Use the questions associated with the situation and questions below for your dis
ID: 3350570 • Letter: U
Question
Use the questions associated with the situation and questions below for your discussion forum. As you complete your discussion response answer this question, “Why is it important to ask these questions as you read research reports?”
Your boss just sent you an internal report that provided several graphs to support his/her research. As you begin to read the report and analyze the graphs there are some questions you ask yourself as you attempt to extract information from the graphical data display provided in the report.
Is the chosen graphical display appropriate for the type of data collected? What are the types of data? Why is it important to understand the different types of data associated with statistical analysis and graphical displays?
For the graphical displays of univariate numerical data, how would you describe the shape of the distribution? What does this say about the variable being summarized?
Are there any outliers (noticeably unusual values) in the data set? Is there any plausible explanation for why these values differ from the rest of the data?
Where do most of the data values fall? What is a typical value for the data set? What does this say about the variable being summarized?
Is there much variability in the data values? What does this say about the variable being summarized
Explanation / Answer
Exploratory Data Analysis
When one gets the data related to a certain study for analysis and possibly prediction, the following steps and questions should be included in the analysis. Note that this may vary depending on the problem. As D.R. Cox ( a famous statistician) says : there are no routine statistical questions. only questionable statistical routines. On this note, let us look at the different points to be considered.
1. Setting up the Problem : Identifying the different buisiness objectives related to the problem. Understand how the data was collected. Whether the data collected is past and future predictions are required or some relations are to be identified in the data. Also, this step includes defining the explanatory variables (columns) and the target variables. The final step here would include deciding on the type of problem you are tackling. The problem can be supervised or unsupervised. Supervised problems are generally classified into - regression and classification problems. If the target variable is continuous (range -inf,inf) , then it is a regression problem. An example of this is house price prediction. If the target variable is discreet (categorical with two or more categories) then it is a binary or multiclass classification problem respectively. An example of classification problem is spam classification or fraud detection. Some unsupervised problems are clustering problems. Make your modelling objectives clear at this stage.
2. Data Understanding : This involves understanding the data. This involves both data summarization and visualization tehniques for both single variables and two or more variables. The techniques differ for continuous and categorical data. The first step is to identify which columns are continuous, categorical (discreet but no order) or ordinal (discreet with a specific order. eg - no of children). An important method to do that is to see the number of unique values in each column and then see if there are many unique values the data is probably continuous and if they are countable it is probably discreet.
Now df.describe() function can be used to get the single variable summary of all the variables. This gives the count, number of missing values, min, max, standard deviation, quartile information, skewness, kurtosis of each variable. This is mostly helpful in the case of continuous variables. For categorical variables, frequency counts and percentages for each category is an insightful feature.
For visual analysis in single dimension : histograms, distribution functions, boxplots (outlier analysis) , time series plots for time series data are helpful tools. For categorical data visual analysis, some plots useful are count plots, pairplots.For analysis of two or more variables we look at correlation plots, heatmaps, crosstabs (for categorical data). Some other visualization techniques are - anscombe's quartet and scatterplot matrices.
3. Data Preparation : Following data understanding, the next step will be preparing data for analysis. This step is the most important and brings out the best results. The first step here includes variable cleaning. This includes removing any observations or rows which seem erroneous delibrately or by mistake. The columns such as unique id should be dropped as they do not provide any information.
The next thing to be checked for is consistency in data formats. If all the categorical variables are of the type category and continuous of type int64, dates of type date and zip codes of type string. Sometimes, categorical variables are masked as continuous and is problematic. Then, the missing values should be taken care of - The missing values can be MCAR (Missing Completely at Random), MAR(Missing at Random) and MNAR (Missing Not at Random). The different methods of dealing with missing values are - imputation with a constant, imputation by mean/median, imputation with a distribution, imputation with self distribution, imputation by modelling. For categorical data, generally the values should be replaced with a value representing missing. When data is MAR or MNAR, the information that data is missing itself can be significant. In such cases, dummy variables and multiple imputation help.
The next step is dealing with outliers. The 1.5 Interquartile range rule is selected to deal with outliers. More methods based on k-means clustering ,knn , mahalanobis distance are also available. The outliers can be removed or replaced with some upper limit value. Replacement with mean is also considered so that outliers cause least damage.
Feature creation using simple variable transformations like e^x,log(x),x^2 need to be tried and checked if any transformation is more correlated to target than original values. This variable transformation helps in fixing skew. Further, binning continuous variables should be considered in some cases like age.
Numeric variable scaling can be done using normalization or standardization. Both these functions can be easily implemented using scikit-learn library of python. Transformation of categorical data includes one-hot encoding. This creates n-1 dummy variables for a column with n categories so they are treated differently and effect of each category can be studied.
The final step before modelling is feature selection. Correlation, Chi-square test, Forward and Backward selection models, ANOVA, Principal component analysis are some methods for feature selection. PCA is an unsupervised method. More unsupervised methods can be considered to get feature interactions and proceed with feature creation before starting the modelling process.
The modelling process should be careful. Occam's razor needs to be followed. Simple models are the best models. Move from simple to complicated models. Look into sampling methods, class imbalance problems, regularization for overfitting and evaluation metrics for each model and finally decide on the model. Ensemble of different models improves the robustness of the model. Also, splitting the data into training and test sets is important before modelling for robust results and inferences. Hope this helps.
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.