I begin with a new project. It is from the Kaggle playground wherein the objective is to build a regression model (as the response variable or the outcome or dependent variable is continuous in nature) from a given set of predictors or independent variables.
My motivation to work on this project are the following;
- Help me to learn and improve upon feature engineering and advanced regression algorithms like random forests, gradient boosting with xgboost
- Help me in articulating compelling data powered stories
- Help me understand and build a complete end to end data powered solution
From the Kaggle page, “The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It’s an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset.”
The Data Dictionary
The data dictionary can be accessed from here.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.) In simple terms this means that the lower the RMSE value, greater is the accuracy of your prediction model.
About the dataset
The dataset is split into training and testing files where the training dataset has 81 variables in 1460 rows and the testing dataser has 80 variables in 1459 rows. These variables focus on the quantity and quality of many physical attributes of the real estate property.
There are a large number of categorical variables (23 nominal, 23 ordinal) associated with this data set. They range from 2 to 28 classes with the smallest being STREET (gravel or paved) and the largest being NEIGHBORHOOD (areas within the Ames city limits). The nominal variables typically identify various types of dwellings, garages, materials, and environmental conditions while the ordinal variables typically rate various items within the property.
The 14 discrete variables typically quantify the number of items occurring within the house. Most are specifically focused on the number of kitchens, bedrooms, and bathrooms (full and half) located in the basement and above grade (ground) living areas of the home.
In general the 20 continuous variables relate to various area dimensions for each observation. In addition to the typical lot size and total dwelling square footage found on most common home listings, other more specific variables are quantified in the data set.
“A strong analysis should include the interpretation of the various coefficients, statistics, and plots associated with their model and the verification of any necessary assumptions.”
An interesting feature of the dataset is that several of the predictors are labelled as NA when actually they are not missing values and correspond to actual data points. This can be verified from the data dictionary where variable like Alley, Pool etc have NA value that correspond to No Alley Access and No Pool respectively. This SO question that was answered by the user ‘flodel’ solves this problem of recoding specific columns of a dataset.
A total of 357 missing values are present in training predictors (LotFrontage-259, MasVnrType-8, MasVnrArea-8, Electrical-1, GarageYrBlt-81) and 358 missing values in testing dataset predictors (MSZoning-4, LotFrontage-227, Exterior1st-1, Exterior2nd-1, MasVnrType-16, MasVnArea-15, BsmtFinSF1-1, BsmtFinType2-1, BsmtFinSF2-1, BsmtUnfSF-1, TotalBsmtSF-1, BsmtFullBath-2, BsmtHalfBath-2, KitchenQual-1, Functional-2, GarageYrBlt-78, SaleType-1).
Some basic problems that need to be solved first namely, data dimensionality reduction, missing value treatment, correlation, dummy coding. A common question that most ask is that how to determine the relevant predictors in a high dimensional dataset as this. The approach that I will use for dimensionality reduction will be two fold, first I will check for zero variance predictors.
(a) Check for Near Zero Variance Predictors
A predictor with zero variability does not contribute anything to the prediction model and can be removed.
Computing: This can easily be accomplished by using the nearZeroVar() method from the caret package. In training dataset, there are 21 near zero variance variables namely (‘Street’ ‘LandContour’ ‘Utilities’ ‘LandSlope’ ‘Condition2’ ‘RoofMatl’ ‘BsmtCond’ ‘BsmtFinType2’ ‘BsmtFinSF2’ ‘Heating’ ‘LowQualFinSF’ ‘KitchenAbvGr’ ‘Functional’ ‘GarageQual’ ‘GarageCond’ ‘EnclosedPorch’ ‘X3SsnPorch’ ‘ScreenPorch’ ‘PoolArea’ ‘MiscFeature’ ‘MiscVal’) and in the testing dataset there are 19 near zero variance predictors namely (‘Street’ ‘Utilities’ ‘LandSlope’ ‘Condition2’ ‘RoofMatl’ ‘BsmtCond’ ‘BsmtFinType2’ ‘Heating’ ‘LowQualFinSF’ ‘KitchenAbvGr’ ‘Functional’ ‘GarageCond’ ‘EnclosedPorch’ ‘X3SsnPorch’ ‘ScreenPorch’ ‘PoolArea’ ‘MiscVal’). Post removal of these predictors from both the training and testing dataset, the data dimension is reduced to 60 predictors for train data and 61 predictors each.
(b) Missing data treatment
There are two types of missing data;
(i) MCAR (Missing Completetly At Random) & (ii) MNAR (Missing Not At Random)
Usually, MCAR is the desirable scenario in case of missing data. For this analysis I will assume that MCAR is at play. Assuming data is MCAR, too much missing data can be a problem too. Usually a safe maximum threshold is 5% of the total for large datasets. If missing data for a certain feature or sample is more than 5% then you probably should leave that feature or sample out. We therefore check for features (columns) and samples (rows) where more than 5% of the data is missing using a simple function. Some good references are 1 and 2.
Computing: I have used the VIM package in R for missing data visualization. I set the threshold at 0.80, any predictors equal to or above this threshold need no imputation and should be removed. Post removal of the near zero variance predictors, I next check for high missing values and I find that there are no predictors with high missing values in either the train or test data.
Important Note: As per this r-blogger’s post, it is not advisable to use mean imputation for continuous predictors because it can affect the variance in the data. Also, one should avoid using the mode imputation for categorical variables so I use the mice library for missing valueimputation for the continuous variables.
(c) Correlation treatment
Correlation refers to a technique used to measure the relationship between two or more variables.When two objects are correlated, it means that they vary together.Positive correlation means that high scores on one are associated with high scores on the other, and that low scores on one are associated with low scores on the other. Negative correlation, on the other hand, means that high scores on the first thing are associated with low scores on the second. Negative correlation also means that low scores on the first are associated with high scores on the second.
Pearson r is a statistic that is commonly used to calculate bivariate correlations. Or better said, its checks for linear relations.
For an Example Pearson r = -0.80, p < .01. What does this mean?
To interpret correlations, four pieces of information are necessary.
1. The numerical value of the correlation coefficient.Correlation coefficients can vary numerically between 0.0 and 1.0. The closer the correlation is to 1.0, the stronger the relationship between the two variables. A correlation of 0.0 indicates the absence of a relationship. If the correlation coefficient is –0.80, which indicates the presence of a strong relationship.
2. The sign of the correlation coefficient.A positive correlation coefficient means that as variable 1 increases, variable 2 increases, and conversely, as variable 1 decreases, variable 2 decreases. In other words, the variables move in the same direction when there is a positive correlation. A negative correlation means that as variable 1 increases, variable 2 decreases and vice versa. In other words, the variables move in opposite directions when there is a negative correlation. The negative sign indicates that as class size increases, mean reading scores decrease.
3. The statistical significance of the correlation. A statistically significant correlation is indicated by a probability value of less than 0.05. This means that the probability of obtaining such a correlation coefficient by chance is less than five times out of 100, so the result indicates the presence of a relationship.
In any data anlysis activity, the analyst should always check for highly correlated variables and remove them from the dataset because correlated predictors do not quantify
4. The effect size of the correlation.For correlations, the effect size is called the coefficient of determination and is defined as r2. The coefficient of determination can vary from 0 to 1.00 and indicates that the proportion of variation in the scores can be predicted from the relationship between the two variables.
A correlation can only indicate the presence or absence of a relationship, not the nature of the relationship. Correlation is not causation.
How Problematic is Multicollinearity?
Moderate multicollinearity may not be problematic. However, severe multicollinearity is a problem because it can increase the variance of the coefficient estimates and make the estimates very sensitive to minor changes in the model. The result is that the coefficient estimates are unstable and difficult to interpret. Multicollinearity saps the statistical power of the analysis, can cause the coefficients to switch signs, and makes it more difficult to specify the correct model. According to Tabachnick & Fidell (1996) the independent variables with a bivariate correlation more than .70 should not be included in multiple regression analysis.
To detect highly correlated predictors in the data, I used the findCorrelation() method of the caret library and I find that there are four predictors in the training dataset with more than 80% correlation and these are “YearRemodAdd”,”OverallCond”,”BsmtQual”,”Foundation” which I then remove from the train data thereby reducing the data dimension to 56. I follow the similar activity for the test data and I find that there are two predictors with more than 80% correlation and these are “Foundation” “LotShape” which I then remove from the test data.
The final data dimensions are 1460 rows in 56 columns in train data and 1460 rows in 59 columns in the test data.
In the next post, I will discuss on the issue of outlier detection, skewness resolution and data visualization.
Filed under: Data Science Competition, pre-processing, R Tagged: exploratory statistics, R