Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the first things which I came across while studying about data science was that three important steps in a data science project is data preparation, creating & testing the model and reporting. It is a widely accepted fact that data preparation takes up most of the time followed by creating the model and then reporting. There were opinions which says we should try to reduce the time taken for data preparation which we can use for creating and testing the model. But a model is only as good as the data on which it is created. A simpler model based on clean data will most likely outperform a complicated model based on dirty or ambiguous data.

### Big Mart Sales dataset

With an example from a regression problem to predict the sales, we can go through some of the common situations we might face while creating a good data set. The dataset which I am using is taken from http://www.analyticsvidhya.com/ Big Mart Sales prediction problem which I have modified a bit to include some outliers in the response variable Item_Outlet_Sales

### Data description

Item_Identifier : Unique product ID
Item_Weight : Weight of product
Item_Fat_Content : Whether the product is low fat or not
Item_Visibility : The % of total display area of all products in a store allocated to the particular product
Item_Type : The category to which the product belongs
Item_MRP : Maximum Retail Price (list price) of the product
Outlet_Identifier : Unique store ID
Outlet_Establishment_Year : The year in which store was established
Outlet_Size : The size of the store in terms of ground area covered
Outlet_Location_Type : The type of city in which the store is located
Outlet_Type : Whether the outlet is just a grocery store or some sort of supermarket
Item_Outlet_Sales : Sales of the product in the particulat store. This is the outcome variable to be predicted

## Dealing with missing values

First let’s take a look at the missing values. If number of observations with missing values are much lower than the total number of observations, then there’s not much loss of information by dropping them. I am using the function complete.cases() to check for rows without missing values. The function returns a logical vector indicating which cases are complete, i.e., have no missing values. Please note that this function looks for NULL/NA value and there might be missing values in other forms like blanks in character factor columns.

nrows <- nrow(Data)
ncomplete <- sum(complete.cases(Data))
ncomplete

## [1] 7060

ncomplete/nrows

## [1] 0.8283468

Here we can see that by dropping all the rows with missing values, we are losing about 18% of data. So we cannot drop them.

## Data Exploration

Now let’s have a proper look into the data set. We can begin with the response variable. From the information we have, it is a continuos variable. I am using the ggplot2 package for data visualization which I believe most of you would be familiar with. I will be showing the distribution of the dependent variable Item_Outlet_Sales

library(ggplot2)
#Plotting the dependent variable distribution
pl1 <- ggplot(Data, aes(Item_Outlet_Sales))
pl1 + geom_density(fill = "red", alpha = "0.7")

Here we can see that the distribution looks similar to a half normal distribution. If we take a closer look, we can see that there is a sudden spike towards the right end of the graph. This might possibly be a sentinel value. A sentinel value is a special kind of bad numerical value: a value that used to represent “unknown” or “not applicable”" or other special cases in numeric data. One way to detect sentinel values is to look for sudden jumps in an otherwise smooth distribution of values. We can now take a look into the summary of the Item_Outlet_Sales variable to confirm this

summary(Data$Item_Outlet_Sales) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 33.29 834.90 1798.00 2218.00 3104.00 33330.00 Here we can see that the maximum value is 33333 which is nowhere closer to the other values. We can now examine these values to check whether they follow a pattern. If not, we can drop them. I am now using the dplyr package. If anyone is not familiar with it, please go through the package help. The documentation is comprehensive The filter() function in dplyr helps us to subset the data based column values library(dplyr) #Creating a data frame with only the outliers outlier <- Data %>% filter(Item_Outlet_Sales == 33333) outlier ## Item_Identifier Item_Weight Item_Fat_Content Item_Visibility ## 1 DRI11 NA Low Fat 0.03423768 ## 2 FDP25 15.200 Low Fat 0.02132748 ## 3 FDU55 16.200 Low Fat 0.03598410 ## 4 FDF44 7.170 Regular 0.05997133 ## 5 FDE41 NA reg 0.00000000 ## 6 FDK04 7.360 Low Fat 0.05260793 ## 7 FDJ53 10.500 Low Fat 0.07125791 ## 8 FDX08 12.850 Low Fat 0.02264989 ## 9 FDE53 10.895 Low Fat 0.02703220 ## 10 FDA31 7.100 Low Fat 0.11023479 ## Item_Type Item_MRP Outlet_Identifier ## 1 Hard Drinks 113.2834 OUT027 ## 2 Canned 216.8824 OUT017 ## 3 Fruits and Vegetables 260.6278 OUT045 ## 4 Fruits and Vegetables 132.1968 OUT018 ## 5 Frozen Foods 83.7566 OUT019 ## 6 Frozen Foods 56.3588 OUT017 ## 7 Frozen Foods 121.3098 OUT046 ## 8 Fruits and Vegetables 179.3318 OUT045 ## 9 Frozen Foods 106.3280 OUT017 ## 10 Fruits and Vegetables 171.7080 OUT045 ## Outlet_Establishment_Year Outlet_Size Outlet_Location_Type ## 1 1985 Medium Tier 3 ## 2 2007 Tier 2 ## 3 2002 Tier 2 ## 4 2009 Medium Tier 3 ## 5 1985 Small Tier 1 ## 6 2007 Tier 2 ## 7 1997 Small Tier 1 ## 8 2002 Tier 2 ## 9 2007 Tier 2 ## 10 2002 Tier 2 ## Outlet_Type Item_Outlet_Sales ## 1 Supermarket Type3 33333 ## 2 Supermarket Type1 33333 ## 3 Supermarket Type1 33333 ## 4 Supermarket Type2 33333 ## 5 Grocery Store 33333 ## 6 Supermarket Type1 33333 ## 7 Supermarket Type1 33333 ## 8 Supermarket Type1 33333 ## 9 Supermarket Type1 33333 ## 10 Supermarket Type1 33333 There are only 10 observations which has these sentinel values. We can see that Item_Type, Item_MRP, Outlet_Location_Type, Item_Weight, Outlet_Type are all different among these outliers. So, it does not look like wrong data from a particular store or location. Lets drop them as them. Dealing with these type of values normally require domain knowledge. #Removing the observations with sentinel value Data <- Data %>% filter(Item_Outlet_Sales != 33333) pl2 <- ggplot(Data, aes(Item_Outlet_Sales)) pl2 + geom_density(fill = "blue", alpha = "0.5") summary(Data$Item_Outlet_Sales)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
##    33.29   833.60  1794.00  2182.00  3101.00 13090.00

Now we can explore the remaining variables.

summary(Data)

##  Item_Identifier  Item_Weight     Item_Fat_Content Item_Visibility
##  FDG33  :  10    Min.   : 4.555   LF     : 316     Min.   :0.00000
##  FDW13  :  10    1st Qu.: 8.775   low fat: 112     1st Qu.:0.02699
##  DRE49  :   9    Median :12.600   Low Fat:5081     Median :0.05395
##  DRN47  :   9    Mean   :12.860   reg    : 116     Mean   :0.06616
##  FDD38  :   9    3rd Qu.:16.850   Regular:2888     3rd Qu.:0.09466
##  FDF52  :   9    Max.   :21.350                    Max.   :0.32839
##  (Other):8457    NA's   :1461
##                  Item_Type       Item_MRP      Outlet_Identifier
##  Fruits and Vegetables:1228   Min.   : 31.29   OUT027 : 934
##  Snack Foods          :1200   1st Qu.: 93.81   OUT013 : 932
##  Household            : 910   Median :143.02   OUT035 : 930
##  Frozen Foods         : 852   Mean   :140.99   OUT049 : 930
##  Dairy                : 682   3rd Qu.:185.66   OUT046 : 929
##  Baking Goods         : 648   Max.   :266.89   OUT018 : 927
##  (Other)              :2993                    (Other):2931
##  Outlet_Establishment_Year Outlet_Size   Outlet_Location_Type
##  Min.   :1985                    :2404   Tier 1:2386
##  1st Qu.:1987              High  : 932   Tier 2:2779
##  Median :1999              Medium:2791   Tier 3:3348
##  Mean   :1998              Small :2386
##  3rd Qu.:2004
##  Max.   :2009
##
##             Outlet_Type   Item_Outlet_Sales
##  Grocery Store    :1082   Min.   :   33.29
##  Supermarket Type1:5570   1st Qu.:  833.58
##  Supermarket Type2: 927   Median : 1794.33
##  Supermarket Type3: 934   Mean   : 2181.56
##                           3rd Qu.: 3101.30
##                           Max.   :13086.97
##

Looking at the Item_Weight variable, we can see that there are 1461 missing values. We can also see that the Item_Fat_Content variable is coded incorrectly. The same factor levels are coded in different ways.
There are 2404 missing values in the Outlet_Size variable. Another interesting thing is about the Item_Visibility variable. In my opinion, there can’t be any item with 0 visibility as no item in a supermarket or grocery store is supposed to be completely invisible to customers.

Let’s treat them one by one:

First let’s recode the Item_Fat_Content variable. There are 2 levels- Regular and low fat which are coded into 5 different levels named as LF, low fat, Low Fat, reg and Regular. We can recode them into lowfat

#with gsub replacing the levels with Regular or lowfat as required
Data$Item_Fat_Content <- gsub("LF", "lowfat",Data$Item_Fat_Content)
Data$Item_Fat_Content <- gsub("low fat", "lowfat",Data$Item_Fat_Content)
Data$Item_Fat_Content <- gsub("Low Fat", "lowfat",Data$Item_Fat_Content)
Data$Item_Fat_Content <- gsub("reg", "Regular",Data$Item_Fat_Content)
Data$Item_Fat_Content <- as.factor(Data$Item_Fat_Content)
summary(Data$Item_Fat_Content) ## lowfat Regular ## 5509 3004 Now let us replace the missing values in the Item_Weight variable. There are many ways to deal with missing values in a continuous variable which includes mean replacement, median replacement, replacing with an arbitrary constant, regression methods etc. I will be using mean replacement and regression in this example. I am using mean replacement for Item_Weight and regression for Item_Visibility. In real projects, these methods are chosen based on requirements. Normally we use mean replacement for variable which have lower predictive power for the final response variable. #Using mean to replace the missing values in Item_Weight variable MeanItem_Weight <- mean(Data$Item_Weight[!is.na(Data$Item_Weight)]) Data$Item_Weight[is.na(Data$Item_Weight)] <- MeanItem_Weight #Using regression to replace the zeros in Item_visibility variable Data_1 <- Data %>% filter(Item_Visibility != 0) visibility_model <- lm(Item_Visibility ~ Item_Weight + Item_Fat_Content + Item_Type + Item_MRP + Outlet_Establishment_Year + Outlet_Size + Outlet_Location_Type + Item_Outlet_Sales, data = Data_1) Data$Item_Visibility[Data$Item_Visibility == 0] <- predict(visibility_model,newdata = Data[Data$Item_Visibility == 0,])

Finally we have to classify the missing values in the Outlet_Size variable.
I am using the random forest algorithm for classification. In my experience, the random forest algorithm has worked well for classification models as it has the advantage of being an ensemble model. I am using the randomForest package as it has a very good implementation of the random forest algorithm. The dataset is split to train and test set using the package caTools. The caTools package is a very good tool for splitting our dataset for machine learning algorithms.
The function sample.split() is used for splitting. Two subsets are made which are classified as TRUE and FALSE. Normally we use the TRUE subset for training and FALSE subset for testing

library(caTools)
set.seed(100)
Data$Outlet_Size <- as.character(Data$Outlet_Size)
Storetypes <- subset(Data, Outlet_Size != "")
spl <- sample.split(Storetypes$Outlet_Size, SplitRatio = 0.8) Train <- subset(Storetypes, spl == TRUE) Test <- subset(Storetypes, spl == FALSE) ###Using Random Forest for classification library(randomForest) Train$Outlet_Size <- as.factor(Train$Outlet_Size) Test$Outlet_Size <- as.factor(Test$Outlet_Size) ###Creating the model SizeForest <- randomForest(Outlet_Size ~.-Item_Outlet_Sales -Item_Identifier, data = Train,nodesize = 25, ntree = 100) ###Predicting on the test set PredictForest <- predict(SizeForest, newdata = Test) #Confusion matrix for accuracy table(Test$Outlet_Size, PredictForest)

##         PredictForest
##          High Medium Small
##   High    186      0     0
##   Medium    0    558     0
##   Small     0      0   477

###Classifying the missing values in the dataset
Data\$Outlet_Size <- predict(SizeForest, newdata =Data)
######

Now we can check the complete dataset once more. We can see that the problem of missing values are resolved and the factors are well coded.

summary(Data)

##  Item_Identifier  Item_Weight     Item_Fat_Content Item_Visibility
##  FDG33  :  10    Min.   : 4.555   lowfat :5509     Min.   :0.003575
##  FDW13  :  10    1st Qu.: 9.310   Regular:3004     1st Qu.:0.033088
##  DRE49  :   9    Median :12.860                    Median :0.060615
##  DRN47  :   9    Mean   :12.860                    Mean   :0.070478
##  FDD38  :   9    3rd Qu.:16.000                    3rd Qu.:0.096000
##  FDF52  :   9    Max.   :21.350                    Max.   :0.328391
##  (Other):8457
##                  Item_Type       Item_MRP      Outlet_Identifier
##  Fruits and Vegetables:1228   Min.   : 31.29   OUT027 : 934
##  Snack Foods          :1200   1st Qu.: 93.81   OUT013 : 932
##  Household            : 910   Median :143.02   OUT035 : 930
##  Frozen Foods         : 852   Mean   :140.99   OUT049 : 930
##  Dairy                : 682   3rd Qu.:185.66   OUT046 : 929
##  Baking Goods         : 648   Max.   :266.89   OUT018 : 927
##  (Other)              :2993                    (Other):2931
##  Outlet_Establishment_Year Outlet_Size   Outlet_Location_Type
##  Min.   :1985              High  : 932   Tier 1:2386
##  1st Qu.:1987              Medium:5195   Tier 2:2779
##  Median :1999              Small :2386   Tier 3:3348
##  Mean   :1998
##  3rd Qu.:2004
##  Max.   :2009
##
##             Outlet_Type   Item_Outlet_Sales
##  Grocery Store    :1082   Min.   :   33.29
##  Supermarket Type1:5570   1st Qu.:  833.58
##  Supermarket Type2: 927   Median : 1794.33
##  Supermarket Type3: 934   Mean   : 2181.56
##                           3rd Qu.: 3101.30
##                           Max.   :13086.97
##`

The dataset is now ready for modelling….

The post Preparing the data for modelling with R appeared first on MilanoR.