Preparing the data for modelling with R

May 9, 2016
By

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

One of the first things which I came across while studying about data science was that three important steps in a data science project is data preparation, creating & testing the model and reporting. It is a widely accepted fact that data preparation takes up most of the time followed by creating the model and then reporting. There were opinions which says we should try to reduce the time taken for data preparation which we can use for creating and testing the model. But a model is only as good as the data on which it is created. A simpler model based on clean data will most likely outperform a complicated model based on dirty or ambiguous data.

Big Mart Sales dataset

With an example from a regression problem to predict the sales, we can go through some of the common situations we might face while creating a good data set. The dataset which I am using is taken from http://www.analyticsvidhya.com/ Big Mart Sales prediction problem which I have modified a bit to include some outliers in the response variable `Item_Outlet_Sales`

Data description

Item_Identifier : Unique product ID
Item_Weight : Weight of product
Item_Fat_Content : Whether the product is low fat or not
Item_Visibility : The % of total display area of all products in a store allocated to the particular product
Item_Type : The category to which the product belongs
Item_MRP : Maximum Retail Price (list price) of the product
Outlet_Identifier : Unique store ID
Outlet_Establishment_Year : The year in which store was established
Outlet_Size : The size of the store in terms of ground area covered
Outlet_Location_Type : The type of city in which the store is located
Outlet_Type : Whether the outlet is just a grocery store or some sort of supermarket
Item_Outlet_Sales : Sales of the product in the particulat store. This is the outcome variable to be predicted

Dealing with missing values

First let’s take a look at the missing values. If number of observations with missing values are much lower than the total number of observations, then there’s not much loss of information by dropping them. I am using the function `complete.cases()` to check for rows without missing values. The function returns a logical vector indicating which cases are complete, i.e., have no missing values. Please note that this function looks for NULL/NA value and there might be missing values in other forms like blanks in character factor columns.

```nrows <- nrow(Data)
ncomplete <- sum(complete.cases(Data))
ncomplete```

`## [1] 7060`

`ncomplete/nrows`

`## [1] 0.8283468`

Here we can see that by dropping all the rows with missing values, we are losing about 18% of data. So we cannot drop them.

Data Exploration

Now let’s have a proper look into the data set. We can begin with the response variable. From the information we have, it is a continuos variable. I am using the `ggplot2` package for data visualization which I believe most of you would be familiar with. I will be showing the distribution of the dependent variable `Item_Outlet_Sales`

```library(ggplot2)
#Plotting the dependent variable distribution
pl1 <- ggplot(Data, aes(Item_Outlet_Sales))
pl1 + geom_density(fill = "red", alpha = "0.7")```

Here we can see that the distribution looks similar to a half normal distribution. If we take a closer look, we can see that there is a sudden spike towards the right end of the graph. This might possibly be a sentinel value. A sentinel value is a special kind of bad numerical value: a value that used to represent “unknown” or “not applicable”” or other special cases in numeric data. One way to detect sentinel values is to look for sudden jumps in an otherwise smooth distribution of values. We can now take a look into the summary of the `Item_Outlet_Sales` variable to confirm this

`summary(Data\$Item_Outlet_Sales)`

```##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
##    33.29   834.90  1798.00  2218.00  3104.00 33330.00```

Here we can see that the maximum value is 33333 which is nowhere closer to the other values.
We can now examine these values to check whether they follow a pattern. If not, we can drop them.
I am now using the `dplyr` package. If anyone is not familiar with it, please go through the package help. The documentation is comprehensive
The `filter()` function in dplyr helps us to subset the data based column values

```library(dplyr)
#Creating a data frame with only the outliers
outlier <- Data %>% filter(Item_Outlet_Sales == 33333)
outlier```

```##    Item_Identifier Item_Weight Item_Fat_Content Item_Visibility
## 1            DRI11          NA          Low Fat      0.03423768
## 2            FDP25      15.200          Low Fat      0.02132748
## 3            FDU55      16.200          Low Fat      0.03598410
## 4            FDF44       7.170          Regular      0.05997133
## 5            FDE41          NA              reg      0.00000000
## 6            FDK04       7.360          Low Fat      0.05260793
## 7            FDJ53      10.500          Low Fat      0.07125791
## 8            FDX08      12.850          Low Fat      0.02264989
## 9            FDE53      10.895          Low Fat      0.02703220
## 10           FDA31       7.100          Low Fat      0.11023479
##                Item_Type Item_MRP Outlet_Identifier
## 1            Hard Drinks 113.2834            OUT027
## 2                 Canned 216.8824            OUT017
## 3  Fruits and Vegetables 260.6278            OUT045
## 4  Fruits and Vegetables 132.1968            OUT018
## 5           Frozen Foods  83.7566            OUT019
## 6           Frozen Foods  56.3588            OUT017
## 7           Frozen Foods 121.3098            OUT046
## 8  Fruits and Vegetables 179.3318            OUT045
## 9           Frozen Foods 106.3280            OUT017
## 10 Fruits and Vegetables 171.7080            OUT045
##    Outlet_Establishment_Year Outlet_Size Outlet_Location_Type
## 1                       1985      Medium               Tier 3
## 2                       2007                           Tier 2
## 3                       2002                           Tier 2
## 4                       2009      Medium               Tier 3
## 5                       1985       Small               Tier 1
## 6                       2007                           Tier 2
## 7                       1997       Small               Tier 1
## 8                       2002                           Tier 2
## 9                       2007                           Tier 2
## 10                      2002                           Tier 2
##          Outlet_Type Item_Outlet_Sales
## 1  Supermarket Type3             33333
## 2  Supermarket Type1             33333
## 3  Supermarket Type1             33333
## 4  Supermarket Type2             33333
## 5      Grocery Store             33333
## 6  Supermarket Type1             33333
## 7  Supermarket Type1             33333
## 8  Supermarket Type1             33333
## 9  Supermarket Type1             33333
## 10 Supermarket Type1             33333```

There are only 10 observations which has these sentinel values. We can see that `Item_Type`, `Item_MRP`, `Outlet_Location_Type`, `Item_Weight`, `Outlet_Type` are all different among these outliers. So, it does not look like wrong data from a particular store or location. Let`s drop them as them. Dealing with these type of values normally require domain knowledge.

```#Removing the observations with sentinel value
Data <- Data %>% filter(Item_Outlet_Sales != 33333)
pl2 <- ggplot(Data, aes(Item_Outlet_Sales))
pl2 + geom_density(fill = "blue", alpha = "0.5")```

`summary(Data\$Item_Outlet_Sales)`

```##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
##    33.29   833.60  1794.00  2182.00  3101.00 13090.00```

Now we can explore the remaining variables.

`summary(Data)`

```##  Item_Identifier  Item_Weight     Item_Fat_Content Item_Visibility
##  FDG33  :  10    Min.   : 4.555   LF     : 316     Min.   :0.00000
##  FDW13  :  10    1st Qu.: 8.775   low fat: 112     1st Qu.:0.02699
##  DRE49  :   9    Median :12.600   Low Fat:5081     Median :0.05395
##  DRN47  :   9    Mean   :12.860   reg    : 116     Mean   :0.06616
##  FDD38  :   9    3rd Qu.:16.850   Regular:2888     3rd Qu.:0.09466
##  FDF52  :   9    Max.   :21.350                    Max.   :0.32839
##  (Other):8457    NA's   :1461
##                  Item_Type       Item_MRP      Outlet_Identifier
##  Fruits and Vegetables:1228   Min.   : 31.29   OUT027 : 934
##  Snack Foods          :1200   1st Qu.: 93.81   OUT013 : 932
##  Household            : 910   Median :143.02   OUT035 : 930
##  Frozen Foods         : 852   Mean   :140.99   OUT049 : 930
##  Dairy                : 682   3rd Qu.:185.66   OUT046 : 929
##  Baking Goods         : 648   Max.   :266.89   OUT018 : 927
##  (Other)              :2993                    (Other):2931
##  Outlet_Establishment_Year Outlet_Size   Outlet_Location_Type
##  Min.   :1985                    :2404   Tier 1:2386
##  1st Qu.:1987              High  : 932   Tier 2:2779
##  Median :1999              Medium:2791   Tier 3:3348
##  Mean   :1998              Small :2386
##  3rd Qu.:2004
##  Max.   :2009
##
##             Outlet_Type   Item_Outlet_Sales
##  Grocery Store    :1082   Min.   :   33.29
##  Supermarket Type1:5570   1st Qu.:  833.58
##  Supermarket Type2: 927   Median : 1794.33
##  Supermarket Type3: 934   Mean   : 2181.56
##                           3rd Qu.: 3101.30
##                           Max.   :13086.97
##```

Looking at the `Item_Weight` variable, we can see that there are 1461 missing values. We can also see that the `Item_Fat_Content` variable is coded incorrectly. The same factor levels are coded in different ways.
There are 2404 missing values in the `Outlet_Size` variable. Another interesting thing is about the `Item_Visibility` variable. In my opinion, there can’t be any item with 0 visibility as no item in a supermarket or grocery store is supposed to be completely invisible to customers.

Let’s treat them one by one:

First let’s recode the `Item_Fat_Content` variable. There are 2 levels- `Regular` and `low fat` which are coded into 5 different levels named as `LF`, `low fat`, `Low Fat`, `reg` and `Regular`. We can recode them into lowfat

```#with gsub replacing the levels with Regular or lowfat as required
Data\$Item_Fat_Content <- gsub("LF", "lowfat",Data\$Item_Fat_Content)
Data\$Item_Fat_Content <- gsub("low fat", "lowfat",Data\$Item_Fat_Content)
Data\$Item_Fat_Content <- gsub("Low Fat", "lowfat",Data\$Item_Fat_Content)
Data\$Item_Fat_Content <- gsub("reg", "Regular",Data\$Item_Fat_Content)
Data\$Item_Fat_Content <- as.factor(Data\$Item_Fat_Content)
summary(Data\$Item_Fat_Content)```

```##  lowfat Regular
##    5509    3004```

Now let us replace the missing values in the `Item_Weight` variable. There are many ways to deal with missing values in a continuous variable which includes mean replacement, median replacement, replacing with an arbitrary constant, regression methods etc. I will be using mean replacement and regression in this example. I am using mean replacement for `Item_Weight` and regression for `Item_Visibility.` In real projects, these methods are chosen based on requirements. Normally we use mean replacement for variable which have lower predictive power for the final response variable.

```#Using mean to replace the missing values in Item_Weight variable
MeanItem_Weight <- mean(Data\$Item_Weight[!is.na(Data\$Item_Weight)])
Data\$Item_Weight[is.na(Data\$Item_Weight)] <- MeanItem_Weight
#Using regression to replace the zeros in Item_visibility variable
Data_1 <- Data %>% filter(Item_Visibility != 0)
visibility_model <- lm(Item_Visibility ~ Item_Weight + Item_Fat_Content +
Item_Type + Item_MRP +
Outlet_Establishment_Year + Outlet_Size +
Outlet_Location_Type + Item_Outlet_Sales,
data = Data_1)
Data\$Item_Visibility[Data\$Item_Visibility == 0] <-
predict(visibility_model,newdata = Data[Data\$Item_Visibility == 0,])```

Finally we have to classify the missing values in the `Outlet_Size variable`.
I am using the random forest algorithm for classification. In my experience, the random forest algorithm has worked well for classification models as it has the advantage of being an ensemble model. I am using the `randomForest` package as it has a very good implementation of the random forest algorithm. The dataset is split to train and test set using the package `caTools`. The caTools package is a very good tool for splitting our dataset for machine learning algorithms.
The function `sample.split()` is used for splitting. Two subsets are made which are classified as `TRUE` and `FALSE`. Normally we use the `TRUE` subset for training and `FALSE` subset for testing

```library(caTools)
set.seed(100)
Data\$Outlet_Size <- as.character(Data\$Outlet_Size)
Storetypes <- subset(Data, Outlet_Size != "")
spl <- sample.split(Storetypes\$Outlet_Size, SplitRatio = 0.8)
Train <- subset(Storetypes, spl == TRUE)
Test <- subset(Storetypes, spl == FALSE)
###Using Random Forest for classification
library(randomForest)
Train\$Outlet_Size <- as.factor(Train\$Outlet_Size)
Test\$Outlet_Size <- as.factor(Test\$Outlet_Size)
###Creating the model
SizeForest <- randomForest(Outlet_Size ~.-Item_Outlet_Sales -Item_Identifier,
data =  Train,nodesize = 25, ntree = 100)
###Predicting on the test set
PredictForest <- predict(SizeForest, newdata = Test)
#Confusion matrix for accuracy
table(Test\$Outlet_Size, PredictForest)```

```##         PredictForest
##          High Medium Small
##   High    186      0     0
##   Medium    0    558     0
##   Small     0      0   477```

```###Classifying the missing values in the dataset
Data\$Outlet_Size <- predict(SizeForest, newdata =Data)
######```

Now we can check the complete dataset once more. We can see that the problem of missing values are resolved and the factors are well coded.

`summary(Data)`

```##  Item_Identifier  Item_Weight     Item_Fat_Content Item_Visibility
##  FDG33  :  10    Min.   : 4.555   lowfat :5509     Min.   :0.003575
##  FDW13  :  10    1st Qu.: 9.310   Regular:3004     1st Qu.:0.033088
##  DRE49  :   9    Median :12.860                    Median :0.060615
##  DRN47  :   9    Mean   :12.860                    Mean   :0.070478
##  FDD38  :   9    3rd Qu.:16.000                    3rd Qu.:0.096000
##  FDF52  :   9    Max.   :21.350                    Max.   :0.328391
##  (Other):8457
##                  Item_Type       Item_MRP      Outlet_Identifier
##  Fruits and Vegetables:1228   Min.   : 31.29   OUT027 : 934
##  Snack Foods          :1200   1st Qu.: 93.81   OUT013 : 932
##  Household            : 910   Median :143.02   OUT035 : 930
##  Frozen Foods         : 852   Mean   :140.99   OUT049 : 930
##  Dairy                : 682   3rd Qu.:185.66   OUT046 : 929
##  Baking Goods         : 648   Max.   :266.89   OUT018 : 927
##  (Other)              :2993                    (Other):2931
##  Outlet_Establishment_Year Outlet_Size   Outlet_Location_Type
##  Min.   :1985              High  : 932   Tier 1:2386
##  1st Qu.:1987              Medium:5195   Tier 2:2779
##  Median :1999              Small :2386   Tier 3:3348
##  Mean   :1998
##  3rd Qu.:2004
##  Max.   :2009
##
##             Outlet_Type   Item_Outlet_Sales
##  Grocery Store    :1082   Min.   :   33.29
##  Supermarket Type1:5570   1st Qu.:  833.58
##  Supermarket Type2: 927   Median : 1794.33
##  Supermarket Type3: 934   Mean   : 2181.56
##                           3rd Qu.: 3101.30
##                           Max.   :13086.97
##```

The dataset is now ready for modelling….

The post Preparing the data for modelling with R appeared first on MilanoR.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...