[This article was first published on R – Hi! I am Nagdev, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Most of us have come across situations where, we do not have enough data for building reliable models due to various reasons such as, it’s expensive to collect data (human studies), limited resources, lack of historical data availability (earth quakes). Even before we begin talking about how to overcome the challenge, let’s first talk about why we need minimum samples even before we consider building model. First of all, we can build a model with low samples. It is definitely possible! But, the as the number of samples decreases, the margin of error increases and vice versa. If you want to build a model with the highest accuracy you would need to have as many samples as possible. If the model is for a real world application, then you need to have data across multiple days to account for any changes in the system. There is a formula that can be used to calculate the sample size and is as follows:

Where, n = sample size

Z = Z-score value

σ = populated standard deviation

MOE = acceptable margin of error

You can also calulated with an online calculator as in this link
https://www.qualtrics.com/blog/calculating-sample-size/

Now we know that why minimum samples are required for achieving required accuracy, say in some case we do not have an opportunity to collect more samples or available. Then we have an option to do the following

1. K-fold cross validation
2. Leave-P-out cross validation
3. Leave-one-out cross validation
4. New data creation through estimation

In K-fold method, the data is split into k partitions and then is trained with each partition and tested with the left out kth partition. In k-hold method, not all combinations are considered. Only user specified partitions are considered. While in leave-one/p-out, all combinations or partitions are considered. This is more exhaustive technique in validating the results. The following above two techniques are the most popular techniques that is used both in machine learning and deep learning.

When it comes to handling NA’s in a data set we have always imputed it through mean, median, zero’s and random numbers. But, this would probably not make sense when we want to create new data.

In new data creation through estimation technique, rows of missing data is created in the data set and a separate data imputation model is used to impute missing data in the rows. Multivariate Imputation by Chained Equations (MICE) is one of the most popular algorithms that are available to insert missing data irrespective of data types such as mixes of continuous, binary, unordered categorical and ordered categorical data.

There are various tutorials available for k-fold and leave one out models. This tutorial will focus on the fourth model where new data will be created to handle less sample size. In the and a simple classification model with be trained to see if there was a significant improvement. Also, distribution of imputed and non-imputed data will be compared to see any significant difference.

Let’s load all the libraries needed for now.

```options(warn=-1)

library(mice)
library(dplyr)```

## Load data into a data frame

The data available in my GitHub repository is used for the analysis.

```setwd("C:/OpenSourceWork/Experiment")

file1\$y = 1
file2\$y = 2
file3\$y = 3
file4\$y = 4

#view top rows of data
```

## Create some features from data

The data used in this study is vibration data with different states. The data was collected at 100 Hz. The data to be used as is is high dimensional also, we do not have any good summary of the data. Hence, some statistical features are extracted. In this case, sample standard deviation, sample mean, sample min, sample max and sample median is calculated. Also, the data is aggregated by 1 second.

```file1\$group = as.factor(round(file1\$time))
file2\$group = as.factor(round(file2\$time))
file3\$group = as.factor(round(file3\$time))
file4\$group = as.factor(round(file4\$time))
#(file1,20)

#list of all files
files = list(file1, file2, file3, file4)

#loop through all files and combine
features = NULL
for (i in 1:4){
res = files[[i]] %>%
group_by(group) %>%
summarize(ax_mean = mean(ax),
ax_sd = sd(ax),
ax_min = min(ax),
ax_max = max(ax),
ax_median = median(ax),
ay_mean = mean(ay),
ay_sd = sd(ay),
ay_min = min(ay),
ay_may = max(ay),
ay_median = median(ay),
az_mean = mean(az),
az_sd = sd(az),
az_min = min(az),
az_maz = max(az),
az_median = median(az),
aT_mean = mean(aT),
aT_sd = sd(aT),
aT_min = min(aT),
aT_maT = max(aT),
aT_median = median(aT),
y = mean(y)
)
features = rbind(features, res)
}

features = subset(features, select = -group)

# store it in a df for future reference
actual.features = features```

## Study data

First, lets look at the size of our populations and summary of our features along with their data types.

```# show data types
str(features)
Classes 'tbl_df', 'tbl' and 'data.frame':	362 obs. of  21 variables:
\$ ax_mean  : num  -0.03816 -0.00581 0.06985 0.01155 0.04669 ...
\$ ax_sd    : num  0.659 0.633 0.667 0.551 0.643 ...
\$ ax_min   : num  -1.26 -1.62 -1.46 -1.93 -1.78 ...
\$ ax_max   : num  1.38 1.19 1.47 1.2 1.48 ...
\$ ax_median: num  -0.0955 -0.0015 0.107 0.0675 0.0836 ...
\$ ay_mean  : num  -0.068263 0.003791 0.074433 0.000826 -0.017759 ...
\$ ay_sd    : num  0.751 0.782 0.802 0.789 0.751 ...
\$ ay_min   : num  -1.39 -1.56 -1.48 -2 -1.66 ...
\$ ay_may   : num  1.64 1.54 1.8 1.56 1.44 ...
\$ ay_median: num  -0.19 0.0101 0.1186 -0.0027 -0.0253 ...
\$ az_mean  : num  -0.138 -0.205 -0.0641 -0.0929 -0.1399 ...
\$ az_sd    : num  0.985 0.925 0.929 0.889 0.927 ...
\$ az_min   : num  -2.68 -3.08 -1.82 -2.16 -1.85 ...
\$ az_maz   : num  2.75 2.72 2.49 3.24 3.55 ...
\$ az_median: num  0.0254 -0.2121 -0.1512 -0.1672 -0.1741 ...
\$ aT_mean  : num  1.27 1.26 1.3 1.2 1.23 ...
\$ aT_sd    : num  0.583 0.545 0.513 0.513 0.582 ...
\$ aT_min   : num  0.4 0.41 0.255 0.393 0.313 0.336 0.275 0.196 0.032 0.358 ...
\$ aT_maT   : num  3.03 3.2 2.64 3.32 3.6 ...
\$ aT_median: num  1.08 1.14 1.28 1.12 1.17 ...
\$ y        : num  1 1 1 1 1 1 1 1 1 1 ...```

### Create observations with NA values in the end

Next, we will impute some NA’s for this tutorial purpose at the end of the table.

```features1 = features
for(i in 363:400){
features1[i,] = NA
}```

### View at bottom 50 rows

We see the missing values at the end of the table.

Disclaimer: here we introducing all of last 50 rows as NA. In real world, its highly unlikely. You might have only few values missing.

`tail(features1, 50)`

## Impute NA’s with best values using iteration method

Next, to impute missing values we will use mice function. We will keep max iterations to 50 and method as ‘pmm’.

```imputed_Data = mice(features1,
m=1,
maxit = 50,
method = 'pmm',
seed = 999,
printFlag =FALSE)
```

## View imputed results

Now we have imputed results. We will use the first imputed data frame for this study. You could actually test all the different imputations to see which works better.

```imputedResultData = mice::complete(imputed_Data,1)
tail(imputedResultData, 50)
```

## Looking at distribution actual data and imputed data

We will first compare basic statistics and then distributions of the couple of features. In the comparison of statistics between actual and imputed we can observe that the mean and SD for both imputed and actual are almost equal.

```data.frame(actual_ax_mean = c(mean(features\$ax_mean), sd(features\$ax_mean))
, imputed_ax_mean = c(mean(imputedResultData\$ax_mean), sd(imputedResultData\$ax_mean))
, actual_ax_median = c(mean(features\$ax_median), sd(features\$ax_median))
, imputed_ax_median = c(mean(imputedResultData\$ax_median), sd(imputedResultData\$ax_median))
, actual_az_sd = c(mean(features\$az_sd), sd(features\$az_sd))
, imputed_az_sd = c(mean(imputedResultData\$az_sd), sd(imputedResultData\$az_sd))
, row.names = c("mean", "sd"))
```

Now, lets look at the distributions in the data. From the distribution below, we can observe that the distributions for actual data and imputed data is almost identical. We can confirm it with the bandwidth in the plots.

```par(mfrow=c(3,2))
plot(density(features\$ax_mean), main = "Actual ax_mean", type="l", col="red")
plot(density(imputedResultData\$ax_mean), main = "Imputed ax_mean", type="l", col="red")
plot(density(features\$ax_median), main = "Actual ax_median", type="l", col="red")
plot(density(imputedResultData\$ax_median), main = "Imputed ax_median", type="l", col="red")
plot(density(features\$az_sd), main = "Actual az_sdn", type="l", col="red")
plot(density(imputedResultData\$az_sd), main = "Imputed az_sd", type="l", col="red")
```

## Building a classification model based on actual data and Imputed data

In the following data y will be our classification variable. We will build a classification model using a simple support vector machine(SVM) with actual and imputed data. No transformation will be done on the data. In the end we will compare the results

### Actual Data

#### Sample data creation

Let’s split the data into train and test with ratio’s of 80:20.

```#create samples of 80:20 ratio
features\$y = as.factor(features\$y)
sample = sample(nrow(features) , nrow(features)* 0.8)
train = features[sample,]
test = features[-sample,]
```

#### Build a SVM model

Now, we can train the model using train set. We will not do any parameter tuning in this example.

```library(e1071)
ibrary(caret)

actual.svm.model = svm(y ~., data = train)
summary(actual.svm.model)

Call:
svm(formula = y ~ ., data = train)

Parameters:
SVM-Type:  C-classification
cost:  1
gamma:  0.05

Number of Support Vectors:  142

( 47 18 47 30 )

Number of Classes:  4

Levels:
1 2 3 4

```

#### Validate SVM model

In the below confusion matrix, we observe the following

1. accuary>NIR indicating model is very good
2. Higher accuray and kappa value indicates a very accurate model
3. Even the balanced accuracy is close to 1 indicating the model is highly accurate
```# build a confusion matrix using caret package
confusionMatrix(predict(actual.svm.model, test), test\$y)

Confusion Matrix and Statistics

Reference
Prediction  1  2  3  4
1 10  1  0  0
2  0 26  0  0
3  0  0 22  0
4  0  0  3 11

Overall Statistics

Accuracy : 0.9452
95% CI : (0.8656, 0.9849)
No Information Rate : 0.3699
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.9234
Mcnemar's Test P-Value : NA

Statistics by Class:

Class: 1 Class: 2 Class: 3 Class: 4
Sensitivity            1.0000   0.9630   0.8800   1.0000
Specificity            0.9841   1.0000   1.0000   0.9516
Pos Pred Value         0.9091   1.0000   1.0000   0.7857
Neg Pred Value         1.0000   0.9787   0.9412   1.0000
Prevalence             0.1370   0.3699   0.3425   0.1507
Detection Rate         0.1370   0.3562   0.3014   0.1507
Detection Prevalence   0.1507   0.3562   0.3014   0.1918
Balanced Accuracy      0.9921   0.9815   0.9400   0.9758```

### Imputed Data

#### Sample data creation

```# create samples of 80:20 ratio
imputedResultData\$y = as.factor(imputedResultData\$y)
sample = sample(nrow(imputedResultData) , nrow(imputedResultData)* 0.8)
train = imputedResultData[sample,]
test = imputedResultData[-sample,]
```

#### Build a SVM model

```imputed.svm.model = svm(y ~., data = train)
summary(imputed.svm.model)

Call:
svm(formula = y ~ ., data = train)

Parameters:
SVM-Type:  C-classification
cost:  1
gamma:  0.05

Number of Support Vectors:  167

( 59 47 36 25 )

Number of Classes:  4

Levels:
1 2 3 4

```

#### Validate SVM model

In the below confusion matrix, we observe the following

1. accuary>NIR indicating model is very good
2. Higher accuray and kappa value indicates a very accurate model
3. Even the balanced accuracy is close to 1 indicating the model is highly accurate
```confusionMatrix(predict(imputed.svm.model, test), test\$y)

Confusion Matrix and Statistics

Reference
Prediction  1  2  3  4
1 15  0  0  0
2  1 21  0  0
3  0  0 17  0
4  0  0  0 26

Overall Statistics

Accuracy : 0.9875
95% CI : (0.9323, 0.9997)
No Information Rate : 0.325
P-Value [Acc > NIR] : < 2.2e-16

Kappa : 0.9831
Mcnemar's Test P-Value : NA

Statistics by Class:

Class: 1 Class: 2 Class: 3 Class: 4
Sensitivity            0.9375   1.0000   1.0000    1.000
Specificity            1.0000   0.9831   1.0000    1.000
Pos Pred Value         1.0000   0.9545   1.0000    1.000
Neg Pred Value         0.9846   1.0000   1.0000    1.000
Prevalence             0.2000   0.2625   0.2125    0.325
Detection Rate         0.1875   0.2625   0.2125    0.325
Detection Prevalence   0.1875   0.2750   0.2125    0.325
Balanced Accuracy      0.9688   0.9915   1.0000    1.000```

### Overall results

What we saw above and their interpretation is completely subjective. One way to truly validate them is to create random train and test samples multiple times (say 30), build a model, validate the model, capture kappa value. Finally use a simple t-test to see if there is a significant difference.

Null hypothesis:
H0: there is no significant difference between two samples.

```# lets create functions to simplify the process

test.function = (data){
# create samples
sample = sample(nrow(data) , nrow(data)* 0.75)
train = data[sample,]
test = data[-sample,]

# build model
svm.model = svm(y ~., data = train)

# get metrics
metrics = confusionMatrix(predict(svm.model, test), test\$y)
return(metrics\$overall['Accuracy'])

}

# now lets calculate accuracy with actual data to get 30 results
actual.results  = NULL
for(i in 1:100) {
actual.results[i] = test.function(features)
}

# 0.978021978021978
# 0.978021978021978
# 0.978021978021978
# 0.945054945054945
# 0.989010989010989
# 0.967032967032967
# now lets calculate accuracy with imputed data to get 30 results
imputed.results  = NULL
for(i in 1:100) {
imputed.results[i] = test.function(imputedResultData)
}
# 0.97
# 0.95
# 0.92
# 0.96
# 0.92
# 0.96```

#### T-test to test the results

What’s better than statistically prove if there is significant difference right? So, we will do a t-test to see if there is any statistical difference in the accuracy.

```# Do a simple t-test to see if there is a difference in accuracy when data is imputed
t.test(x= actual.results, y = imputed.results, conf.level = 0.95)

Welch Two Sample t-test

data:  actual.results and imputed.results
t = 7.9834, df = 194.03, p-value = 1.222e-13
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.01673213 0.02771182
sample estimates:
mean of x mean of y
0.968022  0.945800
```

In the above t-test we have set the confidence level at 95%. From the results we can observe that the p-value is less than 0.05 indicating that there is a significant difference in accuracy between actual data and imputed data. From the means we can notice that the average accuracy of actual data is about 96.5% while the accuracy of imputed data y is about 92.5%. There is a variation of 4%. So, does that mean imputing more data results in reducing the accuracy across various models?

Why not do a test to compare the results? let’s consider 4 other models for that and those will be

1. Random forest
2. Decision tree
3. KNN
4. Naive Bayes

#### Random Forest

Let’s use all the same steps as above and fit different models. The results of accuracy will be in the below table

```library(randomForest)

# lets create functions to simplify the process

test.rf.function = function(data){
# create samples
sample = sample(nrow(data) , nrow(data)* 0.75)
train = data[sample,]
test = data[-sample,]

# build model
rf.model = randomForest(y ~., data = train)

# get metrics
metrics = confusionMatrix(predict(rf.model, test), test\$y)
return(metrics\$overall['Accuracy'])

}

# now lets calculate accuracy with actual data to get 30 results
actual.rf.results  = NULL
for(i in 1:100) {
actual.rf.results[i] = test.rf.function(features)
}

# now lets calculate accuracy with imputed data to get 30 results
imputed.rf.results  = NULL
for(i in 1:100) {
imputed.rf.results[i] = test.rf.function(imputedResultData)
}
head(data.frame(Actual = actual.rf.results, Imputed = imputed.rf.results))

# Do a simple t-test to see if there is a difference in accuracy when data is imputed
t.test(x= actual.rf.results, y = imputed.rf.results, conf.level = 0.95)```
```	Welch Two Sample t-test

data:  actual.rf.results and imputed.rf.results
t = 11.734, df = 183.2, p-value 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.02183138 0.03065654
sample estimates:
mean of x mean of y
0.976044  0.949800
```

In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 2.5% difference.

#### Decision Tree

```library(rpart)

# lets create functions to simplify the process

test.dt.function = function(data){
# create samples
sample = sample(nrow(data) , nrow(data)* 0.75)
train = data[sample,]
test = data[-sample,]

# build model
dt.model = rpart(y ~., data = train, method="class")

# get metrics
metrics = confusionMatrix(predict(dt.model, test, type="class"), test\$y)
return(metrics\$overall['Accuracy'])

}

# now lets calculate accuracy with actual data to get 30 results
actual.dt.results  = NULL
for(i in 1:100) {
actual.dt.results[i] = test.dt.function(features)
}

# now lets calculate accuracy with imputed data to get 30 results
imputed.dt.results  = NULL
for(i in 1:100) {
imputed.dt.results[i] = test.dt.function(imputedResultData)
}
head(data.frame(Actual = actual.dt.results, Imputed = imputed.dt.results))

# Do a simple t-test to see if there is a difference in accuracy when data is imputed
t.test(x= actual.dt.results, y = imputed.dt.results, conf.level = 0.95)```
```	Welch Two Sample t-test

data:  actual.dt.results and imputed.dt.results
t = 16.24, df = 167.94, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.03331888 0.04254046
sample estimates:
mean of x mean of y
0.9703297 0.9324000
```

In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 3.5% difference.

#### K-Nearest Neighbor (KNN)

```library(class)

# lets create functions to simplify the process

test.knn.function = function(data){
# create samples
sample = sample(nrow(data) , nrow(data)* 0.75)
train = data[sample,]
test = data[-sample,]

# build model
knn.model = knn(train,test, cl=train\$y, k=5)

# get metrics
metrics = confusionMatrix(knn.model, test\$y)
return(metrics\$overall['Accuracy'])

}

# now lets calculate accuracy with actual data to get 30 results
actual.dt.results  = NULL
for(i in 1:100) {
actual.dt.results[i] = test.knn.function(features)
}

# now lets calculate accuracy with imputed data to get 30 results
imputed.dt.results  = NULL
for(i in 1:100) {
imputed.dt.results[i] = test.knn.function(imputedResultData)
}
head(data.frame(Actual = actual.dt.results, Imputed = imputed.dt.results))

# Do a simple t-test to see if there is a difference in accuracy when data is imputed
t.test(x= actual.dt.results, y = imputed.dt.results, conf.level = 0.95)```
```	Welch Two Sample t-test

data:  actual.dt.results and imputed.dt.results
t = 3.2151, df = 166.45, p-value = 0.001566
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.002126868 0.008895110
sample estimates:
mean of x mean of y
0.989011  0.983500
```

In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 0.05% difference.

#### Naive Bayes

```# lets create functions to simplify the process

test.nb.function = function(data){
# create samples
sample = sample(nrow(data) , nrow(data)* 0.75)
train = data[sample,]
test = data[-sample,]

# build model
nb.model = naiveBayes(y ~., data = train)

# get metrics
metrics = confusionMatrix(predict(nb.model, test), test\$y)
return(metrics\$overall['Accuracy'])

}

# now lets calculate accuracy with actual data to get 30 results
actual.nb.results  = NULL
for(i in 1:100) {
actual.nb.results[i] = test.nb.function(features)
}

# now lets calculate accuracy with imputed data to get 30 results
imputed.nb.results  = NULL
for(i in 1:100) {
imputed.nb.results[i] = test.nb.function(imputedResultData)
}
head(data.frame(Actual = actual.nb.results, Imputed = imputed.nb.results))

# Do a simple t-test to see if there is a difference in accuracy when data is imputed
t.test(x= actual.nb.results, y = imputed.nb.results, conf.level = 0.95)```
```	Welch Two Sample t-test

data:  actual.nb.results and imputed.nb.results
t = 18.529, df = 174.88, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.04214191 0.05218996
sample estimates:
mean of x mean of y
0.9740659 0.9269000
```

In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 4.5% difference.

### Conclusion

From the above results we observe that irrespective of the type of model built, we observed a standard variation in accuracy in the range of 3% – 5% between using actual data and imputed data. In all the cases, actual data helped in building a better model compared to using imputed data for building the model.

If you enjoyed this tutorial, then check out my other tutorials and my GitHub page for all the source code and various R-packages.

The post Testing the Effect of Data Imputation on Model Accuracy appeared first on Hi! I am Nagdev.