Testing the Effect of Data Imputation on Model Accuracy

[This article was first published on R – Hi! I am Nagdev, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Most of us have come across situations where, we do not have enough data for building reliable models due to various reasons such as, it’s expensive to collect data (human studies), limited resources, lack of historical data availability (earth quakes). Even before we begin talking about how to overcome the challenge, let’s first talk about why we need minimum samples even before we consider building model. First of all, we can build a model with low samples. It is definitely possible! But, the as the number of samples decreases, the margin of error increases and vice versa. If you want to build a model with the highest accuracy you would need to have as many samples as possible. If the model is for a real world application, then you need to have data across multiple days to account for any changes in the system. There is a formula that can be used to calculate the sample size and is as follows:

Image

Where, n = sample size

Z = Z-score value

σ = populated standard deviation

MOE = acceptable margin of error

You can also calulated with an online calculator as in this link
https://www.qualtrics.com/blog/calculating-sample-size/

Now we know that why minimum samples are required for achieving required accuracy, say in some case we do not have an opportunity to collect more samples or available. Then we have an option to do the following

  1. K-fold cross validation
  2. Leave-P-out cross validation
  3. Leave-one-out cross validation
  4. New data creation through estimation

In K-fold method, the data is split into k partitions and then is trained with each partition and tested with the left out kth partition. In k-hold method, not all combinations are considered. Only user specified partitions are considered. While in leave-one/p-out, all combinations or partitions are considered. This is more exhaustive technique in validating the results. The following above two techniques are the most popular techniques that is used both in machine learning and deep learning.

When it comes to handling NA’s in a data set we have always imputed it through mean, median, zero’s and random numbers. But, this would probably not make sense when we want to create new data.

In new data creation through estimation technique, rows of missing data is created in the data set and a separate data imputation model is used to impute missing data in the rows. Multivariate Imputation by Chained Equations (MICE) is one of the most popular algorithms that are available to insert missing data irrespective of data types such as mixes of continuous, binary, unordered categorical and ordered categorical data.

There are various tutorials available for k-fold and leave one out models. This tutorial will focus on the fourth model where new data will be created to handle less sample size. In the and a simple classification model with be trained to see if there was a significant improvement. Also, distribution of imputed and non-imputed data will be compared to see any significant difference.

Load libraries

Let’s load all the libraries needed for now.

options(warn=-1)

# load libraies
library(mice)
library(dplyr)

Load data into a data frame

The data available in my GitHub repository is used for the analysis.

setwd("C:/OpenSourceWork/Experiment")
#read csv files
file1 = read.csv("dry run.csv", sep=",", header =T)
file2 = read.csv("base.csv", sep=",", header =T)
file3 = read.csv("imbalance 1.csv", sep=",", header =T)
file4 = read.csv("imbalance 2.csv", sep=",", header =T)

#Add labels to data
file1$y = 1
file2$y = 2
file3$y = 3
file4$y = 4

#view top rows of data
head(file1)
timeaxayazaTy
0.002-0.32460.27480.15020.4511
0.0090.6020-0.1900-0.32270.7091
0.0190.97870.32580.01241.0321
0.0270.6141-0.41790.04710.7441
0.038-0.3218-0.6389-0.42590.8331
0.047-0.36070.1332-0.12910.4061
Raw data

Create some features from data

The data used in this study is vibration data with different states. The data was collected at 100 Hz. The data to be used as is is high dimensional also, we do not have any good summary of the data. Hence, some statistical features are extracted. In this case, sample standard deviation, sample mean, sample min, sample max and sample median is calculated. Also, the data is aggregated by 1 second.

file1$group = as.factor(round(file1$time))
file2$group = as.factor(round(file2$time))
file3$group = as.factor(round(file3$time))
file4$group = as.factor(round(file4$time))
#(file1,20)

#list of all files
files = list(file1, file2, file3, file4)

#loop through all files and combine
features = NULL
for (i in 1:4){
res = files[[i]] %>%
    group_by(group) %>%
    summarize(ax_mean = mean(ax),
              ax_sd = sd(ax),
              ax_min = min(ax),
              ax_max = max(ax),
              ax_median = median(ax),
              ay_mean = mean(ay),
              ay_sd = sd(ay),
              ay_min = min(ay),
              ay_may = max(ay),
              ay_median = median(ay),
              az_mean = mean(az),
              az_sd = sd(az),
              az_min = min(az),
              az_maz = max(az),
              az_median = median(az),
              aT_mean = mean(aT),
              aT_sd = sd(aT),
              aT_min = min(aT),
              aT_maT = max(aT),
              aT_median = median(aT),
              y = mean(y)
             )
    features = rbind(features, res)
}

features = subset(features, select = -group)

# store it in a df for future reference
actual.features = features

Study data

First, lets look at the size of our populations and summary of our features along with their data types.

# show data types
str(features)
Classes 'tbl_df', 'tbl' and 'data.frame':	362 obs. of  21 variables:
 $ ax_mean  : num  -0.03816 -0.00581 0.06985 0.01155 0.04669 ...
 $ ax_sd    : num  0.659 0.633 0.667 0.551 0.643 ...
 $ ax_min   : num  -1.26 -1.62 -1.46 -1.93 -1.78 ...
 $ ax_max   : num  1.38 1.19 1.47 1.2 1.48 ...
 $ ax_median: num  -0.0955 -0.0015 0.107 0.0675 0.0836 ...
 $ ay_mean  : num  -0.068263 0.003791 0.074433 0.000826 -0.017759 ...
 $ ay_sd    : num  0.751 0.782 0.802 0.789 0.751 ...
 $ ay_min   : num  -1.39 -1.56 -1.48 -2 -1.66 ...
 $ ay_may   : num  1.64 1.54 1.8 1.56 1.44 ...
 $ ay_median: num  -0.19 0.0101 0.1186 -0.0027 -0.0253 ...
 $ az_mean  : num  -0.138 -0.205 -0.0641 -0.0929 -0.1399 ...
 $ az_sd    : num  0.985 0.925 0.929 0.889 0.927 ...
 $ az_min   : num  -2.68 -3.08 -1.82 -2.16 -1.85 ...
 $ az_maz   : num  2.75 2.72 2.49 3.24 3.55 ...
 $ az_median: num  0.0254 -0.2121 -0.1512 -0.1672 -0.1741 ...
 $ aT_mean  : num  1.27 1.26 1.3 1.2 1.23 ...
 $ aT_sd    : num  0.583 0.545 0.513 0.513 0.582 ...
 $ aT_min   : num  0.4 0.41 0.255 0.393 0.313 0.336 0.275 0.196 0.032 0.358 ...
 $ aT_maT   : num  3.03 3.2 2.64 3.32 3.6 ...
 $ aT_median: num  1.08 1.14 1.28 1.12 1.17 ...
 $ y        : num  1 1 1 1 1 1 1 1 1 1 ...

Create observations with NA values in the end

Next, we will impute some NA’s for this tutorial purpose at the end of the table.

features1 = features
for(i in 363:400){
  features1[i,] = NA
}

View at bottom 50 rows

We see the missing values at the end of the table.

Disclaimer: here we introducing all of last 50 rows as NA. In real world, its highly unlikely. You might have only few values missing.

tail(features1, 50)
ax_meanax_sdax_minax_maxax_medianay_meanay_sday_minay_mayay_medianaz_sdaz_minaz_mazaz_medianaT_meanaT_sdaT_minaT_maTaT_mediany
-0.0160970300.8938523-2.34452.3006-0.07360-0.0097594061.311817-3.42152.50280.108901.264572-2.87513.3718-0.070701.8660300.78083190.3804.0981.82004
-0.0155653470.8956615-2.26612.50890.086400.0273138611.294063-2.94212.34970.152601.368576-3.31652.6989-0.016601.9304260.77496860.1274.4631.83504
0.0240062500.8653758-2.40992.5328-0.031700.0084406251.376398-3.04222.37270.113901.449783-4.21714.77030.001102.0035520.83002530.3875.1381.99204
-0.0155630000.8720967-2.34512.3269-0.053250.0139620001.240091-3.13602.85630.091451.418988-3.37583.4279-0.104101.8953800.83515050.1734.4581.87354
0.0038948980.8806773-2.30983.1902-0.092600.0225755101.301955-3.25612.7833-0.053801.271799-3.80353.1323-0.261151.8522650.79096400.4363.9441.75704
-0.0393792080.8127135-2.15231.8828-0.112500.0054544551.189519-2.80572.48520.030401.366368-3.39282.45070.054301.8280590.75620420.5803.5731.69604
0.0214690000.8272527-1.58953.7505-0.089950.0113120001.285206-2.74232.6785-0.036401.177012-2.66492.16850.027551.7859300.71208290.2983.8951.75754
0.0059170000.9139808-2.33102.8131-0.07800-0.0408680001.320873-2.97782.2841-0.014351.401567-3.37283.31650.194851.9475700.85135730.3974.1911.81804
-0.0344485710.8640626-2.49172.4113-0.01960-0.0134104761.235196-3.33052.49120.094201.327886-2.98642.8430-0.053001.8825900.69713370.3703.7751.90304
0.0468373740.9776022-1.86882.6644-0.036000.0198171721.293644-2.78362.61660.125401.245906-2.48133.2677-0.114601.9016460.72960950.2833.8131.84404
-0.0144530610.9553743-2.71182.4640-0.01000-0.0377173471.285358-3.12252.45060.030851.457232-4.25123.37540.093251.9844180.85111680.4464.3511.86004
0.0468108700.9259427-1.53091.9420-0.114550.2306760871.491983-2.84352.84050.330601.111205-2.17482.9009-0.037901.9271740.76220310.4913.3552.16204
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA
NANANANANANANANANANANANANANANANANANANANA

Impute NA’s with best values using iteration method

Next, to impute missing values we will use mice function. We will keep max iterations to 50 and method as ‘pmm’.

imputed_Data = mice(features1, 
                    m=1, 
                    maxit = 50, 
                    method = 'pmm', 
                    seed = 999, 
                    printFlag =FALSE)

View imputed results

Now we have imputed results. We will use the first imputed data frame for this study. You could actually test all the different imputations to see which works better.

imputedResultData = mice::complete(imputed_Data,1)
tail(imputedResultData, 50)
ax_meanax_sdax_minax_maxax_medianay_meanay_sday_minay_mayay_medianaz_sdaz_minaz_mazaz_medianaT_meanaT_sdaT_minaT_maTaT_mediany
351-0.0160970300.8938523-2.34452.3006-0.07360-0.0097594061.3118166-3.42152.50280.108901.2645719-2.87513.3718-0.070701.86602970.78083190.3804.0981.82004
352-0.0155653470.8956615-2.26612.50890.086400.0273138611.2940627-2.94212.34970.152601.3685757-3.31652.6989-0.016601.93042570.77496860.1274.4631.83504
3530.0240062500.8653758-2.40992.5328-0.031700.0084406251.3763983-3.04222.37270.113901.4497833-4.21714.77030.001102.00355210.83002530.3875.1381.99204
354-0.0155630000.8720967-2.34512.3269-0.053250.0139620001.2400913-3.13602.85630.091451.4189884-3.37583.4279-0.104101.89538000.83515050.1734.4581.87354
3550.0038948980.8806773-2.30983.1902-0.092600.0225755101.3019546-3.25612.7833-0.053801.2717989-3.80353.1323-0.261151.85226530.79096400.4363.9441.75704
356-0.0393792080.8127135-2.15231.8828-0.112500.0054544551.1895194-2.80572.48520.030401.3663678-3.39282.45070.054301.82805940.75620420.5803.5731.69604
3570.0214690000.8272527-1.58953.7505-0.089950.0113120001.2852056-2.74232.6785-0.036401.1770121-2.66492.16850.027551.78593000.71208290.2983.8951.75754
3580.0059170000.9139808-2.33102.8131-0.07800-0.0408680001.3208731-2.97782.2841-0.014351.4015674-3.37283.31650.194851.94757000.85135730.3974.1911.81804
359-0.0344485710.8640626-2.49172.4113-0.01960-0.0134104761.2351957-3.33052.49120.094201.3278861-2.98642.8430-0.053001.88259050.69713370.3703.7751.90304
3600.0468373740.9776022-1.86882.6644-0.036000.0198171721.2936436-2.78362.61660.125401.2459059-2.48133.2677-0.114601.90164650.72960950.2833.8131.84404
361-0.0144530610.9553743-2.71182.4640-0.01000-0.0377173471.2853576-3.12252.45060.030851.4572321-4.25123.37540.093251.98441840.85111680.4464.3511.86004
3620.0468108700.9259427-1.53091.9420-0.114550.2306760871.4919834-2.84352.84050.330601.1112049-2.17482.9009-0.037901.92717390.76220310.4913.3552.16204
3630.0112386140.8127502-1.96022.14300.00680-0.0133673081.3019546-3.06282.73380.000701.4534581-4.43252.9648-0.035201.93830000.85261280.3734.3511.87054
364-0.0098122640.7680463-2.34921.39190.031100.0139841580.6084791-1.41550.92730.118600.9997898-3.00313.5781-0.259301.22195100.64506160.2333.6031.07301
365-0.0267600000.4780558-1.18260.99340.05560-0.0352182690.5632648-1.07611.2307-0.081650.7635922-2.31151.89340.030050.97142000.42148910.2142.1800.92651
3660.0290830000.7515921-2.26282.4640-0.008200.0111595961.3073606-3.13602.85270.040101.4534581-3.67512.6187-0.226801.93675490.74393260.3544.1561.84504
3670.0024010000.5641062-1.15331.4479-0.042150.0111595961.0358946-1.98562.9217-0.070400.7141977-1.77911.3013-0.207851.26073580.45236640.3762.1061.28304
3680.0176707070.4158231-0.97851.06470.07680-0.0267196080.4759174-0.93400.9077-0.036500.6919936-1.60942.0555-0.193650.87421050.39627100.2302.1230.81201
369-0.0780387760.4413032-1.10990.9826-0.03910-0.0106260420.4768587-0.93920.8497-0.046550.8165436-2.29362.1036-0.295700.93195240.45176330.1932.3800.88652
3700.0043726320.8352791-1.69662.38970.00845-0.0100640001.2746954-2.78322.28410.030851.2177225-3.12893.09190.019051.78446530.73439520.4893.7641.75203
3710.0161030000.3997476-0.95371.15460.03655-0.0316227720.4828770-0.97721.1237-0.145400.7672163-1.98211.8173-0.092400.90538000.41605490.2012.0530.85202
372-0.0203554460.4178729-1.05240.9076-0.093400.0444000000.5439558-0.98431.07980.140000.7552593-2.06071.6134-0.179900.94989110.38461760.2221.7520.89501
3730.0013636360.4868077-0.90271.51550.048200.0313390001.0619675-2.32612.4081-0.002100.7598489-1.74821.3013-0.200751.32727720.43154940.4782.2881.32204
374-0.0081222220.8831968-1.93943.3244-0.096100.0174009711.3778757-3.75802.45270.169351.4260617-3.18933.57810.093251.95768570.91675710.2954.8301.94304
375-0.0654010100.8489219-2.48712.1672-0.11250-0.0434917530.5648206-1.51880.84970.054401.4259974-3.18934.65570.080101.49502970.80124180.1984.2901.25501
3760.0397200000.5946125-1.52501.73900.050400.0614245100.8133879-1.23031.62550.056600.9355264-2.29362.92020.024201.25079000.53917910.2943.0811.17703
3770.0228410000.8646867-2.12532.63780.057200.0525153061.1332836-2.54292.36920.106201.0360114-3.09243.05900.001101.58112750.70532540.3263.7421.58153
378-0.0019245100.5975310-1.47751.4089-0.11455-0.0408680001.0363392-2.32892.21230.030250.7546022-1.61751.2922-0.185101.33248450.51315520.3052.0911.28304
3790.0179750000.4780750-1.20111.4923-0.07450-0.0223198020.5072372-1.14041.0361-0.041350.7439169-2.00521.7066-0.094500.91514000.45417000.2622.2640.82702
380-0.0708040000.4780558-1.92540.9244-0.05830-0.0749275510.5037149-1.04851.0710-0.077500.7598489-2.17352.0385-0.245600.92814000.48138140.1502.0840.79002
381-0.0022047620.9310547-2.78322.5242-0.07875-0.0193058821.3019546-2.42152.8615-0.028801.1771775-3.09032.4800-0.191551.83774510.72543060.3773.3481.77704
3820.0214690000.8646867-2.00012.4477-0.034000.0519778951.3628383-2.65742.74140.153051.1474602-2.95162.63710.088701.78841240.75201920.4003.6511.91804
383-0.0154683540.8127502-2.20342.3405-0.021500.0461797981.3628383-2.85942.72880.021301.1112049-4.21711.72150.096001.75928280.76801180.2953.6711.77804
384-0.0021430000.4442709-0.99491.0734-0.04265-0.0079040000.5386439-1.28281.2250-0.067650.7335329-2.26942.1640-0.301500.92936270.45176330.2662.4070.80002
3850.0275871290.4551125-1.27851.02850.05660-0.0352637250.4854652-1.01431.1332-0.036500.7048400-2.12371.86890.111000.85718000.44939560.1642.2220.81202
3860.0176707070.6981887-1.53872.1808-0.045000.0436031911.2152972-2.66313.19730.093800.8017314-1.60941.2922-0.106801.49107000.51589150.3762.4281.58204
3870.0174010000.7680463-1.45282.2822-0.003500.0556128711.0989870-2.77372.31340.167851.0468209-2.80511.7055-0.014701.57375250.68251900.4282.9881.58104
3880.0013636360.4354711-1.06770.95790.03655-0.0171158420.5501718-1.11341.0798-0.016400.7466890-2.12372.05550.022300.93421000.44379110.2662.2220.84101
3890.0360870000.8741671-2.29673.3393-0.03330-0.0199197921.4065464-2.97783.0511-0.046801.2155255-3.82811.93020.088201.89538000.77781200.2424.0981.91704
3900.0075880000.8409728-1.96022.2383-0.079850.0257970001.3525870-3.15112.7414-0.021351.4189884-3.69472.7486-0.149451.96488890.84892060.3973.9631.86004
3910.0657545450.4533416-0.77691.11790.104700.0479554460.5539467-0.93401.03560.033600.7569361-2.13622.3655-0.104950.96639130.42760360.2852.3530.89302
392-0.0305267330.4442709-1.71191.03020.03000-0.0218666670.6103892-1.01981.6418-0.011051.4149706-3.35995.0202-0.116001.30629000.75620420.1314.4431.10751
393-0.0016430000.8086920-1.90332.5242-0.03200-0.0337479591.3111909-3.02312.32080.016901.1671442-3.74512.0425-0.191551.79762240.71337290.3263.6511.73104
394-0.0239163460.4139117-0.69771.1179-0.043600.0113120000.4828770-1.28281.12370.049400.7135787-1.95531.8769-0.239500.86097140.40641900.0542.0310.79002
3950.0379147060.4369138-0.97010.99370.07080-0.0117038100.4883374-1.08221.1166-0.084050.7141977-1.92852.07660.080100.86215840.42224420.1932.1800.79102
396-0.0248207920.8127135-1.92992.63780.01800-0.0445800001.1363141-2.54292.4081-0.129101.0066063-2.40431.5056-0.128601.61213590.58532240.0522.5171.69454
397-0.0162375000.7620745-2.40991.7855-0.051500.0323551021.1534694-2.67342.45060.077251.4259974-4.12384.2297-0.247901.79762240.90829280.2125.3971.65953
398-0.0393792080.5614528-1.71191.4600-0.11620-0.0324630001.1096189-2.41112.4533-0.099101.1076786-3.12152.2947-0.140001.50258330.75216180.1683.7901.44203
3990.0262061860.7980083-1.90332.38630.002100.0098708741.2557210-2.85072.43430.131051.2135140-2.51122.1638-0.226801.79241580.68280060.3973.1971.71503
4000.0727777780.4051881-0.83860.88470.155750.0153704080.4759174-0.93401.20390.010900.7135787-2.11861.5632-0.139700.90874000.37678820.1702.5070.81201

Looking at distribution actual data and imputed data

We will first compare basic statistics and then distributions of the couple of features. In the comparison of statistics between actual and imputed we can observe that the mean and SD for both imputed and actual are almost equal.

data.frame(actual_ax_mean = c(mean(features$ax_mean), sd(features$ax_mean)) 
           , imputed_ax_mean = c(mean(imputedResultData$ax_mean), sd(imputedResultData$ax_mean))
           , actual_ax_median = c(mean(features$ax_median), sd(features$ax_median)) 
           , imputed_ax_median = c(mean(imputedResultData$ax_median), sd(imputedResultData$ax_median))
           , actual_az_sd = c(mean(features$az_sd), sd(features$az_sd)) 
           , imputed_az_sd = c(mean(imputedResultData$az_sd), sd(imputedResultData$az_sd))
           , row.names = c("mean", "sd"))
actual_ax_meanimputed_ax_meanactual_ax_medianimputed_ax_medianactual_az_sdimputed_az_sd
mean0.0063079090.005851233-0.001328867-0.002140251.05886501.0528059
sd0.0309610850.0311258480.0596198340.060113420.24467820.2477697

Now, lets look at the distributions in the data. From the distribution below, we can observe that the distributions for actual data and imputed data is almost identical. We can confirm it with the bandwidth in the plots.

par(mfrow=c(3,2))
plot(density(features$ax_mean), main = "Actual ax_mean", type="l", col="red")
plot(density(imputedResultData$ax_mean), main = "Imputed ax_mean", type="l", col="red")
plot(density(features$ax_median), main = "Actual ax_median", type="l", col="red")
plot(density(imputedResultData$ax_median), main = "Imputed ax_median", type="l", col="red")
plot(density(features$az_sd), main = "Actual az_sdn", type="l", col="red")
plot(density(imputedResultData$az_sd), main = "Imputed az_sd", type="l", col="red")
Density plots

Building a classification model based on actual data and Imputed data

In the following data y will be our classification variable. We will build a classification model using a simple support vector machine(SVM) with actual and imputed data. No transformation will be done on the data. In the end we will compare the results

Actual Data

Sample data creation

Let’s split the data into train and test with ratio’s of 80:20.

#create samples of 80:20 ratio
features$y = as.factor(features$y)
sample = sample(nrow(features) , nrow(features)* 0.8)
train = features[sample,]
test = features[-sample,]

Build a SVM model

Now, we can train the model using train set. We will not do any parameter tuning in this example.

library(e1071)
ibrary(caret)

actual.svm.model = svm(y ~., data = train)
summary(actual.svm.model)

Loading required package: ggplot2
Call:
svm(formula = y ~ ., data = train)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 
      gamma:  0.05 

Number of Support Vectors:  142

 ( 47 18 47 30 )


Number of Classes:  4 

Levels: 
 1 2 3 4


Validate SVM model

In the below confusion matrix, we observe the following

  1. accuary>NIR indicating model is very good
  2. Higher accuray and kappa value indicates a very accurate model
  3. Even the balanced accuracy is close to 1 indicating the model is highly accurate
# build a confusion matrix using caret package
confusionMatrix(predict(actual.svm.model, test), test$y)

Confusion Matrix and Statistics

          Reference
Prediction  1  2  3  4
         1 10  1  0  0
         2  0 26  0  0
         3  0  0 22  0
         4  0  0  3 11

Overall Statistics
                                          
               Accuracy : 0.9452          
                 95% CI : (0.8656, 0.9849)
    No Information Rate : 0.3699          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9234          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 1 Class: 2 Class: 3 Class: 4
Sensitivity            1.0000   0.9630   0.8800   1.0000
Specificity            0.9841   1.0000   1.0000   0.9516
Pos Pred Value         0.9091   1.0000   1.0000   0.7857
Neg Pred Value         1.0000   0.9787   0.9412   1.0000
Prevalence             0.1370   0.3699   0.3425   0.1507
Detection Rate         0.1370   0.3562   0.3014   0.1507
Detection Prevalence   0.1507   0.3562   0.3014   0.1918
Balanced Accuracy      0.9921   0.9815   0.9400   0.9758

Imputed Data

Sample data creation

# create samples of 80:20 ratio
imputedResultData$y = as.factor(imputedResultData$y)
sample = sample(nrow(imputedResultData) , nrow(imputedResultData)* 0.8)
train = imputedResultData[sample,]
test = imputedResultData[-sample,]

Build a SVM model

imputed.svm.model = svm(y ~., data = train)
summary(imputed.svm.model)

Call:
svm(formula = y ~ ., data = train)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 
      gamma:  0.05 

Number of Support Vectors:  167

 ( 59 47 36 25 )


Number of Classes:  4 

Levels: 
 1 2 3 4


Validate SVM model

In the below confusion matrix, we observe the following

  1. accuary>NIR indicating model is very good
  2. Higher accuray and kappa value indicates a very accurate model
  3. Even the balanced accuracy is close to 1 indicating the model is highly accurate
confusionMatrix(predict(imputed.svm.model, test), test$y)

Confusion Matrix and Statistics

          Reference
Prediction  1  2  3  4
         1 15  0  0  0
         2  1 21  0  0
         3  0  0 17  0
         4  0  0  0 26

Overall Statistics
                                          
               Accuracy : 0.9875          
                 95% CI : (0.9323, 0.9997)
    No Information Rate : 0.325           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9831          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: 1 Class: 2 Class: 3 Class: 4
Sensitivity            0.9375   1.0000   1.0000    1.000
Specificity            1.0000   0.9831   1.0000    1.000
Pos Pred Value         1.0000   0.9545   1.0000    1.000
Neg Pred Value         0.9846   1.0000   1.0000    1.000
Prevalence             0.2000   0.2625   0.2125    0.325
Detection Rate         0.1875   0.2625   0.2125    0.325
Detection Prevalence   0.1875   0.2750   0.2125    0.325
Balanced Accuracy      0.9688   0.9915   1.0000    1.000

Overall results

What we saw above and their interpretation is completely subjective. One way to truly validate them is to create random train and test samples multiple times (say 30), build a model, validate the model, capture kappa value. Finally use a simple t-test to see if there is a significant difference.

Null hypothesis:
H0: there is no significant difference between two samples.

# lets create functions to simplify the process

test.function = (data){
    # create samples
    sample = sample(nrow(data) , nrow(data)* 0.75)
    train = data[sample,]
    test = data[-sample,]
    
    # build model
    svm.model = svm(y ~., data = train)
    
    # get metrics
    metrics = confusionMatrix(predict(svm.model, test), test$y)
    return(metrics$overall['Accuracy'])
    
}

# now lets calculate accuracy with actual data to get 30 results
actual.results  = NULL
for(i in 1:100) {
    actual.results[i] = test.function(features)
}
head(actual.results)

# 0.978021978021978
# 0.978021978021978
# 0.978021978021978
# 0.945054945054945
# 0.989010989010989
# 0.967032967032967
# now lets calculate accuracy with imputed data to get 30 results
imputed.results  = NULL
for(i in 1:100) {
    imputed.results[i] = test.function(imputedResultData)
}
head(imputed.results)
# 0.97
# 0.95
# 0.92
# 0.96
# 0.92
# 0.96

T-test to test the results

What’s better than statistically prove if there is significant difference right? So, we will do a t-test to see if there is any statistical difference in the accuracy.

# Do a simple t-test to see if there is a difference in accuracy when data is imputed</em>
t.test(x= actual.results, y = imputed.results, conf.level = 0.95)

	Welch Two Sample t-test

data:  actual.results and imputed.results
t = 7.9834, df = 194.03, p-value = 1.222e-13
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.01673213 0.02771182
sample estimates:
mean of x mean of y 
 0.968022  0.945800 

In the above t-test we have set the confidence level at 95%. From the results we can observe that the p-value is less than 0.05 indicating that there is a significant difference in accuracy between actual data and imputed data. From the means we can notice that the average accuracy of actual data is about 96.5% while the accuracy of imputed data y is about 92.5%. There is a variation of 4%. So, does that mean imputing more data results in reducing the accuracy across various models?

Why not do a test to compare the results? let’s consider 4 other models for that and those will be

  1. Random forest
  2. Decision tree
  3. KNN
  4. Naive Bayes

Random Forest

Let’s use all the same steps as above and fit different models. The results of accuracy will be in the below table

library(randomForest)

# lets create functions to simplify the process

test.rf.function = function(data){
    # create samples
    sample = sample(nrow(data) , nrow(data)* 0.75)
    train = data[sample,]
    test = data[-sample,]
    
    # build model
    rf.model = randomForest(y ~., data = train)
    
    # get metrics
    metrics = confusionMatrix(predict(rf.model, test), test$y)
    return(metrics$overall['Accuracy'])
    
}

# now lets calculate accuracy with actual data to get 30 results
actual.rf.results  = NULL
for(i in 1:100) {
    actual.rf.results[i] = test.rf.function(features)
}
#head(actual.rf.results)

# now lets calculate accuracy with imputed data to get 30 results
imputed.rf.results  = NULL
for(i in 1:100) {
    imputed.rf.results[i] = test.rf.function(imputedResultData)
}
head(data.frame(Actual = actual.rf.results, Imputed = imputed.rf.results))

# Do a simple t-test to see if there is a difference in accuracy when data is imputed
t.test(x= actual.rf.results, y = imputed.rf.results, conf.level = 0.95)
ActualImputed
0.9560440.95
1.0000000.93
0.9670330.96
0.9670330.96
1.0000000.97
0.9670330.93
Random forest accuracy for actual and imputed data
	Welch Two Sample t-test

data:  actual.rf.results and imputed.rf.results
t = 11.734, df = 183.2, p-value 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.02183138 0.03065654
sample estimates:
mean of x mean of y 
 0.976044  0.949800 

In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 2.5% difference.

Decision Tree

library(rpart)

# lets create functions to simplify the process

test.dt.function = function(data){
    # create samples
    sample = sample(nrow(data) , nrow(data)* 0.75)
    train = data[sample,]
    test = data[-sample,]
    
    # build model
    dt.model = rpart(y ~., data = train, method="class")
    
    # get metrics
    metrics = confusionMatrix(predict(dt.model, test, type="class"), test$y)
    return(metrics$overall['Accuracy'])
    
}

# now lets calculate accuracy with actual data to get 30 results
actual.dt.results  = NULL
for(i in 1:100) {
    actual.dt.results[i] = test.dt.function(features)
}
#head(actual.rf.results)

# now lets calculate accuracy with imputed data to get 30 results
imputed.dt.results  = NULL
for(i in 1:100) {
    imputed.dt.results[i] = test.dt.function(imputedResultData)
}
head(data.frame(Actual = actual.dt.results, Imputed = imputed.dt.results))

# Do a simple t-test to see if there is a difference in accuracy when data is imputed
t.test(x= actual.dt.results, y = imputed.dt.results, conf.level = 0.95)
ActualImputed
0.9780220.92
0.9670330.94
0.9670330.95
0.9560440.94
0.9560440.94
0.9780220.95
Decision tree accuracy for actual and imputed data
	Welch Two Sample t-test

data:  actual.dt.results and imputed.dt.results
t = 16.24, df = 167.94, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.03331888 0.04254046
sample estimates:
mean of x mean of y 
0.9703297 0.9324000 

In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 3.5% difference.

K-Nearest Neighbor (KNN)

library(class)

# lets create functions to simplify the process

test.knn.function = function(data){
    # create samples
    sample = sample(nrow(data) , nrow(data)* 0.75)
    train = data[sample,]
    test = data[-sample,]
    
    # build model
    knn.model = knn(train,test, cl=train$y, k=5)
    
    # get metrics
    metrics = confusionMatrix(knn.model, test$y)
    return(metrics$overall['Accuracy'])
    
}

# now lets calculate accuracy with actual data to get 30 results
actual.dt.results  = NULL
for(i in 1:100) {
    actual.dt.results[i] = test.knn.function(features)
}
#head(actual.rf.results)

# now lets calculate accuracy with imputed data to get 30 results
imputed.dt.results  = NULL
for(i in 1:100) {
    imputed.dt.results[i] = test.knn.function(imputedResultData)
}
head(data.frame(Actual = actual.dt.results, Imputed = imputed.dt.results))

# Do a simple t-test to see if there is a difference in accuracy when data is imputed
t.test(x= actual.dt.results, y = imputed.dt.results, conf.level = 0.95)
ActualImputed
0.9670330.97
1.0000000.98
0.9780220.99
0.9780221.00
0.9670331.00
0.9780221.00
KNN accuracy for actual and imputed data
	Welch Two Sample t-test

data:  actual.dt.results and imputed.dt.results
t = 3.2151, df = 166.45, p-value = 0.001566
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.002126868 0.008895110
sample estimates:
mean of x mean of y 
 0.989011  0.983500 

In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 0.05% difference.

Naive Bayes

# lets create functions to simplify the process

test.nb.function = function(data){
    # create samples
    sample = sample(nrow(data) , nrow(data)* 0.75)
    train = data[sample,]
    test = data[-sample,]
    
    # build model
    nb.model = naiveBayes(y ~., data = train)
    
    # get metrics
    metrics = confusionMatrix(predict(nb.model, test), test$y)
    return(metrics$overall['Accuracy'])
    
}

# now lets calculate accuracy with actual data to get 30 results
actual.nb.results  = NULL
for(i in 1:100) {
    actual.nb.results[i] = test.nb.function(features)
}
#head(actual.rf.results)

# now lets calculate accuracy with imputed data to get 30 results
imputed.nb.results  = NULL
for(i in 1:100) {
    imputed.nb.results[i] = test.nb.function(imputedResultData)
}
head(data.frame(Actual = actual.nb.results, Imputed = imputed.nb.results))

# Do a simple t-test to see if there is a difference in accuracy when data is imputed
t.test(x= actual.nb.results, y = imputed.nb.results, conf.level = 0.95)
ActualImputed
0.9890110.95
0.9670330.92
0.9780220.94
1.0000000.95
0.9890110.90
0.9670330.93
Naive Bayes accuracy for actual and imputed data
	Welch Two Sample t-test

data:  actual.nb.results and imputed.nb.results
t = 18.529, df = 174.88, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.04214191 0.05218996
sample estimates:
mean of x mean of y 
0.9740659 0.9269000 

In the above t-test results we can come to a similar conclusion as above. There is a significant difference between the actual data and imputed data accuracy. We see approximately 4.5% difference.

Conclusion

From the above results we observe that irrespective of the type of model built, we observed a standard variation in accuracy in the range of 3% – 5% between using actual data and imputed data. In all the cases, actual data helped in building a better model compared to using imputed data for building the model.

If you enjoyed this tutorial, then check out my other tutorials and my GitHub page for all the source code and various R-packages.

The post Testing the Effect of Data Imputation on Model Accuracy appeared first on Hi! I am Nagdev.

To leave a comment for the author, please follow the link and comment on their blog: R – Hi! I am Nagdev.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)