Model Exploration using K-sample Plot in Big Data

[This article was first published on Issei's Smart Analysis, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Generically, error rate of predicting binary variable by a certain model becomes plateau increasing sample size. When the model fits training data , the error rate gains from 0 to true error. And when the model fits test data, the error rate decreases from 1 to true error. These phenomenons also occur predicting continuous variable, now error rate changes to proportion of explained variance. This figure clearly shows this concept (listed in textbook “Data Mining and Statistics for Decision Making”).



When we have to deal with Big Data, most of all machine learning methods may not be able to calculate parameters because these calculation take to huge time. So I propose that checking performance of a certain model using
“K-sample Plot (K’s plot)” explained below at times like dealing Big Data.

K’s plot is simple algorithm drawing 1 – AUC (or proportion of explained variance) vs. sample size plot. The reason of using 1 – AUC instead of error rate, AUC can considerate sensitivity and specificity simultaneously.

Step 1. Sampling K observations from Big Data.
Step 2. Estimate 1 – AUC of training samples and test samples by Cross-Variation.
Step 3. Changing sampling number K from small to efficient size and calculate Step2.
Step 4. Plot K (x-axis) vs. 1 – AUC (y-axis).


I implement K’s plot in R package named “KsPlot”. Partial example code of KsPlot is this.

library(KsPlot)
set.seed(1)
x1   <- rnorm(1000000)
set.seed(2)
x2   <- rnorm(1000000)
set.seed(3)
y    <- 2*x1 + x2**2 + rnorm(1000000)

X1      <- data.frame(x1 = x1, x2 = x2)
X2      <- data.frame(x1 = x1, x2 = x2, x3 = x2**2)
y       <- y

set.seed(1)
KsResult1 <- KsamplePlot(X1, y)
set.seed(1)
KsResult2 <- KsamplePlot(X1, y, Method="svm")

This example data have continuous outcome y and two continuous explanatory variables x1 and x2. Observation number of this data is 1 million!! Relationships of y and x1 is linear and y and x2 is quadratic. And If someone try to fitting linear model and SVM (support vector machine), calculation time of linear model is a few second but time of SVM is over 1 day!! So it's better that we can check performances of linear and SVM using K's plot.

 Result of Linear Model.


Result of SVM.

Compare two results, we understand SVM is better than linear model in this case. At the present I implement these models (specified "Method=" option): linear model (lm), Support Vector Machine (svm), Newral Network (nn), is Random Forest (rf), Multipe Adaptive Regression Splines (mars), Classification and Regression Tree (cart) and LASSO (lasso) for continuous target variable. And implement lm and svm for binary target.

This method is some part of my doctor paper in Japanese. Now we are translating doctor paper in English and will try to publish to statistical journal.

To leave a comment for the author, please follow the link and comment on their blog: Issei's Smart Analysis.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)