How to Split data into train and test in R

[This article was first published on Methods – finnstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

finnstats:-For the latest Data Science, jobs and UpToDate tutorials visit finnstats

Split data into train and test in r, It is critical to partition the data into training and testing sets when using supervised learning algorithms such as Linear Regression, Random Forest, Naïve Bayes classification, Logistic Regression, and Decision Trees etc.

We first train the model using the training dataset’s observations and then use it to predict from the testing dataset.

Splitting helps to avoid overfitting and to improve the training dataset accuracy.

Finally, we need a model that can perform well on unknown data, therefore we utilize test data to test the trained model’s performance at the end.

Decision Tree R Code » Classification & Regression »

First, we need to load some sample data for illustration purposes.

We can make use of the iris data set for the same.

For the same output, we will utilize set.seed function.

set.seed(123)                              
data <- iris
head(data)                                        
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
dim(data)
[1] 150   5

The format of our exemplifying data is shown in the preceding RStudio console output.

It has four numeric columns, Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and 150 rows.

Now will try to split these data sets.

Example: split data into train and test in r

Will show you how to use the sample function in R to divide a data frame into training and test data.

Cluster Analysis in R » Unsupervised Approach »

To begin, we’ll create a fake indicator to indicate whether a row is in the training or testing data set.

In an ideal world, we’d have 70% training data and 30% testing data, which would provide the highest level of accuracy.

split1<- sample(c(rep(0, 0.7 * nrow(data)), rep(1, 0.3 * nrow(data))))
split1
  [1] 1 1 0 0 1 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1
 [48] 0 1 1 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0
 [95] 0 1 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0
[142] 1 0 1 0 0 0 1 1 1

You must first create a seed before creating your training and test sets. This is the random number generator in R.

Timeseries analysis in R » Decomposition, & Forecasting »

The main benefit of using seed is that you may get the same sequence of random numbers every time you use the random number generator with the same seed.

Now we can cross verify our split1 data.

table(split1)                                 
 0   1
105  45

As you can see from the split1 data set, 105 observations assign to the training data (i.e. 0), and 45 observations assign to the testing data (i.e. 45). (i.e. 1).

Now we can create train data set and split the data set separately.

Logistic Regression R- Tutorial » Detailed Overview »

train <- data[split1 == 0, ]            

Let’s have a look at the first rows of our training data:

head(train)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa
dim(train)
[1] 105   5

Rows 3, 4, 6, 8, 9, 10 were assigned to the training data, as seen in the previous RStudio console output.

To define our test data, we may do the same thing.

test <- data[split1== 1, ]    

Let’s additionally print the data set’s head        

head(test)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
7           4.6         3.4          1.4         0.3  setosa
13          4.8         3.0          1.4         0.1  setosa
14          4.3         3.0          1.1         0.1  setosa

Now we can check the dim (test)

[1] 45  5

It appears to be in good condition. The above data sets were able to utilize to run statistical procedures such as machine learning algorithms on them.

Please note for the better model accuracy need to avoid class imbalance issues.

K Nearest Neighbor Algorithm in Machine Learning » finnstats

The post How to Split data into train and test in R appeared first on finnstats.

To leave a comment for the author, please follow the link and comment on their blog: Methods – finnstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)