How to arrange training and testing datasets in R

[This article was first published on Data Analysis in R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The post How to arrange training and testing datasets in R appeared first on finnstats.

If you are interested to learn more about data science, you can find more articles here finnstats.

How to arrange training and testing datasets in R, To divide a data frame into training and test sets for model construction in R, use the createDataPartition() function from the caret package.

The basic syntax used by this function is as follows:

createDataPartition(y, times = 1, p = 0.5, list = TRUE, …)

where:

y: vector of outcomes

times: number of partitions to create

p: percentage of data to use in the training set

list: whether or not to save the results in a list

The example below demonstrates how to use this function in practice.

PCA for Categorical Variables in R » finnstats

Example:- How to arrange training and testing datasets in R

Assume we have a data frame in R with 1,000 rows containing information about students’ study hours and their final exam scores:

Make this example replicable.

set.seed(123)

Let’s create a data frame

df <- data.frame(hours=runif(1000, min=0, max=10),
                 score=runif(1000, min=40, max=100))

Now we can view the head of the data frame

head(df)
     hours    score
1 7.8355588 64.49499
2 0.7643654 77.62753
3 8.9624691 51.64345
4 0.1915280 85.47124
5 8.6345563 93.20182
6 8.9055675 44.02384

Assume we want to fit a simple linear regression model that predicts final exam score based on hours studied.

Assume we want to train the model on 80% of the rows in the data frame and then test it on the remaining 20%.

The following code demonstrates how to split the data frame into training and testing sets using the caret package’s createDataPartition() function.

How to combine Multiple Plots in R » finnstats

library(caret)

Divide the data frame into training and testing sets.

train_indices <- createDataPartition(df$score, times=1, p=.8, list=FALSE)

Now ready to create a training set

dftrain <- df[train_indices , ]

Now we can create a testing set

dftest  <- df[-train_indices, ]

Let’s view the number of rows in each set

nrow(dftrain)
[1] 800
nrow(dftest)
[1] 200

As we can see, our training dataset has 800 rows, which accounts for 80% of the original dataset.

ave for average calculation in R » finnstats

Similarly, our test dataset has 200 rows, which represents 20% of the original dataset.

The first few rows of each set can also be seen:

Now we can view the head of training set

head(dftrain)
     hours    score
1 8.966972 55.93220
2 2.655087 71.84853
3 3.721239 81.09165
4 5.728534 62.99700
5 9.082078 97.29928
7 8.983897 42.34600

Let’s view the head of the testing set

head(dftest)
      hours    score
6  2.016819 47.10139
12 2.059746 96.67170
18 7.176185 92.61150
23 2.121425 89.17611
24 6.516738 50.47970
25 1.255551 90.58483

We can then use the training set to train the regression model and the testing set to evaluate its performance.

When to Use plotly? » finnstats

If you are interested to learn more about data science, you can find more articles here finnstats.

The post How to arrange training and testing datasets in R appeared first on finnstats.

To leave a comment for the author, please follow the link and comment on their blog: Data Analysis in R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)