How to prepare and apply machine learning to your dataset

[This article was first published on R-exercises, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Dear reader,

If you are a newbie in the world of machine learning, then this tutorial is exactly what you need in order to introduce yourself to this exciting new part of the data science world.

This post includes a full machine learning project that will guide you step by step to create a “template,” which you can use later on other datasets.

In this step-by-step tutorial you will:

1. Use one of the most popular machine learning packages in R.
2. Explore a dataset by using statistical summaries and data visualization.
3. Build 5 machine-learning models, pick the best, and build confidence that the accuracy is reliable.

The process of a machine learning project may not be exactly the same, but there are certain standard and necessary steps:

1. Define Problem.
2. Prepare Data.
3. Evaluate Algorithms.
4. Improve Results.
5. Present Results.


The first thing you have to do is install and load the “caret” package with:
install.packages("caret") library(caret)

Moreover, we need a dataset to work with. The dataset we chose in our case is “iris,” which contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species. To attach it to the environment, use:

1.1 Create a Validation Dataset

First of all, we need to validate that our data set is good. Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. To be sure about the accuracy of the best model on unseen data, we will evaluate it on actual unseen data. To do this, we will “deposit” some data that the algorithms will not find and use this data later to get a second and independent idea of how accurate the best model really is.

We will split the loaded dataset into two, 80% of which we will use to train our models and 20% of which we will hold back as a validation dataset. Look at the example below:
#create a list of 80% of rows in the original dataset to use them for training
validation_index <- createDataPartition(dataset$Species, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- dataset[-validation_index,]
# use the remaining 80% of data to training and testing the models
dataset <- dataset[validation_index,]

You now have training data in the dataset variable and a validation set that will be used later in the validation variable.

Learn more about machine learning in the online course Beginner to Advanced Guide on Machine Learning with R Tool. In this course you will learn how to:

  • Create a machine learning algorhitm from a beginner point of view
  • Quickly dive into more advanced methods in an accessible pace and with more explanations
  • And much more

This course shows a complete workflow start to finish. It is a great introduction and fallback when you have some experience.


In this step, we are going to explore our data set. More specifically, we need to know certain features of our dataset, like:

1. Dimensions of the dataset.
2. Types of the attributes.
3. Details of the data.
4. Levels of the class attribute.
5. Analysis of the instances in each class.
6. Statistical summary of all attributes.

2.1 Dimensions of Dataset

We can see of how many instances (rows) and how many attributes (columns) the data contains with the dim function. Look at the example below:

2.2 Types of Attributes

Knowing the types is important as it can help you summarize the data you have and possible transformations you might need to use to prepare the data before modeilng. They could be doubles, integers, strings, factors and other types. You can find it with:
sapply(dataset, class)

2.3 Details of the Data

You can take a look at the first 5 rows of the data with:

2.4 Levels of the Class

The class variable is a factor that has multiple class labels or levels. Let’s look at the levels:

There are two types of classification problems: the multinomial like this one and the binary if there were two levels.

2.5 Class Distribution

Let’s now take a look at the number of instances that belong to each class. We can view this as an absolute count and as a percentage with:
percentage <- prop.table(table(dataset$Species)) * 100 cbind(freq=table(dataset$Species), percentage=percentage)

2.6 Statistical Summary

This includes the mean, the min and max values, as well as some percentiles. Look at the example below:


We now have a basic idea about the data. We need to extend that with some visualizations, and for that reason we are going to use two types of plots:

1. Univariate plots to understand each attribute.
2. Multivariate plots to understand the relationships between attributes.

3.1 Univariate Plots

We can visualize just the input attributes and just the output attributes. Let’s set that up and call the input attributes x and the output attributes y.
x <- dataset[,1:4] y <- dataset[,5]

Since the input variables are numeric, we can create box and whisker plots of each one with:
par(mfrow=c(1,4)) for(i in 1:4) { boxplot(x[,i], main=names(iris)[i]) }

We can also create a barplot of the Species class variable to graphically display the class distribution.

3.2 Multivariate Plots

First, we create scatterplots of all pairs of attributes and color the points by class. Then, we can draw ellipses around them to make them more easily separated.
You have to install and call the “ellipse” package to do this.
install.packages("ellipse") library(ellipse) featurePlot(x=x, y=y, plot="ellipse")

We can also create box and whisker plots of each input variable, but this time they are broken down into separate plots for each class.
featurePlot(x=x, y=y, plot="box")

Next, we can get an idea of the distribution of each attribute. We will use some probability density plots to give smooth lines for each distribution.
scales <- list(x=list(relation="free"), y=list(relation="free")) featurePlot(x=x, y=y, plot="density", scales=scales)


Now it is time to create some models of the data and estimate their accuracy on unseen data.

1. Use the test harness to use 10-fold cross validation.
2. Build 5 different models to predict species from flower measurements.
3. Select the best model.

4.1 Test Harness

This will split our dataset into 10 parts, train in 9, test on 1, and release for all combinations of train-test splits.
control <- trainControl(method="cv", number=10) metric <- "Accuracy"

We are using the metric of “Accuracy” to evaluate models. This is: (number of correctly predicted instances / divided by the total number of instances in the dataset)*100 to give a percentage.

4.2 Build Models

We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that we created earlier.

Algorithms evaluation:

1. Linear Discriminant Analysis (LDA)
2. Classification and Regression Trees (CART).
3. k-Nearest Neighbors (kNN).
4. Support Vector Machines (SVM) with a linear kernel.
5. Random Forest (RF)

This is a good mixture of simple linear (LDA), nonlinear (CART, kNN) and complex nonlinear methods (SVM, RF). We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.

NOTE: To proceed, first install and load the following packages: “rpart”, “kernlab”, “e1071” and “randomForest”.

Let’s build our five models:
# a) linear algorithms
set.seed(7) fit.lda <- train(Species~., data=dataset, method="lda", metric=metric, trControl=control)
# b) nonlinear algorithms
set.seed(7) fit.cart <- train(Species~., data=dataset, method="rpart", metric=metric, trControl=control)
# kNN
set.seed(7) fit.knn <- train(Species~., data=dataset, method="knn", metric=metric, trControl=control)
# c) advanced algorithms
set.seed(7) fit.svm <- train(Species~., data=dataset, method="svmRadial", metric=metric, trControl=control)
# Random Forest
set.seed(7) fit.rf <- train(Species~., data=dataset, method="rf", metric=metric, trControl=control)

4.3 Select the Best Model

We now have 5 models and accuracy estimations for each so we have to compare them.

It is a good idea to create a list of the created models and use the summary function.
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf)) summary(results)

Moreover, we can create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times.

You can summarize the results for just the LDA model that seems to be the most accurate.

5. Make Predictions

The LDA was the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set.

We can run the LDA model directly on the validation set and summarize the results in a confusion matrix.
predictions <- predict(fit.lda, validation) confusionMatrix(predictions, validation$Species)

To leave a comment for the author, please follow the link and comment on their blog: R-exercises. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)