Site icon R-bloggers

3-step lesson, going into the life of machine learning

[This article was first published on R - Data Science Heroes Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Automatic Machine Learning

Introduction

I want to develop a model that automatically learns over time“, a really challenging objective. We’ll develop in this post a procedure that loads data, build a model, make predictions and, if something changes over time, it will create a new model, all with R.

Picture credit: S.H Horikawa

This post intends to recreate as simple as possible the machine learning scenario: automatically creation of a predictive model with temporal concerns. It’s going to be kind of manual because the objective is to cover a little of the logic behind a machine that learns.

Start with “Small Data” to conquer Big Data 😉

Our case

We have one input (age) and one output variable (purchases). We want to predict next months purchases based on age. If the model is inaccurate, a new one should be built.

Temporality

In machine learning, it’s quite important to understand temporality. We will stand in 3 different dates to introduce this concept:

Step 1: Model building (January)

We’re on January, and we’re building the model with historical data, when we know both variables, age and purchases.

## Loading needed libraries
suppressMessages(library(ggplot2))  
suppressMessages(library(forecast))  

Find the data sets used in this example in Github

## Reading historical data
set.seed(999)

data_historical=read.delim(file="data_historical.txt", header=T, sep="t")

## Plotting current relationship between age and purchases

ggplot(data_historical, aes(x=age, y=purchases)) +  
  geom_point(shape=1) + ## Points as circles (good to see density)
  geom_smooth(method=lm) ## Linear regression line

## Model creation. Input variable: "age", to predict "purchases".

model=lm(purchases~age, data=data_historical)

Probably you know the linear regression, but if you don’t, check this.

Clearly the relationship between age and purchases is linear. After building the linear regression model, we check one accuracy metric: MAPE (Mean Average Percentage Error), close to 0, better.

MAPE measures how different is the prediction against the real value (in terms of percentage).

## Checking accuracy model
historical_error=round(accuracy(model)[,"MAPE"],2)  
historical_error

## Setting up error threshold (to be used later)
threshold=10 ## 10 represents "10%" of error (MAPE)  

It is expected to have a similar value over next months, if not, the model is not a good representation of reality.

Defining threshold:

There is the need to define an error threshold value, let’s say if the error (measuring by MAPE) in the following months is higher than 10%, model has to be rebuilding.

This rebuilding is the key point here, we can automate the process to take new data, build a new model, and if this new model has an error below threshold, then it becomes the new model in production, (the simplest scenario)

## Checking model coefficients
model$coefficients  

R output:

(intercept)    age   
-15.4992    100.3812

In other words, this is how the model looks like:
purchases=100.3812*age – 15.4992

Step 2: Model performing ok (February to April)

During this period new customers arrive, the model to forecast purchases is applied the first day of each month. As a matter of fact we know how the model performed during this 3-month period, looking at real error (MAPE): predicted purchases vs. real purchases.

Note: Performance simulation and re-building with R code will be in next step (May)

Error table shows the following:

As it can be seen, there’s an increasing tendency in error, getting closer to the maximum allowed.

Step 3.1: Model performing bad (May)

Now we’re in May 31th. It is known how purchases were over current month. Following procedure should be executed at the end of every month.

## Read data from past month, May.
data_may=read.delim(file="data_may.txt", header=T, sep="t")

## Retrieve the predictions made on May 1st based on the model built on January.
forecasted_purchases=predict.lm(object = model, newdata = data.frame(age=data_may$age))

## Checking error
error_may=accuracy(forecasted_purchases, data_may$purchases)[,"MAPE"]  
error_may

## Difference to threshold (10%)
threshold-error_may  

R output says:

“error_may” is 18.79473, and “threshold-error_may” is -8.794733

In this month the error exceed the threshold by 8.79%.

This is how the model is working on May:

## Further inspection plotting forecasted (blue) against actual (black) purchases.

ggplot(data_may, aes(x=age)) +  
  geom_line(aes(y=forecasted_purchases), colour="blue") +
  geom_point(aes(y=purchases), shape=1)

Step 3.2: Model rebuilding

Clearly, the model works well predicting purchases on customers before 35 years-old, and becomes missaccuarate for older people. This segment is buying more than before.

It could be caused for example because of some change on business policy, a discount which is no more available, etc.

A new model must be created returning new error metrics.

## Procedure to generate a new model
if(error_may>threshold) {

  ## Build new model, based on new data.
  new_model=lm(purchases~age, data=data_may)

  ## Assign predictions to 'May' data. They are the predictions for training data.
  data_may$forecasted_purchases=new_model$fitted.values

  ## Plot: new Linear regression
  p=ggplot(data_may, aes(x=age)) +
    geom_line(aes(y=forecasted_purchases), colour="blue") +
    geom_point(aes(y=purchases), shape=1)

  print(p)

  new_error=accuracy(new_model)[,"MAPE"]

  if(new_error<threshold)
    {print("We have a new model built in an automated process! =)")} else
    {print("Manual inspection & building =(")}

  }

R output:

"We have a new model built in an automated process! =)"

We have the new model to run next month (June):

  ## Checking new model coefficients
  new_model$coefficients

R output:

(intercept)    age   
-4414.1504    244.8179

In other words…

purchases=244.8179*age – 4414.1504

And there is the final model!

Final comments

Finally…

To leave a comment for the author, please follow the link and comment on their blog: R - Data Science Heroes Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.