Site icon R-bloggers

Machine Learning Strategy (Part 1)

[This article was first published on Philipp Probst, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Machine Learning Strategy is about how to tackle machine learning tasks strategically. In three blog posts I will try to give an introduction into this topic and I also hope for some comments and opinions on this topic.

< !--excerpt-->

Task Definition

The first step of a machine learning task is usually to define the task itself. What is the purpose, what do we want to achieve? Sometimes the target is not so clear at the beginning.

After that, the next step is to think about how to achieve the task in the best way. Questions that arise are:

  1. Which and how much data should be collected or are available?
  2. How should the data be structured, transformed and divided?
  3. Which algorithm with which hyperparameters should be used?

How to solve these questions?

Target definition

The first step is to clearly define the target:

  1. What metric(s) do we want to optimize? (optimizing metrics)
  2. Under which constraints should they be optimized? (satisficing metrics)
  3. Are there observations for which we should give a stronger weight?

Example: We want to minimize the mean squared error under the constraint, that the runtime should be less than 5 minutes.

Data split

The next step is to divide the data into different parts. Usually the data is divided into three parts:

  1. Training data: Is used for training the algorithm
  2. Development data: Is used for evaluating the training algorithm iteratively
  3. Test data: Is used for the final evaluation of the trained algorithm

By using (repeated) cross-validation training and development data can be interchanged. E.g. in 5-fold cross-validation the data is divided in 5 parts and each part is once used as development data for evaluating the metric while the other parts are used for training. At the end one can e.g. take the mean of the results of the development data.

How to divide the data?

Data sizes:

In the following blog post, I will post more about the possibilities of improving an algorithm once it has been trained with training data and how this can be done in an iterative process.

This blog post is partly based on information that is contained in a course about deep learning on coursera.org that I took recently. Hence, a lot of credit for this post goes to Andrew Ng that held this course.

Feel free to leave a comment below and share your experiences and opinions about this topic. How do you tackle machine learning problems strategically?

To leave a comment for the author, please follow the link and comment on their blog: Philipp Probst.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.