Machine Learning Strategy (Part 3)

[This article was first published on Philipp Probst, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this third and last blog post about machine learning strategy I will talk about the problems of different distributions of train, development and test set and about learning from multiple tasks.

Different distributions of training and test/dev data sets

In many cases we have different distributions of the train set and the test set (which should be similar to the final dataset where we apply our data). The dev set should be ideally similar to the test set. With different distribution of train and dev/test set it is not clear where the difference in performance comes from;

→ Is it variance or the different distributions that lead to different performances?

Solution: Create a Training-dev-set, which has the same distribution as the training set, but is used for developing. Also create a standard dev-set with the same distribution as the test set. By looking at the performance difference between training-dev-set and standard dev-set we can analyze if the difference comes from the different distributions or through the variance.

Learning from multiple tasks

Here I will shortly introduce the topic of learning from multiple tasks by mentioning some topics of this vast field.

Transfer Learning

Use infrastructures and algorithms of similar problems and adjust them to the current case. This is especially useful if more data is available for similar problems than for the problem at hand.

Multi-task Learning

Try to learn tasks simultaneously:

  • E.g. if the task is to predict several classes at once the task could be converted to a multilabel or multivariate task.
  • It might be useful to use a single loss function
  • Only makes sense if tasks are similar and if their is a connection between the tasks

End-to-end Learning

End-to-end Learning is the modeling of the whole pipeline with just one model:

  • In contrast to that is the division of the task in single modeling steps
  • Example: Identifications of persons with a camera. This task can be divided in two steps:
    1. Identification of the face on the picture and zoom on it
    2. Use an algorithms to identify the face with the zoomed picture

      → In this case the single tasks are much easier to learn than an algorithm that learns everything at once

Advantages and Disadvantages pf end-to-end learning


  • Let the data speak
  • No manual adjustment of the modeling design is necessary


  • Usually more data necessary
  • External knowledge (not available in the data) cannot be incorporated
  • Manual designed pipelines/features can possibly incorporate this knowledge

The mentioned strategies and methods are applicable on many machine learning problems, on classical statistical problems as well as on complex deep learning pipelines.

The end

This blog post is partly based on information that is contained in a tutorial about deep learning on that I took recently. Hence, a lot of credit for this post goes to Andrew Ng that held this tutorial.

Feel free to post your questions and annotations below.

To leave a comment for the author, please follow the link and comment on their blog: Philipp Probst. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)