Site icon R-bloggers

Machine Learning Strategy (Part 3)

[This article was first published on Philipp Probst, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this third and last blog post about machine learning strategy I will talk about the problems of different distributions of train, development and test set and about learning from multiple tasks.

< !--excerpt-->

Different distributions of training and test/dev data sets

In many cases we have different distributions of the train set and the test set (which should be similar to the final dataset where we apply our data). The dev set should be ideally similar to the test set. With different distribution of train and dev/test set it is not clear where the difference in performance comes from;

→ Is it variance or the different distributions that lead to different performances?

Solution: Create a Training-dev-set, which has the same distribution as the training set, but is used for developing. Also create a standard dev-set with the same distribution as the test set. By looking at the performance difference between training-dev-set and standard dev-set we can analyze if the difference comes from the different distributions or through the variance.

Learning from multiple tasks

Here I will shortly introduce the topic of learning from multiple tasks by mentioning some topics of this vast field.

Transfer Learning

Use infrastructures and algorithms of similar problems and adjust them to the current case. This is especially useful if more data is available for similar problems than for the problem at hand.

Multi-task Learning

Try to learn tasks simultaneously:

End-to-end Learning

End-to-end Learning is the modeling of the whole pipeline with just one model:

Advantages and Disadvantages pf end-to-end learning

Advantages:

Disadvantages:

The mentioned strategies and methods are applicable on many machine learning problems, on classical statistical problems as well as on complex deep learning pipelines.

The end

This blog post is partly based on information that is contained in a tutorial about deep learning on coursera.org that I took recently. Hence, a lot of credit for this post goes to Andrew Ng that held this tutorial.

Feel free to post your questions and annotations below.

To leave a comment for the author, please follow the link and comment on their blog: Philipp Probst.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.