Empirical Economics with R (Part A): The wine formula and machine learning
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This semester I teach a new course: Empirical Economics with R. All material is online available, including video lectures with quizzes and interactive RTutor problem sets, which are perhaps the most important part of the course.
One goal when designing the course was to motivate most concepts with one or two main applications. The main application in chapter 1 is based on research by Orley Ashenfelter and co-authors who develop with a small data set and a simple linear regression a formula for the quality of Bordeaux red wines:
wine quality = 0.6160 * average temperature during growing season + 0.00117 * rainfall in preceeding winter months - 0.00386 * rainfall in August (harvest month)
I feel that this application nicely demonstrates an approach to empirical economic research that was more prevalent in the older days. One has a small data set and wants to find a simple formula that nicely describes a stable empirical relationship in the real world. Or perhaps that is just how I thought as a young student how economics would more be like. Unfortunately, it turns out that economics is a field that mostly seems void of simple, stable empirical relationships, at least if one takes natural laws as a benchmark. Still, you will see that the simple regression model for the wine formula has a surprisingly good in-sample fit. Yet, out-of-sample prediction accuracy is only qualitatively discussed and less clear.
The discussion of out-of sample prediction accuracy then naturally leads us to chapter 2, which shows as contrast the modern machine learning approach to prediction problems. We study a reasonable large data set of house prices and introduce sample splitting into a training and test data set to systematically assess out-of sample prediction accuracy. We also introduce regression trees, random forests and parameter tuning via k-fold cross-validation. The RTutor problem set also covers some strategies for dealing with missing values.
The later chapters deal with strategies to estimate causal effects and corresponding applications and will be summarized in future posts.
You can find all material in the course’s Github repository. Take a look at the setup instructions if you want to solve the RTutor problem sets on your own computer.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.