March 8, 2011
By

(This article was first published on Back Side Smack » R Stuff, and kindly contributed to R-bloggers)

Two posts ago I mentioned the age-earnings profile but did not provide a regression of log earnings on wage. I also offered, without evidence, that fitting a simple linear regression would be inappropriate. How do I know that? How could we determine the appropriateness of a regression? There are a number of technical or econometric means to determine mechanically whether or not a regression is appropriate. We can test for the functional form with the Breusch-Pagan test (a story about which will be left for another time) or the White test. Both of these tests are specifically for heteroskedasticity, not the functional form. However if we can imagine a process where our model is:

• $y_i = alpha + beta X_i + epsilon$

But the true process is

• $y_i = alpha + beta X^2_i + epsilon$

Our residuals (different than the errors!) from fitting the first model on the second model will vary with the $X$ term, just as though our errors were heteroskedastic. But for simple enough models, we can take a step back and eyeball the regression. If we fit a linear model to a quadratic or otherwise partially linear term and plot the residuals against the $X$ term we should be able to see some shape emerge. If our model is very well fitted and the underlying process is linear then the residuals will be constant across independent variables. If our model is mis-specified (as in our example above) the residuals might look like this:

 From AEP

The above plot is easily recovered by plot(lm(log(eph) ~ age, data=adams)), a command which will bring up a number of different diagnostic plots. Let’s fit a local regression to the data and see what comes out.

 From AEP

We probably over-estimate the decline in earnings as age goes on, but this is much better than our linear regression. Some causes of mis-estimation might be within our capacity to easily solve. I mentioned in the last post that a proper age-earnings profile would correctly code the ages of workers in the dataset, subtracting years of schooling from age. We might also talk about non-wage compensation and how that may increase over time. Further, we have dropped all the zeros from our dataset, which is pretty inappropriate. Correcting for entry and exit from the labor force may change the shape of our profile.

Code isn’t included because it is basically two lines stemming immediately from the past post.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...