The inspiration for this post came as I was browsing texts and articles about USA’s GDP and I wondered what might have a positive relationship to GDP that would be interesting to graph and explore.
I stumbled upon two datasets: US States by Educational Attainment and US States by GDP. The data looked clean enough so I decided to write up a quick R program to see what I could find.
1 2 3 4 5 6 7 8 9 10 

First I imported the two datasets and merged them into one frame. I then built a linear model using GDP per Capita as the response variable and Percent of the Population with Bachelors are the predictor.
The summary of this simple linear model is featured below:
1 2 3 4 5 6 7 8 

As we can see, the pvalue is very small (small enough for this model to be significant). However the low RSquared leaves much to be desired. Very little of the variance in the data is accounted for by our statistical model and I question whether it is a good fit or not.
Nevertheless, we can see a pretty interesting graph below:
1 2 3 4 5 6 7 8 

Aha! We can attribute some of our lack of fit to the outliers Wyoming, Alaska, and Delaware. With such a small dataset (50 states in the USA) 3 outliers can definitely have a strong impact on the fit.
I tried a BoxCox test in R to see whether our fit might be improved if we transformed the response variable. The BoxCox plot in R suggests that raising the response to the negative first power might be beneficial.
This new model is outlined below:
1 2 3 4 5 6 7 8 9 

The pvalue of this new model is even lower and the RSquared value also improved slightly. Still, the model is not perfect and it would be unwise to claim that we have a strong linear relationship between the two.
Graphing this we get:
1 2 3 4 5 6 7 8 

Once again we see the same 3 outliers which definitely has a huge impact on the significance of such a small model. Either way, this model seems to exhibit somewhat of a stronger relationship between the two variables.
Rbloggers.com offers daily email updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...