Exploring GDP per Capita vs. Educational Attainment

[This article was first published on Frank Portman, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The inspiration for this post came as I was browsing texts and articles about USA’s GDP and I wondered what might have a positive relationship to GDP that would be interesting to graph and explore.

I stumbled upon two datasets: US States by Educational Attainment and US States by GDP. The data looked clean enough so I decided to write up a quick R program to see what I could find.

1
2
3
4
5
6
7
8
9
10
<span class="line"><span class="kn">library</span><span class="p">(</span>ggplot2<span class="p">)</span>
</span><span class="line">
</span><span class="line">bachelors <span class="o"><-</span> read.csv<span class="p">(</span><span class="s">"bachelors.csv"</span><span class="p">,</span> header <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span>
</span><span class="line">GDP <span class="o"><-</span> read.csv<span class="p">(</span><span class="s">"gdppercapita.csv"</span><span class="p">,</span> header <span class="o">=</span> <span class="kc">TRUE</span><span class="p">)</span>
</span><span class="line">
</span><span class="line">new.data <span class="o"><-</span> <span class="kp">merge</span><span class="p">(</span>bachelors<span class="p">,</span> GDP<span class="p">,</span> by <span class="o">=</span> <span class="s">"State"</span><span class="p">)</span>
</span><span class="line">
</span><span class="line"><span class="kp">colnames</span><span class="p">(</span>new.data<span class="p">)</span> <span class="o"><-</span> <span class="kt">c</span><span class="p">(</span><span class="s">"State"</span><span class="p">,</span> <span class="s">"Percent.Bachelors"</span><span class="p">,</span> <span class="s">"GDP.Per.Capita"</span><span class="p">)</span>
</span><span class="line">
</span><span class="line">model <span class="o"><-</span> lm<span class="p">(</span>GDP.Per.Capita <span class="o">~</span> Percent.Bachelors<span class="p">,</span> data <span class="o">=</span> new.data<span class="p">)</span>
</span>

First I imported the two datasets and merged them into one frame. I then built a linear model using GDP per Capita as the response variable and Percent of the Population with Bachelors are the predictor.

The summary of this simple linear model is featured below:

1
2
3
4
5
6
7
8
<span class="line"><span class="o">></span> <span class="kp">summary</span><span class="p">(</span>model<span class="p">)</span>
</span><span class="line">
</span><span class="line">Call<span class="o">:</span>
</span><span class="line">lm<span class="p">(</span>formula <span class="o">=</span> GDP.Per.Capita <span class="o">~</span> Percent.Bachelors<span class="p">,</span> data <span class="o">=</span> new.data<span class="p">)</span>
</span><span class="line">
</span><span class="line">Residual standard error<span class="o">:</span> <span class="m">6923</span> on <span class="m">48</span> degrees of freedom
</span><span class="line">Multiple R<span class="o">-</span>squared<span class="o">:</span> <span class="m">0.3564</span><span class="p">,</span>	Adjusted R<span class="o">-</span>squared<span class="o">:</span> <span class="m">0.343</span>
</span><span class="line"><span class="bp">F</span><span class="o">-</span>statistic<span class="o">:</span> <span class="m">26.58</span> on <span class="m">1</span> and <span class="m">48</span> DF<span class="p">,</span>  p<span class="o">-</span>value<span class="o">:</span> <span class="m">4.737e-06</span>
</span>

As we can see, the p-value is very small (small enough for this model to be significant). However the low R-Squared leaves much to be desired. Very little of the variance in the data is accounted for by our statistical model and I question whether it is a good fit or not.

Nevertheless, we can see a pretty interesting graph below:

1
2
3
4
5
6
7
8
<span class="line">g <span class="o"><-</span> ggplot<span class="p">(</span>new.data<span class="p">,</span> aes<span class="p">(</span>x <span class="o">=</span> Percent.Bachelors<span class="p">,</span>
</span><span class="line">                       y <span class="o">=</span> GDP.Per.Capita<span class="p">))</span> <span class="o">+</span>
</span><span class="line">                     xlab<span class="p">(</span><span class="s">"Proportion of Population with Bachelor's or Higher"</span><span class="p">)</span> <span class="o">+</span>
</span><span class="line">                     ylab<span class="p">(</span><span class="s">"GDP Per Capita"</span><span class="p">)</span>
</span><span class="line">
</span><span class="line">g <span class="o"><-</span> g <span class="o">+</span> geom_text<span class="p">(</span>aes<span class="p">(</span>label <span class="o">=</span> State<span class="p">))</span>
</span><span class="line">
</span><span class="line">g
</span>

Aha! We can attribute some of our lack of fit to the outliers Wyoming, Alaska, and Delaware. With such a small dataset (50 states in the USA) 3 outliers can definitely have a strong impact on the fit.

I tried a Box-Cox test in R to see whether our fit might be improved if we transformed the response variable. The Box-Cox plot in R suggests that raising the response to the negative first power might be beneficial.

This new model is outlined below:

1
2
3
4
5
6
7
8
9
<span class="line">model2 <span class="o"><-</span> lm<span class="p">(</span>GDP.Per.Capita<span class="o">^</span><span class="m">-1</span> <span class="o">~</span> Percent.Bachelors<span class="p">,</span> data <span class="o">=</span> new.data<span class="p">)</span>
</span><span class="line"><span class="o">></span> <span class="kp">summary</span><span class="p">(</span>model2<span class="p">)</span>
</span><span class="line">
</span><span class="line">Call<span class="o">:</span>
</span><span class="line">lm<span class="p">(</span>formula <span class="o">=</span> GDP.Per.Capita<span class="o">^</span><span class="m">-1</span> <span class="o">~</span> Percent.Bachelors<span class="p">,</span> data <span class="o">=</span> new.data<span class="p">)</span>
</span><span class="line">
</span><span class="line">Residual standard error<span class="o">:</span> <span class="m">2.967e-06</span> on <span class="m">48</span> degrees of freedom
</span><span class="line">Multiple R<span class="o">-</span>squared<span class="o">:</span> <span class="m">0.4199</span><span class="p">,</span>	Adjusted R<span class="o">-</span>squared<span class="o">:</span> <span class="m">0.4079</span>
</span><span class="line"><span class="bp">F</span><span class="o">-</span>statistic<span class="o">:</span> <span class="m">34.75</span> on <span class="m">1</span> and <span class="m">48</span> DF<span class="p">,</span>  p<span class="o">-</span>value<span class="o">:</span> <span class="m">3.627e-07</span>
</span>

The p-value of this new model is even lower and the R-Squared value also improved slightly. Still, the model is not perfect and it would be unwise to claim that we have a strong linear relationship between the two.

Graphing this we get:

1
2
3
4
5
6
7
8
<span class="line">h <span class="o"><-</span> ggplot<span class="p">(</span>new.data<span class="p">,</span> aes<span class="p">(</span>x <span class="o">=</span> Percent.Bachelors<span class="p">,</span>
</span><span class="line">                       y <span class="o">=</span> GDP.Per.Capita<span class="o">^</span><span class="m">-1</span><span class="p">))</span> <span class="o">+</span>
</span><span class="line">		xlab<span class="p">(</span><span class="s">"Proportion of Population with Bachelor's or Higher"</span><span class="p">)</span> <span class="o">+</span>
</span><span class="line">		ylab<span class="p">(</span><span class="s">"1 divided by GDP Per Capita"</span><span class="p">)</span>
</span><span class="line">
</span><span class="line">h <span class="o"><-</span> h <span class="o">+</span> geom_text<span class="p">(</span>aes<span class="p">(</span>label <span class="o">=</span> State<span class="p">))</span>
</span><span class="line">
</span><span class="line">h
</span>

Once again we see the same 3 outliers which definitely has a huge impact on the significance of such a small model. Either way, this model seems to exhibit somewhat of a stronger relationship between the two variables.

To leave a comment for the author, please follow the link and comment on their blog: Frank Portman.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)