Exploring GDP per Capita vs. Educational Attainment

January 9, 2013
By

(This article was first published on Frank Portman, and kindly contributed to R-bloggers)

The inspiration for this post came as I was browsing texts and articles about USA’s GDP and I wondered what might have a positive relationship to GDP that would be interesting to graph and explore.

I stumbled upon two datasets: US States by Educational Attainment and US States by GDP. The data looked clean enough so I decided to write up a quick R program to see what I could find.

1
2
3
4
5
6
7
8
9
10
library(ggplot2)

bachelors <- read.csv("bachelors.csv", header = TRUE)
GDP <- read.csv("gdppercapita.csv", header = TRUE)

new.data <- merge(bachelors, GDP, by = "State")

colnames(new.data) <- c("State", "Percent.Bachelors", "GDP.Per.Capita")

model <- lm(GDP.Per.Capita ~ Percent.Bachelors, data = new.data)

First I imported the two datasets and merged them into one frame. I then built a linear model using GDP per Capita as the response variable and Percent of the Population with Bachelors are the predictor.

The summary of this simple linear model is featured below:

1
2
3
4
5
6
7
8
> summary(model)

Call:
lm(formula = GDP.Per.Capita ~ Percent.Bachelors, data = new.data)

Residual standard error: 6923 on 48 degrees of freedom
Multiple R-squared: 0.3564,	Adjusted R-squared: 0.343
F-statistic: 26.58 on 1 and 48 DF,  p-value: 4.737e-06

As we can see, the p-value is very small (small enough for this model to be significant). However the low R-Squared leaves much to be desired. Very little of the variance in the data is accounted for by our statistical model and I question whether it is a good fit or not.

Nevertheless, we can see a pretty interesting graph below:

1
2
3
4
5
6
7
8
g <- ggplot(new.data, aes(x = Percent.Bachelors,
                       y = GDP.Per.Capita)) +
                     xlab("Proportion of Population with Bachelor's or Higher") +
                     ylab("GDP Per Capita")

g <- g + geom_text(aes(label = State))

g

Aha! We can attribute some of our lack of fit to the outliers Wyoming, Alaska, and Delaware. With such a small dataset (50 states in the USA) 3 outliers can definitely have a strong impact on the fit.

I tried a Box-Cox test in R to see whether our fit might be improved if we transformed the response variable. The Box-Cox plot in R suggests that raising the response to the negative first power might be beneficial.

This new model is outlined below:

1
2
3
4
5
6
7
8
9
model2 <- lm(GDP.Per.Capita^-1 ~ Percent.Bachelors, data = new.data)
> summary(model2)

Call:
lm(formula = GDP.Per.Capita^-1 ~ Percent.Bachelors, data = new.data)

Residual standard error: 2.967e-06 on 48 degrees of freedom
Multiple R-squared: 0.4199,	Adjusted R-squared: 0.4079
F-statistic: 34.75 on 1 and 48 DF,  p-value: 3.627e-07

The p-value of this new model is even lower and the R-Squared value also improved slightly. Still, the model is not perfect and it would be unwise to claim that we have a strong linear relationship between the two.

Graphing this we get:

1
2
3
4
5
6
7
8
h <- ggplot(new.data, aes(x = Percent.Bachelors,
                       y = GDP.Per.Capita^-1)) +
		xlab("Proportion of Population with Bachelor's or Higher") +
		ylab("1 divided by GDP Per Capita")

h <- h + geom_text(aes(label = State))

h

Once again we see the same 3 outliers which definitely has a huge impact on the significance of such a small model. Either way, this model seems to exhibit somewhat of a stronger relationship between the two variables.

To leave a comment for the author, please follow the link and comment on their blog: Frank Portman.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)