The inspiration for this post came as I was browsing texts and articles about USA’s GDP and I wondered what might have a positive relationship to GDP that would be interesting to graph and explore.
1 2 3 4 5 6 7 8 9 10
library(ggplot2) bachelors <- read.csv("bachelors.csv", header = TRUE) GDP <- read.csv("gdppercapita.csv", header = TRUE) new.data <- merge(bachelors, GDP, by = "State") colnames(new.data) <- c("State", "Percent.Bachelors", "GDP.Per.Capita") model <- lm(GDP.Per.Capita ~ Percent.Bachelors, data = new.data)
First I imported the two datasets and merged them into one frame. I then built a linear model using GDP per Capita as the response variable and Percent of the Population with Bachelors are the predictor.
The summary of this simple linear model is featured below:
1 2 3 4 5 6 7 8
> summary(model) Call: lm(formula = GDP.Per.Capita ~ Percent.Bachelors, data = new.data) Residual standard error: 6923 on 48 degrees of freedom Multiple R-squared: 0.3564, Adjusted R-squared: 0.343 F-statistic: 26.58 on 1 and 48 DF, p-value: 4.737e-06
As we can see, the p-value is very small (small enough for this model to be significant). However the low R-Squared leaves much to be desired. Very little of the variance in the data is accounted for by our statistical model and I question whether it is a good fit or not.
Nevertheless, we can see a pretty interesting graph below:
1 2 3 4 5 6 7 8
g <- ggplot(new.data, aes(x = Percent.Bachelors, y = GDP.Per.Capita)) + xlab("Proportion of Population with Bachelor's or Higher") + ylab("GDP Per Capita") g <- g + geom_text(aes(label = State)) g
Aha! We can attribute some of our lack of fit to the outliers Wyoming, Alaska, and Delaware. With such a small dataset (50 states in the USA) 3 outliers can definitely have a strong impact on the fit.
I tried a Box-Cox test in R to see whether our fit might be improved if we transformed the response variable. The Box-Cox plot in R suggests that raising the response to the negative first power might be beneficial.
This new model is outlined below:
1 2 3 4 5 6 7 8 9
model2 <- lm(GDP.Per.Capita^-1 ~ Percent.Bachelors, data = new.data) > summary(model2) Call: lm(formula = GDP.Per.Capita^-1 ~ Percent.Bachelors, data = new.data) Residual standard error: 2.967e-06 on 48 degrees of freedom Multiple R-squared: 0.4199, Adjusted R-squared: 0.4079 F-statistic: 34.75 on 1 and 48 DF, p-value: 3.627e-07
The p-value of this new model is even lower and the R-Squared value also improved slightly. Still, the model is not perfect and it would be unwise to claim that we have a strong linear relationship between the two.
Graphing this we get:
1 2 3 4 5 6 7 8
h <- ggplot(new.data, aes(x = Percent.Bachelors, y = GDP.Per.Capita^-1)) + xlab("Proportion of Population with Bachelor's or Higher") + ylab("1 divided by GDP Per Capita") h <- h + geom_text(aes(label = State)) h
Once again we see the same 3 outliers which definitely has a huge impact on the significance of such a small model. Either way, this model seems to exhibit somewhat of a stronger relationship between the two variables.