library(ggplot2)bachelors <- read.csv("bachelors.csv", header =TRUE)GDP <- read.csv("gdppercapita.csv", header =TRUE)new.data <-merge(bachelors, GDP, by ="State")colnames(new.data)<-c("State","Percent.Bachelors","GDP.Per.Capita")model <- lm(GDP.Per.Capita ~ Percent.Bachelors, data = new.data)
First I imported the two datasets and merged them into one frame. I then built a linear model using GDP per Capita as the response variable and Percent of the Population with Bachelors are the predictor.
The summary of this simple linear model is featured below:
>summary(model)Call:lm(formula = GDP.Per.Capita ~ Percent.Bachelors, data = new.data)Residual standard error:6923 on 48 degrees of freedom
Multiple R-squared:0.3564, Adjusted R-squared:0.343F-statistic:26.58 on 1 and 48 DF, p-value:4.737e-06
As we can see, the p-value is very small (small enough for this model to be significant). However the low R-Squared leaves much to be desired. Very little of the variance in the data is accounted for by our statistical model and I question whether it is a good fit or not.
Nevertheless, we can see a pretty interesting graph below:
g <- ggplot(new.data, aes(x = Percent.Bachelors, y = GDP.Per.Capita))+ xlab("Proportion of Population with Bachelor's or Higher")+ ylab("GDP Per Capita")g <- g + geom_text(aes(label = State))g
Aha! We can attribute some of our lack of fit to the outliers Wyoming, Alaska, and Delaware. With such a small dataset (50 states in the USA) 3 outliers can definitely have a strong impact on the fit.
I tried a Box-Cox test in R to see whether our fit might be improved if we transformed the response variable. The Box-Cox plot in R suggests that raising the response to the negative first power might be beneficial.
This new model is outlined below:
model2 <- lm(GDP.Per.Capita^-1~ Percent.Bachelors, data = new.data)>summary(model2)Call:lm(formula = GDP.Per.Capita^-1~ Percent.Bachelors, data = new.data)Residual standard error:2.967e-06 on 48 degrees of freedom
Multiple R-squared:0.4199, Adjusted R-squared:0.4079F-statistic:34.75 on 1 and 48 DF, p-value:3.627e-07
The p-value of this new model is even lower and the R-Squared value also improved slightly. Still, the model is not perfect and it would be unwise to claim that we have a strong linear relationship between the two.
Graphing this we get:
h <- ggplot(new.data, aes(x = Percent.Bachelors, y = GDP.Per.Capita^-1))+ xlab("Proportion of Population with Bachelor's or Higher")+ ylab("1 divided by GDP Per Capita")h <- h + geom_text(aes(label = State))h
Once again we see the same 3 outliers which definitely has a huge impact on the significance of such a small model. Either way, this model seems to exhibit somewhat of a stronger relationship between the two variables.
To leave a comment for the author, please follow the link and comment on their blog: Frank Portman.