Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the second part of the forecasting exercise, where I am looking at a multiple regression. To keep it simple I chose the states that boarder WI and the US unemployment information for the regression. Again this is a down and dirty analysis, I would not call this complete in any sense.

#getting the data for WI, US, IL, MI, MN, IO

#Using Quandl data, great little site

#merging the data into one dataframe
#I started with WI so that I could isolate the rest of the variables better
unemp<-merge(wi, io, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io')
unemp<-merge(unemp, mi, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi')
unemp<-merge(unemp, mn, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi', 'mn')
unemp<-merge(unemp, us, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi', 'mn', 'us')
unemp<-merge(unemp, il, by='Date'); colnames(unemp)<-c('Date', 'wi', 'io', 'mi', 'mn', 'us', 'il')

save(unemp, file='Unemployment WI.RData')

library(car)
library(lmtest)
library(forecast)

unemp.b<-unemp[1:436,]
unemp.p<-unemp[437:448,]

plot(unemp$Date, unemp$us, col='black',
main='Unemployment WI Region', type='l', xlab='Date',
ylab='Unemp. Rate', ylim=c(2, 17))
lines(unemp$Date, unemp$il, col='blue')
lines(unemp$Date, unemp$io, col='dark green')
lines(unemp$Date, unemp$mi, col='purple')
lines(unemp$Date, unemp$mn, col='gray')
lines(unemp$Date, unemp$wi, col='red')
leg.txt<-c('US', 'IL', 'IO', 'MI', 'MN', "WI")
leg.col<-c('black','blue', 'dark green', 'purple', 'gray', 'red')
legend('topleft', leg.txt, col=leg.col, pch=15)

summary(unemp)

#correlation matrix

cm<-cor(unemp[,-1])
cm

#From the graph and correlation matrix, everything
#is highly correlated, the lowest r value is 0.7879
#between Iowa and US.
#plots of each of the data, then each individual,
#everything looks really good, nice linear trends,
#although Michigan has quite bit of volatility
#compared to the other variables.

plot(unemp.b, pch='.')
model<-lm(wi~., data=unemp.b[,-1])
step.m<-step(model, direction='backward')
summary(step.m)

#testing for outliers
outlierTest(step.m)

# After conducting the outlier test, looks like it is michigan
#Sept 1980 where unemployment was 13.2 percent, point 57 is
#close at 12.9 those are the two biggest, but the outlierTest
#isolated row 56

dwtest(step.m)
dwtest(step.m, alternative='two.sided')
#yep lots of autocorrelation, so a multiple ARAIMA might be in order

pred<-unemp.p[3:7]
mult.f<-predict(step.m, newdata=pred, interval='prediction')
a1<-accuracy(mult.f, unemp.p$wi) #because the MASE is missing I needed to add the following to compare later a1<-c(a1,'0') names(a1)<-c("ME", "RMSE", "MAE", "MPE", "MAPE", "MASE") # ARIMA auto.r<-auto.arima(unemp.b$wi, xreg=unemp.b[,3:7])
auto.r
auto.f<-forecast(auto.r, xreg=unemp.p[,3:7], h=12)
a2<-accuracy(auto.f, unemp.p$wi) a2 #Just remember that the 0 is just a place holder r.table<-rbind(a1, a2) row.names(r.table)<-c('Multiple Regression', 'Reg. with ARIMA Errors') r.table At this point I combined the two accuracy tables to see which one was the best. b.table<-rbind(a.table[,1:6], r.table) b.table<-b.table[order(b.table$MAPE),]
b.table

The winner is: Regression with ARIMA Errors, for most of the analysis the simple mean like predictions dominated, even the multiple regression is considerably low on the list.