Data Mining the California Solar Statistics with R: Part IV

[This article was first published on R – Beyond Maxwell, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Predicting the residential solar power installations by county by quarter in CA from 2009-2013

So far I have gathered three data sets and combined them into one which I will now use to try to predict the number of solar installations by county by quarter in CA from 2009-2013. The three data sets I am using are; the California Solar Statistics which are a record of applications for state incentives for solar power installations, American Community Survey for information about median income and population of each county and NASA Surface Metrology and Solar Energy Release which contains information about the average amount of sun shinning on a given latitude and longitude. My dependent variable that I would like to predict is the installed solar in a given county per quarter in kW (variable is named Total.kW). The predictors are as follows:
1. population (has yearly changes) called pop (#)
2. median income (has yearly changes) called income ($)
3. cost of solar (has quarterly changes) called cost ($)
4. solar insolation (no changes, annual average) called insolation (kWh/m2 /day)

Getting Started

To try to predict the installed solar power, I will start with the simplest and easiest to interpret linear models and then work towards more complex models which increase prediction accuracy but are not as easy to make inferences about what the model is doing. Step 1 is always to load the packages I will be using and the .Rdata file which has all my data frames from the last several posts.

require(boot) ## needed for cross validation
require(gam) ##model for generalized additive models
require(splines) ##splines model, will use with gam
require(randomForest) 
require(gbm)
load("CaSolarIII.RData") ###Load data from last post

Individual Linear Models

To start out, I am fitting the installed solar power against each predictor individually to get an idea of how strongly the dependent variable depends on each of the predictors.

##fit predictors one at a time against total install of residential kW of PV/county/quarter
pop.fit = lm(Total.kW ~  pop , data = installsByYearCountyQuarter) ##fit against population
cost.fit = lm(Total.kW ~  cost , data = installsByYearCountyQuarter) ##fit against cost of PV systen
income.fit = lm(Total.kW ~  income , data = installsByYearCountyQuarter) ##county median income
insolation.fit = lm(Total.kW ~  insolation , data = installsByYearCountyQuarter) ##solar insolation by county

Below you can see the fits of the response against each individual predictor, it is clear that none of them alone can be an accurate predictor for this problem which is not surprising. The low r squared values indicate that none of the predictors can explain more than 46% of the variance in the data.

par(mfrow=c(2,2)) ##set plot mode to a 2 x 2 grid
##plots of each fit
plot(Total.kW ~ pop , data = installsByYearCountyQuarter)
abline(pop.fit, col="red")
text(7000000,8000,paste("Rsq = ",round(summary(pop.fit)$r.squared,2)))
plot(Total.kW ~ cost , data = installsByYearCountyQuarter)
abline(cost.fit, col="red")
text(7.0,8000,paste("Rsq = ",round(summary(cost.fit)$r.squared,2)))
plot(Total.kW ~ income , data = installsByYearCountyQuarter)
abline(income.fit, col="red")
text(80000,8000,paste("Rsq = ",round(summary(income.fit)$r.squared,2)))
plot(Total.kW ~ insolation , data = installsByYearCountyQuarter)
abline(insolation.fit, col="red")
text(5.0,9500,paste("Rsq = ",round(summary(insolation.fit)$r.squared,2)))
legend(4.5,8000,c("data","fit"),pch=c(1,NA),lty=c(NA,1),col=c("black","red")) ## add legend to lower left plot

plot of chunk unnamed-chunk-3

Combined Linear Models

Next lets fit two more models, one will take a look at a linear model which takes into account each of the predictors at the same time and the other will include all the predictors as well as their interactions.

###fit linear model to data set
comb.fit = lm(Total.kW ~  pop + cost + income + insolation , data = installsByYearCountyQuarter)
summary(comb.fit)

## 
## Call:
## lm(formula = Total.kW ~ pop + cost + income + insolation, data = installsByYearCountyQuarter)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3167.3  -270.6   -64.6   124.7  8289.6 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.807e+03  6.689e+02  -2.702  0.00702 ** 
## pop          4.182e-04  1.582e-05  26.432  < 2e-16 ***
## cost        -2.235e+02  2.537e+01  -8.809  < 2e-16 ***
## income       6.846e-03  1.700e-03   4.028 6.07e-05 ***
## insolation   6.149e+02  1.273e+02   4.831 1.57e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 696.6 on 988 degrees of freedom
## Multiple R-squared:  0.5183, Adjusted R-squared:  0.5164 
## F-statistic: 265.8 on 4 and 988 DF,  p-value: < 2.2e-16

Looking at the summary above, you can see that each of the predictors is significant. The combined model improves upon the population only model by increasing the adjusted r squared to 51.6 %. Now lets take a look at how much better we can do by adding the interaction terms.

comb.fit.w.int = lm(Total.kW ~  pop + cost + income + insolation +
                      pop:cost + pop:income + pop:insolation +
                      cost:income + cost:insolation + income:insolation, data = installsByYearCountyQuarter)
summary(comb.fit.w.int)

## 
## Call:
## lm(formula = Total.kW ~ pop + cost + income + insolation + pop:cost + 
##     pop:income + pop:insolation + cost:income + cost:insolation + 
##     income:insolation, data = installsByYearCountyQuarter)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2217.2  -202.8   -21.0    75.3  7484.3 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.230e+04  4.769e+03  -2.579 0.010050 *  
## pop                1.159e-02  1.358e-03   8.534  < 2e-16 ***
## cost               2.506e+03  5.999e+02   4.177 3.22e-05 ***
## income            -2.218e-01  5.193e-02  -4.270 2.14e-05 ***
## insolation         2.791e+03  9.438e+02   2.957 0.003181 ** 
## pop:cost          -2.513e-04  1.468e-05 -17.119  < 2e-16 ***
## pop:income         3.639e-09  2.413e-09   1.508 0.131871    
## pop:insolation    -1.821e-03  2.404e-04  -7.573 8.41e-14 ***
## cost:income        4.027e-03  1.578e-03   2.552 0.010854 *  
## cost:insolation   -5.426e+02  1.182e+02  -4.592 4.97e-06 ***
## income:insolation  3.781e-02  1.015e-02   3.726 0.000206 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 566.5 on 982 degrees of freedom
## Multiple R-squared:  0.6833, Adjusted R-squared:  0.6801 
## F-statistic: 211.9 on 10 and 982 DF,  p-value: < 2.2e-16

#plot(comb.fit.w.int)
par(mfrow=c(2,2))
plot(Total.kW~pop,data=installsByYearCountyQuarter,col=rgb(0,0,0,0.3),pch=19)
points(fitted(comb.fit.w.int)~pop,data=installsByYearCountyQuarter,col=rgb(1,0,0,0.3),pch=19)
plot(Total.kW~cost,data=installsByYearCountyQuarter,col=rgb(0,0,0,0.3),pch=19)
points(fitted(comb.fit.w.int)~cost,data=installsByYearCountyQuarter,col=rgb(1,0,0,0.3),pch=19)
plot(Total.kW~income,data=installsByYearCountyQuarter,col=rgb(0,0,0,0.3),pch=19)
points(fitted(comb.fit.w.int)~income,data=installsByYearCountyQuarter,col=rgb(1,0,0,0.3),pch=19)
plot(Total.kW~insolation,data=installsByYearCountyQuarter,col=rgb(0,0,0,0.3),pch=19)
points(fitted(comb.fit.w.int)~insolation,data=installsByYearCountyQuarter,col=rgb(1,0,0,0.3),pch=19)
legend(4.5,8000,c("data","fit"),pch=c(19,19),col=c(rgb(0,0,0,0.3),rgb(1,0,0,0.3)))

plot of chunk unnamed-chunk-5
The interaction terms have varying degrees of significance but all of them appear to be significant except for the interaction between population and income. This model improves the rsquared to 68%, later I will take a look at the 10-fold cross validation error to see if this improvement is just due to overfitting the training data. Looking at the predicted values next to the actual values on the plots above, you can tell that we are moving in the right direction with the model.

Generalized Additive Models

For the next step to increase complexity and hopefully accuracy, I am using a generalized additive model with 4th order natural splines to fit the data.

gam.comb.fit = lm(Total.kW ~  ns(pop,4) + ns(cost,4) + ns(income,4) + ns(insolation,4) , data = installsByYearCountyQuarter)
summary(gam.comb.fit)

## 
## Call:
## lm(formula = Total.kW ~ ns(pop, 4) + ns(cost, 4) + ns(income, 
##     4) + ns(insolation, 4), data = installsByYearCountyQuarter)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2124.4  -215.7    -0.2   175.3  7391.0 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           31.08     188.80   0.165  0.86928    
## ns(pop, 4)1          224.98      73.70   3.053  0.00233 ** 
## ns(pop, 4)2         3958.79     171.19  23.125  < 2e-16 ***
## ns(pop, 4)3         3893.32     179.21  21.726  < 2e-16 ***
## ns(pop, 4)4         3162.82     155.95  20.281  < 2e-16 ***
## ns(cost, 4)1        -312.25     111.93  -2.790  0.00538 ** 
## ns(cost, 4)2        -441.49     105.38  -4.190 3.05e-05 ***
## ns(cost, 4)3        -550.15     188.12  -2.924  0.00353 ** 
## ns(cost, 4)4        -498.50      75.32  -6.618 5.99e-11 ***
## ns(income, 4)1       200.41     165.60   1.210  0.22649    
## ns(income, 4)2        83.57     142.85   0.585  0.55866    
## ns(income, 4)3        96.63     357.44   0.270  0.78697    
## ns(income, 4)4        67.98     126.75   0.536  0.59183    
## ns(insolation, 4)1  -103.04     179.52  -0.574  0.56612    
## ns(insolation, 4)2   323.45     159.79   2.024  0.04322 *  
## ns(insolation, 4)3   557.13     309.44   1.800  0.07210 .  
## ns(insolation, 4)4  -121.77     140.43  -0.867  0.38608    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 605.1 on 976 degrees of freedom
## Multiple R-squared:  0.641,  Adjusted R-squared:  0.6351 
## F-statistic: 108.9 on 16 and 976 DF,  p-value: < 2.2e-16

Adding this much flexibility to the model appears to have reduced the significance of income and insolation, I suspect that at this point we are overfitting the data. The median residual is reduced by this more flexible model but the adjusted r squared has actually decreased compared to the previous model because so many degrees of freedom have been added. The gam package has a nice function to allow you to see your fit vs each predictor and it's standard error very easily, below is an example.

par(mfrow=c(2,2)) ##set plot mode to a 2 x 2 grid
plot.gam(gam.comb.fit,se=TRUE,col="red")

plot of chunk unnamed-chunk-7
Now let's check how accurate each of our models is so far using 10 fold cross validation.

set.seed(72) ## set seed for reproducibility
cv.error=as.data.frame(matrix(data=NA, nrow = 7,  ncol=10)) ##data frame which will hold cv error for each model
for (i in 1:10){
  cv.pop.fit = glm(Total.kW ~  pop , data = installsByYearCountyQuarter)
  cv.error[1,i]=cv.glm(installsByYearCountyQuarter,cv.pop.fit,K=10)$delta[1]
  cv.cost.fit = glm(Total.kW ~  cost , data = installsByYearCountyQuarter)
  cv.error[2,i]=cv.glm(installsByYearCountyQuarter,cv.cost.fit,K=10)$delta[1]
  cv.income.fit = glm(Total.kW ~  income , data = installsByYearCountyQuarter)
  cv.error[3,i]=cv.glm(installsByYearCountyQuarter,cv.income.fit,K=10)$delta[1]
  cv.insolation.fit = glm(Total.kW ~  insolation , data = installsByYearCountyQuarter)
  cv.error[4,i]=cv.glm(installsByYearCountyQuarter,cv.insolation.fit,K=10)$delta[1]
  cv.comb.fit = glm(Total.kW ~  pop + cost + income + insolation , data = installsByYearCountyQuarter)
  cv.error[5,i]=cv.glm(installsByYearCountyQuarter,cv.comb.fit,K=10)$delta[1]
  cv.comb.fit.w.int = glm(Total.kW ~  pop + cost + income + insolation , data = installsByYearCountyQuarter)
  cv.error[6,i]=cv.glm(installsByYearCountyQuarter,cv.comb.fit.w.int,K=10)$delta[1] 
  cv.gam.comb.fit = glm(Total.kW ~  ns(pop,4) + ns(cost,4) + ns(income,4) + ns(insolation,4) , data = installsByYearCountyQuarter)
  cv.error[7,i]=cv.glm(installsByYearCountyQuarter,cv.gam.comb.fit,K=10)$delta[1] 
}
par(mfrow=c(1,1))
levels=c("population only","cost only","income only","insolation only","all 4 combined","all 4 +n interactions","all 4 combinedn-splines")
bp = barplot(rowMeans(cv.error),width=1.0,space=0.1,col="red",xlab="",ylab="Mean Squared Error")
text(bp, labels = levels, srt = 45, adj = c(1.0,1.0), xpd = TRUE, cex=0.9)

plot of chunk unnamed-chunk-8
The cross-validated mean squared error results are indicating that adding the interaction terms was not useful but using the gam model with splines was able to successfully reduce the MSE over the linear model.

Random Forests & Gradient Boosted Tree Models

Now let's look at an extremely flexible model, the random forest. I'm only going to use the 4 original predictors and not the interaction terms. Random forest models are easily implemented in R as you can see below. Variable importance plots can be used to determine which parameters reduce the SSE the most, this is useful since you don't have p-values as you do in a linear regression.

solarForest=randomForest(Total.kW ~ pop + cost + income + insolation , data = installsByYearCountyQuarter, mtry=2, ntree=5000,importance = TRUE)
varImpPlot(solarForest,type=1) ##variable importance plot

plot of chunk unnamed-chunk-9
As in the linear model, population and cost are the most important predictors. In this example I chose to fit 5000 decision tree's in the random forest, but looking at the plot below I probably could have gone with as few as 140 trees. The error on the y axis of the plot below is the mean squared error for the out of bag samples, so it is fair to compare these results with our cross-validated MSE on the linear fits. The random forest achieves MSE < 90,000 while the best previous model only achieved a MSE of ~375,000. The random forest performed 4 times better than the gam model with splines. Around a 4X improvement in MSE.

Another powerful tool based on decision tree's is boosting, this is provided by the gbm package in R.

solarBoostedForest=gbm(Total.kW ~ pop + cost + income + insolation , data = installsByYearCountyQuarter, distribution="gaussian", n.trees=5000,interaction.depth = 2,shrinkage=.05, cv.folds = 10)
summary(solarBoostedForest) 

plot of chunk unnamed-chunk-10

##                   var   rel.inf
## pop               pop 69.818739
## cost             cost 19.901584
## income         income  5.995389
## insolation insolation  4.284288

The boosted tree model has also picked up that population and cost are the most important predictors. Let's take a look and see how the random forest and boosting model compare. The boosted model's error is 10 fold cross validation error while the random forest is out of bag error.

plot(solarForest,main="")
points(solarBoostedForest$cv.error,col="red",pch=19)
legend(2000,160000,c("randomForest","gradient boosted trees"),lty=c(1,NA),pch=c(NA,19),col=c("black","red"))

plot of chunk unnamed-chunk-11
For this example the random forest outperformed the boosted trees, however I only spent a little time optimizing the shrinkage factor and interaction depth, it is possible that with the right parameters the boosted tree's could have done better.

Finally, lets take a look at how the residuals for our best fit (the random forest) and see how the compare to the actual total solar power installed. The model does quite well for instances when installed power is less than ~ 1000 kw/m2 /day. As the size increases the model becomes less accurate, this is likely because the data as much more sparse in this range.

plot((installsByYearCountyQuarter$Total.kW - solarForest$predicted)~installsByYearCountyQuarter$Total.kW
     ,ylab="residuals",xlab="Total.kW",col=rgb(1,0,0,0.3),pch=19)

plot of chunk unnamed-chunk-12

Wrapping up

Today I've taken this merged data set and fit it with various linear models with and without interaction terms and splines. Also the data was fit with a random forest which was a vast improvement over the other models. In my next post I'll use this random forest to try to predict how many solar installations would have occurred over this same time span if the incentive program did not exist. I'd love to hear any questions or comments you may have about this project.

To leave a comment for the author, please follow the link and comment on their blog: R – Beyond Maxwell.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)