Data Mining the California Solar Statistics with R: Part V

June 8, 2015

(This article was first published on R – Beyond Maxwell, and kindly contributed to R-bloggers)

Building a Shiny App to explore the model and the data

About the Shiny App

In my previous post I built several models to try to predict the amount of residential solar installed per county by quarter as a function of solar insolation, price of solar electricity, county population and county median income. To explore the data and model predictions I’ve build a Shiny app in R-Studio. I could not install R-studio using the hosting service that I have now so if you want to check it out, you’ll need to head over to

The app allows you to look at the actual installations vs. predicted installation by year and quarter. Additionally, I included a bar plot of the same data as the map, which makes it a bit easier to see. Also, I created an interactive scatter plot where you can see the effect of the different predictors on the total installed residential solar by county. Rather than post the code here for the Shiny App, the code is hosted on my Github at

Examining the effect of solar subsidies on the total amount of residential solar installed from 2009-2013

In the previous posts, I have been using the subsidized cost of solar installs as a predictor, but now I would like to predict how much residential installed solar would have occurred if no CA subsidies were given. To do this I first need to create a variable for the actual up front cost of solar paid.

##load previous data
##create variable for cost/watt
actualCostByQuarter = ddply(solarData, .(year,quarter),summarise, cost=mean(Total.Cost/(solarData$CEC.PTC.Rating*1000),na.rm=TRUE))
##merge with data set

Now that I have the subsidy free cost as a predictor, I can use the random forest model to predict how much residential solar would have been installed if there had been no subsidies.

noSubPreds = round(sum(predict(solarForest, newdata = installsByYearCountyQuarter))) ## predictions with no subsidies
SubPreds   = round(sum(solarForest$predicted)) ## predictions with subsidies
actual = round(sum(installsByYearCountyQuarter$Total.kW)) ## actual residential installs from 2009-2013
TotalSolarInstalled = c(noSubPreds,SubPreds,actual)
type = c('Without CA subsidies (predicted)','With CA subsidies (predicted)','With CA subsidies (actual)') 
finale = data.frame(TotalSolarInstalled,type) 
ggplot(finale,aes(type,TotalSolarInstalled))+geom_bar(stat="identity")+theme_bw()+ylab('Total residential solar installed from 2009-2013 (kW)') +
 geom_text(aes(label = TotalSolarInstalled), vjust=-0.5, position = position_dodge(0.9), size = 4) ##add labels


This bar chart really surprised me. The random forest model predicts that if no CA subsidies existed then ~ 25% less residential solar would have been installed between 2009 and 2013. That is quite a dramatic effect. Judging by my analysis, the Go Solar California program has helped California take a big step towards a future that is less dependent on fossil fuels.

Well, I think that about wraps things up for this project. If you have any ideas for analysis I didn’t think of or have any comments or questions about what I’ve done please don’t hesitate to reach out to me.

To leave a comment for the author, please follow the link and comment on their blog: R – Beyond Maxwell. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)