Data Mining the California Solar Statistics with R: Part I

April 24, 2015

(This article was first published on R – Beyond Maxwell, and kindly contributed to R-bloggers)

Data Mining the California Solar Statistics with R: Part I


Today I’m taking a look at the data set available from California Solar Statistics availalbe from This data set lists all the applications for state incentives for both residential and commercial systems, it contains information about the PV (Photovoltaic) system size, location, cost, incentive amount, panel and inverter manufacturer and a lot more, if you want more details check out their web page. I’m interested in taking a look at this data set and seeing what we can learn about residential solar installations in CA.

Getting Started – Cleaning the data

The working data sets have already been screened for input errors but the data still isn’t in the format I am looking for, I would like to see the total installed kW of solar by county by year. Additionally, the input which contains the install county has inconsistent nomenclature. For example, some values for county will say “Los Angles County” while others will just say “Los Angeles”, this is a problem because it will create more counties than exist when I try to group by county. Also, the data currently contains canceled applications, I’m only interested in taking a look at installed systems. In this first block of code I’m going to get the data in a more usable format for my purposes.

##First I'll load the packages I plan on using 
##load data, this may take a min
solarData=read.csv(file = "WorkingDataSet_4-15-2015.csv")
## We're interested in the residential data
solarData=subset(solarData,Host.Customer.Sector == "Residential")
## same labels are redundant, remove "county" to avoid this using gsub command
solarData$Host.Customer.Physical.Address.County=gsub( " County","",solarData$Host.Customer.Physical.Address.County)
##There are two instances where no county was listed, I am removing them here
solarData=solarData[solarData$Host.Customer.Physical.Address.County != "",]
#remove cancelled applications, we're insteresed installed systems
solarData=solarData[solarData$First.Cancelled.Date == "",]
#only keep installed applications, if people have filed to receieve their incentives, I am counting it as installed
solarData=solarData[solarData$First.Incentive.Claim.Request.Review.Date != "",]
#extract install year and month variables
##using a package called lubridate with the functions year and month to extract year and month as variables form the data

Next I want to group the data by county and year. The plyr package is great for aggregating data like this, it can be achieved with just a few lines of code. For more about the plyr package see

#Get data by county
countyData = ddply(solarData, .(tolower(Host.Customer.Physical.Address.County),year ),summarize,
      Systems = length(na.omit(year)),
      Total.Size = sum(na.omit(CSI.Rating)))  
##rename column name to "County" from "Host.Customer.Physical.Address.County" this will be important when I want to merge that data set with another one
colnames(countyData)[1] ="County"

Now that I’ve got the data set in the format I’m looking for, I’ll use the gglpot2 package to load a county map of CA. Later We’ll be able to plot the solar install data we have per county on this map. I create a variable called “sort” so that after I merge this data set with the solar data I can put it back in the original order from my CA data frame. The merge function doesn’t keep the order of either your data frames. This is important because I won’t be able to plot the data by county correctly if it isn’t in the same order that it came from the ggplot package in.

Loading the California Map from ggplot2

##create a sort variable so that after we merge the data sets, we can put in proper order to plot
for (i in 1:nrow(CA)){
colnames(CA)[6] ="County"

I also need to subset the data so that each year is it’s own data set if I want to compare state wide solar install plots by year. To do this I’m merging the county labels from my CA data frame with the county labels from my countyData data frame.

countyData2007=subset(countyData,year == 2007)
CA2007 = merge(CA,countyData2007,all.x=TRUE,sort=FALSE, by="County")

countyData2008=subset(countyData,year == 2008)
CA2008 = merge(CA,countyData2008,all.x=TRUE,sort=FALSE, by="County")

countyData2009=subset(countyData,year == 2009)
CA2009 = merge(CA,countyData2009,all.x=TRUE,sort=FALSE, by="County")

countyData2010=subset(countyData,year == 2010)
CA2010 = merge(CA,countyData2010,all.x=TRUE,sort=FALSE, by="County")

countyData2011=subset(countyData,year == 2011)
CA2011 = merge(CA,countyData2011,all.x=TRUE,sort=FALSE, by="County")

countyData2012=subset(countyData,year == 2012)
CA2012 = merge(CA,countyData2012,all.x=TRUE,sort=FALSE, by="County")

countyData2013=subset(countyData,year == 2013)
CA2013 = merge(CA,countyData2013,all.x=TRUE,sort=FALSE, by="County")

countyData2014=subset(countyData,year == 2014)
CA2014 = merge(CA,countyData2014,all.x=TRUE,sort=FALSE, by="County")

##Data set for the installs statewide

Plotting the data with ggplot2 and grid

Next I want to visualize the data, for this I’ll use two great packages, ggplot2 along with the viewport function of the grid package which makes it easy to make very flexible panel plots.

## create countywide plots for 2007, 2010 and 2013
p2007 = ggplot(CA2007 ,aes(x = long, y = lat,group=group,fill=Total.Size)) +
  geom_polygon(colour = "white", size = 0.1) +  
  theme_bw(base_size = 16)+
  scale_fill_gradientn(name="Installed nkW/year",colours = brewer.pal(8,"YlOrRd"),limits=c(0,max(countyData$Total.Size)))+
  ggtitle("2007 by county")+
  theme(axis.text.x = element_text( angle = 25,vjust=-0.1))

p2010 = ggplot(CA2010 ,aes(x = long, y = lat,group=group,fill=Total.Size)) +
  geom_polygon(colour = "white", size = 0.1) +  
  theme_bw(base_size = 16)+
  scale_fill_gradientn(name="Installed nkW/year",colours = brewer.pal(8,"YlOrRd"),limits=c(0,max(countyData$Total.Size)))+
  ggtitle("2010 by county")+
  theme(axis.text.x = element_text( angle = 25,vjust=-0.1))

p2013 = ggplot(CA2013 ,aes(x = long, y = lat,group=group,fill=Total.Size)) +
  geom_polygon(colour = "white", size = 0.1) +  
  theme_bw(base_size = 16)+
  scale_fill_gradientn(name="Installed nkW/year",colours = brewer.pal(8,"YlOrRd"),limits=c(0,max(countyData$Total.Size)))+
  ggtitle("2013 by county")+
  theme(axis.text.x = element_text( angle = 25,vjust=-0.1))

## create statewide plot
pCA = ggplot(CA_comb_data,aes(x=year,y=CAtotal/1000))+
  ylab("CA annually installedn residential solar (MW)")

Below is the code required to generate my plot

##create a new page for plot
##tell viewport that you want 2 rows x 3 cols
pushViewport(viewport(layout = grid.layout(2, 3)))
vplayout <- function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
##specify where each plot should be located
print(p2007, vp = vplayout(1, 1))
print(p2010, vp = vplayout(1, 2))
print(p2013, vp = vplayout(1, 3))
##plots can take up more that one spot in your layout
print(pCA, vp=vplayout(2,1:3))


Looking at the plots by county it’s clear that the applications are more focused in southern CA. In a later post we’ll try and figure out how much of this is due to the high population in southern CA and how much is due to the increased solar insolation. One thing that lookes really strange/surprising to me is the drop in applications in 2014, during this period of time the cost of solar was dropping and the economy was improving. I reached out to the California Energy Comission to ask them about the drop in applications for incentives and they explained that while the CSI (California Solar Incentive) program was intended to last until 2016, the total budget for the program was starting to dry up in 2014 due the larger than anticipated growth in solar. So the drop in applications isn’t related to a decline in PV, just a lack of budget. It would be interesting to see how much the incentives running out changed the number of PV system installs but I haven’t been able to find a source for that data.

In my next post we’ll take a closer look at this data set and see what else we can learn.

To leave a comment for the author, please follow the link and comment on their blog: R – Beyond Maxwell. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)