Data Mining the California Solar Statistics with R: Part III

[This article was first published on R – Beyond Maxwell, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Data Mining the California Solar Statistics with R: Part III

Today I want to combine the California solar statistics with information about the annual solar insolation in each county as well as information about the population and median income. These can then be used as predictors in the models I'll build in the next post. Information about the annual solar insolation can be found in a NASA data set, This data set lists the annual insolation incident on a horizontal surface in kWh/m2 /day for the entire globe at a 1 degree latitude/longitude precision. The population and median income data can be downloaded from the American Community Survey which is run by the census bureau,

Annual insolation in California by county

Ideally, I would like to average the annual solar insolation over the area of the county and assign that insolation to the county, however, the insolation data is much too sparse for that; not to mention that finding a way to average anything over those odd county shapes would be very difficult. What I settled on is to assign the annual solar insolation of each county's county seat to the entire county, this should capture most of the variation across the state. To find the latitude and longitude of each county seat I used the Google Maps API that is built into a great package called ggmaps. Once I have the latitude and longitude of each county seat then I just look up the average insolation at that latitude and longitude.


##load text file containing solar insolation data
avgAnnualInsolation=read.csv("nasaSolarData.txt",sep = " ", skip = 13 , header = T)
##remove monthly totals, only need lat,lon and annual insolation
avgAnnualInsolation = avgAnnualInsolation[,c(1,2,15)]
##subset the data so that only california area is included
avgAnnualInsolation = subset(avgAnnualInsolation, Lat > 32 & Lat < 43 & Lon > -125 & Lon < -113)

The following code is what I wrote to look up the latitdue and longitude using the Google Maps API

##I saved the CA county seats in this text file, read it into a variable
##remove "county" from the end of the county name
countySeats$County=tolower(gsub(" County", "" ,countySeats$County ))
##add latitide,longitude and annual solar insolation variables to the county seat data frame
##This for loop uses the geocode funciton of ggmaps to look up the latitude and longitude of each county seat
##then the instolation from the nasa data set at that lat,lon pair is assigned to the county
for ( counties in 1:nrow(countySeats)){
  seat =  geocode(paste(countySeats$Seat[counties],", CA"))
  countySeats$lat[counties] = seat$lat
  countySeats$lon[counties] = seat$lon
  countySeats$insolation[counties]=avgAnnualInsolation$Ann[avgAnnualInsolation$Lat == round(seat$lat) & 
                                                           avgAnnualInsolation$Lon == round(seat$lon)]
  print(paste(seat$lat," ",seat$lon,countySeats$insolation[counties]))
##save the data to csv file so that i only have to do this once

Now that I have the insolation per county I can merge that data with my California map and plot the results.

##load the data
##merge the insolation data with our state map from post 1
##plot the data
ggplot(CASUN ,aes(x = long, y = lat,group=group,fill=insolation)) +
  geom_polygon(colour = "white", size = 0.1) +  
  theme_bw(base_size = 16)+
  scale_fill_gradientn(name=expression(paste("kWh/"*m^2*"/day")),colours = brewer.pal(8,"YlOrRd"),limits=c(min(CASUN$insolation),max(CASUN$insolation)))+
  ggtitle("Insolation incident on a horizontal surface")+
  theme(axis.text.x = element_text( angle = 25,vjust=-0.1))+ 
  coord_fixed(ratio = 1)

plot of chunk unnamed-chunk-3

As you would expect, southern California gets much more sun that northern California and the coast get's less sun than inland areas because the coast is often more cloudy.

Population and median income in California by county

As I mentioned before, this data is available from the American Community Survey. They have data for each county available from 2009-2013, the data sets from 2007 and 2008 did not include every county. The data was a bit tedious to get. I had to search around to find it and then download by year one year at a time and then combine them in an excel file. I'll put these files in my public dropbox folder so that anyone else interested doesn't have to go through the same tiresome exercise that I did.

##load population by county, data collected from american community survey -, data available form 2009-2013
pop2013=subset(pop,year == 2013)
###Susbset out 2013 as an example to plot
caPop2013 = merge(CA,pop2013,all.x=TRUE,sort=FALSE, by="County")
##must be sorted after merging for plot to work
##load income by county, data collected from american community survey -, data available form 2009-2013
income2013=subset(income,year == 2013)
caInc2013 = merge(CA,income2013,all.x=TRUE,sort=FALSE, by="County")
##create plots using grid and ggplot2 packages
caPopPlot = ggplot(caPop2013 ,aes(x = long, y = lat,group=group,fill=pop)) +
  geom_polygon(colour = "black", size = 0.1) +  
  theme_bw(base_size = 16)+
  scale_fill_gradientn(name="Population",colours = brewer.pal(8,"Blues"),limits=c(min(caPop2013$pop),max(caPop2013$pop)))+
  ggtitle("2013 population by county")+
  theme(axis.text.x = element_text( angle = 25,vjust=-0.1))+ 
  coord_fixed(ratio = 1)

caIncPlot = ggplot(caInc2013 ,aes(x = long, y = lat,group=group,fill=income)) +
  geom_polygon(colour = "black", size = 0.1) +  
  theme_bw(base_size = 16)+
  scale_fill_gradientn(name="Mediann Income ($)",colours = brewer.pal(8,"Greens"),limits=c(min(caInc2013$income),max(caInc2013$income)))+
  ggtitle("2013 median income by county")+
  theme(axis.text.x = element_text( angle = 25,vjust=-0.1))+ 
  coord_fixed(ratio = 1)
##create a new page for plot
##tell viewport that you want 2 rows x 3 cols
pushViewport(viewport(layout = grid.layout(1, 2)))
vplayout <- function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
##specify where each plot should be located
print(caPopPlot, vp = vplayout(1, 1))
print(caIncPlot, vp = vplayout(1, 2))

plot of chunk unnamed-chunk-4
I did not realize how concentrated the population in California is around Los Angeles until I saw this plot. Financially, the bay area has the highest median salary.Regions around Los Angeles, San Diego and Sacramento also have elevated incomes compared to the rest of the state.

The effective cost of solar by county

In my last post we looked at how the cost of solar decreased over the duration of this program. Now I would like to calculate what I am calling the “effective cost” which is the cost of the system minus the incentive received. I should also note that the owners of these system would have been available for the federal invesment tax credit (ITC) as well , but the incentive amounts for that are not included in this data set and are thus not inlcuded in my analysis.

##create new vairable for effective cost which subtracts incentive from total cost
##group installs by quarter and year, calculate average cost for each quarter by year
costByQuarter = ddply(solarData, .(year,quarter),summarise, cost=mean(Effective.Cost,na.rm=TRUE))


##use ddply to group install kW/county/quarter
installsByYearCountyQuarter=ddply(solarData, .(year, County, quarter),summarise, 
                                  Total.kW = sum(na.omit(CSI.Rating)))
##only include 2009-2013, because that is the range we have all the variables for
installsByYearCountyQuarter=subset(installsByYearCountyQuarter,year > 2008 & year < 2014)
##Merge our external data sets with the california solar statistics set 

Now that we've got our combined data set together i want to look for correlation between the variables


plot of chunk unnamed-chunk-6
Looking at the correlation plot, the trends seem to be what I would expect; total install kW of PV increases with increasing population, increasing insolation and decreasing cost. The trend between installed kW and income is not as clear as I would have guessed.

Wrapping up

Well, it's been a lot of work through these last 3 posts, but now I'm ready to perform some machine learning algorithms on this data set I've pulled together. I am eager to see how accurately I'll be able to predict the totaled installed residential PV per quarter using these predictors. Check in next time to see the results!


population by year/county –
median income by year/county –
CA county seats –
This R markdown file –

To leave a comment for the author, please follow the link and comment on their blog: R – Beyond Maxwell. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)