Data Mining the California Solar Statistics with R: Part II

[This article was first published on R – Beyond Maxwell, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Data Mining the California Solar Statistics with R: Part II

In today's post I'll be working some more with the working data set from California Solar Statistics. Last time I imported the data, cleaned it up a bit, grouped it by county and year, and made some plots to look at how residential solar installations have been changing from year to year by county as well as statewide. Now I would like to look a bit closer at the data set and try to answer some further questions about the residential solar market in California. The questions I would like to answer using R with this data set are:
1. What companies are installing the most panels?
2. What brand solar panels are being installed the most?
3. What time of year are the panels being installed?
4. How has the cost of solar energy per watt changed during the time range in the data set?
5. What is the average incentive given to customers installing residential systems?

Question 1-What companies are installing the most panels?

First, I need to load the variables I created in my first post and load the necessary packages.

##Load the dataset that I made in Part I
##I really like the tableau palettes that are a part of the ggthemes package 

Now I need to group the systems by who installed them, then I will take the top 5 installers from each year and take a look at how many systems they have installed between 2007 and 2013. I actaully grouped the installers by their license # rather than their name, I did this because there are some input errors in the data, sometimes the installer name may be capitalized while other times it may not be and then sometimes it may say LLC at the end while others it doesn't, so grouping by license # helps me get around that problem. Then I assign the most frequently occuring name for each installer as the name that goes with the license #.

##use plyr package to aggregate data by installer's contractor #, i use contractor # instead of name because R is case sensitive and the
##contractor names arent put in the exact same way every time
installers=ddply(solarData,.(Contractor.License.Number, year),summarise, 

##sort the data by # of installs
##some installers names are very long, cut off to 20 chars
## Solar City was listed as 3 different names on different years, combine them all just to sat "SolarCity"
installers$installer[installers$installer=="SolarCity Corporatio" |installers$installer== "USB Solar City Owner" ]="SolarCity"
##want to find top 5 installers for each year from 2007-2013
##create a data frame to store the top 5 installers from each year, intialize data frame with 2007 data then row bind on all the rest of the years
topinstallers = head(installers[installers$year==2007,],n=5)
for (years in 2008:2013){
##plot the # of installs by top installers using ggplot

plot of chunk unnamed-chunk-2

These results are not too surprising, SolarCity has by far been installing more panels than any other installer. By 2013 they were installing about twice as many as the next nearest competitor. I suspected that SolarCity would be the leading installer on residences but I did not realize by how much.

Question 2-What brand solar panels are being installed the most?

The approach I took for this section is almost identical to question 1, mostly swapping out variable names.

##I can recycle the code I used to get the top installers by year to look at the top panels by year just by changing variable names
##aggreagate the data
panelsByYear=ddply(solarData,.(PV.Module.1.Manufacturer, year),summarise, installs=length(PV.Module.1.Manufacturer))
##sort it by # of installs
panelsByYear=panelsByYear[order(panelsByYear$year,panelsByYear$installs,decreasing = TRUE),]
##select the top 5 from each year
topPanels = head(panelsByYear[panelsByYear$year==2007,],n=5)
for (years in 2008:2013){
##plot the # of Panels installs by top module manufacturers
  scale_color_tableau("tableau20",name = "Module Manufacturer")    

plot of chunk unnamed-chunk-3

The plot above is for the number of projects the manufacturers panels are used in, not the total number of panels sold or total kW of panels sold. Though that would be interesting to look at…maybe in a future post. It is interesting that most years more homeowners choose Sunpower panels than any other brand. This could be because Sunpower sells the most efficient panels on the market, so if you have a small roof and want to generate as much power as possible, Sunpower is a good way to go. Also, it is fitting for Sunpower to get much of the California business since it is a spin-off from Stanford. I find it interesting that in 2013 Trina solar panels were used in more projects than any other manufacturer despite not being one of the top 5 manufacturers in the previous 6 years.

Question 3. What time of year are the panels being installed?

I decided to group the system installs by the quarter they were installed in rather than the month, so I need to create a new variable called quarter in my solarData data set. I assigned Jan, Feb, March as Winter; April, May, June as Spring and so on. Once the quarter variable exists, it is easy to aggregate and plot the data using ddply and ggplot.

##create a variable for the install quarter
##assign quarter based on month installed
solarData$quarter[solarData$month == 3 |solarData$month == 1 |solarData$month == 2 ] = "Winter"
solarData$quarter[solarData$month == 6 |solarData$month == 4 |solarData$month == 5 ] = "Spring"
solarData$quarter[solarData$month == 9 |solarData$month == 7 |solarData$month == 8 ] = "Summer"
solarData$quarter[solarData$month == 12 |solarData$month == 10 |solarData$month == 11 ] = "Autumn"
##Reorder the quarters, want chronological not alphabeitcal ordering when I plot
##aggregate number if installs by year and by quarter
installsByQuarter = ddply(solarData, .(year,quarter),summarize,
#plot the data
  theme_bw(base_size= 16)+

plot of chunk unnamed-chunk-4

I took the install date to be the date that the application was submitted for reimbursement, so there is probably some lag there, but since I group it by quarter most projects are grouped into the right quarter. I would have thought there may be a trend in the data where more projects are installed after tax refunds are received in the spring or perhaps after year end bonuses. Also, I would have thought Summer could have more installs because of the longer days. But looking at the above plot I can't find a consistent trend. So I thought I'd look at the overall projects installed by quarter instead of by year. The boxplot below shows this result, the median number of installs by quarter is very similar for each quarter.


plot of chunk unnamed-chunk-5

Question 4. How has the cost of solar energy per watt changed during the time range in the data set?

According to the go solar California web site,, customer owned system costs are often reported differently from business owned system costs in this data set. I have already refined the data set so that all the systems are on residences not businesses or power plants, but some of them are owned by third parties like SolarCity. So to make sure I am using accurate cost numbers, I will further refine the data set so that all systems are customer owned.

resOwned = subset(solarData,System.Owner.Sector == "Residential")
##create vector for cost
##calculate cost
##group installs by quarter and year, calculate average cost for each quarter by year
costByQuarter = ddply(resOwned, .(year,quarter),summarise, cost=mean(cost,na.rm=TRUE))
##plot the results
  ylab("cost ( $/W )")+
  theme_bw(base_size= 16)+

plot of chunk unnamed-chunk-6

To determine the $/W for these projects I took the to total cost reported and divided it by the alternating current output power of the module. This analysis does not take into account any changes in consumer price index over this time period. This is probably the most interesting results form this data set so far. By 2013 the cost of solar dropped to nearly 50% of the 2007 cost. This has been fueled by larger scale manufacturing, dramatically cheaper poly Si prices (the biggest single cost component of modules) and higher module efficiencies to name a few.

Question 5. What is the average incentive given to customers installing residential systems?

Since I have already created this subset, all I need to do is plot the results.

  theme_bw(base_size = 16)+
  xlab("Incentive Amount ( $ )")

plot of chunk unnamed-chunk-7

It looks like the majority of residential installs received about a $1000 incentive for their system but there is a long tail to this distribution of installs which received much larger incentives. It's surprising that there are many systems which have incentives in the $100,000 range. It's hard to believe that these could be residential systems. Maybe, I'll take a closer look at this later. Perhaps it will be more interesting to look at the incentives as a function of the total cost of the system. This can be seen in the plot below.

  theme_bw(base_size = 16)+
  xlab("Total Cost ( $ )")+
  ylab("Incentive Amount ( $ )")+
  scale_color_continuous_tableau(name="Nameplate Rating ( kW )",limits=c(0,25))+

plot of chunk unnamed-chunk-8

##grad all the systems where cost = incentive

There is a very large spread of incentives for a given system cost, but the general relationship is as you would expect, as the system cost increases so does the incentive amount. according to the California solar initiative incentive amounts are determined by system size, location, angle and tilt. There are several different incentive programs in this data, that could also help explain the large variation. One thing that surprised me when I created this figure was that for many systems the incentive amount was identical to the cost. In other words they were totally free! I took the subset of the data where the cost is the same as the incentive, and almost all of them were a part of a program called “Single-family Affordable Solar Homes” or SASH, this is a program for low-income single-family households which gives larger incentives than the other programs. The mean of the incentive amount is $5194 and the median incentive is $2433.

Wrapping up

Well, I have answered the questions that I was initially curious about and found a few surprises along the way. In my next post, I will merge this data set with some other publicly available data sets on population and income by county as well as annual solar insolation and fit a regression model to try to understand which factors affect the number of residential solar installations in California. I would love to hear any questions/comments anyone has about the data set and/or approach I took to analyzing the data.

To leave a comment for the author, please follow the link and comment on their blog: R – Beyond Maxwell. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)