Modeling Residential Electricity Usage with R

[This article was first published on Commodity Stat Arb, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Wow, I can’t believe it has been 11 months since my last blog posting!  The next series of postings will be related to the retail energy field.  Residential power usage is satisfying to model as it can be forecast fairly accurately with the right inputs.  Partly as a consequence of deregulation there is now more data more available than before.  As in prior postings I will use reproducible R code each step of the way.

For this posting I will be using data from Commonwealth Edison [ComEd] in Chicago, IL.  ComEd makes available historical usage data for different rate classes.  In this example I use rate class C23.

First we must download the data and fix column and header names:

library(xlsx)
library(downloader)

# load historical electric usage data from ComEd website
download(‘https://www.comed.com/Documents/customer-service/rates-pricing/retail-electricity-metering/Historical-Load-Profiles.xls’,’ComEd.xls’,mode=’wb’)
ComEd<-read.xlsx(file='ComEd.xls',sheetName='C23')

# edit row and column names
dimnames(ComEd)[2][[1]]<-{c('Date',1:24)}
dimnames(ComEd)[1][[1]]<-substr(ComEd[,1],5,15)
ComEd[,1]<-as.Date(substr(ComEd[,1],5,15),'%m/%d/%Y')

Next we hypothesize some explanatory variables.  Presumably electricity usage is influenced by day of the week, time of the year and the passage of time.  Therefore we construct a set of explanatory variables and use them to predict usage for a particular hour of the day [16] as follows:

# construct time related explanatory variables
times<-as.numeric(ComEd[,1]-ComEd[1,1])/365.25
weekdayInd<-!(weekdays(ComEd[,1])=="Saturday"|weekdays(ComEd[,1])=="Sunday")
timcols<-cbind(weekday=weekdayInd,
               Time=times,S1=sin(2*pi*times),C1=cos(2*pi*times),S2=sin(4*pi*times),C2=cos(4*pi*times),S3=sin(6*pi*times),C3=cos(6*pi*times))

lm1<-lm(ComEd[,16]~timcols)
summary(lm1)

While all the input variables are highly predictive, something seems to be missing in overall predictive accuracy:
Call:
lm(formula = ComEd[, 16] ~ timcols)

Residuals:
     Min       1Q   Median       3Q      Max
-1.11876 -0.16639 -0.01833  0.12886  1.64360

Coefficients:
               Estimate Std. Error t value Pr(>|t|)   
(Intercept)     1.30253    0.02577  50.545  < 2e-16 ***
timcolsweekday -0.08813    0.02226  -3.959 7.93e-05 ***
timcolsTime     0.06901    0.00959   7.196 1.03e-12 ***
timcolsS1       0.37507    0.01450  25.863  < 2e-16 ***
timcolsC1      -0.17110    0.01424 -12.016  < 2e-16 ***
timcolsS2      -0.23303    0.01425 -16.351  < 2e-16 ***
timcolsC2      -0.37622    0.01439 -26.136  < 2e-16 ***
timcolsS3      -0.05689    0.01434  -3.967 7.65e-05 ***
timcolsC3       0.13164    0.01424   9.246  < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3679 on 1331 degrees of freedom
Multiple R-squared: 0.6122,    Adjusted R-squared: 0.6099
F-statistic: 262.7 on 8 and 1331 DF,  p-value: < 2.2e-16

A plot of the residuals clearly shows something else is going on:
plot(lm1$residuals)

As you may have already concluded there is seasonal heteroscedasticity in the data, beyond what we have fitted with the seasonal variables.

Let’s load some temperature data for the Chicago area and see how it relates to this data.
First we load the weather data from the NOAA database.  It is stored in annual fixed width files which we stitch together:

# Load weather data from NOAA ftp site
KORDtemps<-NULL
for (yearNum in 2009:2013)
{
  ftpString<-paste('ftp://ftp.ncdc.noaa.gov/pub/data/gsod/',yearNum,'/725300-94846-',yearNum,'.op.gz',sep='')
  fileString<-paste('KORD',yearNum,'.gz',sep='')
  download.file(ftpString,fileString,mode=’wb’)
  temp2<-read.fwf(fileString,c(6,6,10,8,3,8,3,8,3,8,3,7,3,7,3,7,7,8,1,7,1,6,1,6,8),
                  skip=1,
                  col.names=c(‘STN’,’WBAN’,’YEARMODA’,’TEMP’,’TEMPCOUNT’,’DEWP’,’DEWPCOUNT’,’SLP’,’SLPCOUNT’,’STP’,’STPCOUNT’,’VISIB’,’VISIBCOUNT’,’WDSP’,’WDSPCOUNT’,’MAXSPD’,’GUST’,’MAX’,’MAXFLAG’,’MIN’,’MINFLAG’,’PRCP’,’PRCPFLAG’,’SNDP’,’FRSHTT’))
  KORDtemps<-rbind(KORDtemps,temp2)
}

# Change missing data to NAs
KORDtemps[KORDtemps==9999.9]<-NA
KORDtemps<-cbind(date=as.Date(as.character(KORDtemps[,3]),'%Y%m%d'),KORDtemps)
head(KORDtemps)

# create cross reference date index
dateInd<-sapply(ComEd[,1], function(x) which(KORDtemps[,1]==x))



Next we plot the residuals from our regression vs. the prevailing max temperature for the day:
plot(KORDtemps[dateInd,”MAX”],lm1$residuals)

As expected there is something going on relating temperature to residuals.

In my next blog posting I will discuss approaches to fitting this weather data to the demand model…


To leave a comment for the author, please follow the link and comment on their blog: Commodity Stat Arb.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)