We NEED more data

[This article was first published on James Keirstead » Rstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the historic difficulties of doing research on urban energy systems has been the limited availability of data at sufficiently detailed spatial resolutions. Without this data, you might end up relying on aggregate information about the built environment, building occupants, and local geography that doesn't apply to the specifics of a particular neighbourhood or street.

Fortunately things are gradually improving and the UK government has a major initiative in this area: NEED, the National Energy Efficiency Data Framework. The project website has recently been updated with a series of data sets, mainly consisting of cross-tabulated statistics comparing insulation, dwelling size, income, dwelling age, and other factors.

I haven't had a chance to do anything meaningful with this data, but I thought a good starting point would be to write some functions to extract the necessary information from the provided Excel spreadsheets. So the analysis is purely exploratory, examining electricity and gas assumption measured at the level of individual English local authorities and grouped by dwelling floor area. The full code for parsing the data and making the plots is available at the bottom of the post, but here’s the main result.

Energy consumption by floor area for English local authorities

As you would expect, both the mean and the variance of gas consumption increase with larger dwellings. In electricity consumption, the effect is less noticeable with less distinction between consumption levels in small dwellings.

Let's hope that DECC makes more NEED data available quickly.

Full code on Github:

## Analysis of DECC NEED data
## James Keirstead
## 22 November 2012
##
## Underlying data available from
##     http://www.decc.gov.uk/en/content/cms/statistics/energy_stats/en_effic_stats/need/need.aspx
## For more info, see
##     http://www.jameskeirstead.ca/blog/we-need-more-data/ 


## @knitr define-functions
## Define useful global variables
## ------------------------------

file_name <- "data/6951-local-authority-consumption-tables-2010-1.xls"
floor_area_labels <- c("50 or less", "51 to 100", "101 to 150", "151 to 200", "201 to 250", "Over 250")

### Define functions
## ------------------------------

## Remove regional totals
## 
## The spreadsheets contain subtotals by region and country.
## However since I want to analyse the information by local authority
## I first need to strip out these extra rows.

##    df  - the input data frame.  It should contain the local authority
##          names in the second column ("Col1")
remove_regional_totals <- function(df) {
  ## The regional names are kept in Col0 and therefore Col1 is blank
  tmp <- subset(df, !is.na(Col1))
  
  ## Remove whitespace from local authority names
  require(stringr)	  
  tmp <- mutate(tmp, Col1=str_trim(Col1))
  
  ## Return the result, dropping the now empty first column
  return(tmp[,-1])
}

## Extract columns
##
## The data are arranged in columns with discrete categories.
## This function will extract a generic set of columns and apply a new set
## of labels
##
##    df       - the input data frame with local authority name in first column
##    startCol - the number of the first column to be selected
##    labels   - the labels of the extracted columns, e.g. c("Apples", "Oranges", "Pears")
##    category - the generic category of the labels, e.g. "Fruit"
##    melt     - return the columns as a melted data frame
extract_columns <- function(df, startCol, labels, category, do_melt=TRUE) {
  ## Extract the columns and rename
  tmp <- df[, c(1, startCol:(startCol + length(labels) - 1))]
  names(tmp) <- c("LAU", labels)
  
  if (do_melt) {
    require(reshape2)
    tmp.m <- melt(tmp, id="LAU", variable.name=category)
	# Remove DECC's placeholder for confidential info
	require(stringr)
	tmp.m <- mutate(tmp.m, value=as.numeric(str_replace(value, "x", NA)))
	return(tmp.m)
  } else {
    return(tmp)
  }
}	
  

## Extract the number of dwellings in each local authority by floor area 
get_number_dwellings <- function(df) {
  tmp <- extract_columns(df, 3, floor_area_labels, "floor")
  names(tmp)[3] <- "number"
  return(tmp)
}

## Extract the mean energy consumption in each local authority by floor area
## Both the gas and electricity sheets position this information in the same column
get_mean_consumption <- function(df) {
  tmp <- extract_columns(df, 12, floor_area_labels, "floor")
  names(tmp)[3] <- "energy"
  return(tmp)
}

## A convenience function to assemble the overall data set
clean_data <- function(df) {
   tmp <- remove_regional_totals(df)
   mean_energy <- get_mean_consumption(tmp)
   dwellings <- get_number_dwellings(tmp)
   result <- merge(mean_energy, dwellings)
   return(result)
}


## @knitr run-analysis
## Run the analysis
##------------------------

## Read in the spreadsheet
## Start and end coordinates are by manual inspection
require(XLConnect)

## Process the gas data first
raw_gas_data <- readWorksheetFromFile(file_name, "LAG1", 
	startRow=10, startCol=which(LETTERS=="A"), 
	endRow=432, endCol=which(LETTERS=="S"))
gas_data <- clean_data(raw_gas_data)

## Then the electricity data
raw_elec_data <- readWorksheetFromFile(file_name, "LAE1", 
	startRow=10, startCol=which(LETTERS=="A"), 
	endRow=432, endCol=which(LETTERS=="S"))
elec_data <- clean_data(raw_elec_data)

## Merge the two data sets, renaming columns first
df.gas <- summarize(gas_data, LAU=LAU, floor=floor, gas=energy)
df.elec <- summarize(elec_data, LAU=LAU, floor=floor, elec=energy)
data <- merge(df.gas, df.elec)
data <- melt(data, id=c('LAU', 'floor'), variable.name='fuel')
data <- mutate(data, fuel=factor(fuel, lev=c('elec','gas'), labels=c('Electricity', 'Gas')))


## @knitr make-plots
require(ggplot2)
gg <- ggplot(data, aes(x=floor, y=value)) + 
  geom_boxplot() +
  facet_wrap(~ fuel, ncol=1, scale='free') +
  labs(x="Floor area (m2)", y="Mean annual energy consumption (kWh)") +  
  theme_bw() 
print(gg)


To leave a comment for the author, please follow the link and comment on their blog: James Keirstead » Rstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)