Processing Public Data with R

[This article was first published on PremierSoccerStats » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I use R aplenty in analysis and thought it might be worthwhile for some to see the typical process a relative newcomer goes through in extracting and analyzing public datasets

In this instance I happen to be looking at Canadian air pollution statistics.

The data I am interested in is available on the Ontario Ministry of Environment’s website. I have downloaded the hourly ozone readings from two weather stations (Grand Bend and Toronto West) for two years (2000 and 2011) which are available in several formats , including my preference, csv. According to the 2010 annual report from the Ministry, the two selected represent the extremes in readings for that year

I firstly set the directory in which the code and the associated datafiles will reside and import the data. I would normally load any R packages I will utilize at the head of the script (if not already in my start up file) but will hold off here until they are put to use.
I had to do a small amount of row deletion in the csv files so that only the readings data was included

?View Code RSPLUS
# set working directory
setwd("C:/Users/pssguy/Documents/R/societal/pollution")
 
# create  and combine dataframes for each dataset 
# The read.csv function has ideal default arguments of header = TRUE, sep = ","
gb2000 <- read.csv("grandBendO32000.csv")
gb2011 <- read.csv("grandBendO32011.csv")
tw2000 <- read.csv("torontoWestO32000.csv")
tw2011 <- read.csv("torontoWestO32011.csv")
 
# combine the above
pollutant <- rbind(gb2000,gb2011)
pollutant <- rbind(pollutant,tw2000)
pollutant <- rbind(pollutant,tw2011)

It is almosr de rigeur for me to look at the structure of the resultant dataframe

?View Code RSPLUS
str(pollutant)
 
#data.frame':   1490 obs. of  27 variables:
# $ Station.ID: chr  "15020" "15020" "15020" "15020" ...
# $ Pollutant : chr  "Ozone" "Ozone" "Ozone" "Ozone" ...
# $ Date      : chr  "01/01/2000" "02/01/2000" "03/01/2000" "04/01/2000" ...
# $ H01       : int  9999 6 25 16 38 25 0 15 8 12 ...
# $ H02       : int  6 6 24 14 37 17 0 13 7 9 ...
# etc.

There are several aspects of the dataframe that need to be addressed.
Firstly, 24 the 27 variables are hourly readings. To do a proper analyses I need
to reshape the data so that I have just one ‘hour’ column which contains all the
ozone readings. For this, I use the reshape2 package. As with all of the packages
covered here, it is readily downloadable from the CRAN repository from within the R environment

?View Code RSPLUS
# load the pre-installed package
library(reshape2)
 
# reshape data retaining the independent variables of , station Id, pollutant and date
pollutant.melt <- melt(pollutant, id = c("Station.ID", "Pollutant","Date"))
 
# check out the structure of the new dataframe
str(pollutant.melt)
 
# data.frame':   35088 obs. of  5 variables:
# $ Station.ID: int  15020 15020 15020 15020 15020 15020 15020 15020 15020 15020 ...
# $ Pollutant : chr  "Ozone" "Ozone" "Ozone" "Ozone" ...
# $ Date      : chr  "01/01/2000" "02/01/2000" "03/01/2000" "04/01/2000" ...
# $ variable  : Factor w/ 24 levels "H01","H02","H03",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ value     : int  9999 6 25 16 38 25 0 15 8 12

The new ‘variable’ column now shows each hour a reading was taken.

I would like to rename the station ID to the geographical location. There is probably just one ID for each station but to be on the safe side that can be checked via the unique function. As it happens this shows three ids in the dataframe. On checking the raw data, it can be seen that the Toronto West station moved geographical location between 2000 and 2011. In this case, the impact on the data readings is probably minimal but this should certainly be borne in mind in any rigorous analysis

?View Code RSPLUS
# apply the unique function to the Station.ID column referable via the $ sign
unique(pollutant$Station.ID)
# [1] 15020 35003 35125
 
# create a new column with the station name
# The [] notation subset subsets the rows with a specific ID
# and the <- applies the name to a new 'station' column
 
pollutant.melt$station[pollutant.melt$Station.ID=="35003"] <- "West Toronto"
pollutant.melt$station[pollutant.melt$Station.ID=="35125"] <- "West Toronto"
pollutant.melt$station[pollutant.melt$Station.ID=="15020"] <- "Grand Bend"

I now need to address the date field. It needs to be changed from a character to a date vector and I also want to extract various components of it for analysis. The lubridate package comes in handy here.

As you can see below, some errors were noted. The cause (at least for the first one) is that in one of the csv files, the date which should have been “03/10/2000″ had actually come across as “2000-10-02-999″. In a proper analysis, this would simply be corrected or investigated further with the original data provider. However, for the purpose of this exercise the fact that the error is found and that it relates to just 72 out of more than 35,000 observations means that the only action taken is that the offending rows are excluded from the dataframe

?View Code RSPLUS
 
library(lubridate)
pollutant.melt$Date <- dmy(pollutant.melt$Date)
# Using date format %d/%m/%Y. 72 failed to parse
 
# check the first instance by testing for NA output via the is.na function
# again use [] notation for subsetting. the ', ' outputs all columns
head(pollutant.melt[is.na(pollutant.melt$Date),],1)
 #   Station.ID Pollutant Date variable value
# 277      15020     Ozone <NA>      H01  -999
 
# ascertain where row 277 occurs
head(pollutant.melt,278)
#276      15020     Ozone 2000-10-02      H01    38
#277      15020     Ozone       <NA>      H01  -999
#278      15020     Ozone 2000-10-04      H01    10
 
# look at raw data for Grand Bend 3rd Oct 2000
 
# exclude errors from dataframe
 pollutant.melt <- pollutant.melt[!is.na(pollutant.melt$Date),]
 
# apply lubridate functions to get constituent date/time fields
# as separate columns
 
pollutant.melt$month <-month( pollutant.melt$Date)
  pollutant.melt$year <-year(pollutant.melt$Date)
pollutant.melt$wday <- wday(pollutant.melt$Date)

There are just a couple more aspects to tidy up. The variable column is a series of
the hour of day. Removing the leading ‘H’ and creating a new, ‘hour’ column makes sense.
You may also have noted that the very first reading is 9999. A, must-do, initial perusal
of the raw data warns that any reading of 9999 is invalid data and -999 is missing data.
Again, if this data was critical then some followup would be needed but for this situation
they can either be excluded from data as we did earlier or set to NA and specifically excluded from any analysis

?View Code RSPLUS
# create a new column
pollutant.melt$hour<-gsub("H","",pollutant.melt$variable)
 
# set invalid and missing data to NA
pollutant.melt$value[pollutant.melt$value==-999] <- NA
 pollutant.melt$value[pollutant.melt$value==9999] <- NA
 
# if desired, any unwanted columns  can be dumped. 
pollutant.melt$Station.ID <- NULL
pollutant.melt$variable <- NULL

I now have an acceptable set of data to examine. This is not the subject of this blog
but a quick look at how the average monthly ozone levels have changed by station
by year highlights how a couple of other packages – plyr and ggplot2 – are extremely
useful contributors to the R arsenal. As with the other packages mentioned, there are
pdfs providing details of all their functions and hosts of tutorials etc. available online

?View Code RSPLUS
 
# create a new dataframe which gives average ozone levels
# na.rm excludes the NA readings we introduced
ozone_m_y_s <- ddply(pollutant.melt,c("month","year","station"),
summarize,av=mean(value,na.rm=TRUE))
head(ozone_m_y_s,3)
#month year      station       av
#1     1 2000   Grand Bend 25.10633
#2     1 2000 West Toronto 13.73108
#3     1 2011   Grand Bend 32.62988
 
# write a graph of the data to file
png("ozone1.png")
p <- ggplot(ozone_m_y_s, aes(x=month, y=av, colour=station)) 
 p <- p + geom_line()
p <- p + facet_wrap(~ year)
p <- p+ labs(x="Month",y="Ozone level (parts per billion)")
p <- p+ opts(title="Avg Monthly Ozone levels (ppb) at two Ontario locations")
p
dev.off()

The graph highlights the fact that ozone levels are highest in the summer months and
have hardly changed at either location over the two years under consideration

Check out the blog for a further analyses of air quality data

To leave a comment for the author, please follow the link and comment on their blog: PremierSoccerStats » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)