I use R aplenty in analysis and thought it might be worthwhile for some to see the typical process a relative newcomer goes through in extracting and analyzing public datasets
In this instance I happen to be looking at Canadian air pollution statistics.
The data I am interested in is available on the Ontario Ministry of Environment’s website. I have downloaded the hourly ozone readings from two weather stations (Grand Bend and Toronto West) for two years (2000 and 2011) which are available in several formats , including my preference, csv. According to the 2010 annual report from the Ministry, the two selected represent the extremes in readings for that year
I firstly set the directory in which the code and the associated datafiles will reside and import the data. I would normally load any R packages I will utilize at the head of the script (if not already in my start up file) but will hold off here until they are put to use.
I had to do a small amount of row deletion in the csv files so that only the readings data was included
# set working directory setwd("C:/Users/pssguy/Documents/R/societal/pollution") # create and combine dataframes for each dataset # The read.csv function has ideal default arguments of header = TRUE, sep = "," gb2000 <- read.csv("grandBendO32000.csv") gb2011 <- read.csv("grandBendO32011.csv") tw2000 <- read.csv("torontoWestO32000.csv") tw2011 <- read.csv("torontoWestO32011.csv") # combine the above pollutant <- rbind(gb2000,gb2011) pollutant <- rbind(pollutant,tw2000) pollutant <- rbind(pollutant,tw2011)
It is almosr de rigeur for me to look at the structure of the resultant dataframe
str(pollutant) #data.frame': 1490 obs. of 27 variables: # $ Station.ID: chr "15020" "15020" "15020" "15020" ... # $ Pollutant : chr "Ozone" "Ozone" "Ozone" "Ozone" ... # $ Date : chr "01/01/2000" "02/01/2000" "03/01/2000" "04/01/2000" ... # $ H01 : int 9999 6 25 16 38 25 0 15 8 12 ... # $ H02 : int 6 6 24 14 37 17 0 13 7 9 ... # etc.
There are several aspects of the dataframe that need to be addressed.
Firstly, 24 the 27 variables are hourly readings. To do a proper analyses I need
to reshape the data so that I have just one ‘hour’ column which contains all the
ozone readings. For this, I use the reshape2 package. As with all of the packages
covered here, it is readily downloadable from the CRAN repository from within the R environment
# load the pre-installed package library(reshape2) # reshape data retaining the independent variables of , station Id, pollutant and date pollutant.melt <- melt(pollutant, id = c("Station.ID", "Pollutant","Date")) # check out the structure of the new dataframe str(pollutant.melt) # data.frame': 35088 obs. of 5 variables: # $ Station.ID: int 15020 15020 15020 15020 15020 15020 15020 15020 15020 15020 ... # $ Pollutant : chr "Ozone" "Ozone" "Ozone" "Ozone" ... # $ Date : chr "01/01/2000" "02/01/2000" "03/01/2000" "04/01/2000" ... # $ variable : Factor w/ 24 levels "H01","H02","H03",..: 1 1 1 1 1 1 1 1 1 1 ... # $ value : int 9999 6 25 16 38 25 0 15 8 12
The new ‘variable’ column now shows each hour a reading was taken.
I would like to rename the station ID to the geographical location. There is probably just one ID for each station but to be on the safe side that can be checked via the unique function. As it happens this shows three ids in the dataframe. On checking the raw data, it can be seen that the Toronto West station moved geographical location between 2000 and 2011. In this case, the impact on the data readings is probably minimal but this should certainly be borne in mind in any rigorous analysis
# apply the unique function to the Station.ID column referable via the $ sign unique(pollutant$Station.ID) #  15020 35003 35125 # create a new column with the station name # The  notation subset subsets the rows with a specific ID # and the <- applies the name to a new 'station' column pollutant.melt$station[pollutant.melt$Station.ID=="35003"] <- "West Toronto" pollutant.melt$station[pollutant.melt$Station.ID=="35125"] <- "West Toronto" pollutant.melt$station[pollutant.melt$Station.ID=="15020"] <- "Grand Bend"
I now need to address the date field. It needs to be changed from a character to a date vector and I also want to extract various components of it for analysis. The lubridate package comes in handy here.
As you can see below, some errors were noted. The cause (at least for the first one) is that in one of the csv files, the date which should have been “03/10/2000″ had actually come across as “2000-10-02-999″. In a proper analysis, this would simply be corrected or investigated further with the original data provider. However, for the purpose of this exercise the fact that the error is found and that it relates to just 72 out of more than 35,000 observations means that the only action taken is that the offending rows are excluded from the dataframe
library(lubridate) pollutant.melt$Date <- dmy(pollutant.melt$Date) # Using date format %d/%m/%Y. 72 failed to parse # check the first instance by testing for NA output via the is.na function # again use  notation for subsetting. the ', ' outputs all columns head(pollutant.melt[is.na(pollutant.melt$Date),],1) # Station.ID Pollutant Date variable value # 277 15020 Ozone <NA> H01 -999 # ascertain where row 277 occurs head(pollutant.melt,278) #276 15020 Ozone 2000-10-02 H01 38 #277 15020 Ozone <NA> H01 -999 #278 15020 Ozone 2000-10-04 H01 10 # look at raw data for Grand Bend 3rd Oct 2000 # exclude errors from dataframe pollutant.melt <- pollutant.melt[!is.na(pollutant.melt$Date),] # apply lubridate functions to get constituent date/time fields # as separate columns pollutant.melt$month <-month( pollutant.melt$Date) pollutant.melt$year <-year(pollutant.melt$Date) pollutant.melt$wday <- wday(pollutant.melt$Date)
There are just a couple more aspects to tidy up. The variable column is a series of
the hour of day. Removing the leading ‘H’ and creating a new, ‘hour’ column makes sense.
You may also have noted that the very first reading is 9999. A, must-do, initial perusal
of the raw data warns that any reading of 9999 is invalid data and -999 is missing data.
Again, if this data was critical then some followup would be needed but for this situation
they can either be excluded from data as we did earlier or set to NA and specifically excluded from any analysis
# create a new column pollutant.melt$hour<-gsub("H","",pollutant.melt$variable) # set invalid and missing data to NA pollutant.melt$value[pollutant.melt$value==-999] <- NA pollutant.melt$value[pollutant.melt$value==9999] <- NA # if desired, any unwanted columns can be dumped. pollutant.melt$Station.ID <- NULL pollutant.melt$variable <- NULL
I now have an acceptable set of data to examine. This is not the subject of this blog
but a quick look at how the average monthly ozone levels have changed by station
by year highlights how a couple of other packages – plyr and ggplot2 – are extremely
useful contributors to the R arsenal. As with the other packages mentioned, there are
pdfs providing details of all their functions and hosts of tutorials etc. available online
# create a new dataframe which gives average ozone levels # na.rm excludes the NA readings we introduced ozone_m_y_s <- ddply(pollutant.melt,c("month","year","station"), summarize,av=mean(value,na.rm=TRUE)) head(ozone_m_y_s,3) #month year station av #1 1 2000 Grand Bend 25.10633 #2 1 2000 West Toronto 13.73108 #3 1 2011 Grand Bend 32.62988 # write a graph of the data to file png("ozone1.png") p <- ggplot(ozone_m_y_s, aes(x=month, y=av, colour=station)) p <- p + geom_line() p <- p + facet_wrap(~ year) p <- p+ labs(x="Month",y="Ozone level (parts per billion)") p <- p+ opts(title="Avg Monthly Ozone levels (ppb) at two Ontario locations") p dev.off()
Check out the blog for a further analyses of air quality data