**Peter's stats stuff - R**, and kindly contributed to R-bloggers)

## Seasonal air pollution in India

The motivation for this blog post was a conference paper I recently heard that analysed five years of daily pollution data in an India city with a non-seasonal auto-regressive integrated moving average (ARIMA) model. In discussion after the presentation, there were differing views on whether such data should be modelled with a seasonal approach, and I wanted to look at some data to see for myself (disclaimer – I was one of the pro-seasonal camp).

To cut to the chase, ultimately I was looking at this sort of obviously seasonal result:

## Open Air Quality Data

But first thing was to get some data. I didn’t have access to the original data but was aware of other sources, with shorter time spans, but hopefully still enough for my purposes. My first call was the openaq Air Quality Data site and in particular the excellent `ropenaq`

R package which speaks to the openaq API and makes it easy to access their data. This looked very promising, but the data available are for far too short a period for me. Here is the six months or so of data which is all that is available for Ahmedabad.

For the record, here’s the R code that downloaded that data (nearly 23,000 observations – many observations per day) and drew the graph, and also set up the session for everything else done in the rest of this post

## US Embassy and Consulate India Air Quality Monitoring

Luckily, the GitHub page for the `ropenaq`

R package referenced a second R package, `usaqmindia`

. This package has tidied and bundled up various source CSV and PDF files published by the US State Department on particulate matter concentrations in five Indian cities. The R package might not be completely current, but it looks pretty good. The particulate matter data are “commonly referred to as PM 2.5 because they are less than or equal to 2.5 microns in diameter”.

If your Indian geography is as rusty as mine you may appreciate this map of the five cities in question:

… which was made with this code using the `ggmap`

R package (yes, I looked up those latitude and longitudes and copied and pasted them by hand – there are better ways of doing this in general, but for just five cities this was simplest):

The original data published by the US has multiple observations for the five cities for most days, and some days missing observations altogether. It’s not easy to spot patterns with this irregular data. Here’s how it looks out of the box:

… which was created with this:

Those observations at concentration levels of 2000 looks suspicious for two reasons:

- they are so high compared to other levels. It’s plausible there are massive spikes some days, but it’s also plausible there’s something over-sensitive about the measuring device. I lack the subject matter expertise and resources to decide, so simply note that I might want to carefully use robust methods.
- they are exactly 2000, which looks like some kind of truncation (ie the real values may be more than 2000, and 2000 is just the maximum value of the censors).

To see patterns I want more regularised data. As a first go, I tried taking the trimmed mean average observation for each day. This gave me this graphic:

This is getting me somewhere. The patterns in Kolkata and Mumbai are looking plausibly regular now. This daily aggregation was made on the fly with this code:

## Aggregate to monthly data

There are still a fair few days missing data altogether (apparent in the plot above by breaks in the lines). So I decided that for my purposes it made sense to aggregate the data by month. There’s all sorts of information that’s chucked out by this – variation on days within the month, and hours within the day; subtle treatment of the different number of days by month etc. But it’s good enough for my purposes of examining seasonality, and in fact this summarising/aggregation process is probably the best way to investigate a broad annual seasonal pattern. Just because we have large amount of irregular, granular data doesn’t mean we’re compelled to leave it in that state for analysis. Here’s how it looks when aggregated at a monthly level:

Looking at the pattern for each month, the seasonality is very apparent; there is much more particulate matter in the air in the winter-centred November to February period, most noticeably in the four northern cities (ie less so in Chennai). I don’t know enough about the subject matter to speculate if this is related purely to weather (the dominant factor being the May to November monsoon), or to human response to seasons.

We can decompose the monthly series into the trend, seasonal and random elements using the `ggseas`

package to get a simultaneous view of scale and seasonality. The GitHub page of the `usaqmindia`

R package draws attention to the importance of the Hindu Diwali ‘festival of lights’ – which involves fireworks, burning effigies, lamps, and general celebrations – for the particulate matter concentrations, so I’m interested in annotating graphics with those months. This gives us the graphic I started the post with:

Here’s the code that did the above aggregation and analysis.

## Delhi time series analysis

Diwali’s date is in a different month from year to year, which means we have a naturally occuring experiment that lets us distinguish between the underlying seasonal impact of any month, and the particular effect of Diwali. To investigate this further, I had a look at just the Delihi data, which is more complete than that of the other four cities even when aggregated to monthly level.

Here are graphics of the auto-correlation and partial auto-correlation functions of the Delhi monthly time series. These show the strength of relationship between any particular observation and lagged versions of the series. A lag of 1.0 means 12 months in this case. The patterns we see are typical of a series with a noticeable seasonal impact and a strong relationship to the observation of the previous month:

For analysis I was a little worried about second order stationarity, in particular whether the variance of monthly observations is higher when the mean observations are higher. I also had in mind the possibility of using a model for forecasting. While not shown in this already-too-long post, forecasting based on the original data hand a tendency for the lower range of the prediction interval to be negative, which isn’t possible and suggested the need for a transformation. All these problems can be fixed by a Box-Cox power transform. The `BoxCox.lambda`

function used below uses Guerrero’s 1993 method for choosing a value of that minimises the coefficient of variation of the time series. With an estimated value of 0.21 for Delhi’s data (somewhere between a logarithm value of 0 and a square-root-like value of 0.5) this is well in the range of rule of thumb transformation commonly used for this sort of time series.

I use for fitting an ARIMA model with a dummy variable for whether Diwali occurred in the month:

That gives us these results, showing some modest evidence of a gentle drift downwards as well as a Diwali impact (0.46, with a standard error of 0.29 – well after all, it is only three years) within a seasonal ARIMA context

A 95% confidence interval `confint(mod0)`

for the Diwali impact is between -0.12 and +1.04 on the transformed scale. This range includes zero so this analysis isn’t sufficient by itself to conclude it has an impact. However, the power of this test would be pretty low (one of the prices paid for by the aggregation to the monthly level) and my intuitive Bayesian inside would be happy to say it is more likely than not that there is some impact of Diwali on particulate matter in the atmosphere in Delhi.

**leave a comment**for the author, please follow the link and comment on their blog:

**Peter's stats stuff - R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...