Time Series Analysis in R Part 2: Time Series Transformations
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In Part 1 of this series, we got started by looking at the ts object in R and how it represents time series data. In Part 2, I’ll discuss some of the many time series transformation functions that are available in R. This is by no means an exhaustive catalog. If you feel I left out anything important, please let me know. I compile these posts as a guide in RMarkdown which I plan to make available on the web soon.
Often in time series analysis and modeling, we will want to transform data. There are a number of different functions that can be used to transform time series data such as the difference, log, moving average, percent change, lag, or cumulative sum. These type of function are useful for both visualizing time series data and for modeling time series. For example, the moving average function can be used to more easily visualize a high-variance time series and is also a critical part the ARIMA family of models. Functions such as the difference, percent change, and log difference are helpful for making non-stationary data stationary.
Let’s start with a very common operation. To take a lag we use the lag() function from the stats package. It takes a ts object and a value for the lag argument.
tseries_lag1 <- lag(tseries, 1) head(cbind(tseries, tseries_lag1)) tseries tseries_lag1 1999 Q4 NA 7.07667 2000 Q1 7.07667 10.38876 2000 Q2 10.38876 23.91096 2000 Q3 23.91096 14.49356 2000 Q4 14.49356 15.90501 2001 Q1 15.90501 28.00545
Notice that when we do this, tseries_lag1 is equivalent to tseries offset by 1, such that the value for 2000q1 in tseries become the value for 1999q4 in tseries_lag1 and so on. We can do this for any lag of our choosing. Let’s try 3 instead:
tseries_lag3 <- lag(tseries, 3)
head(cbind(tseries, tseries_lag3))
tseries tseries_lag3
1999 Q2 NA 7.07667
1999 Q3 NA 10.38876
1999 Q4 NA 23.91096
2000 Q1 7.07667 14.49356
2000 Q2 10.38876 15.90501
2000 Q3 23.91096 28.00545
We can also do this in the opposite direction using a negative value for the lag argument, effectively creating a lead series instead:
tseries_lead1 <- lag(tseries, -1)
head(cbind(tseries, tseries_lead1))
tseries tseries_lead1
2000 Q1 7.07667 NA
2000 Q2 10.38876 7.07667
2000 Q3 23.91096 10.38876
2000 Q4 14.49356 23.91096
2001 Q1 15.90501 14.49356
2001 Q2 28.00545 15.90501
Notice that tseries_lead1 now leads the original tseries by one time period.
We can take differences of a time series as well. This is equivalent to taking the difference between each value and its previous value:
tseries_diff1 <- diff(tseries, lag = 1) tm <- cbind(tseries, tseries_diff1) head(tm) tseries tseries_diff1 [1,] 7.07667 NA [2,] 10.38876 3.312087 [3,] 23.91096 13.522201 [4,] 14.49356 -9.417399 [5,] 15.90501 1.411455 [6,] 28.00545 12.100441 plot.ts(tm)
As you can see, taking a difference is an effective way to remove a trend and make a time series stationary.
The difference can be calculated over more than one lag. To do so, simply alter the value of the lag argument.
tseries_diff2 <- diff(tseries, lag = 2)
tm <- cbind(tseries, tseries_diff2)
head(tm)
tseries tseries_diff2
2000 Q1 7.07667 NA
2000 Q2 10.38876 NA
2000 Q3 23.91096 16.834288
2000 Q4 14.49356 4.104801
2001 Q1 15.90501 -8.005944
2001 Q2 28.00545 13.511896
plot(tm)
Some time series transformation functions are useful for series in which the variance gets larger over time. These range from the basic logarithm function to the Box-Cox group of transformations (of which the natural logarithm is a special case). We can take the log of a time series using the log function in the same way that we would take the log of a vector. Let’s generate a time series that has increasing variance:
trend <- ts(seq(from = 10, to = 110)) cycle <- ts(sin(trend)) * 0.2 * trend tseries_h <- trend + cycle plot.ts(tseries_h)
As we’ll discuss in more detail, the prevalence of increasing, or more generally non-constant, variance is called heteroskedasticity and can cause problems in linear regression. Often, it will need to be corrected before modeling. One way to do this is taking a log:
tseries_log <- log(tseries_h) tm <-cbind(tseries_h, tseries_log) plot.ts(tm)
There are other more advanced ways of eliminating non-constant variance, one of which is the Box-Cox transformation, which allows us a bit more control over the transformation. The Box-Cox takes the form (Hyndman and Athanasopoulos, 2013):
$$w_t = \begin{cases} \log{ y_t }, \text{ if } \lambda = 0; \\ \frac{ ({y_t}^{\lambda} – 1) }{ \lambda },\text{ otherwise } \\ \end{cases} $$
In order to demonstrate the Box-Cox transformation, I’ll introduce Rob Hyndman’s `forecast` package:
library(forecast)
It has two functions that are of use here. The primary function is BoxCox(), which will return a transformed time series given a time series and a value for the parameter lambda:
plot.ts(BoxCox(tseries_h, lambda = 0.5))
Notice that this value of lambda here does not entirely take care of the heteroskedasticity problem. We can experiment with different values of lambda, or we can use the BoxCox.lambda() function, which will provide us an optimal value for parameter lambda:
lambda <- BoxCox.lambda(tseries_h) print(lambda) [1] 0.05527724 plot.ts(BoxCox(tseries_h, lambda = lambda))
The BoxCox.lambda() function has chosen the value 0.055. If we then use this value in our BoxCox() function, it returns a time series that appears to have constant variance.
Another common calculation that we may want to perform on time series is the percent change from one period to another. Because it seems that R does not have a function for performing this calculation, we can do it ourselves.
tseries_pch <- tseries / lag(tseries, -1) - 1
head(cbind(tseries, tseries_pch))
tseries tseries_pch
2000 Q1 7.07667 NA
2000 Q2 10.38876 0.46802901
2000 Q3 23.91096 1.30161865
2000 Q4 14.49356 -0.39385287
2001 Q1 15.90501 0.09738501
2001 Q2 28.00545 0.76079409
We can turn this into a function so that we can easily calculate percent change in the future. We’ll add an argument to specify the number of periods over which we want to calculate the change and set it to 1 by default. Additionally, we’ll add some argument verification.
pch <- function(data, lag = 1) {
# argument verification
if (!is.ts(data)) stop("data must be of type ts")
if (!is.numeric(lag)) stop("lag must be of type numeric")
# return percent change
data / lag(data, -lag) - 1
}
head(cbind(tseries, pch(tseries)))
tseries pch(tseries)
2000 Q1 7.07667 NA
2000 Q2 10.38876 0.46802901
2000 Q3 23.91096 1.30161865
2000 Q4 14.49356 -0.39385287
2001 Q1 15.90501 0.09738501
2001 Q2 28.00545 0.76079409
Now if we wished to calculate the percentage change over more than one period we can do so. Let’s say we wanted to calculate the year over year percent change. We can call our pch() function with a lag of 4.
tseries_pchya <- pch(tseries, lag=4)
Two of the functions that we have discussed so far, the difference and the log, are often combined in time series analysis. The log difference function is useful for making non-stationary data stationary and has some other useful properties. We can calculate the log difference in R by simply combining the log() and diff() functions.
tseries_dlog <- ts(diff(log(tseries)), start = c(2000, 1), frequency = 4) plot.ts(cbind(tseries, tseries_dlog))
Notice that after taking the log return, tseries appears to be stationary. We see some higher than normal volatility in the beginning of the series. This is due largely to the fact that the series levels start off so small. This leads into a nice property of the log return function, which is that it is a close approximation to the percent change:
plot.ts(cbind(pch(tseries), tseries_dlog))
This similarity is only approximate. The relationship does break down somewhat when the percent change from one period to the next is particularly large. You can read a good discussion of this topic here.
A moving average is another essential function for working with time series. For series with particularly high volatility, a moving average can help us to more clearly visualize its trend. Unfortunately, base R does not (to my knowledge) have a convenient function for calculating the moving average of a time series directly. However, we can use base R’s filter() function, which allows us to perform general linear filtering. We can set up the parameters of this function to be a moving average (Shumway and Stoffer, 2011). Here we apply the filter() function to tseries to create a 5 period moving average. The filter argument lets up specify the filter, which in this case is just a weighted average of 5 observations. The sides argument allows us to specify whether we want to apply the filter over past values (sides = 1), or to both past and future values (sides = 2). Here I set sides to 1 indicating that I want a trailing moving average:
tseries_lf5 <- filter(tseries, filter = rep(1/5, 5), sides = 1)
plot.ts(cbind(tseries, tseries_lf5), plot.type = 'single', col = c('black', 'red'))
The fact that we have to define the linear filter each time we use this function makes it a little cumbersome to use. If we don’t mind introducing a dependency to our code, we could use the SMA() function from the TTR package or the ma() function from the forecast package. The SMA() function takes a ts object and a value for n – the window over which we want to calculate the moving average.
library(TTR) tseries_ma5 <- SMA(tseries, n = 5)
The ma() function from the forecast package also performs moving average calculations. We supply the time series and a value for the order argument.
tseries_ma5fore <- ma(tseries, order = 5)
Let’s compare the results. The SMA() function returns a trailing moving average where each value is the mean of the n most recent trailing values. This is equivalent to the results we get from using the filter() function. The ma() function from the forecast package returns a centered moving average. In this case tseries_ma5for is equal to the average of the current observation, the previous two observation, and the next two observations. Which one you use would depend on your application.
tseries tseries_lf5 tseries_ma5 tseries_ma5fore
2000 Q1 7.076670 NA NA NA
2000 Q2 10.388758 NA NA NA
2000 Q3 23.910958 NA NA 14.35499
2000 Q4 14.493559 NA NA 18.54075
2001 Q1 15.905014 14.35499 14.35499 20.50828
2001 Q2 28.005455 18.54075 18.54075 17.55500
2001 Q3 20.226413 20.50828 20.50828 17.49470
2001 Q4 9.144571 17.55500 17.55500 17.68977
2002 Q1 14.192030 17.49470 17.49470 18.00239
2002 Q2 16.880366 17.68977 17.68977 18.86085
That’s it for Part 2. I don’t know about you, but I am sick of working with generated data. I want to work with some real world data. In Part 3 we’ll do just that using Quandl. See you then.
References
- Hyndman, Rob J. and George Athanasopoulos (2013) Forecasting: Principles and Practice. https://www.otexts.org/fpp
Shumway, Robert H. and David S. Stoffer (2011) Time Series Analysis and Its Applications With R Examples. New York, NY: Springer.
Related Post
- Time Series Analysis in R Part 1: The Time Series Object
- Parsing Text for Emotion Terms: Analysis & Visualization Using R
- Using MongoDB with R
- Finding Optimal Number of Clusters
- Analyzing the first Presidential Debate
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.








