# To Transform or Not To Transform

December 3, 2012
By

[This article was first published on Strategic Thinking - Automated In R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Many of the forecasting packages in R requires a time series that is covariance stationary. For those who are not familiar with this term, there is an excellent online textbook by Hyndman and Athanasopoulos, Forecasting: Principles and Practice. Click here to go directly to that chapter.

The first step, therefore, in making a predictive or analytic procedure is to analyze a time series to see if it is already covariance stationary. More often than not, a transformation is required to convert the non-stationary time series to a stationary time series.

Examples of time series, include stock prices, raw material prices, and ALL data that is ordered by a given interval in date and time. But to keep this article easy to follow, I will use a time series of Powerball, which is a type of lottery game played once a week in Australia.

Fortunately, the historical data is easy to obtain from the Powerball website as a CSV file. For this example, I will be using only ONE (1) column to keep it easy to follow. However, you can extend this example to the rest of the columns.

### (1) Load the data into a data frame

library(forecast)
library(tseries)
library(zoo)

# (1-1) Load the data into a data.frame
hist.df <- read.csv( “powerball.csv”, colClasses = “character” )

# (1-2) Coerce the column into a numeric vector and create a zoo object
n1.zoo <- zoo( matrix( as.numeric(hist.df\$Number_1), ncol=1 ),
as.Date(hist.df\$Draw_date, “%Y/%m/%d”) )
names(n1.zoo) <- c(“n1”)

# (1-3) Note: For the zoo object, the latest result is at the LAST row
tail(n1.zoo)

First, we load the data into a data.frame, then we coerce ONE (1) column into a numeric vector and create a zoo object. For the zoo object, the LAST row corresponds to latest result, the penultimate row corresponds to previous week’s results, and so on.

### (2) Test the column for normality

# (2-1) Sharpiro test (p>0.05 is normal) of diff(log(data))
#         Requires a vector as input
n1.vtr <- as.numeric(n1.zoo)

p3 <- shapiro.test(diff(log(n1.vtr)))\$p.value

# (2-2) Plot histogram of
hist(diff(log(n1.vtr)), prob = T,
main = c(“diff(log(n1))”, paste(“Shapiro p=”, prettyNum(p3, digits = 2))), xlab = “diff(log(n1))”)
lines(density(diff(log(n1.vtr))))

# (2-3) QQ plot of the above
qqnorm(diff(log(n1.zoo)))

The above code tests the column for a normal distribution. It can be seen using a histogram (and a QQ plot), that the diff(log(data)) is approximately normal. However, the Shapiro test indicates that the distribution is NOT normal (p-value is LESS THAN 0.05).

### (3) Test the column for autocorrelation

# (3-1) Acf plot
acf(diff(log(n1.zoo)), main=”diff(log(n1))”)

The ACF plot has two horizontal blue lines for significance tests. If any of the period has a vertical line that crosses the blue lines, then there is a correlation between that period and period 0. It can be seen in the ACF plot, that the diff(log(data)) has an autocorrelation between period 1 and period 0.

### (4) Test the column for stationary

# (4-1) Augmented Dickey-Fuller (ADF) test
#       The null-hypothesis for an ADF test is that the data is non-stationary
#       Therefore, p>0.05 is non-stationary

Using the ADF test, we show the diff(log(data)) is a stationary time series.

### (5) Test the column for seasonality

# (5-1) Seasonal differencing test
#       If ns>0, then seasonal differencing is required

nsdiffs(diff(log(n1.zoo)))

Using the nsdiffs() function, we show the column does NOT require seasonal differencing.

### (6) Auto fit the column using Arima

# (6-1) AIC test
auto.arima(diff(log(n1.zoo)))

Using the auto.arima() function from the package forecast, we find that the column has the best fitted model (with the lowest AIC) when using the diff(log(data)) form. This AIC has improved almost THREE (3) times that of the fitted model when using the data without transformation.

# (7-1) Obtain ONE(1)-period lookahead forecast and confidence interval
f3 <- forecast(auto.arima(diff(log(n1.zoo))), h=1)
f3

# (7-2) Transform diff(log(data)) back into data
#       Select the 95% confidence interval
#       Powerball numbers MUST be between ONE (1) AND FORTY-FIVE (45)
u <- exp( log(n1.zoo[NROW(n1.zoo), 1]) + max(f3\$upper) )
p <- exp( log(n1.zoo[NROW(n1.zoo), 1]) + f3\$mean )
l <- exp( log(n1.zoo[NROW(n1.zoo), 1]) + min(f3\$lower) )
u <- min(45, trunc(u))
p <- round(p)
l <- max(1, round(l))
data.frame(lower=l, predict=p, upper=u)

Based on the above AIC, and taking into consideration ALL the previous tests, we select the diff(log(data)) as our time series input, and then predict a ONE(1)-period lookahead for this time series. We also obtain a confidence interval and then transform our predicted value (with intervals) back to its original form.

### Conclusion

I have shown you how to analyze ONE (1) column of the Powerball time series to determine if it requires transformation prior to fitting a model for forecasting. The transformations diff(data) and diff(log(data)) are commonly used for ANY time series, however, you are NOT limited to these two types. In fact, you are encouraged to explore other types of transformation that may lead you to a better forecasting model than mine.

The entire code (and output) can be viewed here.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.