Last week, version 5.0 of the
forecast package for R was released. There are a few new functions and changes made to the package, which is why I increased the version number to 5.0. Thanks to Earo Wang for helping with this new version.
Handling missing values and outliers
Data cleaning is often the first step that data scientists and analysts take to ensure statistical modelling is supported by good data. Some new functions and extended functions have been added to the
forecast package to make this job easier, and to automate some steps.
na.interp function has been upgraded to handle seasonal series much better. It now fits a seasonal model to the data, and then interpolates the seasonally adjusted series, before re-seasonalizing. I’ve tested it on a lot of data and I think it works pretty well, although I’m sure users will come up with some test cases that cause problems.
tsoutliers is a new function for the purpose of identifying outliers and suggesting reasonable replacements. Residuals are identified by fitting a loess curve for non-seasonal data and via a periodic STL decomposition for seasonal data. Residuals are labelled as outliers if they lie outside the range where is the –quantile of the residuals. This is a little experimental. For a Gaussian distribution, it will identify less than 1 point in 3 million as an outlier. In comparison, when boxplots are used, an outlier is shown if it lies outside and extreme outliers are outside . By these rules, under a Gaussian distribution, 4% of points will be identified as outliers and about 1 in 20000 as extreme outliers.
Real data are often not as well-behaved as a Gaussian distribution, and outliers can be present. For example, the weekly air passenger traffic between Melbourne and Sydney (
melsyd in the
fpp package) contain seven consecutive weeks of zero traffic, and one week of partial traffic, due to a pilots’ strike. The
tsoutliers function can replace those with estimates:
library(fpp) tsoutliers(melsyd[,3]) $index  113 114 115 116 117 118 119 120 $replacements  17.57579 16.94973 16.15192 16.12787 16.20718 14.88098 14.46360 12.56176
A more general function is
tsclean which is a combination of
tsoutliers, so it handles both missing values and outliers. It will return a cleaned version of a time series with outliers and missing values replaced by estimated values.
library(fpp) plot(melsyd[,3], main="Economy class passengers: Melbourne-Sydney") lines(tsclean(melsyd[,3]), col='red')
These three functions have one common argument
lambda (a Box-Cox transformation parameter). If present, the time series is transformed before the outliers are identified and replaced, or missing values are estimated.
These functions are also now used when
forecast.ts. The idea is that
forecast.ts can take any time series and return something reasonable, even if the original series has missing values and outliers.
We’ve added two functions,
easter, into the package; they can be used when adjusting for calendar effects. Like the function
monthdays, both functions work for monthly and quarterly data.
bizdays, as its name suggests, returns the number of business days in each month or quarter of the observed time series. Along with a time series input, it has an argument
FinCenter referring to the “Financial Center” (equivalent to the
finCenter in the
timeDate package). It is also assumed that weekdays are from Monday to Friday.
As Easter holiday isn’t fixed in relation to the civil calendar, which can make it challenging to forecast a time series with Easter effects. The function
easter will return a dummy variable indicating if Easter is present in each month. Easter is defined as the days between Good Friday and Easter Sunday inclusively, plus optionally Easter Monday if
easter.mon = TRUE. The function will return 0 for all months or quarters except those containing some of the days of Easter. A fractional result is returned if Easter spans March and April; otherwise 1 indicates that Easter falls entirely within the month or quarter.
These two functions are intended to give output that can be used as regression variables in
Changes to ARIMA modelling
The biggest change is actually not part of the forecast package. When a regression variable is present (including when a drift term is used), the estimation was very poorly initialized in the
stats::arima function. I proposed a fix to the R core team, and this became part of Rv3.0.2. As
stats::arima is the engine behind the
auto.arima functions in the
forecast package, this means that the package can now sometimes return different results to the results obtained in older versions of R.
Changes to the
forecast package itself include:
- Added arguments
- Removed drift term in
Arima()when 1″ title=”Rendered by QuickLaTeX.com” height=”14″ width=”78″ style=”vertical-align: -1px;”/>.
- Added bootstrap option to
The latter option now makes it possible to forecast from an ARIMA model without making the assumption of Gaussian errors.
Minor changes and bug fixes
Other changes include:
- Added argument model to
dshw()to enable an estimated model to be applied to a new time series.
- Made several functions more robust to
- Corrected an error in the calculation of AICc when using
- Made minimum default p in
nnetarequal to 1 so it can no longer return a null model.
- Improved output from
naive()to better reflect user expectations
Acf()to handle missing values by using
na.contiguous. I might change this to
na.interpin a future release.
- Changed default information criterion in
ets()to AICc. For short time series, it may choose a different model from previous versions.
If any user thinks they have found a bug, please report it on the github page and include a minimal reproducible example. If I can’t reproduce it, I can’t fix it.