**Hyndsight » R**, and kindly contributed to R-bloggers)

Last week, version 5.0 of the `forecast`

package for R was released. There are a few new functions and changes made to the package, which is why I increased the version number to 5.0. Thanks to Earo Wang for helping with this new version.

### Handling missing values and outliers

Data cleaning is often the first step that data scientists and analysts take to ensure statistical modelling is supported by good data. Some new functions and extended functions have been added to the `forecast`

package to make this job easier, and to automate some steps.

The existing `na.interp`

function has been upgraded to handle seasonal series much better. It now fits a seasonal model to the data, and then interpolates the seasonally adjusted series, before re-seasonalizing. I’ve tested it on a lot of data and I think it works pretty well, although I’m sure users will come up with some test cases that cause problems.

`tsoutliers`

is a new function for the purpose of identifying outliers and suggesting reasonable replacements. Residuals are identified by fitting a loess curve for non-seasonal data and via a periodic STL decomposition for seasonal data. Residuals are labelled as outliers if they lie outside the range where is the –quantile of the residuals. This is a little experimental. For a Gaussian distribution, it will identify less than 1 point in 3 million as an outlier. In comparison, when boxplots are used, an outlier is shown if it lies outside and extreme outliers are outside . By these rules, under a Gaussian distribution, 4% of points will be identified as outliers and about 1 in 20000 as extreme outliers.

Real data are often not as well-behaved as a Gaussian distribution, and outliers can be present. For example, the weekly air passenger traffic between Melbourne and Sydney (`melsyd`

in the `fpp`

package) contain seven consecutive weeks of zero traffic, and one week of partial traffic, due to a pilots’ strike. The `tsoutliers`

function can replace those with estimates:

library(fpp) tsoutliers(melsyd[,3]) $index [1] 113 114 115 116 117 118 119 120 $replacements [1] 17.57579 16.94973 16.15192 16.12787 16.20718 14.88098 14.46360 12.56176 |

A more general function is `tsclean`

which is a combination of `na.interp`

and `tsoutliers`

, so it handles both missing values and outliers. It will return a cleaned version of a time series with outliers and missing values replaced by estimated values.

library(fpp) plot(melsyd[,3], main="Economy class passengers: Melbourne-Sydney") lines(tsclean(melsyd[,3]), col='red') |

These three functions have one common argument `lambda`

(a Box-Cox transformation parameter). If present, the time series is transformed before the outliers are identified and replaced, or missing values are estimated.

These functions are also now used when `robust=TRUE`

in `forecast.ts`

. The idea is that `forecast.ts`

can take any time series and return something reasonable, even if the original series has missing values and outliers.

### Calendar variables

We’ve added two functions, `bizdays`

and `easter`

, into the package; they can be used when adjusting for calendar effects. Like the function `monthdays`

, both functions work for monthly and quarterly data.

`bizdays`

, as its name suggests, returns the number of business days in each month or quarter of the observed time series. Along with a time series input, it has an argument `FinCenter`

referring to the “Financial Center” (equivalent to the `finCenter`

in the `timeDate`

package). It is also assumed that weekdays are from Monday to Friday.

As Easter holiday isn’t fixed in relation to the civil calendar, which can make it challenging to forecast a time series with Easter effects. The function `easter`

will return a dummy variable indicating if Easter is present in each month. Easter is defined as the days between Good Friday and Easter Sunday inclusively, plus optionally Easter Monday if `easter.mon = TRUE`

. The function will return 0 for all months or quarters except those containing some of the days of Easter. A fractional result is returned if Easter spans March and April; otherwise 1 indicates that Easter falls entirely within the month or quarter.

These two functions are intended to give output that can be used as regression variables in `auto.arima`

or `tslm`

.

### Changes to ARIMA modelling

The biggest change is actually not part of the forecast package. When a regression variable is present (including when a drift term is used), the estimation was very poorly initialized in the `stats::arima`

function. I proposed a fix to the R core team, and this became part of Rv3.0.2. As `stats::arima`

is the engine behind the `Arima`

and `auto.arima`

functions in the `forecast`

package, this means that the package can now sometimes return different results to the results obtained in older versions of R.

Changes to the `forecast`

package itself include:

- Added arguments
`max.D`

and`max.d`

to`auto.arima()`

,`ndiffs()`

and`nsdiffs()`

. - Removed drift term in
`Arima()`

when . - Added bootstrap option to
`forecast.Arima()`

The latter option now makes it possible to forecast from an ARIMA model without making the assumption of Gaussian errors.

### Minor changes and bug fixes

Other changes include:

- Added argument model to
`dshw()`

to enable an estimated model to be applied to a new time series. - Made several functions more robust to
`zoo`

objects. - Corrected an error in the calculation of AICc when using
`CV()`

. - Made minimum default p in
`nnetar`

equal to 1 so it can no longer return a null model. - Improved output from
`snaive()`

and`naive()`

to better reflect user expectations - Allowed
`Acf()`

to handle missing values by using`na.contiguous`

. I might change this to`na.interp`

in a future release. - Changed default information criterion in
`ets()`

to AICc. For short time series, it may choose a different model from previous versions.

### Bugs?

If any user thinks they have found a bug, please report it on the github page and include a minimal reproducible example. If I can’t reproduce it, I can’t fix it.

**leave a comment**for the author, please follow the link and comment on their blog:

**Hyndsight » R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...