Feature-based time series analysis

[This article was first published on R on Rob J Hyndman, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In my last post, I showed how the feasts package can be used to produce various time series graphics.

The feasts package also includes functions for computing FEatures And Statistics from Time Series (hence the name). In this post I will give three examples of how these might be used.

library(tidyverse)
library(tsibble)
library(feasts)

Exploring Australian tourism data

I used this example in my talk at useR!2019 in Toulouse, and it is also the basis of a vignette in the package, and a recent blog post by Mitchell O’Hara-Wild. The data set contains domestic tourist visitor nights in Australia, disaggregated by State, Region and Purpose.

tourism
## # A tsibble: 24,320 x 5 [1Q]
## # Key:       Region, State, Purpose [304]
##    Quarter Region   State           Purpose  Trips
##      <qtr> <chr>    <chr>           <chr>    <dbl>
##  1 1998 Q1 Adelaide South Australia Business  135.
##  2 1998 Q2 Adelaide South Australia Business  110.
##  3 1998 Q3 Adelaide South Australia Business  166.
##  4 1998 Q4 Adelaide South Australia Business  127.
##  5 1999 Q1 Adelaide South Australia Business  137.
##  6 1999 Q2 Adelaide South Australia Business  200.
##  7 1999 Q3 Adelaide South Australia Business  169.
##  8 1999 Q4 Adelaide South Australia Business  134.
##  9 2000 Q1 Adelaide South Australia Business  154.
## 10 2000 Q2 Adelaide South Australia Business  169.
## # … with 24,310 more rows

An example of a feature would be the autocorrelation function at lag 1 — it is a numerical summary capturing some aspect of the time series. Autocorrelations at other lags are also features, as are the autocorrelations of the first differenced series, or the seasonally differenced series, etc. Another example of a feature is the strength of seasonality of a time series, as measured by \(1-\text{Var}(R_t)/\text{Var}(S_t+R_t)\) where \(S_t\) is the seasonal component and \(R_t\) is the remainder component in an STL decomposition. Values close to 1 indicate a highly seasonal time series, while values close to 0 indicate a time series with little seasonality.

The feasts package has some inbuilt feature calculation functions. For example coef_hurst will calculate the Hurst coefficient of a time series and feat_spectral will compute the spectral entropy of a time series. To apply these to all series in the tourism data set, we can use the features function like this.

tourism %>% features(Trips, list(coef_hurst, feat_spectral))
## # A tibble: 304 x 5
##    Region         State              Purpose  coef_hurst spectral_entropy
##    <chr>          <chr>              <chr>         <dbl>            <dbl>
##  1 Adelaide       South Australia    Business      0.571            0.988
##  2 Adelaide       South Australia    Holiday       0.558            0.968
##  3 Adelaide       South Australia    Other         0.862            0.928
##  4 Adelaide       South Australia    Visiting      0.583            0.964
##  5 Adelaide Hills South Australia    Business      0.555            0.980
##  6 Adelaide Hills South Australia    Holiday       0.651            0.986
##  7 Adelaide Hills South Australia    Other         0.672            0.973
##  8 Adelaide Hills South Australia    Visiting      0.653            0.994
##  9 Alice Springs  Northern Territory Business      0.713            0.973
## 10 Alice Springs  Northern Territory Holiday       0.500            0.826
## # … with 294 more rows

There are also functions which produce collections of features based on tags. For example, feature_set(tags="acf") will produce 7 features based on various autocorrelation coefficients, while feature_set(tags="stl") produces 7 features based on an STL decomposition of a time series, including the strength of seasonality mentioned above.

tourism %>% features(Trips, feature_set(tags="stl"))
## # A tibble: 304 x 12
##    Region State Purpose trend_strength seasonal_streng… seasonal_peak_y…
##    <chr>  <chr> <chr>            <dbl>            <dbl>            <dbl>
##  1 Adela… Sout… Busine…          0.451            0.380                3
##  2 Adela… Sout… Holiday          0.541            0.601                1
##  3 Adela… Sout… Other            0.743            0.189                2
##  4 Adela… Sout… Visiti…          0.433            0.446                1
##  5 Adela… Sout… Busine…          0.453            0.140                3
##  6 Adela… Sout… Holiday          0.512            0.244                2
##  7 Adela… Sout… Other            0.584            0.374                2
##  8 Adela… Sout… Visiti…          0.481            0.228                0
##  9 Alice… Nort… Busine…          0.526            0.224                0
## 10 Alice… Nort… Holiday          0.377            0.827                3
## # … with 294 more rows, and 6 more variables: seasonal_trough_year <dbl>,
## #   spikiness <dbl>, linearity <dbl>, curvature <dbl>, stl_e_acf1 <dbl>,
## #   stl_e_acf10 <dbl>

We can then use these features in plots to identify what type of series are heavily trended and what are most seasonal.

tourism %>% features(Trips, feature_set(tags="stl")) %>%
  ggplot(aes(x=trend_strength, y=seasonal_strength_year, col=Purpose)) +
    geom_point() + facet_wrap(vars(State))

Clearly, holiday series are most seasonal which is unsurprising. The strongest trends tend to be in Western Australia.

The most seasonal series can also be easily identified and plotted.

tourism %>%
  features(Trips, feature_set(tags="stl")) %>%
  filter(seasonal_strength_year == max(seasonal_strength_year)) %>%
  left_join(tourism, by = c("State","Region","Purpose")) %>%
  ggplot(aes(x = Quarter, y = Trips)) + geom_line() +
    facet_grid(vars(State,Region,Purpose))

This is holiday trips to the most popular ski region of Australia.

The feasts package currently includes 42 features which can all be computed in one line like this.

# Compute features
tourism_features <- tourism %>%
  features(Trips, feature_set(pkgs="feasts"))

Unusual time series can be identified by first doing a principal components decomposition

pcs <- tourism_features %>%
  select(-State, -Region, -Purpose) %>%
  prcomp(scale=TRUE) %>%
  augment(tourism_features)
pcs %>%
  ggplot(aes(x=.fittedPC1, y=.fittedPC2, col=Purpose)) +
  geom_point() + theme(aspect.ratio=1)

We can then identify some unusual series.

pcs %>%
  filter(.fittedPC1 > 10.5) %>%
  select(Region, State, Purpose, .fittedPC1, .fittedPC2)
## # A tibble: 3 x 5
##   Region                 State             Purpose  .fittedPC1 .fittedPC2
##   <chr>                  <chr>             <chr>         <dbl>      <dbl>
## 1 Australia's North West Western Australia Business       16.5      -8.24
## 2 Australia's South West Western Australia Holiday        11.3       3.03
## 3 Melbourne              Victoria          Holiday        14.2      -8.24

Replicating Hyndman, Wang and Laptev (2015)

I used a similar approach to identifying outliers in my ICDM 2015 paper with Earo Wang and Nikolay Laptev. In that paper, we used hourly data on thousands of mail servers from Yahoo.

The data is stored in the tsfeatures package as an mts object. Unfortunately, mts and ts objects are particularly bad at handling hourly and other sub-daily data, as time is stored numerically and the precision is not sufficient to always work properly. So we need to do a little work to create a properly formulated tsibble.

yahoo <- tsfeatures::yahoo_data() %>%
  as_tsibble() %>%
  mutate(
    Time = hms::hms(day = trunc(index) - 1L,
                    hour = as.integer((round(24*(index-1))) %% 24))
  ) %>%
  as_tsibble(index = Time, key=key) %>%
  select(Time, key, value)

The key variable here identifies the particular server and measurement variable on that server. There are 1748 such time series, each with 1437 (almost 60 days) observations.

Now we create the features on all series, matching the original paper as closely as possible. There are a few small differences due to different choices in how features are computed, but they make little difference to the results.

The features are computed in two groups — the first group uses the original data, while the second group uses the scaled data.

yahoo_features <- bind_cols(
    yahoo %>% features(value, features=list(
      mean = ~ mean(., na.rm = TRUE),
      var = ~ var(., na.rm = TRUE)
    )),
    yahoo %>% features(scale(value), features = list(
      ~ feat_acf(.),
      ~ feat_spectral(.),
      ~ n_flat_spots(.),
      ~ n_crossing_points(.),
      ~ var_tiled_var(., .period = 24, .size = 24),
      ~ shift_level_max(., .period = 24, .size = 24),
      ~ shift_var_max(., .period = 24, .size = 24),
      ~ shift_kl_max(., .period = 24, .size = 48),
      ~ feat_stl(., .period = 24, s.window = "periodic", robust = TRUE)
    ))
  ) %>%
  rename(lumpiness = var_tiled_var) %>%
  select(key, mean, var, acf1, trend_strength,
         seasonal_strength_24, linearity, curvature,
         seasonal_peak_24, seasonal_trough_24,
         spectral_entropy, lumpiness, spikiness,
         shift_level_max, shift_var_max,
         n_flat_spots, n_crossing_points,
         shift_kl_max, shift_kl_index)

Now we can compute the principal components.

hwl_pca <- yahoo_features %>%
  select(-key) %>%
  na.omit() %>%
  prcomp(scale=TRUE) %>%
  augment(na.omit(yahoo_features))
hwl_pca %>%
  as_tibble() %>%
  ggplot(aes(x=.fittedPC1, y=.fittedPC2)) +
    geom_point()

A useful plot for identifying the outlying observations is an HDR scatterplot which highlights regions of high density, and identifies outliers as those observations in low density regions. The red dots are the 1% highest density observations, the orange dots are in the 50% region, the yellow dots are in the 99% regions, while those shown in black are the 1% most “unusual” observations (having the lowest bivariate density on this plot.) The five most outlying points are highlighted by their row numbers.

hdrcde::hdrscatterplot(hwl_pca$.fittedPC1, hwl_pca$.fittedPC2, noutliers=5)

Replicating Kang, Hyndman and Smith-Miles (2017)

Another paper that uses features is this IJF paper from 2017 in which we explored the feature space of the M3 time series data.

First we need to create several tsibbles, corresponding to the different seasonal periods in the M3 data.

m3totsibble <- function(z) {
  rbind(
    as_tsibble(z$x) %>% mutate(Type="Training"),
    as_tsibble(z$xx) %>% mutate(Type="Test")
  ) %>%
  mutate(
    st = z$st,
    type = z$type,
    period = z$period,
    description = z$description,
    sn = z$sn,
  ) %>%
  as_tibble()
}
m3yearly <- Mcomp::M3 %>%
  subset("yearly") %>%
  purrr::map_dfr(m3totsibble) %>%
  as_tsibble(index=index, key=c(sn,period,st))
m3quarterly <- Mcomp::M3 %>%
  subset("quarterly") %>%
  purrr::map_dfr(m3totsibble) %>%
  mutate(index = yearquarter(index)) %>%
  as_tsibble(index=index, key=c(sn,period,st))
m3monthly <- Mcomp::M3 %>%
  subset("monthly") %>%
  purrr::map_dfr(m3totsibble) %>%
  mutate(index = yearmonth(index)) %>%
  as_tsibble(index=index, key=c(sn,period,st))
m3other <- Mcomp::M3 %>%
  subset("other") %>%
  purrr::map_dfr(m3totsibble) %>%
  as_tsibble(index=index, key=c(sn,period,st))

There are some bespoke features used by Kang et al (IJF 2017), so rather than use the inbuilt feasts functions, we can create our own feature calculation function.

khs_stl <- function(x, period) {
  output <- c(frequency=period, feat_spectral(x))
  lambda <- forecast::BoxCox.lambda(ts(x, frequency=period),
                      lower=0, upper=1, method='loglik')
  stlfeatures <- feat_stl(box_cox(x, lambda), .period=period,
      s.window='periodic', robust=TRUE)
  output <- c(output, stlfeatures, lambda=lambda)
  if(period==1L)
    output <- c(output, seasonal_strength=0)
  return(output)
}
m3_features <- bind_rows(
    m3yearly %>% features(value, features=list(~ khs_stl(.,1))),
    m3quarterly %>% features(value, features=list(~ khs_stl(.,4))) %>%
      rename(seasonal_strength = seasonal_strength_4),
    m3monthly %>% features(value, features=list(~ khs_stl(.,12))) %>%
      rename(seasonal_strength = seasonal_strength_12),
    m3other %>% features(value, features=list(~ khs_stl(.,1)))
  ) %>%
  select(sn, spectral_entropy, trend_strength, seasonal_strength,
         frequency, stl_e_acf1, lambda)

Finally we plot the first two principal components of the feature space.

m3_features %>%
  select(-sn) %>%
  prcomp(scale=TRUE) %>%
  augment(m3_features) %>%
  ggplot(aes(x=.fittedPC1, y=.fittedPC2)) +
    geom_point(aes(color = factor(frequency)))

To leave a comment for the author, please follow the link and comment on their blog: R on Rob J Hyndman.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)