We’re into the second day of Business Science Demo Week. What’s demo week? Every day this week we are demoing an R package: tidyquant (Monday), timetk (Tuesday), sweep (Wednesday), tibbletime (Thursday) and h2o (Friday)! That’s five packages in five days! We’ll give you intel on what you need to know about these packages to go from zero to hero. Second up is timetk, your toolkit for time series in R. Here we go!
It’s a good idea to visualize the data so we know what we’re working with. Visualization is particularly important for time series analysis and forecasting (as we see during time series machine learning). We’ll use tidyquant charting tools: mainly geom_ma(ma_fun = SMA, n = 12) to add a 12-period simple moving average to get an idea of the trend. We can also see there appears to be both trend (moving average is increasing in a relatively linear pattern) and some seasonality (peaks and troughs tend to occur at specific months).
Now that you have a feel for the time series we’ll be working with today, let’s move onto the demo!
We’ve split this demo into two parts. First, we’ll follow a workflow for time series machine learning. Second, we’ll check out coercion tools.
Part 1: Time Series Machine Learning
Time series machine learning is a great way to forecast time series data, but before we get started here are a couple pointers for this demo:
Key Insight: The time series signature ~ timestamp information expanded column-wise into a feature set ~ is used to perform machine learning.
Objective: We’ll predict the next 12 months of data for the time series using the time series signature.
We’ll go through a workflow that can be used to perform time series machine learning. You’ll see how several timetk functions can help with this process. We’ll do machine learning with a simple lm() linear regression, and you will see how powerful and accurate this can be when a time series signature is used. Further, you should think about what other more powerful machine learning algorithms can be used such as xgboost, glmnet (LASSO), and others.
Step 0: Review data
Just to show our starting point, let’s print out our beer_sales_tbl.
We can quickly get a feel for the time series using tk_index() to extract the index and tk_get_timeseries_summary() to retrieve summary information of the index. We use glimpse() to output in a nice format for review.
We can see important features like start, end, units, etc. We also have the quantiles of the time-diffs (difference in seconds between observations), which is useful for assessing the degree of regularity. Because the scale is monthly, the number of seconds between each month follows an irregular distribution.
Step 1: Augment Time Series Signature
The tk_augment_timeseries_signature() function expands out the timestamp information column-wise into a machine learning feature set, adding columns of time series information to the original data frame.
Step 2: Model
Apply any regression model to the data. We’ll use lm(). Note that we drop the date and diff columns. Most algorithms do not work with dates, and the diff column is not useful for machine learning (it’s more useful for finding time gaps in the data).
Step 3: Build Future (New) Data
Use tk_index() to extract the index.
Make a future index from the existing index with tk_make_future_timeseries. The function internally checks the periodicity and returns the correct sequence. Note that we have a whole vignette on how to make future time series, which is helpful due to the complexity of the topic.
From the future index, use tk_get_timeseries_signature() to turn index into time signature data frame.
Step 4: Predict the New Data
Use the predict() function for your regression model. Note that we drop the index and diff columns, the same as before when using the lm() function.
Step 5: Compare Actual vs Predictions
We can use tq_get() to retrieve the actual data. Note that we don’t have all of the data for comparison, but we can at least compare the first several months of actual values.
Visualize our forecast.
We can investigate the error on our test set (actuals vs predictions).
And we can calculate a few residuals metrics. The MAPE error is approximately 4.5% from the actual value, which is pretty good for a simple multivariate linear regression. A more complex algorithm could produce more accurate results.
Problem: Switching between various time classes in R is painful and inconsistent.
Solution: tk_tbl, tk_xts, tk_zoo, tk_ts
We are starting with a tbl object. A disadvantage is that sometimes we would like to convert to an xts object to use xts-based functions from the numerous packages that deal with xts objects (xts, zoo, quantmod, etc).
We can easily convert to an xts object using tk_xts(). Notice that tk_xts() auto-detects the time-based column and uses its values as the index for the xts object.
We can also go from xts back to tbl. We tack on rename_index = "date" to have the index name match what we started with. This used to be very difficult. Notice that
A number of packages use a different time class called ts. Probably the most popular is the forecast package. The advantage of using the tk_ts() function is two-fold:
It’s consistent with the other tk_ coercion functions so coercing back and forth is straightforward and easy.
IMPORTANT: When tk_ts() is used, the ts-object carries the original irregular time index (usually dates) as an index attribute. This makes keeping date and datetime information possible.
Here’s an example. We can use tk_ts() to convert to a ts object. Because the ts-based system only works with regular time series, we need to add the arguments start = 2010 and freq = 12.
There are two ways we can go back to tbl:
Just coerce back using tk_tbl() and we get the “regular” index as YEARMON data type from zoo.
If the object was created with tk_ts() and has a timetk_index, we can coerce back using tk_tbl(timetk_index = TRUE) and we get the original “irregular” index as Date data type.
Method 1: We go back to tbl. Note that the date column is YEARMON class.
Method 2: We go back to tbl but specify timetk_idx = TRUE to return original DATE or DATETIME information.
First, you can check to see if the ts-object has a timetk index with has_timetk_idx().
If TRUE, then specify timetk_idx = TRUE during the tk_tbl() coercion. See that we now have “date” data type. This was previously very difficult to do.
We’ve only scratched the surface of timetk. There’s more to learn including working with time series indices and making future indices. Here are a few resources to help you along the way:
We have a busy couple of weeks. In addition to Demo Week, we have:
On Thursday, October 26 at 7PM EST, Matt will be giving a FREE LIVE #DataTalk on Machine Learning for Recruitment and Reducing Employee Attrition. You can sign up for a reminder at the Experian Data Lab website.
The student will learn from Business Science how to implement cutting edge data science to solve business problems. Please let us know if you are interested. You can leave comments as to what you would like to see at the bottom of the post in Disqus.
About Business Science
Business Science specializes in “ROI-driven data science”. Our focus is machine learning and data science in business applications. We help businesses that seek to add this competitive advantage but may not have the resources currently to implement predictive analytics. Business Science works with clients primarily in small to medium size businesses, guiding these organizations in expanding predictive analytics while executing on ROI generating projects. Visit the Business Science website or contact us to learn more!