Specifying complicated groups of time series in hts
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
With the latest version of the hts package for R, it is now possible to specify rather complicated grouping structures relatively easily.
All aggregation structures can be represented as hierarchies or as cross-products of hierarchies. For example, a hierarchical time series may be based on geography: country, state, region, store. Often there is also a separate product hierarchy: product groups, product types, packet size. Forecasts of all the different types of aggregation are required; e.g., product type A within region X. The aggregation structure is a cross-product of the two hierarchies.
This framework includes even apparently non-hierarchical data: consider the simple case of a time series of deaths split by sex and state. We can consider sex and state as two very simple hierarchies with only one level each. Then we wish to forecast the aggregates of all combinations of the two hierarchies.
Any number of separate hierarchies can be combined in this way. Non-hierarchical factors such as sex can be treated as single-level hierarchies.
The hts package stores the data only at the bottom (most disaggregated) level, and records information about the various types of aggregates that are of interest. The hts() function is appropriate for a single hierarchy (i.e., strictly hierarchical data). More complicated aggregation structures can be specified using the more general gts() function.
Here is an example, based on a question asked on stackoverflow. The problem involves a geographical hierarchy and an industrial classification hierarchy.
Suppose there are two states with four and five counties respectively, and two industries with three and two sub-industries respectively. So there are 9×5 series at the most disaggregated level (sub-industry x county combinations). I will call the states A and B, and the counties A1,A2,A3,A4 and B1,B2,B3,B4,B5. I will call the industries X and Y with sub-industries Xa,Xb,Xc and Ya,Yb respectively. Suppose you have the bottom level series (the most disaggregated level) in a matrix y
, with one column per series, and columns in the following order:
County A1, industry Xa County A1, industry Xb County A1, industry Xc County A1, industry Ya County A1, industry Yb County A2, industry Xa County A2, industry Xb County A2, industry Xc County A2, industry Ya County A2, industry Yb ... County B5, industry Xa County B5, industry Xb County B5, industry Xc County B5, industry Ya County B5, industry Yb
So that we have a reproducible example, I will create y
randomly:
y <- ts(matrix(rnorm(900),ncol=45,nrow=20))
Then we can construct labels for the columns of this matrix as follows:
blnames <- paste(c(rep("A",20),rep("B",25)), # State rep(1:9,each=5), # County rep(c("X","X","X","Y","Y"),9), # Industry rep(c("a","b","c","a","b"),9), # Sub-industry sep="") colnames(y) <- blnames
For example, the first series in the matrix has name "A1Xa"
meaning state A, county 1, industry X, sub-industry a.
We can then easily create the grouped time series object using
gy <- gts(y, characters=list(c(1,1),c(1,1)))
Only the bottom level series are contained in y
. The characters
argument species what aggregations are of interest. In this case, the characters
argument indicates there are two hierarchies (two elements in the list), and the first hierarchy is specified by the first two characters, with the second hierarchy specified by the next two characters. Each level of each hierarchy is specified using a single character (hence the 1s).
A slightly more complicated but analogous example (with labels taking more than one character each) is given in the help file for gts
in v4.3 of the hts
package.
It is possible to specify the grouping structure without using column labels. Then you have to specify the groups matrix which defines what aggregations are of interest. In the example above, the groups matrix is given by
gps <- rbind( c(rep(1,20),rep(2,25)), # State rep(1:9,each=5), # County rep(c(1,1,1,2,2),9), # Industry rep(1:5, 9), # Sub-industry c(rep(c(1,1,1,2,2),4),rep(c(3,3,3,4,4),5)), # State x industry c(rep(1:5, 4),rep(6:10, 5)), # State x Sub-industry rep(1:18, rep(c(3,2),9)) # County x industry )
The order of the rows does not matter. Each row is specifying an aggregation of the bottom level series which is of interest.
Then
gy <- gts(y, groups=gps)
The advantage of using the characters
argument is that the cross-products are handled for you. Also, if your data already comes with helpful column names that can be interpreted as specifying levels of one or more hierarchies, then there is really nothing to do but figure out what the characters
argument should be.
Once the gts
object has been created using the gts()
function, you can proceed to forecast. For exmaple
fc <- forecast(gy)
will generate forecasts for all the bottom level series, and all the aggregate series specified in the call to gts()
. Then it will reconcile the forecasts until they add up for all the specified aggregations, and finally it returns only the reconciled bottom level series. The reconciled aggregated series can easily be constructed from these when they are required.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.