Forecasting my weight with R

(This article was first published on Econometrics and Free Software, and kindly contributed to R-bloggers)

I’ve been measuring my weight almost daily for almost 2 years now; I actually started earlier, but
not as consistently. The goal of this blog post is to get re-acquaiented with time series; I haven’t
had the opportunity to work with time series for a long time now and I have seen that quite a few
packages that deal with time series have been released on CRAN. In this blog post, I will explore
my weight measurements using some functions from the {tsibble} and {tibbletime} packages,
and then do some predictions with the {forecast} package.

First, let’s load the needed packages, read in the data and convert it to a tsibble:

library("tidyverse")
library("readr")
library("forecast")
library("tsibble")
library("tibbletime")
library("mice")
weight <- read_csv("https://gist.githubusercontent.com/b-rodrigues/ea60679135f8dbed448ccf66a216811f/raw/18b469f3b0720f76ce5ee2715d0f9574b615f170/gistfile1.txt") %>% 
    as_tsibble()
## Parsed with column specification:
## cols(
##   Date = col_date(format = ""),
##   Poids = col_double()
## )
## The `index` is `Date`.

You can read more about {tsibble} here. Here, I use {tsibble} mostly
for the next step, which is using the function fill_na() on the tsibble. fill_na() turns
implicit missing values into explicit missing values. These are implicit missing values:

          Date Poids
1   2013-01-01 84.10
2   2013-01-04 85.60

and this is the same view, but with explicit missing values:

          Date Poids
1   2013-01-01 84.10
2   2013-01-02 NA
3   2013-01-03 NA
4   2013-01-04 85.60

This is useful to do, because I want to impute the missing values using the {mice} package.
Let’s do this:

weight <- weight %>% 
    fill_na()

imp_weight <- mice(data = weight) %>% 
    mice::complete("long")
## 
##  iter imp variable
##   1   1  Poids
##   1   2  Poids
##   1   3  Poids
##   1   4  Poids
##   1   5  Poids
##   2   1  Poids
##   2   2  Poids
##   2   3  Poids
##   2   4  Poids
##   2   5  Poids
##   3   1  Poids
##   3   2  Poids
##   3   3  Poids
##   3   4  Poids
##   3   5  Poids
##   4   1  Poids
##   4   2  Poids
##   4   3  Poids
##   4   4  Poids
##   4   5  Poids
##   5   1  Poids
##   5   2  Poids
##   5   3  Poids
##   5   4  Poids
##   5   5  Poids

Let’s take a look at imp_weight:

head(imp_weight)
##   .imp .id       Date Poids
## 1    1   1 2013-10-28  84.1
## 2    1   2 2013-10-29  84.4
## 3    1   3 2013-10-30  83.5
## 4    1   4 2013-10-31  84.1
## 5    1   5 2013-11-01  85.6
## 6    1   6 2013-11-02  85.2

Let’s select the relevant data. I filter from the 11th of July 2016, which is where I started
weighing myself almost every day, to the 31st of May 2018. I want to predict my weight for the
month of June (you might think of the month of June 2018 as the test data, and the rest as training
data):

imp_weight_train <- imp_weight %>% 
    filter(Date >= "2016-07-11", Date <= "2018-05-31")

In the next lines, I create a column called imputation which is simply the same as the column
.imp but of character class, remove unneeded columns and rename some other columns (“Poids” is
French for weight):

imp_weight_train <- imp_weight_train %>% 
    mutate(imputation = as.character(.imp)) %>% 
    select(-.id, -.imp) %>% 
    rename(date = Date) %>% 
    rename(weight = Poids)

Let’s take a look at the data:

ggplot(imp_weight_train, aes(date, weight, colour = imputation)) +
    geom_line() + 
    theme(legend.position = "bottom")

This plots gives some info, but it might be better to smooth the lines. This is possible by
computing a rolling mean. For this I will use the rollify() function of the {tibbletime} package:

mean_roll_5 <- rollify(mean, window = 5)
mean_roll_10 <- rollify(mean, window = 10)

rollify() can be seen as an adverb, pretty much like purrr::safely(); rollify() is a higher
order function that literally rollifies a function, in this case mean() which means that
rollifying the mean creates a function that returns the rolling mean. The window argument lets
you decide how smooth you want the curve to be: the higher the smoother. However, you will lose
some observations. Let’s use this functions to add the rolling means to the data frame:

imp_weight_train <- imp_weight_train %>% 
    group_by(imputation) %>% 
    mutate(roll_5 = mean_roll_5(weight),
           roll_10 = mean_roll_10(weight))

Now, let’s plot these new curves:

ggplot(imp_weight_train, aes(date, roll_5, colour = imputation)) +
    geom_line() + 
    theme(legend.position = "bottom")
## Warning: Removed 20 rows containing missing values (geom_path).

ggplot(imp_weight_train, aes(date, roll_10, colour = imputation)) +
    geom_line() + 
    theme(legend.position = "bottom")
## Warning: Removed 45 rows containing missing values (geom_path).

That’s easier to read, isn’t it?

Now, I will use the auto.arima() function to train a model on the data to forecast my weight for
the month of June. However, my data, imp_weight_train is a list of datasets. auto.arima() does
not take a data frame as an argument, much less so a list of datasets. I’ll create a wrapper around
auto.arima() that works on a dataset, and then map it to the list of datasets:

auto.arima.df <- function(data, y, ...){

    y <- enquo(y)

    yts <- data %>% 
        pull(!!y) %>% 
        as.ts()

    auto.arima(yts, ...)
}

auto.arima.df() takes a data frame as argument, and then y, which is the column that contains the
univariate time series. This column then gets pulled out of the data frame, converted to a time
series object with as.ts(), and then passed down to auto.arima(). I can now use this function
on my list of data sets. The first step is to nest the data:

nested_data <- imp_weight_train %>% 
    group_by(imputation) %>% 
    nest() 

Let’s take a look at nested_data:

nested_data
## # A tibble: 5 x 2
##   imputation data              
##                     
## 1 1          
## 2 2          
## 3 3          
## 4 4          
## 5 5          

nested_data is a tibble with a column called data, which is a so-called list-column. Each
element of data is itself a tibble. This is a useful structure, because now I can map auto.arima.df()
to the data frame:

models <- nested_data %>% 
    mutate(model = map(data, auto.arima.df, y = weight))

This trick can be a bit difficult to follow the first time you see it. The idea is the following:
nested_data is a tibble. Thus, I can add a column to it using mutate(). So far so good.
Now that I am “inside” the mutate call, I can use purrr::map(). Why? purrr::map() takes a list
and then a function as arguments. Remember that data is a list column; you can see it above,
the type of the column data is list. So data is a list, and thus can be used inside purrr::map().
Great. Now, what is inside data? tibbles, where inside each of them is a column
called weight. This is the column that contains my univariate time series I want to model. Let’s
take a look at models:

models
## # A tibble: 5 x 3
##   imputation data               model      
##                           
## 1 1           
## 2 2           
## 3 3           
## 4 4           
## 5 5           

models is a tibble with a column called model, where each element is a model of type ARIMA.

Adding forecasts is based on the same trick as above, and we use the forecast() function:

forecasts <- models %>% 
    mutate(predictions = map(model, forecast, h = 24)) %>% 
    mutate(predictions = map(predictions, as_tibble)) %>% 
    pull(predictions) 

I forecast 24 days (I am writing this on the 24th of June), and convert the predictions to tibbles,
and then pull only the predictions tibble:

forecasts
## [[1]]
## # A tibble: 24 x 5
##    `Point Forecast` `Lo 80` `Hi 80` `Lo 95` `Hi 95`
##  *                        
##  1             71.5    70.7    72.3    70.2    72.8
##  2             71.5    70.7    72.4    70.3    72.8
##  3             71.5    70.6    72.3    70.1    72.8
##  4             71.5    70.6    72.4    70.1    72.9
##  5             71.4    70.5    72.4    70.0    72.9
##  6             71.5    70.5    72.4    70.0    72.9
##  7             71.4    70.5    72.4    69.9    72.9
##  8             71.4    70.4    72.4    69.9    72.9
##  9             71.4    70.4    72.4    69.9    72.9
## 10             71.4    70.4    72.4    69.8    73.0
## # ... with 14 more rows
## 
## [[2]]
## # A tibble: 24 x 5
##    `Point Forecast` `Lo 80` `Hi 80` `Lo 95` `Hi 95`
##  *                        
##  1             71.6    70.8    72.3    70.3    72.8
##  2             71.6    70.8    72.5    70.3    72.9
##  3             71.5    70.6    72.4    70.2    72.9
##  4             71.5    70.6    72.5    70.1    72.9
##  5             71.5    70.5    72.5    70.0    73.0
##  6             71.5    70.5    72.5    70.0    73.0
##  7             71.5    70.5    72.5    69.9    73.0
##  8             71.5    70.4    72.5    69.9    73.1
##  9             71.5    70.4    72.5    69.8    73.1
## 10             71.4    70.3    72.6    69.7    73.1
## # ... with 14 more rows
## 
## [[3]]
## # A tibble: 24 x 5
##    `Point Forecast` `Lo 80` `Hi 80` `Lo 95` `Hi 95`
##  *                        
##  1             71.6    70.8    72.4    70.4    72.8
##  2             71.5    70.7    72.4    70.2    72.8
##  3             71.5    70.6    72.4    70.2    72.9
##  4             71.5    70.6    72.4    70.1    72.9
##  5             71.5    70.5    72.4    70.0    72.9
##  6             71.5    70.5    72.4    70.0    73.0
##  7             71.5    70.5    72.5    69.9    73.0
##  8             71.4    70.4    72.5    69.9    73.0
##  9             71.4    70.4    72.5    69.8    73.0
## 10             71.4    70.4    72.5    69.8    73.1
## # ... with 14 more rows
## 
## [[4]]
## # A tibble: 24 x 5
##    `Point Forecast` `Lo 80` `Hi 80` `Lo 95` `Hi 95`
##  *                        
##  1             71.5    70.8    72.3    70.3    72.8
##  2             71.5    70.7    72.4    70.3    72.8
##  3             71.5    70.7    72.4    70.2    72.8
##  4             71.5    70.6    72.4    70.1    72.9
##  5             71.5    70.6    72.4    70.1    72.9
##  6             71.5    70.5    72.5    70.0    73.0
##  7             71.5    70.5    72.5    69.9    73.0
##  8             71.5    70.4    72.5    69.9    73.0
##  9             71.4    70.4    72.5    69.8    73.1
## 10             71.4    70.3    72.5    69.8    73.1
## # ... with 14 more rows
## 
## [[5]]
## # A tibble: 24 x 5
##    `Point Forecast` `Lo 80` `Hi 80` `Lo 95` `Hi 95`
##  *                        
##  1             71.5    70.8    72.3    70.3    72.8
##  2             71.5    70.7    72.4    70.3    72.8
##  3             71.5    70.7    72.4    70.2    72.8
##  4             71.5    70.6    72.4    70.1    72.9
##  5             71.5    70.6    72.4    70.1    72.9
##  6             71.5    70.5    72.4    70.0    73.0
##  7             71.5    70.5    72.5    69.9    73.0
##  8             71.5    70.4    72.5    69.9    73.0
##  9             71.4    70.4    72.5    69.8    73.1
## 10             71.4    70.3    72.5    69.8    73.1
## # ... with 14 more rows

So forecasts is a list of tibble, each containing a forecast. Remember that I have 5 tibbles, because
I imputed the data 5 times. I will merge this list of data sets together into one, but before I need
to add a column that indices the forecasts:

forecasts <- map2(.x = forecasts, .y = as.character(seq(1, 5)), 
     ~mutate(.x, id = .y)) %>% 
    bind_rows() %>% 
    select(-c(`Lo 80`, `Hi 80`))

colnames(forecasts) <- c("point_forecast", "low_95", "hi_95", "id")

Let’s take a look again at forecasts:

forecasts
## # A tibble: 120 x 4
##    point_forecast low_95 hi_95 id   
##                 
##  1           71.5   70.2  72.8 1    
##  2           71.5   70.3  72.8 1    
##  3           71.5   70.1  72.8 1    
##  4           71.5   70.1  72.9 1    
##  5           71.4   70.0  72.9 1    
##  6           71.5   70.0  72.9 1    
##  7           71.4   69.9  72.9 1    
##  8           71.4   69.9  72.9 1    
##  9           71.4   69.9  72.9 1    
## 10           71.4   69.8  73.0 1    
## # ... with 110 more rows

I now select the true values for the month of June. I also imputed this data, but here I will
simply keep the average of the imputations:

weight_june <- imp_weight %>% 
    filter(Date >= "2018-06-01") %>% 
    select(-.id) %>% 
    group_by(Date) %>% 
    summarise(true_weight = mean(Poids)) %>% 
    rename(date = Date)

Let’s take a look at weight_june:

weight_june
## # A tibble: 24 x 2
##    date       true_weight
##               
##  1 2018-06-01        71.8
##  2 2018-06-02        70.8
##  3 2018-06-03        71.2
##  4 2018-06-04        71.4
##  5 2018-06-05        70.9
##  6 2018-06-06        70.8
##  7 2018-06-07        70.5
##  8 2018-06-08        70.1
##  9 2018-06-09        70.3
## 10 2018-06-10        71.0
## # ... with 14 more rows

Let’s repeat weight_june 5 times, and add the index 1 to 5. Why? Because I want to merge the
true data with the forecasts, and having the data in this form makes things easier:

weight_june <- modify(list_along(1:5), ~`<-`(., weight_june)) %>% 
    map2(.y = as.character(seq(1, 5)), 
         ~mutate(.x, id = .y)) %>% 
    bind_rows()

The first line:

modify(list_along(1:5), ~`<-`(., weight_june)) 

looks quite complicated, but you will see that it is not, once we break it apart. modify()
modifies a list. The list to modify is list_along(1:5), which create a list of NULLs:

list_along(1:5)
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL

The second argument of modify() is either a function or a formula. I created the following
formula:

~`<-`(., weight_june)

We all know the function <-(), but are not used to see it that way. But consider the following:

a <- 3
`<-`(a, 3)

These two formulations are equivalent. So these lines fill the empty element of the list of NULLs
with the data frame weight_june. Then I add the id column and then bind the rows together: bind_rows().

Let’s bind the columns of weight_june and forecasts and take a look at it:

forecasts <- bind_cols(weight_june, forecasts) %>% 
    select(-id1)

forecasts
## # A tibble: 120 x 6
##    date       true_weight id    point_forecast low_95 hi_95
##                             
##  1 2018-06-01        71.8 1               71.5   70.2  72.8
##  2 2018-06-02        70.8 1               71.5   70.3  72.8
##  3 2018-06-03        71.2 1               71.5   70.1  72.8
##  4 2018-06-04        71.4 1               71.5   70.1  72.9
##  5 2018-06-05        70.9 1               71.4   70.0  72.9
##  6 2018-06-06        70.8 1               71.5   70.0  72.9
##  7 2018-06-07        70.5 1               71.4   69.9  72.9
##  8 2018-06-08        70.1 1               71.4   69.9  72.9
##  9 2018-06-09        70.3 1               71.4   69.9  72.9
## 10 2018-06-10        71.0 1               71.4   69.8  73.0
## # ... with 110 more rows

Now, for the last plot:

ggplot(forecasts, aes(x = date, colour = id)) +
    geom_line(aes(y = true_weight), size = 2) + 
    geom_line(aes(y = hi_95)) + 
    geom_line(aes(y = low_95)) + 
    theme(legend.position = "bottom")

The true data fall within all the confidence intervals, but I am a bit surprised by the intervals,
especially the upper confidence intervals; they all are way above 72kg, however my true weight
has been fluctuating around 71kg for quite some months now. I think I have to refresh my memory
on time series, because I am certainly missing something!

If you found this blog post useful, you might want to follow me on twitter
for blog post updates.

To leave a comment for the author, please follow the link and comment on their blog: Econometrics and Free Software.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)