When writing reusable code or packages you often do not know the names of the columns or variables you need to work over. This is what I call “parametric treatment of variables.” This can be a problem when using R
libraries that assume you know the variable names. The R
data manipulation library dplyr
currently supports parametric treatment of variables through “underbar forms” (methods of the form dplyr::*_
), but their use can get tricky.
Better support for parametric treatment of variable names would be a boon to dplyr
users. To this end the replyr
package now has a method designed to re-map parametric variable names to known concrete variable names. This allows concrete dplyr
code to be used as if it was parametric.
dplyr
is a library that prefers you know the name of the column you want to work with. This is great when performing a specific analysis, but somewhat painful when supplying re-usable functions or packages. dplyr
has a complete parametric interface with the “underbar forms” (for example: using dplyr::filter_
instead of dplyr::filter
). However, the underbar notation (and the related necessary details around specifying lazy evaluation of formulas) rapidly becomes difficult.
As an attempted work-around replyr
now supplies an adapter that applies a mapping from column names you have (which can be supplied parametrically) to concrete column names you wish you had (which would allow you to write dplyr
code simply in terms of known or assumed column names).
It is easier to show than explain.
First we set up our libraries and type in some notional data as our example:
# install.packages('devtools') # Run this if you don't already have devtools
# devtools::install_github('WinVector/replyr') # Run this if you don't already have replyr
library('dplyr')
library('replyr')
d <- data.frame(Sepal_Length=c(5.8,5.7),Sepal_Width=c(4.0,4.4),
Species='setosa',rank=c(1,2))
print(d)
# Sepal_Length Sepal_Width Species rank
# 1 5.8 4.0 setosa 1
# 2 5.7 4.4 setosa 2
Then we rename the columns to standard values while restricting to only the named columns (this is the magic step):
nmap <- c('GroupColumn','ValueColumn','RankColumn')
names(nmap) <- c('Species','Sepal_Length','rank')
print(nmap)
# Species Sepal_Length rank
# "GroupColumn" "ValueColumn" "RankColumn"
dtmp <- replyr_renameRestrictCols(d,nmap)
print(dtmp)
# GroupColumn ValueColumn RankColumn
# 1 setosa 5.8 1
# 2 setosa 5.7 2
At this point you do know the column names (they are the ones you picked) and can write nice neat dplyr
. You can then do your work:
# pretend this block is a huge sequence of complicated and expensive operations.
dtmp %>% mutate(RankColumn=RankColumn-1) -> dtmp # start ranks at zero
Notice we were able to use dplyr::mutate
without needing to use dplyr::mutate_
(and without needing to go to Stack Overflow to lookup the lazy-eval notation yet again).
Once you have your desired result you restore the original names of our restricted column set:
invmap <- names(nmap)
names(invmap) <- as.character(nmap)
print(invmap)
# GroupColumn ValueColumn RankColumn
# "Species" "Sepal_Length" "rank"
replyr_renameRestrictCols(dtmp,invmap)
# Species Sepal_Length rank
# 1 setosa 5.8 0
# 2 setosa 5.7 1
If you haven’t worked a lot with dplyr
this won’t look that interesting. If you do work a lot with dplyr
you may have been asking for something like this for quite a while. If you use dplyr::*_ you will love replyr::replyr_renameRestrictCols
. Be aware: replyr::replyr_renameRestrictCols
is a bit of a hack; it mutates all of the columns it is working with, which is unlikely to be a cheap operation.
I feel the replyr::replyr_renameRestrictCols
interface represents the correct design for a better dplyr
based adapter.
I’ll call this the “column view stack proposal.” I would suggest the addition of two operators to dplyr
:
view_as(df,columnNameMap)
takes a data item and returns a data item reference that behaves as if the column names have been re-mapped.unview()
removes the view_as
annotation.Obviously there is an issue of nested views, I would suggest maintaining the views as a stack while using the composite transformation implied by the stack of mapping specifications. I am assuming dplyr
does not currently have such a facility. Another possibility is a term-rewriting engine to re-map formulas from standard names to target names, but this is what the lazy-eval notations are already attempting (and frankly it isn’t convenient or pretty).
I would also suggest that dplyr::arrange
be enhanced to have a visible annotation (just the column names it has arranged by) that allows the user to check if the data is believed to be ordered (crucial for window-function applications). With these two suggestions dplyr
data sources would support three primary annotations:
Groups
: placed by dplyr::group_by
, removed by dplyr::ungroup
, and viewed by dplyr::groups
.Orders
: placed by dplyr::arrange
, removed by Xdplyr::unarrange
(just removes annotation, does not undo arrangement; annotation also removed by any operation that re-orders the data, such as join
), and viewed by Xdplyr::arrangement
.Column Views
: placed by Xdplyr::view_as
, removed by Xdplyr::unview
, and viewed by Xdplyr::views
.The “Xdplyr::
” items are the extensions that are being proposed.
Another possibility would be some sort of “let
” statement that controls name bindings for the duration of a block of code. Using such a statement we would write our above example calculation as:
let(
list(RankColumn='rank',GroupColumn='Species'),
{
# pretend this block is a huge sequence of complicated and expensive operations.
d %>% mutate(RankColumn=RankColumn-1) -> dtmp # start ranks at zero
}
)
The idea is the items 'rank'
and 'Species'
could be passed in parametrically (notice the let
specification is essentially invmap
, so we could just pass that in). This isn’t quite R
‘s “with
” statement as we are not binding names to values, but names to names. Essentially we are asking for macro facility that is compatible with dplyr
remote data sources (and the non-standard evaluation methods used to capture variables names). I haven’t experimented with gtools yet, so I don’t know if this is feasible (some R
macro examples can be found here).
If somebody has this working I would love to hear about it. The reason I worry is the following. Clever meta-programming depends on all of the rest of the code not being clever and not breaking referential transparency. And in this case dplyr
is using non-standard evaluation to capture variable names, which is typically not referentially transparent. Also R
functions are roughly Lisp
fexpr
s, which signals to me R
‘s design may not have a fully clean distinctions between functions and macros.
I feel the above methods will make working with parameterized variables in dplyr
much easier.
When blogging about statistics and R, it is very useful to differentiate the body text to R code. I used to manage this issue by highlighting the code and pretty-R was a valuable instrument from Revolutions Analytics to accomplish this. However, as you may know, Microsoft acquired that company, and now this feature (dressing R code for the web) is not available anymore.
After some searching, I found this online syntax highlighter and it seems to work pretty well. Besides, it allows you to select from different styles, and you can even choose among a lot of computational languages.
When modeling any phenomena by including explanatory variables that highly relates the variable of interest, one question arises: which of the auxiliary variables have a higher influence on the response? I am not writing about significance testing or something like this. I am just thinking like a researcher who wants to know the ranking of variables that influence the response and their related weight.
There are a variety of methods that try to answer that question. The one inducing this thread is very simple: isolate units from variables. Assume a linear model with the following structure (for the sake of simplicity, assume only two explanatory variables):
$$y = \beta_1 x_1 + \beta_2 x_2 + \varepsilon$$
If you assume this model as true and $\beta_i > 0$, then the influence of variable $x_i$, over response $y$ could be found when isolating measure units from variables. Then, one could fit a model over the standardized variables (explanatory and response) and then directly comparing the regression coefficients. Another way to do this is by means of the following expression:
$$I(i) = \frac{\beta_i}{sd(\beta_i)} = \beta_i\frac{ sd(x_i)}{sd(y)}$$
For example, let’s consider the following model $y = -500 x_1 + 50 x_2 + \varepsilon$, then the relative importance of the first and second variable is around 500/(500+50) = 0.9, and 50/(500+50) = 0.1, respectively. The following code shows how to perform this simple analysis in R.
n <- 10000
x1 <- runif(n)
x2 <- runif(n)
y <- -500 * x1 + 50 * x2 + rnorm(n)
model <- lm(y ~ 0 + x1 + x2)
# 1a. Standardized betas
summary(model)$coe[,2]
sd.betas <- summary(model)$coe[,2]
betas <- model$coefficients
imp <- abs(betas)/sd.betas
imp <- imp/sum(imp)
imp
# 1b. Standardized betas
imp1 <- abs(model$coefficients[1] * sd(x1)/sd(y))
imp2 <- abs(model$coefficients[2] * sd(x2)/sd(y))
imp1 / (imp1 + imp2)
imp2 / (imp1 + imp2)
# 2. Standardized variables
model2 <- lm(I(scale(y)) ~ 0 + I(scale(x1)) + I(scale(x2)))
summary(model2)
abs(model2$coefficients)/sum(abs(model2$coefficients))
A few months ago we launched R course finder, an online directory that helps you to find the right R course quickly. With so many R courses available online, we thought it was a good idea to offer a tool that helps people to compare these courses, before they decide where to spend their valuable time and (sometimes) money.
If you haven’t looked at it yet, go to the R Course Finder now by clicking here.
Last month we added 22 courses to the Course Finder. Currently we are at 140 courses on 14 different online platforms, and 2 offline Learning Institutes.
This month we added two new platforms:
Other then adding these platforms we also kept expanding the content on platforms already available in the Course Finder. There were some new courses added this month we we’re excited about and wanted to highlight:
Foundation of Strategic Business Analytics
This course offered by Coursera is part of the ‘Strategic Business Analytics’ specialization. Business analytics is one of the hottest places to work in these days and this course, together with the entire specialization, lays out in great detail the skills you need to enter this field. It also raises the question when we are going to merge business analytics and Data Science into one, Data analytics anyone?
Regression Machine Learning with R
This is an Udemy course that seems to me (and I’ve been working on this database for a while), is one of the most complete courses on Machine Learning so far available. It also has a good pace that connects to those of us who have a solid statistical background and know how to work with R, but want to focus on this concept in more detail.
And, last but not least: If you like R Course Finder, please share this announcement with friends and colleagues using the buttons below.
This post assumes you know what k-fold cross validation is. If you want to brush up, here’s a fantastic tutorial from Stanford University professors Trevor Hastie and Rob Tibshirani.
Before worrying about models, we can generate K folds using crossv_kfold
from the modelr package. Let’s practice with the mtcars
data to keep things simple.
library(modelr)
set.seed(1) # Run to replicate this post
folds <- crossv_kfold(mtcars, k = 5)
folds
#> # A tibble: 5 × 3
#> train test .id
#> <list> <list> <chr>
#> 1 <S3: resample> <S3: resample> 1
#> 2 <S3: resample> <S3: resample> 2
#> 3 <S3: resample> <S3: resample> 3
#> 4 <S3: resample> <S3: resample> 4
#> 5 <S3: resample> <S3: resample> 5
This function takes a data frame and randomly partitions it’s rows (1 to 32 for mtcars
) into k
roughly equal groups. We’ve partitioned the row numbers into k = 5
groups. The results are returned as a tibble (data frame) like the one above.
Each cell in the test
column contains a resample
object, which is an efficient way of referencing a subset of rows in a data frame (?resample
to learn more). Think of each cell as a reference to the rows of the data frame belonging to each partition. For example, the following tells us that the first partition of the data references rows 5, 9, 17, 20, 27, 28, 29, which accounts for roughly 1 / k
of the total data set (7 of the 32 rows).
folds$test[[1]]
#> <resample [7 x 11]> 5, 9, 17, 20, 27, 28, 29
Each cell in train
also contains a resample
object, but referencing the rows in all other partitions. For example, the first train
object references all rows except 5, 9, 17, 20, 27, 28, 29:
folds$train[[1]]
#> <resample [25 x 11]> 1, 2, 3, 4, 6, 7, 8, 10, 11, 12, ...
We can now run a model on the data referenced by each train
object, and validate the model results on each corresponding partition in test
.
Say we’re interested in predicting Miles Per Gallon (mpg
) with all other variables. With the whole data set, we’d do this via:
lm(mpg ~ ., data = mtcars)
Instead, we want to run this model on each set of training data (data referenced in each train
cell). We can do this as follows:
library(dplyr)
library(purrr)
folds <- folds %>% mutate(model = map(train, ~ lm(mpg ~ ., data = .)))
folds
#> # A tibble: 5 × 4
#> train test .id model
#> <list> <list> <chr> <list>
#> 1 <S3: resample> <S3: resample> 1 <S3: lm>
#> 2 <S3: resample> <S3: resample> 2 <S3: lm>
#> 3 <S3: resample> <S3: resample> 3 <S3: lm>
#> 4 <S3: resample> <S3: resample> 4 <S3: lm>
#> 5 <S3: resample> <S3: resample> 5 <S3: lm>
folds %>% mutate(model = ...)
is adding a new model
column to the folds tibble.map(train, ...)
is applying a function to each of the cells in train
~ lm(...)
is the regression model applied to each train
cell.data = .
specifies that the data for the regression model will be the data referenced by each train
object.The result is a new model
column containing fitted regression models based on each of the train
data (i.e., the whole data set excluding each partition).
For example, the model fitted to our first set of training data is:
folds$model[[1]] %>% summary()
#>
#> Call:
#> lm(formula = mpg ~ ., data = .)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.6540 -0.9116 0.0439 0.9520 4.2811
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -44.243933 31.884363 -1.388 0.1869
#> cyl 0.844966 1.064141 0.794 0.4404
#> disp 0.016800 0.015984 1.051 0.3110
#> hp 0.004685 0.022741 0.206 0.8398
#> drat 3.950410 1.989177 1.986 0.0670 .
#> wt -4.487007 2.016341 -2.225 0.0430 *
#> qsec 2.327131 1.243095 1.872 0.0822 .
#> vs -3.963492 3.217176 -1.232 0.2382
#> am -0.550804 2.333252 -0.236 0.8168
#> gear 5.476604 2.648708 2.068 0.0577 .
#> carb -1.595979 1.104272 -1.445 0.1704
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.159 on 14 degrees of freedom
#> Multiple R-squared: 0.9092, Adjusted R-squared: 0.8444
#> F-statistic: 14.02 on 10 and 14 DF, p-value: 1.205e-05
The next step is to use each model for predicting the outcome variable in the corresponding test
data. There are many ways to achieve this. One general approach might be:
folds %>% mutate(predicted = map2(model, test, <FUNCTION_TO_PREDICT_TEST_DATA> ))
map2(model, test, ...)
iterates through each model and set of test
data in parallel. By referencing these in the function for predicting the test data, this would add a predicted
column with the predicted results.
For many common models, an elegant alternative is to use augment
from broom. For regression, augment
will take a fitted model and a new data frame, and return a data frame of the predicted results, which is what we want! Following above, we can use augment
as follows:
library(broom)
folds %>% mutate(predicted = map2(model, test, ~ augment(.x, newdata = .y)))
#> # A tibble: 5 × 5
#> train test .id model predicted
#> <list> <list> <chr> <list> <list>
#> 1 <S3: resample> <S3: resample> 1 <S3: lm> <data.frame [7 × 13]>
#> 2 <S3: resample> <S3: resample> 2 <S3: lm> <data.frame [7 × 13]>
#> 3 <S3: resample> <S3: resample> 3 <S3: lm> <data.frame [6 × 13]>
#> 4 <S3: resample> <S3: resample> 4 <S3: lm> <data.frame [6 × 13]>
#> 5 <S3: resample> <S3: resample> 5 <S3: lm> <data.frame [6 × 13]>
To extract the relevant information from these predicted
results, we’ll unnest
the data frames thanks to the tidyr package:
library(tidyr)
folds %>%
mutate(predicted = map2(model, test, ~ augment(.x, newdata = .y))) %>%
unnest(predicted)
#> # A tibble: 32 × 14
#> .id mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> 2 1 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> 3 1 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> 4 1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> 5 1 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> 6 1 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> 7 1 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> 8 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 9 2 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 10 2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> # ... with 22 more rows, and 2 more variables: .fitted <dbl>,
#> # .se.fit <dbl>
This was to show you the intermediate steps. In practice we can skip the mutate
step:
predicted <- folds %>% unnest(map2(model, test, ~ augment(.x, newdata = .y)))
predicted
#> # A tibble: 32 × 14
#> .id mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> 2 1 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> 3 1 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> 4 1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> 5 1 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> 6 1 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> 7 1 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> 8 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 9 2 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 10 2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> # ... with 22 more rows, and 2 more variables: .fitted <dbl>,
#> # .se.fit <dbl>
We now have a tibble of the test
data for each fold (.id
= fold number) and the corresponding .fitted
, or predicted values for the outcome variable (mpg
) in each case.
We can compute and examine the residuals:
# Compute the residuals
predicted <- predicted %>%
mutate(residual = .fitted - mpg)
# Plot actual v residual values
library(ggplot2)
predicted %>%
ggplot(aes(mpg, residual)) +
geom_hline(yintercept = 0) +
geom_point() +
stat_smooth(method = "loess") +
theme_minimal()
It looks like our models could be overestimating mpg
around 20-30 and underestimating higher mpg
. But there are clearly fewer data points, making prediction difficult.
We can also use these values to calculate the overall proportion of variance accounted for by each model:
rs <- predicted %>%
group_by(.id) %>%
summarise(
sst = sum((mpg - mean(mpg)) ^ 2), # Sum of Squares Total
sse = sum(residual ^ 2), # Sum of Squares Residual/Error
r.squared = 1 - sse / sst # Proportion of variance accounted for
)
rs
#> # A tibble: 5 × 4
#> .id sst sse r.squared
#> <chr> <dbl> <dbl> <dbl>
#> 1 1 321.5886 249.51370 0.2241214
#> 2 2 127.4371 31.86994 0.7499164
#> 3 3 202.6600 42.19842 0.7917773
#> 4 4 108.4733 50.79684 0.5317113
#> 5 5 277.3283 59.55946 0.7852385
# Overall
mean(rs$r.squared)
#> [1] 0.616553
So, across the folds, the regression models have accounted for an average of 61.66% of the variance of mpg
in new, test data.
Plotting these results:
rs %>%
ggplot(aes(r.squared, fill = .id)) +
geom_histogram() +
geom_vline(aes(xintercept = mean(r.squared))) # Overall mean
Although the model performed well on average, it performed pretty poorly when fold 1 was used as test data.
With this new knowledge, we can do something similar to the k = 20
case shown in David’s post. See that you can understand what’s going on here:
set.seed(1)
# Select four variables from the mpg data set in ggplot2
ggplot2::mpg %>% select(year, cyl, drv, hwy) %>%
# Create 20 folds (5% of the data in each partition)
crossv_kfold(k = 20) %>%
# Fit a model to training data
mutate(model = map(train, ~ lm(hwy ~ ., data = .))) %>%
# Unnest predicted values on test data
unnest(map2(model, test, ~ augment(.x, newdata = .y))) %>%
# Compute R-squared values for each partition
group_by(.id) %>%
summarise(
sst = sum((hwy - mean(hwy)) ^ 2),
sse = sum((hwy - .fitted) ^ 2),
r.squared = 1 - sse / sst
) %>%
# Plot
ggplot(aes(r.squared)) +
geom_density() +
geom_vline(aes(xintercept = mean(r.squared))) +
theme_minimal()
Thanks for reading and I hope this was useful for you.
For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.
If you’d like the code that produced this blog, check out the blogR GitHub repository.
]]>@drsimonj here to discuss how to conduct k-fold cross validation, with an emphasis on evaluating models supported by David Robinson’s broom package. Full credit also goes to David, as this is a slightly more detailed version of his past post, which I read some time ago and felt like unpacking.
This post assumes you know what k-fold cross validation is. If you want to brush up, here’s a fantastic tutorial from Stanford University professors Trevor Hastie and Rob Tibshirani.
Before worrying about models, we can generate K folds using crossv_kfold
from the modelr package. Let’s practice with the mtcars
data to keep things simple.
library(modelr)
set.seed(1) # Run to replicate this post
folds <- crossv_kfold(mtcars, k = 5)
folds
#> # A tibble: 5 × 3
#> train test .id
#> <list> <list> <chr>
#> 1 <S3: resample> <S3: resample> 1
#> 2 <S3: resample> <S3: resample> 2
#> 3 <S3: resample> <S3: resample> 3
#> 4 <S3: resample> <S3: resample> 4
#> 5 <S3: resample> <S3: resample> 5
This function takes a data frame and randomly partitions it’s rows (1 to 32 for mtcars
) into k
roughly equal groups. We’ve partitioned the row numbers into k = 5
groups. The results are returned as a tibble (data frame) like the one above.
Each cell in the test
column contains a resample
object, which is an efficient way of referencing a subset of rows in a data frame (?resample
to learn more). Think of each cell as a reference to the rows of the data frame belonging to each partition. For example, the following tells us that the first partition of the data references rows 5, 9, 17, 20, 27, 28, 29, which accounts for roughly 1 / k
of the total data set (7 of the 32 rows).
folds$test[[1]]
#> <resample [7 x 11]> 5, 9, 17, 20, 27, 28, 29
Each cell in train
also contains a resample
object, but referencing the rows in all other partitions. For example, the first train
object references all rows except 5, 9, 17, 20, 27, 28, 29:
folds$train[[1]]
#> <resample [25 x 11]> 1, 2, 3, 4, 6, 7, 8, 10, 11, 12, ...
We can now run a model on the data referenced by each train
object, and validate the model results on each corresponding partition in test
.
Say we’re interested in predicting Miles Per Gallon (mpg
) with all other variables. With the whole data set, we’d do this via:
lm(mpg ~ ., data = mtcars)
Instead, we want to run this model on each set of training data (data referenced in each train
cell). We can do this as follows:
library(dplyr)
library(purrr)
folds <- folds %>% mutate(model = map(train, ~ lm(mpg ~ ., data = .)))
folds
#> # A tibble: 5 × 4
#> train test .id model
#> <list> <list> <chr> <list>
#> 1 <S3: resample> <S3: resample> 1 <S3: lm>
#> 2 <S3: resample> <S3: resample> 2 <S3: lm>
#> 3 <S3: resample> <S3: resample> 3 <S3: lm>
#> 4 <S3: resample> <S3: resample> 4 <S3: lm>
#> 5 <S3: resample> <S3: resample> 5 <S3: lm>
folds %>% mutate(model = ...)
is adding a new model
column to the folds tibble.map(train, ...)
is applying a function to each of the cells in train
~ lm(...)
is the regression model applied to each train
cell.data = .
specifies that the data for the regression model will be the data referenced by each train
object.The result is a new model
column containing fitted regression models based on each of the train
data (i.e., the whole data set excluding each partition).
For example, the model fitted to our first set of training data is:
folds$model[[1]] %>% summary()
#>
#> Call:
#> lm(formula = mpg ~ ., data = .)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.6540 -0.9116 0.0439 0.9520 4.2811
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -44.243933 31.884363 -1.388 0.1869
#> cyl 0.844966 1.064141 0.794 0.4404
#> disp 0.016800 0.015984 1.051 0.3110
#> hp 0.004685 0.022741 0.206 0.8398
#> drat 3.950410 1.989177 1.986 0.0670 .
#> wt -4.487007 2.016341 -2.225 0.0430 *
#> qsec 2.327131 1.243095 1.872 0.0822 .
#> vs -3.963492 3.217176 -1.232 0.2382
#> am -0.550804 2.333252 -0.236 0.8168
#> gear 5.476604 2.648708 2.068 0.0577 .
#> carb -1.595979 1.104272 -1.445 0.1704
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.159 on 14 degrees of freedom
#> Multiple R-squared: 0.9092, Adjusted R-squared: 0.8444
#> F-statistic: 14.02 on 10 and 14 DF, p-value: 1.205e-05
The next step is to use each model for predicting the outcome variable in the corresponding test
data. There are many ways to achieve this. One general approach might be:
folds %>% mutate(predicted = map2(model, test, <FUNCTION_TO_PREDICT_TEST_DATA> ))
map2(model, test, ...)
iterates through each model and set of test
data in parallel. By referencing these in the function for predicting the test data, this would add a predicted
column with the predicted results.
For many common models, an elegant alternative is to use augment
from broom. For regression, augment
will take a fitted model and a new data frame, and return a data frame of the predicted results, which is what we want! Following above, we can use augment
as follows:
library(broom)
folds %>% mutate(predicted = map2(model, test, ~ augment(.x, newdata = .y)))
#> # A tibble: 5 × 5
#> train test .id model predicted
#> <list> <list> <chr> <list> <list>
#> 1 <S3: resample> <S3: resample> 1 <S3: lm> <data.frame [7 × 13]>
#> 2 <S3: resample> <S3: resample> 2 <S3: lm> <data.frame [7 × 13]>
#> 3 <S3: resample> <S3: resample> 3 <S3: lm> <data.frame [6 × 13]>
#> 4 <S3: resample> <S3: resample> 4 <S3: lm> <data.frame [6 × 13]>
#> 5 <S3: resample> <S3: resample> 5 <S3: lm> <data.frame [6 × 13]>
To extract the relevant information from these predicted
results, we’ll unnest
the data frames thanks to the tidyr package:
library(tidyr)
folds %>%
mutate(predicted = map2(model, test, ~ augment(.x, newdata = .y))) %>%
unnest(predicted)
#> # A tibble: 32 × 14
#> .id mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> 2 1 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> 3 1 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> 4 1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> 5 1 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> 6 1 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> 7 1 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> 8 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 9 2 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 10 2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> # ... with 22 more rows, and 2 more variables: .fitted <dbl>,
#> # .se.fit <dbl>
This was to show you the intermediate steps. In practice we can skip the mutate
step:
predicted <- folds %>% unnest(map2(model, test, ~ augment(.x, newdata = .y)))
predicted
#> # A tibble: 32 × 14
#> .id mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
#> 2 1 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> 3 1 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
#> 4 1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> 5 1 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> 6 1 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> 7 1 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
#> 8 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
#> 9 2 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 10 2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
#> # ... with 22 more rows, and 2 more variables: .fitted <dbl>,
#> # .se.fit <dbl>
We now have a tibble of the test
data for each fold (.id
= fold number) and the corresponding .fitted
, or predicted values for the outcome variable (mpg
) in each case.
We can compute and examine the residuals:
# Compute the residuals
predicted <- predicted %>%
mutate(residual = .fitted - mpg)
# Plot actual v residual values
library(ggplot2)
predicted %>%
ggplot(aes(mpg, residual)) +
geom_hline(yintercept = 0) +
geom_point() +
stat_smooth(method = "loess") +
theme_minimal()
It looks like our models could be overestimating mpg
around 20-30 and underestimating higher mpg
. But there are clearly fewer data points, making prediction difficult.
We can also use these values to calculate the overall proportion of variance accounted for by each model:
rs <- predicted %>%
group_by(.id) %>%
summarise(
sst = sum((mpg - mean(mpg)) ^ 2), # Sum of Squares Total
sse = sum(residual ^ 2), # Sum of Squares Residual/Error
r.squared = 1 - sse / sst # Proportion of variance accounted for
)
rs
#> # A tibble: 5 × 4
#> .id sst sse r.squared
#> <chr> <dbl> <dbl> <dbl>
#> 1 1 321.5886 249.51370 0.2241214
#> 2 2 127.4371 31.86994 0.7499164
#> 3 3 202.6600 42.19842 0.7917773
#> 4 4 108.4733 50.79684 0.5317113
#> 5 5 277.3283 59.55946 0.7852385
# Overall
mean(rs$r.squared)
#> [1] 0.616553
So, across the folds, the regression models have accounted for an average of 61.66% of the variance of mpg
in new, test data.
Plotting these results:
rs %>%
ggplot(aes(r.squared, fill = .id)) +
geom_histogram() +
geom_vline(aes(xintercept = mean(r.squared))) # Overall mean
Although the model performed well on average, it performed pretty poorly when fold 1 was used as test data.
With this new knowledge, we can do something similar to the k = 20
case shown in David’s post. See that you can understand what’s going on here:
set.seed(1)
# Select four variables from the mpg data set in ggplot2
ggplot2::mpg %>% select(year, cyl, drv, hwy) %>%
# Create 20 folds (5% of the data in each partition)
crossv_kfold(k = 20) %>%
# Fit a model to training data
mutate(model = map(train, ~ lm(hwy ~ ., data = .))) %>%
# Unnest predicted values on test data
unnest(map2(model, test, ~ augment(.x, newdata = .y))) %>%
# Compute R-squared values for each partition
group_by(.id) %>%
summarise(
sst = sum((hwy - mean(hwy)) ^ 2),
sse = sum((hwy - .fitted) ^ 2),
r.squared = 1 - sse / sst
) %>%
# Plot
ggplot(aes(r.squared)) +
geom_density() +
geom_vline(aes(xintercept = mean(r.squared))) +
theme_minimal()
Thanks for reading and I hope this was useful for you.
For updates of recent blog posts, follow @drsimonj on Twitter, or email me at drsimonjackson@gmail.com to get in touch.
If you’d like the code that produced this blog, check out the blogR GitHub repository.
The study of maps is still in its infancy, but exciting progress has been made in the last twenty year, thanks to the growing availability of calculators, then computers, and now computer graphics. Maps are easy and fast to simulate on digital computers where time is inherently discrete. Such computer experiments have revealed a number of unexpected and beautiful patterns…Given a non-linear map \(f\) you may want to plot the time series of states \(x_{t+1}=f(x_t)\), depict the cobweb diagram and look at fixed points, to begin with. Let \(f\) be the immortal logistic map depending on the parameter \(r\).
f <- function(x,r) r*x*(1-x)
bounce <- function(f,init=4,n=10,cobweb=T,timeseries=F,dom=NULL,...){
iterates <- NULL
x0 <- init
for(t in 1:n){
x1<- f(x0,...)
iterates[t] <- x1
x0 <- x1
}
if(cobweb & !timeseries){
if(is.null(dom)){
a <- range(c(init,iterates))[1]-0.5
b <- range(c(init,iterates))[2]+0.5} else
{a <- dom[1];b <- dom[2]}
curve(f(x,...),a,b);grid(col=1);abline(0,1,lty=2)
lines(c(init,iterates), f(c(init,iterates),...),t="s")
points(c(init,iterates), f(c(init,iterates),...))
}
if(timeseries){
plot(0:n,c(init,iterates),t="l")
}
}
bounce(f,0.1,r=3.7,n=100)
bounce(f,0.1,r=3.7,n=100) |
bounce(f,0.1,r=3.8282,n=100,timeseries=T) |
I have been teaching the course “Mathematics” for the PhD students at Ca’ Foscari University for a few years. In the lectures I cover some “scattered” material that may prove useful, sooner or later, to develop quantitative models in Economics. One of the weeks of the course is devoted to non-linear dynamics, one-dimensional maps and chaos.
R can be extremely useful to explore the dynamics, compute fixed points or cycles and have a numerical look at the map. The mix of computational techniques and theoretical investigation is widely acknowledged to be fruitful. As Strogatz, “Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chemistry, and Engineering”, puts it:
The study of maps is still in its infancy, but exciting progress has been made in the last twenty year, thanks to the growing availability of calculators, then computers, and now computer graphics. Maps are easy and fast to simulate on digital computers where time is inherently discrete. Such computer experiments have revealed a number of unexpected and beautiful patterns…
Given a non-linear map \(f\) you may want to plot the time series of states \(x_{t+1}=f(x_t)\), depict the cobweb diagram and look at fixed points, to begin with. Let \(f\) be the immortal logistic map depending on the parameter \(r\).
f <- function(x,r) r*x*(1-x)
bounce <- function(f,init=4,n=10,cobweb=T,timeseries=F,dom=NULL,...){
iterates <- NULL
x0 <- init
for(t in 1:n){
x1<- f(x0,...)
iterates[t] <- x1
x0 <- x1
}
if(cobweb & !timeseries){
if(is.null(dom)){
a <- range(c(init,iterates))[1]-0.5
b <- range(c(init,iterates))[2]+0.5} else
{a <- dom[1];b <- dom[2]}
curve(f(x,...),a,b);grid(col=1);abline(0,1,lty=2)
lines(c(init,iterates), f(c(init,iterates),...),t="s")
points(c(init,iterates), f(c(init,iterates),...))
}
if(timeseries){
plot(0:n,c(init,iterates),t="l")
}
}
bounce(f,0.1,r=3.7,n=100)
bounce computes \(n\) iterates of the map starting from init an plots a cobweb diagram. Defaults may or may not work for a specific map but the plotting domain can be provided if the educated guess doesn’t work (here, say, bounce(f,0.1,r=3.7,n=100,dom=c(0,1) would be probably better).
bounce(f,0.1,r=3.7,n=100) |
bounce(f,0.1,r=3.8282,n=100,timeseries=T) |
In a coming post, we’ll use R to draw bifurcation diagrams and Lyapunov exponents.
One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score.
This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter software issued thrust commands in pound-seconds units to an engine expecting the commands to be in newton-seconds units. The two quantities are related by a constant ratio of 1.4881639, and therefore anything measured in pound-seconds units will have a correlation of 1.0 with the same measurement in newton-seconds units. However, one is not the other and the difference is why the Mars Climate Orbiter “encountered Mars at a lower than anticipated altitude and disintegrated due to atmospheric stresses.”
The need for a convenient direct F-test without accidentally triggering the implicit re-scaling that is associated with calculating a correlation is one of the reasons we supply the sigr R library. However, even then things can become confusing.
Please read on for a nasty little example.
Consider the following “harmless data frame.”
d <- data.frame(prediction=c(0,0,-1,-2,0,0,-1,-2),
actual=c(2,3,1,2,2,3,1,2))
The recommended test for checking the quality of “prediction
” related to “actual
” is an F-test (this is the test stats::lm
uses). We can directly run this test with sigr
(assuming we have installed the package) as follows:
sigr::formatFTest(d,'prediction','actual',format='html')$formatStr
F Test summary: (R^{2}=-16, F(1,6)=-5.6, p=n.s.).
sigr
reports an R-squared of -16 (please see here for some discussion of R-squared). This may be confusing, but it correctly communicates we have no model and in fact “prediction
” is worse than just using the average (a very traditional null-model).
However, cor.test
appears to think “prediction
” is a usable model:
cor.test(d$prediction,d$actual)
Pearson's product-moment correlation
data: d$prediction and d$actual
t = 1.1547, df = 6, p-value = 0.2921
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3977998 0.8697404
sample estimates:
cor
0.4264014
This is all for a prediction where sum((d$actual-d$prediction)^2)==66
which is larger than sum((d$actual-mean(d$actual))^2)==4
. We concentrate on effects measures (such as R-squared) as we can drive the p-values wherever we want just by adding more data rows. Our point is: you are worse off using this model than using the mean-value of the actual (2) as your constant predictor. To my mind that is not a good prediction. And lm
seems similarly excited about “prediction
.”
summary(lm(actual~prediction,data=d))
Call:
lm(formula = actual ~ prediction, data = d)
Residuals:
Min 1Q Median 3Q Max
-0.90909 -0.43182 0.09091 0.52273 0.72727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.2727 0.3521 6.455 0.000655 ***
prediction 0.3636 0.3149 1.155 0.292121
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7385 on 6 degrees of freedom
Multiple R-squared: 0.1818, Adjusted R-squared: 0.04545
F-statistic: 1.333 on 1 and 6 DF, p-value: 0.2921
One reason to not trust the lm
result is it didn’t score the quality of “prediction
“. It scored the quality of “0.3636*prediction + 2.2727
.” It can be the case that “0.3636*prediction + 2.2727
” is in fact a good predictor. But
that doesn’t help us if it is “prediction
” we are showing to our boss or putting into production. We can try to mitigate this by insisting lm
try to stay closer to the original by turning off the intercept or offset with the “0+
” notation. That looks like the following.
summary(lm(actual~0+prediction,data=d))
Call:
lm(formula = actual ~ 0 + prediction, data = d)
Residuals:
Min 1Q Median 3Q Max
0.00 0.00 1.00 2.25 3.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
prediction -1.0000 0.6094 -1.641 0.145
Residual standard error: 1.927 on 7 degrees of freedom
Multiple R-squared: 0.2778, Adjusted R-squared: 0.1746
F-statistic: 2.692 on 1 and 7 DF, p-value: 0.1448
Even the lm(0+)
‘s adjusted prediction is bad as we see below:
d$lmPred <- predict(lm(actual~0+prediction,data=d))
sum((d$actual-d$lmPred)^2)
[1] 26
Yes, the lm(0+)
found a way to improve the prediction; but the improved prediction is still very bad (worse than using a well chosen constant). And it is hard to argue that “-prediction
” is the same model as “prediction
.”
Now sigr
is fairly new code, so it is a bit bold saying it is right when it disagrees with the standard methods. However sigr
is right in this case. The standard methods are not so much wrong as different, for two reasons:
cor.test
” and “lm %>% summary
” are designed to check if any re-scaling of the prediction is in fact good. These are different questions. Using “cor.test
” or “lm %>% summary
” to test the utility of a potential variable is a good idea. The reprocessing hidden in these tests is consistent with the later use of a variable in a model. Using them to score model results that are supposed be directly used is wrong.lm(0+)
are designed to see how well lm(0+)
is working. This means the lm(0+)
scores the quality of its output (not its inputs) so it gets credit for flipping the sign on the prediction. Also it considers the natural null-model to be one it can form where there are no variable driven effects. Since there is no intercept or “dc-term” in these models (caused by the “0+
” notation) the grand average is not considered a plausible null-model as it isn’t in the concept space of the modeling situation the lm
was presented with. Or from help(summary.lm)
:R^2, the ‘fraction of variance explained by the model’,
R^2 = 1 – Sum(R[i]^2) / Sum((y[i]- y*)^2),
where y* is the mean of y[i] if there is an intercept and zero otherwise.
I admit, this is very confusing. But it corresponds to documentation, and makes sense from a modeling perspective. It is correct. The silent switching of null model from average to zero make sense in the context it is defined in. It does not make sense for testing our prediction, but that is just one more reason to use the proper F-test directly instead of trying to hack “cor.test
” or “lm(0+) %>% summary
” to calculate it for you.
And that is what sigr
is about: standard tests (using R
supplied implementations) with a slightly different calling convention to better document intent (which in our case is almost always measuring the quality of a model separate from model construction). It is a new library, so it doesn’t yet have the documentation needed to achieve its goal, but we will eventually get there.
I have prepared a file with four aggregated time series for analysis and forecast. It can be found on my github repo, the name of the file is DT_4_ind. Of course, I’m using EnerNOC smart meter data again and time series were aggregated by four located industries. The file was created easily by the package feather
(CRAN link), so only by this package, you can read this file again. The feather
is a useful tool to share data for R and Python users.
Data manipulation will be done by data.table
package, visualizations by ggplot2
, plotly
and animation
packages.
The first step to do some “magic” is to scan all of the needed packages.
library(feather)
library(data.table)
library(ggplot2)
library(plotly)
library(animation)
Read the mentioned smart meter data by read_feather
to one data.table
.
DT <- as.data.table(read_feather("DT_4_ind"))
Let’s plot, what I have prepared for you - aggregated time series of electricity consumption by industry.
ggplot(data = DT, aes(x = date, y = value)) +
geom_line() +
facet_grid(type ~ ., scales = "free_y") +
theme(panel.border = element_blank(),
panel.background = element_blank(),
panel.grid.minor = element_line(colour = "grey90"),
panel.grid.major = element_line(colour = "grey90"),
panel.grid.major.x = element_line(colour = "grey90"),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12, face = "bold"),
strip.text = element_text(size = 9, face = "bold")) +
labs(x = "Date", y = "Load (kW)")
An interesting fact is that the consumption of the industry Food Sales & Storage isn’t changing during holidays as much as others.
The aim of the multiple linear regression is to model dependent variable (output) by independent variables (inputs). Another target can be to analyze influence (correlation) of independent variables to the dependent variable. Like in the previous post, we want to forecast consumption one week ahead, so regression model must capture weekly pattern (seasonality). Variables (inputs) will be of two types of seasonal dummy variables - daily (\( d_1, \dots, d_{48} \)) and weekly (\( w_1, \dots, w_6 \)). In the case of the daily variable, there will be \( 1 \), when the consumption during the day will be measured at the particular time, otherwise \( 0 \). In the case of the week variable there will be \( 1 \), when the consumption is measured at the particular day of the week, otherwise \( 0 \).
The regression model can be formally written as:
where \( y_i \) is the electricity consumption at the time \( i \), where \( i = 1, \dots, N \). \( \beta_1, \dots, \beta_{54} \) are regression coefficients, which we want to estimate. \( d_1, \dots, d_{48} \) and \( w_1, \dots, w_6 \) are dummy independent variables. \( \varepsilon_i \) is a random error. Assumption for the errors are that they are independently identical distributed (i.i.d.) with distribution \( \varepsilon \sim N(0,~\sigma^2) \).
Estimation of regression coefficients is done by ordinary least squares (OLS). So if we wrote our model as:
where \( Y \) is a vector of the length \( N \), \( \beta \) is a vector of the length \( p \) and \( \mathbf{X} \) is a matrix of the size \( N\times p,\) then OLS estimation of \( \beta \) is:
You are maybe asking, where is independent variable \( w_7 \) or intercept \( \beta_0 \). We must omit them due to collinearity of independent variables. The model matrix \( \mathbf{X} \) must be a regular matrix, not singular. Thereto, intercept has no sense in the time series regression model, because we do not usually consider time 0.
Let’s finally do some regression analysis of our proposed model. Firstly, prepare DT
to work with a regression model. Transform the characters of weekdays to integers.
DT[, week_num := as.integer(as.factor(DT[, week]))]
Store informations in variables of the type of industry, date, weekday and period.
n_type <- unique(DT[, type])
n_date <- unique(DT[, date])
n_weekdays <- unique(DT[, week])
period <- 48
Let’s look at some data chunk of consumption and do regression analysis on it. I have picked aggregate consumption of education (schools) buildings for two weeks. Store it in variable data_r
and plot it.
data_r <- DT[(type == n_type[2] & date %in% n_date[57:70])]
ggplot(data_r, aes(date_time, value)) +
geom_line() +
theme(panel.border = element_blank(),
panel.background = element_blank(),
panel.grid.minor = element_line(colour = "grey90"),
panel.grid.major = element_line(colour = "grey90"),
panel.grid.major.x = element_line(colour = "grey90"),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12, face = "bold")) +
labs(x = "Date", y = "Load (kW)")
Let’s now create the mentioned independent dummy variables and store all of them in the matrix_train
. When we are using the method lm
in R, it’s simple to define dummy variables in one vector. Just use as.factor
for some vector of classes. We don’t need to create 48 vectors for daily dummy variables and 6 vectors for weekly dummy variables.
N <- nrow(data_r)
window <- N / period # number of days in the train set
# 1, ..., period, 1, ..., period - and so on for the daily season
# using feature "week_num" for the weekly season
matrix_train <- data.table(Load = data_r[, value],
Daily = as.factor(rep(1:period, window)),
Weekly = as.factor(data_r[, week_num]))
Let’s create our first multiple linear model with the function lm
. lm
automatically add to the linear model intercept, so we must define it now 0
. Also, we can simply put all the variables to the model just by the dot - .
.
lm_m_1 <- lm(Load ~ 0 + ., data = matrix_train)
smmr_1 <- summary(lm_m_1)
paste("R-squared: ",
round(smmr_1$r.squared, 3),
", p-value of F test: ",
1-pf(smmr_1$fstatistic[1], smmr_1$fstatistic[2], smmr_1$fstatistic[3]))
## [1] "R-squared: 0.955 , p-value of F test: 0"
You can see a nice summary of the linear model, but I will omit them now because of its long length (we have 54 variables). So I’m showing you only the two most important statistics: R-squared and p-value of F-statistic of the goodness of fit. They seem pretty good.
Let’s look at the fitted values.
datas <- rbindlist(list(data_r[, .(value, date_time)],
data.table(value = lm_m_1$fitted.values, data_time = data_r[, date_time])))
datas[, type := rep(c("Real", "Fitted"), each = nrow(data_r))]
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
geom_line(size = 0.8) +
theme_bw() +
labs(x = "Time", y = "Load (kW)",
title = "Fit from MLR")
That’s horrible! We are missing something here.
Look at the fitted values vs. residuals now.
ggplot(data = data.table(Fitted_values = lm_m_1$fitted.values,
Residuals = lm_m_1$residuals),
aes(Fitted_values, Residuals)) +
geom_point(size = 1.7) +
geom_smooth() +
geom_hline(yintercept = 0, color = "red", size = 1) +
labs(title = "Fitted values vs Residuals")
This is the typical example of heteroskedasticity - occurrence of nonconstant residuals (variance) in a regression model. The linear regression has an assumption that residuals must be from \( N(0,~\sigma^2) \) distribution and they are i.i.d. In the other words, the residuals must be symmetrically around zero.
Let’s look at the next proof that our residuals are not normal. We can use normal Q-Q plot here. I’m using the function from this stackoverflow question to plot it by ggplot2
.
ggQQ <- function(lm){
# extract standardized residuals from the fit
d <- data.frame(std.resid = rstandard(lm))
# calculate 1Q/4Q line
y <- quantile(d$std.resid[!is.na(d$std.resid)], c(0.25, 0.75))
x <- qnorm(c(0.25, 0.75))
slope <- diff(y)/diff(x)
int <- y[1L] - slope * x[1L]
p <- ggplot(data = d, aes(sample = std.resid)) +
stat_qq(shape = 1, size = 3) + # open circles
labs(title = "Normal Q-Q", # plot title
x = "Theoretical Quantiles", # x-axis label
y = "Standardized Residuals") + # y-axis label
geom_abline(slope = slope, intercept = int, linetype = "dashed",
size = 1, col = "firebrick1") # dashed reference line
return(p)
}
ggQQ(lm_m_1)
Of course, it is absolutely not normal (points must be close the red line).
What can we do now? Use other regression methods (especially nonlinear ones)? No. Let’s think about why this happened. We have seen on fitted values, that measurements during the day were moved constantly by the estimated coefficient of week variable, but the behavior during the day wasn’t captured. We need to capture this behavior because especially weekends behave absolutely different. It can be handled by defining interactions between day and week dummy variables to the regression model. So we multiply every daily variable with every weekly one. Again, be careful with collinearity and singularity of the model matrix, so we must omit one daily dummy variable (for example \( d_{1} \)). Fortunately, this is done in method lm
automatically, when we use factors as variables.
Let’s train a second linear model. Interactions should solve the problem, that we saw in the plot of fitted values.
lm_m_2 <- lm(Load ~ 0 + Daily + Weekly + Daily:Weekly,
data = matrix_train)
Look at R-squared of previous model and the new one with interactions:
c(Previous = summary(lm_m_1)$r.squared, New = summary(lm_m_2)$r.squared)
## Previous New
## 0.9547247 0.9989725
R-squared seems better.
Look at the comparison of residuals of two fitted models. Using the interactive plot plotly
here.
ggplot(data.table(Residuals = c(lm_m_1$residuals, lm_m_2$residuals),
Type = c(rep("MLR - simple", nrow(data_r)),
rep("MLR with interactions", nrow(data_r)))),
aes(Type, Residuals, fill = Type)) +
geom_boxplot()
ggplotly()
This is much better than the previous model, it seems that interactions are working.
Prove it with a sequence of three plots - fitted values, fit vs. residuals and Q-Q plot.
datas <- rbindlist(list(data_r[, .(value, date_time)],
data.table(value = lm_m_2$fitted.values, data_time = data_r[, date_time])))
datas[, type := rep(c("Real", "Fitted"), each = nrow(data_r))]
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
geom_line(size = 0.8) +
theme_bw() +
labs(x = "Time", y = "Load (kW)",
title = "Fit from MLR")
ggplot(data = data.table(Fitted_values = lm_m_2$fitted.values,
Residuals = lm_m_2$residuals),
aes(Fitted_values, Residuals)) +
geom_point(size = 1.7) +
geom_hline(yintercept = 0, color = "red", size = 1) +
labs(title = "Fitted values vs Residuals")
ggQQ(lm_m_2)
Everything seems much better than in the previous model. The fitted values seem almost perfect.
I also tried to work with a linear trend to boost this model, but it did not help (wasn’t significant). So go ahead and forecast consumption with this model.
Again, I build function (as in the previous post) to return the forecast of the one week ahead. So we can then simply compare with STL+ARIMA method (was better than STL+ETS). Arguments of this function are just data
and set_of_date
, so it’s easy to manipulate. Let’s add everything needed to function predWeekReg
to create a regression model and forecast.
predWeekReg <- function(data, set_of_date){
# Subsetting the dataset by dates
data_train <- data[date %in% set_of_date]
N <- nrow(data_train)
window <- N / period # number of days in the train set
# 1, ..., period, 1, ..., period - and so on for the daily season
# Using feature "week_num" for the weekly season
matrix_train <- data.table(Load = data_train[, value],
Daily = as.factor(rep(1:period, window)),
Weekly = as.factor(data_train[, week_num]))
# Creation of the model
lm_m <- lm(Load ~ 0 + Daily + Weekly + Daily:Weekly, data = matrix_train)
# Creation of the forecast for one week ahead
pred_week <- predict(lm_m, matrix_train[1:(7*period), -1, with = FALSE])
return(as.vector(pred_week))
}
Define MAPE (Mean Absolute Percentage Error) for evaluation of our forecast.
mape <- function(real, pred){
return(100 * mean(abs((real - pred)/real)))
}
Now we are ready to produce forecasts. I set training set of the length of two weeks - experimentally proved. In experiments, a whole data set of the length of one year is used, so a forecast for 50 weeks will be produced. A sliding window approach for training is used. Let’s produce (compute) forecast for every type of industry (4), to see differences between them:
n_weeks <- floor(length(n_date)/7) - 2
# Forecasts
lm_pred_weeks_1 <- sapply(0:(n_weeks-1), function(i)
predWeekReg(DT[type == n_type[1]], n_date[((i*7)+1):((i*7)+7*2)]))
lm_pred_weeks_2 <- sapply(0:(n_weeks-1), function(i)
predWeekReg(DT[type == n_type[2]], n_date[((i*7)+1):((i*7)+7*2)]))
lm_pred_weeks_3 <- sapply(0:(n_weeks-1), function(i)
predWeekReg(DT[type == n_type[3]], n_date[((i*7)+1):((i*7)+7*2)]))
lm_pred_weeks_4 <- sapply(0:(n_weeks-1), function(i)
predWeekReg(DT[type == n_type[4]], n_date[((i*7)+1):((i*7)+7*2)]))
# Evaluation (computation of errors)
lm_err_mape_1 <- sapply(0:(n_weeks-1), function(i)
mape(DT[(type == n_type[1] & date %in% n_date[(15+(i*7)):(21+(i*7))]), value],
lm_pred_weeks_1[, i+1]))
lm_err_mape_2 <- sapply(0:(n_weeks-1), function(i)
mape(DT[(type == n_type[2] & date %in% n_date[(15+(i*7)):(21+(i*7))]), value],
lm_pred_weeks_2[, i+1]))
lm_err_mape_3 <- sapply(0:(n_weeks-1), function(i)
mape(DT[(type == n_type[3] & date %in% n_date[(15+(i*7)):(21+(i*7))]), value],
lm_pred_weeks_3[, i+1]))
lm_err_mape_4 <- sapply(0:(n_weeks-1), function(i)
mape(DT[(type == n_type[4] & date %in% n_date[(15+(i*7)):(21+(i*7))]), value],
lm_pred_weeks_4[, i+1]))
Similarly, you can do this with the function predWeek
from the previous post. I used STL+ARIMA method to compare with the MLR with interactions. Here is the plotly
of MAPEs:
For every industry MLR was more accurate than STL+ARIMA, so our basic regression method is working very well for double seasonal time series.
I have created 4 (IMHO) interesting GIFs by the package animation
to show whole forecast for a year. I have done it this way for every four industries:
datas <- data.table(value = c(as.vector(lm_pred_weeks_1),
DT[(type == n_type[1]) & (date %in% n_date[-c(1:14,365)]), value]),
date_time = c(rep(DT[-c(1:(14*48), (17473:nrow(DT))), date_time], 2)),
type = c(rep("MLR", nrow(lm_pred_weeks_1)*ncol(lm_pred_weeks_1)),
rep("Real", nrow(lm_pred_weeks_1)*ncol(lm_pred_weeks_1))),
week = c(rep(1:50, each = 336), rep(1:50, each = 336)))
saveGIF({
oopt = ani.options(interval = 0.9, nmax = 50)
for(i in 1:ani.options("nmax")){
print(ggplot(data = datas[week == i], aes(date_time, value, group = type, colour = type)) +
geom_line(size = 0.8) +
scale_y_continuous(limits = c(min(datas[, value]), max(datas[, value]))) +
theme(panel.border = element_blank(), panel.background = element_blank(),
panel.grid.minor = element_line(colour = "grey90"),
panel.grid.major = element_line(colour = "grey90"),
panel.grid.major.x = element_line(colour = "grey90"),
title = element_text(size = 15),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12, face = "bold")) +
labs(x = "Time", y = "Load (kW)",
title = paste("Forecast of MLR (", n_type[1], "); ", "week: ", i, "; MAPE: ",
round(lm_err_mape_1[i], 2), "%", sep = "")))
ani.pause()
}}, movie.name = "industry_1.gif", ani.height = 450, ani.width = 750)
Here are the created GIFs:
In these animations we can see that the behavior of the electricity consumption can be very stochastic and many external factors influence it (holidays, weather etc.), so it’s a challenging task.
In my next post, I will continue with the introduction of regression methods, this time with GAM (Generalized Additive Model).
Script for the creation of this whole post can be found on my github repo.
]]>I will continue in describing forecast methods, which are suitable to seasonal (or multi-seasonal) time series. In the previous post smart meter data of electricity consumption were introduced and a forecast method using similar day approach was proposed. ARIMA and exponential smoothing (common methods of time series analysis) were used as forecast methods. The biggest disadvantage of this approach was that we created multiple models at once for different days in the week, which is computationally expensive and it can be a little bit unclear. Regression methods are more suitable for multi-seasonal times series. They can handle multiple seasonalities through independent variables (inputs of a model), so just one model is needed. In this post, I will introduce the most basic regression method – multiple linear regression (MLR).
I have prepared a file with four aggregated time series for analysis and forecast. It can be found on my github repo, the name of the file is DT_4_ind. Of course, I’m using EnerNOC smart meter data again and time series were aggregated by four located industries. The file was created easily by the package feather
(CRAN link), so only by this package, you can read this file again. The feather
is a useful tool to share data for R and Python users.
Data manipulation will be done by data.table
package, visualizations by ggplot2
, plotly
and animation
packages.
The first step to do some “magic” is to scan all of the needed packages.
Read the mentioned smart meter data by read_feather
to one data.table
.
Let’s plot, what I have prepared for you – aggregated time series of electricity consumption by industry.
An interesting fact is that the consumption of the industry Food Sales & Storage isn’t changing during holidays as much as others.
The aim of the multiple linear regression is to model dependent variable (output) by independent variables (inputs). Another target can be to analyze influence (correlation) of independent variables to the dependent variable. Like in the previous post, we want to forecast consumption one week ahead, so regression model must capture weekly pattern (seasonality). Variables (inputs) will be of two types of seasonal dummy variables – daily (\( d_1, \dots, d_{48} \)) and weekly (\( w_1, \dots, w_6 \)). In the case of the daily variable, there will be \( 1 \), when the consumption during the day will be measured at the particular time, otherwise \( 0 \). In the case of the week variable there will be \( 1 \), when the consumption is measured at the particular day of the week, otherwise \( 0 \).
The regression model can be formally written as:
where \( y_i \) is the electricity consumption at the time \( i \), where \( i = 1, \dots, N \). \( \beta_1, \dots, \beta_{54} \) are regression coefficients, which we want to estimate. \( d_1, \dots, d_{48} \) and \( w_1, \dots, w_6 \) are dummy independent variables. \( \varepsilon_i \) is a random error. Assumption for the errors are that they are independently identical distributed (i.i.d.) with distribution \( \varepsilon \sim N(0,~\sigma^2) \).
Estimation of regression coefficients is done by ordinary least squares (OLS). So if we wrote our model as:
where \( Y \) is a vector of the length \( N \), \( \beta \) is a vector of the length \( p \) and \( \mathbf{X} \) is a matrix of the size \( N\times p,\) then OLS estimation of \( \beta \) is:
You are maybe asking, where is independent variable \( w_7 \) or intercept \( \beta_0 \). We must omit them due to collinearity of independent variables. The model matrix \( \mathbf{X} \) must be a regular matrix, not singular. Thereto, intercept has no sense in the time series regression model, because we do not usually consider time 0.
Let’s finally do some regression analysis of our proposed model. Firstly, prepare DT
to work with a regression model. Transform the characters of weekdays to integers.
Store informations in variables of the type of industry, date, weekday and period.
Let’s look at some data chunk of consumption and do regression analysis on it. I have picked aggregate consumption of education (schools) buildings for two weeks. Store it in variable data_r
and plot it.
Let’s now create the mentioned independent dummy variables and store all of them in the matrix_train
. When we are using the method lm
in R, it’s simple to define dummy variables in one vector. Just use as.factor
for some vector of classes. We don’t need to create 48 vectors for daily dummy variables and 6 vectors for weekly dummy variables.
Let’s create our first multiple linear model with the function lm
. lm
automatically add to the linear model intercept, so we must define it now 0
. Also, we can simply put all the variables to the model just by the dot – .
.
You can see a nice summary of the linear model, but I will omit them now because of its long length (we have 54 variables). So I’m showing you only the two most important statistics: R-squared and p-value of F-statistic of the goodness of fit. They seem pretty good.
Let’s look at the fitted values.
That’s horrible! We are missing something here.
Look at the fitted values vs. residuals now.
This is the typical example of heteroskedasticity – occurrence of nonconstant residuals (variance) in a regression model. The linear regression has an assumption that residuals must be from \( N(0,~\sigma^2) \) distribution and they are i.i.d. In the other words, the residuals must be symmetrically around zero.
Let’s look at the next proof that our residuals are not normal. We can use normal Q-Q plot here. I’m using the function from this stackoverflow question to plot it by ggplot2
.
Of course, it is absolutely not normal (points must be close the red line).
What can we do now? Use other regression methods (especially nonlinear ones)? No. Let’s think about why this happened. We have seen on fitted values, that measurements during the day were moved constantly by the estimated coefficient of week variable, but the behavior during the day wasn’t captured. We need to capture this behavior because especially weekends behave absolutely different. It can be handled by defining interactions between day and week dummy variables to the regression model. So we multiply every daily variable with every weekly one. Again, be careful with collinearity and singularity of the model matrix, so we must omit one daily dummy variable (for example \( d_{1} \)). Fortunately, this is done in method lm
automatically, when we use factors as variables.
Let’s train a second linear model. Interactions should solve the problem, that we saw in the plot of fitted values.
Look at R-squared of previous model and the new one with interactions:
R-squared seems better.
Look at the comparison of residuals of two fitted models. Using the interactive plot plotly
here.
This is much better than the previous model, it seems that interactions are working.
Prove it with a sequence of three plots – fitted values, fit vs. residuals and Q-Q plot.
Everything seems much better than in the previous model. The fitted values seem almost perfect.
I also tried to work with a linear trend to boost this model, but it did not help (wasn’t significant). So go ahead and forecast consumption with this model.
Again, I build function (as in the previous post) to return the forecast of the one week ahead. So we can then simply compare with STL+ARIMA method (was better than STL+ETS). Arguments of this function are just data
and set_of_date
, so it’s easy to manipulate. Let’s add everything needed to function predWeekReg
to create a regression model and forecast.
Define MAPE (Mean Absolute Percentage Error) for evaluation of our forecast.
Now we are ready to produce forecasts. I set training set of the length of two weeks – experimentally proved. In experiments, a whole data set of the length of one year is used, so a forecast for 50 weeks will be produced. A sliding window approach for training is used.
Let’s produce (compute) forecast for every type of industry (4), to see differences between them:
Similarly, you can do this with the function predWeek
from the previous post. I used STL+ARIMA method to compare with the MLR with interactions. Here is the plotly
of MAPEs:
For every industry MLR was more accurate than STL+ARIMA, so our basic regression method is working very well for double seasonal time series.
I have created 4 (IMHO) interesting GIFs by the package animation
to show whole forecast for a year. I have done it this way for every four industries:
Here are the created GIFs:
In these animations we can see that the behavior of the electricity consumption can be very stochastic and many external factors influence it (holidays, weather etc.), so it’s a challenging task.
In my next post, I will continue with the introduction of regression methods, this time with GAM (Generalized Additive Model).
Script for the creation of this whole post can be found on my github repo.