Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

@drsimonj here to discuss how to conduct k-fold cross validation, with an emphasis on evaluating models supported by David Robinson’s broom package. Full credit also goes to David, as this is a slightly more detailed version of his past post, which I read some time ago and felt like unpacking.

## Assumed knowledge: K-fold Cross validation

This post assumes you know what k-fold cross validation is. If you want to brush up, here’s a fantastic tutorial from Stanford University professors Trevor Hastie and Rob Tibshirani.

## Creating folds

Before worrying about models, we can generate K folds using crossv_kfold from the modelr package. Let’s practice with the mtcars data to keep things simple.

library(modelr)
set.seed(1)  # Run to replicate this post
folds <- crossv_kfold(mtcars, k = 5)
folds
#> # A tibble: 5 × 3
#>            train           test   .id
#>
#> 1       1
#> 2       2
#> 3       3
#> 4       4
#> 5       5


This function takes a data frame and randomly partitions it’s rows (1 to 32 for mtcars) into k roughly equal groups. We’ve partitioned the row numbers into k = 5 groups. The results are returned as a tibble (data frame) like the one above.

Each cell in the test column contains a resample object, which is an efficient way of referencing a subset of rows in a data frame (?resample to learn more). Think of each cell as a reference to the rows of the data frame belonging to each partition. For example, the following tells us that the first partition of the data references rows 5, 9, 17, 20, 27, 28, 29, which accounts for roughly 1 / k of the total data set (7 of the 32 rows).

folds$test[[1]] #> 5, 9, 17, 20, 27, 28, 29  Each cell in train also contains a resample object, but referencing the rows in all other partitions. For example, the first train object references all rows except 5, 9, 17, 20, 27, 28, 29: folds$train[[1]]
#>  1, 2, 3, 4, 6, 7, 8, 10, 11, 12, ...


We can now run a model on the data referenced by each train object, and validate the model results on each corresponding partition in test.

## Fitting models to training data

Say we’re interested in predicting Miles Per Gallon (mpg) with all other variables. With the whole data set, we’d do this via:

lm(mpg ~ ., data = mtcars)


Instead, we want to run this model on each set of training data (data referenced in each train cell). We can do this as follows:

library(dplyr)
library(purrr)

folds <- folds %>% mutate(model = map(train, ~ lm(mpg ~ ., data = .)))
folds
#> # A tibble: 5 × 4
#>            train           test   .id    model
#>
#> 1       1
#> 2       2
#> 3       3
#> 4       4
#> 5       5

• folds %>% mutate(model = ...) is adding a new model column to the folds tibble.
• map(train, ...) is applying a function to each of the cells in train
• ~ lm(...) is the regression model applied to each train cell.
• data = . specifies that the data for the regression model will be the data referenced by each train object.

The result is a new model column containing fitted regression models based on each of the train data (i.e., the whole data set excluding each partition).

For example, the model fitted to our first set of training data is:

folds$model[[1]] %>% summary() #> #> Call: #> lm(formula = mpg ~ ., data = .) #> #> Residuals: #> Min 1Q Median 3Q Max #> -3.6540 -0.9116 0.0439 0.9520 4.2811 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) -44.243933 31.884363 -1.388 0.1869 #> cyl 0.844966 1.064141 0.794 0.4404 #> disp 0.016800 0.015984 1.051 0.3110 #> hp 0.004685 0.022741 0.206 0.8398 #> drat 3.950410 1.989177 1.986 0.0670 . #> wt -4.487007 2.016341 -2.225 0.0430 * #> qsec 2.327131 1.243095 1.872 0.0822 . #> vs -3.963492 3.217176 -1.232 0.2382 #> am -0.550804 2.333252 -0.236 0.8168 #> gear 5.476604 2.648708 2.068 0.0577 . #> carb -1.595979 1.104272 -1.445 0.1704 #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 2.159 on 14 degrees of freedom #> Multiple R-squared: 0.9092, Adjusted R-squared: 0.8444 #> F-statistic: 14.02 on 10 and 14 DF, p-value: 1.205e-05  ## Predicting the test data The next step is to use each model for predicting the outcome variable in the corresponding test data. There are many ways to achieve this. One general approach might be: folds %>% mutate(predicted = map2(model, test, ))  map2(model, test, ...) iterates through each model and set of test data in parallel. By referencing these in the function for predicting the test data, this would add a predicted column with the predicted results. For many common models, an elegant alternative is to use augment from broom. For regression, augment will take a fitted model and a new data frame, and return a data frame of the predicted results, which is what we want! Following above, we can use augment as follows: library(broom) folds %>% mutate(predicted = map2(model, test, ~ augment(.x, newdata = .y))) #> # A tibble: 5 × 5 #> train test .id model predicted #> #> 1 1 #> 2 2 #> 3 3 #> 4 4 #> 5 5  To extract the relevant information from these predicted results, we’ll unnest the data frames thanks to the tidyr package: library(tidyr) folds %>% mutate(predicted = map2(model, test, ~ augment(.x, newdata = .y))) %>% unnest(predicted) #> # A tibble: 32 × 14 #> .id mpg cyl disp hp drat wt qsec vs am gear carb #> #> 1 1 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 #> 2 1 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 #> 3 1 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 #> 4 1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 #> 5 1 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 #> 6 1 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 #> 7 1 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 #> 8 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 #> 9 2 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 #> 10 2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 #> # ... with 22 more rows, and 2 more variables: .fitted , #> # .se.fit  This was to show you the intermediate steps. In practice we can skip the mutate step: predicted <- folds %>% unnest(map2(model, test, ~ augment(.x, newdata = .y))) predicted #> # A tibble: 32 × 14 #> .id mpg cyl disp hp drat wt qsec vs am gear carb #> #> 1 1 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 #> 2 1 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 #> 3 1 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 #> 4 1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 #> 5 1 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 #> 6 1 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 #> 7 1 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 #> 8 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 #> 9 2 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 #> 10 2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 #> # ... with 22 more rows, and 2 more variables: .fitted , #> # .se.fit  We now have a tibble of the test data for each fold (.id = fold number) and the corresponding .fitted, or predicted values for the outcome variable (mpg) in each case. ## Validating the model We can compute and examine the residuals: # Compute the residuals predicted <- predicted %>% mutate(residual = .fitted - mpg) # Plot actual v residual values library(ggplot2) predicted %>% ggplot(aes(mpg, residual)) + geom_hline(yintercept = 0) + geom_point() + stat_smooth(method = "loess") + theme_minimal()  It looks like our models could be overestimating mpg around 20-30 and underestimating higher mpg. But there are clearly fewer data points, making prediction difficult. We can also use these values to calculate the overall proportion of variance accounted for by each model: rs <- predicted %>% group_by(.id) %>% summarise( sst = sum((mpg - mean(mpg)) ^ 2), # Sum of Squares Total sse = sum(residual ^ 2), # Sum of Squares Residual/Error r.squared = 1 - sse / sst # Proportion of variance accounted for ) rs #> # A tibble: 5 × 4 #> .id sst sse r.squared #> #> 1 1 321.5886 249.51370 0.2241214 #> 2 2 127.4371 31.86994 0.7499164 #> 3 3 202.6600 42.19842 0.7917773 #> 4 4 108.4733 50.79684 0.5317113 #> 5 5 277.3283 59.55946 0.7852385 # Overall mean(rs$r.squared)
#> [1] 0.616553


So, across the folds, the regression models have accounted for an average of 61.66% of the variance of mpg in new, test data.

Plotting these results:

rs %>%
ggplot(aes(r.squared, fill  = .id)) +
geom_histogram() +
geom_vline(aes(xintercept = mean(r.squared)))  # Overall mean


Although the model performed well on average, it performed pretty poorly when fold 1 was used as test data.

## All at once

With this new knowledge, we can do something similar to the k = 20 case shown in David’s post. See that you can understand what’s going on here:

set.seed(1)
# Select four variables from the mpg data set in ggplot2
ggplot2::mpg %>% select(year, cyl, drv, hwy) %>%
# Create 20 folds (5% of the data in each partition)
crossv_kfold(k = 20) %>%
# Fit a model to training data
mutate(model = map(train, ~ lm(hwy ~ ., data = .))) %>%
# Unnest predicted values on test data
unnest(map2(model, test, ~ augment(.x, newdata = .y))) %>%
# Compute R-squared values for each partition
group_by(.id) %>%
summarise(
sst = sum((hwy - mean(hwy)) ^ 2),
sse = sum((hwy - .fitted) ^ 2),
r.squared = 1 - sse / sst
) %>%
# Plot
ggplot(aes(r.squared)) +
geom_density() +
geom_vline(aes(xintercept = mean(r.squared))) +
theme_minimal()


## Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at [email protected] to get in touch.

If you’d like the code that produced this blog, check out the blogR GitHub repository.