# Running cross_validate from cvms in parallel

**R**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The cvms package is useful for cross-validating a list of linear and logistic regression model formulas in R. To speed up the process, I’ve added the option to cross-validate the models in parallel. In this post, I will walk you through a simple example and introduce the combine_predictors() function, which generates model formulas by combining a list of fixed effects. We will be using the simple `participant.scores`

dataset from cvms.

First, we will install the newest versions of cvms and groupdata2 from GitHub. You will also need the doParallel package.

```
# Install packages
devtools::install_github("ludvigolsen/groupdata2")
devtools::install_github("ludvigolsen/cvms")
```

Then, we attach the packages and set the random seed to 1.

```
# Attach packages
library(cvms) # cross_validate, combine_predictors
library(groupdata2) # fold
library(doParallel) # registerDoParallel
```

```
# Set seed for reproducibility
# Note that R versions < 3.6.0 may give different results
set.seed(1)
```

Now, we will create the folds for cross-validation. This simply adds a factor in the dataset called .folds with folds identifiers (e.g. 1,1,1,2,2,3,3,…). We will also ensure that we have a similar ratio of the two diagnoses in the folds, and that all rows pertaining to a participant is put in the same fold.

```
# Create folds in the dataset
data <- fold(participant.scores, k = 4,
cat_col = "diagnosis",
id_col = "participant")
```

We will use the combine_predictors() function to generate our model formulas. We supply the list of fixed effects (we will use *age* and *score*) and it combines them with and without interactions. Note that when we have more than 6 fixed effects, it becomes very slow due to the number of the possible combinations. To deal with this, it has some options to limit the number of fixed effects per formula, along with the maximum size of included interactions. We will not use those here though.

```
# Generate model formulas with combine_predictors()
models <- combine_predictors(dependent = "diagnosis",
fixed_effects = c("age", "score"))
models
```

```
### [1] "diagnosis ~ age" "diagnosis ~ score"
### [3] "diagnosis ~ age * score" "diagnosis ~ age + score"
```

We want to test if running cross_validate() in parallel is faster than running it sequentially. This would be hard to tell with only 4 simple models, so we repeat the model formulas 100 times each.

```
# Repeat formulas 100 times
models_repeated <- rep(models, each = 100)
```

Now we can cross-validate with and without parallelization. We will start *without* it.

```
# Cross-validate the model formulas without parallelization
system.time({cv_1 <- cross_validate(data,
models = models_repeated,
family = "binomial")})
```

```
### user system elapsed
### 26.290 0.194 26.595
```

This took **26.595** seconds to run.

For the parallelization, we will use the doParallel package. There are other options out there though.

First, we register the number of CPU cores to use. I will use 4 cores.

```
# Register CPU cores
registerDoParallel(4)
```

Then, we simply set parallel to TRUE in cross_validate().

```
# Cross-validate the model formulas with parallelization
system.time({cv_2 <- cross_validate(data,
models = models_repeated,
family = "binomial",
parallel = TRUE)})
```

```
### user system elapsed
### 39.274 1.845 10.955
```

This time it took only **10.955** seconds!

As these formulas are very simple, and the dataset is very small, it’s difficult to estimate how much time the parallelization will save in the real world. If we were cross-validating a lot of larger models on a big dataset, it could be a meaningful option.

In this post, you have learned to run cross_validate() in parallel. This functionality can also be found in validate(), and I have also added it to the new baseline() function, which I will cover in a future post. It creates baseline evaluations, so we have something to compare our awesome models to. Pretty neat!

You have also learned to generate model formulas with combine_predictors().

Indlægget Running cross_validate from cvms in parallel blev først udgivet på .

**leave a comment**for the author, please follow the link and comment on their blog:

**R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.