cvms 0.1.0 released on CRAN

July 4, 2019
By

(This article was first published on R, and kindly contributed to R-bloggers)

After a fairly long life on GitHub, my R package, cvms, for cross-validating linear and logistic regression, is finally on CRAN!

With a few additions in the past months, this is a good time to catch you all up on the included functionality. For examples, check out the readme on GitHub!

The main purpose of cvms is to allow researchers to quickly compare their models with cross-validation, with a tidy output containing the relevant metrics. Once the best model has been selected, it can be validated (that is, trained on the entire training set and evaluated on a validation set) with the same set of metrics. cross_validate() and validate() are the main tools for this. Besides the set of evaluation metrics, the results and predictions from each cross-validation iteration are included, allowing further analysis.

The main additions and improvements to cross_validate() are, that we can now run repeated cross-validation by simply specifying a list of fold columns (as created by groupdata2::fold), and that the models can be cross-validated in parallel.

Four new functions have recently been added to cvms:

baseline()

baseline() creates baseline evaluations of the task at hand (either linear regression or binomial logistic regression). Baseline evaluations tell us what performance we should expect if our model was randomly guessing or always predicting the same value. With imbalanced datasets, this information is very important, as we could get seemingly good results with a useless model. Hence, we create a baseline evaluation and compare our models against it during our analysis.

combine_predictors()

combine_predictors() creates model formulas with all* combinations of a list of fixed effects, including two- and three-way interactions. *When including interaction terms, there are some restrictions, as the model formulas have been precomputed to speed up the process. With these restrictions, we can generate up to 259,358 formulas, which seems like enough for most use cases. While combine_predictors() is useful for trying out a lot of combinations of our fixed effects, we should of course be aware that such combinations may not be theoretically meaningful and could do well due to overfitting. Note that you can add random effects to the formulas, and that multiple versions of the same predictor (e.g. transformed and untransformed) can be added as a sublist, in which case model formulas with both versions will be created, without them ever being in the same formula.

select_metrics()

select_metrics() is used to select the evaluation metrics and the model formula columns from the output in cross_validate(). As we have a lot of information in the output, it allows us to focus on the metrics when reporting the models or creating tutorials.

reconstruct_formulas()

reconstruct_formulas() is used to reconstruct the model formulas from the output of cross_validate(). This is useful, if we have cross-validated a long list of models and want to perform repeated cross-validation on the best models. We simply order the results data frame by our selection criterion (or find the nondominated models with rPref::psel) and reconstruct the model formulas for the best models.

Installation

Installation of cvms:

CRAN:
install.packages("cvms")
Development version:
devtools::install_github("ludvigolsen/cvms")

 

Being the first CRAN release, I hope to get feedback that can improve cvms, be it changes or additions. Feel free to open an issue on GitHub or send me a mail at r-pkgs at ludvigolsen.dk

cvms was designed to work well with groupdata2. See the additions and improvements in version 1.1.0 of groupdata2 released along with cvms here.

Indlægget cvms 0.1.0 released on CRAN blev først udgivet på .

To leave a comment for the author, please follow the link and comment on their blog: R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)