A few days ago, I released a new version of my R package, groupdata2, on CRAN.
groupdata2 contains a set of functions for grouping data, such as creating balanced partitions and folds for cross-validation.
Version 1.1.0 adds the
balance() function for using up- and downsampling to balance the categories (e.g. classes) in your dataset. The main difference between existing up-/downsampling tools and
balance() is that it has methods for dealing with IDs. If, for instance, our dataset contains a number of participants with multiple measurements (rows) each, we might want to either keep all the rows of a participant or delete all of them, e.g. if we are measuring the effect of time and need all the timesteps for a meaningful analysis.
balance() currently has four methods for dealing with IDs. As it is a new function, I’m very open to ideas for improvements and additional functionality. Feel free to open an issue on github or send me a mail at r-pkgs at ludvigolsen.dk
fold() can now balance the groups (partitions/folds) by a numerical column. In short, the rows are ordered as smallest, largest, second smallest, second largest, etc. (I refer to this as extreme pairing in the documentation), and grouped as pairs. This seems to work pretty well, especially with larger datasets.
fold() can now create multiple unique fold columns at once, e.g. for repeated cross-validation with cvms. As ensuring the uniqueness of the fold columns can take a while, I’ve added the option to run the column comparisons in parallel. You need to register a parallel backend first (E.g. with
The new helper function
differs_from_previous() finds values, or indices of values, that differ from the previous value by some threshold(s). It is related to the
"l_starts" method in
groupdata2 can be installed with: