Predicting large and imbalanced data set using the R package tidymodels

Posted on April 13, 2020 by Modeling with R in R bloggers | 0 Comments

[This article was first published on Modeling with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

The super easy way, at least for me, to deploy machine learning models is by making use of the R package tidymodels, which is a collection of many packages that makes the workflow steps for your project very smooth and tightly connected to each other and easily managable in a well-structured manner. The core packages contained in tidymodels are:

rsample: for data splitting and rsampling.
parsnip: Unified interface to the most common machine learning models.
recipes: unified interface to the most common pre-processing tools for feature engineering.
workflows: bundle the workflow steps together.
tune: for optimization of the hyperparameters.
yardstick: provides the most common performance metrics.
broom: converts the outputs into user friendly formats such as tibble.
dials: provides tools for parameter grids.
infer: provides tools for statistical inferences.

In addition to the above apackages tidymodels contains also some classical packages such as: dplyr, ggplot2, purrr, tibble. For more detail click here.

In order to widely explore and understand the tidymodels, we should look for a noisy dataset that has large number of variables with missing values. Fortunately, I found an open source dataset that fulfils these requirements and in addition, it is highly imbalanced. This data is about scania trucks and can be downloaded from UCI machine learning repository with an extra file for its description.

the target variable of this data is the air pressure system APS in the truck that generates the pressurized air that are utilized in various function in the truck. It has two classes: positive pos if a component failures due to a failures in the APS system, negative neg if a component failures are not related to the APS system. This means that we are dealing with binary classification problem.

Data exploration

The data is already separated into training and testing set from the source, so let’s call the packages that we need and the data.

ssh <- suppressPackageStartupMessages
ssh(library(readr))
ssh(library(caret))
ssh(library(themis))
ssh(library(tidymodels))
train <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_training_set.csv", skip = 20)
test <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_test_set.csv", skip = 20)

Notice that the data is a tibble where the first twenty rows are a mix of rows that contain some text descritpion and empty rows. and the 21th row contains the column names. That is why we have set the skip argument equals to 20, and for the 21th column by default has been read as colnames col_names = TRUE.

Summary of the variables

First let’s check the dimension of the two sets to be aware of what we are dealing with.

dim(train)
[1] 60000   171
dim(test)
[1] 16000   171

The training set has 60000 rows and 171 variables which is moderately large dataset. Inspecting thus this data by the usual functions such as summary, str would give heavy and not easily readable outputs. the best alternative However is by extracting the most important information that is required for building any machine learning model in aggregated way, for instance, what type of variables it has, some statistics about the variable values, the missing values..etc.

map_chr(train, typeof) %>% 
  tibble() %>% 
  table()
.
character    double 
      170         1

Strangely, all the variables but one are characters, which is in contradiction with the description of this data from the file description. To figure out what is going on we display some few rows and some few columns.

train[1:5,1:7]
# A tibble: 5 x 7
  class aa_000 ab_000 ac_000     ad_000 ae_000 af_000
  <chr>  <dbl> <chr>  <chr>      <chr>  <chr>  <chr> 
1 neg    76698 na     2130706438 280    0      0     
2 neg    33058 na     0          na     0      0     
3 neg    41040 na     228        100    0      0     
4 neg       12 0      70         66     0      10    
5 neg    60874 na     1368       458    0      0

I think, the problem is that the missing values in the data indicated by na are not recognized as missing values, instead they are treated as characters and this what makes the function read_csv coerces every variable that has this na values to character type. To fix this problem we can either go back and set the na argument to “na”, or we set the missing values by hand as follows.

train[-1] <- train[-1] %>% 
  modify(~replace(., .=="na", NA)) %>%
  modify(., as.double)

Now let’s check again

map_chr(train, typeof) %>% 
  tibble() %>% 
  table()
.
character    double 
        1       170

The first column excluded above is our target variable class. We should not forget to do the same transformation to the test set.

test[-1] <- test[-1] %>% 
  modify(~replace(., .=="na", NA)) %>%
  modify(., as.double)

If we try to apply the summary function on the entire variables (170), we will spent a lot of time to read the summary of each variable without much gain. Instead, we try to get an automated way to get only the information needed to build efficiently our model. To decide whether we should normalize the data or not, for instance, we display the standard deviances of all the variable in decreasing order.

Note: with tree based models we do not need neither normalize the data nor converting factors to dummies.

map_dbl(train[-1], sd, na.rm=TRUE) %>% 
  tibble(sd = .) %>% 
  arrange(-sd)
# A tibble: 170 x 1
           sd
        <dbl>
 1 794874918.
 2  97484780.
 3  42746746.
 4  40404413.
 5  40404412.
 6  40404411.
 7  11567771.
 8  10886737.
 9  10859905.
10  10859904.
# ... with 160 more rows

We have very large variability, which means that the data should be normalized for any machine learning model that uses gradient descent or based on class distances.

Another thing we can check is if some variabels have small number of unique values which can hence be converted to factor type.

map(train[-1], unique) %>% 
  lengths(.) %>% 
  sort(.) %>% 
  head(5)
cd_000 ch_000 as_000 ef_000 ab_000 
     2      3     22     29     30

To make things simple we consider only the first two ones to be converted to factor type.

the first one is constant which is of type zero variance because its variance equals to zero, and the second one should be converted to factor type with two levels (for the two sets), but since it has large missing values we will decide about it later on. Notice that we do not apply theses transformations here because they will be combined at once with all the required transformations as what will be shown shortly.

Missing values

The best way to deal with missing values depends on their number compared to the dataset size. if we have small number then it would be easier to completely remove them from the data, if in contrast we have large number then the best choice is to impute them using one of the common methods designed for this type of issue.

dim(train[!complete.cases(train),])
[1] 59409   171

As we see almost every row contains at least one missing value in some columns. Let’s check the distribution of missing values within columns.

df <- modify(train[-1], is.na) %>% 
  colSums() %>%
  tibble(names = colnames(train[-1]),missing_values=.) %>% 
  arrange(-missing_values)
  
df
# A tibble: 170 x 2
   names  missing_values
   <chr>           <dbl>
 1 br_000          49264
 2 bq_000          48722
 3 bp_000          47740
 4 bo_000          46333
 5 ab_000          46329
 6 cr_000          46329
 7 bn_000          44009
 8 bm_000          39549
 9 bl_000          27277
10 bk_000          23034
# ... with 160 more rows

I think the best strategy is to first remove columns that have a large number of missing values then we impute the rest, thereby we reduce the number of predictors and the number of missing values at ones. The following script keep the predictors that have a number of missing values less than 10000.

names <- modify(train[-1], is.na) %>% 
  colSums() %>%
  tibble(names = colnames(train[-1]), missing_values=.) %>% 
  filter(missing_values < 10000) %>% 
  select(1)
train1 <- train[c("class",names$names)]
test1 <- test[c("class",names$names)]

An important thing should be noted here is that, if we use imputation methods that use information from all other columns and/or rows to predict the current missing value, therefore the data must be first split between training and testing sets before any imputation, to abide by the crucial rule of machine learning: the test data should never be seen by the model during training process. Fortunately, our data is already split so that the imputation can be done separately. However, the imputation methods will be implemented later on by the help of the recipes package where we bundle all the pre-processing steps together. Note: the above ch_000 was removed since it did not fulfill the required threshold.

imbalanced data

Another important issue that we face when predicting this data is the imbalanced problem.

prop.table(table(train1$class))

       neg        pos 
0.98333333 0.01666667

This data is highly imbalanced, which tends to make even the worst machine learning model gives very high accuracy rate. In other words, if we do not use any model and predict every class as the largest class label (in our case negative) the accuracy rate will be approximately equal to the proportion of the largest class (in our case 98%), which is very big misleading result. Moreover, this misleading result can be catastrophic if we are more interested to predict the small class (in our case positive) such as detecting fraudulent credit cards. If you would like to get more detail about how to deal with imbalanced data please check this article.

building the recipe

Our initial model will be the random forest wich is the most popular one . So the first step to build our model is by defining our model with the engine, which is the method (or the package) used to fit this model, and the mode with two possible values classification or regression. In our case, for instance, there exists two available engines: randomForest or ranger. Notice that the parsnip package who provides these settings. For more detail about all the models available click here.

Note: To speed up the computation process we restrict the forest to 100 trees instead of the default 500.

rf <- rand_forest(trees = 100) %>% 
  set_engine("ranger", num.threads=3, seed = 123) %>%
  set_mode("classification")

Most machine learning models require pre-processed data with some feature engineering. Traditionally, R has (and some other packages such as dplyr and stringr) provides a wide range of functions such that we can do almost every kind of feature engineering. However, if we have many different transformations to perform then they will be done separately and it will be a little cumbersome to repeat the same scripts again for testing set for instance. Therefore, the recipes package provides an easy way to combine all the transformations and other features related to the model, such as selecting the predictors that should be included, identifiers, …etc, as a single block that can be used for any other subset of the data.

For our case we will apply the following transformations:

Imputing the missing values by the median of the corresponding variable since we have only numeric variables (for simplicity).
removing variables that have zero variance (variable that has one unique value).
removing highly correlated predictor using threshold 0.8.
Normalizing the data (even we do not need it in this model but we add this step since this recipe will be used with other models that use gradient decent or distances calculations).
using the subsampling method smote to create a balanced data. Notice that the smote method is provided by the package themis

To combine all these operations together we call the function recipe.

data_rec <- recipe(class~., data=train1) %>% 
  step_medianimpute(all_predictors() , seed_val = 111) %>% 
  step_zv(all_predictors()) %>% 
  step_corr(all_predictors(), threshold = 0.8) %>% 
  step_normalize(all_predictors()) %>%
  step_smote(class)

As you see everything combined nicely and elegantly. However, this recipe transformed nothing yet, it just recorded the formula, the predictors and the transformations that should be applied. This means that we can update, at any time before fitting our model, the formula, add or remove some steps. The super interesting feature of recipe is that we can apply it to any other data (than that mentioned above, train) provided that has the same variable names. In case you want to apply these transformations to the training data use the prep function, and to retrieve the results use the function juice, and for other data use bake after prep to be able to apply some parameters from the training data, for instance, when we normalize the data this function lets us use the mean of predictors computed from the training data rather than from the testing data. However, in our case, we will combine everything until the model fitting step.
For more detail about all the steps available click here.

Building the workflow

To well organize our workflow in a structured and smoother way, we use the workflow package that is one of the tidymodels collection.

rf_wf <- workflow() %>% 
  add_model(rf) %>% 
  add_recipe(data_rec)
rf_wf
== Workflow ======================================
Preprocessor: Recipe
Model: rand_forest()

-- Preprocessor ----------------------------------
5 Recipe Steps

* step_medianimpute()
* step_zv()
* step_corr()
* step_normalize()
* step_smote()

-- Model -----------------------------------------
Random Forest Model Specification (classification)

Main Arguments:
  trees = 100

Engine-Specific Arguments:
  num.threads = 3
  seed = 123

Computational engine: ranger

random forest model

Now we can run everything at once, the recipe and the model, notice that here we can also update, add or remove some elements before going ahead and fit the model.

model training

Everything now is ready to run our model with the default values.

model_rf <- rf_wf %>% 
  fit(data = train1)

We can extract the summary of this model as follows

model_rf %>% pull_workflow_fit()
parsnip model object

Fit time:  49.9s 
Ranger result

Call:
 ranger::ranger(formula = formula, data = data, num.trees = ~100,      num.threads = ~3, seed = ~123, verbose = FALSE, probability = TRUE) 

Type:                             Probability estimation 
Number of trees:                  100 
Sample size:                      118000 
Number of independent variables:  95 
Mtry:                             9 
Target node size:                 10 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.003974456

This model has created 100 trees and has chosen randomly 9 predictors with each tree. with these settings thus we do obtain very low oob error rate which is 0.4% (accuracy rate 99.6% ). However, be cautious with such high accuracy rate, since, in practice, This result may highly related to an overfitting problem. Last thing I want to mention about this output, by looking at the confusion matrix, is the fact that we have now balanced data.

model evaluation

The best way to evaluate our model is by using the testing set. Notice that the yardstick provides bunch of metrics to use, but let’s use the most popular one for classification problems accuracy.

model_rf %>% 
  predict( new_data = test1) %>% 
  bind_cols(test1["class"]) %>% 
  accuracy(truth= as.factor(class), .pred_class) 
# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.990

with this model we get high accuracy which is very closer to the previous one. However, we should not forget that we are dealing with imbalanced data, and even though we have used subsampling methods (like smote method used here), they do not completely solve this issue, they can only minimize it at certain level and this is the reason why we have many of these methods. Therefore, it is better to use the confusion matrix from the caret package since it gives more information.

caret::confusionMatrix(as.factor(test1$class), predict(model_rf, new_data = test1)$.pred_class)
Confusion Matrix and Statistics

          Reference
Prediction   neg   pos
       neg 15535    90
       pos    62   313
                                          
               Accuracy : 0.9905          
                 95% CI : (0.9889, 0.9919)
    No Information Rate : 0.9748          
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.7998          
                                          
 Mcnemar's Test P-Value : 0.02853         
                                          
            Sensitivity : 0.9960          
            Specificity : 0.7767          
         Pos Pred Value : 0.9942          
         Neg Pred Value : 0.8347          
             Prevalence : 0.9748          
         Detection Rate : 0.9709          
   Detection Prevalence : 0.9766          
      Balanced Accuracy : 0.8863          
                                          
       'Positive' Class : neg

As said shortly, the specificity rate related to the minor class 77% is very low compared to the major class 99%, and You can think of this as a partial overfitting towards the major class. So if we are more interested to the minor class (which is often the case) then we have go back to our model and try tuning our model or try another subsampling method.

Model tuning:

For model tuning we try other values for some arguments rather than the default vaues. and leave the tuning for some others to the dials package. So let’s try the following argument values:

num.trees = 100. The default is 500.
num.threads = 3. The default is 1.

And tune the following:

mtry = tune(). The default is square root of the number of the variables.
min_n = tune(). The default is 1.

First, we define the model with these new arguments.

model_tune <- rand_forest(trees= 100, mtry=tune(), min_n = tune()) %>%
  set_engine("ranger", num.threads=3, seed=123) %>% 
  set_mode("classification")

Since in grid search the two arguments mtry and min_n are data dependent, then we should at least specify their ranges.

grid <- grid_regular(mtry(range = c(9,15)), min_n(range = c(5,40)), levels = 3)
grid
# A tibble: 9 x 2
   mtry min_n
  <int> <int>
1     9     5
2    12     5
3    15     5
4     9    22
5    12    22
6    15    22
7     9    40
8    12    40
9    15    40

By setting the levels equal to 3 we get 9 combinations and hence 9 models will be trained. The above recipe has steps that should not be repeated many times when tuning the model, we apply therefore the recipe to the training data in order to get the transformed data, and do not forget to apply the recipe to the testing data.

train2 <- prep(data_rec) %>% 
  juice()
test2 <- prep(data_rec) %>% 
  bake(test1)

To tune our model we use cross validation technique. since we have large data set we use only 3 folds.

set.seed(111)
fold <- vfold_cv(train2, v = 3, strata = class)

Now we bundle our recipe with the specified model.

tune_wf <- workflow() %>% 
  add_model(model_tune) %>%
  add_formula(class~.)

To fit these models across the folds we use the tune_grid function instead of fit.

tune_rf <- tune_wf %>% 
  tune_grid(resamples = fold, grid = grid)

For classification problems this function uses two metrics: accuracy and area under the ROC curve. SO we can extract the metric values as follows.

tune_rf %>% collect_metrics()
# A tibble: 18 x 7
    mtry min_n .metric  .estimator  mean     n    std_err
   <int> <int> <chr>    <chr>      <dbl> <int>      <dbl>
 1     9     5 accuracy binary     0.995     3 0.000150  
 2     9     5 roc_auc  binary     1.00      3 0.0000105 
 3     9    22 accuracy binary     0.994     3 0.000170  
 4     9    22 roc_auc  binary     1.00      3 0.0000195 
 5     9    40 accuracy binary     0.993     3 0.000461  
 6     9    40 roc_auc  binary     1.00      3 0.00000961
 7    12     5 accuracy binary     0.995     3 0.000290  
 8    12     5 roc_auc  binary     1.00      3 0.0000159 
 9    12    22 accuracy binary     0.994     3 0.000119  
10    12    22 roc_auc  binary     1.00      3 0.0000257 
11    12    40 accuracy binary     0.993     3 0.000239  
12    12    40 roc_auc  binary     1.00      3 0.00000559
13    15     5 accuracy binary     0.995     3 0.000231  
14    15     5 roc_auc  binary     1.00      3 0.0000162 
15    15    22 accuracy binary     0.994     3 0.000180  
16    15    22 roc_auc  binary     1.00      3 0.0000132 
17    15    40 accuracy binary     0.993     3 0.000274  
18    15    40 roc_auc  binary     1.00      3 0.0000184

To get the best model we have to choose one of the two metrics, so let’s go ahead with the accuracy rate.

best_param <- 
  tune_rf %>% select_best(metric = "accuracy")
best_param
# A tibble: 1 x 2
   mtry min_n
  <int> <int>
1    12     5

we can finalize the workflow with the new parameter values.

tune_wf2 <- tune_wf %>% 
  finalize_workflow(best_param)
tune_wf2
== Workflow ======================================
Preprocessor: Formula
Model: rand_forest()

-- Preprocessor ----------------------------------
class ~ .

-- Model -----------------------------------------
Random Forest Model Specification (classification)

Main Arguments:
  mtry = 12
  trees = 100
  min_n = 5

Engine-Specific Arguments:
  num.threads = 3
  seed = 123

Computational engine: ranger

Now we fit the model with the best parameter values to the entire training data.

best_model <- tune_wf2 %>% 
  fit(train2)
best_model
== Workflow [trained] ============================
Preprocessor: Formula
Model: rand_forest()

-- Preprocessor ----------------------------------
class ~ .

-- Model -----------------------------------------
Ranger result

Call:
 ranger::ranger(formula = formula, data = data, mtry = ~12L, num.trees = ~100,      min.node.size = ~5L, num.threads = ~3, seed = ~123, verbose = FALSE,      probability = TRUE) 

Type:                             Probability estimation 
Number of trees:                  100 
Sample size:                      118000 
Number of independent variables:  95 
Mtry:                             12 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.003681062

Let’s get the confusion matrix

caret::confusionMatrix(as.factor(test2$class), predict(best_model, new_data = test2)$.pred_class)
Confusion Matrix and Statistics

          Reference
Prediction   neg   pos
       neg 15535    90
       pos    65   310
                                          
               Accuracy : 0.9903          
                 95% CI : (0.9887, 0.9918)
    No Information Rate : 0.975           
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.795           
                                          
 Mcnemar's Test P-Value : 0.05389         
                                          
            Sensitivity : 0.9958          
            Specificity : 0.7750          
         Pos Pred Value : 0.9942          
         Neg Pred Value : 0.8267          
             Prevalence : 0.9750          
         Detection Rate : 0.9709          
   Detection Prevalence : 0.9766          
      Balanced Accuracy : 0.8854          
                                          
       'Positive' Class : neg

As we see we do not get any improvement for the specificity rate. so let’s try another subsampling method, say Rose method.

rf_rose <- rand_forest(trees = 100, mtry=15, min_n = 5) %>% 
  set_engine("ranger", num.threads=3, seed = 123) %>%
  set_mode("classification")
data_rec2 <- recipe(class~., data=train1) %>% 
  step_medianimpute(all_predictors() , seed_val = 111) %>% 
  step_zv(all_predictors()) %>% 
  step_corr(all_predictors(), threshold = 0.8) %>% 
  step_normalize(all_predictors()) %>%
  step_rose(class) 
rf_rose_wf <- workflow() %>% 
  add_model(rf_rose) %>% 
  add_recipe(data_rec2)
model_rose_rf <- rf_rose_wf %>% 
  fit(data = train1)
caret::confusionMatrix(as.factor(test1$class), predict(model_rose_rf, new_data = test1)$.pred_class)
Confusion Matrix and Statistics

          Reference
Prediction   neg   pos
       neg 15328   297
       pos    55   320
                                          
               Accuracy : 0.978           
                 95% CI : (0.9756, 0.9802)
    No Information Rate : 0.9614          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.6345          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.9964          
            Specificity : 0.5186          
         Pos Pred Value : 0.9810          
         Neg Pred Value : 0.8533          
             Prevalence : 0.9614          
         Detection Rate : 0.9580          
   Detection Prevalence : 0.9766          
      Balanced Accuracy : 0.7575          
                                          
       'Positive' Class : neg

The rose method is much worse than smote method since the specificity rate has doped down to 52%.

logistic regression model

The logistic regression is another model to fit data with binary outcome. As before we use the first recipe with smote method.

logit <- logistic_reg() %>% 
  set_engine("glm") %>%
  set_mode("classification")

logit_wf <- workflow() %>% 
  add_model(logit) %>% 
  add_recipe(data_rec)

set.seed(123)
model_logit <- logit_wf %>% 
  fit(data = train1)

caret::confusionMatrix(as.factor(test1$class), predict(model_logit, new_data = test1)$.pred_class)
Confusion Matrix and Statistics

          Reference
Prediction   neg   pos
       neg 14813   812
       pos    27   348
                                        
               Accuracy : 0.9476        
                 95% CI : (0.944, 0.951)
    No Information Rate : 0.9275        
    P-Value [Acc > NIR] : < 2.2e-16     
                                        
                  Kappa : 0.4333        
                                        
 Mcnemar's Test P-Value : < 2.2e-16     
                                        
            Sensitivity : 0.9982        
            Specificity : 0.3000        
         Pos Pred Value : 0.9480        
         Neg Pred Value : 0.9280        
             Prevalence : 0.9275        
         Detection Rate : 0.9258        
   Detection Prevalence : 0.9766        
      Balanced Accuracy : 0.6491        
                                        
       'Positive' Class : neg

with this model we do not get better rate for minority class than random forest model.

Session information

sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] yardstick_0.0.6  workflows_0.1.1  tune_0.1.0       tibble_2.1.3    
 [5] rsample_0.0.6    purrr_0.3.3      parsnip_0.1.0    infer_0.5.1     
 [9] dials_0.0.6      scales_1.1.0     broom_0.5.5      tidymodels_0.1.0
[13] themis_0.1.1     recipes_0.1.9    dplyr_0.8.3      caret_6.0-85    
[17] ggplot2_3.3.0    lattice_0.20-40  readr_1.3.1     

loaded via a namespace (and not attached):
  [1] mlr_2.17.1           backports_1.1.5      fastmatch_1.1-0     
  [4] tidytext_0.2.3       plyr_1.8.6           igraph_1.2.4.2      
  [7] splines_3.6.3        crosstalk_1.0.0      listenv_0.8.0       
 [10] SnowballC_0.6.0      rstantools_2.0.0     inline_0.3.15       
 [13] digest_0.6.23        foreach_1.4.8        htmltools_0.4.0     
 [16] rsconnect_0.8.16     fansi_0.4.1          magrittr_1.5        
 [19] checkmate_2.0.0      BBmisc_1.11          unbalanced_2.0      
 [22] doParallel_1.0.15    globals_0.12.5       gower_0.2.1         
 [25] matrixStats_0.55.0   xts_0.12-0           hardhat_0.1.2       
 [28] prettyunits_1.1.1    colorspace_1.4-1     xfun_0.12           
 [31] callr_3.4.2          crayon_1.3.4         lme4_1.1-21         
 [34] survival_3.1-8       zoo_1.8-7            iterators_1.0.12    
 [37] glue_1.3.1           gtable_0.3.0         ipred_0.9-9         
 [40] pkgbuild_1.0.6       rstan_2.19.3         miniUI_0.1.1.1      
 [43] Rcpp_1.0.4           xtable_1.8-4         GPfit_1.0-8         
 [46] stats4_3.6.3         lava_1.6.7           StanHeaders_2.21.0-1
 [49] prodlim_2019.11.13   DT_0.12              htmlwidgets_1.5.1   
 [52] threejs_0.3.3        FNN_1.1.3            ellipsis_0.3.0      
 [55] pkgconfig_2.0.3      loo_2.2.0            ParamHelpers_1.14   
 [58] nnet_7.3-13          utf8_1.1.4           tidyselect_1.0.0    
 [61] rlang_0.4.5          DiceDesign_1.8-1     reshape2_1.4.3      
 [64] later_1.0.0          munsell_0.5.0        tools_3.6.3         
 [67] cli_2.0.2            generics_0.0.2       ranger_0.12.1       
 [70] ggridges_0.5.2       evaluate_0.14        stringr_1.4.0       
 [73] fastmap_1.0.1        yaml_2.2.1           ModelMetrics_1.2.2.1
 [76] processx_3.4.2       knitr_1.28           RANN_2.6.1          
 [79] future_1.17.0        nlme_3.1-145         mime_0.9            
 [82] rstanarm_2.19.3      rstudioapi_0.11      tokenizers_0.2.1    
 [85] compiler_3.6.3       bayesplot_1.7.1      shinythemes_1.1.2   
 [88] curl_4.3             e1071_1.7-3          tidyposterior_0.0.2 
 [91] lhs_1.0.2            stringi_1.4.6        ps_1.3.2            
 [94] blogdown_0.18        Matrix_1.2-18        nloptr_1.2.2        
 [97] markdown_1.1         shinyjs_1.1          vctrs_0.2.4         
[100] pillar_1.4.3         lifecycle_0.2.0      furrr_0.1.0         
[103] data.table_1.12.8    httpuv_1.5.2         R6_2.4.1            
[106] bookdown_0.18        promises_1.1.0       gridExtra_2.3       
[109] janeaustenr_0.1.5    codetools_0.2-16     boot_1.3-24         
[112] colourpicker_1.0     MASS_7.3-51.5        gtools_3.8.1        
[115] assertthat_0.2.1     ROSE_0.0-3           withr_2.1.2         
[118] shinystan_2.5.0      parallel_3.6.3       hms_0.5.3           
[121] grid_3.6.3           rpart_4.1-15         timeDate_3043.102   
[124] minqa_1.2.4          tidyr_1.0.0          class_7.3-15        
[127] rmarkdown_2.1        parallelMap_1.5.0    pROC_1.16.1         
[130] tidypredict_0.4.5    shiny_1.4.0          lubridate_1.7.4     
[133] base64enc_0.1-3      dygraphs_1.1.1.6

To leave a comment for the author, please follow the link and comment on their blog: Modeling with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Predicting large and imbalanced data set using the R package tidymodels

Introduction

Data exploration

Summary of the variables

Missing values

imbalanced data

building the recipe

Building the workflow

random forest model

model training

model evaluation

Model tuning:

logistic regression model

Session information

Related

Introduction

Data exploration

Summary of the variables

Missing values

imbalanced data

building the recipe

Building the workflow

random forest model

model training

model evaluation

Model tuning:

logistic regression model

Session information

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)