How does the EMMA package simplify imputation of missing data in mlr3pipelines?
Missing data in ML
There is no machine learning without data, which in real-world looks usually much different than academic-tutorial examples. One of the common difficulties while working with datasets is dealing with missing values. Some of machine learning algorithms (e.g. decision trees) can by default deal with missings, but many of them need additional preprocessing methods.
Imputation of missing values is probably the most popular of them and itself states a wide topic for research and software development. We can divide imputation methods into simple ones (e.g. mean or mode replacement of missings) or advanced (e.g. multiple imputations, regression, KNN).
Why we need EMMA?
There are many great imputations methods implemented in R packages. However, it is often quite uncomfortable to use them as a part of our ML model. Why? They require careful usage while adjusting many parameters, handling different input requirements, and dealing with possible failures of algorithms. As they all have different interfaces, it is rather impossible to use them out-of-the-box in ML models. By creating the EMMA package, we hope to eliminate these problems and deliver a practical, user-friendly imputation tool.
Imputation methods available in EMMA
EMMA package consists of a wide spectrum of imputation methods available in R packages, nicely wrapped by mlr3 pipelines. The full list of the packages used in EMMA consists of mice, Amelia, missMDA, VIM, SoftImpute, MissRanger, and MissForest. Each of the mentioned packages uses its full potential and if possible, provides hyperparameters optimization. This means EMMA implements methods like Predictive Mean Matching, K-Nearest Neighbours, or Random Forest. Above that EMMA also includes simple imputation methods. Probably the most important feature of the package is mice implementation in the standard machine learning approach when we use trained mice model to impute missing data on the test set. Why does this ML approach is not obvious?
How does it work?
As we mentioned earlier, imputation using advanced technics can cause problems in ML. Let’s assume that we want to treat imputation as an integral part of the model. Therefore it should have some model trained on the train set and then used to impute missings on the test set. Unfortunately, not all advanced imputations can follow this scheme (let’s call this approach A).
For example, there are not training steps in SoftImpute or K-NN methods. Next, most of the other packages which could be used in approach A are not prepared for it. Because of that, we decided to use the so-called approach B, which is different from the typical ML approach. In the B approach, we separately train imputation on train and test sets.
Because of these different approaches, EMMA includes simple methods too. We are aware of problems with the B approach, for example, it is impossible to test on only one observation. We are trying to extend the number of methods available in approach A. On the other hand in some cases the B approach can be superior, like when we use historical data for training but present for testing. Our main goal is to simplify and standardize the imputation interface from different packages. We do that by using an existing interface in mlr3 pipelines, which EMMA basically extends. As a result, EMMA can be used alone or as any other pipe operator from that package. Bellow, you can find an example of usage in practice.
Let’s use EMMA!
First installation from GitHub using devtools.
devtools::install_github("https://github.com/ModelOriented/EMMA", subdir = "/EMMA_package/EMMA") library(EMMA)
Now we will use a task from OpenML with missing values in data and impute them with mice wrapped in a pipe by EMMA.
library(mlr3oml) task_oml <- OMLTask$new(54) task <- task_oml$task #If You want, visualize missings with naniar package library(naniar) gg_miss_var(task$data(), show_pct = TRUE)
pipe_imp <- PipeOpMice$new() imputed_task <- pipe_imp$train(list(task))$output #Check output task, no missings! sum(imputed_task$missings())
Now we will use EMMA in a simple model. Notice that learner can deal by default with categorical variables but does not handle missings. We are using the same task and imputer as above.
library(mlr3learners) #Building model with pipelines pipe_model <- lrn("classif.ranger") graph <- pipe_imp %>>% pipe_model graph_learner <- GraphLearner$new(graph) #Evaluating rr <- resample(task, graph_learner, rsmp("holdout")) #Print resampling result rr$print() #Print achieved score rr$score()
That was imputation with EMMA!
We are continuously working on improving and extending the package.
Be aware that imputation with particular methods is not always possible depending on the dataset.
If you are interested in other posts about explainable, fair, and responsible ML, follow #ResponsibleML on Medium.
In order to see more R related content visit https://www.r-bloggers.com