Automatic data types checking in predictive models

[This article was first published on R - Data Science Heroes Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Automatic data types checking in predictive models

The problem: We have data, and we need to create models (xgboost, random forest, regression, etc). Each one of them has its constraints regarding data types.
Many strange errors appear when we are creating models just because of data format.

The new version of funModeling 1.9.3 (Oct 2019) aimed to provide quick and clean assistance on this.

Cover photo by: @franjacquier_

tl;dr;code ????

Based on some messy data, we want to run a random forest, so before getting some weird errors, we can check…

Example 1:

# install.packages("funModeling")
library(funModeling)
library(tidyverse)

# Load data
data=read_delim("https://raw.githubusercontent.com/pablo14/data-integrity/master/messy_data.txt", delim = ';')

# Call the function:
integ_mod_1=data_integrity_model(data = data, model_name = "randomForest")

# Any errors?
integ_mod_1

## 
## ✖ {NA detected} num_vessels_flour, thal, gender
## ✖ {Character detected} gender, has_heart_disease
## ✖ {One unique value} constant

Regardless the “one unique value”, the other errors need to be solved in order to create a random forest.

Alghoritms have their own data type restrictions, and their own error messages making the execution a hard debugging task… data_integrity_model will alert with a common error message about such errors.

Introduction

data_integrity_model is built on top of data_integrity function. We talked about it in the post: Fast data exploration for predictive modeling.

It checks:

  • NA
  • Data types (allow non-numeric? allow character?)
  • High cardinality
  • One unique value

Supported models ????

It takes the metadata from a table that is pre-loaded with funModeling

head(metadata_models)

## # A tibble: 6 x 6
##   name         allow_NA max_unique allow_factor allow_character only_numeric
##   <chr>        <lgl>         <dbl> <lgl>        <lgl>           <lgl>       
## 1 randomForest FALSE            53 TRUE         FALSE           FALSE       
## 2 xgboost      TRUE            Inf FALSE        FALSE           TRUE        
## 3 num_no_na    FALSE           Inf FALSE        FALSE           TRUE        
## 4 no_na        FALSE           Inf TRUE         TRUE            TRUE        
## 5 kmeans       FALSE           Inf TRUE         TRUE            TRUE        
## 6 hclust       FALSE           Inf TRUE         TRUE            TRUE

The idea is anyone can add the most popular models or some configuration that is not there.
There are some redundancies, but the purpose is to focus on the model, not the needed metadata.
This way we don’t think in no NA in random forest, we just write randomForest.

Some custom configurations:

  • no_na: no NA variables.
  • num_no_na: numeric with no NA (for example, useful when doing deep learning).

Embed in a data flow on production ????

Many people ask for typical questions when interviewing candidates. I like these ones: “How do you deal with new data?” or “What are the considerations you have when you do a deploy?”

Based on our first example:

integ_mod_1

## 
## ✖ {NA detected} num_vessels_flour, thal, gender
## ✖ {Character detected} gender, has_heart_disease
## ✖ {One unique value} constant

We can check:

integ_mod_1$data_ok

## [1] FALSE

data_ok is a flag useful to stop a process raising an error if anything goes wrong.

More examples ????

Example 2:

On mtcars data frame, check if there is any variable with NA:

di2=data_integrity_model(data = mtcars, model_name = "no_na")

# Check:
di2

## ✔ Data model integrity ok!

Good to go?

di2$data_ok

## [1] TRUE

Example 3:

data_integrity_model(data = heart_disease, model_name = "pca")

## 
## ✖ {NA detected} num_vessels_flour, thal
## ✖ {Non-numeric detected} gender, chest_pain, fasting_blood_sugar, resting_electro, thal, exter_angina, has_heart_disease

Example 4:

data_integrity_model(data = iris, model_name = "kmeans")

## 
## ✖ {Non-numeric detected} Species

Any suggestions?

If you come across any cases which aren’t covered here, you are welcome to contribute: funModeling’s github.

How about time series? I took them as: numeric with no na (model_name = num_no_na). You can add any new model by updating the table metadata_models.

And that’s it.


In case you want to understand more about data types and qualilty, you can check the Data Science Live Book ????

Have data fun! ????

???? You can found me at: Linkedin & Twitter.

To leave a comment for the author, please follow the link and comment on their blog: R - Data Science Heroes Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)