# Package funModeling: data cleaning, importance variable analysis and model performance

**R - Data Science Heroes Blog**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hi there ðŸ™‚

This new package –`install.packages("funModeling")`

– tries to cover with simple concepts common tasks in data science. Written like a *short tutorial*, its focus is on data interpretation and analysis.

Below, you’ll find a copy-paste from the package vignette, (so you can drink a good coffee while you read it… )

#### Introduction

This package covers common aspects in predictive modeling:

**Data Cleaning****Variable importance analysis****Assessing model performance**

Main purpose of this package is to teach some predictive modeling using a practical toolbox of functions and concepts, to people who is starting in data science, small data and big data. With **special focus** on results and analysis understanding.

#### Part 1: Data cleaning

**Overview**: Quantity of zeros, NA, unique values; as well as the data type may lead to a good or bad model. Here an approach to cover the very first step in data modeling.

## Loading needed libraries library(funModeling) data(heart_disease)

###### Checking NA, zeros, data type and unique values

my_data_status=df_status(heart_disease)

`q_zeros`

: quantity of zeros (`p_zeros`

: in percentage)`q_na`

: quantity of NA (`p_na`

: in percentage)`type`

: factor or numeric`unique`

: quantity of unique values

###### Why are these metrics important?

**Zeros**: Variables with**lots of zeros**may be not useful for modeling, and in some cases it may dramatically bias the model.**NA**: Several models automatically exclude rows with NA (**random forest**, for example). As a result, the final model can be biased due to several missing rows because of only one variable. For example, if the data contains only one out of 100 variables with 90% of NAs, the model will be training with only 10% of original rows.**Type**: Some variables are encoded as numbers, but they are codes or categories, and the models**don’t handle them**in the same way.**Unique**: Factor/categorical variables with a high number of different values (~30), tend to do overfitting if categories have low representative, (**decision tree**, for example).

###### Filtering unwanted cases

Function `df_status`

takes a data frame and returns a the status table to quickly remove unwanted cases.

**Removing variables with high number of NA/zeros**

# Removing variables with 60% of zero values vars_to_remove=subset(my_data_status, my_data_status$p_zeros > 60) vars_to_remove["variable"]

## Keeping all except vars_to_remove heart_disease_2=heart_disease[, !(names(heart_disease) %in% vars_to_remove[,"variable"])]

**Ordering data by percentage of zeros**

my_data_status[order(-my_data_status$p_zeros),]

#### Part 2: Variable importance with cross_plot

**Overview**:**Analysis purpose**: To identify if the input variable is a good/bad predictor through visual analysis.**General purpose**: To explain the decision of including -or not- a variable to a model to a*non-analyst*person.

**Constraint:** Target variable must have only 2 values. If it has `NA`

values, they will be removed.

*Note: Please note there are many ways for selecting best variables to build a model, here is presented one more based on visual analysis.*

###### Example 2.1: Is gender correlated with heart disease?

cross_gender=cross_plot(heart_disease, str_input="gender", str_target="has_heart_disease")

Last two plots have the same data source, showing the distribution of `has_heart_disease`

in terms of `gender`

. The one on the left shows in percentage value, while the one on the right shows in absolute value.

###### How to extract conclusions from the plots? (Short version)

`Gender`

variable seems to be a **good predictor**, since the likelihood of having heart disease is different given the female/male groups. **it gives an order to the data**.

###### How to extract conclusions from the plots? (Long version)

**From 1st plot (%):**

- The
**likelihood**of having heart disease for males is 55.3%, while for females is: 25.8%. - The heart disease rate for males
**doubles**the rate for females (55.3 vs 25.8, respectively).

** From 2nd plot (count):**

There are a total of

**97 females**:- 25 of them have heart disease (25/97=25.8%, which is the ratio of 1st plot).
- the remaining 72 have not heart disease (74.2%)

There are a total of

**206 males**:- 114 of them have heart disease (55.3%)
- the remaining 92 have not heart disease (44.7%)

Total cases: Summing the values of four bars: 25+72+114+92=

**303**.

*Note: What would it happened if instead of having the rates of 25.8% vs. 55.3% (female vs male), they had been more similar like 30.2% vs. 30.6%). In this case variable gender it would have been much less relevant, since it doesn’t separate the has_heart_disease event.*

###### Example 2.2: Crossing with numerical variables

Numerical variables should be **binned** in order to plot them with an histogram, otherwise the plot is not showing information, as it can be seen here:

###### Equal frequency binning

There is a function included in the package (inherited from Hmisc package) : `equal_freq`

, which returns the bins/buckets based on the **equal frequency criteria**. Which is *-or tries to-* have the same quantity of rows per bin.

For numerical variables, `cross_plot`

has by default the `auto_binning=T`

, which automtically calls the `equal_freq`

function with `n_bins=10`

(or the closest number).

cross_plot(heart_disease, str_input="max_heart_rate", str_target="has_heart_disease")

###### Example 2.3: Manual binning

If you don’t want the automatic binning, then set the `auto_binning=F`

in `cross_plot`

function.

For example, creating `oldpeak_2`

based on equal frequency, with 3 buckets.

heart_disease$oldpeak_2=equal_freq(var=heart_disease$oldpeak, n_bins = 3) summary(heart_disease$oldpeak_2)

Plotting the binned variable (`auto_binning = F`

):

cross_oldpeak_2=cross_plot(heart_disease, str_input="oldpeak_2", str_target="has_heart_disease", auto_binning = F)

**Conclusion**

This new plot based on `oldpeak_2`

shows clearly how: the likelihood of **having heart disease increases** as **oldpeak_2 increases** as well. *Again, it gives an order to the data.*

###### Example 2.4: Noise reducing

Converting variable `max_heart_rate`

into a one of 10 bins:

heart_disease$max_heart_rate_2=equal_freq(var=heart_disease$max_heart_rate, n_bins = 10) cross_plot(heart_disease, str_input="max_heart_rate_2", str_target="has_heart_disease")

At a first glance, `max_heart_rate_2`

shows a negative and linear relationship, however there are some buckets which add noise to the relationship. For example, the bucket `(141, 146]`

has a higher heart disease rate than the previous bucket, and it was expected to have a lower. *This could be noise in data.*

**Key note**: One way to reduce the **noise** (at the cost of **losing** some information), is to split with less bins:

heart_disease$max_heart_rate_3=equal_freq(var=heart_disease$max_heart_rate, n_bins = 5) cross_plot(heart_disease, str_input="max_heart_rate_3", str_target="has_heart_disease")

**Conclusion**: As it can be seen, now the relationship is much clean and clear. Bucket *‘N’* has a higher rate than *‘N+1’*, which implies a negative correlation.

**How about saving the cross_plot result into a folder?**
Just set the parameter `path_out`

with the folder you want -It creates a new one if it doesn’t exists-.

cross_plot(heart_disease, str_input="max_heart_rate_3", str_target="has_heart_disease", path_out="my_plots")

It creates the folder `my_plots`

into the working directory.

###### Example 4: `cross_plot`

on multiple variables

Imagine you want to run cross_plot for several variables at the same time. To achieve this goal just define a vector containing the variable names.

If you want to analyze 3 variables at once:

`vars_to_analyze=c("age", "oldpeak", "max_heart_rate")`

`cross_plot(data=heart_disease, str_target="has_heart_disease", str_input=vars_to_analyze)`

###### Final notes:

- Correlation does not imply causation
`cross_plot`

is good to visualize**linear**relationships, giving it a hint on**non-linear**relationships.- Cleaning the variables help the model to better modelize the data.

#### Part 3: Assessing model performance

**Overview**: Once the predictive model is developed with `training`

data, it should be compared with `test`

data (which wasn’t seen by the model before). Here is presented a wrapper for the **ROC Curve** and **AUC** (area under ROC) and the **KS** (Kolmogorov-Smirnov).

###### Creating the model

## Training and test data. Percentage of training cases default value=80%. index_sample=get_sample(data=heart_disease, percentage_tr_rows=0.8) ## Generating the samples data_tr=heart_disease[index_sample,] data_ts=heart_disease[-index_sample,] ## Creating the model only with training data fit_glm=glm(has_heart_disease ~ age + oldpeak, data=data_tr, family = binomial)

###### ROC, AUC and KS performance metrics

## Performance metrics for Training Data model_performance(fit=fit_glm, data = data_tr, target_var = "has_heart_disease")

## Performance metrics for Test Data model_performance(fit=fit_glm, data = data_ts, target_var = "has_heart_disease")

**Key notes**

- The
**higher**the KS and AUC, the**better**the performance is.- KS range: from 0 to 1.
- AUC range: from 0.5 to 1.

- Performance metrics should be
**similar**between training and test set.

**Final comments**

- KS and AUC focus on similar aspects: How well the model distinguishes the class to predict.
- ROC and AUC article: link

###### End of vignette, some links…

`funModeling`

“black-box” *(either for learning or to contribute -code not complex and commented)*:

**leave a comment**for the author, please follow the link and comment on their blog:

**R - Data Science Heroes Blog**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.