[This article was first published on R - Data Science Heroes Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

tl;dr: Convert numerical variables into categorical, as it is shown in the next image.

## Let’s start!

The package `funModeling` (from version > 1.6.6) introduces two
functions— `discretize_get_bins` & `discretize_df` —that work together
in order to help us in the discretization task.

If you were using the 1.6.6, please see the update note below (Jan-19-2018).

```# First we load the libraries
# install.packages("funModeling")
library(funModeling)
library(dplyr)
```

Let’s see an example. First, we check current data types:

```df_status(heart_disease, print_results = F) %>% select(variable, type, unique, q_na) %>% arrange(type)

##                  variable    type unique q_na
## 1                  gender  factor      2    0
## 2              chest_pain  factor      4    0
## 3     fasting_blood_sugar  factor      2    0
## 4         resting_electro  factor      3    0
## 5                    thal  factor      3    2
## 6            exter_angina  factor      2    0
## 7       has_heart_disease  factor      2    0
## 8                     age integer     41    0
## 9  resting_blood_pressure integer     50    0
## 10      serum_cholestoral integer    152    0
## 11         max_heart_rate integer     91    0
## 12            exer_angina integer      2    0
## 13                  slope integer      3    0
## 14      num_vessels_flour integer      4    4
## 15 heart_disease_severity integer      5    0
## 16                oldpeak numeric     40    0
```

We’ve got factor, integer, and numeric variables: a good mix! The
transformation has two steps. First, it gets the cuts or threshold
values from which each segment begins. The second step is using the
threshold to obtain the variables as categoricals.

Two variables will be discretized in the following example:
`max_heart_rate` and `oldpeak`. Also, we’ll introduce some `NA` values
into `oldpeak` to test how the function works with missing data.

```# Introducing some missing values in the first 30 rows of the oldpeak variable
heart_disease\$oldpeak[1:30]=NA
```

Step 1) Getting the bin thresholds for each input variable:

`discretize_get_bins` returns a data frame that needs to be used in the
`discretize_df` function, which returns the final processed data frame.

```d_bins=discretize_get_bins(data=heart_disease, input=c("max_heart_rate", "oldpeak"), n_bins=5)

## [1] "Variables processed: max_heart_rate, oldpeak"

# Checking `d_bins` object:
d_bins

##         variable                     cuts
## 1 max_heart_rate 131|147|160|171|Inf
## 2        oldpeak   0.1|0.3|1.1|2|Inf
```

Parameters:

• `data`: the data frame containing the variables to be processed.
• `input`: vector of strings containing the variable names.
• `n_bins`: the number of bins/segments to have in the discretized
data.

We can see each threshold point (or upper boundary) for each variable.

Update Jan-19-2018: Some points that differs from version 1.6.6 to 1.6.7:

• `discretize_get_bins` doesn’t create the `-Inf` threshold since that value was always considered to be the minimum.
• The one value category now it is represented as a range, for example, what it was `"5"`, now it is `"[5, 6)"`.
• Buckets formatting may have changed, if you were using this function in production, you would need to check the new values.

Time to continue with next step!

Step 2) Applying the thresholds for each variable:

```# Now it can be applied on the same data frame or in a new one (for example, in a predictive model that changes data over time)
heart_disease_discretized=discretize_df(data=heart_disease, data_bins=d_bins, stringsAsFactors=T)

## [1] "Variables processed: max_heart_rate, oldpeak"
```

Parameters:

• `data`: data frame containing the numerical variables to be
discretized.
• `data_bins`: data frame returned by `discretize_get_bins`. If it is
changed by the user, then each upper boundary must be separated by a
pipe character (`|`) as shown in the example.
• `stringsAsFactors`: `TRUE` by default, final variables will be
factor (instead of a character) and useful when plotting.

#### Final results and their plots

Before and after

Final distribution:

```describe(heart_disease_discretized %>% select(max_heart_rate,oldpeak))

## heart_disease_discretized %>% select(max_heart_rate, oldpeak)
##
##  2  Variables      303  Observations
## ---------------------------------------------------------------------------
## max_heart_rate
##        n  missing distinct
##      303        0        5
##
## Value      [-Inf, 131) [ 131, 147) [ 147, 160) [ 160, 171) [ 171, Inf]
## Frequency           63          59          62          62          57
## Proportion       0.208       0.195       0.205       0.205       0.188
## ---------------------------------------------------------------------------
## oldpeak
##        n  missing distinct
##      303        0        6
##
## Value      [-Inf, 0.1) [ 0.1, 0.3) [ 0.3, 1.1) [ 1.1, 2.0) [ 2.0, Inf]
## Frequency           97          18          54          54          50
## Proportion       0.320       0.059       0.178       0.178       0.165
##
## Value              NA.
## Frequency           30
## Proportion       0.099
## ---------------------------------------------------------------------------

p5=ggplot(heart_disease_discretized, aes(max_heart_rate)) + geom_bar(fill="#0072B2") + theme_bw() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))
p6=ggplot(heart_disease_discretized, aes(oldpeak)) + geom_bar(fill="#CC79A7") + theme_bw() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

gridExtra::grid.arrange(p5, p6, ncol=2)
```

Showing final variable distribution:

Sometimes, it is not possible to get the same number of cases per bucket
when computing equal frequency as is shown in the `oldpeak`
variable.

#### NA handling

Regarding the `NA` values, the new `oldpeak` variable has six
categories: five categories defined in `n_bins=5` plus the `NA.` value.
Note the point at the end indicating the presence of missing values.

• `discretize_df` will never return an `NA` value without transforming
it to the string `NA.`.
• `n_bins` sets the number of bins for all the variables.
• If `input` is missing, then it will run for all numeric/integer
variables whose number of unique values is greater than the number
of bins (`n_bins`).
• Only the variables defined in `input` will be processed while
remaining variables will not be modified at all.
• `discretize_get_bins` returns just a data frame that can be changed
by hand as needed, either in a text file or in the R session.

#### Discretization with new data

In our data, the minimum value for `max_heart_rate` is 71. The data
preparation must be robust with new data; e.g., if a new patient arrives
whose `max_heart_rate` is 68, then the current process will assign
her/him to the lowest category.

In other functions from other packages, this preparation may return an
`NA` because it is out of the segment.

As we pointed out before, if new data comes over time, it’s likely to
get new min/max value/s. This can break our process. To solve this,
`discretize_df` will always have as min/max the values `-Inf`/`Inf`;
thus, any new value falling below/above the minimum/maximum will be
added to the lowest or highest segment as applicable.

The data frame returned by `discretize_get_bins` must be saved in order
to apply it to new data. If the discretization is not intended to run
with new data, then there is no sense in having two functions: it can be
only one. In addition, there would be no need to save the results of
`discretize_get_bins`.

Having this two-step approach, we can handle both cases.

The usage of `discretize_get_bins` + `discretize_df` provides quick data
preparation, with a clean data frame that is ready to use. Clearly
showing where each segment begin and end, indispensable when making
statistical reports.

The decision of not fail when dealing with a new min/max in new data
is just a decision. In some contexts, failure would be the desired
behavior.

The human intervention: The easiest way to discretize a data frame
is to select the same number of bins to apply to every variable—just
like the example we saw—however, if tuning is needed, then some
variables may need a different number of bins. For example, a
variable with less dispersion can work well with a low number of bins.

Common values for the number of segments could be 3, 5, 10, or 20 (but
no more). It is up to the data scientist to make this decision.

#### Bonus track: The trade-off art ⚖️

• A high number of bins => More noise captured.
• A low number of bins => Oversimplification, less variance.

Do these terms sound similar to any other ones in machine learning?

adding or subtracting variables from a predictive model.

• More variables: Overfitting alert (too detailed predictive model).
• Fewer variables: Underfitting danger (not enough information to
capture general patterns).

Just like oriental philosophy has pointed out for thousands of years, there is an art in finding the right balance between one value and its opposite.

`Keep in touch:` @pabloc_ds.