tree.bins to provide users the ability to recategorize categorical variables, dependent on a response variable, by iteratively creating a decision tree for each of the categorical variables (class factor) and the selected response variable. The decision tree is created using
rpart::rpart. You often encounter variables with several levels when conducting data analysis or using machine learning algorithms. Decision trees can be used to decide how to best collapse these categorical variables into more manageable factors. The rules from the leaves of the decision tree are extracted, and used to recategorize (bin) the appropriate categorical variable (predictor). Only variables containing more than two factor levels will be considered by the function. The final output generates a dataset containing the recategorized variables and/or a list containing a mapping table for each of the candidate variables. For more details see Decision tree methods: applications for classification and prediction by Yan-yan Song and Ying Lu or T. Hastie et al (2009, ISBN: 978-0-387-84857-0).
At FI Consulting, the Data Science Book Club explores cutting-edge topics over a Friday beer. We occasionally challenge each other to a friendly competition practicing new predictive and machine learning models. Last meeting’s challenge used the Ames housing dataset to predict housing prices in Ames, Iowa. It contained a large number of categorical and continuous variables. Altogether, the dataset consisted of over 70 variables. To truly understand the relationship among the variables and the response, in particular the categorical variables, I needed to visualize their relationship with the average house price of homes sold in Ames, Iowa. My goal was to recategorize the current levels into bins that displayed similar relationship with the sale price. Performing such a task manually would have taken extensive effort.
When working with large datasets, there may be a need to recategorize candidate variables by some criterion.
tree.bins allows you to recategorize these variables through a decision tree method derived from
tree.bins is especially useful if the dataset contains several factor class variables with an abnormal number of levels. The intended purpose of the package is to recategorize predictors in order to reduce the number of dummy variables created when applying a statistical method to model a response. This can result in more parsimonious and/or accurate modeling results. The first half of this post illustrates data analysis procedures to identify a typical problem that contains a variable with several levels, and the latter half covers
tree.bins functionality and usage.
Pre-Categorization: Typical Variable for Consideration
This section illustrates a typical variable that could be considered for recategorization.
Visualization of Candidate Variable
Using a subset of the Ames dataset, the below chunk illustrates the average home sale price of each Neighborhood.
Notice that many neighborhoods observe the same average sale price. This indicates that we could combine and recategorize the Neighborhoods variable into fewer levels.
Statistical Method Implementation of Candidate Variable
The following illustrates the results of using a statistical learning method without using the
tree.bins function – linear regression for this example – with the Neighborhoods categorical variable.
Notice that there are multiple dummy variables being created to capture the different levels found within the Neighborhoods variable.
Visualizing the Leaves Created by a Decision Tree
The below steps illustrate how
rpart categorizes the different levels of Neighborhoods into separate leaves. These leaves are used to generate the mappings that are extracted and applied within
tree.bins to recategorize the current data.
These 5 categories are what
tree.bins will use to recategorize the variable Neighborhood.
Post-Categorization: Typical Variable for Consideration
This section illustrates the result of using
tree.bins to recategorize a typical variable.
Recategorization of Candidate Variable
There are many neighborhoods with similar average home sale prices. To create fewer dummy variables, we could group the neighborhoods with similar sale prices into one bin. We could alternatively create visualizations to identify these similarities in levels for each variable, but it would remain an extremely tedious task not to mention subjective to the analyst.
A better method would be to use the rules that are generated from a decision tree. Using
rpart::rpart, the task remains tedious, especially when there are numerous factor class variables to be considered. The
tree.bins function automates this process by iteratively recategorizing each factor level variable for a given dataset.
The control parameter in the
tree.bins function serves the same purpose as the control parameter in the
rpart function. If you specify a value for this parameter, that value will be used to prune the tree for each variable passed in to the data parameter. Remember, that a decision tree is being built to refactor each variable into new levels.
You can also create a two-dimensional
data.frame and pass this object into the control parameter. The first column must contain the variable name(s) that are contained in the
data.frame specified in the data parameter. The second column must contain the cp values of the respective variable name(s). Any variable name(s) not included in your
data.frame will use the generated cp value within the
rpart function. Lastly, the column names identified in your created
data.frame are irrelevant, only the elements are important.
The Different Return Options of tree.bins
Depending on what you want,
tree.bins can return either the recategorized data.frame or a list of lookup tables. The lookup tables contain the old-to-new value mappings for each recategorized variable generated by
“new.fctrs” returns the recategorized data.frame.
“lkup.list” returns a list of the lookup tables.
“both” returns an object containing both the
lkup.list outputs. These can be returned by using the “$” notation.
Using the bin.oth Function
tree.bins recategorizes factor class variables of any data.frame. Assuming that similar data will continue to be collected, or perhaps used in testing the performance of the model, you may want to recategorize this new data.frame by the same lookup tables that were generated from the first data.frame. In this case, being able to bin other data.frames with the same lookup table would be quite useful. The example below takes in a subset of the
AmesSubset data and returns a data.frame recategorized by the lookup list generated from the
tree.bins will help you solve similar problems. Please feel free to post any issues that you encounter or simply keep up with the latest changes and updates on GitHub.