Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

By Xiaotong Ding (Claire), With Greg Page

A practical tool that enables a modeler to remove non-informative data points during the variable selection process of data modeling

In this article, we will introduce a powerful function called ‘nearZeroVar()’. This function, which comes from the caret package, is a practical tool that enables a modeler to remove non-informative data points during the variable selection process of data modeling.

For starters, the nearZeroVar() function identifies constants, and predictors with one unique value across samples. In addition, nearZeroVar() diagnoses predictors as having “near-zero variance” when they possess very few unique values relative to the number of samples, and for which the ratio of the frequency of the most common value to the frequency of the second most common value is large.

Regardless of the modeling process being used, or of the specific purpose for a particular model, the removal of non-informative predictors is a good idea. Leaving such variables in a model only adds extra complexity, without any corresponding payoff in model accuracy or quality.

For this analysis, we will use the dataset hawaii.csv , which contains information about Airbnb rentals from Hawaii. In the code cell below, the dataset is read into R, and blank cells are converted to NA values

library(dplyr)
library(caret)
options(scipen=999)  #display decimal values, rather than scientific notation
dim(data)
## [1] 21523    74
data[data==""] <- NA
nzv_vals <- nearZeroVar(data, saveMetrics = TRUE)
dim(nzv_vals)
## [1] 74  4
• The code chunk shown above generates a dataframe with 74 rows (one for each variable in the dataset) and four columns. If saveMetrics is set to FALSE, the positions of the zero or near-zero predictors are returned instead.
nzv_sorted <- arrange(nzv_vals, desc(freqRatio))
head(nzv_sorted)

freqRatio
percentUnique
zeroVar
nzv
has_availability 21522.000000 0.009292385 FALSE TRUE
calculated_host_listings_count_shared_rooms 521.634146 0.032523347 FALSE TRUE
host_has_profile_pic 282.184211 0.009292385 FALSE TRUE
number_of_reviews_l30d 26.545337 0.046461924 FALSE TRUE
calculated_host_listings_count_private_rooms 13.440804 0.097570041 FALSE FALSE
room_type 9.244102 0.018584770 FALSE FALSE

The first column, freqRatio, tells us the ratio of frequencies for the most common value over the second most common value for that variable. To see how this is calculated, let’s look at the freqRatio for host_has_profile_pic (282.184):

table(sort(data$host_has_profile_pic, decreasing=TRUE)) ## ## f t ## 76 21446 In the entire dataset, there are 76 ‘f’ values, and 21446 ‘t’ values. The frequency ratio of the most common outcome to the second-most common outcome, therefore, is 21446/76, or 282.1842. The second column, percentUnique, indicates the percentage of unique data points out of the total number of data points. To illustrate how this is determined, let’s examine the ‘license’ variable, which shows a value here of 45.384007806. The length of the output from the unique() function, generated below, indicates that license contains 9768 distinct values throughout the entire dataset (most likely, some are repeated because a single individual may own multiple Airbnb properties). length(unique(data$license))
## [1] 9768
By dividing the number of unique values by the number of observations, and then multiplying by 100, we arrive back at the percentUnique value shown above:
length(unique(data\$license)) / nrow(data) * 100
## [1] 45.38401
For predictive modeling with numeric input features, it can be okay to have 100 percent uniqueness, as numeric values exist along a continuous spectrum. Imagine, for example, a medical dataset with the weights of 250 patients, all taken to 5 decimal places of precision – it is quite possible to expect that no two patients’ weights would be identical, yet weight could still carry predictive value in a model focused on patient health outcomes. For non-numeric data, however, 100 percent uniqueness means that the variable will not have any predictive power in a model. If every customer in a bank lending dataset has a unique address, for example, then the ‘customer address’ variable cannot offer us any general insights about default likelihood. The third column, zeroVar, is a vector of logicals (TRUE or FALSE) that indicate whether the predictor has only one distinct value. Such variables will not yield any predictive power, regardless of their data type. The fourth column, nzv, is also a vector of logical values, for which TRUE values indicate that the variable is a near-zero variance predictor. For a variable to be flagged as such, it must meet two conditions: (1) Its frequency ratio must exceed the freqCut threshold used by the function; AND (2) its percentUnique value must fall below the uniqueCut threshold used by the function. By default, freqCut is set to 95/5 (or 19, if expressed as an integer value), and uniqueCut is set to 10. Let’s take a look at the variables with the 10 highest frequency ratios:
head(nzv_sorted, 10)

freqRatio
percentUnique
zeroVar
nzv
has_availability 21522.000000 0.009292385 FALSE TRUE
calculated_host_listings_count_shared_rooms 521.634146 0.032523347 FALSE TRUE
host_has_profile_pic 282.184211 0.009292385 FALSE TRUE
number_of_reviews_l30d 26.545337 0.046461924 FALSE TRUE
calculated_host_listings_count_private_rooms 13.440804 0.097570041 FALSE FALSE
room_type 9.244102 0.018584770 FALSE FALSE
review_scores_checkin 7.764874 0.041815732 FALSE FALSE
review_scores_location 7.632574 0.041815732 FALSE FALSE
maximum_nights_avg_ntm 7.083577 6.095804488 FALSE FALSE
minimum_maximum_nights 7.018508 0.715513637 FALSE FALSE
Right now, number_of_reviews_l30d (Number of reviews in the last 30 days) is considered an ‘nzv’ variable, with its frequency ratio of 26.54 falling above the default of 19, and its uniqueness percentage of 0.046 falling below 0.10. If we adjust the function’s settings in a way that would nullify either of those conditions, it will no longer be considered an nzv variable:
nzv_vals2 <- nearZeroVar(data, saveMetrics = TRUE, uniqueCut = 0.04)
nzv_sorted2 <- arrange(nzv_vals2, desc(freqRatio))
head(nzv_sorted2, 10)

freqRatio
percentUnique
zeroVar
nzv
has_availability 21522.000000 0.009292385 FALSE TRUE
calculated_host_listings_count_shared_rooms 521.634146 0.032523347 FALSE TRUE
host_has_profile_pic 282.184211 0.009292385 FALSE TRUE
number_of_reviews_l30d 26.545337 0.046461924 FALSE FALSE
calculated_host_listings_count_private_rooms 13.440804 0.097570041 FALSE FALSE
room_type 9.244102 0.018584770 FALSE FALSE
review_scores_checkin 7.764874 0.041815732 FALSE FALSE
review_scores_location 7.632574 0.041815732 FALSE FALSE
maximum_nights_avg_ntm 7.083577 6.095804488 FALSE FALSE
minimum_maximum_nights 7.018508 0.715513637 FALSE FALSE
Note that with the lower cutoff for percentUnique in place, number_of_reviews_l30d no longer qualifies for nzv status. Adjusting the frequency ratio to any value above 26.55 would have had a similar effect. So what is the “correct” setting to use? Like nearly everything else in the world of modeling, this question does not lend itself to a “one-size-fits-all” answer. At times, nearZeroVar() may serve as a handy way to quickly whittle down the size of an enormous dataset. Other times, it might even be used in a nearly-opposite way – if a modeler is specifically looking to call attention to anomalous values, this function could be used to flag variables that contain them. Either way, we encourage you to explore this function, and to consider making it part of your Exploratory Data Analysis (EDA) routine, especially when you are faced with a large dataset and looking for places to simplify the task in front of you.
Function With Special Talent from ‘caret’ package in R — NearZeroVar() was first posted on September 4, 2021 at 8:54 am.