Function With Special Talent from ‘caret’ package in R — NearZeroVar()
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.
By Xiaotong Ding (Claire), With Greg Page
A practical tool that enables a modeler to remove noninformative data points during the variable selection process of data modeling
In this article, we will introduce a powerful function called ‘nearZeroVar()’. This function, which comes from the caret package, is a practical tool that enables a modeler to remove noninformative data points during the variable selection process of data modeling.
For starters, the nearZeroVar() function identifies constants, and predictors with one unique value across samples. In addition, nearZeroVar() diagnoses predictors as having “nearzero variance” when they possess very few unique values relative to the number of samples, and for which the ratio of the frequency of the most common value to the frequency of the second most common value is large.
Regardless of the modeling process being used, or of the specific purpose for a particular model, the removal of noninformative predictors is a good idea. Leaving such variables in a model only adds extra complexity, without any corresponding payoff in model accuracy or quality.
For this analysis, we will use the dataset hawaii.csv , which contains information about Airbnb rentals from Hawaii. In the code cell below, the dataset is read into R, and blank cells are converted to NA values
library(dplyr) library(caret) options(scipen=999) #display decimal values, rather than scientific notation data = read.csv("/Users/xiaotongding/Desktop/Page BAWritingProject/hawaii.csv") dim(data) ## [1] 21523 74 data[data==""] < NA nzv_vals < nearZeroVar(data, saveMetrics = TRUE) dim(nzv_vals) ## [1] 74 4
 The code chunk shown above generates a dataframe with 74 rows (one for each variable in the dataset) and four columns. If saveMetrics is set to FALSE, the positions of the zero or nearzero predictors are returned instead.
nzv_sorted < arrange(nzv_vals, desc(freqRatio)) head(nzv_sorted)

freqRatio

percentUnique

zeroVar

nzv


has_availability  21522.000000  0.009292385  FALSE  TRUE 
calculated_host_listings_count_shared_rooms  521.634146  0.032523347  FALSE  TRUE 
host_has_profile_pic  282.184211  0.009292385  FALSE  TRUE 
number_of_reviews_l30d  26.545337  0.046461924  FALSE  TRUE 
calculated_host_listings_count_private_rooms  13.440804  0.097570041  FALSE  FALSE 
room_type  9.244102  0.018584770  FALSE  FALSE 
The first column, freqRatio, tells us the ratio of frequencies for the most common value over the second most common value for that variable. To see how this is calculated, let’s look at the freqRatio for host_has_profile_pic (282.184):
table(sort(data$host_has_profile_pic, decreasing=TRUE)) ## ## f t ## 76 21446In the entire dataset, there are 76 ‘f’ values, and 21446 ‘t’ values. The frequency ratio of the most common outcome to the secondmost common outcome, therefore, is 21446/76, or 282.1842. The second column, percentUnique, indicates the percentage of unique data points out of the total number of data points. To illustrate how this is determined, let’s examine the ‘license’ variable, which shows a value here of 45.384007806. The length of the output from the unique() function, generated below, indicates that license contains 9768 distinct values throughout the entire dataset (most likely, some are repeated because a single individual may own multiple Airbnb properties).
length(unique(data$license)) ## [1] 9768By dividing the number of unique values by the number of observations, and then multiplying by 100, we arrive back at the percentUnique value shown above:
length(unique(data$license)) / nrow(data) * 100 ## [1] 45.38401For predictive modeling with numeric input features, it can be okay to have 100 percent uniqueness, as numeric values exist along a continuous spectrum. Imagine, for example, a medical dataset with the weights of 250 patients, all taken to 5 decimal places of precision – it is quite possible to expect that no two patients’ weights would be identical, yet weight could still carry predictive value in a model focused on patient health outcomes. For nonnumeric data, however, 100 percent uniqueness means that the variable will not have any predictive power in a model. If every customer in a bank lending dataset has a unique address, for example, then the ‘customer address’ variable cannot offer us any general insights about default likelihood. The third column, zeroVar, is a vector of logicals (TRUE or FALSE) that indicate whether the predictor has only one distinct value. Such variables will not yield any predictive power, regardless of their data type. The fourth column, nzv, is also a vector of logical values, for which TRUE values indicate that the variable is a nearzero variance predictor. For a variable to be flagged as such, it must meet two conditions: (1) Its frequency ratio must exceed the freqCut threshold used by the function; AND (2) its percentUnique value must fall below the uniqueCut threshold used by the function. By default, freqCut is set to 95/5 (or 19, if expressed as an integer value), and uniqueCut is set to 10. Let’s take a look at the variables with the 10 highest frequency ratios:
head(nzv_sorted, 10)

freqRatio

percentUnique

zeroVar

nzv


has_availability  21522.000000  0.009292385  FALSE  TRUE 
calculated_host_listings_count_shared_rooms  521.634146  0.032523347  FALSE  TRUE 
host_has_profile_pic  282.184211  0.009292385  FALSE  TRUE 
number_of_reviews_l30d  26.545337  0.046461924  FALSE  TRUE 
calculated_host_listings_count_private_rooms  13.440804  0.097570041  FALSE  FALSE 
room_type  9.244102  0.018584770  FALSE  FALSE 
review_scores_checkin  7.764874  0.041815732  FALSE  FALSE 
review_scores_location  7.632574  0.041815732  FALSE  FALSE 
maximum_nights_avg_ntm  7.083577  6.095804488  FALSE  FALSE 
minimum_maximum_nights  7.018508  0.715513637  FALSE  FALSE 
nzv_vals2 < nearZeroVar(data, saveMetrics = TRUE, uniqueCut = 0.04) nzv_sorted2 < arrange(nzv_vals2, desc(freqRatio)) head(nzv_sorted2, 10)

freqRatio

percentUnique

zeroVar

nzv


has_availability  21522.000000  0.009292385  FALSE  TRUE 
calculated_host_listings_count_shared_rooms  521.634146  0.032523347  FALSE  TRUE 
host_has_profile_pic  282.184211  0.009292385  FALSE  TRUE 
number_of_reviews_l30d  26.545337  0.046461924  FALSE  FALSE 
calculated_host_listings_count_private_rooms  13.440804  0.097570041  FALSE  FALSE 
room_type  9.244102  0.018584770  FALSE  FALSE 
review_scores_checkin  7.764874  0.041815732  FALSE  FALSE 
review_scores_location  7.632574  0.041815732  FALSE  FALSE 
maximum_nights_avg_ntm  7.083577  6.095804488  FALSE  FALSE 
minimum_maximum_nights  7.018508  0.715513637  FALSE  FALSE 
Function With Special Talent from ‘caret’ package in R — NearZeroVar() was first posted on September 4, 2021 at 8:54 am.
Rbloggers.com offers daily email updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/datascience job.
Want to share your content on Rbloggers? click here if you have a blog, or here if you don't.