# A guide to encoding categorical features using R

**Anindya Mozumdar**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this article, we will look at various options for encoding categorical features. We will also present R code for each of the encoding techniques. Categorical feature encoding is an important data processing step required for using these features in many statistical modelling and machine learning algorithms.

The material in the article is heavily borrowed from the post Smarter Ways to Encode Categorical Data for Machine Learning by Jeff Hale. The linked article provides many useful tips along with Python code. Our aim is to implement each of the encoding techniques using R while minimizing the use of additional packages. Please note that the code provided in this article is illustrative and does not follow any software engineering best practice. It may also not be suitable for production, though my hope is that some of you readers will find it useful in your own work. If you find any errors or have suggestions to improve the code, please make a pull request to the Github repository linked at the end of the article.

The words – *features*, *variables* or *columns* – are used interchangeably throughout the rest of the article. The same is true for the words *observations* or *rows* to represent each individual data point.

### Data

The data for this article was prepared synthetically and the code to prepare it can be found in the code “01_Synthetic_Data_Preparation.R” in the repository. It comprises 5,000 observations and three variables

*city*– A categorical variable whose values may be one of 100 unique American city names obtained from the*world.cities*dataset available in the*maps*package. Some observations may have missing values for this variable.*hml*– A categorical variable whose values are ‘High’, ‘Medium’ or ‘Low’. The intent is to demonstrate an ordinal feature. Some observations may have missing values for this variable.*target*– A numeric variable which may take one of two values 0 or 1. This can be considered as the dependent variable for a binary classification problem. There are no missing values for this variable.

There is also a relationship between the *hml* and *target* variables, with the value *High* having the lowest target rate (percentage of records where the target is 1) and the value *Low* having the highest target rate.

The data is referred to by the variable *df* in the example code below.

### Encoders – A Short Summary

The encoders described in this article can be divided into two key groups

- Classic and Contrast encoders – These do not use the information from the dependent variable in the encoding. If there are
*k*unique values in a categorical variable, they create up to*k*distinct columns to store the encoding depending on the technique. - Bayesian encoders – These use information from the dependent variable in the encoding. One must ensure that the encoding is only done after splitting the data into training and test sets and using the target from the training set. These encoders always create one column to store the encoded value. It is often recommended to add some ‘noise’ to the resulting encoded value.

Some encoders are applicable mostly for ordinal variables while others can be used for both ordinal and nominal variables. In this article, we will not get into the question of deciding which encoding is the most appropriate. There are already excellent resources available in the internet on this topic and a quick web search should lead you to interesting articles or discussions for that particular encoding technique.

The classic and contrast encoders described are

- Ordinal
- One-Hot
- Binary
- Hashing
- Helmert
- Backward difference
- Polynomial

The Bayesian encoders described in the article are

- Target
- Leave one out
- Weight of evidence
- James Stein
- M Estimator

### Classic and Contrast Encoders

#### Ordinal

This is the simplest form of encoding where each value is converted to an integer. The maximum value is the equal to the number of unique values of the variable. By default, the values are considered in the order in which they appear in the column but can be changed using the *order* argument to the function. The *exclude* argument is needed in the *factor* function to ensure that missing values are also assigned an encoded value.

encode_ordinal <- function(x, order = unique(x)) { x <- as.numeric(factor(x, levels = order, exclude = NULL)) x }

A few examples to demonstrate the function in action. The order in which the values occur in the data is *Medium*, *Low*, *High* and *NA*. So the original call to the function assigns the values 1 to 4 in this order. This can be changed by the user passing an explicit order. In general, encoding a nominal variable like *city* is not recommended using this technique and the code has been used for illustration only.

table(df[["hml"]], encode_ordinal(df[["hml"]]), useNA = "ifany") ## ## 1 2 3 4 ## High 0 0 1618 0 ## Low 0 1545 0 0 ## Medium 1587 0 0 0 ## <NA> 0 0 0 250 table(df[["hml"]], encode_ordinal(df[["hml"]], order = c(NA, "Low", "Medium", "High")), useNA = "ifany") ## ## 1 2 3 4 ## High 0 0 0 1618 ## Low 0 1545 0 0 ## Medium 0 0 1587 0 ## <NA> 250 0 0 0 table(df[["hml"]], encode_ordinal(df[["hml"]], order = c("Low", "Medium", "High", NA)), useNA = "ifany") ## ## 1 2 3 4 ## High 0 0 1618 0 ## Low 1545 0 0 0 ## Medium 0 1587 0 0 ## <NA> 0 0 0 250 new_df <- df new_df[["hml_encoded"]] <- encode_ordinal(df[["hml"]]) new_df[["city_encoded"]] <- encode_ordinal(df[["city"]]) head(new_df) ## city hml target hml_encoded city_encoded ## 1: Hartford Medium 0 1 1 ## 2: Athens-Clarke Low 0 2 2 ## 3: Pasco Low 1 2 3 ## 4: Nashua High 0 3 4 ## 5: Bellflower High 0 3 5 ## 6: Kendall Low 1 2 6

#### One-Hot

This is probably the most common form of encoding and is often referred to as creating dummy or indicator variables. It creates a new column for each unique value of the categorical variable. Each of these columns are binary with values 1 or 0 depending on whether the value of the variable is equal to the unique value being encoded by this column.

There are multiple ways to do one-hot encoding in R. One way would be to use the *model.matrix* function from the *stats* package.

# Using model.matrix new_df <- df new_df$city <- factor(new_df$city, exclude = NULL) new_df$hml <- factor(new_df$hml, exclude = NULL) new_df <- model.matrix(~.-1, data = new_df[, c("city", "hml")], contrasts.arg = list( city = contrasts(new_df$city, contrasts = FALSE), hml = contrasts(new_df$hml, contrasts = FALSE) )) head(new_df[, 1:3]) ## cityAlameda cityAmes cityApple Valley ## 1 0 0 0 ## 2 0 0 0 ## 3 0 0 0 ## 4 0 0 0 ## 5 0 0 0 ## 6 0 0 0

I would personally recommend using the *designTreatmentsZ* and *prepare* functions from the *vtreat* package. While a comprehensive discussion of *vtreat* is beyond the scope of this article, it has a lot of nice features and takes care of multiple potential issues for you.

library(vtreat) tz <- vtreat::designTreatmentsZ(df, c("city", "hml")) ## [1] "vtreat 1.5.1 inspecting inputs Mon Feb 03 00:16:33 2020" ## [1] "designing treatments Mon Feb 03 00:16:33 2020" ## [1] " have initial level statistics Mon Feb 03 00:16:33 2020" ## [1] " scoring treatments Mon Feb 03 00:16:33 2020" ## [1] "have treatment plan Mon Feb 03 00:16:34 2020" new_df <- vtreat::prepare(tz, df, extracols = "target") head(new_df[, 1:5]) ## city_catP hml_catP city_lev_NA city_lev_x_Alameda city_lev_x_Ames ## 1 0.0090 0.3174 0 0 0 ## 2 0.0094 0.3090 0 0 0 ## 3 0.0106 0.3090 0 0 0 ## 4 0.0116 0.3236 0 0 0 ## 5 0.0106 0.3236 0 0 0 ## 6 0.0094 0.3090 0 0 0 detach("package:vtreat", unload = TRUE)

The third option is to use the *dummyVars* function from the *caret* package. This uses the R formula interface and one needs to use the *predict* function to actually create the variables.

library(caret) ## Loading required package: lattice ## Loading required package: ggplot2 new_df <- df new_df$city <- factor(new_df$city, exclude = NULL) new_df$hml <- factor(new_df$hml, exclude = NULL) new_df$city <- addNA(new_df$city) new_df$hml <- addNA(new_df$hml) dv <- caret::dummyVars(" ~ city + hml", data = new_df) new_df <- data.frame(predict(dv, newdata = df)) head(new_df[, 1:3]) ## city.Alameda city.Ames city.Apple.Valley ## 1 0 0 0 ## 2 0 0 0 ## 3 0 0 0 ## 4 0 0 0 ## 5 0 0 0 ## 6 0 0 0 detach("package:caret", unload = TRUE)

#### Binary

Instead of each unique value being assigned to an integer, the integers are converted to a binary number and a representation of this binary number is stored as multiple columns, where each column can take the values 0 or 1. In that sense, it can be thought of as a combination of ordinal and one-hot encoding. So if you have 27 distinct values of a categorical variable, then 5 columns are sufficient to encode this variable - as 5-digit binary numbers can store any value from 0 to 31. An implementation is provided below using the *binaryLogic* package. We first use ordinal encoding. The *as.binary* function from this package is used to get the binary representation of the integer assigned to that category. As not all numbers will have the same length of representation, the numbers are “padded by 0s”. From the example, we see that 7 indicator variables are sufficient to represent all the 100 cities as well as the missing value in the data.

library(binaryLogic) encode_binary <- function(x, order = unique(x), name = "v_") { x <- as.numeric(factor(x, levels = order, exclude = NULL)) x2 <- as.binary(x) maxlen <- max(sapply(x2, length)) x2 <- lapply(x2, function(y) { l <- length(y) if (l < maxlen) { y <- c(rep(0, (maxlen - l)), y) } y }) d <- as.data.frame(t(as.data.frame(x2))) rownames(d) <- NULL colnames(d) <- paste0(name, 1:maxlen) d } new_df <- cbind(df, encode_binary(df[["city"]], name = "city_")) head(new_df) ## city hml target city_1 city_2 city_3 city_4 city_5 city_6 city_7 ## 1: Hartford Medium 0 0 0 0 0 0 0 1 ## 2: Athens-Clarke Low 0 0 0 0 0 0 1 0 ## 3: Pasco Low 1 0 0 0 0 0 1 1 ## 4: Nashua High 0 0 0 0 0 1 0 0 ## 5: Bellflower High 0 0 0 0 0 1 0 1 ## 6: Kendall Low 1 0 0 0 0 1 1 0 detach("package:binaryLogic", unload = TRUE)

#### Hashing

This uses the hashing trick described in this article.

An implementation of this technique is provided by the *FeatureHashing* package. The *hash.size* should be chosen so that the collision rate is not too high. A simple example of creating the features and using it in *xgboost* is also shown in the example below.

library(FeatureHashing) library(xgboost) m.mat <- hashed.model.matrix(c("city", "hml"), df, hash.size = 2 ^ 10, create.mapping = TRUE) # Extract the mapping mapping <- hash.mapping(m.mat) # Check collision rate mean(duplicated(mapping)) ## [1] 0.05825243 # Names names(mapping) ## [1] "cityIndio" "cityCherry Hill" "hmlHigh" ## [4] "cityHattiesburg" "cityRockford" "cityLynwood" ## [7] "cityEdinburg" "cityAustin" "cityPleasanton" ## [10] "cityWest Orange" "cityWilkes-Barre" "cityHaverhill" ## [13] "cityPeoria" "cityBellflower" "hmlMedium" ## [16] "cityDeKalb" "hmlLow" "cityCitrus Heights" ## [19] "cityRacine" "citySomerville" "cityFlorence-Graham" ## [22] "cityYorba Linda" "cityWilson" "cityVacaville" ## [25] "cityHamilton" "cityChandler" "cityPocatello" ## [28] "cityHartford" "cityDetroit" "cityFranconia" ## [31] "citySaint Louis Park" "cityElizabeth" "cityPearland" ## [34] "cityOverland Park" "cityFindlay" "cityApple Valley" ## [37] "cityNewport News" "cityMesa" "cityHemet" ## [40] "cityPassaic" "cityCharleston" "cityBlacksburg" ## [43] "citySioux City" "cityCary" "cityChico" ## [46] "cityCrystal Lake" "cityBolingbrook" "cityWalnut Creek" ## [49] "cityWestminster" "cityWaterbury" "cityFond du Lac" ## [52] "citySalt Lake City" "cityBridgewater" "cityCarson" ## [55] "cityWheaton-Glenmont" "cityNorth Bethesda" "cityWatsonville" ## [58] "cityGrand Island" "cityNashua" "cityDothan" ## [61] "cityKalamazoo" "cityEllicott City" "cityEau Claire" ## [64] "cityGilbert" "cityPasco" "cityHoboken" ## [67] "cityMontgomery Village" "cityHilo" "cityJupiter" ## [70] "cityTopeka" "cityPittsburgh" "cityGlendora" ## [73] "cityBoynton Beach" "cityFort Myers" "cityWestland" ## [76] "cityYonkers" "cityFort Worth" "cityMinnetonka" ## [79] "citySarasota" "cityConcord" "cityBattle Creek" ## [82] "cityWest Sacramento" "cityShelby" "cityAlameda" ## [85] "cityBroken Arrow" "cityJoliet" "cityGarden Grove" ## [88] "citySan Clemente" "cityPharr" "cityAmes" ## [91] "cityTurlock" "cityFort Smith" "cityWoodland" ## [94] "citySan Buenaventura" "cityBeaverton" "cityFort Wayne" ## [97] "cityAthens-Clarke" "cityMansfield" "cityMinneapolis" ## [100] "cityToledo" "cityKendall" "cityBloomfield" ## [103] "cityBryan" # Hashed values hashed.value(names(mapping)) ## [1] -1783827961 -1554688119 -1885087034 903710395 -433534873 -1698157690 ## [7] 1000408268 -1738513785 327611540 744213640 101526589 -97403553 ## [13] 305346091 -1622681438 -1895904773 -1797244496 -775669476 -1581269032 ## [19] -583515489 823993455 -475980681 1164925682 -1986211414 -1614513926 ## [25] 664810711 -283418510 -1627492277 594557760 -873606777 -1364185228 ## [31] -1160352055 -1874159752 -1049081229 523278412 -1020228123 1555812291 ## [37] -1468685840 -1311498863 407357228 2002198630 1444597631 176643076 ## [43] 1486456091 -489505536 -40219489 86822926 1912518627 1934415346 ## [49] 229200030 1718684157 -212552741 -1918227473 356556145 -434725725 ## [55] 311621850 1373663104 1322620681 285510556 1916763754 273751018 ## [61] 1583117190 949246687 -1883264844 -1177970518 1718097732 -951650603 ## [67] -1766063377 -1433688539 1350976259 -563052972 1423031903 -1800689849 ## [73] 232908642 -1215777705 1600270663 266027690 -1060761545 -1107504569 ## [79] -588399577 1817885574 -1773860050 -756120522 743992857 1583240399 ## [85] -741470805 1905324187 320817142 -2121018767 -1079400486 1445605708 ## [91] -676887260 221591167 -1991619194 20030044 -912243463 172128751 ## [97] -1909238049 -1076346490 -1617920854 -1900950692 1677793300 -822743039 ## [103] -201665694 # Testing with xgboost bst <- xgboost(data = m.mat, label = df$target, nround = 10, params = list( booster = "gbtree", objective = "binary:logistic", eval_metric = "auc" )) ## [1] train-auc:0.785905 ## [2] train-auc:0.792083 ## [3] train-auc:0.796824 ## [4] train-auc:0.798893 ## [5] train-auc:0.799901 ## [6] train-auc:0.801744 ## [7] train-auc:0.802468 ## [8] train-auc:0.803876 ## [9] train-auc:0.805856 ## [10] train-auc:0.806401 detach("package:xgboost", unload = TRUE) detach("package:FeatureHashing", unload = TRUE)

#### Helmert

Helmert encoding is a form of contrast encoding where each value of a categorical variable is compared to the mean of the subsequent levels. A nice explanation can be found in this StackOverlow answer by user *StatsStudent* and the comments by user *whuber*. The implementation below is inspired by one of the code samples from *whuber*, while keeping it consistent with the values output by the *category_encoders* module for use with *scikit-learn* in Python. The output of the *encode_helmert* function below is a data frame which can be merged back to the original data to get the encoding for all values in the data.

helmert <- function(n) { m <- t((diag(seq(n-1, 0)) - upper.tri(matrix(1, n, n)))[-n,]) t(apply(m, 1, rev)) } encode_helmert <- function(df, var) { x <- df[[var]] x <- unique(x) n <- length(x) d <- as.data.frame(helmert(n)) d[[var]] <- rev(x) names(d) <- c(paste0(var, 1:(n-1)), var) d } d <- encode_helmert(df, "hml") d ## hml1 hml2 hml3 hml ## 1 0 0 3 <NA> ## 2 0 2 -1 High ## 3 1 -1 -1 Low ## 4 -1 -1 -1 Medium

#### Backward Difference

This is similar to Helmert encoding, except that in this system, the mean for one level of the categorical variable is compared to the mean for the prior adjacent level.

backward_difference <- function(n) { m <- matrix(0:(n-1), nrow = n, ncol = n) m <- m + upper.tri(matrix(1, n, n)) m2 <- matrix(-(n-1), n, n) m2[upper.tri(m2)] <- 0 m <- (m + m2) / n m <- (t(m))[, -n] m } encode_backward_difference <- function(df, var) { x <- df[[var]] x <- unique(x) n <- length(x) d <- as.data.frame(backward_difference(n)) d[[var]] <- x names(d) <- c(paste0(var, 1:(n-1)), var) d } d <- encode_backward_difference(df, "hml") d ## hml1 hml2 hml3 hml ## 1 -0.75 -0.5 -0.25 Medium ## 2 0.25 -0.5 -0.25 Low ## 3 0.25 0.5 -0.25 High ## 4 0.25 0.5 0.75 <NA>

#### Polynomial

It is again similar to the two contrast encoders above; it looks for the linear, quadratic and cubic trends in a categorical variable. The implementation of this encoding is simple using the *contr.poly* function from the *stats* package.

encode_polynomial <- function(df, var) { x <- df[[var]] x <- unique(x) n <- length(x) d <- as.data.frame(contr.poly(n)) d[[var]] <- x names(d) <- c(paste0(var, 1:(n-1)), var) d } d <- encode_polynomial(df, "hml") d ## hml1 hml2 hml3 hml ## 1 -0.6708204 0.5 -0.2236068 Medium ## 2 -0.2236068 -0.5 0.6708204 Low ## 3 0.2236068 -0.5 -0.6708204 High ## 4 0.6708204 0.5 0.2236068 <NA>

### Bayesian Encoders

#### Target

Target encoding is also very simple, where the encoded value of each value of a categorical variable is simply the mean of the target variable. The mean of the target is obtained by using the *aggregate* R function. Some noise can be added to the encoded value by specifying the *sigma* argument. If specified, it draws a random sample from the normal distribution with mean 1 and standard deviation equal to *sigma*, and multiplies the mean of the target with this random value. Note that the addition of this noise is usually done only on the training set and not the test set.

encode_target <- function(x, y, sigma = NULL) { d <- aggregate(y, list(factor(x, exclude = NULL)), mean, na.rm = TRUE) m <- d[is.na(as.character(d[, 1])), 2] l <- d[, 2] names(l) <- d[, 1] l <- l[x] l[is.na(l)] <- m if (!is.null(sigma)) { l <- l * rnorm(length(l), mean = 1, sd = sigma) } l } table(encode_target(df[["hml"]], df[["target"]]), df[["hml"]], useNA = "ifany") ## ## High Low Medium <NA> ## 0.0247218788627936 1618 0 0 0 ## 0.0945179584120983 0 0 1587 0 ## 0.323624595469256 0 1545 0 0 ## 0.5 0 0 0 250 new_df <- df new_df[["hml_encoded"]] <- encode_target(df[["hml"]], df[["target"]]) new_df[["hml_encoded2"]] <- encode_target(df[["hml"]], df[["target"]], sigma = 0.05) new_df[["city_encoded"]] <- encode_target(df[["city"]], df[["target"]]) head(new_df) ## city hml target hml_encoded hml_encoded2 city_encoded ## 1: Hartford Medium 0 0.09451796 0.09918551 0.3111111 ## 2: Athens-Clarke Low 0 0.32362460 0.33714512 0.2553191 ## 3: Pasco Low 1 0.32362460 0.31293654 0.1132075 ## 4: Nashua High 0 0.02472188 0.02665522 0.0862069 ## 5: Bellflower High 0 0.02472188 0.02271573 0.1320755 ## 6: Kendall Low 1 0.32362460 0.33760350 0.1914894

#### Leave One Out

This is very similar to target encoding, except that for each row we exclude that row in calculating the mean of the target. Some noise can be added to the encoded value by specifying the *sigma* argument. In the implementation below, this is done by iterating over each row and removing it prior to calculating the mean of the target. For binary targets, the implementation can be significantly optimized by noting that we can pre-calculate the totals by each unique value and the target. Removing that row is then only subtracting from the total for that value and target combination (the implementation is left as an exercise to the reader).

encode_leave_one_out <- function(x, y, sigma = NULL) { n <- length(x) x[is.na(x)] <- "__MISSING" x2 <- vapply(1:n, function(i) { xval <- x[i] yloo <- y[-i] xloo <- x[-i] yloo <- yloo[xloo == xval] mean(yloo, na.rm = TRUE) }, numeric(1)) if (!is.null(sigma)) { x2 <- x2 * rnorm(n, mean = 1, sd = sigma) } x2 } new_df <- df new_df[["hml_encoded"]] <- encode_leave_one_out(df[["hml"]], df[["target"]]) set.seed(1123) new_df[["city_encoded"]] <- encode_leave_one_out(df[["city"]], df[["target"]], sigma = 0.05) head(new_df) ## city hml target hml_encoded city_encoded ## 1: Hartford Medium 0 0.09457755 0.32939353 ## 2: Athens-Clarke Low 0 0.32383420 0.27531012 ## 3: Pasco Low 1 0.32318653 0.10067333 ## 4: Nashua High 0 0.02473717 0.08372383 ## 5: Bellflower High 0 0.02473717 0.14392368 ## 6: Kendall Low 1 0.32318653 0.17560372

#### Weight of Evidence

Weight of evidence encoding is very popular in the field of credit scoring (credit risk modelling). In this, the encoded value is the logarithm of the odds of the target. The odds is defined to be the percentage of records with target value 0 divided by the percentage of records with target value 1. The implementation below is valid only for binary targets. The formula to calculate weight of evidence can also be modified for continous target (see link), but that implementation is left as an exercise to the reader. In the field of credit scoring, it is also common practice to group related values together and then calculate the weight of evidence. For example, if there is a field called *marital_status*, one may decide to combine the values of *Divorced* and *Separated* before calculating weight of evidence if the empirical data suggests that they have very similar target rates.

encode_woe <- function(x, y, sigma = NULL) { d <- aggregate(y, list(factor(x, exclude = NULL)), mean, na.rm = TRUE) d[["woe"]] <- log(((1 / d[, 2]) - 1) * (sum(y) / sum(1-y))) m <- d[is.na(as.character(d[, 1])), 3] l <- d[, 3] names(l) <- d[, 1] l <- l[x] l[is.na(l)] <- m if (!is.null(sigma)) { l <- l * rnorm(length(l), mean = 1, sd = sigma) } l } table(encode_woe(df[["hml"]], df[["target"]]), df[["hml"]], useNA = "ifany") ## ## High Low Medium <NA> ## -1.63607386968271 0 0 0 250 ## -0.898909803705993 0 1545 0 0 ## 0.623603722300056 0 0 1587 0 ## 2.03896017760917 1618 0 0 0 new_df <- df new_df[["hml_encoded"]] <- encode_woe(df[["hml"]], df[["target"]]) new_df[["hml_encoded2"]] <- encode_woe(df[["hml"]], df[["target"]], sigma = 0.05) new_df[["city_encoded"]] <- encode_woe(df[["city"]], df[["target"]]) head(new_df) ## city hml target hml_encoded hml_encoded2 city_encoded ## 1: Hartford Medium 0 0.6236037 0.6259107 -0.8411440 ## 2: Athens-Clarke Low 0 -0.8989098 -0.8513090 -0.5656325 ## 3: Pasco Low 1 -0.8989098 -0.8585435 0.4223143 ## 4: Nashua High 0 2.0389602 2.0354463 0.7247801 ## 5: Bellflower High 0 2.0389602 2.1024377 0.2466574 ## 6: Kendall Low 1 -0.8989098 -0.8857241 -0.1957123

#### James Stein

A good explanation of this encoding can be found in this article. This method is similar to target encoding but the encoded value is shrunk towards the overall population mean of the target. So the encoded value is a weighted sum of the target mean for that category and the population mean.

encode_james_stein <- function(x, y, sigma = NULL) { n_all <- length(y) p_all <- mean(y) var_all <- (p_all * (1 - p_all)) / n_all d <- aggregate(y, list(factor(x, exclude = NULL)), mean, na.rm = TRUE) d2 <- aggregate(y, list(factor(x, exclude = NULL)), length) g <- names(d)[1] d <- merge(d, d2, by = g, all = TRUE) d[, 4] <- (d[, 2] * (1 - d[, 2])) / d[, 3] d[, 5] <- d[, 4] / (d[, 4] + var_all) d[, 6] <- (1 - d[, 5]) * d[, 2] + d[, 5] * p_all m <- d[is.na(as.character(d[, 1])), 6] l <- d[, 6] names(l) <- d[, 1] l <- l[x] l[is.na(l)] <- m if (!is.null(sigma)) { l <- l * rnorm(length(l), mean = 1, sd = sigma) } l } table(encode_james_stein(df[["hml"]], df[["target"]]), df[["hml"]], useNA = "ifany") ## ## High Low Medium <NA> ## 0.0735644532521975 1618 0 0 0 ## 0.13999163072412 0 0 1587 0 ## 0.171951205029329 0 0 0 250 ## 0.18893950763687 0 1545 0 0 new_df <- df new_df[["hml_encoded"]] <- encode_james_stein(df[["hml"]], df[["target"]]) new_df[["hml_encoded2"]] <- encode_james_stein(df[["hml"]], df[["target"]], sigma = 0.05) new_df[["city_encoded"]] <- encode_james_stein(df[["city"]], df[["target"]])

#### M Estimator

It is similar to the James Stein encoder where the target mean for each unique value is adjusted using the population mean using one additional parameter *m*. The default value of *m* is 1.

encode_m_estimator <- function(x, y, m = 1, sigma = NULL) { p_all <- mean(y) d <- aggregate(y, list(factor(x, exclude = NULL)), sum, na.rm = TRUE) d2 <- aggregate(y, list(factor(x, exclude = NULL)), length) g <- names(d)[1] d <- merge(d, d2, by = g, all = TRUE) d[, 4] <- (d[, 2] + p_all * m) / (d[, 3] + m) m <- d[is.na(as.character(d[, 1])), 4] l <- d[, 4] names(l) <- d[, 1] l <- l[x] l[is.na(l)] <- m if (!is.null(sigma)) { l <- l * rnorm(length(l), mean = 1, sd = sigma) } l } table(encode_m_estimator(df[["hml"]], df[["target"]]), df[["hml"]], useNA = "ifany") ## ## High Low Medium <NA> ## 0.0248072884496603 1618 0 0 0 ## 0.0945610831234257 0 0 1587 0 ## 0.323520698576973 0 1545 0 0 ## 0.498657370517928 0 0 0 250 new_df <- df new_df[["hml_encoded"]] <- encode_m_estimator(df[["hml"]], df[["target"]]) new_df[["hml_encoded2"]] <- encode_m_estimator(df[["hml"]], df[["target"]], sigma = 0.05) new_df[["city_encoded"]] <- encode_m_estimator(df[["city"]], df[["target"]])

### Github Repository

All code is available in this Github repository. It may also be worthwhile to look at the category_encoders Python package.

**leave a comment**for the author, please follow the link and comment on their blog:

**Anindya Mozumdar**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.