# Slight inconsistency between forcats’ fct_lump_min and fct_lump_prop

**R – Statistical Odds & Ends**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently noticed a slight inconsistency between the `forcats`

package’s `fct_lump_min`

and `fct_lump_prop`

functions. (I’m working with v0.5.1, which is the latest version at the time of writing.) These functions lump levels that meet a certain criteria into an “other” level. According to the documentation,

`fct_lump_min`

“lumps levels that appear fewer than`min`

times”, and`fct_lump_prop`

“lumps levels that appear fewer than`prop * n`

times”, where`n`

is the length of the factor variable.

Let’s try this out in an example:

x <- factor(c(rep(1, 6), rep(2, 3), rep(3, 1))) x # [1] 1 1 1 1 1 1 2 2 2 3 # Levels: 1 2 3 fct_lump_min(x, min = 3) # [1] 1 1 1 1 1 1 2 2 2 Other # Levels: 1 2 Other

The levels 1 and 2 appear at least 3 times, and so they are not converted to the “Other” level. The level 3 appears just once, and so is converted.

What do you think this line of code returns?

fct_lump_prop(x, prop = 0.3)

Since `prop * n = 0.3 * 10 = 3`

, the documentation suggests that only the level 3 should be converted to “Other”. However, that is NOT the case:

fct_lump_prop(x, prop = 0.3) # [1] 1 1 1 1 1 1 Other Other Other Other # Levels: 1 Other

The level 2 appears exactly 3 times, and is converted into “Other”, contrary to what the documentation says.

Digging, into the source code of `fct_lump_prop`

, you will find this line of code:

new_levels <- ifelse(prop_n > prop, levels(f), other_level)

`prop_n`

is the proportion of times each factor level appears. If the documentation is correct, the greater than sign should really be a greater than or equal sign.

So what’s the right fix here? One way is to fix the documentation to say that `fct_lump_prop`

“lumps levels that appear at most `prop * n`

times”, but that breaks consistency with `fct_lump_min`

. Another is to make the fix in the code as suggested above, but that will change code behavior. Either is probably fine, and it’s up to the package owner to decide which makes more sense.

**leave a comment**for the author, please follow the link and comment on their blog:

**R – Statistical Odds & Ends**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.