A very common task in data processing is the transformation of the numeric variables (continuous, discrete etc) to categorical by creating bins. For example, is quite ofter to convert the
age to the
age group. Let’s see how we can easily do that in R.
We will consider a random variable from the Poisson distribution with parameter λ=20
library(dplyr) # Generate 1000 observations from the Poisson distribution # with lambda equal to 20 df<-data.frame(MyContinuous = rpois(1000,20)) # get the histogtam hist(df$MyContinuous)
Create specific Bins
Let’s say that you want to create the following bins:
- Bin 1: (-inf, 15]
- Bin 2: (15,25]
- Bin 3: (25, inf)
We can easily do that using the
cut command. Let’s start:
df<-df%>%mutate(MySpecificBins = cut(MyContinuous, breaks = c(-Inf,15,25,Inf))) head(df,10)
Let’s have a look at the counts of each bin.
Notice that you can define also you own labels within the
Create Bins based on Quantiles
Let’s say that you want each bin to have the same number of observations, like for example 4 bins of an equal number of observations, i.e. 25% each. We can easily do it as follows:
numbers_of_bins = 4 df<-df%>%mutate(MyQuantileBins = cut(MyContinuous, breaks = unique(quantile(MyContinuous,probs=seq.int(0,1, by=1/numbers_of_bins))), include.lowest=TRUE)) head(df,10)
We can check the
MyQuantileBins if contain the same number of observations, and also to look at their ranges:
Notice that in case that you want to split your continuous variable into bins of equal size you can also use the
ntile function of the
dplyr package, but it does not create labels of the bins based on the ranges.