R Function of the Day: cut

September 23, 2009
By

(This article was first published on sigmafield - R, and kindly contributed to R-bloggers)

This post originally appeared on my Wordpress blog on September 23, 2009. I present it here in its original form.

The R Function of the Day series will focus on describing in plain language how certain R functions work, focusing on simple examples that you can apply to gain insight into your own data.

Today, I will discuss the cut function.

What situation is cut useful in?

In many data analysis settings, it might be useful to break up a continuous variable such as age into a categorical variable. Or, you might want to classify a categorical variable like year into a larger bin, such as 1990-2000. There are many reasons not to do this when performing regression analysis, but for simple displays of demographic data in tables, it could make sense. The cut function in R makes this task simple!

How do I use cut?

First, we will simulate some data from a hypothetical clinical trial that includes variables for patient ID, age, and year of enrollment.


> ## generate data for clinical trial example
> clinical.trial <-
    data.frame(patient = 1:100,              
               age = rnorm(100, mean = 60, sd = 8),
               year.enroll = sample(paste("19", 85:99, sep = ""),
                 100, replace = TRUE))
> summary(clinical.trial)
    patient            age         year.enroll
 Min.   :  1.00   Min.   :41.18   1991   :12  
 1st Qu.: 25.75   1st Qu.:52.99   1988   :11  
 Median : 50.50   Median :60.08   1985   : 9  
 Mean   : 50.50   Mean   :59.67   1993   : 7  
 3rd Qu.: 75.25   3rd Qu.:65.67   1995   : 7  
 Max.   :100.00   Max.   :76.40   1997   : 7  
                                  (Other):47   

Now, we will use the cut function to make age a factor, which is what R calls a categorical variable. Our first example calls cut with the breaks argument set to a single number. This method will cause cut to break up age into 4 intervals. The default labels use standard mathematical notation for open and closed intervals.


> ## basic usage of cut with a numeric variable
> c1 <- cut(clinical.trial$age, breaks = 4)
> table(c1)
c1
  (41.1,50]   (50,58.8] (58.8,67.6] (67.6,76.4] 
          9          34          41          16  
> ## year.enroll is a factor, so must convert to numeric first!
> c2 <- cut(as.numeric(as.character(clinical.trial$year.enroll)),
            breaks = 3)
> table(c2)
c2
(1985,1990] (1990,1994] (1994,1999] 
         36          34          30  

Well, the intervals that cut chose by default are not the nicest looking with the age example, although they are fine with the year example, since it was already discrete. Luckily, we can specify the exact intervals we want for age. Our next example shows how.


> ## specify break points explicitly using seq function
> 
> ## look what seq does  
> seq(30, 80, by = 10)
[1] 30 40 50 60 70 80 
> ## cut the age variable using the seq defined above
> c1 <- cut(clinical.trial$age, breaks = seq(30, 80, by = 10))
> ## table of the resulting factor           
> table(c1)
c1
(30,40] (40,50] (50,60] (60,70] (70,80] 
      0       9      40      42       9  

That looks pretty good. There is no reason that the breaks argument has to be equally spaced as I have done above. It could be any grouping that you want.

Finally, I am going to show you an example of a custom R function to categorize ages. It uses cut inside of it, but does some preprocessing and uses the labels argument to cut to make the output look nice.

age.cat <- function(x, lower = 0, upper, by = 10,
                   sep = "-", above.char = "+") {

 labs <- c(paste(seq(lower, upper - by, by = by),
                 seq(lower + by - 1, upper - 1, by = by),
                 sep = sep),
           paste(upper, above.char, sep = ""))

 cut(floor(x), breaks = c(seq(lower, upper, by = by), Inf),
     right = FALSE, labels = labs)
}

This function categorizes age in a fairly flexible way. The first assignment to labs inside the function creates a vector of labels. Then, the cut function is called to do the work, with the custom labels as an argument. Here are some examples using our simulated data from above. I am no longer going to save the results of the function calls to a variable and call table on them, but rather just nest the call to age.cat in a call to table. I previously did a post on the table function.


> ## only specifying an upper bound, uses 0 as lower bound, and
> ## breaks up categories by 10
> table(age.cat(clinical.trial$age, upper = 70))
  0-9 10-19 20-29 30-39 40-49 50-59 60-69   70+ 
    0     0     0     0     9    40    42     9  
> ## now specifying a lower bound
> table(age.cat(clinical.trial$age, lower = 30, upper = 70))
30-39 40-49 50-59 60-69   70+ 
    0     9    40    42     9  
> ## now specifying a lower bound AND the "by" argument 
> table(age.cat(clinical.trial$age, lower = 30, upper = 70, by = 5))
30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69   70+ 
    0     0     3     6    22    18    22    20     9  

Summary of cut

The cut function is useful for turning continuous variables into factors. You saw how to specify the number of cutpoints, specify the exact cutpoints, and saw a function built around cut that simplifies categorizing an age variable and giving it appropriate labels.

Tags: 

To leave a comment for the author, please follow the link and comment on his blog: sigmafield - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.