**sigmafield - R**, and kindly contributed to R-bloggers)

*This post originally appeared on my Wordpress blog on September 23, 2009. I present it here in its original form.*

The **R Function of the Day** series will focus on describing in *plain language* how certain R functions work, focusing on simple examples that you can apply to gain insight into your own data.

Today, I will discuss the **cut** function.

### What situation is cut useful in?

In many data analysis settings, it might be useful to break up a continuous variable such as age into a categorical variable. Or, you might want to classify a categorical variable like year into a larger bin, such as 1990-2000. There are many reasons *not* to do this when performing regression analysis, but for simple displays of demographic data in tables, it could make sense. The **cut** function in R makes this task simple!

### How do I use cut?

First, we will simulate some data from a hypothetical clinical trial that includes variables for patient ID, age, and year of enrollment.

> ## generate data for clinical trial example > clinical.trial <- data.frame(patient = 1:100, age = rnorm(100, mean = 60, sd = 8), year.enroll = sample(paste("19", 85:99, sep = ""), 100, replace = TRUE)) > summary(clinical.trial) patient age year.enroll Min. : 1.00 Min. :41.18 1991 :12 1st Qu.: 25.75 1st Qu.:52.99 1988 :11 Median : 50.50 Median :60.08 1985 : 9 Mean : 50.50 Mean :59.67 1993 : 7 3rd Qu.: 75.25 3rd Qu.:65.67 1995 : 7 Max. :100.00 Max. :76.40 1997 : 7 (Other):47

Now, we will use the **cut** function to make age a factor, which is what R calls a categorical variable. Our first example calls cut with the **breaks** argument set to a single number. This method will cause **cut** to break up age into 4 intervals. The default labels use standard mathematical notation for open and closed intervals.

> ## basic usage of cut with a numeric variable > c1 <- cut(clinical.trial$age, breaks = 4) > table(c1) c1 (41.1,50] (50,58.8] (58.8,67.6] (67.6,76.4] 9 34 41 16 > ## year.enroll is a factor, so must convert to numeric first! > c2 <- cut(as.numeric(as.character(clinical.trial$year.enroll)), breaks = 3) > table(c2) c2 (1985,1990] (1990,1994] (1994,1999] 36 34 30

Well, the intervals that **cut** chose by default are not the nicest looking with the age example, although they are fine with the year example, since it was already discrete. Luckily, we can specify the exact intervals we want for age. Our next example shows how.

> ## specify break points explicitly using seq function > > ## look what seq does > seq(30, 80, by = 10) [1] 30 40 50 60 70 80 > ## cut the age variable using the seq defined above > c1 <- cut(clinical.trial$age, breaks = seq(30, 80, by = 10)) > ## table of the resulting factor > table(c1) c1 (30,40] (40,50] (50,60] (60,70] (70,80] 0 9 40 42 9

That looks pretty good. There is no reason that the breaks argument has to be equally spaced as I have done above. It could be any grouping that you want.

Finally, I am going to show you an example of a custom R function to categorize ages. It uses **cut** inside of it, but does some preprocessing and uses the **labels** argument to cut to make the output look nice.

age.cat <- function(x, lower = 0, upper, by = 10, sep = "-", above.char = "+") { labs <- c(paste(seq(lower, upper - by, by = by), seq(lower + by - 1, upper - 1, by = by), sep = sep), paste(upper, above.char, sep = "")) cut(floor(x), breaks = c(seq(lower, upper, by = by), Inf), right = FALSE, labels = labs) }

This function categorizes age in a fairly flexible way. The first assignment to **labs** inside the function creates a vector of labels. Then, the **cut** function is called to do the work, with the custom labels as an argument. Here are some examples using our simulated data from above. I am no longer going to save the results of the function calls to a variable and call **table** on them, but rather just nest the call to **age.cat** in a call to **table**. I previously did a post on the table function.

> ## only specifying an upper bound, uses 0 as lower bound, and > ## breaks up categories by 10 > table(age.cat(clinical.trial$age, upper = 70)) 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70+ 0 0 0 0 9 40 42 9 > ## now specifying a lower bound > table(age.cat(clinical.trial$age, lower = 30, upper = 70)) 30-39 40-49 50-59 60-69 70+ 0 9 40 42 9 > ## now specifying a lower bound AND the "by" argument > table(age.cat(clinical.trial$age, lower = 30, upper = 70, by = 5)) 30-34 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70+ 0 0 3 6 22 18 22 20 9

### Summary of cut

The cut function is useful for turning continuous variables into factors. You saw how to specify the number of cutpoints, specify the exact cutpoints, and saw a function built around **cut** that simplifies categorizing an age variable and giving it appropriate labels.

**leave a comment**for the author, please follow the link and comment on his blog:

**sigmafield - R**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...