# Proportions with mean()

**blogR**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

One of the most common tasks I want to do is calculate the proportion of observations (e.g., rows in a data set) that meet a particular condition. For example, what is the proportion of missing data, or people over the age of 18?

There is a suprisingly easy solution to handle this problem: by combining boolean vectors and `mean()`

.

## Step 1: creating a boolean vector

We start with boolean vectors, which is a vector that is `TRUE`

whenever our observation meets our condition, or `FALSE`

whenever it’s not. We create this boolean vector by submitting our observations to some sort of conditional statement (or relevant function like `is.na()`

). Let’s take a look at a few examples:

x <- letters[1:10] x == "b" # return a boolean vector which is TRUE whenver x is "b" #> [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE x <- 1:10 x > 5 # TRUE whenever x is greater than 5 #> [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE x > 5 & x %% 2 == 0 # TRUE when x > 5 AND divisible by 2 #> [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE x <- c(1, 2, NA, 4) is.na(x) # TRUE when x is a missing value #> [1] FALSE FALSE TRUE FALSE

If you’re unsure with how the above works, take a look at this page on R Programming Operators.

With this under our belt, it seems simple enough to create a boolean vector that tells us when our observations meet some condition (`TRUE`

) or not (`FALSE`

).

## Step 2: calculating the proportion of TRUE

From this point, all we need to do is wrap our conditional statment inside `mean()`

:

x <- 1:10 mean(x > 5) # proportion of values in x greater than 5 #> [1] 0.5

How/why does this work? If you take a look at the help page with `?mean()`

, you’ll read that the arguement `x`

can be a logical vector. But what does this mean. Well, when you use a boolean vector, `mean()`

first converts it to a numeric vector. This means that every `TRUE`

becomes `1`

, and every `FALSE`

becomes `0`

:

x <- 1:10 as.numeric(x > 5) #> [1] 0 0 0 0 0 1 1 1 1 1

It then computes the mean of these 1’s and 0’s. At this point, you just need to think a little. How is the mean calculated? Well, it’s the sum of all the values, divided by their length? So the sum of a vector of 1’s and 0’s will be the total number of 1’s! Divided by the length then gives you the proportion. As a side note, you might realise that you can use `sum()`

instead of `mean()`

if you want to calculate the frequency. Let’s break this right down:

x <- 1:10 x #> [1] 1 2 3 4 5 6 7 8 9 10 test <- x > 5 test #> [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE as.numeric(test) #> [1] 0 0 0 0 0 1 1 1 1 1 sum(test) #> [1] 5 length(test) #> [1] 10 sum(test) / length(test) #> [1] 0.5 mean(test) #> [1] 0.5

## Some useful examples

At this point, we can apply this to all sorts of problems. Here are some examples using the mtcars data set:

d <- mtcars head(d) #> mpg cyl disp hp drat wt qsec vs am gear carb #> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 #> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 #> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 #> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 #> Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 #> Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 # Proportion of rows (cars) with cyl == 6 (6 cylinders) mean(d$cyl == 6) #> [1] 0.21875 # Proportions of rows (cars) with hp > 250 (horsepower over 200) mean(d$hp > 250) #> [1] 0.0625 # Proportion of cars with 8-cylinders and that get more than 15 Miles/(US) gallon mean(d$cyl == 8 & d$hp > 15) #> [1] 0.4375

## Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at [email protected] to get in touch.

If you’d like the code that produced this blog, check out my GitHub repository, blogR.

**leave a comment**for the author, please follow the link and comment on their blog:

**blogR**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.