# Introduction to missing data (NAs) in R

**R on R (for ecology)**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

As many of us know, science is not a perfect process. Maybe you can’t get out in the field on a certain day. Maybe you can only sample a portion of what needs to get done. Or maybe you’re downloading public data sets and they aren’t lining up perfectly. All of these can result in missing data, which can be a real pain when it comes time for analysis.

Another common source of missing data, especially when recording species abundance data in community ecology, is when you forget to write a ‘0’ and instead leave the entry blank. In the moment you might know that blank entries mean zero, but give it just a few weeks and you’ll be scratching your head! In those cases it’s often best to label those entries as unknown or missing.

In this tutorial, I’m going to explain what exactly an `NA`

value is, how you can find `NA`

s in your data, and how you can remove them.

### What does it mean to have NAs in my data?

`NA`

s represent missing values in R. This is pretty common if you’re importing data from Excel and have some empty cells in the spreadsheet. When you load the data into R, the empty cells will be populated with `NA`

s.

`NA`

(which stands for ‘Not Available’) in R. In fact, you’ll notice the color change when you type `NA`

in your code since R already knows what that means.
# Read in an example data set with NAs ex <- read.csv("example_data.csv") # View data ex ## example data set ## 1 1 2 4 ## 2 NA 2 4 ## 3 16 1 4 ## 4 2 NA 5 ## 5 3 1 NA ## 6 6 7 8

**Click here to download the example_data.csv file** if you want to follow along.

`NA`

s cannot be treated like other types of data (e.g, strings, numeric values). For example, you can’t perform math with them or use them in logical comparisons. If you do so, all you’ll get is an `NA`

. In the following examples, all positions in the vector with `NA`

just return `NA`

again, no matter what operation is performed. We also get `NA`

if we use mathematical functions such as `sum()`

on the vector, because R can’t add `NA`

s.

# Create a vector with NAs v <- c(1.2, 4.5, NA, 8.9, NA) # Can we do math with NAs? v + 1 ## [1] 2.2 5.5 NA 9.9 NA sum(v) ## [1] NA # Can we perform logical comparisons? v < 7 ## [1] TRUE TRUE NA FALSE NA v == 4.5 ## [1] FALSE TRUE NA FALSE NA

And the reason of course is simple… What’s the answer to `5 + 'some unknown number'`

?

Have you figured it out yet?

The answer is `'some unknown number'`

! 😄

Thus: `5 + NA = NA`

### How can I detect NAs in my data?

So how can we see if we have `NA`

s in our data? We normally use `==`

to see if a value is equal to another one. Let’s see if that will work on our vector. We know that there’s an `NA`

in the 3rd position of our vector.

# Create a vector with NAs v <- c(1.2, 4.5, NA, 8.9, NA)

So theoretically, `v == NA`

should return `FALSE FALSE TRUE FALSE TRUE`

.

# Are there any NAs in our vector? v == NA ## [1] NA NA NA NA NA

But this code just gives us `NA`

s. Unfortunately, `NA`

s don’t work with any kind of logical operator either.

Same as with math operations, `NA`

is just a placeholder for `'I don't know the real value'`

, so asking does `NA == NA`

, is the same as saying does `'some unknown number' == 'some unknown number'`

, which clearly has no known answer.

Luckily, R gives us a special function to detect `NA`

s. This is the `is.na()`

function. And actually, if you try to type `my_vector == NA`

, R will tell you to use `is.na()`

instead.

`is.na()`

will work on individual values, vectors, lists, and data frames. It will return `TRUE`

or `FALSE`

where you have an `NA`

or where you don’t.

# Which values in my vector are NA? is.na(v) ## [1] FALSE FALSE TRUE FALSE TRUE # Which values in my data frame are NA? is.na(ex) ## example data set ## [1,] FALSE FALSE FALSE ## [2,] TRUE FALSE FALSE ## [3,] FALSE FALSE FALSE ## [4,] FALSE TRUE FALSE ## [5,] FALSE FALSE TRUE ## [6,] FALSE FALSE FALSE

You can also combine `is.na()`

with `sum()`

and `which()`

to figure out how many `NA`

s you have and where they’re located.

# How many NAs in my data frame? sum(is.na(ex)) ## [1] 3 # Which row contains an NA in the 'data' column? which(is.na(ex$data)) ## [1] 4 # Which vector positions contain NAs? which(is.na(v)) ## [1] 3 5

`sum(is.na(ex))`

works is because `is.na()`

first converts your values to `TRUE`

or `FALSE`

, and applying math operations to T/F values automatically converts them to 1s or 0s.
### How do I remove NAs from my data?

Now that we know we have `NA`

s in our data… how do we get rid of them?

Some functions have an easy built-in argument, `na.rm`

, which you can set to `TRUE`

or `FALSE`

to remove `NA`

s from the data to be evaluated. If you remember the example from earlier, just running `sum(v)`

returned `NA`

. Adding `na.rm`

fixes this:

# Sum across vector v sum(v, na.rm = TRUE) ## [1] 14.6 # Take the mean of our vector v mean(v, na.rm = TRUE) ## [1] 4.866667

`NA`

values, but in other cases it makes more sense to remove them manually. Either way, this goes beyond the current scope of this post, but it is an important note to keep in mind.
If you want to remove all observations containing `NA`

s, you can also use the `na.omit()`

function. Keep in mind that removing an observation means removing the entire row of data.

# remove NAs from our data frame na.omit(ex) ## example data set ## 1 1 2 4 ## 3 16 1 4 ## 6 6 7 8

Something else you might want to do is replace those `NA`

s with another value. Maybe you want to replace missing values with 0 (You’re 200% sure those missing values were supposed to be 0s?? 😄), or maybe you want to replace those missing values with the mean of your data to approximate what those values would be (that can be especially useful for multivariate analyses). You can subset your vector or data frame to the places where `is.na()`

is true, and set those equal to a new value.

# Replace NAs in data frame with 0 ex[is.na(ex)] <- 0 # View data frame ex ## example data set ## 1 1 2 4 ## 2 0 2 4 ## 3 16 1 4 ## 4 2 0 5 ## 5 3 1 0 ## 6 6 7 8 # Replace NAs in vector with the mean v[is.na(v)] <- mean(v, na.rm = TRUE) # View vector v ## [1] 1.200000 4.500000 4.866667 8.900000 4.866667

Awesome! Now you know how to find `NA`

s in your data, perform functions without letting `NA`

s get in the way, and remove `NA`

s from your data for further analysis. Soon these functions will come to you `NA`

turally…haha. I hope you found this tutorial helpful. Happy coding!

P.S. I’d recommend listening to this song to put you in the `NA`

-removing mood!

Also be sure to check out **R-bloggers** for other great tutorials on learning R

**leave a comment**for the author, please follow the link and comment on their blog:

**R on R (for ecology)**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.