As many of us know, science is not a perfect process. Maybe you can’t get out in the field on a certain day. Maybe you can only sample a portion of what needs to get done. Or maybe you’re downloading public data sets and they aren’t lining up perfectly. All of these can result in missing data, which can be a real pain when it comes time for analysis.
Another common source of missing data, especially when recording species abundance data in community ecology, is when you forget to write a ‘0’ and instead leave the entry blank. In the moment you might know that blank entries mean zero, but give it just a few weeks and you’ll be scratching your head! In those cases it’s often best to label those entries as unknown or missing.
In this tutorial, I’m going to explain what exactly an
NA value is, how you can find
NAs in your data, and how you can remove them.
What does it mean to have NAs in my data?
NAs represent missing values in R. This is pretty common if you’re importing data from Excel and have some empty cells in the spreadsheet. When you load the data into R, the empty cells will be populated with
NA(which stands for ‘Not Available’) in R. In fact, you’ll notice the color change when you type
NAin your code since R already knows what that means.
# Read in an example data set with NAs ex <- read.csv("example_data.csv") # View data ex ## example data set ## 1 1 2 4 ## 2 NA 2 4 ## 3 16 1 4 ## 4 2 NA 5 ## 5 3 1 NA ## 6 6 7 8
Click here to download the
example_data.csv file if you want to follow along.
NAs cannot be treated like other types of data (e.g, strings, numeric values). For example, you can’t perform math with them or use them in logical comparisons. If you do so, all you’ll get is an
NA. In the following examples, all positions in the vector with
NA just return
NA again, no matter what operation is performed. We also get
NA if we use mathematical functions such as
sum() on the vector, because R can’t add
# Create a vector with NAs v <- c(1.2, 4.5, NA, 8.9, NA) # Can we do math with NAs? v + 1 ##  2.2 5.5 NA 9.9 NA sum(v) ##  NA # Can we perform logical comparisons? v < 7 ##  TRUE TRUE NA FALSE NA v == 4.5 ##  FALSE TRUE NA FALSE NA
And the reason of course is simple… What’s the answer to
5 + 'some unknown number' ?
Have you figured it out yet?
The answer is
'some unknown number'! 😄
5 + NA = NA
How can I detect NAs in my data?
So how can we see if we have
NAs in our data? We normally use
== to see if a value is equal to another one. Let’s see if that will work on our vector. We know that there’s an
NA in the 3rd position of our vector.
# Create a vector with NAs v <- c(1.2, 4.5, NA, 8.9, NA)
v == NA should return
FALSE FALSE TRUE FALSE TRUE.
# Are there any NAs in our vector? v == NA ##  NA NA NA NA NA
But this code just gives us
NAs don’t work with any kind of logical operator either.
Same as with math operations,
NA is just a placeholder for
'I don't know the real value', so asking does
NA == NA, is the same as saying does
'some unknown number' == 'some unknown number', which clearly has no known answer.
Luckily, R gives us a special function to detect
NAs. This is the
is.na() function. And actually, if you try to type
my_vector == NA, R will tell you to use
is.na() will work on individual values, vectors, lists, and data frames. It will return
FALSE where you have an
NA or where you don’t.
# Which values in my vector are NA? is.na(v) ##  FALSE FALSE TRUE FALSE TRUE # Which values in my data frame are NA? is.na(ex) ## example data set ## [1,] FALSE FALSE FALSE ## [2,] TRUE FALSE FALSE ## [3,] FALSE FALSE FALSE ## [4,] FALSE TRUE FALSE ## [5,] FALSE FALSE TRUE ## [6,] FALSE FALSE FALSE
You can also combine
which() to figure out how many
NAs you have and where they’re located.
# How many NAs in my data frame? sum(is.na(ex)) ##  3 # Which row contains an NA in the 'data' column? which(is.na(ex$data)) ##  4 # Which vector positions contain NAs? which(is.na(v)) ##  3 5
sum(is.na(ex))works is because
is.na()first converts your values to
FALSE, and applying math operations to T/F values automatically converts them to 1s or 0s.
How do I remove NAs from my data?
Now that we know we have
NAs in our data… how do we get rid of them?
Some functions have an easy built-in argument,
na.rm, which you can set to
FALSE to remove
NAs from the data to be evaluated. If you remember the example from earlier, just running
na.rm fixes this:
# Sum across vector v sum(v, na.rm = TRUE) ##  14.6 # Take the mean of our vector v mean(v, na.rm = TRUE) ##  4.866667
NAvalues, but in other cases it makes more sense to remove them manually. Either way, this goes beyond the current scope of this post, but it is an important note to keep in mind.
If you want to remove all observations containing
NAs, you can also use the
na.omit() function. Keep in mind that removing an observation means removing the entire row of data.
# remove NAs from our data frame na.omit(ex) ## example data set ## 1 1 2 4 ## 3 16 1 4 ## 6 6 7 8
Something else you might want to do is replace those
NAs with another value. Maybe you want to replace missing values with 0 (You’re 200% sure those missing values were supposed to be 0s?? 😄), or maybe you want to replace those missing values with the mean of your data to approximate what those values would be (that can be especially useful for multivariate analyses). You can subset your vector or data frame to the places where
is.na() is true, and set those equal to a new value.
# Replace NAs in data frame with 0 ex[is.na(ex)] <- 0 # View data frame ex ## example data set ## 1 1 2 4 ## 2 0 2 4 ## 3 16 1 4 ## 4 2 0 5 ## 5 3 1 0 ## 6 6 7 8 # Replace NAs in vector with the mean v[is.na(v)] <- mean(v, na.rm = TRUE) # View vector v ##  1.200000 4.500000 4.866667 8.900000 4.866667
Awesome! Now you know how to find
NAs in your data, perform functions without letting
NAs get in the way, and remove
NAs from your data for further analysis. Soon these functions will come to you
NAturally…haha. I hope you found this tutorial helpful. Happy coding!
P.S. I’d recommend listening to this song to put you in the
Also be sure to check out R-bloggers for other great tutorials on learning R