Handling Missing Values in R using tidyr

[This article was first published on r-bloggers on Programming with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, We’ll see 3 functions from tidyr that’s useful for handling Missing Values (NAs) in the dataset. Please note: This post isn’t going to be about Missing Value Imputation.

tidyr

According to the documentation of tidyr,

The goal of tidyr is to help you create tidy data. Tidy data is data where:

+ Every column is variable.
+ Every row is an observation..
+ Every cell is a single value.

Let’s start with loading tidyr library. tidyr is also one of the packages present in tidyverse.

library(tidyr)

tidyr functions

Following are the 3 tidyr functions that are handy for processing Missing Values

  • drop_na()
  • fill()
  • replace_na()

Dataset with Missing Value

To get a dataset with missing values, let’s take mtcars and make some missing values in it.

df <- mtcars

df$hp[2] <- NA
df$cyl[5] <- NA
df$gear[5] <- NA
df$mpg[10] <- NA

# counting number of missing values
paste("Number of Missing Values", sum(is.na(df)))
## [1] "Number of Missing Values 4"
# dimensions

paste("Number of Rows",nrow(df))
## [1] "Number of Rows 32"
paste("Number of Columns",ncol(df))
## [1] "Number of Columns 11"

Now that we’ve got a dataset with Missing Values (NAs) in it.

drop_na()

drop_na() drops/removes the rows/entries with Missing Values

library(dplyr) #just in-case if we need to some dplyr verbs
## Warning: package 'dplyr' was built under R version 3.5.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df_no_na <- drop_na(df)


# counting number of missing values
paste("Number of Missing Values", sum(is.na(df_no_na)))
## [1] "Number of Missing Values 0"
# dimensions

paste("Number of Rows",nrow(df_no_na))
## [1] "Number of Rows 29"
paste("Number of Columns",ncol(df_no_na))
## [1] "Number of Columns 11"

fill()

fill() fills the NAs (missing values) in selected columns (dplyr::select() options could be used like in the below example with everything()).

It also lets us select the .direction either down (default) or up or updown or downup from where the missing value must be filled.

Quite Naive, but could be handy in a lot of instances like let’s say Time Series data.

df_na_filled <- df %>% 
                    fill(
                      dplyr::everything()
                    )


# counting number of missing values
paste("Number of Missing Values", sum(is.na(df_na_filled)))
## [1] "Number of Missing Values 0"
# dimensions

paste("Number of Rows",nrow(df_na_filled))
## [1] "Number of Rows 32"
paste("Number of Columns",ncol(df_na_filled))
## [1] "Number of Columns 11"

replace_na()

replace_na() is to be used when you have got the replacement value which the NAs should be filled with.

Below is an example of how we have replaced all NAs with just zero (0)

df_na_replaced <- df %>% 
                    mutate_all(replace_na,0)


# counting number of missing values
paste("Number of Missing Values", sum(is.na(df_na_replaced)))
## [1] "Number of Missing Values 0"
# dimensions

paste("Number of Rows",nrow(df_na_replaced))
## [1] "Number of Rows 32"
paste("Number of Columns",ncol(df_na_replaced))
## [1] "Number of Columns 11"

Alternatively, We could’ve simply identified numeric / continous values and replaced their values with NAs like this:

df_na_replaced <- df %>% 
                    mutate_if(is.numeric, replace_na,0)

Hopefully, this post would have thrown some light on those three functions of tidyr to handle missing values: drop_na(), fill(), replace_na().

If you liked this, Please subscribe to my Language-agnostic Data Science Newsletter and also share it with your friends!

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers on Programming with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)