How to Handle Missing Data

Posted on January 18, 2021 by Business Science in R bloggers | 0 Comments

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This article is part of a R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks.

Missing values used to drive me nuts… until I learned how to impute them! In 10-minutes, learn how to visualize and impute in R using ggplot dplyr and 3 more packages to simple imputation.

Here are the links to get set up. ????

Handling missing values

We’re going to kick the tires on 3 key packages:

visdat – For quickly visualizing data
naniar – For working with NA’s (missing data)
simputation – For simple imputation (converting missing data to values)

So let’s get started!

Visualizing Missing Data
Using vis_miss(), gg_miss_upset() and geom_miss_point()

Quickly Skim Missing Data

It doesn’t get any easier than this. Simply use visdat::vis_miss() to visualize the missing data. We can see Ozone and Solar.R are the offenders.

Identify Interactions in Column Missingness

Use Case: It often makes sense to evaluate the interactions between columns containing missing data. We can use an “upset” plot for this.

Start with a good question:

“Is it often that we have both Ozone and Solar.R missing at the same time?”

We can answer this with gg_miss_upset(). We can see that 2 of 5 Solar.R (40%) happen at the same observation that Ozone is missing. Might want to check for IOT sensor issues!

Visualize Missing Observations in a Scatter Plot

Use Case: This is a great before/after visual.

For our final exploratory plot, let’s plot the missing data using geom_miss_point(). It works just like geom_point(), but plots where the missing data are located in addition to the non-missing data.

Before Imputation. Using geom_miss_point()

Imputation
impute_rf()

The simputation library comes with a host of impute*()_ functions. We’ll focus on impute_rf(), which implements a random forest to do the imputation.

This imputes the NA’s, replacing the missing Ozone and Solar.R data. We can see the missing data follows the distribution of the non-missing data in the updated scatter plot.

After Imputation. Using impute_rf() and geom_point()

Reminders:

Full code in the Github Repository.
Watch the YouTube Video for detailed instructions.

Time for an air-guitar celebration with your co-worker. ????

But if you really want to improve your productivity…

Here’s how to master R programming and become powered by R. ????

What happens after you learn R for Business.

When your CEO gets word of your Shiny Apps saving the company $$$. ????

This is career acceleration.

SETUP R-TIPS WEEKLY PROJECT

Sign Up to Get the R-Tips Weekly (You’ll get email notifications of NEW R-Tips as they are released): https://mailchi.mp/business-science/r-tips-newsletter
Set Up the GitHub Repo: https://github.com/business-science/free_r_tips
Check out the setup video (https://youtu.be/F7aYV0RPyD0). Or, Hit Pull in the Git Menu to get the R-Tips Code

Once you take these actions, you’ll be set up to receive R-Tips with Code every week. =)

???? Top R-Tips Tutorials you might like:

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

How to Handle Missing Data

Handling missing values

Visualizing Missing Data
Using vis_miss(), gg_miss_upset() and geom_miss_point()

Quickly Skim Missing Data

Identify Interactions in Column Missingness

Visualize Missing Observations in a Scatter Plot

Imputation
impute_rf()

But if you really want to improve your productivity…

SETUP R-TIPS WEEKLY PROJECT

Related

Handling missing values

Visualizing Missing DataUsing vis_miss(), gg_miss_upset() and geom_miss_point()

Quickly Skim Missing Data

Identify Interactions in Column Missingness

Visualize Missing Observations in a Scatter Plot

Imputationimpute_rf()

But if you really want to improve your productivity…

SETUP R-TIPS WEEKLY PROJECT

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Visualizing Missing Data
Using vis_miss(), gg_miss_upset() and geom_miss_point()

Imputation
impute_rf()

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)