ggplot your missing data

November 30, 2015
By

(This article was first published on njtierney - rbloggers, and kindly contributed to R-bloggers)

Visualising missing data is important when analysing a dataset. I wanted to make a plot of the presence/absence in a dataset. One package, Amelia provides a function to do this, but I don’t like the way it looks. So I made a ggplot version of what it did.

Let’s make a dataset using the awesome wakefield package, and add random missingness.

library(dplyr)
library(wakefield)
df <- 
  r_data_frame(
  n = 30,
  id,
  race,
  age,
  sex,
  hour,
  iq,
  height,
  died,
  Scoring = rnorm,
  Smoker = valid
  ) %>%
  r_na(prob=.4)

This is what the Amelia package produces by default:

library(Amelia)

missmap(df)

plot of chunk unnamed-chunk-2

And let’s explore the missing data using my own ggplot function:

# A function that plots missingness
# requires `reshape2`

library(reshape2)
library(ggplot2)

ggplot_missing <- function(x){
  
  x %>% 
    is.na %>%
    melt %>%
    ggplot(data = .,
           aes(x = X2,
               y = X1)) +
    geom_raster(aes(fill = value)) +
    scale_fill_grey(name = "",
                    labels = c("Present","Missing")) +
    theme_minimal() + 
    theme(axis.text.x  = element_text(angle=45, vjust=0.5)) + 
    labs(x = "Variables in Dataset",
         y = "Rows / observations")
}

Let’s test it out

ggplot_missing(df)

plot of chunk unnamed-chunk-4

It’s much cleaner, and easier to interpret.

This function, and others, is available in the neato package, where I store a bunch of functions I think are neat.

Quick note – there used to be a function, missing.pattern.plot that you can see here in the package mi. However, it doesn’t appear to exist anymore. This is a shame, as it was a really nifty plot that clustered the groups of missingness. My friend and colleague, Sam Clifford heard me complaining about this and wrote some code that does just that – I shall share this soon, it will likely be added to the neato repository.

Thoughts? Write them below.

To leave a comment for the author, please follow the link and comment on their blog: njtierney - rbloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)