ggplot your missing data

[This article was first published on njtierney - rbloggers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Visualising missing data is important when analysing a dataset. I wanted to make a plot of the presence/absence in a dataset. One package, Amelia provides a function to do this, but I don’t like the way it looks. So I made a ggplot version of what it did.

Let’s make a dataset using the awesome wakefield package, and add random missingness.

<span class="n">library</span><span class="p">(</span><span class="n">dplyr</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">wakefield</span><span class="p">)</span><span class="w">
</span><span class="n">df</span><span class="w"> </span><span class="o"><-</span><span class="w"> 
  </span><span class="n">r_data_frame</span><span class="p">(</span><span class="w">
  </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w">
  </span><span class="n">id</span><span class="p">,</span><span class="w">
  </span><span class="n">race</span><span class="p">,</span><span class="w">
  </span><span class="n">age</span><span class="p">,</span><span class="w">
  </span><span class="n">sex</span><span class="p">,</span><span class="w">
  </span><span class="n">hour</span><span class="p">,</span><span class="w">
  </span><span class="n">iq</span><span class="p">,</span><span class="w">
  </span><span class="n">height</span><span class="p">,</span><span class="w">
  </span><span class="n">died</span><span class="p">,</span><span class="w">
  </span><span class="n">Scoring</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rnorm</span><span class="p">,</span><span class="w">
  </span><span class="n">Smoker</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">valid</span><span class="w">
  </span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
  </span><span class="n">r_na</span><span class="p">(</span><span class="n">prob</span><span class="o">=</span><span class="m">.4</span><span class="p">)</span><span class="w">
</span>

This is what the Amelia package produces by default:

<span class="n">library</span><span class="p">(</span><span class="n">Amelia</span><span class="p">)</span><span class="w">

</span><span class="n">missmap</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="w">
</span>

plot of chunk unnamed-chunk-2

And let’s explore the missing data using my own ggplot function:

<span class="c1"># A function that plots missingness
# requires `reshape2`
</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">reshape2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">

</span><span class="n">ggplot_missing</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">){</span><span class="w">
  
  </span><span class="n">x</span><span class="w"> </span><span class="o">%>%</span><span class="w"> 
    </span><span class="n">is.na</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">melt</span><span class="w"> </span><span class="o">%>%</span><span class="w">
    </span><span class="n">ggplot</span><span class="p">(</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">.</span><span class="p">,</span><span class="w">
           </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">X2</span><span class="p">,</span><span class="w">
               </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">X1</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">geom_raster</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">fill</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">value</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">scale_fill_grey</span><span class="p">(</span><span class="n">name</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">""</span><span class="p">,</span><span class="w">
                    </span><span class="n">labels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"Present"</span><span class="p">,</span><span class="s2">"Missing"</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
    </span><span class="n">theme_minimal</span><span class="p">()</span><span class="w"> </span><span class="o">+</span><span class="w"> 
    </span><span class="n">theme</span><span class="p">(</span><span class="n">axis.text.x</span><span class="w">  </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">angle</span><span class="o">=</span><span class="m">45</span><span class="p">,</span><span class="w"> </span><span class="n">vjust</span><span class="o">=</span><span class="m">0.5</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> 
    </span><span class="n">labs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Variables in Dataset"</span><span class="p">,</span><span class="w">
         </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s2">"Rows / observations"</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span>

Let’s test it out

<span class="n">ggplot_missing</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="w">
</span>

plot of chunk unnamed-chunk-4

It’s much cleaner, and easier to interpret.

This function, and others, is available in the neato package, where I store a bunch of functions I think are neat.

Quick note – there used to be a function, missing.pattern.plot that you can see here in the package mi. However, it doesn’t appear to exist anymore. This is a shame, as it was a really nifty plot that clustered the groups of missingness. My friend and colleague, Sam Clifford heard me complaining about this and wrote some code that does just that – I shall share this soon, it will likely be added to the neato repository.

Thoughts? Write them below.

To leave a comment for the author, please follow the link and comment on their blog: njtierney - rbloggers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)