Look! It’s your data!

Posted on December 10, 2014 by jehrlinger in R bloggers | 0 Comments

[This article was first published on Learning Slowly » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If a picture is worth a thousand words, then how many tables are a single visualization worth? Exploratory data analysis is a great way to see what is and is not in your dataset.

I work in a hospital research group. Most of my colleagues are more comfortable in SAS than in R. One of the first ideas I had to help the workflow here was to create an R script that generated a series of data plots to visualize variables within our analysis dataset. This allowed our statistical programmers to do some data checking for quality, outliers and missing data before handing the data set to our biostatisticians. This moved the error checking step up in the project workflow, hopefully saving some time and making us all more productive.

The problem was this script required some custom edits for every data set. Something that the stat programmers wanted to learn, but in the heat of (deadlines) battle, that time for learning new things rarely happens. Hence it gets put off until “later”.

I’ve been contemplating how to use Shiny apps for a long time, and this finally caused me jumped in. It was pretty simple to get started actually, I just dumped a bunch of code I had been using (read: copy and pasting into lots of jobs) and put it in the server.R file. I created a ui.R and BAM! the xportEDA Shiny app was born!

xportEDA A shiny app for data visualization

xportEDA is a Shiny app that generates graphics used to explore your data set. It started out for SAS xport files only, but we’ve added csv and some rdata file support. Written in [R](http://cran.r-project.org/), this shiny app requires the following packages:

shiny
foreign (to load SAS xport files)
ggplot2 (nice figures)
reshape2
RColorBrewer (color palettes are my friend)

Description

The xportEDA app makes it easy to visualize your data quickly, without requiring programming effort to get a jump on your data wrangling.

You supply the app with a data file. The app can read in a data.frame from a SAS xpt, csv or rdata file, and generates a set of data visualizations.

The app first classifies the variables as continuous, logical or categorical. Any variable with only 2 unique values is interpreted as logical. An ad-hoc definition for categorical variables is any factor plus any variable with more than 2 and less than 10 unique values. By default, character variables are converted to factors, however if we have more than 20 levels, we will not show a panel figure for that variable.

The app creates a faceted set of histograms for all categorical and logical variables, and another set of scatter plots for all continuous variables. Since we often are working in time-to-event settings, the app searches the variable names for some of our “standard” time related variable names to use for the x-axis. Typically, we use a “date of procedure” for this. However, if your data does not have a “time” variable name, we will select the first continuous variable for the x-axis. This variable can be changed though the Shiny interface.

A separate page is set up for visualizing individual variables, making it easy to export a single figure for use in reports or other communications. Useful for when your collaborators do not believe you are missing large chunks of data in a variable, or there are negative values for strictly positive variables, like height.

We also include a data summary page for further data debugging purposes.

Examples

This example is from our research data.

The first figure shows all the categorical data in this dataset. Each of the histograms look pretty uniformly distributed, though there are some missing values shown in the hx_fcad variable.

Continuous variables are also informative. Here we see the uniformly distributed data over the time of interest (about 7 years of follow up). However, there are a series of extreme values many variables like bun_pr or BMI. BMI is a variable made from height and weight, so the extremely short or extremely large people can make for very strange BMI measures. This is typically a problem with units of measurement. What should we do with these extreme values?

The bottom panel also shows a simple way to check goodness of follow up for time to event data. We expect the triangular shape as subjects that entered the study early should have the longest follow up. The internal part of the triangle should be mostly red xs, indicating an event occurred (death) and most of the blue circles should be on the hypotenuse of the triangle, as they indicate censored or alive case, occuring hopefully at the end of the study as opposed to lost to follow up cases in the interior.

A single plot for bun_pr makes it pretty clear there are only 3 values that are suspect. Since there are quite a few observations, we may want to make these missing and use imputation, or return these observations for data correction.

This is not the optimal way to view a data summary, but it may help the user understand where issues with the app are coming from.

Running the App

To try it out, you can see it on the shinapps.io site
https://ehrlinger.shinyapps.io/xportEDA/

I have also posted the app code to a GitHub repository where you can download it, and try it out. Let me know how it goes, report bugs or contribute back. I’d love to make this better, and learn more Shiny tricks along the way.

The easiest way is to run the app locally is to download it from the https://github.com/ehrlinger/xportEDA repository.

Then run it from R
R> library(shiny) R> runApp()

or run it in R directly from the repository:

R> shiny::runGitHub("ehrlinger/xportEDA")

Issues and Caveats

I could put the standard “use at your own risk” disclaimer here. I will also add:

The app is written with my specific problem domain in mind though I am open for suggestions on how to improve it.

We tend to use time to event data (working in a hospital after all).

Our group uses SAS predominantly, hence the “xport” functionality and the app naming structure.

xportEDA will have trouble with large p data sets, as I have not figured out how to make shiny extend the figures indefinitely down the page. I do dynamically set the number of columns in an effort to control how small the panel plots get. But if you get into the 75 categorical or continuous variable range, it may become illegible.

Progress not perfection! The best way to start, is to start.

Filed under: R Tagged: ggplot2, R, shiny