Scatter Plots with Marginal Densities – An Example for Doing Exploratory Data Analysis with Tableau and R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
One of the first stages in most data analysis projects is about exploring the data at hand. During this stage the analyst tries to get familiar with his dataset by looking at summary statistics, feature distributions and relationships between different attributes – just to name the key tasks. It is a really important procedure before the start of hypotheses testing and statistical modeling, as it gives important insight about what can be done with data and where we should expect problems. For example, a discrete target attribute where the labels are extremely uneven distributed (rare events) should guide our choice for the right modeling and data prep technique. Or if we detect an independent feature that is highly correlated with the target, then this indicates a good candidate for feature selection. Visualizations are the key technique used within exploratory data analysis, which conversely should give us great preconditions for using Tableau during this stage.
One of the most frequently used visualizations is a scatter plot. It is used for showing the relationship between two continuous features. A simple scatter plot can be easily enriched with more “features” like showing the correlation, marginal densities plots and histograms, groupings as well as trend lines. After reading this blog post by John Mount and Nina Zumel that shows exactly such an extended version of a scatter plot, I was wondering if the same visualization can be created within Tableau and how we can strengthen Tableau’s advantages regarding interactivity to make the plot even more useful.
Lets first think about the possible structure of this visualization in Tableau. To put three sheets for the two density histograms and the scatter plot together in one dashboard should be no problem. The correlation information can be added as a separate panel floating on top of the scatter plot or fixed in the upper right corner.
How to get all of this different statistical estimates into a common Tableau is a different question – the density histogram (which is a modified histogram showing relative frequencies in a way that the area over all bars add up to 1), the density curve (where the area under the curve adds up to 1), the smoothing function and the correlation estimate. One approach would be to combine some of the things available in Tableau (like the trend functionality) with some hacks to build the other concepts within Tableau. A second approach is, to use some “specialist” for all sorts of statistical questions and let it do all the calculations – for example R. This is the approach I have chosen and Tableau’s interface towards R was really helpful with that. The big advantage of using R is that there are well known functions together with a vast amount of documentation for all of those concepts.
In the following I will explain how this can be applied to the different plots. To compare the result with the original visualization, I will use the original data set from Johns and Ninas blog that can be found here.
Scatter Plot with Smoothing Function
Let’s begin with the central visualization – the scatter plot. We could start by dropping the two features we want to compare on the column and row shelf, but this would be unhandy if we want to change the attributes later one. Therefore, the idea is to use the well-known Tableau pattern for creating dynamic fields – a parameter that contains the different options together with a feature that maps the current parameter value on the right measure. Each of the axis will be represented by one dynamic field.
Then we place both dynamic fields on row and column shelf and add a third dynamic field to the color shelf. The latter one allows the user to choose a dimension for grouping and can give additional insights into the relationship between two measures. Don’t forget to disable the “Aggregate Measures” option to show each single item from the dataset. The next piece that is to be added is the trend line (smoothing function) that is available as build-in function in Tableau. Again, we first should think if this gives us the flexibility we want. For example, we might want to give the user an option to choose between different smoothing approaches with one click in the dashboard? R is well suited for that. Another advantage is that by using the R calculation we avoid that when changing the grouping dimension, Tableau always first add a separate trend line for each group (this is the standard behavior of Tableau’s own trend functionality). The following R code will calculate a trend line from linear regression or local polynomial regression fitting (loess) together with the corresponding 95% confidence interval: At the end it returns a string vector of the same length as the number of input items, where each vector item contains fit, lower and upper bound separated by a special delimiter (“~”). Just split the string accordingly, transform the three values into numbers and add them inside a “Measure Values” box by creating a dual axis chart. The final scatter plot looks like this:
The panel showing the correlation is most probably the easiest component of the dashboard. It just shows the result from the R table calc as part of the column header. First step is to move the dynamic fields for x and y axis to the detail section to use them as inputs for the R table calc. Then turn of “Aggregate Measures”. Again, the result from the R calculation needs to be split into its single components – the pearson product-moment correlation coefficient and the corresponding p value. After that move both parts to the column shelf. After some formatting – font, font size, cell border etc. – the result is hopefully looking like this:
Marginal Density Distribution and Marginal Density Histogram
Creating a plot showing the marginal density distribution together with the marginal density histogram is the trickiest part. Building a density histogram in Tableau should be possible with the help of some table calcs, given that the high of a bar of a bin has to be equal to the # of elements in the bin/(# of total elements * width of the bin). But estimating a density function isn’t Tableau’s business and there are more efficient ways to do that. For example in R we can use the function ’density’ to calculate a density function and the function ’hist’ to give us the values of a density histogram. How to bring all that together in Tableau?
The general idea is that we discretize the attribute already in Tableau, send bins and the number of elements per bin to R and let R calculate the density function as well as the high of the bars for the density histogram. Because we send pre-aggregated information (bins) to R instead of the raw data, the density calculation is an approximation. But this should be good enough for the purpose of visualization. Second, our approach will be faster for larger datasets than sending the raw data because we just transmit a data package of the size of the number of bins instead of the number of items.
Let’s start with the binning of the attribute. Tableau’s build-in binning functionality has two major drawbacks: The result can’t be used as input to an R table calc and we cannot specify the number of bins directly – just the bin width. But thanks to the new LOD feature in Tableau 9.0, it is easy to implement our own function for equal width binning. First calculate the minimum and maximum of the attribute using the new LOD functions. Based on both fields we then define the bin width. Using bin width and the minimum, the lower and upper bound of the bin can be calculated. Finally, we assign every item the midpoint of the interval as numeric label.
Because of performance reasons we calculate the marginal density function and marginal density histogram together in one table calc. After saving the bins and the number of elements per bin into R variables, we use the “density” function in R to compute the density estimate and assign each bin the density value from the coordinate next to the bin. The “hist” function gives us the high of the bar for each bin if we use the parameter “freq=FALSE” and set the breakpoints equal to the bins. The value of the density estimate and the height of the density histogram are then concatenated and returned as string. This is done with the following R code: After placing the bin field on the column shelf (the one that represents the feature from the x axis of the scatter plot), the density estimate as well as the high of the bar on the row shelf and adding some tooltips, the marginal density plot for the top of the dashboard looks as follows:
The visualization of the marginal density function and the marginal density histogram for the feature that is on the y axis of the scatter plot is just a replicate where columns and rows are swapped.
Now we just need to arrange everything we build so far on a dashboard and get some basic version of the desired scatter plot with marginal densities. As opposite to the static versions of the graph in R and python it comes with an interactive user interface and some nice feature like configurable binning width, different smoothing methods and an optional grouping of the data points.
Are we done – not really. When playing around with the dashboard using a much larger dataset and different features I spotted three things that need some explanations and fixes where possible.
Scatter Plot with Marginal Distributions 2.0
The first issue is about adapting the scatter plot to new datasets because creating it from the scratch doesn’t seem to be a good option. I didn’t find a one-click solution, but rather want to highlight the points of the workbook that need customization and present a guideline here. I suggest not to replace the data source as this operation often crashes parts of the layout and format. Instead, keep the endpoint (e.g. file or table name) of the connection constant and just change the data in the data source directly. After opening the workbook Tableau will show lots of errors (because of the wrong field references) that we fix within the next steps. Meanwhile deactivate the auto-update. There are only six attributes that need to be changed when using a new data source: three parameters that define features for x axis, y axis and grouping as well as the three corresponding dynamic fields.
Change the selectable features for the parameter of the x axis and use this one as input for the y axis parameter (“Add from Parameter”). Next change the measure representing the x axis feature and copy the code into the measure for the y axis feature. Repeat the whole procedure for the grouping parameter and field and you are done. If some of you have any other idea about doing this a little bit more programmatically, (maybe hacking the workbook xml) I’m happy if you leave a comment.
The second issue is about the alignment of the axis of the marginal density plots with the scatter plot (x axis from scatter plot with density chart on top and y axis from scatter plot with density chart on the right). You can see that because Tableau choose different ranges for the continuous feature and the binned version of the same feature, there is some unpleasant overlap. This is a nasty problem because it is not possible to define the range for a feature on a global level.
But within the tips section of Jonathans blog a small text snippet describes a possible solution:
“Use an invisible reference line, either based on a constant or a calculated field. Tableau will make sure the axis range is big enough.”
Because we can use parameters or table calc to define the value of a reference line, we create four new fields that calculate the two maxima and minima of the two features and add respectively subtract a small user-defined “buffer”.
Place those fields on the corresponding marginal density visualizations and add two reference lines on each chart – one for the minimum and one for the maximum. The result are “beautifully” aligned axis.
The last issue is easily spotted if you chose “loess” as smoothing function and change the dataset towards a larger one – for example something like the 54k diamond dataset. Most probably you will get an error from R. In a nutshell this tells you that R failed on creating a loess estimate because of the large number of cases. A sufficient solution is to use a sampling procedure, calculate the loess estimate on the sample and extrapolate the result back to the complete dataset. Then following modification of the R code from above shows a solution that uses a sample of 500 data points if the dataset contains more than 500 items: The revised dashboard with perfectly aligned axis and ready for analyzing larger datasets is shown below:
The importance of statistical charts as part of the exploratory data analysis process is a good reason for using a tool specialized in visualization like Tableau. As shown above it can provide easy access and additional interactions in an enterprise environment (e.g. together with Tableau Server) that are otherwise unachievable for people not familiar with special purpose languages like R or special visualization packages like the seaborn library for Python. And again it shows that R is a perfect partner, as it does all the statistical calculations in the background and keeps Tableau free of them. Also, it allows the dashboard creator to add additional features like other smoothing functions or summary statistics.
I hope that these blog post was something useful for anyone considering to use Tableau for statistical visualization and it would make me happy if you leave a short comments. As always the underling two workbooks (packed) can be found here (basic version) and here (revised version).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.