Datashader is a big deal

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I recently got back from Strata West 2017 (where I ran a very well received workshop on R and Spark). One thing that really stood out for me at the exhibition hall was Bokeh plus datashader from Continuum Analytics.

I had the privilege of having Peter Wang himself demonstrate datashader for me and answer a few of my questions.

I am so excited about datashader capabilities I literally will not wait for the functionality to be exposed in R through rbokeh. I am going to leave my usual knitr/rmarkdown world and dust off Jupyter Notebook just to use datashader plotting. This is worth trying, even for diehard R users.

datashader

Every plotting system has two important ends: the grammar where you specify the plot, and the rendering pipeline that executes the presentation. Switching plotting systems means switching how you specify plots and can be unpleasant (this is one of the reasons we wrap our most re-used plots in WVPlots to hide or decouple how the plots are specified from the results you get). Given the convenience of the ggplot2 grammar, I am always reluctant to move to other plotting systems unless they bring me something big (and even then sometimes you don’t have to leave: for example the absolutely amazing adapter plotly::ggplotly).

Currently, to use datashader you must talk directly to Python and Bokeh (i.e. learn a different language). But what that buys you is massive: in-pixel analytics. Let me clarify that.

datashader makes points and pixels first class entities in the graphics rendering pipeline. It admits they exist (many plotting systems render to an imaginary infinite resolution abstract plane) and allows the user to specify scale dependent calculations and re-calculations over them. It is easiest to show by example.

Please take a look at these stills from the datashader US Census example. We can ask pixels to be colored by the majority race in the region of Lake Michigan:

NewImage

If we were to use the interactive version of this graph we could zoom in on Chicago and the majorities are re-calculated based on the new scale:

NewImage

What is important to understand is that is this is vastly more powerful than zooming in on a low-resolution rendering:

Unknown

and even more powerful than zooming out on a static high-resolution rendering:

Unknown 2

datashader can redo aggregations and analytics on the fly. It can recompute histograms and renormalize them relative to what is visible to maintain contrast. It can find patterns that emerge as we change scale: think of zooming in on a grey pixel that resolves into a black and white checkerboard.

You need to run datashader to really see the effect. The html exports, while interactive, sometimes do not correctly perform in all web browsers.

An R example

I am going to share a simple datashader example here. Again, to see the full effect you would have to copy it into an Jupyter notebook and run it. But I will use it to show my point.

After going through the steps to install Anaconda and Juputer notebook (plus some more conda install steps to include necessary packages) we can make a plot of the ggplot2 data example diamonds

ggplot2 renderings of diamonds typically look like the following (and show of the power and convenience of the grammar):

NewImage

A datashader rendering looks like the following:

NewImage

If we use the interactive rectangle selector to zoom in on the apparently isolated point around $18300 and 3.025 carats we get the following dynamic re-render:

NewImage

Notice the points shrunk (and didn’t subdivide) and there are some extremely faint points. There is something wrong with that as a presentation; but it isn’t because of datashader! It is something unexpected in the data which is now jumping out at us.

datashader is shading proportional to aggregated count. So the small point staying very dark (and being so dark it causes other point to render near transparent) means there are multiple observations in this tiny neighborhood. Going back to R we can look directly at the data:

> library("dplyr")
> diamonds %>% filter(carat>=3, carat<=3.05, price>=18200, price<=18400)

# A tibble: 5 × 10
  carat     cut color clarity depth table price     x     y     z
  <dbl>   <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  3.01 Premium     I     SI2  60.2    59 18242  9.36  9.31  5.62
2  3.01    Fair     I     SI2  65.8    56 18242  8.99  8.94  5.90
3  3.01    Fair     I     SI2  65.8    56 18242  8.99  8.94  5.90
4  3.01    Good     I     SI2  63.9    60 18242  9.06  9.01  5.77
5  3.01    Good     I     SI2  63.9    60 18242  9.06  9.01  5.77

There are actually 5 rows with the exact carat and pricing indicated by the chosen point. The point stood out at fine scale because it indicated something subtle in the data (repetitions) that the analyst may not have known about or expected. The “ugly” presentation was an important warning. This is hands on the data, the quickest path to correct results.

For some web browsers, you don’t always see proper scaling, yielding artifacts like the following:

NewImage

The Jupyter notebooks always work, and web-browsers usually work (I am assuming it is security or ad-blocking that is causing the effect, not a datashader issue).

Conclusion

datashader brings to production resolution dependent per-pixel analytics. This is a very powerful style of interaction that is going to appear more and more places. This is something that the Continuum Analytics team has written about before and requires some interesting cross-compiling (Numba) to implement at scale. Now that analysts have seen this in action they are going to want this and ask for this.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)