Extending RevoScaleR for Mining Big Data – Hexbins

April 5, 2013
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Derek McCrae Norton, Senior Sales Engineer

It is my job to help potential clients see that the tasks they are used to completing can be completed on big data in Revolution R Enterprise (and that it is easy).  Honestly, this is my dream job, and in my eyes it is sort of like playing and getting paid for it.

Many times RevoScaleR has exactly what the clients are looking for which is great, even if not as much fun for me. Sometimes, however, the client wants to carry out a task that is not explicitly included in RevoScaleR and this is where the fun begins.

Now that my long-winded introduction is over I wanted to share a few of these extensions and their motivations. 

Motivation: Visualizing bivariate relationships with big data.

Scatterplots are often the go-to bivariate visualization, but there are times when it does not perform so well such as big data (too much ink on the page) or overlap (no distinction).  Hexagonal binning can deal with both issues, and there is a package, hexbin, built just for that. The problem is that it does not scale to big data as is.  Luckily we only have to calculate counts based on bins. With xbins = 30 and shape = 1, we have just 30 x 30 = 900 bins which is not big data.

hexbin allows us to easily define the bins based on a x and y range, and rxDataStep can step through the data to calculate the counts by bin. The end result is a relative small function that leverages RevoScaleR for a small piece (the only big data piece), and then creates a standard hexbin object.

Hexbin
You can test this out yourself with the function rxHexBin at github.

Look for upcoming posts on other ways to extend RevoScaleR for mining Big Data.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.