(This article is referring to an initial proof-of-concept version of r-big-pivot)
I have to admit that I very much enjoy pivoting through data using Excel. Its pivoting tool is great for getting a quick insight into a data set’s structure and for discovering interesting anomalies (the sudden rise of deaths due to viral hepatitis serves as a nice example). Unfortunately Excel itself is handicapped by several restrictions:
- a maximum number of one point something million rows per data set (which is crucial because the data needs to be formatted long)
- Excel is slow and often instable when you go beyond 100’000 rows
- cumbersome integration into available data infrastructures
- restricted charting and analysis capabilities
To be fair – the first two points have been amended by PowerPivot which allows for theoretically unrestricted data set sizes and also optimizes the speed of calculations a lot. Plus, PowerPivot can easily fetch data from data bases like MySQL. So PowerPivot does a good job, but sometimes you just want a box plot or a heat map – to give a few examples. And then it gets painful to impossible.
But why bother with Excel anyway?
Now R on the other hand is THE tool for processing and visualizing data. Thanks to Hadley Wickham’s reshape(2), ggplot2 and you name it there is pretty much nothing you can’t get done – but – and this is a big but in my opinion – you have to use the command line. Perfect for specific data operations but certainly no fun for data exploration. Changing how an aggregation is structured takes 10 seconds or longer to type instead of just two mouse clicks. Restricting plots and aggregations to a varying subset of a large number of categories is no fun without an appropriate UI.
As I don’t know of any open source tool which closes the gap I have been phantasizing about doing this in R for a while. Obviously crafting a UI is easiest done with HTML and company. But how to get a web UI to talk to R? Well, since Shiny this is actually pretty easy. The Shiny package comes with a full web server and provides a very well crafted R framework for quickly setting up a basic UI. The documentation is superb by the way and the community very helpful.
Have a look at the GitHub repo for the code and an example on how to use it. But let me warn you – the current state of the project rather serves as a proof of concept as it is too edgy for serious usage. But implementing this tool provided me with insights regarding ergonomical as well as performance bottlenecks that have to be solved.
List of Handicaps and Bottlenecks:
- Performance of data processing is okay (on my PC) but slow for data exceeding something like two million rows.
- Processing data with SQL is not comfortable and in many cases not possible. MDX might be just the right tool for playing with data. Something analoguous to sqldf, just with mdx (mdxdf?) might be what we need here.
- A graphical user interface for pivoting would make the data interaction more fun and easier. Some jQuery UI component translating its state to a command which is passed to a hidden text input could do the trick.
- Given that no big data set is really just ready to use (0s, NAs, NULLs and unexpected or invalid values) some further assistance is needed in this area.
- Every large data set has to make use of categories and labels. “F” and “M” for the gender are a very common and simple example. If we have to deal with a category containing more than ten values it starts to become tricky keeping track of all. But still the labels can be short and descriptive in the same time. Now we take the NUTS regions down to level 3 of Germany and the labels are not just not descriptive anymore it becomes impossible to handle them using the command line. Especially here a well-developed UI is necessary.
- Also very useful it is to visually link a category ID with its description – like “V01-V09″ for “pedestrian injured in transport accident” in case of the ICD-10 codes.