Last quarter was a bit too busy to write some new blog post because of a new job. And changing the job often come along with changing the tools you work with. That was my way to Tableau. Tableau is one of the new stars in the BI/Analytics world and definitely worth a look. The people at Tableau describe their tool as an instrument that combines interactive visual analytics, sophisticated visualization methods, an easy to use graphical user-interface and a lot of connectors for dozens of different data sources ranging from standard Excel to more “Big Data” like things such as different Hadoop distributions and NoSQL databases. What is really impressing is that you can create interactive visualizations in just no time.
I’m currently using Tableau to carry out explorative analysis and reporting up to management level (creating high polish PDFs). I’m quite satisfied with Tableaus capabilities for this even that Tableaus focuses more on Self-BI. What I missed are sophisticated analytical methods like different algorithms for classification, regression, clustering et cetera. So I was really happy that Tableau announced the integration of R in their upcoming release and quickly enrolled in their beta tester program a few weeks ago. Time now to share some insights from my first experiments with the beta version of the upcoming Tableau release 8.1.
First step is to get the connection between R and Tableau based on Rserve up and running. Following the approach given here it should be really easy even for Windows. It is good to know that you can launch Rserve with “R CMD Rserve_d” instead of “R CMD Rserve” to grab some information from the communication between Tableau and R for debugging (here you can find another short comment about some more advanced debugging techniques). Alternatively you can launch Rserve from within R with:
If everything works fine you can test you connection from inside Tableau by selecting “Help -> Manage R Connections …”. Be sure to replace “localhost” with “127.0.0.1” because of a bug in the current beta version (tested with 8.1 Beta 2 – may be fixed in newer versions) which will slow down performance otherwise.
The integration uses so called table calculations to retrieve results from calculations within R. A table calculation could be used to state a calculation like “Partition the sales data by region and calculate the overall revenue per product category”. This means that the arguments from Tableau to R have to be specified as aggregates. The result of the R calculation then should be one value or the same number of values as in the partition (vector). This sounds a little bit difficult and to be honest in the beginning it was difficult for me to understand the logic behind. But let’s try some examples to demonstrate what that means.
Our aim is visualize the resulting clustering which should give us some understanding about how it matches the true classification. We use a calculated field to pass the data to R and get the resulting clustering ids from the k-Means function. The code for the table calculation function inside the calculated field is very simple:
SCRIPT_INT('result <- kmeans(x = data.frame(.arg1,.arg2,.arg3,.arg4), 3) result$cluster', SUM([Sepal#Length]),SUM([Sepal#Width]),SUM([Petal#Length]),SUM([Petal#Width]) )
SCRIPT_INT( R code, Parameters ) is the Tableau table calculation for running R code that returns an integer value or vector (Tableau also offers functions that return boolean, string and real values). It is important to know, that this function only returns one value/vector and such, if you need more parts of the results from your R computation, you have to create additional calculated fields. In the code above we use data.frame on the given parameters (going from
.argN where N is the number of parameters) first to create a data object for the k-Means function. The resulting clustering ids are then passed back to Tableau. As you see all four parameters must be aggregated first (
SUM(...)) before we can use them in our R calculation.
We can then use the calculated field to show the cluster assignment by assigning every cluster id a different shape. Because we want to see the results on a detailed level we need to deactivate the default aggregation in Tableau by clicking on “Analysis -> Aggregate Measures”. The resulting plot reveals that there is a huge mismatch between the clusters and classes – not the best clustering solution.
Now it would be nice to strengthen the interactive features of Tableau and therefore parameterize the R calculations in a way that we can modify the calculation from within Tableau. This leads to my second example visualizing the results from a PCA – showing a scatterplot of the objects in the space of the two choosen components. Therefore we use the USArrets dataset (Violent Crime Rates by US State) and the standard R function princomp for doing principal component analysis:
SCRIPT_REAL('ncomp <- max(.arg5) df <- data.frame(.arg1,.arg2,.arg3,.arg4) pc <- princomp(df, cor = TRUE) as.numeric(pc$scores[,ncomp])', SUM([Assault]),SUM([Murder]),SUM([Rape]),SUM([UrbanPop]),[No. Component - x axis] )
We can extract the scores for every component from the result of
princomp. Because we need the scores for x and y axis we create two calculated fields – one for each axis (above you can see the code for the calculated field that shows the x axis). Together with the data we can pass a parameter from Tableau to R that determines which component scores we want (one parameter for every calculated field).
Because R retrieves this parameter the same size as the other inputs we need to aggregate the parameter to yield a scalar value for selecting the right scores.
Now it’s beautiful to see how we can integrate these R calculations into Tableau’s typical interactive workflow: Drag one of the attributes to column and one to ‘Shape’ – now you can see how theses features load on the selected components. If you activate the parameter control for each of the parameters you can change the scores mapped on x and y axis on the fly. Really a nice tool to explore the results of a principal component analysis.
My last example demonstrates a use case that is common for me but that in my opinion also shows the limitations of the current R integration. Imagine that your personal data scientist sends you a fine tuned data mining model that you want to use for further prediction. You want to apply these model on your current dataset in Tableau and use the predictions as additional feature for your visualizations to answer further questions. How could the new R integration in Tableau helps us with that?
Let’s start and first create the model outside of Tableau because this is likely the way the data scientist will do it (using tools from his own development environment). For this we create a simple classification model with rpart (decision trees) for predicting the outcome of the species feature from our iris dataset and save the model in an RData file ready for sharing (we could also use a more standardized format like PMML available in the corresponding package pmml).
require(rpart) data(iris) rpartModel <- rpart(Species ~ ., data = iris) save(rpartModel, file = "C:/tmp/RpartModel.RData")
Now we could try to utilize the model from within Tableau. For this we create the following calculated field that loads the model and predicts the outcome of the given data. The code looks like this
SCRIPT_STR(' require(rpart) data(iris) load(file = "C:/tmp/RpartModel.RData") newiris <- data.frame(Sepal.Length = .arg1, Sepal.Width = .arg2, Petal.Length = .arg3, Petal.Width = .arg4) as.character(predict(rpartModel, newiris, type ="class"))', ATTR([Sepal#Length]),ATTR([Sepal#Width]),ATTR([Petal#Length]),ATTR([Petal#Width]) )
It is not surprising that the visualization of true class (color) versus predictions (shape) reveals good results. As before you need to deselect “Aggregate Measures” for the plot.
So far so good. But what if we want to use the predictions together with the class labels to create a simple confusion matrix? Let’s try and plot both features against each other and color the results using an additional calculated field called Misclassification (
IF ATTR([Species]) = [R Prediction] THEN 'No' ELSE 'Yes' END). What I achieved after some experiments are either disaggregated results (showing a “1” for each entry colored according to “Misclassification”“ – which is not very helpful) or a single prediction for each partition of the aggregated data that gets multiplied by the number of data points inside each partition. Here are two pictures showing both possible results:
Sadly, until now I didn’t get the things running in way it should be (maybe some of you have an helpful idea idea – if so then don’t hesitate to leave a comment) which make me think if some of my desired use cases are not possible with the current R integration in Tableau.
Finally I want to summarize what I observed so far: I’m really excited whenever a commercial software company announces that it will integrate R into their software stack. What Tableau did with R is really useful as the shown uses cases hopefully demonstrated. But the integration is also not that simple than other things in Tableau and I still need more time to get a better picture of things that work and things that won’t work. Therefore it is perfect, that Tableau has a very active community and I’m sure that given some weeks we will see dozens of new possibilities on how this integration of R can produce fruitful results.
Please also find my workbook with all the examples from this blog post attached: