How to Use R for Connecting Neo4j and Tableau (A Recommendation Use Case)

[This article was first published on Data * Science + R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Venue Recommendation Dashboard using Tableau, R and Neo4j

Introduction

Year is just a little bit more than two months old and we got the good news from Tableau – beta testing for version 9.0 started. But it looks like that one of my most favored features didn’t manage to be part of the first release – the Tableau Web Data Connector (it’s mentioned in Christian Chabot keynote at 01:18:00 you can find here). The connector can be used to access REST APIs from within Tableau.
Instead of waiting for the unknown release containing the Web Data Connector, I will show in this post how you can still use the current version of Tableau together with R to build your own “Web Data Connector”. Specifically, this means we connect to an instance of the graph database Neo4j using Neo4js REST API. However, that is not the only good news: our approach that will create a life connection to the “REST API data source” goes beyond any attempt that utilizes Tableaus Data Extract API, static tde files that could be loaded in Tableau.

Architecture

As Tableau currently lacks of a native connector for Neo4j (for examples see here, here and here) we first need to think about a way to connect it with Tableau using something like a “middleware”. The figure below shows how that could be done:

Data flow when accessing Neo4J from R

The main idea is to write a small R program, which sends dynamically build Cypher queries towards Neo4j. Cypher is a special query language (somehow similar to SQL), specifically designed for Neo4j’s graph model. The query result is then parsed from JSON, transformed and sent back as answer to Tableau.

Use Case

If you are a regular reader of my blog, then you may know the “check-in” data set. The data set was crawled from the location based social network Foursquare over a couple of month in 2012 (data contains a sample of check-ins for the city of Cologne during that time) for scientific purposes. Every data point states a location/venue, a user and a timestamp when the user checked in at the specified location (and some other “metadata” about location type, latitude, longitude, e.g.). Based on these dataset the goal is to calculate and visualize the similarity of each pair of venues (Please look into this older blog post to get additional information about the dataset and location recommendation). You can find a link to the data set at the end of the posting.

Installing Neo4J

To start with the recommendation example, we need to set up a personal Neo4j instance. Fortunately, the installation procedure is very easy. I will explain how to do it for the windows version: First step is to download the community version for windows (currently I use 2.1.6) from the Neo4Js website. The download contains a setup program that asks for an installation directory during setup. After finishing the installation, open the “neo4j-community.exe” program in the bin folder of the installation directory.

Neo4J administration tool

Now choose a directory for the graph database data and click on “Settings” down left after that. There we need to change some of the parameters. Select “Edit” in the database configuration section and uncomment the entries “node_auto_indexing=true” and “node_keys_indexable=name,age”. For the later change the parameter value from “name,age” to “name”. By doing this we tell Neo4j to create an index for us that automatically index all nodes as we add them to the database (index is on field “name”). We use the index afterwards to query nodes efficiently. Close the “Settings” dialog afterwards and press “Start” to initiate the Neo4j server. From this point on you can use your browser to open the Neo4J web front-end and query the database.

Data Preparation

Before we can start building our viz in Tableau, we need to populate the data in Neo4j. R is ideal in this case for doing some transformations and then shifting the data to Neo4j. For uploading and querying we use the same R method to approach Neo4j’s REST interface.
The overall code for data preparation and upload can be found in the following script: The workhorse is the function ‘query’ that uses the curl library to send http queries and decodes the returning JSON results into R data objects.
Next part is about data transformation. Here the only things special is that we use the “data.table” package to sum up the number of visits for every pair of venues and users. The result is an approximation of the user rating for the venue (one can discuss this approach, but I think it’s fine for this example). As last step in this part, we write the venue data frame into an excel file that serves as data source for Tableau later on.
The next part of the script is about uploading the data to Neo4j, which is done using dynamically build Cypher ‘CREATE’ statements (It’s up to the reader to go deeper into the Cypher syntax. This blog post is not intended as Cypher tutorial, but if you want to know more please look at the Cypher reference or you can find a good tutorial here. Also a Neo4j package for R exists, but I haven’t tested it and therefore can’t say if handling all the queries using the package functionality is easier than using plain Cypher). The order in which we upload the data to Neo4j is:

  1. Venues: as nodes of type “venue” with the additional attributes ID, name, category name and coordinates
  2. Users: as nodes of type “user” with the additional attribute ID
  3. Visits: as relationship with the single attribute “stars” containing the rating

To simplify the similarity calculation afterwards, it is a good idea to pre-calculate static components during the data preparation phase and store them as attributes of nodes or relations inside the graph. As similarity measure I choose the cosine similarity that consists of two parts: the dot product of the two vectors (where every vector represents the ratings of the different users for a venue) divided by a term that is the product of both vector magnitudes. The magnitude of a venue is the static part and hence can be pre-calculated. This is done in the last part of the script.

Computing recommend venues

Before we put everything together, we need to set up the recommendation procedure. The basic idea is to give R a list of all venues and a selected/reference venue for which we want to calculate the N most similar locations. R then marks the top N venues in the list of all locations and hands back the entire list to Tableau. The R script for this looks as follows: First R loads all the libraries, receives input from Tableau and does some nasty regexp hacking that’s needed to extract the id out of a parameter containing the user-selected venue. Later in the Tableau part, I explain why this is needed.
Second we define the function for accessing Neo4J’s REST API that is only slightly different from the one in the data preparation part. This is because if we receive data, we have to take care that Neo4j may return an empty result (e.g. a venue not connected in the graph).
Third and last part is to merge the results of the computation with the list of all venues and return it back to Tableau as “special-character-delimited”-string (because Tableau expects a single vector as result from a R calculation, we use the approach to concatenate all fields from a single row in one string and decompose it later in Tableau). The fields we send back to Tableau are the cosine similarity score and the number of joint users that rated both venues (the reference and the recommendation).

Building a recommendation dashboard in Tableau

Now it is time to put all the pieces together: First thing is to load the dataset that we have created during data preparation into Tableau. Next we need to create a parameter that takes the venue for which the user wants a recommendation. We cannot use the name only because it is not unique. We may take the ID, which results in another problem. Would you know what’s behind venue “V#234”? Ok, we can change the label for each parameter value manually, but nobody wants to this for thousands of labels. Therefore, we create a field combining name and ID and take this as source field for filling the parameter. A second parameter is used restrict the number of recommendations based on how many the user wants to have.

Filling the venue parameter from a field containing venue name and ID

Now we need to create the calculated field for the R computation. Use the code explained in the previous section to create a calculated field “recommendation_result” and use two separated fields (“cosine similarity” and “number of user”) to extract the cosine similarity and the number of users from the return value. Let’s start creating the viz by putting “LNG” to columns, “LAT” to rows and “Venueid” to the detail shelf. Change the coordinates then into dimensions (One tip – disable automatic updates of the visualization until all table calculations are correctly configured). Because we want to calculate the recommendations over all venues, change the table calculation for the field “recommendation_result” by moving “LAT”, “LNG” and “Venueid” to addressing.
For a proper visualization of the results, we want to change the mark symbol to differentiate between the selected venue, the recommendations and all other venues. Additionally the size attribute should be used to increase visibility of the recommendations, whereas we use color to plot the similarity of the recommendations given by the cosine similarity. For the size and symbol, create a field “venue type” as shown in the picture, change the viz type to “Shape” and place the new field on the shape and size shelf (don’t forget to change the table calculation addressing).

Field 'venue type' is used to show different labels for selection, recommendations and other venues

For color there is another obstacle if we want to avoid that non-recommendations get a similar color than recommendations with a low similarity (because Tableau doesn’t let you set a special color for NULL). Instead of using the numeric cosine similarity feature, create a discrete field classifying the cosine similarity into fixed intervals. It allows us to give all non-recommend venues plus the selection a color-coding that is completely different from the color range for the cosine similarity.

Field used for coloring the venues

To move the selected venue and all recommendations to the front of the map and the rest to the back we need a last hack, because “normal” ordering is not working for our table calculation. Thanks to Tableau Community I got some help from Yuriy. His approach uses the Tableau page functionality to control the plot order of the venues. Take the “venue type” to page and select “Show History” plus “All” and deselect “Fade”. Then move the slider to the right to plot the recommendations on top of all non-recommend venues and the selected venue on top everything.

Pages in Tableau can be used to control plot order on a map

Until now I found only one drawback for this workaround that is you should not filter out the object, which is plotted at last (in our case the selected venue). For example, if you would add a venue type filter and use it to take out the selected venue, the element will not “come back”, even if you select it again in the filter. For this you have to go back to the page controls and move the slider to the right again.

Now, that’s it pretty much – a complete visualization showing venue recommendation in Tableau computed live inside a graph database brought to you by R. You can add labels and tooltips or just use my “prefabricated” dashboard.

Summary

As the blog post shows, using R to build some sort of a live connection to a Web Data Source using R’s REST capabilities is clearly something worth to think about. As most workarounds available try to extract static data, the described way not only gets you access to live data, but also allows you to move intensive calculations from Tableau into some big number cruncher cluster and/or utilize application/data source specific calculation techniques (like Cypher for Neo4j). Drawback is that you have to use tableau calculations for handling everything. Time will show how the new Web Data Connector feature coming in one of the 9.X versions compare with this approach.
As always, you can find the workbook here and the check in dataset here.

To leave a comment for the author, please follow the link and comment on their blog: Data * Science + R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)