Since I started using Tableau I’m quite fascinated about the capabilities of this piece of software. Before Christmas I was looking how I could build an interactive visualization that helps me to explore the relationships between different objects in a form that shows which objects are very close to each other according to some similarity measure or vice versa. This is one fundamental question in recommendation systems. It was a nice coincidence that at the same time Bora Beran published his blog post on how you could visualize a correlation matrix because his tutorial together with a well written article from Jonathan Drummey about how to calculate “No-SQL” cross product served as a starting point for me to create a second version of my previous blog post about venue recommendation – but now inside Tableau and interactive!
The current blog article is also based on data from the location based social network Foursquare which was crawled over a couple of month during 2012 (data showing a sample from the city of cologne). In principle every data point states a location/venue, a user and a timestamp when the user checked in at the specified location (and quit some other “metadata” about location type, latitude, longitude, e.g.). Based on these dataset the goal is to calculate and visualize the similarity of each pair of venues. The similarity could be used afterward to recommend locations based on venues that a user visited before (Please look into my older blog post to get additional information about the dataset and location recommendation).
A similarity measure that is quite common in the domain of recommendation systems is the so-called cosine similarity. Basically the cosine similarity measures the angle between two vectors where every vector represents one of the objects (here venues) someone wants to compare. Every element \(i\) of the vector express the affinity of person \(i\) towards the object that is represented by that vector. In this example we measure the affinity by calculating the number of times that person visited a specific venue. For example, if person 2 travels to venue A three times, then A2 is 3 whereas it is 0 if person 2 never visited place A.
The cosine similarity ranges between -1 and 1, where 1 means completely similar, 0 means no similarity and -1 means both venues are completely different. Because every score has to be zero or positive, the smallest value for the cosine similarity is 0 which makes sense because using the described methodology a user cannot express any negative affinity for a place.
Prepare raw data with R
My first objective was just to reproduce a visualization that is similar to Bora’s correlation matrix, but is based on the cosine similarity. The matrix itself should provide a nice overview to identify the most connected venues very quickly.
The first step therefore is to bring the data in the correct shape for Tableau. Because Tableau’s abilities to reshape and enrich data are limited this is a nice example how preprocessing data with R could leverage advanced analytics in Tableau.
The simple idea is to calculate the number of check-ins by venue and user (~score) and then use the inner join of this dataset to itself based on the USERID (through a “multiple table” connection in Tableau). Now we can easily compare different venues by looking at both scores from a user using a set of Tableau table calculations.
For the aggregation the R code below shows the necessary steps: First transform the necessary fields. Then, extract a single dataset that consists of all venues and their properties like latitude, longitude, venue name and category name.
# -------------------------------------- # filter parameters # -------------------------------------- minNumberOfCheckinsPerVenue = 10 minNumberOfUsersPerVenue = 2 # -------------------------------------- # load required libraries # -------------------------------------- require(data.table) # -------------------------------------- # load Foursquare data # -------------------------------------- fileName <- "Foursquare_Checkins_Cologne.csv" dataset <- read.csv2(fileName, colClasses = c(rep("character", 7), rep("factor", 2), rep("character", 2)), dec = ",", encoding = "UTF-8") # how the first 10 elements look like head(dataset) # -------------------------------------- # data preprocessing # -------------------------------------- dataset$CHECKIN_DATE <- as.POSIXct(dataset$CHECKIN_DATE, format = "%Y-%m-%d %H:%M:%S") dataset$LAT <- sub(",", ".", dataset$LAT) dataset$LNG <- sub(",", ".", dataset$LNG) dataset$LAT <- as.numeric(dataset$LAT) dataset$LNG <- as.numeric(dataset$LNG) dataset$HOUR24 <- as.numeric(format(dataset$CHECKIN_DATE, "%H")) venueDataset <- unique(dataset[, c("VENUEID", "LNG", "LAT", "VENUE_NAME", "CATEGORY_NAME")]) # use data.table for aggregation datasetDT <- data.table(dataset) venueUserDataset <- datasetDT[, list(COUNT_CHECKINS = length(unique(CHECKIN_ID))), by = list(VENUEID, USERID)] # filter for artificial venues tmpVenueDataset <- data.table:::merge.data.table( datasetDT[, list(COUNT_USER_ALL = length(unique(USERID))), by = list(VENUEID)][COUNT_USER_ALL >= minNumberOfUsersPerVenue], datasetDT[, list(COUNT_CHECKINS_ALL = length(unique(CHECKIN_ID))), by = list(VENUEID)][COUNT_CHECKINS_ALL >= minNumberOfCheckinsPerVenue], by = "VENUEID") venueUserDataset <- data.table:::merge.data.table(venueUserDataset, tmpVenueDataset, by = "VENUEID") venueUserDataset <- data.frame(venueUserDataset) # -------------------------------------- # merge scoring table with venue information and save as csv # -------------------------------------- # venueDataset contains all venues head(venueDataset) ## VENUEID LNG LAT VENUE_NAME ## 1 4aef5d85f964a520dfd721e3 6.959 50.94 Koeln Hauptbahnhof ## 24 4bade052f964a520506f3be3 6.949 50.93 Stadtbibliothek Koeln ## 25 4baf1998f964a52033eb3be3 6.964 50.93 Deutsches Sport & Olympia Museum ## 26 4baf428cf964a52024f43be3 6.962 50.92 Ubierschaenke ## 27 4ba4f032f964a520dac538e3 6.849 50.92 OBI Baumarkt ## 28 4bc210d92a89ef3b7925f388 6.927 50.95 Pfeiler Grill ## CATEGORY_NAME ## 1 Travel & Transport ## 24 College & University ## 25 Arts & Entertainment ## 26 Nightlife Spot ## 27 Shop & Service ## 28 Food # venueUserDataset contains all the relationships (ratings) head(venueUserDataset) ## VENUEID USERID COUNT_CHECKINS COUNT_USER_ALL COUNT_CHECKINS_ALL 1 ## 4a120aa1f964a5206d771fe3 594791 1 11 13 2 ## 4a120aa1f964a5206d771fe3 24447 2 11 13 3 ## 4a120aa1f964a5206d771fe3 139050 1 11 13 4 ## 4a120aa1f964a5206d771fe3 74382 1 11 13 5 ## 4a120aa1f964a5206d771fe3 2226845 2 11 13 6 ## 4a120aa1f964a5206d771fe3 13255988 1 11 13 venueUserFullDataset <- merge(x = venueUserDataset, y = venueDataset, by = "VENUEID") head(venueUserFullDataset) write.csv2(venueUserFullDataset, file = "140116_venue_recommendation_dataset.csv", row.names = FALSE)
We use the data.table package to easily calculate the score and two other aggregates (COUNT_USER_ALL and COUNT_CHECKINS_ALL) that are used to filter the dataset for “artificial” venues like places that are only visited by one person or that had less than 10 check-ins. Last step then is to enrich the result with the venue information and save it as a simple CSV. Now we are ready to start with Tableau.
Creating a similarity matrix with Tableau
If a text file connection with “Multiple Tables” (common key is the USERID) is used, it is possible to create a data structure that allows a direct comparison of the scores for each venue pair.
After that we rename some of the measures and dimensions, create a hierarchy based on VENUEID and VENUENAME and setup the correct type for both coordinates. After that it should look like this:
To calculate the cosine similarity we need to precompute three things:
- the dot product between the vectors of the two venues,
- the Euclidean norm for the vector of the first venue and
- the Euclidean norm for the vector of the second venue
Because of how we created our overall dataset (inner join) it is very easy to compute the dot product by multiplying both scores we have per comparison (aka row).
Both Euclidean norms are somewhat more complex and we need the help of Tableaus “domain completion ” (explained here) to calculate them. We use this approach (changing this
SUM(POWER( ) and create two table calculations. Keeping in mind that we want to create a matrix like visualization, one of the created fields will compute the Euclidean norm down and one across the matrix. This can be configured after we placed both field onto the visualization shelf.
Now we can compute the cosine similarity by putting all three things together:
After that lets start to build the visualization by bringing the venue hierarchy from the first table onto columns and the venue hierarchy from the second table onto rows.
Now put the field Cosine Similarity on color (change the visualization to Square) and change the table calculation for Sum of Squared Scores (horizontal) to across and for Sum of Squared Scores (vertical) to down.
That’s it, you can also place the field on Label to show up the scores and/or use the computed cosine similarity as filter.
One step further – interactive exploration by bringing together venue similarity and spatial distance
The matrix shows a lot of information available and gives you a nice overview of the overall pattern (here one can see that for most of the venues the similarity is rather small) but on the other side lacks of clarity. You can sort by one column to bring up the most similar venues but, you still need to scroll up and down to oversee all values or from left to the right to compare it with other venues. And what’s also not possible based on this visualization type, is to examine for example if the spatial proximity is correlated with the computed similarity.
Putting both things together I’d like to show a visualization that use a map based approach to show similar locations towards a chosen venue (technically this corresponds to a visualization that is displaying either a row or a column from the matrix). Every time the user choose a different venue the map will update to show similar venues based on the calculated cosine similarity.
The first step is to create a parameter Reference Venue that contains all the venues the user can select and is used to be the “reference venue”. For this we create a calculated field that combines VENUEID and VENUE_NAME (because we can’t assume that the venue name is unique) and use it to “fill” the new parameter.
Next we need to adjust the calculation of the dot product. The idea is to set the dot product to zero if the selected venue is not involved (this will lead the cosine similarity for this pair of venues to be zero). After that, we can use the cosine similarity to adjust the size of any venue on the map – venues that are not relevant will “disappear”.
Last but not least we need to change the calculation of the cosine similarity a little bit:
And that’s it. Place both coordinates onto column and row and (Y_)VENUE_NAME and (Y_)VENUE_ID on the detail shelf. The calculated field for cosine similarity should be placed on color and size. Remember to configure the table calculation in the same way as before.
You will see that as you change the reverence venue with the dropdown field the map is updated accordingly.
Visualizing object similarity for non-spatial data
Because the foursquare dataset is about geographical data it seems natural to use a map as visualization method. But what if these spatial relationship is not given – for example how could we display the cosine similarity for different product sold in an online shop?
I think treemaps are perfectly suited to display such an information. Similar to the map approach we use the cosine similarity for size and color. Using treemaps, it is easy to get on overview over the most similar venues as they are clustered around the upper left corner.
The transformation from the map is simple – just drop the coordinates and add the venue name and cosine similarity to label.
There is one important thing left: if you download the attached workbook and play around with it you may experience that the performance is rather low. Especially if one use larger datasets Tableau takes some time to render the matrix. It would be interesting to see if someone can speedup the calculation.
Please also find my workbook with all the examples from this blog post attached: