Google Analytics + R = FUN!

[This article was first published on Stats raving mad » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The scope of this post it to show how simple it is to get data out of the Google Analytics and create your own reports (that you hope that they can be semi-automated at least) and you favourite statistical graphs (those that GA is currently missing). As you already know R is a favourite tool to me, so this will be the main tool to get the data, reshape them and depict them. You will need elementary knowledge of the R language and in the end you’ll soon relaize that google search is more than 40% of the code polishing stuff your code will ever need…

R packages

There are two packages (or libraries) that connect to the Google Analytics API and return data to you, the older one is RGoogleAnalytics and the new champion is rga. Both are excellent and I’ve used both at all occasions. rga seems a bit nicier but RGoogleAnalytics is certainly more robust and works under all occasions. Apart from the core, I will use ProjectTemplate for my personal organisation (you won’t see it however) and ggplot2 for graphics.

Google Authentication

First of all go to Google’s API Console and create a new API application after you make sure that you have Google Analytics Service enabled.

Because of the lengthy script that would ruin the flow of the post I have created a Github repo where all scripts reside [zip]


Now, that the API access is set-up in the one side,we should make the connection to R. You should already know that RCurl is a bit tricky as I have oulined in the A tiny RCurl headache note. The solution proposed there is applied here as well. note that this issue will be solved in the next release of rga. On the other hand RGoogleAnalytics seems to be already on the spot. Have in mind that using

ssl.verifypeer = FALSE

isn’t the most secure way to use network communication in R.  You can use the following to create a connection to the API [rga_initiate_API_connection.R] This is heavily copy-pasted from Randy Zwitch’s (not provided): Using R and the Google Analytics API post.

One issue is how to get the Profile IDs. The hard way would be to go to Query Explorer and cycle through all profiles and write down the IDs that you are interested in. However, luck is all you got as there is a function that will return you all profiles that you have access with the account tied to the API access you created. (BTW, it is excellent that there is provision for access to the Management API in the rga package)

In the next I will assume that you have defined the ids that you are interested in.

The main hypothesis that I want to get a taste of is whether the different post categories (eg. measure, statistics, music etc) have different load times. This will be interesting given that all categories don’t have the same burden to get loaded (images vary, youtube videos scripts). To achieve the following you will need to use a filters vector and loop over it. Give appropriate names in the vector and you will be done. Note, that we have created extra metrics

  • e-commerce rate : this is not meaningful in the case of a blog, but if you are advanced in analytics you might have implemented goals as e-commerce events as B. Clifton suggests.
  • bounce rate : the bounce rate should be correlated to the page load
  • buckets of page load time : we use a 4 seconds range for each bucket to be consistent with the the Apdex standard.

Because I want to get a more metrics than a single query allows (11) I use another query in the loop to ge the rest and then merge them.Now, if you run all these scripts you will have a data frame like this extracted in the end of the script using the head() function

Enough with scripting!

Now, that the data are on our console we can finally get some graphs. The following histogram is the aggregated page load speed histogram of this blog. You should note that there is a significant volume of sample units that belongs to the 12-16 bucket. I have the suspicion that they also belong to a specific country group as well as the host is providing good page load timings in the US and Western Europe. (Note to myself : I should add the ga:country dimension in the second query run).


OK. This is not a nice picture at all! I know that I have experimented with various analytics scripts in the last months plus in the first 3 months of 2013 I was using a significantly heavier wordpress theme but I still think the the sample is skewed by the georgaphic distribution of the readers (a new post will come soon on this!)


Extend the script to your needs

In a modification of the script above I can use a loop on the web properties that I have access , so I use R to store data and create a roll-up report in a fast way. If you are looking at the comments section of the scripts you will notice the following.

# In the future we should only get data for increment dates. Don't we?<-min(final_dataset$date)

I use this to incrementally query and store data in the final_dataset data frame (this will help with the sampling that I will run into the first time of running the script for a long period of time). I am pretty sure a cron thing can be streamlined here, however I have no idea on cron jobs…

Head now to !

To leave a comment for the author, please follow the link and comment on their blog: Stats raving mad » R. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)