Google Correlate Certainly Does Not Imply Causation

July 6, 2011
By

(This article was first published on Speaking Statistically, and kindly contributed to R-bloggers)

I recently heard about a new tool called Google Correlate that helps one finds Google search patterns that correspond to (i.e. correlate with) real-world trends.

For those that don't get it yet, the tool allows one to type in a search term and the tool finds other Google searches that exhibit the same search pattern over time. You can even enter your own time series data to see whether that data is correlated with the search activity of some Google search terms. As an example of the former use of the tool, I typed in "amazon" and one of the "similar" search terms that Google found was "USPS Lost Package."

Google Correlate Results for "Amazon" suggests that"USPS Lost Package" is often Searched By Google Users Around the Same Time as "Amazon."

This suggests that "amazon" is often searched at the same time during the year as "USPS lost package." This makes sense to some degree; people tend to shop at (and probably search for) Amazon more frequently during the holiday season; USPS is likely to be busier at those times, leading to more lost packages and thus more people searching for lost packages.

But Correlation Does Not Imply Causation!

While I think that Google Correlate could be very useful, anyone using it has to be very very careful.

Why?

When you have hundreds of billions of searches to look through, you're likely to find at least one search result that correlates well with your data.

As an example, I tried looking for correlations with "hobbits" and found that "OK Go lyrics" are well-correlated with that. I can't think of any explanation for that one. (If you can, I'd be very interested to hear it because I enjoy both OK Go and Lord of the Rings).

Google Correlate Results for "Hobbits" suggests that"OK Go Lyrics" is often Searched By Google Users Around the Same Time as "Hobbits."
Let's Take This Further


To demonstrate the idea that if you have billions of possible searches, at least a few are going to be well correlated, I decided to generate random data in a pattern called a random walk. I then uploaded these random walk time series into Google Correlate to see if it could find search terms correlated with the data.

My methodology was as follows:


First, I generated 30 random walks. These random walks followed the pattern such that at any time t, the value of the time series x is given by the value at the previous time plus a normally distributed random number with mean 0 and standard deviation 0.1:


Here's what they look like:

A Visualization of the 30 Random Walks Generated. Each Random Walk has Mean 0 and Std. Dev. 0.1.
Next, I uploaded the time series to Google Correlate and asked it to find Google Searches that were correlated with these random walks. For each time series, Google generally gives between 10-20 searches that it thinks are correlated. For each time series I uploaded, I computed the correlation between that and every search that Google said it was correlated with.

Finally, I generated 10,000 additional random walks and computed the pairwise-correlation between them (i.e. take one random walk, compute the correlation with every other random walk, take the next random walk, compute the correlation with every other random walk, and so on.). This provides a distribution of correlation coefficients that one would expect between random walks.

Results

For each of the random walks I uploaded, Google found Google Searches whose search activity was correlated with the random walk. To get an idea of how strong this correlation is, let's look at a histogram of the correlation coefficients that one would expect between random walks (in red) and the histogram of correlation coefficients that Google Correlate found (in red).

[Update @ July 6th, 2011, 2:21AM EDT:] The code used to plot the histogram had an error. The old plot contained a histogram of correlations between random noise with mean = 0 and std. dev. = 0.1 NOT of a random walk as described earlier. The plot here is the corrected version of the original plot. Also, for clarity, I decided to show only the distribution of correlation coefficients from Google Searches.



Notice how the correlation coefficients that Google Correlate gives us are all greater than 0.6 with many coefficients much greater than that. This gives us two pieces of information:
  1. Google shows only correlations above 0.6. (If a lot of searches for any particular time series have a correlation above 0.6, Google only shows the top 90).
  2. Google Correlate tends to find strong correlations to anything, even random, made up data. 
Conclusions
  • It will be very hard for people to really use Google Correlate effectively. If random noise is so well correlated with search results, it will become hard for anyone to claim that the results they find are real and not just a fluke. 
That said, I do still think the tool will be useful. Indeed, a friend and I wrote a brief report for Prof. Steele's class on using appropriate google searches to predict the number of weekly refinances applications and the rate of prepayment in mortgage backed securities. (On a side note, the Google whitepaper on Google Correlate independently tested and verified some of our results.)

To leave a comment for the author, please follow the link and comment on his blog: Speaking Statistically.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.