Some years ago, Google discovered that when people are concerned about influenza, they search for flu-related information and that to some extent, search traffic is an indicator of flu activity. Google Flu Trends was born.
Illness is sweeping through our department this week and I have succumbed. It’s not flu but at one point, I did wonder if my symptoms were those of bronchitis. Remembering Google Flu Trends, I thought I’d try my query for “bronchitis” at Google Trends, where I saw the chart shown at right.
Interesting. Clearly seasonal, peaking around the latest and earliest months of each year. Winter, for those of you in the northern hemisphere.
- select USA and Australia as regions
- download the data in CSV format (I chose fixed scaling), rename files “us.csv” and “aus.csv”
- edit the files a little to retain only the “Week, bronchitis, bronchitis (std error)” section
Fire up your R console and try this:
library(ggplot2) us <- read.table("us.csv", header = T, sep = ",") aus <- read.table("aus.csv", header = T, sep = ",") # add a region column us$region <- "usa" aus$region <- "aus" # combine data alldata <- rbind(us, aus) # add a date column alldata$week <- strptime(alldata$Week, format = "%b %d %Y") # and plot the non-zero values ggplot(alldata[alldata$bronchitis > 0,], aes(as.Date(week), bronchitis)) + geom_line(aes(color = region)) + xlab("Date")
Result shown at right: click for the full-size version.
That’s not unexpected, but it’s rather nice. In the USA peak searches for “bronchitis” coincide with troughs in Australia and vice-versa. The reason, of course, is that search peaks for both regions during winter, but winter in the USA (northern hemisphere) occurs during the southern summer (and again, vice-versa).
There must be all sorts of interesting and potentially useful information buried away in web usage data. I guess that’s why so many companies are investing in it. However, for those of us more interested in analysing data than marketing – what else is “out there”? Can we “do science” with it? How many papers are published using data gathered only from the Web?