Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the whole history of the world there is but one thing that money can not buy… to wit the wag of a dog’s tail (Josh Billings)

In this post I combine several things:

• Simple webscraping to read the list of companion dogs from Wikipedia. I love rvest package to do these things.
• Google Trends queries to download the evolution of searchings of breeds during last 6 months. I use gtrendsR package to do this and works quite well.
• A dinamic Highchart visualization using the awesome highcharter package
• A static ggplot visualization.

The experiment is based on a simple idea: what people search on the Internet is what people do. Can be Google Trends an useful tool to know which breed will become fashionable in the future? To be honest, I don’t really know but I will make my own bet.

What I have done is to extract last 6 months of Google trends of this list of companion breeds. After some simple text mining, I divide the set of names into 5-elements subsets because Google API doesn’t allow searchings with more than 5 items. The result of the query to Google trends is a normalized time series, meaning the 0 – 100 values are relative, not absolute, measures. This is done by taking all of the interest data for your keywords and dividing it by the highest point of interest for that date range. To make all 5-items of results comparable I always include King Charles Spaniel breed in all searchings (as a kind of undercover agent I will use to compare searching levels). The resulting number is my “Level” Y-Axis of the plot. I limit searchings to code=”0-66″ which is restrict results to Animals and pets category. Thanks, Philippe, for your help in this point. I also restrict rearchings To the United States of America.

There are several ways to obtain an aggregated trend indicator of a time series. My choice here was doing a short moving average order=2 to the resulting interest over time obtained from Google. The I divide the weekly variations by the smoothed time series. The trend indicator is the mean of these values. To obtains a robust indicator, I remove outliers of the original time series. This is my X-axis.

This is how dog breeds are arranged with respect my Trend and Level indicators:

Inspired by Gartner’s Hype Cycle of Emerging Technologies I distinguish two sets of dog breeds:

• Plateau of Productivity Breeds (succesful breeds with very high level indicator and possitive trend): Golden Retriever, Pomeranian, Chihuahua, Collie and Shih Tzu.
• Innovation Trigger Breeds (promising dog breeds with very high trend indicator and low level): Mexican Hairless Dog, Keeshond, West Highland White Terrier and German Spitz.

I discovered recently a wonderful package called highcharter which allows you to create incredibly cool dynamic visualizations. I love it and I could not resist to use it to do the previous plot with the look and feel of The Economist. This is an screenshot (reproduce it to play with tits interactivity):

And here comes my prediction. After analyzing the set Innovation Trigger Breeds, my bet is Keeshond will increase its popularity in the nearly future: don’t you think it is lovely?

Photo by Terri BrownFlickr: IMG_4723, CC BY 2.0

Here you have the code:

library(gtrendsR)
library(rvest)
library(dplyr)
library(stringr)
library(forecast)
library(outliers)
library(highcharter)
library(ggplot2)
library(scales)

#Webscraping
x="https://en.wikipedia.org/wiki/Companion_dog"
html_nodes("ul:nth-child(19)") %>%
html_text() %>%
strsplit(., "n") %>%
unlist() -> breeds

#Some simple cleansing
breeds=iconv(breeds[breeds!= ""], "UTF-8")

gconnect(usr, psw)

#Reference (undercover agent)
ref="King Charles Spaniel"

#Remove the undercover agent from the set of breeds
breeds=setdiff(breeds, ref)

#Subsets. Do not worry about warning message
sub.breeds=split(breeds, 1:ceiling(length(breeds)/4))

#Loop to obtain google trends of each 5-items subset
results=list()
for (i in 1:length(sub.breeds))
{
res <- gtrends(unlist(union(ref, sub.breeds[i])),           start_date = Sys.Date()-180,           cat="0-66",           geo="US")   results[[i]]=res } #Loop to obtain trend and level indicator of each breed trends=data.frame(name=character(0), level=numeric(0), trend=numeric(0)) for (i in 1:length(results)) {   df=results[[i]]$trend lr=mean(results[[i]]$trend[,3]/results[[1]]$trend[,3]) for (j in 3:ncol(df)) { s=rm.outlier(df[,j], fill = TRUE) t=mean(diff(ma(s, order=2))/ma(s, order=2), na.rm = T) l=mean(results[[i]]$trend[,j]/lr)     trends=rbind(data.frame(name=colnames(df)[j], level=l, trend=t), trends)   } } #Prepare data for visualization trends %>%
group_by(name) %>%
summarize(level=mean(level), trend=mean(trend*100)) %>%
filter(level>0 & trend > -10 & level<500) %>%
na.omit() %>%
mutate(name=str_replace_all(name, ".US","")) %>%
mutate(name=str_replace_all(name ,"[[:punct:]]"," ")) %>%
rename(
x = trend,
y = level
) -> trends
trends$y=(trends$y/max(trends\$y))*100

#Dinamic chart as The Economist
highchart() %>%
hc_title(text = "The Hype Bubble Map for Dog Breeds") %>%
hc_subtitle(text = "According Last 6 Months of Google Searchings") %>%
hc_xAxis(title = list(text = "Trend"), labels = list(format = "{value}%")) %>%
hc_yAxis(title = list(text = "Level")) %>%
hc_add_series(data = list.parse3(trends), type = "bubble", showInLegend=FALSE, maxSize=40) %>%
hc_tooltip(formatter = JS("function(){
return ('<b>Trend: </b>' + Highcharts.numberFormat(this.x, 2)+'%' + '
<b>Level: </b>' + Highcharts.numberFormat(this.y, 2) + '
<b>Breed: </b>' + this.point.name)
}"))

#Static chart
opts=theme(
panel.background = element_rect(fill="gray98"),
panel.border = element_rect(colour="black", fill=NA),
axis.line = element_line(size = 0.5, colour = "black"),
axis.ticks = element_line(colour="black"),
panel.grid.major = element_line(colour="gray75", linetype = 2),
panel.grid.minor = element_blank(),
axis.text.y = element_text(colour="gray25", size=15),
axis.text.x = element_text(colour="gray25", size=15),
text = element_text(size=20),
legend.key = element_blank(),
legend.position = "none",
legend.background = element_blank(),
plot.title = element_text(size = 30))
ggplot(trends, aes(x=x/100, y=y, label=name), guide=FALSE)+
geom_point(colour="white", fill="darkorchid2", shape=21, alpha=.3, size=9)+
scale_size_continuous(range=c(2,40))+
scale_x_continuous(limits=c(-.02,.02), labels = percent)+
scale_y_continuous(limits=c(0,100))+
labs(title="The Hype Bubble Map for Dog Breeds",
x="Trend",
y="Level")+
geom_text(data=subset(trends, x> .2 & y > 50), size=4, colour="gray25")+
geom_text(data=subset(trends, x > .7), size=4, colour="gray25")+opts