Mr. Spearman or how to explore changes in trends
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Last week I was with my good friend Johannes on the phone. He just became a father and called her adorable daughter “Franziska”. It made me think about the meaning of the name… Franziska you could tell from the South of Germany -Bavaria- or even Austria… and a quite traditional name as well.
So names can reveal a lot about the contextual situations of a person or a time frame. Even with the proper data set you could potentially tell which periods of time follow the same naming patterns. I starting searching for baby names lists with a popularity ranking and I came to the Social Security Administration where you can pull male and female names ranked by popularity from the 1880 until last year -BTW, sometimes you cannot reach the site with a non-USA IP-Address, so you might want to have a look at proxy-like solutions like ZenMate or Hola.
Well, I managed to -obviously programmatically :)- download all the years available from the US census. I’m not providing the downloaded data, because I don’t want to potentially infringe any copywriting or whatsoever law, so I just point you guys to the source. The data looked promising but I wasn’t quite sure it would be possible to detect trend changes. My reasoning started like:
- Each year is a sorted list of names.
- The list from a particular year to the next or previous one is not likely to change much.
- Conversely, a “bigger-than-expected” change in the list from a year to the next one might reveal a trend interruption or a trend change
So what I need is a way of measuring how similar two ranked lists are… And good news! this metric exists and is provided by the Spearman’s Correlation Coefficient for Ranks (SCCR) -BTW, the Kendall correlation coefficient does the job as well-.
For each year, I computed the SCCR value with all other years split by gender. I created a scattered plot where the size and the transparency of each point is determined by the SCCR value. Intuitively if we have a look at the diagonal, bigger opaque points together form a cluster where the trend persists and places with almost no color in this diagonal represent trend interruption.
Some interesting findings from this picture:
- The average trend duration is something around 5 years for male and a bit less for female.
- The female names are more likely to change from a year to the next one, as we see less consistency over time.
- In the 19th century, people used to stick more to the baby names -see the big cluster before 1910-
- The male baby names between 1985 and 1997 didn’t vary much -long lasting trend here-
- It’s very unlikely, that years distant in time follow the same pattern -almost not even a single case-
Just filtering out years pairs with SCCR below 0.30 produces a picture where mostly consecutive years stay -diagonal-, while the other non-consecutive ones just disappear -with some punctual exceptions-.
Obviously, the action is in the diagonal! If we just focus on the SCCR(year i, year i+1) and map this value to 0 if it falls under a threshold or to another number otherwise, we can produce a clearer picture of the trend changes. In following picture I’ve played with different thresholds, which allows for understanding trend changes with different resolutions (as you can see the lower we set the threshold, the longer the trends last for).
Obviously, more sophisticated approaches can be taken, such as us the Twitter’s approach for breakout detection or time series change point detection packages (good ones are CPM, BCP or ECP)… but I love simplicity and for this post the threshold mapping does a decent job.
On a separate note, the kind of data we have here allows for running Density Based Cluster Analysis – I particularly like DBSCAN or one of the improved versions of it-. I encourage the reader to get the data I’m providing below and give it a play… Please share your experience if you happen to
The following R code snippet computes the year-to-year SCCR for both gender values. I’ve exported the resulting data as csv your those who want to play with it and were too lazy to download the yearly names rank: year-to-year-sccr
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | # Year over Year Spearman generation all.years<-unique(names.ranking$year) year.sccr<-NULL #names.ranking.male<-subset(names.ranking, names.ranking$gender=='male') for (i in 1:length(all.years)) { origin<-subset(names.ranking, year==all.years[i]) origin<-origin[order(origin$name),] for (j in 1:length(all.years)) { destination<-subset(names.ranking, year==all.years[j]) destination<-destination[order(destination$name),] a.female<-cor(origin[origin$gender=='female',]$rank, destination[destination$gender=='female',]$rank, use="na.or.complete", method="spearman") a.male<-cor(origin[origin$gender=='male',]$rank, destination[destination$gender=='male',]$rank, use="na.or.complete", method="spearman") year.sccr<-rbind(year.sccr, data.frame(sccr=a.female,gender='female', origin=all.years[i], destination=all.years[j])) year.sccr<-rbind(year.sccr, data.frame(sccr=a.male,gender='male', origin=all.years[i], destination=all.years[j])) } } |
The visualization is a pretty straight forward ggplot with some annotations to draw the decades lines. In geom_point we use the SCCR value to determine both the point transparency and size.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | # creating the annotation for the decades x<-seq(1880,2010,by=10) y<-rep(1880) x.end<-seq(1880,2010,by=10) y.end<-seq(1880,2010,by=10) ann.1<- data.frame(x,x.end,y,y.end) y<-seq(1880,2010,by=10) x<-rep(1880) y.end<-seq(1880,2010,by=10) x.end<-seq(1880,2010,by=10) ann.2<- data.frame(x,x.end,y,y.end) #setting to 0 own correlation values to improve the visualization year.sccr[year.sccr$origin == year.sccr$destination,]$sccr<-0 # to see just the cluster in a crisp way dataset<-subset(year.sccr,sccr>0.1) # for the last 40 years dataset<-subset(year.sccr,sccr>0.1 & origin>1970 & destination > 1970) ggplot(dataset, aes(x=destination, y=origin, group=origin, color=factor(origin))) + geom_point(aes(size=sccr, alpha=sccr))+ theme(legend.position='none') + xlab("")+ylab("") + facet_wrap(~gender, ncol = 2) + theme(strip.text.x = element_text(size = 10, colour = "black")) + theme(axis.text=element_text(size=10)) + theme(axis.title=element_text(size=10,face="bold")) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + theme(panel.background = element_rect(fill = 'white')) + theme(panel.grid.major = element_line( color="snow2")) + theme(strip.background = element_rect(fill = 'white')) + theme(strip.text = element_text(size=14,face="bold")) + scale_x_continuous(breaks=seq(1880,2010,10)) + scale_y_continuous(breaks=seq(1880,2010,10)) + theme(axis.text = element_text(color = "black", size = 10)) + annotate("segment", x=ann.1$x, xend=ann.1$x.end,y=ann.1$y, yend=ann.1$y.end, alpha = 0.5,colour = "black" , linetype=2) + annotate("segment", x=ann.2$x, xend=ann.2$x.end,y=ann.2$y, yend=ann.2$y.end, alpha = 0.5,colour = "black" , linetype=2) |
For the chart showing the different periods with different thresholds in the diagonal, have a look at:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | sccr.next.year<-NULL min.year<-min(unique(year.sccr$origin)) max.year<-max(unique(year.sccr$origin)) max.year<-max.year-1 for (i in 1880:max.year) { sccr.m<-year.sccr[year.sccr$origin==i & year.sccr$destination==i+1 & year.sccr$gender=='male',]$sccr sccr.f<-year.sccr[year.sccr$origin==i & year.sccr$destination==i+1 & year.sccr$gender=='female',]$sccr #sccr<-year.sccr[year.sccr$origin==i & year.sccr$destination==i+1,]$sccr sccr.next.year<-rbind(data.frame(year=i,sccr=sccr.m, gender='male'),sccr.next.year) sccr.next.year<-rbind(data.frame(year=i,sccr=sccr.m, gender='female'),sccr.next.year) } sccr.next.year$trend40<-ifelse(all.sccr.next.year$sccr>0.40,10,0) sccr.next.year$trend30<-ifelse(all.sccr.next.year$sccr>0.30,20,0) sccr.next.year$trend20<-ifelse(all.sccr.next.year$sccr>0.20,30,0) sccr.next.year$trend10<-ifelse(all.sccr.next.year$sccr>0.10,40,0) melt.sccr.next.year<-melt(sccr.next.year,id.vars = c("year","gender","sccr")) ggplot(melt.sccr.next.year, aes(x=year, y=value, group=variable, color=factor(year))) + geom_step()+ theme(legend.position='none') + facet_wrap(~gender)+ xlab("")+ylab("") + theme(strip.text.x = element_text(size = 10, colour = "black")) + theme(axis.text=element_text(size=10)) + theme(axis.title=element_text(size=10,face="bold")) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + theme(panel.background = element_rect(fill = 'white')) + theme(panel.grid.major = element_line( color="snow2")) + theme(strip.background = element_rect(fill = 'white')) + theme(strip.text = element_text(size=14,face="bold")) + scale_x_continuous(breaks=seq(1880,2010,10)) + scale_y_continuous(breaks=seq(1880,2010,10)) + theme(axis.text = element_text(color = "black", size = 10)) + annotate("text", x = 1895,y = 41, label="sccr>.10",size=3.5) + annotate("text", x = 1895,y = 31, label="sccr>.20",size=3.5) + annotate("text", x = 1895,y = 21, label="sccr>.30",size=3.5) + annotate("text", x = 1895,y = 11, label="sccr>.40",size=3.5) |
I hope you enjoy the post and take away the value of Spearman to measure ranked lists changes over time and to detect trends and patterns. It would be great repeating the analysis with the German naming lists, but unfortunately there’s not enough history in the freely available sources. So I’m not going to be able to tell my friend much about the name “Franziska”, but I will keep searching for the right source!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.