Popular Baby Names Walk-Through Part 2 – Graphing the fast movers

November 21, 2011
By

(This article was first published on Command-Line Worldview, and kindly contributed to R-bloggers)

I will assume you have read through part 1 and have the csv file loaded. While we covered some basic graphing in the last post i hope to get into a little more of the data crunching. Specifically I am interested in the names which where driven by a specific cultural phenomena. There are clearly decade long trends where “Beth” is fading in popularity, but you can not point to a specific event or figure in the public consciousness that is driving this. Enough talk let me show you what I mean. Here is the chart for the female name Trinity. Hat Tip to the google dev python people for pointing this out.

nm<-"Trinity"
p<-ggplot(names,aes(x=Year,y=Rank)) 
p<- p + ylim(max(names$Rank),min(names$Rank)) 
p<- p + geom_line(data = names[which(names$Female %in% nm),], aes(group=Female, colour = Female), alpha = 1, size = 1)
p<- p + opts(title = "People Liked the Matrix Way Too Much")
matrix.label<-data.frame(Year = 1985 , Rank = 220, Text = "Matrix Released - 1999") # create the custom on graphic text label
p <- p +  geom_rect(aes(xmin = 1998 , xmax = 2000 , ymin = 1000 , ymax = 1 ),fill = "Green", alpha = .002)
p <- p + geom_text(data = matrix.label, aes(label = Text))
p

You can see here that some people loved the Matrix enough to brand their child for life. As you can see this rapid rise in popularity in a one or two year period can more easily be attributed to a particular social phenomena. Let take a first stab at this. What are the female names that have the greatest change in our data.

#This would be more elequently done with Hadley Wickam's ddply. The lapply/do.call("rbind") combo is brillinatly useful and for simple things I use
name.min.max<-function(nm){
data.frame(
  name = nm,
  min = min(names[which(names$Female == nm),2]),
  max = max(names[which(names$Female == nm),2]),
  dif = max(names[which(names$Female == nm),2]) - min(names[which(names$Female == nm),2])
  )
}
name.list<-unique(names$Female)
out<-lapply(X = as.list(name.list), FUN = name.min.max)
out <- do.call("rbind", out)
female.dif<-out[order(-out$dif),]
#Top Female Names with greatest change overtime # Need to seperate the winners and losser just plots largest difference
nm<-as.character(female.dif[1:10,1])

What we have done here is built a simple name.min.max function that returns a data frame of for a female name passed in. the lapply() function iterates over a list of distinct Female Names in name.list and returns a list of data frames(one fr each name). do.call(“rbind”,out) stitches them all back together. Then we order by the greatest difference a grab the top ten.

> female.dif[1:10,]
name min max dif 469 Ava 4 999 995 1232 Samantha 3 998 995 915 Tammy 8 994 986 59 Debra 2 982 980 29 Cheryl 13 992 979 1257 Kayla 11 988 977 38 Kathy 14 990 976
 22 Janice 21 996 975 738 Lily 18 993 975 1844 Taylor 6 978 972

Here is the result lets graph it with a facet for each name.

p<-ggplot(names,aes(x=Year,y=Rank)) 
p <- p + ylim(max(names$Rank),min(names$Rank)) 
p <- p + geom_line(data = names[which(names$Female %in% nm),], aes(group=Female, colour = Female), alpha = 1, size = 1)
p <- p + opts(title = "Top Movers")
p <- p + facet_wrap(~Female)
p

We definitely found names that changed over time but we have names that are rising over time and falling. Also our new name.min.max function doesn’t give us any insight about rate of change. Is it a slow change like Lily or a dramatic rise like Kayla. As we said before names with a dramatic rise over a short period are more likely associated with something in the pop culture of the time.

What we are most interested is the names that jumped the most over a one year period. I tried a couple different techniques to get this and im going to walk through what i think to be the best. I would love to see other people’s approach. To calculated this we are going to use the function Delt() from the package quantmod. Quantmod is a package for testing financial models. If you actually want to make god money with R, stop reading about baby names and go build a trading platform. Delt is used to calculate change in stock prices over time. i.e. 2% up since yesterday. So load up quantmod and see what happens.

library(quantmod)
female.names<-names[,c(1,4,2)]
female.names<-female.names[order(female.names$Female, female.names$Year),]
female.names$delta <- Delt(female.names$Rank)
female.names[1540:1580,]      

      Year Female Rank Delt.1.arithmetic
44018 1994 Alexis   18      -0.333333333
45014 1995 Alexis   14      -0.222222222
46008 1996 Alexis    8      -0.428571429
47008 1997 Alexis    8       0.000000000
48006 1998 Alexis    6      -0.250000000
49003 1999 Alexis    3      -0.500000000
50006 2000 Alexis    6       1.000000000
51005 2001 Alexis    5      -0.166666667
52005 2002 Alexis    5       0.000000000
53007 2003 Alexis    7       0.400000000
54011 2004 Alexis   11       0.571428571
55013 2005 Alexis   13       0.181818182
56014 2006 Alexis   14       0.076923077
57019 2007 Alexis   19       0.357142857
58015 2008 Alexis   15      -0.210526316
59013 2009 Alexis   13      -0.133333333
41952 1991 Alexus  952      72.230769231
42616 1992 Alexus  616      -0.352941176
43354 1993 Alexus  354      -0.425324675
44250 1994 Alexus  250      -0.293785311
45234 1995 Alexus  234      -0.064000000
46178 1996 Alexus  178      -0.239316239
47182 1997 Alexus  182       0.022471910
48184 1998 Alexus  184       0.010989011
49228 1999 Alexus  228       0.239130435
50252 2000 Alexus  252       0.105263158
51281 2001 Alexus  281       0.115079365
52317 2002 Alexus  317       0.128113879
53370 2003 Alexus  370       0.167192429
54425 2004 Alexus  425       0.148648649
55529 2005 Alexus  529       0.244705882
56547 2006 Alexus  547       0.034026465
57627 2007 Alexus  627       0.146252285
58719 2008 Alexus  719       0.146730463
59895 2009 Alexus  895       0.244784423
48992 1998 Alexys  992       0.108379888
49821 1999 Alexys  821      -0.172379032
50842 2000 Alexys  842       0.025578563
51837 2001 Alexys  837      -0.005938242
52909 2002 Alexys  909       0.086021505
53958 2003 Alexys  958       0.053905391

We subsetted the female names into their own object, ordered by name and year(this is important when calculating a rolling change). We opened up a slice of the new object and there are some problems. First is Delt() function is returning arithmetic differences(that is the difference in the two observation divided by the first). This would be a great help if the numbers it was calculating were truly continuous. Remember, these are ordinal rankings.

1998 Alexis    6      -0.250000000
1999 Alexis    3      -0.500000000

This change in Alexis ranking between 1998 and 1999 is being a calculated as a 50% increase. This isn’t strictly meaningful because what we want to calculate is the absolute change. This brings up my second favorite part of R. If you have a problem with how function is behaving, just go change the function.

we can download the source file for quantmod:

wget http://cran.r-project.org/src/contrib/quantmod_0.3-17.tar.gz

extract the tar file and grep the function you care about:

tar -zxvf quantmod_0.3-17.tar.gz

grep Delt ./quantmod/R/*

We get a R source file named ./quantmod/R/OHLC.transformations.R that contains the function we care about. Here is the original definition:

`Delt` <-
function(x1,x2=NULL,k=0,type=c('arithmetic','log'))
{
    x1 <- try.xts(x1, error=FALSE)
    type <- match.arg(type[1],c('log','arithmetic'))
    if(length(x2)!=length(x1) && !is.null(x2)) stop('x1 and x2 must be of same length');
    if(is.null(x2)){
        x2 <- x1 #copy for same symbol deltas
        if(length(k) < 2) {
            k <- max(1,k)
        }
    }
    dim(x2) <- NULL  # allow for multiple k matrix math to happen
    if(type=='log') {
        xx <- lapply(k, function(K.) {
                log(unclass(x2)/Lag(x1,K.))
              })
    } else {
        xx <- lapply(k, function(K.) {
                unclass(x2)/Lag(x1,K.)-1
              })
    }
    xx <- do.call("cbind", xx)
    colnames(xx) <- paste("Delt",k,type,sep=".")
    reclass(xx,x1)
}

There is a default type parameter that if not set will apply to the row.

unclass(x2)/Lag(x1,K.)-1

This is how the change is normalize. what we want is:

unclass(x2) - Lag(x1,K.)  

What we can do is add another type option(absolute) and add an else if to handle it. We can run this to define a new function and apply it to the data:

`Delt.Absolute` <-
function(x1,x2=NULL,k=0,type=c('arithmetic','log'))
{
    x1 <- try.xts(x1, error=FALSE)
    type <- match.arg(type[1],c('log','arithmetic', 'absolute'))
    if(length(x2)!=length(x1) && !is.null(x2)) stop('x1 and x2 must be of same length');
    if(is.null(x2)){
        x2 <- x1 #copy for same symbol deltas
        if(length(k) < 2) {
            k <- max(1,k)
        }
    }
    dim(x2) <- NULL  # allow for multiple k matrix math to happen
    if(type=='log') {
        xx <- lapply(k, function(K.) {
                log(unclass(x2)/Lag(x1,K.))
              })
    } else if (type=='absolute') {
        xx <- lapply(k, function(K.) {
                unclass(x2) - Lag(x1,K.)  
              })
    } else {
       xx <- lapply(k, function(K.) {
                unclass(x2)/Lag(x1,K.)-1
              })
 
    }
    xx <- do.call("cbind", xx)
    colnames(xx) <- paste("Delt",k,type,sep=".")
    reclass(xx,x1)
}
 
female.names<-names[,c(1,4,2)]
female.names<-female.names[order(female.names$Female, female.names$Year),]
female.names$delta <- Delt.Absolute(female.names$Rank,k=1, type = "absolute")
female.names[20:30,]
      Year  Female Rank Delt.1.absolute
28913 1978   Aaron  913             -33
29990 1979   Aaron  990              77
30883 1980   Aaron  883            -107
31956 1981   Aaron  956              73
32969 1982   Aaron  969              13
33881 1983   Aaron  881             -88
49978 1999 Abagail  978              97
50954 2000 Abagail  954             -24
51920 2001 Abagail  920             -34
52882 2002 Abagail  882             -38
53868 2003 Abagail  868             -14

now the function returns an absolute change. We still have an issue. look at the difference calculated between Aaron and Abagail. It is calculating between names, not good. To solve im going to use a trick I read on SO.

female.names[c(TRUE, female.names$Female[-1] != female.names$Female[-length(female.names$Female)]), 4] <- NA
female.names<-female.names[order( female.names[4]),]
 

This elegant line of code take the first instance of all the names in our data frame and replaces the Delt col with a NA(R’s missing data handler). There is some tweaking if you want to use it on numeric data but it is a clever piece of code. After we reorder and extract the top ten names with greatest one year positive change(which is actually a negative number from our Delt,Absolute function, think a change in Rank 10 -> 2 is actually -8 change). Lets plot these top ten.

female.names.winners<-female.names[order(female.names[4]),]    
top.female.names<-unique(female.names.winners[1:12,2])
p<-ggplot(names,aes(x=Year,y=Rank)) + ylim(max(names$Rank),min(names$Rank)) + geom_line(data = names[which(names$Female %in% top.female.names),], aes(group=Female, colour = Female), alpha = 1, size = 1)+ opts(title = "Female Baby Name Popularity Since 1950")+ opts(axis.text.x=theme_text(angle=-70),hjust=0) + facet_wrap(~ Female)
p

As you can see there are some names that flash up but do not stick around. However Some names shoot up and stayed popular. Take Desiree for example. Lets take a closer look.

p<-ggplot(names,aes(x=Year,y=Rank)) 
p <- p + ylim(max(names$Rank),min(names$Rank)) # Flip the Y-Axis
nm <- "Desiree" # enter a female name
p <- p  + geom_line(data = names[which(names$Female == nm),], aes(group=Female, colour = Female), alpha = 1, size = 2) + opts(title = nm)
p

Now I have never seen this movie but clearly Brando has an effect. Desiree (1954)

I hope you found some of this useful. There are some fun association that can be found. Have fun. Next Up fun with Baby Names Part 3: Biggest Losser’s

To leave a comment for the author, please follow the link and comment on his blog: Command-Line Worldview.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , ,

Comments are closed.