Popular Baby Names Walk-Through Part 1 – Web Scrapping and ggploting

Posted on November 20, 2011 by Command-Line Worldview in R bloggers | 0 Comments

[This article was first published on Command-Line Worldview, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the first walk-through I have posted. Reading these types of posts has been incredibly helpful as I have been learning R and other useful tools in the Unix universe. Hopefully you find it helpful.

First, I have been watching Google Python Videos the last couple days and they have a coding assignment using Social Security Administration Data Baby Names. Not having the downloads for the course I thought it would be a good python exercise to try to get the same data. So, my interest in baby names has nothing to do with any impending life decision(or any recent drunken decisions). You can get the python script and the R we are going to use here. Also the csv file we are going to use can also be downloaded here. Also if you are interested in more baby name projects and a web scrapper written in R/Ruby check out Hadley Wickham’s project.

now on to the R.

First We load the data and ggplot

library(ggplot2)
 
#load
names<-read.csv("top_names_since_1950.csv", header = TRUE)
#switch factors to characters
names[,4]<-as.character(names$Female)
names[,3]<-as.character(names$Male)
 
#take a quick look
head(names)

This data runs from 1950 to 2009, provides the year, rank and male and female name. Lets take a quick plot.

p<-ggplot(names,aes(x=Year,y=Rank)) +geom_line(aes(group=Female),colour="#431600",alpha=0.1)
p

We have to use a low alpha or the over-plotting gets to be much. Well it looks pretty but there are a couple issues. First ggplot’s defaults are treating the Rank variable like a number and have the top ranked names at the bottom of the Y-axis. second this doesn’t give you too much insight, its a little much. Lets focus flip the Y-axis and focus on a single name.

Here we get to explore one of the best parts of R in general and ggplot in particular, iterative coding. You write something, run, write some more, run. Here we can just add layers to the ggplot object we just created.

p <- p + ylim(max(names$Rank),min(names$Rank)) # Flip the Y-Axis
nm <- "Beth" # enter a female name
p <- p  + geom_line(data = names[which(names$Female == nm),], aes(group=Female, colour = Female), alpha = 1, size = 2) + opts(title = nm)
p

We can see that Beth is not as popular as i once was. We can try the same thing with a group of names. Let look at some Male names together. I have some brothers, so we can tune this into a competitive contest of nomenclature.

###MALE####
nm<-c("Malcolm","Ethan","Allen","John","Eric")
p<-ggplot(names,aes(x=Year,y=Rank)) + ylim(max(names$Rank),min(names$Rank))
p <- p + geom_line(data = names[which(names$Male %in% nm),], aes(group=Male, colour = Male), alpha = 1, size = 1)
p <- p + geom_line(aes(group=Male),colour="#431600",alpha=0.1)+ opts(title = "Male Baby Name Popularity Since 1950")
p

Comparing the relative fortunes of these names is informative but the chart has two main shortcomings. Charting all ~2500 names gives an interesting pattern and texture but it doesn’t obliviously add to the understanding of the underlying data. At best it is an interesting background, at worst it distracting chart junk. While I don’t want to sound like a fanboy Tufte Evangelist but this borders on chartjunk. We can print it with the background nonsense.

nm<-c("Malcolm","Ethan","John","Eric")
p<-ggplot(names,aes(x=Year,y=Rank)) + ylim(max(names$Rank),min(names$Rank))
p <- p + geom_line(data = names[which(names$Male %in% nm),], aes(group=Male, colour = Male), alpha = 1, size = 1)
p <- p + opts(title = "Male Baby Name Popularity Since 1950")
p

This is clearer but there is still a legend lookup issue. It’s unclear which names go with which line. ggplot’s facet_wrap is great for this.

p <- p + facet_wrap(~Male)
p

It is clear that the names with greatest range of ranks are a more interesting subjects. Even though I think that the regal name John is fine subject of study, continued dominance in the name game can get tiresome. Looking into the names with the great rise or fall an interesting exercise. That will be part 2 of this rambling visual diarrhea.

To leave a comment for the author, please follow the link and comment on their blog: Command-Line Worldview.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Popular Baby Names Walk-Through Part 1 – Web Scrapping and ggploting

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)