Estimate Age from First Name

[This article was first published on Data and Analysis with R, at Work, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today I read a cute post from Flowing Data on the most trendy names in US history. What caught my attention was a link posted in the article to the source data, which happens to be yearly lists of baby names registered with the US social security agency since 1880
(see here). I thought that it might be good to compile and use these lists at work for two reasons:

(1) I don’t have experience handling file input programmatically in R (ie working with a bunch of files in a directory instead of manually loading one or two) and
(2) It could be useful to have age estimates in the donor files that I work with (using the year when each first name was most popular).

I’ve included the R code in this post at the bottom, after the following explanatory text.

I managed to build a dataframe that contains in each row a name, how many people were registered as having been born with that name in a given year, the year, the total population for that year, and the relative proportion of people with that name in that year.

Once I got that dataframe, I built a function to query that dataframe for the year when a given name was most popular, an estimated age using that year, and the relative proportion of people born with that name from that year.

I don’t have any testing data to check the results against, but I did do an informal check around the office, and it seems okay!

However, I’d like to scale this upwards so that age estimates can be calculated for each row in a vector of first names. As the code stands below, the function I made takes too long to be scaled up effectively.

I’m wondering what’s the best approach to take? Some ideas I have so far follow:

  • Create a smaller dataframe where each row contains a unique name, the year when it was most popular, and the relative popularity in that year. Make a function to query this new dataframe.
  • Possibly convert the above dataframe into a data table and then building a function to query the data table instead.
  • Failing the efficacy of the above two ideas, load the popularity data into Python, and make a function to query it there instead.

    Does anyone have any better ideas for me ?

    I’ll also accept any suggestions for cleaning up my code as is :)

    To leave a comment for the author, please follow the link and comment on their blog: Data and Analysis with R, at Work.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

  • Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)