Names in the U.S., from James Smith to Jose Rodriguez

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Two weeks ago, @mona published an interesting post on her blog, about a difficult question, What’s The Most Common Name In America? There were stats about first names, in the U.S., and last names, too. Those informations are – somehow – easy to get. But usually, it is more complicated to get the first and the last name together. For confidentiality issues ! Datasets – the ones I deal with – are supposed to be anonymized, so I never see the first and the last names.  In a previous post, a few years ago, I did mention the so-called Social Security Death Master File. In that file, we have Social Security numbers, with the date of birth, the date of death as well as the first and the last name. So I did use those files to get stats about the first and the last names of American citizens. Of course, it is very restrictive. I have only U.S. citizens that have a Social Security number (which is not compulsary in the U.S. as far as I understood) and who passed away (as mentioned in the name of the dataset: the death master file). Another great thing about that dataset is that I have the date of birth, so I can look at some cohort effect (see opendata.stackexchange for an interesting discussion on that dataset).

This dataset is quite big. Overall, it is a 8 Gigabyte file, ziped in three documents. My first idea was to work directly on the zipped file, and to scan it, piece by piece…

temp <- tempfile()
download.file(paste("http://ssdmf.info/raw_data/ssdm",1,".zip",sep=""),temp)
list.files <- unzip(temp,list=TRUE) 
s <- 1
data <- scan(unz(temp,list.files$Name[1]),
what="character",sep=",",skip=(s-1)*1e5,n=1e5)

(and to loop on the last part) but it was extremely slow... after 4 days, I was still scanning the files. So I asked @3wen to help me. And using nice functions from several packages, it ran much faster,

library(stringr)
library(lubridate)
library(plyr)
library(data.table)
library(LaF)

The first step was to download the files, and to extract them. Here, you need some space, like 8 Gb. To extract the appropriate information from that text file, use

cols <- c(1,9,20,4,15,15,1,2,2,4,2,2,4,2,5,5,7)
noms_col <- c("code", "ssn", "last_name", "name_suffix", "first_name", "middle_name", "VorPCode", "date_death_m", "date_death_d", "date_death_y", "date_birth_m", "date_birth_d", "date_birth_y", "state", "zip_resid", "zip_payment", "blanks")

Then, we will count many things,

cpttf_t <- NULL
cpttl_t <- NULL
cpttot_t <- NULL
cpttf_c_t <- NULL
cpttl_c_t <- NULL
cpttot_c_t <- NULL

We need those empty object to start our code,

temp ="/user/arthur/ssm/ssdm1"
ssn <- laf_open_fwf(temp, column_widths = cols, 
    column_types=rep("character", length(cols)),
    column_names = noms_col,
    trim = TRUE)

go_through <- seq(1, nrow(ssn), by = 1e05)
if(go_through[length(go_through)] != nrow(ssn)) go_through <- c(go_through, nrow(ssn))

go_through <- cbind(go_through[-
length(go_through)],
c(go_through[-c(1, length(go_through))]-1, go_through[length(go_through)]))

Now, we can start counting the appearence of each name in the dataset, in a function

count_ckunk <- function(s){
  print(s)
  data <- ssn[go_through[s,1]:go_through[s,2], c("last_name", "first_name", "date_birth_y")]
data$cohort <- trunc(as.numeric(data$date_birth_y)/10)*10

The counts are

cpttf_temp <- count(data, "first_name")
cpttl_temp <- count(data, "last_name")
cpttot_temp <- count(data, c("last_name", "first_name"))

The first name, the last name, and the pair first and last name. And then, we also look at the cohort,

cpttf_temp_c <- count(data, c("first_name", "cohort"))
cpttl_temp_c <- count(data, c("last_name", "cohort"))
cpttot_temp_c <- count(data, c("last_name", "first_name", "cohort"))

We create a list, to store that information

list(cpttf = cpttf_temp, cpttl = cpttl_temp, cpttot = cpttot_temp,
cpttf_c = cpttf_temp_c, cpttl_c = cpttl_temp_c, cpttot_c = cpttot_temp_c)
}

Now, that we have our function, we use it

data <- lapply(seq_len(nrow(go_through)),count_ckunk)

In that dataset, we can count appearences, of all names

cpttf <- do.call("rbind",lapply(data, function(x) x$cpttf))
cpttl <- do.call("rbind",lapply(data, function(x) x$cpttl))
(nom,prenom)
cpttot <- do.call("rbind",lapply(data, function(x) x$cpttot))

as well as those information and the cohort

cpttf_c <- do.call("rbind",lapply(data, function(x) x$cpttf_c))
cpttl_c <- do.call("rbind",lapply(data, function(x) x$cpttl_c))
cpttot_c <- do.call("rbind",lapply(data, function(x) x$cpttot_c))

We are almost done !

cpttf <- count(cpttf, "first_name", wt_var = "freq")
cpttl <-count(cpttl, "last_name", wt_var = "freq")
cpttot <- count(cpttot, c("last_name", "first_name"), wt_var = "freq")
cpttf_c <- count(cpttf_c, c("first_name", "cohort"), wt_var = "freq")
cpttl_c <-count(cpttl_c, c("last_name", "cohort"), wt_var = "freq")
cpttot_c <- count(cpttot_c, c("last_name", "first_name", "cohort"), wt_var = "freq")

We now have our counts. We can sort them, just to visualize

cpttf_t <- arrange(cpttf_t, desc(freq))
cpttl_t <- arrange(cpttl_t, desc(freq))
cpttot_t <- arrange(cpttot_t, desc(freq))
cpttf_c_t <- arrange(cpttf_c_t, cohort, desc(freq))
cpttl_c_t <- arrange(cpttl_c_t, cohort, desc(freq))
cpttot_c_t <- arrange(cpttot_c_t, cohort, desc(freq))

For instance, in the 1930 cohort, our top-6 was

> head(cpttot_c_t[cpttot_c_t$cohort==1930,])
         last_name first_name cohort freq
14033573     SMITH      JAMES   1930  682
14033574     SMITH     ROBERT   1930  616
14033575   JOHNSON     ROBERT   1930  527
14033576   JOHNSON      JAMES   1930  502
14033577  WILLIAMS      JAMES   1930  473
14033578     SMITH       MARY   1930  425

while in the 1970 cohort, it was

> head(cpttot_c_t[cpttot_c_t$cohort==1970,])
         last_name first_name cohort freq
17950485 RODRIGUEZ       JOSE   1970   54
17950486  GONZALEZ       JOSE   1970   41
17950487   JOHNSON    MICHAEL   1970   40
17950488  MARTINEZ       JOSE   1970   38
17950489 HERNANDEZ       JOSE   1970   36
17950490     SMITH    MICHAEL   1970   35

Frequencies are extremely small in the later. It should come from the fact that people in the dataset are gone.

Based on those datasets, we can start visualizing. Create lists of first and last names

L <- cpttl_t[,"last_name"]
F <- cpttf_t[,"first_name"]
F <- F[nchar(F)>2]

Consider the 1930 cohort

YEAR <- 1930
N <- sum(cpttl_c_t[(cpttl_c_t$cohort==YEAR)
,"freq"])
Freq_L <- cpttl_c_t[(cpttl_c_t$cohort==YEAR)&(cpttl_c_t$last_name%in%L[1:20]),c("last_name","freq")]
Freq_F <- cpttf_c_t[(cpttf_c_t$cohort==YEAR)&(cpttf_c_t$first_name%in%F[1:20]),c("first_name","freq")]

Freq_FL <- cpttot_c_t[(cpttot_c_t$cohort==YEAR)&(cpttot_c_t$last_name%in%L[1:20])&(cpttot_c_t$first_name%in%F[1:20]),c("last_name","first_name","freq")]

Here, we focus on the top-20 for the first and the last names. The count matrix is here

dt_Freq_FL <- data.table(Freq_FL)
dt_Freq_FL <- dcast.data.table(dt_Freq_FL, last_name~first_name, fun = sum, value.var = "freq")
df_Freq_FL <- data.frame(dt_Freq_FL)
mat_Freq_FL <- as.matrix(df_Freq_FL[,-1])
rownames(mat_Freq_FL) <- df_Freq_FL[,1]
Nobs <- mat_Freq_FL

that we can plot

We can also use the chi-square metrics to compare that matrix with the matrix under the independence assumption,

Nind <- chisq.test(Nobs)$expected
Q <- (Nobs-Nind)^2/Nind

The codes for the color are

library(RColorBrewer)
plotclr=colorRampPalette(brewer.pal(7,
"RdYlBu")[7:1] )(20)
niv=c( 0, 7.651122e-03 ,2.736844e-02 ,6.642095e-02 ,1.198073e-01,
 1.807768e-01, 2.599493e-01 ,3.543850e-01 ,4.645050e-01, 6.037386e-01,
 7.536596e-01, 9.836452e-01 ,1.215549e+00, 1.568113e+00 ,1.990363e+00,
2.538832e+00 ,3.228198e+00 ,4.883766e+00, 6.925505e+00 ,1.121456e+01,
10000000)
class_q=cut(as.vector(Q),breaks=niv,labels=1:20)
colcode=matrix(plotclr[as.numeric(class_q)],20,20)

Those values are the quantiles for 1930 counts.  I kept the values to use the same beaks in 1950 and 1970, and compare three cohorts.

For the 1950 cohort, the counts were

while the chi-square distance matrix was

In red, we have large contributions in the chi-square distance. But it can be either because observed counts were too small, or too large. Here, we do not know. Except very specific cells that we can guess (e.g. Willam Willams and David Davis are in red... I believe that those are uncommon combinations).  But observe that if the family name is Garcia then the first name will not be distributed as it could be for other family names. At least in 1950 and 1930. But when you look at the 1970 cohort,

the chi-square contribution is much smaller. So clearly, the effect that was also observed in @mona's post, is related to the year of birth. In 1950, it was rather rare to observe an English style first name with a Spanish style last name. But it seems that it is more comon nowadays...

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)