Listening for trends in US baby names over 130 years

Posted on January 25, 2011 by Ethan Brown in R bloggers, Uncategorized | 0 Comments

[This article was first published on Statisfactions: The Sounds of Data and Whimsy » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

What happens when you mash together R‘s data crunching magic, Festival‘s speech synthesis power, and the audio wonders of the venerable music language Csound? You fall even more in love with free and open-source software, and you start hearing sounds like this:

A single beat of the above sound represents the top 1000 baby names for one year, progressing from 1880 to 1905.

Each beat is composed of 1000 voices, each speaking one of the top 1000 names for those years; each name is pitch-shifted to reflect its relative percentage of all baby names for that sex for that year. Higher-pitched names mean those names were more frequent that year, with the left speaker playing female names and the right playing male ones.

You can hear all that, right? Let’s winnow this down to just the top name for each gender for each year, and extend the range from 1880 all the way to 2008. This is easiest to hear on headphones:

The biblical Mary and John hold tenaciously onto the top spot for most popular names for quite awhile at the beginning, there–from 1880 to 1923. After that it begins to get more sonically interesting: Robert and James come along for substantial stints, but only in 1946 does Mary finally give up as reigning name queen!

Around then Michael then has a 44-year reign throughout a variety of female-name fashions. I especially like the exciting rhythm that Jessica brings to the proceedings, and Emma’s cadential quality.

You’ll also notice that overall the pitches are getting lower as the piece proceeds–John starts at 8 percent of boys and Mary at 7 percent of girls, but by 2008, the top names (Jacob and Emma) are only about 1 percent. This seems to follow a general trend in the increasing spread of the distribution of names: the top 1000 names account for 93 percent of babies in 1880, but only 79 percent for boys and 67 percent for girls in 2008.

Long live diversity of names! Down with the tyranny of John and Mary! I’m rather fond of my biblical name–the second-most-popular US male name of 2009, I’ll have you know–but I’m glad to see Haruko, Cedric, Mafalda, Che, Amina, Zigmund, and Mohamed make appearances in the top thousand US names in this dataset as well.

Notes, Code, Acknowledgments, and Technical Stuff

The data is initially from the Social Security Administration; I used a cleaned-up subset of the data put together by the redoubtable Hadley Wickham from his GitHub project on this data.

The process for generating this is very much a hack to get the job done on my computer–after more experimentation with these tools I hope to develop more general code that provides a nicer R interface. But here’s a start.

After downloading the data into directory “data/”, I used “text2wave”, a command-line utility that comes with the free and open speech synthesis environment Festival. I just used a simple R loop to direct the process of generating the name soundfiles, using some bash commands (with Windows, this should work with Cygwin but won’t work with normal MS-DOS):

ttsVect <- function(x, dir =  getwd()) {
  ## Runs festival's text2wave utility on all text in a character
  ## vector and produces WAV files from them
  ## in specified directory
  commands <- paste("echo '",x,"' | text2wave -o ", directory,
                    "/",x,".wav -eval '(voice_rab_diphone)'", sep="")
  lapply(commands, system)
  invisible()
}
## Read in downloaded data
dir.create("data")
babyNames <- read.csv("data/baby-names.csv")
## Creates WAV files in directory 'wav'
dir.create("wav")
ttsVect(levels(babyNames$name), dir = "wav") # Warning: can take hours!

This generates soundfiles for ALL the names, which takes quite awhile since there are around 6700–if you’re interested in just generating a few, you can just select the ones you want and pass it along to the function.

In order to do the pitch-shifting effects without changing the length of the sound samples, we need to generate Fourier transform data on each of the soundfiles. The powerful, free and open-source music programming language Csound has a command-line utility called PVANAL for this. (It sounds like solar-powered gay porn, but in fact it stands for “Phase Vocoder ANALysis”, the process of pitch-shifting.) We invoke PVANAL with a similar simplistic shell loop in R:

pvanalVect <- function(x, directory =  getwd()) {
  ## Runs csound's pvanal utility on all wav files
  ## associated with character vector
  ## and produces PVC files for phase vocoding in csound
  commands <- paste("csound -U pvanal ", x, ".wav ", directory,
                    "/", x, ".pvc", sep="")
  lapply(commands, system)
  invisible()
}
## Create phase-vocoder directory and files
dir.create("pvc")
pvanalVect(levels(babyNames$name), "pvc") # Also time-consuming!

Now, we need to create a few functions for input to Csound. It would be nice, eventually to make use of Festival’s and Csound’s APIs for this kind of stuff, but while I’m figuring that out I can still just write files and pass them to Csound to compile and run.

Csound programs have two parts, the orchestra and the score; the orchestra sets up the “instruments”, processes that csound uses to render sound, while the score provides control information to the instruments (rather like MIDI).

I won’t get too far into the syntax of Csound itself here, but the basic strategy here is to create an instrument that will take various inputs to render the pitch-shifted name using the Fourier-transform data generated above. Then, I can use an R data.frame with the values of the instrument’s parameters (called p-fields in Csound) to generate the score.

Here’s the function to write the input file for csound; if no filename is provided, this function creates a temporary one:

csndWriteCsdFile<-function(df, file=NULL) {
  ## Adapted from Erich Neuwirth's Rcsound package
  ## from a function of the same name
  if(is.null(file)){
    file <- paste(tempfile("Rcs"),"csd",sep=".")
  }
  orchestra <- '<CsoundSynthesizer>
<CsOptions>
</CsOptions>
<CsInstruments>

sr = 16000
ksmps = 128
nchnls = 2
0dbfs = 1

instr 1
ifreqscale = p4
iamp = p5
ktime line 0.25, p3, 0.75
asig pvoc ktime, ifreqscale, p6, 1
outs (asig*p5)*p7, (asig*p5)*(1-p7)
endin

</CsInstruments>
'
  score <- c("<CsScore>", paste("i",do.call(paste, df)),"e", "</CsScore>", "</CsoundSynthesizer>")
  write(orchestra, file)
  write(score, file, append=TRUE)
  file
}

Now, here’s a function that calls the above function to write the input file and then to call Csound on the result. The “output” argument can be changed to be the path to a WAV file to render the sound to; by default it renders to the computer’s output audio device (“dac” for “Digital Audio Converter”).

csndPlayDF <- function(df, csdFile=NULL, output="dac") {
  ## Takes data.frame and uses it, in order, as
  ## p-fields in csound
  csdFile <- csndWriteCsdFile(df=df, file=csdFile)
  system(paste("csound",csdFile,"-o",output))
}

What are the parameters to our instrument? In order, they are:

p1: the instrument number–here it should always be 1

p2: start time of note (in seconds)

p3: duration

p4: frequency multiplier–how much to change the pitch

p5: amplitude multiplier–since we’re adding a lot of notes we need to decrease their volume so that it doesn’t overload the speakers

p6: filepath to the result of the PVANAL analysis–csound uses this to actually generate the sound

p7: pan (0 = left speaker, 1 = right speaker)

So, we just need to transform the appropriate elements of the babyNames data frame to create the parameters, and then feed the resulting data frame into the csndPlayDF() function. To facilitate this and making sure they’re in the right order (which is central to Csound), I have a simple template function:

csndtemplate <- function(rows=0) {
  newdf <- data.frame(matrix(nrow=rows,ncol=7))
  names(newdf) <- c("inst", "start", "dur", "freqm", "ampm","pvocpath","pan")
  newdf
}

After all this we’re in familiar R data-wrangling territory (phew!). First we have a simple function to retrieve the top n names per year per gender from the babyNames data.frame:

topNames <- function(n=1) {
  ## Retrieve top n names by gender by year
  require(plyr)
  tops <- ddply(babyNames,c("year","sex"), function(df)
                df[rank(df$percent, ties.method="max")> (length(df$percent)-n),])
  tops
}

Then, we transform this appropriately into p-fields for Csound:

Pfields <- function(x, setrange = 1600, len = 0.4) {
  ## Takes input data.frame x derived from babyNames dataset
  ## and transforms it into p-fields
  ## Columns in input data frame need to match column names
  ## in the baby-names.csv file
  ## *range* is total spread of frequency shifts to be made (in cents)
  ## *len* is how long each sound should play

  nrows <- nrow(x)
  n <- max(as.data.frame(table(x$year))[,2])
  topNamesP <- csndtemplate(nrows)
  topNamesP$inst <- 1
  topNamesP$start <- (x$year - 1880)*len
  topNamesP$dur <- rep(len, nrows)
  ## Centers frequency shifts in range
  topNamesP$freqm <- (x$percent -
                      min(x$percent
                          ))*(setrange)/(max(x$percent)
                                   - min(x$percent)) - setrange
  ## Creates multiplicative factor from cents
  topNamesP$freqm <- 2^(topNamesP$freqm/1200)
  ## Scales all amplitude down to the number of different voices
  ## attempted to render
  topNamesP$ampm <- 1/n
  topNamesP$pvocpath <- paste('"',getwd(),"/pvc/",x$name, ".pvc",'"',sep="")
  topNamesP$pan <- as.numeric(x$sex == "boy")
  topNamesP
}

At long last, we are ready to hear something! Here’s the command to generate the first file–I’ve specified an output file since there’s no way Csound can handle rendering 1000 pitch-shifted voices in real time. Let the coder beware: I actually ran the following command overnight and had to break out of it 8 hours later, when it was only a quarter done (which is why it only goes up to 1905):

## Theoretically generates all 1000 names
## pitch-shifted to their percentage for every year,
## but actually takes a billion years:
csndPlayDF(Pfields(topNames(1000)), output = "All1000.wav")

A more feasible option is to choose a much smaller n–like n=1, which generated the second example in this post. You can mess with the length of the sample and the spread of pitches to achieve different effects:

csndPlayDF(Pfields(topNames(1))) # Sounds like second example
csndPlayDF(Pfields(topNames(1),setrange = 2400)) # Larger spread of pitches
csndPlayDF(Pfields(topNames(1),setrange = 2400, len = 0.025)) # Blurs into new pitches

Finally, I’d like to give a shout-out to Erich Neuwirth, who provided the inspiration for my explorations into R and csound with his Rcsound package, last updated in 2003; I borrowed some of his code in csndWriteCsdFile(); it is not on CRAN, but you can see a vignette here and download the package from his website here. His package gives a few functions for rendering data to sound. I found it required some fiddling to get it to install and work, and I may post a guide soon. Briefly: I had to delete the reference to “csound” in the dependency line of the DESCRIPTION file to get the package to install, and had to replace “pkg” with “package” in the csoundfile.R to get the CSD file to write properly.

Are you interested in making sound in R? Send me an e-mail or leave a comment here. If there’s enough interest I’d love to set up a mailing list or Google group.