Site icon R-bloggers

The Rhythm of Names

[This article was first published on R-posts.com, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
One of the fundamental properties of English prosody is a preference for alternations between strong and weak beats. This preference for rhythmic alternation is expressed in several ways: This blog examines whether the preference for rhythmic alternation affects naming patterns. Consider the following names of lakes in the United States:
      1. Glacier Lake
      2. Guitar Lake*
      3. Lake Louise
      4. Lake Ellen*
The names (a) and (c) preserve rhythmic alternation whereas the asterisked (b) and (d) create a stress clash with two consecutive stressed syllables.

As you can see, lake names in the US can either begin or end with “Lake”. More than 90% end in “Lake” reflecting the standard modifier + noun word order in English. But the flexibility allows us to test whether particular principles (e.g., linguistic, cultural) affect the choice between “X Lake” and “Lake X.” In the case of rhythmic alternation, we would expect weak-strong “iambic” words like “Louise” and “Guitar” to be more common in names beginning than ending in “Lake.”

To test this hypothesis, I used the Quanteda package to pull all the names of lakes and reservoirs in the USGS database of placenames that contained the word “Lake” plus just one other word before or after it. Using the “nsyllable” function in Quanteda, I whittled the list down to names whose non-“lake” word contained just two syllables. Finally, I pulled random samples of 500 names each from those beginning and ending with Lake, then manually coded the stress patterns on the non-lake word in each name.

Coding details for these steps follow. First, we’ll load our place name data frame and take a look at the variable names in the data frame, which are generally self-explanatory:

setwd("/Users/MHK/R/Placenames")
load("placeNames.RData")
colnames(placeNames)
 [1] "FEATURE_ID"     "FEATURE_NAME"   "FEATURE_CLASS"  "STATE_ALPHA"    "STATE_NUMERIC" 
 [6] "COUNTY_NAME"    "COUNTY_NUMERIC" "PRIM_LAT_DEC"   "PRIM_LONG_DEC"  "ELEV_IN_M"

Next, we’ll filter to lakes and reservoirs based on the FEATURE_CLASS variable, and convert names to lower case. We’ll then flag lake and reservoir names that either begin or end with the word “lake”, filtering out those in neither category:

temp <- filter(placeNames, FEATURE_CLASS %in% c("Lake","Reservoir"))
temp$FEATURE_NAME <- tolower(temp$FEATURE_NAME)
temp$first_word <- 0
temp$last_word <- 0
temp$first_word[grepl("^lake\\b",temp$FEATURE_NAME)] <- 1
temp$last_word[grepl("\\blake$",temp$FEATURE_NAME)] <- 1
temp <- filter(temp, first_word + last_word > 0)
We’ll use the ntoken function in the Quanteda text analytics package to find names that contain just two words. By definition given the code so far, one of these two words is “lake.” We’ll separate out the other word, and use the nsyllable function in Quanteda to pull out just those words containing two syllables (i.e., “disyllabic” words). These will be the focus of our analysis.
temp$nWords <- ntoken(temp$FEATURE_NAME, remove_punct=TRUE)
temp <- filter(temp, nWords == 2)
temp$num_syl <- nsyllable(temp$otherWord)
temp <- filter(temp, num_syl == 2)
temp$otherWord <- temp$FEATURE_NAME
temp$otherWord <- gsub("^lake\\b","",temp$otherWord)
temp$otherWord <- gsub("\\blake$","",temp$otherWord)
temp$otherWord <- trimws(temp$otherWord)
  Given the large number of names with “lake” either in first or last position plus a two syllable word (30,391 names), we’ll take a random sample of 500 names beginning with “lake” and 500 ending with “lake”, combine into single data frame, and save as a csv file.
lake_1 <- filter(temp, first_word == 1) %>% sample_n(500)
lake_2 <- filter(temp, last_word == 1) %>% sample_n(500)
lakeSample <- rbind(lake_1,lake_2)
write.csv(lakeSample,file="lake stress clash sample.csv",row.names=FALSE)

I manually coded each of the disyllabic non-“lake” words in each name for whether it had strong-weak (i.e., “trochaic”) or weak-strong (“iambic”) stress. This coding was conducted blind to whether the name began or ended in “Lake.” Occasionally, I came across words like “Cayuga” that the nsyllable function erred in classifying as containing two syllables. I dropped these 23 words – 2.3% of the total – from the analysis (18 in names beginning with “Lake” and 5 in names ending in “Lake”).

Overall, 90% of the non-lake words had trochaic stress, which is consistent with the dominance of this stress pattern in the disyllabic English lexicon. However, as predicted from the preference for rhythmic alternation, iambic stress was almost 5x more common in names beginning than ending with “Lake” (16.4% vs. 3.4%, x2 = 44.81, p < .00001).

Place names provide a rich resource for testing the potential impact of linguistic and cultural factors on the layout of our “namescape.” For example, regional differences in the distribution of violent words in US place names are associated with long-standing regional variation in attitudes towards violence. Large databases of place names along with R tools for text analytics offer many opportunities for similar analyses.
The Rhythm of Names was first posted on April 10, 2021 at 6:54 pm.

To leave a comment for the author, please follow the link and comment on their blog: R-posts.com.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.