Site icon R-bloggers

Elizabeth!

[This article was first published on MeanMean, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< svg style="display: none;">< defs id="MathJax_SVG_glyphs">< path stroke-width="1" id="MJMATHI-74" d="M26 385Q19 392 19 395Q19 399 22 411T27 425Q29 430 36 430T87 431H140L159 511Q162 522 166 540T173 566T179 586T187 603T197 615T211 624T229 626Q247 625 254 615T261 596Q261 589 252 549T232 470L222 433Q222 431 272 431H323Q330 424 330 420Q330 398 317 385H210L174 240Q135 80 135 68Q135 26 162 26Q197 26 230 60T283 144Q285 150 288 151T303 153H307Q322 153 322 145Q322 142 319 133Q314 117 301 95T267 48T216 6T155 -11Q125 -11 98 4T59 56Q57 64 57 83V101L92 241Q127 382 128 383Q128 385 77 385H26Z">< path stroke-width="1" id="MJMAIN-2B" d="M56 237T56 250T70 270H369V420L370 570Q380 583 389 583Q402 583 409 568V270H707Q722 262 722 250T707 230H409V-68Q401 -82 391 -82H389H387Q375 -82 369 -68V230H70Q56 237 56 250Z">< path stroke-width="1" id="MJMATHI-6B" d="M121 647Q121 657 125 670T137 683Q138 683 209 688T282 694Q294 694 294 686Q294 679 244 477Q194 279 194 272Q213 282 223 291Q247 309 292 354T362 415Q402 442 438 442Q468 442 485 423T503 369Q503 344 496 327T477 302T456 291T438 288Q418 288 406 299T394 328Q394 353 410 369T442 390L458 393Q446 405 434 405H430Q398 402 367 380T294 316T228 255Q230 254 243 252T267 246T293 238T320 224T342 206T359 180T365 147Q365 130 360 106T354 66Q354 26 381 26Q429 26 459 145Q461 153 479 153H483Q499 153 499 144Q499 139 496 130Q455 -11 378 -11Q333 -11 305 15T277 90Q277 108 280 121T283 145Q283 167 269 183T234 206T200 217T182 220H180Q168 178 159 139T145 81T136 44T129 20T122 7T111 -2Q98 -11 83 -11Q66 -11 57 -1T48 16Q48 26 85 176T158 471L195 616Q196 629 188 632T149 637H144Q134 637 131 637T124 640T121 647Z">< path stroke-width="1" id="MJMAIN-3D" d="M56 347Q56 360 70 367H707Q722 359 722 347Q722 336 708 328L390 327H72Q56 332 56 347ZM56 153Q56 168 72 173H708Q722 163 722 153Q722 140 707 133H70Q56 140 56 153Z">< path stroke-width="1" id="MJMAIN-31" d="M213 578L200 573Q186 568 160 563T102 556H83V602H102Q149 604 189 617T245 641T273 663Q275 666 285 666Q294 666 302 660V361L303 61Q310 54 315 52T339 48T401 46H427V0H416Q395 3 257 3Q121 3 100 0H88V46H114Q136 46 152 46T177 47T193 50T201 52T207 57T213 61V578Z">< path stroke-width="1" id="MJMAIN-4D" d="M132 622Q125 629 121 631T105 634T62 637H29V683H135Q221 683 232 682T249 675Q250 674 354 398L458 124L562 398Q666 674 668 675Q671 681 683 682T781 683H887V637H854Q814 636 803 634T785 622V61Q791 51 802 49T854 46H887V0H876Q855 3 736 3Q605 3 596 0H585V46H618Q660 47 669 49T688 61V347Q688 424 688 461T688 546T688 613L687 632Q454 14 450 7Q446 1 430 1T410 7Q409 9 292 316L176 624V606Q175 588 175 543T175 463T175 356L176 86Q187 50 261 46H278V0H269Q254 3 154 3Q52 3 37 0H29V46H46Q78 48 98 56T122 69T132 86V622Z">< path stroke-width="1" id="MJMAIN-53" d="M55 507Q55 590 112 647T243 704H257Q342 704 405 641L426 672Q431 679 436 687T446 700L449 704Q450 704 453 704T459 705H463Q466 705 472 699V462L466 456H448Q437 456 435 459T430 479Q413 605 329 646Q292 662 254 662Q201 662 168 626T135 542Q135 508 152 480T200 435Q210 431 286 412T370 389Q427 367 463 314T500 191Q500 110 448 45T301 -21Q245 -21 201 -4T140 27L122 41Q118 36 107 21T87 -7T78 -21Q76 -22 68 -22H64Q61 -22 55 -16V101Q55 220 56 222Q58 227 76 227H89Q95 221 95 214Q95 182 105 151T139 90T205 42T305 24Q352 24 386 62T420 155Q420 198 398 233T340 281Q284 295 266 300Q261 301 239 306T206 314T174 325T141 343T112 367T85 402Q55 451 55 507Z">< path stroke-width="1" id="MJMAIN-45" d="M128 619Q121 626 117 628T101 631T58 634H25V680H597V676Q599 670 611 560T625 444V440H585V444Q584 447 582 465Q578 500 570 526T553 571T528 601T498 619T457 629T411 633T353 634Q266 634 251 633T233 622Q233 622 233 621Q232 619 232 497V376H286Q359 378 377 385Q413 401 416 469Q416 471 416 473V493H456V213H416V233Q415 268 408 288T383 317T349 328T297 330Q290 330 286 330H232V196V114Q232 57 237 52Q243 47 289 47H340H391Q428 47 452 50T505 62T552 92T584 146Q594 172 599 200T607 247T612 270V273H652V270Q651 267 632 137T610 3V0H25V46H58Q100 47 109 49T128 61V619Z">< path stroke-width="1" id="MJSZ2-2211" d="M60 948Q63 950 665 950H1267L1325 815Q1384 677 1388 669H1348L1341 683Q1320 724 1285 761Q1235 809 1174 838T1033 881T882 898T699 902H574H543H251L259 891Q722 258 724 252Q725 250 724 246Q721 243 460 -56L196 -356Q196 -357 407 -357Q459 -357 548 -357T676 -358Q812 -358 896 -353T1063 -332T1204 -283T1307 -196Q1328 -170 1348 -124H1388Q1388 -125 1381 -145T1356 -210T1325 -294L1267 -449L666 -450Q64 -450 61 -448Q55 -446 55 -439Q55 -437 57 -433L590 177Q590 178 557 222T452 366T322 544L56 909L55 924Q55 945 60 948Z">< path stroke-width="1" id="MJMATHI-69" d="M184 600Q184 624 203 642T247 661Q265 661 277 649T290 619Q290 596 270 577T226 557Q211 557 198 567T184 600ZM21 287Q21 295 30 318T54 369T98 420T158 442Q197 442 223 419T250 357Q250 340 236 301T196 196T154 83Q149 61 149 51Q149 26 166 26Q175 26 185 29T208 43T235 78T260 137Q263 149 265 151T282 153Q302 153 302 143Q302 135 293 112T268 61T223 11T161 -11Q129 -11 102 10T74 74Q74 91 79 106T122 220Q160 321 166 341T173 380Q173 404 156 404H154Q124 404 99 371T61 287Q60 286 59 284T58 281T56 279T53 278T49 278T41 278H27Q21 284 21 287Z">< path stroke-width="1" id="MJMATHI-4E" d="M234 637Q231 637 226 637Q201 637 196 638T191 649Q191 676 202 682Q204 683 299 683Q376 683 387 683T401 677Q612 181 616 168L670 381Q723 592 723 606Q723 633 659 637Q635 637 635 648Q635 650 637 660Q641 676 643 679T653 683Q656 683 684 682T767 680Q817 680 843 681T873 682Q888 682 888 672Q888 650 880 642Q878 637 858 637Q787 633 769 597L620 7Q618 0 599 0Q585 0 582 2Q579 5 453 305L326 604L261 344Q196 88 196 79Q201 46 268 46H278Q284 41 284 38T282 19Q278 6 272 0H259Q228 2 151 2Q123 2 100 2T63 2T46 1Q31 1 31 10Q31 14 34 26T39 40Q41 46 62 46Q130 49 150 85Q154 91 221 362L289 634Q287 635 234 637Z">< path stroke-width="1" id="MJMAIN-28" d="M94 250Q94 319 104 381T127 488T164 576T202 643T244 695T277 729T302 750H315H319Q333 750 333 741Q333 738 316 720T275 667T226 581T184 443T167 250T184 58T225 -81T274 -167T316 -220T333 -241Q333 -250 318 -250H315H302L274 -226Q180 -141 137 -14T94 250Z">< path stroke-width="1" id="MJMATHI-78" d="M52 289Q59 331 106 386T222 442Q257 442 286 424T329 379Q371 442 430 442Q467 442 494 420T522 361Q522 332 508 314T481 292T458 288Q439 288 427 299T415 328Q415 374 465 391Q454 404 425 404Q412 404 406 402Q368 386 350 336Q290 115 290 78Q290 50 306 38T341 26Q378 26 414 59T463 140Q466 150 469 151T485 153H489Q504 153 504 145Q504 144 502 134Q486 77 440 33T333 -11Q263 -11 227 52Q186 -10 133 -10H127Q78 -10 57 16T35 71Q35 103 54 123T99 143Q142 143 142 101Q142 81 130 66T107 46T94 41L91 40Q91 39 97 36T113 29T132 26Q168 26 194 71Q203 87 217 139T245 247T261 313Q266 340 266 352Q266 380 251 392T217 404Q177 404 142 372T93 290Q91 281 88 280T72 278H58Q52 284 52 289Z">< path stroke-width="1" id="MJMAIN-2C" d="M78 35T78 60T94 103T137 121Q165 121 187 96T210 8Q210 -27 201 -60T180 -117T154 -158T130 -185T117 -194Q113 -194 104 -185T95 -172Q95 -168 106 -156T131 -126T157 -76T173 -3V9L172 8Q170 7 167 6T161 3T152 1T140 0Q113 0 96 17Z">< path stroke-width="1" id="MJMAIN-2212" d="M84 237T84 250T98 270H679Q694 262 694 250T679 230H98Q84 237 84 250Z">< path stroke-width="1" id="MJMAIN-29" d="M60 749L64 750Q69 750 74 750H86L114 726Q208 641 251 514T294 250Q294 182 284 119T261 12T224 -76T186 -143T145 -194T113 -227T90 -246Q87 -249 86 -250H74Q66 -250 63 -250T58 -247T55 -238Q56 -237 66 -225Q221 -64 221 250T66 725Q56 737 55 738Q55 746 60 749Z">< path stroke-width="1" id="MJMAIN-32" d="M109 429Q82 429 66 447T50 491Q50 562 103 614T235 666Q326 666 387 610T449 465Q449 422 429 383T381 315T301 241Q265 210 201 149L142 93L218 92Q375 92 385 97Q392 99 409 186V189H449V186Q448 183 436 95T421 3V0H50V19V31Q50 38 56 46T86 81Q115 113 136 137Q145 147 170 174T204 211T233 244T261 278T284 308T305 340T320 369T333 401T340 431T343 464Q343 527 309 573T212 619Q179 619 154 602T119 569T109 550Q109 549 114 549Q132 549 151 535T170 489Q170 464 154 447T109 429Z">< path stroke-width="1" id="MJMAIN-2208" d="M84 250Q84 372 166 450T360 539Q361 539 377 539T419 540T469 540H568Q583 532 583 520Q583 511 570 501L466 500Q355 499 329 494Q280 482 242 458T183 409T147 354T129 306T124 272V270H568Q583 262 583 250T568 230H124V228Q124 207 134 177T167 112T231 48T328 7Q355 1 466 0H570Q583 -10 583 -20Q583 -32 568 -40H471Q464 -40 446 -40T417 -41Q262 -41 172 45Q84 127 84 250Z">< path stroke-width="1" id="MJMAIN-7B" d="M434 -231Q434 -244 428 -250H410Q281 -250 230 -184Q225 -177 222 -172T217 -161T213 -148T211 -133T210 -111T209 -84T209 -47T209 0Q209 21 209 53Q208 142 204 153Q203 154 203 155Q189 191 153 211T82 231Q71 231 68 234T65 250T68 266T82 269Q116 269 152 289T203 345Q208 356 208 377T209 529V579Q209 634 215 656T244 698Q270 724 324 740Q361 748 377 749Q379 749 390 749T408 750H428Q434 744 434 732Q434 719 431 716Q429 713 415 713Q362 710 332 689T296 647Q291 634 291 499V417Q291 370 288 353T271 314Q240 271 184 255L170 250L184 245Q202 239 220 230T262 196T290 137Q291 131 291 1Q291 -134 296 -147Q306 -174 339 -192T415 -213Q429 -213 431 -216Q434 -219 434 -231Z">< path stroke-width="1" id="MJMAIN-2E" d="M78 60Q78 84 95 102T138 120Q162 120 180 104T199 61Q199 36 182 18T139 0T96 17T78 60Z">< path stroke-width="1" id="MJMAIN-7D" d="M65 731Q65 745 68 747T88 750Q171 750 216 725T279 670Q288 649 289 635T291 501Q292 362 293 357Q306 312 345 291T417 269Q428 269 431 266T434 250T431 234T417 231Q380 231 345 210T298 157Q293 143 292 121T291 -28V-79Q291 -134 285 -156T256 -198Q202 -250 89 -250Q71 -250 68 -247T65 -230Q65 -224 65 -223T66 -218T69 -214T77 -213Q91 -213 108 -210T146 -200T183 -177T207 -139Q208 -134 209 3L210 139Q223 196 280 230Q315 247 330 250Q305 257 280 270Q225 304 212 352L210 362L209 498Q208 635 207 640Q195 680 154 696T77 713Q68 713 67 716T65 731Z">

Over the last few months I have been spending my nights taking care of my newly born second daughter. Keeping me company during the sleepless wee hours of the morning was the Reconcilable Differences Podcast. In episode 17 of this podcast, It’s Devastating, there was an open question placed by John Siracusa with regard to how baby names change over time, and if there were any sudden changes. In this blog post, part of my investigation of podcast theme series, I will take a look at these two questions. I will also provide my source code in the spirit of reproducible research.

Short a Few Apostles

I used the babynames package in R to perform this analysis. The R package babynames by Hadley Wickham provides a nifty way to get baby names and related information from the Social Security Administration and other government agencies. Downloading the data and plotting the name John relative to Jon over time can be done with a few simple commands.

library(ggplot2)
library(babynames)
library(dplyr)
library(magrittr)

# big plots, because I'm old and blind
theme_set(theme_gray(base_size = 18))

# get the names
theJNames <-babynames %>%
  filter(name %in% c('Jon','John') & sex == 'M')

# plot the proportion of new SSN applications 
# with the names over time
p <- ggplot(data=theJNames, aes(x=year,y=prop)) +
  geom_line(aes(color=name),size=1.5) +
  labs(
    title='Jon v.s. John',
    y="Proportion of new SSN applications",
    x="Year")
plot(p)
Figure 1: Proportions of the names Jon and John over time.

With this little bit of work, we can see that the proportion of babies named John is strictly larger than Jon, but is also on a fairly substantial decline. Out of curiosity, let’s check James. Why James? It seems that just about everyone I know that is male and in their 70s is named James.

library(ggplot2)
library(babynames)
library(dplyr)
library(magrittr)

# big plots, because I'm old and blind
theme_set(theme_gray(base_size = 18))

# get the names
theJNames <-babynames %>%
  filter(name %in% c('James') & sex == 'M')

# plot the proportion of new SSN applications 
# with the names over time
p <- ggplot(data=theJNames, aes(x=year,y=prop)) +
  geom_line() +
  labs(
    title='James',
    y="Proportion of new SSN applications",
    x="Year")

maxYear <- theJNames$year[which.max(theJNames$prop)]

plot(p)
Figure 2: Proportions of the name James over time.

So it looks like we hit peak James in maxYear=1944. So this seems to line up with my observations.

For the data pedantic, which should be everyone, this list of baby names is derived from Social Security Card applications. Since the social security administration didn’t exist until 1935 and didn’t start issuing cards until 1937, the names from 1880-1937 are likely to be a bit incomplete.

The Names of the 1880s

So John and James had downward trends in the coverage of new babies, does this hold for other names? Let’s see what the top names in the 1880s are and where they are today, likewise let’s see what the top names in 2014 are and how they have historically trended.

library(ggplot2)
library(babynames)
library(dplyr)
library(magrittr)

# big plots, because I'm old and blind
theme_set(theme_gray(base_size = 18))

# dream of the 1880s
# ladies first
oldNamesData <- babynames %>%
  filter(year == 1880 & sex =='F')

#baby names are presorted
oldNames <- head(oldNamesData$name,n=7)
oldNamesData <- filter(babynames, name %in% oldNames & sex == 'F')
oldNamesData$name <- as.factor(oldNamesData$name)

# plot the proportion of new SSN applications 
# with the names over time
p <- ggplot(data=oldNamesData, aes(x=year,y=prop)) +
  geom_line(aes(color=name)) +
  labs(
    title='Top 1880 Female Names',
    y="Proportion of new SSN applications",
    x="Year")

maxYear <- theJNames$year[which.max(theJNames$prop)]

plot(p)
Figure 3: Old time female names over time.
library(ggplot2)
library(babynames)
library(dplyr)
library(magrittr)

# big plots, because I'm old and blind
theme_set(theme_gray(base_size = 18))

# 2014 females
newNamesData <- babynames %>%
  filter(year == 2014 & sex =='F')

#baby names are presorted
newNames <- head(newNamesData$name,n=7)
newNamesData <- filter(babynames, name %in% newNames & sex == 'F')
newNamesData$name <- as.factor(newNamesData$name)

# plot the proportion of new SSN applications 
# with the names over time
p <- ggplot(data=newNamesData, aes(x=year,y=prop)) +
  geom_line(aes(color=name)) +
  labs(
    title='Top 2014 Female Names',
    y="Proportion of new SSN applications",
    x="Year")

plot(p)
Figure 4: New female names over time.

Now for the gents.

library(ggplot2)
library(babynames)
library(dplyr)
library(magrittr)

# big plots, because I'm old and blind
theme_set(theme_gray(base_size = 18))

# dream of the 1880s
oldNamesData <- babynames %>%
  filter(year == 1880 & sex =='M')

#baby names are presorted
oldNames <- head(oldNamesData$name,n=5)
oldNamesData <- filter(babynames, name %in% oldNames & sex == 'M')
oldNamesData$name <- as.factor(oldNamesData$name)

# plot the proportion of new SSN applications 
# with the names over time
p <- ggplot(data=oldNamesData, aes(x=year,y=prop)) +
  geom_line(aes(color=name)) +
  labs(
    title='Top 1880 Male Names',
    y="Proportion of new SSN applications",
    x="Year")

maxYear <- theJNames$year[which.max(theJNames$prop)]

plot(p)
Figure 5: Old time male names over time.
library(ggplot2)
library(babynames)
library(dplyr)
library(magrittr)

# big plots, because I'm old and blind
theme_set(theme_gray(base_size = 18))

# males in 2014
newNamesData <- babynames %>%
  filter(year == 2014 & sex =='M')

#baby names are presorted
newNames <- head(newNamesData$name,n=5)
newNamesData <- filter(babynames, name %in% newNames & sex == 'M')
newNamesData$name <- as.factor(newNamesData$name)

# plot the proportion of new SSN applications 
# with the names over time
p <- ggplot(data=newNamesData, aes(x=year,y=prop)) +
  geom_line(aes(color=name)) +
  labs(
    title='Top 2014 Male Names',
    y="Proportion of new SSN applications",
    x="Year")

plot(p)
Figure 6: New male names over time.

The interesting bit about both of these results is that the popularity of names seems to be fairly short lived in modern times. This can be seen in the popular names from 2014 poping up from nowhere, with the exception of William and maybe Emily.

Name Distribution

That was a lot of fun, but let’s get back to the question of how people have labeled their children over time. What we can hypothesize from the prior results is that there is an abundance of new names over the last century.

So let’s take a look at this new diversity in child naming. One way to look at this diversity is to just to plot the number of distinct names for each year and see how this changes over time (Figure 7.).

library(ggplot2)
library(babynames)


# big plots, because I'm old and blind
theme_set(theme_gray(base_size = 18))

# quick aggregation function, there is a dplyr function for this as well
uniqueNames <- aggregate(prop ~ year+sex, data=babynames,
                         function(x) length(x))

# plot the number of unique names from new SSN applications 
# over time
p <- ggplot(data=uniqueNames, aes(x=year,y=prop)) +
  geom_line(aes(color=sex)) +
  labs(
    title='Number of Unique New Names over Time',
    y="Proportion of new SSN applications",
    x="Year")

plot(p)
Figure 7: Number of unique names over time.

Clearly this is an increase in distinct names. This increase in new names could be a function of the change in population size, but it is likely more complicated than that. Immigration from non-European countries may play a large roll in this change, but that would require a bit more work than I’m willing to put into this blog post.

Another way to look at this data is to plot the sorted proportions. Plotting the sorted proportions will provide some insights into the distribution of the names. If the proportions are uniformly distributed, this implies no real preference towards a particular name. If the proportions are highly skewed, then the population likely has a large number of uncommon names and a small number of popular names. To limit the number of plots, I just took a quick look at the plots of proportions for 1890 (YouTube) and 1990 (YouTube). To make things simple I just pooled male and female names together.

library(ggplot2)
library(babynames)
library(dplyr)
library(magrittr)

# big plots, because I'm old and blind
theme_set(theme_gray(base_size = 18))

# get names
namesDensity <- babynames %>% filter(year %in% c(1890,1990)) %>%
  group_by(sex,year) %>%
  mutate(index=1:length(prop)) %>%  # create an index
  ungroup(sex,year)

namesDensity$Year <- as.factor(namesDensity$year)

namesDensity1890 <- filter(namesDensity, Year==1890)
namesDensity1890 <-
  namesDensity1890[sort(namesDensity1890$prop, index.return=T)$ix,]

namesDensity1990 <- filter(namesDensity, Year==1990)
namesDensity1990 <-
  namesDensity1990[sort(namesDensity1990$prop, index.return=T)$ix,]

# plot the proportion of new SSN applications 
# with the names over time
p <- ggplot(data=namesDensity1890, aes(x=index,y=prop)) +
  geom_line(aes(color=Year)) +
  labs(
    title='Plot of Decreasing Name Porportions in 1890',
    y="Proportion of new SSN applications",
    x="Number of Names")

plot(p)

p <- ggplot(data=namesDensity1990, aes(x=index,y=prop)) +
  geom_line(aes(color=Year)) +
  labs(
    title='Plot of Decreasing Name Porportions in 1990',
    y="Proportion of new SSN applications",
    x="Number of Names")

plot(p)
Figure 8: Plot of decreasing proportions in 1890.
Figure 9: Plot of decreasing proportions in 1990.

There are a few interesting characteristics about these two plots. The first characteristic we saw before in Figure 7., the number of the unique names. Here we can see the number of unique names increased eight times over one hundred years. So clearly there are more names out there, but how are they distributed? This can be seen clearly in both plots, there are a small number of common names and a much larger number of uncommon names. Finally, the magnitudes of these proportions are interesting, in 1890 the most popular names covered about 7% of the total population of new SSN applicants, while this was down to 3% in 1990.

Name Change

Were there any large changes in names over time? this is a bit more difficult question to answer, but let’s have a go at it. What we need to do is to look at changes by a given lag, e.g. how do the names at time < svg xmlns:xlink="http://www.w3.org/1999/xlink" width="0.84ex" height="2.009ex" style="vertical-align: -0.338ex;" viewBox="0 -719.6 361.5 865.1" role="img" focusable="false">< g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)">< use x="0" y="0" xlink:href="#MJMATHI-74"> compare to names at tie < svg xmlns:xlink="http://www.w3.org/1999/xlink" width="4.891ex" height="2.343ex" style="vertical-align: -0.505ex;" viewBox="0 -791.3 2105.9 1008.6" role="img" focusable="false">< g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)">< use x="0" y="0" xlink:href="#MJMATHI-74">< use x="583" y="0" xlink:href="#MJMAIN-2B">< use x="1584" y="0" xlink:href="#MJMATHI-6B"> where < svg xmlns:xlink="http://www.w3.org/1999/xlink" width="1.211ex" height="2.176ex" style="vertical-align: -0.338ex;" viewBox="0 -791.3 521.5 936.9" role="img" focusable="false">< g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)">< use x="0" y="0" xlink:href="#MJMATHI-6B"> is a whole number.

Let’s take a look at the change with < svg xmlns:xlink="http://www.w3.org/1999/xlink" width="5.472ex" height="2.176ex" style="vertical-align: -0.338ex;" viewBox="0 -791.3 2356.1 936.9" role="img" focusable="false">< g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)">< use x="0" y="0" xlink:href="#MJMATHI-6B">< use x="799" y="0" xlink:href="#MJMAIN-3D">< use x="1855" y="0" xlink:href="#MJMAIN-31">, and use the mean squared difference between names. The mean squared difference is calculated as

< svg xmlns:xlink="http://www.w3.org/1999/xlink" width="28.321ex" height="7.343ex" style="vertical-align: -3.005ex;" viewBox="0 -1867.7 12193.8 3161.4" role="img" focusable="false">< g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)">< use xlink:href="#MJMAIN-4D">< use x="917" y="0" xlink:href="#MJMAIN-53">< use x="1474" y="0" xlink:href="#MJMAIN-45">< use transform="scale(0.707)" x="3048" y="-213" xlink:href="#MJMATHI-74">< use x="2788" y="0" xlink:href="#MJMAIN-3D">< g transform="translate(3845,0)">< use x="0" y="0" xlink:href="#MJSZ2-2211">< g transform="translate(147,-1090)">< use transform="scale(0.707)" x="0" y="0" xlink:href="#MJMATHI-69">< use transform="scale(0.707)" x="345" y="0" xlink:href="#MJMAIN-3D">< use transform="scale(0.707)" x="1124" y="0" xlink:href="#MJMAIN-31">< g transform="translate(299,1185)">< use transform="scale(0.707)" x="0" y="0" xlink:href="#MJMATHI-4E">< use transform="scale(0.574)" x="989" y="-203" xlink:href="#MJMATHI-74">< g transform="translate(5289,0)">< g transform="translate(286,0)">< rect stroke="none" width="450" x="0" y="220">< g transform="translate(60,807)">< use xlink:href="#MJMAIN-28">< g transform="translate(389,0)">< use x="0" y="0" xlink:href="#MJMATHI-78">< g transform="translate(572,-150)">< use transform="scale(0.707)" x="0" y="0" xlink:href="#MJMATHI-69">< use transform="scale(0.707)" x="345" y="0" xlink:href="#MJMAIN-2C">< use transform="scale(0.707)" x="624" y="0" xlink:href="#MJMATHI-74">< use x="1981" y="0" xlink:href="#MJMAIN-2212">< g transform="translate(2981,0)">< use x="0" y="0" xlink:href="#MJMATHI-78">< g transform="translate(572,-150)">< use transform="scale(0.707)" x="0" y="0" xlink:href="#MJMATHI-69">< use transform="scale(0.707)" x="345" y="0" xlink:href="#MJMAIN-2C">< use transform="scale(0.707)" x="624" y="0" xlink:href="#MJMATHI-74">< use transform="scale(0.707)" x="985" y="0" xlink:href="#MJMAIN-2B">< use transform="scale(0.707)" x="1764" y="0" xlink:href="#MJMAIN-31">< use x="5255" y="0" xlink:href="#MJMAIN-29">< use transform="scale(0.707)" x="7983" y="675" xlink:href="#MJMAIN-32">< g transform="translate(2529,-704)">< use x="0" y="0" xlink:href="#MJMATHI-4E">< use transform="scale(0.707)" x="1136" y="-213" xlink:href="#MJMATHI-74">< use x="11915" y="0" xlink:href="#MJMAIN-2C">
where each proportion is identified as < svg xmlns:xlink="http://www.w3.org/1999/xlink" width="17.719ex" height="3.009ex" style="vertical-align: -1.005ex;" viewBox="0 -863.1 7629.2 1295.7" role="img" focusable="false">< g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)">< use x="0" y="0" xlink:href="#MJMATHI-78">< g transform="translate(572,-150)">< use transform="scale(0.707)" x="0" y="0" xlink:href="#MJMATHI-69">< use transform="scale(0.707)" x="345" y="0" xlink:href="#MJMAIN-2C">< use transform="scale(0.707)" x="624" y="0" xlink:href="#MJMATHI-74">< use x="1619" y="0" xlink:href="#MJMATHI-69">< use x="2242" y="0" xlink:href="#MJMAIN-2208">< g transform="translate(3187,0)">< use x="0" y="0" xlink:href="#MJMAIN-7B">< use x="500" y="0" xlink:href="#MJMAIN-31">< use x="1001" y="0" xlink:href="#MJMAIN-2C">< use x="1446" y="0" xlink:href="#MJMAIN-2E">< use x="1891" y="0" xlink:href="#MJMAIN-2E">< use x="2336" y="0" xlink:href="#MJMAIN-2E">< g transform="translate(2781,0)">< use x="0" y="0" xlink:href="#MJMATHI-4E">< use transform="scale(0.707)" x="1136" y="-213" xlink:href="#MJMATHI-74">< use x="3940" y="0" xlink:href="#MJMAIN-7D">, the set of names in year < svg xmlns:xlink="http://www.w3.org/1999/xlink" width="0.84ex" height="2.009ex" style="vertical-align: -0.338ex;" viewBox="0 -719.6 361.5 865.1" role="img" focusable="false">< g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)">< use x="0" y="0" xlink:href="#MJMATHI-74">. If the name is observed in year < svg xmlns:xlink="http://www.w3.org/1999/xlink" width="0.84ex" height="2.009ex" style="vertical-align: -0.338ex;" viewBox="0 -719.6 361.5 865.1" role="img" focusable="false">< g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)">< use x="0" y="0" xlink:href="#MJMATHI-74"> but not in < svg xmlns:xlink="http://www.w3.org/1999/xlink" width="4.842ex" height="2.176ex" style="vertical-align: -0.338ex;" viewBox="0 -791.3 2084.9 936.9" role="img" focusable="false">< g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)">< use x="0" y="0" xlink:href="#MJMATHI-74">< use x="583" y="0" xlink:href="#MJMAIN-2212">< use x="1584" y="0" xlink:href="#MJMAIN-31"> the proportion is set to 0.

library(ggplot2)
library(babynames)
library(dplyr)
library(magrittr)

# big plots, because I'm old and blind
theme_set(theme_gray(base_size = 18))

result <- c()
k <- 1

# get the change in proportion
for(t in 1880:(2014-k)) {

  # get the names at t
  currentName <-babynames %>% filter( sex=='M' & year==t)

  # get the names at t+k
  currentNamePlusK <-babynames %>% filter(sex=='M' & year==t+k)

  #join the two years together by name
  check <- left_join(currentName, currentNamePlusK, by=c("name"="name"))
  check[is.na(check)] <- 0

  result <- c( result, mean((check$prop.x - check$prop.y)^2) )
}

result <- cbind(1880:(2014-k),result)
result <-as.data.frame(result)
names(result) <- c('year','change')

# plot the proportion of new SSN applications 
# with the names over time
p <- ggplot(data=result, aes(x=year,y=change)) +
  geom_line(size=1.5) +
  labs(
    title='Changes in Yearly Name Proportion',
    y="Mean of Squared Changes in Proportion",
    x="Year")
plot(p)
Figure 10: Name change over time.

This plot is pretty interesting. We see some early on changes around 1890 and 1900, followed by a spike around 1910; then in close to 1970 we have another spike. My best guess would be the 1970 change would be associated with the black power movement that peaked in 1970, and the earlier changes are likely due to immigration events.

If we do a little bit of Google sleuthing we can see in this plot that there seems to be a bit of correlation between these early events and immigration, but not the massive spike in immigration that occurs in 1990. Looking at Wikipedia we can see that around the turn of the century there was a large amount of movement from northern, southern, and eastern Europe to the United States.

However, the massive spike in 1990 doesn’t seem to be associated with a massive change in source countries, as seen in this plot. Of course this plot is of regions of birth, which I’ll assume is a better indicator of future baby names than country of emigration. Still, as mentioned earlier, the names pre 1937 are a bit incomplete, so the earlier changes in name might be an anomaly in the data, and I certainly don’t want to assume some degree of causal relationship.

From the island to the wall

Just for fun, let’s look at some popular TV and Movie names over time for males and females.
library(ggplot2)
library(babynames)
library(dplyr)
library(magrittr)

# big plots, because I'm old and blind
theme_set(theme_gray(base_size = 18))

# get the names
tvNames <-babynames %>%
  filter(name %in%
           c('Bilbo',
             'Aragorn',
             'Gandalf',
             'Boromir',
             'Faramir',
             'Legolas',
             'Frodo',
             'Gilligan',
             'Tyrion'
             ) & sex == 'M')

# plot the proportion of new SSN applications 
# with the names over time
p <- ggplot(data=tvNames, aes(x=year,y=prop)) +
  geom_line(aes(color=name),size=1.5) +
  labs(
    title='TV and Movie Names, Male',
    y="Proportion of new SSN applications",
    x="Year")
plot(p)
Figure 11: TV and Movie names over time, Male.
library(ggplot2)
library(babynames)
library(dplyr)
library(magrittr)

# big plots, because I'm old and blind
theme_set(theme_gray(base_size = 18))

# get the names
tvNames <-babynames %>%
  filter(name %in%
           c('Ginger',
             'Sansa',
             'Arya',
             'Daenerys',
             'Cersei',
             'Arwen'
             ) & sex == 'F')

# plot the proportion of new SSN applications 
# with the names over time
p <- ggplot(data=tvNames, aes(x=year,y=prop)) +
  geom_line(aes(color=name),size=1.5) +
  labs(
    title='TV and Movie Names, Female',
    y="Proportion of new SSN applications",
    x="Year")
plot(p)
Figure 12: TV and Movie names over time, Female.

In the end

This was a lot of fun. Please let me know via Twitter or email if you would like to see anymore analysis. If I find some free time I would like to build an interactive website for this data, but in the mean time please feel to play around with the code I have provided.

library(ggplot2)
library(babynames)
library(dplyr)
library(magrittr)

# big plots, because I'm old and blind
theme_set(theme_gray(base_size = 18))

# get the names
theJNames <-babynames %>%
  filter(name %in% c('Fin') & sex == 'M')

# plot the proportion of new SSN applications 
# with the names over time
p <- ggplot(data=theJNames, aes(x=year,y=prop)) +
  geom_line(aes(color=name),size=1.5) +
  labs(
    title='Fin',
    y="Proportion of new SSN applications",
    x="Year")
plot(p)
Figure 13: Proportion of males named Fin over time.

As far as future work goes, I’ll probaby take a look at the recent release of data on hard drive reliablility from Backblaze.

To leave a comment for the author, please follow the link and comment on their blog: MeanMean.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.