Which Gender is associated with this Name? R to the R-escue!

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When addressing somebody unknown to you with an uncommon name e.g. in an email you might not know whether this person is male or female. In this post, we make it a little fun project to let R help us with that, so read on!

Of course, R cannot figure out the gender just by looking at the names, we need some data! A very impressive dataset can be found here: Gender by Name Data Set.

In this dataset, we find nearly one hundred fifty thousand instances of first/given names of male and female babies, source datasets are from government authorities:

  • US: Baby Names from Social Security Card Applications – National Data, 1880 to 2019
  • UK: Baby names in England and Wales Statistical bulletins, 2011 to 2018
  • Canada: British Columbia 100 Years of Popular Baby names, 1918 to 2018
  • Australia: Popular Baby Names, Attorney-General’s Department, 1944 to 2019

NB: Because of the origin of the data the categories here are strictly binary (male/female) and not gender-divers.

We can now write a simple R function which formats the output a little bit and provides us with percentage values in case the name is used for both genders:

name_gender_data <- read.csv("data/name_gender_dataset.csv") # change path accordingly

name_gender <- function(name) {
  data <- name_gender_data[name_gender_data$Name == name, 1:3]
  data <- cbind(data[1:2], round(data[3] / sum(data[3]), 3) * 100)
  colnames(data) <- c("Name", "Gender", "Percent")
  rownames(data) <- NULL

I, of course, start by trying it on my own name 😉

##     Name Gender Percent
## 1 Holger      M     100

Now, how about a name not everybody might know the gender of, “Emre”:

##   Name Gender Percent
## 1 Emre      M     100

Same with “Elle”:

##   Name Gender Percent
## 1 Elle      F     100

How about names that are given to both genders, like “Charlie”:

##      Name Gender Percent
## 1 Charlie      M    86.9
## 2 Charlie      F    13.1

And, as the last example, what happens when the name is not included in the data:

## [1] Name    Gender  Percent
## <0 rows> (or 0-length row.names)

I hope that you enjoyed this little project and that it will prove helpful. Do you have other ideas about what to do with this dataset? Leave them in the comments!

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)