Classify gender based on danish first names

[This article was first published on Renglish – 56north | Skræddersyet dataanalyse, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In Denmark we have official lists of what people are allowed to have as first names. That means there are lists of government approved boys names, girls names and unisex names. There is a total of 18.529 approved girls names, 15.052 boys names and 813 unisex names.

This means that we can write an R-package that can classify a name as either male, female, unisex or indeterminable. And I did just that. Allow me to introduce the “namesDK” package. It is available from github by running devtools::install_github(“56north/namesDK”).

After that you feed it a string of names. It uses the first name to classify the gender, so if you provide a full name (ie: Lars Løkke Rasmussen) then it will split the string and choose the first name (ie: Lars).

You can use the package if you have a lot of names, that you would like demographic variables attached to, such as gender. It could be names mined from social media, a customer list, etc.

In order to do this you simply call the “gender” function from the package. Here is a brief example of how it works:

library(namesDK)

gender(“Lars Løkke Rasmussen”)
#> [[1]]
#> [1] “male”

gender(c(“Helle Thorning Smidt”, “Lars Løkke Rasmussen”, “Traktor Troels”))
#> [[1]]
#> [1] “female”
#>
#> [[2]]
#> [1] “male”
#>
#> [[3]]
#> [1] NA

As you can see, the last string in the call above said “Traktor” as first name (the machine used in aggriculture) and therefore returns an NA, since Traktor is not an approved danish first name.

There you go. Sweet and simple. Enjoy.

If your country has the same sort of rules, maybe we should create a package that can classify gender based on first names across multiple languages. Let me know if you are interested ?

To leave a comment for the author, please follow the link and comment on their blog: Renglish – 56north | Skræddersyet dataanalyse.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)