Classify gender based on danish first names

May 5, 2016
By

(This article was first published on Renglish – 56north | Skræddersyet dataanalyse, and kindly contributed to R-bloggers)

In Denmark we have official lists of what people are allowed to have as first names. That means there are lists of government approved boys names, girls names and unisex names. There is a total of 18.529 approved girls names, 15.052 boys names and 813 unisex names.

This means that we can write an R-package that can classify a name as either male, female, unisex or indeterminable. And I did just that. Allow me to introduce the “namesDK” package. It is available from github by running devtools::install_github(“56north/namesDK”).

After that you feed it a string of names. It uses the first name to classify the gender, so if you provide a full name (ie: Lars Løkke Rasmussen) then it will split the string and choose the first name (ie: Lars).

You can use the package if you have a lot of names, that you would like demographic variables attached to, such as gender. It could be names mined from social media, a customer list, etc.

In order to do this you simply call the “gender” function from the package. Here is a brief example of how it works:

library(namesDK)

gender(“Lars Løkke Rasmussen”)
#> [[1]]
#> [1] “male”

gender(c(“Helle Thorning Smidt”, “Lars Løkke Rasmussen”, “Traktor Troels”))
#> [[1]]
#> [1] “female”
#>
#> [[2]]
#> [1] “male”
#>
#> [[3]]
#> [1] NA

As you can see, the last string in the call above said “Traktor” as first name (the machine used in aggriculture) and therefore returns an NA, since Traktor is not an approved danish first name.

There you go. Sweet and simple. Enjoy.

If your country has the same sort of rules, maybe we should create a package that can classify gender based on first names across multiple languages. Let me know if you are interested 🙂

To leave a comment for the author, please follow the link and comment on their blog: Renglish – 56north | Skræddersyet dataanalyse.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)