Finding Named Entities using R

February 11, 2014
By

(This article was first published on Stats and things, and kindly contributed to R-bloggers)

Occasionally, I’ll need to pick out names (first name, last name) from text. These days, the text I’m working with is usually tweets. Any how, I didn’t see any solution out there (that worked for me) when I developed this, so hopefully it can be a starting point for somebody else with similar needs…

First, I start out with a list of names from the census bureau. I downloaded male first names, female first names, and last names and same them as variables in R. I do take out some of the names as “exceptions” that screw up my process here. Names like “In”, “An”, “Chi”, “So”, and so on.

Then, I split my target text up into bigrams, that is, adjacent pairs of words in the original text…

This returns every pair of words in the tweet. From here, I can look through each of these bigrams for names. To make the search for names a little easier, I throw out any bigrams that don’t have capital letters for the first and last names.

Now that I have a list of bigrams in which both words start with capital letters, I can compare the words to the name list to see if they are names. I start with the last name. If the second word in the bigram doesn’t appear in the last name list, we can stop… there’s no need to check the first name. If the second word is a last name, then we check the first word against the first names list. If both of these check out, we have ourselves a name. Here’s the code for that…

The full code for this can be found here… https://github.com/corynissen/cook-county-tweet-dashboard/blob/master/cctweets/findNames.R

I have tried the openNLP package for this and couldn’t get it to work reliably and quickly, so I made my own. If you have any suggestions on how to do this better, let me know!

Follow me on twitter… https://twitter.com/corynissen

To leave a comment for the author, please follow the link and comment on their blog: Stats and things.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)