Automatic Detection of the Language of a Tweet

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Two days ago, in my post to extract automatically my own tweets, and to generate some html list, I mentioned that it would be great if there were a function that could be used to distinguish tweets in English, and tweets in French (usually, I tweet in one of those two languages). And one more time, @3wen came to rescue me! In my previous post, I spent some time to extract mentions to other twitter accounts, and urls. This time, I have to remove those two informations, that are not relevant to get the language of a tweet.

> text_only <- function(x){
+ split_x <- strsplit(x,"@")[[1]]
+ x_id <- paste(split_x,collapse="",sep="")
+ split_x_id <- strsplit(x_id,"http")
+ n <- length(split_x_id[[1]])
+ tweet_x <- strsplit(split_x_id[[1]]," ")
+ if(n==1) rt <- x_id
+ if(n>1){
+ for(i in 2:n){
+ tweet_x[[i]] <- tweet_x[[i]][-1]
+ }}
+ tweet_x_text <- paste(unlist(tweet_x),collapse=" ")
+ tweet_x_text <- paste(unlist(strsplit(tweet_x_text," rt")),collapse=" ")
+ tweet_x_text <- paste(unlist(strsplit(tweet_x_text," ht")),collapse=" ")
+ tweet_x_text <- paste(unlist(strsplit(tweet_x_text,", rt")),collapse=" ")
+ tweet_x_text <- paste(unlist(strsplit(tweet_x_text,", ht")),collapse=" ")
+ tweet_x_text <- paste(unlist(strsplit(tweet_x_text,", via")),collapse=" ")
+ tweet_x_text <- paste(unlist(strsplit(tweet_x_text,", poke")),collapse=" ")
+ tweet_x_text <- paste(unlist(strsplit(tweet_x_text,",rt")),
collapse=" ")
+ tweet_x_text <- paste(unlist(strsplit(tweet_x_text,",ht")),
collapse=" ")
+ tweet_x_text <- paste(unlist(strsplit(tweet_x_text,",via")),
collapse=" ")
+ tweet_x_text <- paste(unlist(strsplit(tweet_x_text,",poke")),
collapse=" ")
+ tweet_x_text <- paste(unlist(strsplit(tweet_x_text,""")),
collapse=" ")
+ return(tweet_x_text)
+ }

The function is very simple. The only point here is that I wanted to exclude typical tweet words, such as "RT", "HT", or words such as "via", or "poke", that can be used in any language. To get the text of my own tweets, we can use

> tweets_freak_text <- lapply(tweets_freak, text_only)

Now, to get the language of a tweet, the function I was looking for is the following,

> library("textcat")
> library("rvest")
> library("stringr")

> text_tweets <- str_trim(unlist(tweets_freak_text))
> lang <- textcat(text_tweets)

For instance

> text_tweets[1]
[1] "De la surveillance de masse à la paranoïa généralisée  par passionant !"
> textcat(text_tweets[1])
[1] "french"


> text_tweets[6]
[1] "Life Expectancy improved in every country of the world  on blog"
> textcat(text_tweets[6])
[1] "english"

All that makes sense. Of course, sometimes, the prediction is not the one I expected

> text_tweets[2]
[1] "You're more likely to die on your birthday   see also"
> textcat(text_tweets[2])
[1] "afrikaans"

(note that detected English, here). Consider the following code now, to get a data frame, with the language, as well as a flag if the language is neither English or French,

> base_tweets <- data.frame(text=substr(text_tweets,1,20),
+ lang=textcat(text_tweets),
+ check=1-textcat(text_tweets) %in% c("english","french"))

> head(base_tweets,20)
                        text         lang check
1  De la surveillance de mas       french     0
2  You're more likely to die    afrikaans     1
3  Universités: guerre ouver       french     0
4  [free ebook]  Statistique       french     0
5  (nice viz, even if I bel       english     0
6  Life Expectancy improved       english     0
7  Gigabyte gourmet: AI robo      english     0
8  interview with Damon Murr      english     0
9  Roundup and Gluten Intol       english     0
10 Measuring Inequality  by       english     0
11 Pour une révolution fisca       french     0
12 How statisticians changed      english     0
13 The Death and Life of the      english     0
14 “Le bénéficiaire de la lé       french     0
15 Traffic Accidents Are Mor      english     0
16 Flight Reliability When F      english     0
17 How good are out-of-sampl      english     0
18 Entrepreneurship, down-si      english     0
19                true story slovak-ascii     1
20 An NSA Big Graph experime         <NA>     1

That's not (so) bad... if we consider that we have less than 140 characters, i.e. less than 20 words, to recognize a language. I will have to check less than 15% of the tweets. Now, I can use the following lines to generate a text file with the html code I can use in my posts, with tweets in English first, then French, and finally those that were neither in English, nor in French

> list_tweets_start <- unlist(tweets_freak_sub)
> list_tweets <- c(list_tweets_start[base_tweets$lang=="english"],
+ list_tweets_start[base_tweets$lang=="french"],
> write.table(list_tweets,file="tweets_somewhere_else.txt",quote=FALSE,row.names=FALSE)

For instance, the first tweets in English are

That's a good start... Then I have to include pictures, graphs, and maps, but at least, that's easy to spot here (from the url).

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)