Automatic Detection of the Language of a Tweet
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Two days ago, in my post to extract automatically my own tweets, and to generate some html list, I mentioned that it would be great if there were a function that could be used to distinguish tweets in English, and tweets in French (usually, I tweet in one of those two languages). And one more time, @3wen came to rescue me! In my previous post, I spent some time to extract mentions to other twitter accounts, and urls. This time, I have to remove those two informations, that are not relevant to get the language of a tweet.
> text_only <- function(x){ + split_x <- strsplit(x,"@")[[1]] + x_id <- paste(split_x,collapse="http://twitter.com/",sep="") + split_x_id <- strsplit(x_id,"http") + n <- length(split_x_id[[1]]) + tweet_x <- strsplit(split_x_id[[1]]," ") + + if(n==1) rt <- x_id + if(n>1){ + for(i in 2:n){ + tweet_x[[i]] <- tweet_x[[i]][-1] + }} + tweet_x_text <- paste(unlist(tweet_x),collapse=" ") + tweet_x_text <- paste(unlist(strsplit(tweet_x_text," rt")),collapse=" ") + tweet_x_text <- paste(unlist(strsplit(tweet_x_text," ht")),collapse=" ") + tweet_x_text <- paste(unlist(strsplit(tweet_x_text,", rt")),collapse=" ") + tweet_x_text <- paste(unlist(strsplit(tweet_x_text,", ht")),collapse=" ") + tweet_x_text <- paste(unlist(strsplit(tweet_x_text,", via")),collapse=" ") + tweet_x_text <- paste(unlist(strsplit(tweet_x_text,", poke")),collapse=" ") + tweet_x_text <- paste(unlist(strsplit(tweet_x_text,",rt")), collapse=" ") + tweet_x_text <- paste(unlist(strsplit(tweet_x_text,",ht")), collapse=" ") + tweet_x_text <- paste(unlist(strsplit(tweet_x_text,",via")), collapse=" ") + tweet_x_text <- paste(unlist(strsplit(tweet_x_text,",poke")), collapse=" ") + tweet_x_text <- paste(unlist(strsplit(tweet_x_text,""")), collapse=" ") + return(tweet_x_text) + }
The function is very simple. The only point here is that I wanted to exclude typical tweet words, such as "RT", "HT", or words such as "via", or "poke", that can be used in any language. To get the text of my own tweets, we can use
> tweets_freak_text <- lapply(tweets_freak, text_only)
Now, to get the language of a tweet, the function I was looking for is the following,
> library("textcat") > library("rvest") > library("stringr") > text_tweets <- str_trim(unlist(tweets_freak_text)) > lang <- textcat(text_tweets)
For instance
> text_tweets[1] [1] "De la surveillance de masse à la paranoïa généralisée par passionant !" > textcat(text_tweets[1]) [1] "french"
or
> text_tweets[6] [1] "Life Expectancy improved in every country of the world on blog" > textcat(text_tweets[6]) [1] "english"
All that makes sense. Of course, sometimes, the prediction is not the one I expected
> text_tweets[2] [1] "You're more likely to die on your birthday see also" > textcat(text_tweets[2]) [1] "afrikaans"
(note that https://translate.google.com/ detected English, here). Consider the following code now, to get a data frame, with the language, as well as a flag if the language is neither English or French,
> base_tweets <- data.frame(text=substr(text_tweets,1,20), + lang=textcat(text_tweets), + check=1-textcat(text_tweets) %in% c("english","french")) > head(base_tweets,20) text lang check 1 De la surveillance de mas french 0 2 You're more likely to die afrikaans 1 3 Universités: guerre ouver french 0 4 [free ebook] Statistique french 0 5 (nice viz, even if I bel english 0 6 Life Expectancy improved english 0 7 Gigabyte gourmet: AI robo english 0 8 interview with Damon Murr english 0 9 Roundup and Gluten Intol english 0 10 Measuring Inequality by english 0 11 Pour une révolution fisca french 0 12 How statisticians changed english 0 13 The Death and Life of the english 0 14 “Le bénéficiaire de la lé french 0 15 Traffic Accidents Are Mor english 0 16 Flight Reliability When F english 0 17 How good are out-of-sampl english 0 18 Entrepreneurship, down-si english 0 19 true story slovak-ascii 1 20 An NSA Big Graph experime <NA> 1
That's not (so) bad... if we consider that we have less than 140 characters, i.e. less than 20 words, to recognize a language. I will have to check less than 15% of the tweets. Now, I can use the following lines to generate a text file with the html code I can use in my posts, with tweets in English first, then French, and finally those that were neither in English, nor in French
> list_tweets_start <- unlist(tweets_freak_sub) > list_tweets <- c(list_tweets_start[base_tweets$lang=="english"], + list_tweets_start[base_tweets$lang=="french"], +list_tweets_start[base_tweets$check==1]) > write.table(list_tweets,file="tweets_somewhere_else.txt",quote=FALSE,row.names=FALSE)
For instance, the first tweets in English are
- (nice viz', even if I believe that the title is maybe not appropriate)
- "Life Expectancy improved in every country of the world" http://maxroser.com/everyone-is-better-off-life-expectancy-increased/ on @MaxCRoser's blog http://twitter.com/freakonometrics/status/551981550070693888/photo/1
- "Gigabyte gourmet: AI robot learns to cook just by watching YouTube videos" http://rt.com/news/219687-robot-learns-watching-video/ ht @aussietorres
- interview with Damon Murray, http://www.vice.com/read/russian-criminal-tattoo-fuel-damon-murray-interview-876 who published the Russian Criminal Tattoo Encyclopaedia series, ht @fdastous
That's a good start... Then I have to include pictures, graphs, and maps, but at least, that's easy to spot here (from the url).
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.