Topic Modeling in R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A statistical approach for discovering “abstracts/topics” from a collection of text documents based on statistics of each word. In simple terms, the process of looking into a large collection of documents, identifying clusters of words and grouping them together based on similarity and identifying patterns in the clusters appearing in multitude.
- I love playing cricket.
- Sachin is my favorite cricketer.
- Titanic is heart touching movie.
- Data Analytics is next Future in IT.
- Data Analytics & Big Data complements each other.
Topic Modeling can be achieved by using Latent Dirichlet Allocation algorithm. Not going into the nuts & bolts of the Algorithm, LDA automatically learns itself to assign probabilities to each & every word in the Corpus and classify into Topics. A simple explanation for LDA could be found here:
- Fetch tweets data using ‘twitteR’ package.
- Load the data into the R environment.
- Clean the Data to remove: re-tweet information, links, special characters, emoticons, frequent words like is, as, this etc.
- Create a Term Document Matrix (TDM) using ‘tm’ Package.
- Calculate TF-IDF i.e. Term Frequency Inverse Document Frequency for all the words in word matrix created in Step 4.
- Exclude all the words with tf-idf <= 0.1, to remove all the words which are less frequent.
- Calculate the optimal Number of topics (K) in the Corpus using log-likelihood method for the TDM calculated in Step6.
- Apply LDA method using ‘topicmodels’ Package to discover topics.
- Evaluate the model.
Conclusion:
Topic modeling using LDA is a very good method of discovering topics underlying. The analysis will give good results if and only if we have large set of Corpus.In the above analysis using tweets from top 5 Airlines, I could find that one of the topics which people are talking about is about FOOD being served. We can Sentiment Analysis techniques to mine what people thinks about, talks about products/companies etc.
SourceCode:
library(“tm”)
library(“wordcloud”)
library(“slam”)
library(“topicmodels”)
#Load Text
con <- file("tweets.txt", "rt")
tweets = readLines(con)
#Clean Text
tweets = gsub(“(RT|via)((?:\b\W*@\w+)+)”,””,tweets)
tweets = gsub(“http[^[:blank:]]+”, “”, tweets)
tweets = gsub(“@\w+”, “”, tweets)
tweets = gsub(“[ t]{2,}”, “”, tweets)
tweets = gsub(“^\s+|\s+$”, “”, tweets)
tweets <- gsub('\d+', '', tweets)
tweets = gsub(“[[:punct:]]”, ” “, tweets)
corpus = Corpus(VectorSource(tweets))
corpus = tm_map(corpus,removePunctuation)
corpus = tm_map(corpus,stripWhitespace)
corpus = tm_map(corpus,tolower)
corpus = tm_map(corpus,removeWords,stopwords(“english”))
tdm = DocumentTermMatrix(corpus) # Creating a Term document Matrix
# create tf-idf matrix
term_tfidf <- tapply(tdm$v/row_sums(tdm)[tdm$i], tdm$j, mean) * log2(nDocs(tdm)/col_sums(tdm > 0))
summary(term_tfidf)
tdm <- tdm[,term_tfidf >= 0.1]
tdm <- tdm[row_sums(tdm) > 0,]
summary(col_sums(tdm))
#Deciding best K value using Log-likelihood method
best.model <- lapply(seq(2, 50, by = 1), function(d){LDA(tdm, d)})
best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))
#calculating LDA
k = 50;#number of topics
SEED = 786; # number of tweets used
CSC_TM <-list(VEM = LDA(tdm, k = k, control = list(seed = SEED)),VEM_fixed = LDA(tdm, k = k,control = list(estimate.alpha = FALSE, seed = SEED)),Gibbs = LDA(tdm, k = k, method = "Gibbs",control = list(seed = SEED, burnin = 1000,thin = 100, iter = 1000)),CTM = CTM(tdm, k = k,control = list(seed = SEED,var = list(tol = 10^-4), em = list(tol = 10^-3))))
#To compare the fitted models we first investigate the values of the models fitted with VEM and estimated and with VEM and fixed
sapply(CSC_TM[1:2], slot, “alpha”)
sapply(CSC_TM, function(x) mean(apply(posterior(x)$topics, 1, function(z) – sum(z * log(z)))))
Topic <- topics(CSC_TM[["VEM"]], 1)
Terms <- terms(CSC_TM[["VEM"]], 8)
Terms
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.