Time Is on My Side – A Small Example for Text Analytics on a Stream

June 23, 2013
By

(This article was first published on Data * Science + R, and kindly contributed to R-bloggers)

Introduction and Background

While my last posting was about recommendation in the context of Location Based Social Networks there are also other interesting topics regarding the analysis of unstructured data. The most established one is probably Text Analytics/Mining focusing on all sorts of text data.For me, coming from spatial analysis, these topic is relatively new but I couldn’t help noticing that more and more easy-to-use packages for analyzing various kinds of text sources entered the R ecosystem and a lot of postings and tutorials about this topic were created. Time for me to carry out some experiments - this post is about the outcomes. A last short notice: You will read about a lot of different techniques you might not directly relate to Text Analytics like Shiny, rCharts and Data Stream Mining. That’s true. But all of them can be connected with Text Analytics and I will show a few examples for that. Mostly I use these blog posts as a playground for exploring how different new techniques could be nicely fit together and how worthwhile they could be for my daily work. This posting is a good example for that and I hope you can benefit at least a little bit from it.

Social media services like Twitter are clearly a rich source for analyzing texts and also show some clear challenges for simple methods (slang, spelling, …). Imagine you want to explore a special topic that people discuss on twitter. Examining terms used by the users could give you a first impression of the development of the discussion and also serve as a starting point for exploring different sub-themes or connected topics of the overall topic. A simple approach might look like this: One could extract all the terms/words out of some collected tweets and use them for exploratory data analysis. A potential drawback of such an approach might be that the size of the data is too large (for processing, computing or saving) especially when the topic is very popular and there are thousands or millions of corresponding tweets. To overcome this problem you could think about working only with a sample of terms. This should give you approximately the same information than working with all discovered terms/words and needs less resources. But taking a proper sample (each term has the same probability to be selected) from an ongoing discussion (a stream of terms) is not trivial. You could wait until the discussion is over and then sample from the collected tweets but this gives you no actionable knowledge (because the discussion is over) and you have to store all the tweets or terms (size!). The second approach could be to save all the extracted terms (size!) and take a new sample from all the saved terms each time a new word “arrives”. This is neither efficient nor feasible depending on the size of the data (imagine the discussion on twitter lasts for weeks or months). I will show a well established sampling method to overcome these problems but at the moment let me state the concrete objective of my experiment discussed so far:

Constantly crawl all tweets regarding a special topic while managing a proper sample of the terms (I don’t want to save all the extracted words). In addition create some prototypical tool to explore the discussion based on the sampled words.

Reservoir Sampling

So, first we have to think about the sampling method, that should work on a data stream of unknown length. Reservoir Sampling is such a method which is very simple. First define the sample size \(R\) (~ size of the reservoir) and fill it with the first items (~terms/words) from the stream. After every place in the reservoir is occupied for every new item on the stream the following steps have to be performed:

  1. Take a random number from the uniform distribution between 0 and 1 and compare it with \(R/Iter\) where \(Iter\) is the index of the current item.
  2. If the random number is smaller replace a randomly chosen term from the reservoir with the current item.

The statistics behind ensures it that at iteration K every term from the stream seen so far had the same probability to be part of the sample. Some more detailed explanation could be found here and here. The code for Reservoir sampling is shown below.

rsize = 1000
iTerm = 0
wordReservoir = character(rsize)
...

termVec <- ...  # vector with words

for (i in 1:length(termVec)) {
    if (iTerm <= rsize) {
        # still initializing reservoir
        wordReservoir[iTerm] <- termVec[i]
        iTerm <- iTerm + 1
    } else {
        pI <- runif(1)
        if (pI < (rsize/iTerm)) {
            # keep
            itemRemove <- sample(1:rsize, size = 1)
            wordReservoir[itemRemove] <- termVec[i]
        }
        iTerm <- iTerm + 1
    }
}

A Small Program for Crawling and Processing Tweets

The second step was to implement the Crawler for the tweet stream. Because some time ago I read a blog post about a new R package for working with streams I decided to follow the proposed framework and implemented my own stream data source for working with twitter. Therefore I had to create my own “DSD” class (Data Stream Data) that only needs a constructor function and a get_points() method. Although it is not really necessary to use the framework for my simple example I think that the small amount of extra time for implementation was worthwhile because in future we could now also use the algorithms offered by the stream package together with a twitter stream. The code for the “ DSD_Twitter” class is shown below (Note: You will need an account on Twitter and proper credentials to get it running - see also this new post):

# --------------------------------------------- 
# Content of file DSD_Twitter.R 
# ---------------------------------------------
DSD_Twitter <- function(searchTerm, lang = NULL,
                        streamMinLength = 20,
                        pathToCredentials = "twitterCredentials.RData") {
    require(twitteR)
    require(plyr)
    require(ROAuth)

    # Initialize OAuth
    options(RCurlOptions = list(verbose = FALSE, capath = system.file("CurlSSL", 
        "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))
    load(pathToCredentials)
    registerTwitterOAuth(cred)

    # streamMinLength ensures us that we could read at least streamMinLength
    # Items from the stream .recentID stores the latest tweet ID from the
    # stream - so we will not read in the same tweet multiple times
    .recentID <<- min(twListToDF(searchTwitter(searchString = searchTerm, 
                                               n = streamMinLength))$id)

    l <- list(description = "Twitter Data Stream", searchTerm = searchTerm, 
        lang = lang)
    class(l) <- c("DSD_Twitter", "DSD_R", "DSD")
    l
}

get_points.DSD_Twitter <- function(x, n = 1, ...) {
    if (is.null(x$lang)) {
        tweets <- searchTwitter(searchString = x$searchTerm, n = n, sinceID = .recentID)
    } else {
        tweets <- searchTwitter(searchString = x$searchTerm, lang = x$lang, 
            n = n, sinceID = .recentID)
    }

    tweetsDF <- twListToDF(tweets)

    .recentID <<- max(tweetsDF$id)

    tweetsDF
}

The crawler itself then only is a simple endless loop which grasps the latest tweets every certain interval. What certainly is of more interest is how the tweets will be processed in every iteration. After getting the next tweets from the datastream all words/terms are extracted from the tweet messages, processed and transformed to their base form. These are typical steps in Text Mining and I will explain them a bit later on. Then Reservoir Sampling is used to update the sample. In the last two steps a snapshot of the current reservoir is saved for further offline processing and a simple wordcloud is created that shows the frequency distribution over the current sample. In this posting the wordcloud is only a placeholder for some more advanced analytical method working on the sample. The code for all this is as follows:

# --------------------------------------------- 
# Content of file Twitter_Stream_Mining.R 
# ---------------------------------------------
require(stream)
require(tm)
require(wordcloud)

source("DSD_Twitter.R")  # Data Stream Data class
source("TextMining_Functions.R")  # Functions used for text processing

rsize = 1000  # size of the reservoir
wordReservoir = character(rsize)
minToWait = 10
nIter <- 20

# sTerm is the search term and need to be specified before
dir.create(sTerm)
setwd(sTerm)

# initialize stream
mydatastream <- DSD_Twitter(searchTerm = paste("#", sTerm, sep = ""), 
                            streamMinLength = 50, 
                            lang = "de", 
                            pathToCredentials = "../twitterCredentials.RData")

iTerm = 0
while (TRUE) {
    res <- try(get_points(mydatastream, n = nIter), silent = TRUE)  # get next tweets
    if (!inherits(res, "try-error")) {
        # create vector of terms
        tdMatrix <- getTermDocumentMatrixFromTweets(tweetTextVector = res$text, 
            wordsToRemove = sTerm)
        termVec <- getTermVectorFromTermDocumentMatrix(tdMatrix)
        cat("+ ", length(termVec), " new terms (", format(Sys.time(), 
            "%H:%M:%S"), ")\n")

        # update sample
        for (i in 1:length(termVec)) {
            if (iTerm <= rsize) {
                # still initializing reservoir
                wordReservoir[iTerm] <- termVec[i]
                iTerm <- iTerm + 1
            } else {
                pI <- runif(1)
                if (pI < (rsize/iTerm)) {
                  # keep
                  itemRemove <- sample(1:rsize, size = 1)
                  wordReservoir[itemRemove] <- termVec[i]
                }
                iTerm <- iTerm + 1
            }
        }

        # a simple wordcloud as placeholder for more advanced methods
        png(paste("WordCloud_", format(Sys.time(), "%Y-%m-%d_%H_%M_%S"), ".png", 
            sep = ""), width = 12, height = 8, units = "in", res = 300)
        tM <- table(wordReservoir[wordReservoir != ""])
        wordcloud(names(tM), tM, random.order = FALSE, colors = 
            brewer.pal(8, "Dark2"), min.freq = 1)
        dev.off()

        ## save snapshot
        write.table(x = wordReservoir, file = paste("TwSrMin_Terms_Reservoir_", 
            format(Sys.time(), "%Y-%m-%d_%H_%M_%S"), ".txt", sep = ""), append = FALSE, 
            row.names = FALSE, col.names = FALSE)
    } else {
        cat("+ There are no new tweets available (", format(Sys.time(), 
            "%H:%M:%S"), ")\n")
    }

    Sys.sleep(60 * minToWait)
}

Some words about the text processing. I tried to keep it very simple and followed the instructions from the tm package and the tutorials given here and here. I put all things into a separate file that could then be sourced by my main script. The first functions contains most of the functionality. I transform the whole text into lower letters, remove punctuation and numbers. I also remove typical stop words, the search term and URLs (if I can find them). The final step is stemming yielding the base form of every word. The result after the preprocessing is a so called term-document matrix describing the frequency of the words over the tweets. Because we need this in the form of a “stream of words” the second function transform the matrix into a simple character vector. The code for both functions is as follows:

# --------------------------------------------- 
# Content of file TextMining_Functions.R 
# ---------------------------------------------
getTermDocumentMatrixFromTweets <- function(tweetTextVector, lang = "de", enc = "UTF-8", 
    wordsToRemove = "") {
    require(tm)

    myCorpus <- Corpus(VectorSource(tweetTextVector, encoding = enc), 
	readerControl = list(language = lang))
    myCorpus <- tm_map(myCorpus, tolower)
    # remove punctuation
    myCorpus <- tm_map(myCorpus, removePunctuation)
    # remove numbers
    myCorpus <- tm_map(myCorpus, removeNumbers)
    tmpTerms <- row.names(as.matrix(TermDocumentMatrix(myCorpus, 
	control = list(minWordLength = 1))))
    httpUrlsRemove <- tmpTerms[grep("httptco", tmpTerms)]
    myCorpus <- tm_map(myCorpus, removeWords, c(stopwords("german"), httpUrlsRemove, 
	wordsToRemove))
    dictCorpus <- myCorpus
    myCorpus <- tm_map(myCorpus, stemDocument)
    # stem completion
    myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = dictCorpus)
    myDtm <- TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))
    return(myDtm)
}

getTermVectorFromTermDocumentMatrix <- function(tdm) {
    tM <- rowSums(as.matrix(tdm))
    return(rep(names(tM), times = tM))
}

Then only two lines of code were needed to start the crawler:

sTerm = ... # Twitter Hashtag
source("Twitter_Stream_Mining.R")

Experiment

By the time I perform these experiment the ongoing discussion about the Deutsche Telekom, one of the world’s largest telecommunication company, provides me with a vibrant discussion so I decided to follow up on this. The cause of the discussion was an announcement by Deutsche Telekom to limit the bandwidth after reaching some monthly threshold for all of their customers from 2016 onwards (you can read more about this for example here). The discussion was divided into two main threads: What will this mean to most popular broadband services and does it violate “network neutrality”. The crawler was running for 8 days beginning on the fourth of May. The following three pictures show the status of the discussion after the first, the fourth and the eighth day (from the left).

Day 1Day 2Day 1

The term dominating the discussion clearly is “drosselkom”, a satirical combination of the words “drosseln” (reducing speed) and “Telekom”. The analyst should remove the term from the plots to gain a better picture of the situation (done for the following three pictures corresponding to the three above).

Day 1Day 2Day 1

The User Interface

This also shows the demand for a graphical interface to allow the analyst to explore a specific snapshot in more detail. I will leave this for the future and will be showing instead, how to build a simple tool to explore the occurrence of words over time (along all snapshots). I therefore used the Shiny package to create a small web application. For this I first had to aggregate the information from the different snapshots to yield a “word-snapshot matrix” showing the frequency of a word in every snapshot. Because analyzing words over a long time is not typically “Real-time Analytics” I implement this preprocessing as a batch job. The following code shows how I created the matrix:

# --------------------------------------------- 
# Content of file Prepare_Batch_View.R 
# ---------------------------------------------
require(plyr)
require(reshape2)
require(ggplot2)

snapshotDirectory = "telekom"  # the same as the search term

resFileList <- list.files(snapshotDirectory)[grep("TwSrMin_Terms_Reservoir", 
    list.files(snapshotDirectory))]

wordMatrix <- matrix(character(0), nrow = 1000, ncol = length(resFileList))
for (i in 1:length(resFileList)) {
    wordReservoir <- read.table(file = paste(snapshotDirectory, "/", resFileList[i], 
        sep = ""), colClasses = c("character"))$V1
    reservoirLength <- length(wordReservoir)
    wordMatrix[1:reservoirLength, i] <- wordReservoir
}
colnames(wordMatrix) <- as.POSIXct(substr(resFileList, 25, 43), 
    "%Y-%m-%d_%H_%M_%S", tz = "GMT")
# first create key value pairs
wordKeyValueList <- melt(wordMatrix, na.rm = TRUE)
wordKeyValueList$Var1 <- NULL
colnames(wordKeyValueList) <- c("TIMESTAMP", "WORD")
# create word x snapshot matrix
occurrencePerWordAndTime <- dcast(wordKeyValueList, WORD ~ TIMESTAMP, length, fill = 0)

save(occurrencePerWordAndTime, 
    file = paste(snapshotDirectory, "/Result_Batch_Processing.RData", sep = ""))

The Shiny application consists of three files in a new subdirectory called “Shiny_Timeline”. The first one is “global.R” which loads the prepared dataset and precompute some information needed for creating the dynamic interface. The interface enables the user to select a couple of words via a section box to visualize their occurrence in the sample over time. The content of the selection box itself is a named list of terms ordered by the overall occurrence of the terms over all snapshots. This could be calculated directly from the “word-snapshot matrix”. The code therefore is part of “global.R”:

# --------------------------------------------- 
# Content of file Shiny_Timeline/global.R 
# ---------------------------------------------
snapshotDirectory = "../telekom"

# load dataset
load(paste(snapshotDirectory, "/Result_Batch_Processing.RData", sep = ""))

# first column contains the term itself
indOrderWords <- order(rowSums(occurrencePerWordAndTime[, -1]), decreasing = TRUE)
# compute the input for the selection box
inputList <- as.list(as.character(occurrencePerWordAndTime$WORD)[indOrderWords])
names(inputList) <- paste(as.character(occurrencePerWordAndTime$WORD)[indOrderWords],
	" (", rowSums(occurrencePerWordAndTime[, -1])[indOrderWords], 
	")", sep = "")

The second file of the Shiny application is my interface description “ui.R”. We offer three ways to interact. The user can select words, choose between two plots showing the same content but offer a different degree of interaction and set the start and end date for the visualization. The user interface is dynamic, meaning that the user first has to select a word before he can select any plot or modify the dates. The output consists of a text box showing all labels and two conditional panels showing the plots as requested.

# ---------------------------------------------
# Content of file Shiny_Timeline/ui.R
# ---------------------------------------------
shinyUI(
  pageWithSidebar(

    headerPanel("Term Analyzer"),

    sidebarPanel(
      selectInput(inputId = "word", 
                  label = "Term (number of occurrences over all snaphshots):", 
                  inputList, multiple = TRUE),
      uiOutput("triggerGGplot2"),
      uiOutput("triggerRChart"),
      uiOutput("triggerDateSlider")
    ),

    mainPanel(
      h4("Selected words"),
      textOutput("selectedWords"),
      conditionalPanel(
        condition = "input.ggplot2Plot == true",
        div(),
        h4("Plot showing term occurrence over time (using ggplot2)"),
        plotOutput("timeseries")
      ),
      conditionalPanel(
        div(),
        h4("Plot showing term occurrence over time (using rCharts and nvd3)"),
        condition = "input.rchartPlot == true",
        showOutput("timeseriesRChartsControls","nvd3")
      )
    )
  )
)

The following screenshot shows both plots activated (click to enlarge).

User Interface

The “server.R” file provides the core functionality. Especially the handling of the inputs and the creation of the two plots. The first plot is based on the ggplot2 package the second one on the rCharts package. While ggplot2 is de facto standard in R, rCharts offers some real nice interactive features. You can (de)select different time series inside the plot or for example hover a time series to show its value. The only thing I didn’t figure out was how to display dates on the x-axis like in the ggplot2 plot. So the last part of code looks like this:

# --------------------------------------------- 
# Content of file Shiny_Timeline/server.R 
# ---------------------------------------------

# code to run: require(shiny); runApp('Shiny_Timeline')

require(wordcloud)
require(tm)
require(plyr)
require(reshape2)
require(ggplot2)
require(rCharts)


shinyServer(function(input, output) {

    getSelectedWords <- reactive({
        if (length(input$word) < 1) {
            return(NULL)
        } else {
            return(as.character(input$word))
        }
    })

    getDateRange <- reactive({
        dateRange <- as.numeric(input$daterange) * 24 * 60 * 60
        # because we need the complete day
        dateRange[2] <- dateRange[2] + 23 * 60 * 60 + 59 * 60 + 59
        return(dateRange)
    })

    output$selectedWords <- renderText({
        text <- getSelectedWords()
        if (length(text) == 0) {
            return("Nothing selected")
        } else {
            return(text)
        }
    })

    # dynamic ui components
    output$triggerGGplot2 <- renderUI({
        if (length(input$word) > 0) {
            checkboxInput("ggplot2Plot", "Show ggplot2 plot", FALSE)
        }
    })

    output$triggerRChart <- renderUI({
        if (length(input$word) > 0) {
            checkboxInput("rchartPlot", "Show rCharts plot", FALSE)
        }
    })

    output$triggerDateSlider <- renderUI({
        if (length(input$word) > 0) {
            dateSeq <- as.POSIXct(
                as.numeric(as.character(colnames(occurrencePerWordAndTime)[-1])), 
                origin = "1970-01-01", tz = "GMT")
            dateRangeInput("daterange", "Date range:", start = format(min(dateSeq), 
                "%Y-%m-%d"), end = format(max(dateSeq), "%Y-%m-%d"), 
                min = format(min(dateSeq), "%Y-%m-%d"), max = format(max(dateSeq), 
                "%Y-%m-%d"), language = "de", separator = " - ")
        }
    })

    output$timeseries <- renderPlot({
        indSelectedWords <- which(as.character(occurrencePerWordAndTime$WORD) %in% 
            getSelectedWords())
        dateRange <- getDateRange()
        indSelectedSnapShots <- which(
            as.numeric(colnames(occurrencePerWordAndTime)[-1]) >= dateRange[1] & 
            as.numeric(colnames(occurrencePerWordAndTime)[-1]) <= dateRange[2]) + 1
        tmp <- melt(occurrencePerWordAndTime[indSelectedWords, 
                c(1, indSelectedSnapShots)], id.vars = c("WORD"))
        colnames(tmp) <- c("WORD", "TIMESLOT", "VALUE")
        tmp$TIMESLOT <- as.POSIXct(as.numeric(as.character(tmp$TIMESLOT)), 
            origin = "1970-01-01", tz = "GMT")

        p <- ggplot(tmp, aes(TIMESLOT, VALUE, group = WORD, colour = WORD)) + 
            geom_path(alpha = 0.5) + scale_colour_hue() + ylab("occurrence") + 
            xlab("Snapshot (datetime)")
        print(p)
    })

    output$timeseriesRChartsControls <- renderChart({
        indSelectedWords <- which(as.character(occurrencePerWordAndTime$WORD) %in% 
            getSelectedWords())
        dateRange <- getDateRange()
        indSelectedSnapShots <- which(
            as.numeric(colnames(occurrencePerWordAndTime)[-1]) >= dateRange[1] & 
            as.numeric(colnames(occurrencePerWordAndTime)[-1]) <= dateRange[2]) + 1
        tmp <- melt(occurrencePerWordAndTime[indSelectedWords, 
                c(1, indSelectedSnapShots)], id.vars = c("WORD"))
        colnames(tmp) <- c("WORD", "TIMESLOT", "VALUE")
        tmp$TIMESLOT  <- as.numeric(as.character(tmp$TIMESLOT))
        tmp$TIMESLOT  <- tmp$TIMESLOT - (min(tmp$TIMESLOT) - 1)

        p2 <- nPlot(VALUE ~ TIMESLOT, group = "WORD", data = tmp, type = "lineChart", 
            dom = "timeseriesRChartsControls", width = 800)
        p2$xAxis(axisLabel = "time (in seconds from the first snapshot)", width = 40)
        p2$yAxis(axisLabel = "occurrence", width = 40)
        return(p2)
    })
})

Interpretation

Running this small application gives you the ability to show the development of the discussion in more detail. It is great to see how different subtopics enrich the discussion over time as it could be seen on the following picture with four terms selected.

User Interface

Some interpretation: Malte Götz is the name of a person who started a large online petition against the intention of Deutsche Telekom. On the third of May he handed over all signatures to Deutsche Telekom in a public event. Because the crawler was started one day later and also needed some time to initialize we can only see the last part of the discussion on that. But as the discussion goes on the importance relative to the overall discussion on this subtopic decreases. On the sixth of May the “Verbraucherzentrale Nordrhein-Westfalen”, a large consumer protection agency, announced that it sent a warning letter to Deutsche Telekom to stop their plan to reduce bandwidth. You can see that this caused a lot of discussion on twitter until the end with a peak around the seventh of May. On the same day the “re:publica” conference with topics related to the digital society opened in Berlin. Because the topic is very close to topic of the conference you can see an emerging relation between the conference and the case of Deutsche Telekom. This is especially important because as an analyst you are able to see in “Real-time” how events will “position” themselves to the topic under observation (become an important part of the discussion or not). This is also true for the last term I want to mention - “wiwo”. It is the abbreviation for a large business magazine called “Wirtschaftswoche”. On May the eighth they published an article about the relation between Deutsche Telekom and German politics in particular agencies responsible to oversee the telecommunication market. I could not find any other news article that got such a high attention for this twitter discussion in my data.

Conclusion

There are a lot of things I want to mention at the end - so I will keep it short:

  • You could easily start with Text Mining in R but as things will get more complex results will not always be as expected. For example I tried some simple sentiment analysis but slang, spelling and irony on twitter made it impossible for me to get some useful outcome.
  • The user interface described in my post and a good search engine made it really easy to get a dynamic picture of the discussion on twitter. Surely you can improve this by viewing topics instead of single terms. For this topic modeling could be a good starting point.
  • Reservoir sampling helped me great in keeping the data to handle small. And it is good to know this method ;-).
  • Shiny is my most important new observation from all the techniques. You could easily develop some prototypical web interfaces to show to colleagues and customers. Until now it felt like a restriction to show only static pictures and tables but that is over now.
  • Graphics based on JavaScript like rCharts are great and look very nice. But the available documentation is “limited” compared to ggplot2 and there are a lot of different libraries you have to deal with. Consulting the original java script code is necessary for customizing your charts.

Now, that’s all for this posting and I hope you can get something out of it. If you have questions or any comments feel free to leave a message.

To leave a comment for the author, please follow the link and comment on his blog: Data * Science + R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.