Site icon R-bloggers

Generate text using Markov Chains (sort of)

[This article was first published on Clean Code, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Inspired by the hilarious podcast The Greatest Generation, I have worked again with all the lines from all the episode scripts of TNG. Today I will make a clunky bot (although it does nothing and is absolutely not useful) that talks like Captain Picard. I actually wanted to use a Markov Chain to generate text. A Markov Chain has a specific property. It doesn’t care what happened before, it only looks at probabilities from the current state to a next state.

Markov Chains

An example is a board game based on dice throws. Every time I throw the die there are equal opportunities to go from my current side to any other side. In this case every probability is equal, but when the die is loaded probabilities would change.

In the current example I’m using all the phrases I extracted from the series TNG 1. And then well use words that follow each other choosing “randomly” but weighted by occurrence.

But first cleaning

I’m using the packages dplyr, stringr, tidytext and tidyr. If you’d like to follow along download the dataset and load it in R with load(“path to file”)

The dataset itself is rather large dim(TNG) #110176 17 with 11 thousand rows and 17 variables. However we will only use the variables type, who and text. Type contains either “description” or “speech”. We only need speech. Because I was a novice and because in general I’m not very tidy 2 this dataset is rather dirty. I copied and did not sanitize the who part. This means that if you filter on “Riker”, you will not get everything he said. The script has weird things like: “Riker’s com voice”, “Riker’s voice” and several variants of “V.O.” (means voice over).

Cleaning the who column

We have to clean out the who column, filter only the rows that contain speech. Perhaps we also want to select only rows for certain characters. And finally we want to cut up the text into bigrams or trigrams. Thanks to an excellent book by Julia Silge and David Robinson this was surprisingly easy to do. I cannot recommend their package tidytext enough, truly awesome!

You could do the following steps all in one pipe, but for clarity sake, and your sanity, we will take smaller steps.

speech_TNG <- TNG %>%
        filter(type == "speech") %>%    # select only speech parts
        mutate(who = str_replace(who, "'S COM VOICE", "") %>% 
                       str_replace( "\\(.{1,}\\)", "") %>%
                       str_replace( "\".{1,}\"", "") %>%
                       str_replace( "YOUNG", "") %>%
                       str_replace( "'S VOICE", "") %>%
                       str_trim())

This looks crazy, but don’t despair:

How did I know how to build this enormous pipe? I took a sample of the who column and tried stuff untill it suited my needs [3].

[3]: for example sample <- TNG %>% filter(type == "speech") %>% select(who) %>% sample_n(20) %>% .$who

Extracting bi- and trigrams per character

And now we can create seperate datasets for every character.

bigrams_data <- speech_TNG %>%
        filter(who == "DATA") %>%
        unnest_tokens(bigram, text, token = "ngrams",to_lower = TRUE, n= 2) %>%
        separate(bigram, c("word1", "word2"), sep = " ") %>%
        count(word1, word2, sort = TRUE)

We use the unnest_tokens command from the tidytext package. This command will take your cell with text and extract either sentences, words, ngrams or paragraphs. It also converts everything to lowercase and delete any punctuation. The resulting variable I now call “bigram”. Then I take that variable “bigram” and split it up into two variables calling them “word1” and “word2”. Finally I count how many times the combinations occur.

For Picard, the bigrams would be:

bigrams_picard <- speech_TNG %>%
        filter(who == "PICARD") %>%
        unnest_tokens(bigram, text, token = "ngrams",to_lower = TRUE, n= 2) %>%
        separate(bigram, c("word1", "word2"), sep = " ") %>%
        count(word1, word2, sort = TRUE)

And his trigrams would be

trigrams_picard <- speech_TNG %>%
        filter(who == "PICARD") %>%
        unnest_tokens(trigram, text, token = "ngrams",to_lower = TRUE, n= 3) %>%
        separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
        count(word1, word2,word3, sort = TRUE)

Creating a markov chain (sort of)

My endproduct takes two words and tries to find a third word. Then it takes the final two words and tries to find another word untill the sentance has a length that I specify at the start.

What I actually created is a trigram dataframe, and a function that searches that frame. The function takes two words and returns all the rows where the first word matches the first column and the second word matches the second column.

Furthermore I made a sentence creator, a function where you supply the first two words and the length of the sentence. That function keeps using the last words in the sentence until the correct length is achieved. With the fallback method of using bigrams when the trigrams don’t work anymore it could still fail, but not so often.

Word generator

return_third_word <- function( woord1, woord2){
        woord <- trigrams_picard %>%
                filter_(~word1 == woord1, ~word2 == woord2) %>%
                sample_n(1, weight = n) %>%
                .[["word3"]]
        woord
}

However this function sometimes returns an empty row. So I baked in a backup for when it can’t find the word combination.

return_third_word <- function( woord1, woord2){
        woord <- trigrams_picard %>%
                filter_(~word1 == woord1, ~word2 == woord2) %>%
                sample_n(1, weight = n) %>%
                .[["word3"]]
        if(length(woord) == 0){
                bleh <- filter_(bigrams_picard, ~word1 == woord2) %>%
                        sample_n(1, weight = n)
                warning("no word found, adding ", bleh, "to", woord1 , woord2)
                woord <- bleh
        }
        woord
}

I also used a bit of non-standard evalation (NSE) just for fun.

Sentence generator

generate_sentence <- function(word1, word2, sentencelength =5, debug =FALSE){
        #input validation
        if(sentencelength <3)stop("I need more to work with")
        sentencelength <- sentencelength -2
        # starting
        sentence <- c(word1, word2)
        woord1 <- word1
        woord2 <- word2
        for(i in seq_len(sentencelength)){
                if(debug == TRUE)print(i)
                word <- return_third_word( woord1, woord2)
                sentence <- c(sentence, word)
                woord1 <- woord2
                woord2 <- word
        }
        output <-paste(sentence, collapse = " ")
        output
}

For example this created for me: generate_sentence("I", "am", 9)

"i am loathe to do with you because they"
"i am not the case however wesley is alive"
"i am aware of the tachyon pulse it might be able to determine how to"
# and with the enterprise and length 9
"the enterprise we use the cloak to escape do"
 "the enterprise we have had"
 "the enterprise for a thing"

Right it makes no sense whatsover. Thanks for following along!

Do you have suggestions, improvements, found errors? open an issue or email me. https://github.com/RMHogervorst/cleancode/issues/new/

Generate text using Markov Chains (sort of) was originally published by at Clean Code on January 21, 2017.

To leave a comment for the author, please follow the link and comment on their blog: Clean Code.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.