Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This one is named, yes, you guessed it, after Markov chains.
It’s simple calculation of probability of words chaining and drawing the multiple times appeared chained words reminds of markov chain (although this is not it!).
The gist is is tokenization of words, counting the appearances and calculating the probabilities.
markov_babbler <- function(text, order = 2, n = 50, by_word = TRUE) { tokens <- if (by_word) str_split(text, "\\s+")[[1]] else unlist(str_split(text, "")) tokens <- tokens[tokens != ""] #add the removal of full stops,.... token <- c('I', 'I am', 'to', 'all', 'Oh') df <- data.frame( from = sapply(seq_len(length(tokens) - order), function(i) paste(tokens[i:(i + order - 1)], collapse = " ")), to = tokens[(order + 1):length(tokens)], stringsAsFactors = FALSE ) probs <- df %>% group_by(from, to) %>% summarise(freq = n(), .groups = "drop") %>% group_by(from) %>% mutate(prob = freq / sum(freq)) current <- sample(unique(probs$from), 1) output <- unlist(str_split(current, " ")) for (i in seq_len(n)) { next_word <- probs %>% filter(from == current) if (nrow(next_word) == 0) break next_token <- sample(next_word$to, 1, prob = next_word$prob) output <- c(output, next_token) current <- paste(tail(output, order), collapse = " ") }
Having this in mind, I have took Red Ridding hood (Brother Grimm) and plugged the story into the function. In both English and Slovenian languages.
…
Playing around with useless statistics is fun. Useless fun
And no function is complete with little ggplot for drawing the network of words.
g <- graph_from_data_frame(probs %>% filter(freq > 1), directed = TRUE) plot <- ggraph(g, layout = "fr") + geom_edge_link(aes(edge_alpha = prob, edge_width = prob), color = "firebrick") + geom_node_label(aes(label = name), size = 4, repel = TRUE) + theme_void() + labs(title = "Markov Chain: Token Transitions")
As always, the complete code is available on GitHub in  Useless_R_function repository. The sample file in this repository is here (filename: Markov_babbler.R). Check the repository for future updates.
Happy R-coding and stay healthy!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.