Little useless-useful R functions – Markov babbler

tomaztsql

7 hours ago

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This one is named, yes, you guessed it, after Markov chains. The babbler is there to connotate the simplicity of useless R function.

It’s simple calculation of probability of words chaining and drawing the multiple times appeared chained words reminds of markov chain (although this is not it!).

The gist is is tokenization of words, counting the appearances and calculating the probabilities.

markov_babbler <- function(text, order = 2, n = 50, by_word = TRUE) {
  tokens <- if (by_word) str_split(text, "\\s+")[[1]] else unlist(str_split(text, ""))
  tokens <- tokens[tokens != ""]
  
  #add the removal of full stops,....
  token <- c('I', 'I am', 'to', 'all', 'Oh')
  
  df <- data.frame(
    from = sapply(seq_len(length(tokens) - order), function(i) paste(tokens[i:(i + order - 1)], collapse = " ")),
    to = tokens[(order + 1):length(tokens)],
    stringsAsFactors = FALSE
  )
  
  probs <- df %>%
    group_by(from, to) %>%
    summarise(freq = n(), .groups = "drop") %>%
    group_by(from) %>%
    mutate(prob = freq / sum(freq))
  
  current <- sample(unique(probs$from), 1)
  output <- unlist(str_split(current, " "))
  
  for (i in seq_len(n)) {
    next_word <- probs %>% filter(from == current)
    if (nrow(next_word) == 0) break
    next_token <- sample(next_word$to, 1, prob = next_word$prob)
    output <- c(output, next_token)
    current <- paste(tail(output, order), collapse = " ")
  }

Having this in mind, I have took Red Ridding hood (Brother Grimm) and plugged the story into the function. In both English and Slovenian languages.

…

Playing around with useless statistics is fun. Useless fun

And no function is complete with little ggplot for drawing the network of words.

  g <- graph_from_data_frame(probs %>% filter(freq > 1), directed = TRUE)
  plot <- ggraph(g, layout = "fr") +
    geom_edge_link(aes(edge_alpha = prob, edge_width = prob), color = "firebrick") +
    geom_node_label(aes(label = name), size = 4, repel = TRUE) +
    theme_void() +
    labs(title = "Markov Chain: Token Transitions")

As always, the complete code is available on GitHub in Useless_R_function repository. The sample file in this repository is here (filename: Markov_babbler.R). Check the repository for future updates.

Happy R-coding and stay healthy!

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Related