Investigating words distribution with R – Zipf’s law

February 7, 2019
By

(This article was first published on r – Appsilon Data Science | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers)

Hello again! Typically I would start by describing a complicated problem that can be solved using machine or deep learning methods, but today I want to do something different, I want to show you some interesting probabilistic phenomena!

Have you heard of Zipf’s law? I hadn’t until recently. Zipf’s law is an empirical law that states that many different datasets found in nature can be described using Zipf’s distribution. Most notably, word frequencies in books, documents and even languages can be described in this way. Simplified, Zipf’s law states that if we take a document, book or any collection of words and then the how many times each word occurs, their frequencies will be very similar to Zipf’s distribution. Let’s say that the number of occurrences of the most frequently occurring word is:

X

Zipf’s law states that the number of occurrences of the second most frequently occurring word will be equal to:

X/2

So basically this word will occur half of the number of times the most frequent word did. The number of occurrences of the third most frequently occurring word would be:

X/3

And so on … So the number of occurrences of the Nth most frequent word would be:

X/N

Most recent studies of this phenomena show that in the case of words, typically there is the same value of 𝞪, and the frequency on Nth word is described as:

X/N𝞪

To check the theory I downloaded a set of the 50,000 most frequent Polish words in subtitles (https://github.com/hermitdave/FrequencyWords/blob/master/content/2016/pl/pl_50k.txt) from OpenSubtitles.org. Here’s a visualization of real and theoretical frequencies.

Zipf’s law using ggplot

To see it more clearly we can use logarithmic scales.

Zipf’s law using ggplot log

Try it out yourself: a list of example datasets can be found here: https://en.wiktionary.org/wiki/Wiktionary:Frequency_lists

You can use this example code to create a similar visualization:


library(ggplot2)
library(dplyr)
library(themes)
library(gganimate)

word_count <- # Data frame containing words and their frequency 
colnames(word_count) <- c("word", "count")
alpha <- 1 # Change it needed
word_count <- word_count %>%
 mutate(word = factor(word, levels = word),
        rank = row_number(),
        zipfs_freq = ifelse(rank == 1, count, dplyr::first(count) / rank^alpha))

zipfs_plot <- ggplot(word_count, aes(x = rank, y = count)) + 
geom_point(aes(color = "observed")) +
 theme_bw() + 
geom_point(aes(y = zipfs_freq, color = "theoretical")) +
 transition_reveal(count, rank) + 
labs(x = "rank", y = "count", title = "Zipf's law visualization") +
 scale_colour_manual(name = "Word count", values=c("theoretical" = "red", "observed" = "black")) +
 theme(legend.position = "top")
zipfs_animation <- animate(p)

This experiment is amazing, because language is very complicated: words in text are not random in any sense, and they depend on the previous ones. That’s why it’s so surprising to see such patterns here. We should always remember that the world can astonish us in many different ways! See you next time 🙂

Article Investigating words distribution with R – Zipf’s law comes from Appsilon Data Science | End­ to­ End Data Science Solutions.

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End­ to­ End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)