A “Startlingly Neat & Simple” Rule & Five Graphs About Patterns That Might Surprise You

February 17, 2015

(This article was first published on Plotly, and kindly contributed to R-bloggers)

George Zipf popularized an idea—Zipf’s Law—that approximates populations of cities, distribution of money in counties, and how frequently words are used. Nobel Prize-winning columnist Paul Krugmans wrote of Zipf’s Law that

“the usual complaint about economic theory is that our models are oversimplified — that they offer excessively neat views of complex, messy reality. [In the case of Zipf’s law] the reverse is true: we have complex, messy models, yet reality is startlingly neat and simple.”

Read on to learn more. Let us know if you want to run Plotly Enterprise on-premise.

A Zipfian Distribution: How Often Words Appear

A Zipfian distribution is a type of power law. A power law occurs when one event varies as a power of another. One application of Zipf’s law states that in texts of natural language (e.g., books), each word is used twice as often as the next most commonly occuring word. The graph below applies the rule to word usage in 29 UK books below. “The” occurred 225,300 uses, and was the most commonly used word. Note that the graph is interactive; you can press the “play with this data” link to edit, embed, and share your own version.

Log-log_# of uses of word, Linear_y, Histogram_x, Box plot_y vs Log-log_Word rank, Linear_x

Evaluating Power Laws

We can test for a power law by plotting frequency (y-axis) against rank (x-axis) on a double log axis. Then check for a straight line. The graph below shows three attempts to fit a power law function to datasets. The plot on the left is a good fit. The plot in the middle is a decent fit. The plot on the right is not a good fit.

Evaluating power law fits for three datasets" style="display: block; text-align: center;">Evaluating power law fits for three datasets" style="max-width: 100%;width: 800px;" width="450" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"/>

Evaluating Zipfian Distributions For City Populations

Another application of Zipf’s law is for populations. We’ve used ggplot2 to graph the population of cities (y-axis) and the rank of each city. In this dataset, New York has the highest population and is ranked first.

City population vs rank across countries

GDP Of Nations

We are approaching a Zipfians distribution for country GDP vs rank.

Zipf’s Law and Its Correlation to the GDP of Nations

Evaluating Power Laws For Many Datasets

Researchers use power laws to determine how much inftrasture a city needs, examine the number of gas stations required in a city, and much more.

Works                                              Proteins                                               Metabolic" style="display: block; text-align: center;">
Works                                              Proteins                                               Metabolic" style="max-width: 100%;width: 700px;" width="450" onerror="this.onerror=null;this.src='https://plot.ly/404.png';"/>

If you liked what you read, please consider sharing. Find us at [email protected] and @plotlygraphs.

To leave a comment for the author, please follow the link and comment on their blog: Plotly.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)