Retrieving Data from Google Books with `ngramr`
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Karl Marx is the most famous founding fathers of modern sociology with a popularity peak in 1975-6, but declining ever since.
Introduction
Google has a tool for tracking the frequency of words or phrases across its vast collection of scanned texts, the Google Books. The Google Ngram Viewer reports data and graphs the frequency of words encountered in one or across several corpus over time. For instance, the chart above campares the appearance in the English corpus of following bigrams names: “Karl Marx”, “Max Weber”, “Emile Durkheim”.
The y-axis shows of all the bigrams contained in the sample of books written in English, what percentage of them are “Karl Marx” or “Max Weber” or “Emile Durkheim”? From the chart above, we can conclude that Marx is the most famous sociologist among the others founding fathers, with a peak in popularity about 1975-6, but his influence has been declining ever since. These thinkers are considered the founding fathers of sociology because they set out to develop practical and scientifically sound methods of research to examine theories of the social world rooted in a specific historical and cultural context.
Using R with Google Ngram Viewer
There is a package to query the Google Ngram called ngramr written by Sean Carmody. With this package, one can retrieve data from Ngram pages in the form of data frame.
Getting Started
The first thing to do is to load the ggplot2
and ngramr
packages. In case you don’t have them installed, an installation is required.
Write a Query and Do the Plots
The following is equivalent to the chart above for the three sociologists bigrams, except that I’m applying a smoothed line–or moving average of 5 years, so trends become more apparent. For instance, the data shown for 2000 is an average of the raw count for 2000 plus 5 values on either side: (“count for 1995” + “count for 2000” + “count for 2005”), divided by 3. So a smoothing of 5 means that 11 values will be averaged: 5 on either side, plus the target value in the center of them.
ng <- c("(Karl Marx)", "(Max Weber)", "(Emile Durkheim)") ggram(ng, year_start = 1800, year_end = 2012, smoothing = 5, google_theme = FALSE) + ggtitle("Marx, Weber, Durkheim")+ theme_538(legend.position = "bottom")
Complex Queries
I test the trend in popularity of the unigran “Capitalism” vis-à-vis other two unigrans (“Socialism”, “Communism”) from 1850 to 2012. Here’s how we might combine + and / to show how the unigram “Capitalism” has blossomed at the expense of “Socialism” and “Communism” terms in the literature published in English.
ng <- c("Capitalism /(Capitalism + Socialism + Communism)") cap <- ggram(ng, year_start = 1850, year_end = 2012, smoothing = 3, google_theme = FALSE) + ggtitle("Capitalism over (Capitalism + Socialism + Communism)")+ theme_538() ng <- c("Socialism /(Capitalism + Socialism + Communism)") soc <- ggram(ng, year_start = 1850, year_end = 2012, smoothing = 3, google_theme = FALSE) + ggtitle("Socialism over (Capitalism + Socialism + Communism)")+ theme_538() ng <- c("Communism /(Capitalism + Socialism + Communism)") com <- ggram(ng, year_start = 1850, year_end = 2012, smoothing = 3, google_theme = FALSE) + ggtitle("Communism over (Capitalism + Socialism + Communism)")+ theme_538() multiplot(cap, soc, com, ncols=1) ggfootnote(size = .5)
In the following, I retrieve the frequency for the unigram “Capitalism” in the Russian corpus, French, and British English. Note that the results can be case-insensitive variants.
Below is a plot of the unigram “Capitalism” (“капитализм”) in the Russian language corpus.
ng <- "капитализм" rus=ggram(ng, year_start = 1850, corpus = "rus_2012", ignore_case = TRUE, google_theme = TRUE) + ggtitle("Russian: капитализм")+ theme_538() ng <- "capitalisme" fre=ggram(ng, year_start = 1850, corpus = "fre_2012", ignore_case = TRUE, google_theme = TRUE) + ggtitle("French: Capitalisme")+ theme_538() ng <- "capitalism" eng=ggram(ng, year_start = 1850, corpus = "eng_gb_2012", ignore_case = TRUE, google_theme = TRUE) + ggtitle("English: Capitalism")+ theme_538() multiplot(rus, fre, eng, ncols=1) ggfootnote(size = .5)
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.