Constructing a Word Cloud for ICML 2015

July 10, 2015
By

(This article was first published on Exegetic Analytics » R, and kindly contributed to R-bloggers)

Word clouds have become a bit cliché, but I still think that they have a place in giving a high level overview of the content of a corpus. Here are the steps I took in putting together the word cloud for the International Conference on Machine Learning (2015).

word-cloud

  1. Extract the hyperlinks to the PDFs of all of the papers from the Conference Programme web site using a pipeline of grep and uniq.
  2. In R, extract the text from each of these PDFs using the Rpoppler package.
  3. Split the text for each page into lines and remove the first line (corresponding to the running title) from every page except the first.
  4. Intelligently handle word breaks and then concatenate lines within each document.
  5. Transform text to lower case then remove punctuation, digits and stop words using the tm package.
  6. Compile the words for all of the documents into a single data.frame.
  7. Using the dplyr package count the number of times that each word occurs across the entire corpus as well as the number of documents which contain that word. This is what the top end of the resulting data.frame looks like:
    > head(word.counts)
           word count doc
    1       can  6496 270
    2  learning  5818 270
    3 algorithm  5762 252
    4      data  5687 254
    5     model  4956 242
    6       set  4040 269
    
  8. Finally, construct a word cloud with the tagcloud package using the word count to weight the word size and the document count to determine grey scale.

The first cloud above contains the top 300 words. The larger cloud below is the top 1000 words.

word-cloud-large

The post Constructing a Word Cloud for ICML 2015 appeared first on Exegetic Analytics.

To leave a comment for the author, please follow the link and comment on their blog: Exegetic Analytics » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)