Reading HTML pages in R for text processing

[This article was first published on Quantum Forest » rblogs, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We were talking with one of my colleagues about doing some text analysis—that, by the way, I have never done before—for which the first issue is to get text in R. Not any text, but files that can be accessed through internet. In summary, we need to access an HTML file, parse it so we can access specific content and then remove the HTML tags. Finally, we may want to replace some text (the end of lines, \n, for example) before continue processing the files.

The package XML has the necessary functionality to deal with HTML, while the rest is done using a few standard R functions.

library(XML)

# Read and parse HTML file
doc.html = htmlTreeParse('http://apiolaza.net/babel.html',
           useInternal = TRUE)

# Extract all the paragraphs (HTML tag is p, starting at
# the root of the document). Unlist flattens the list to
# create a character vector.
doc.text = unlist(xpathApply(doc.html, '//p', xmlValue))

# Replace all \n by spaces
doc.text = gsub('\\n', ' ', doc.text)

# Join all the elements of the character vector into a single
# character string, separated by spaces
doc.text = paste(doc.text, collapse = ' ')

Incidentally, babel.html contains a translation of the short story ‘The Library of Babel’ by Jorge Luis Borges. Great story! We can repeat this process with several files and then create a corpus (and analyze it) using the tm package.

To leave a comment for the author, please follow the link and comment on their blog: Quantum Forest » rblogs.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)