Acquiring data for language research (3/3): web scraping

[This article was first published on R on francojc ⟲, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Web scraping

There are many resources available through direct downloads from repositories and individual sites and R package interfaces to web resources with APIs, but these resources are relatively limited to the amount of public-facing textual data recorded on the web. In the case that you want to acquire data from webpages R can be used to access the web programmatically through a process known as web scraping. The complexity of web scrapes can vary but in general it requires more advanced knowledge of R as well as the structure of the language of the web: HTML (Hypertext Markup Language).

A toy example

HTML is a cousin of XML and as such organizes web documents in a hierarchical format that is read by your browser as you navigate the web. Take for example the toy webpage I created for this demonstration in Figure 1.

Example web page.

Figure 1: Example web page.

The file accessed by my browser to render this webpage is test.html and in plain-text format looks like this:

<html>
  <head>
    <title>My website</title>
  </head>
  <body>
    <div class="intro">
      <p>Welcome!</p>
      <p>This is my first website. </p>
    </div>
    <table>
      <tr>
        <td>Contact me:</td>
        <td>
          <a href="mailto:[email protected]">[email protected]</a>
        </td>
      </tr>
    </table>
    <div class="conc">
      <p>Good-bye!</p>
    </div>
  </body>
</html>

Each element in this file is delineated by an opening and closing tag, <head></head>. Tags are nested within other tags to create the structural hierarchy. Tags can take class and id labels to distinguish them from other tags and often contain other attributes that dictate how the tag is to behave when rendered visually by a browser. For example, there are two <div> tags in our toy example: one has the label class = "intro" and the other class = "conc". <div> tags are often used to separate sections of a webpage that may require special visual formatting. The <a> tag, on the other hand, creates a web link. As part of this tag’s function, it requires the attribute href= and a web protocol –in this case it is a link to an email address mailto:[email protected]. More often than not, however, the href= contains a URL (Uniform Resource Locator). A working example might look like this: <a href="https://francojc.github.io/">My homepage</a>.

The aim of a web scrape is to download the HTML file, parse the document structure, and extract the elements containing the relevant information we wish to capture. Let’s attempt to extract some information from our toy example. To do this we will need the rvest package. First, install/load the package, then, read and parse the HTML from the character vector named web_file assigning the result to html.

pacman::p_load(rvest) # install/ load `rvest`
html <- read_html(web_file) # read the raw html
html
## {xml_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body>\n    <div class="intro">\n      <p>Welcome!</p>\n      <p>Thi ...

read_html() parses the raw HTML into an object of class xml_document. The summary output above shows that tags the HTML structure have been parsed into ‘nodes’. The tag nodes can be accessed by using the html_nodes() function by specifying the tag to isolate.

html %>% 
  html_nodes("div")
## {xml_nodeset (2)}
## [1] <div class="intro">\n      <p>Welcome!</p>\n      <p>This is my firs ...
## [2] <div class="conc">\n      <p>Good-bye!</p>\n    </div>

The %>% operator is used to ‘pipe’ the output of one R operation to the input of the next operation. Piping is equivalent to embedding functions but tends to lead to more legible code.

sum(1:5) # embedding example
## [1] 15
1:5 %>% sum() # piping example
## [1] 15

By default the subsequent function assumes that the output will be used as the first argument. If this is not the case, the . operator can be used to match the output to the correct argument.

1:5 %>% paste("Number", .) # directing output with .
## [1] "Number 1" "Number 2" "Number 3" "Number 4" "Number 5"

Notice that html_nodes("div") has returned both div tags. To isolate one of tags by its class, we add the class name to the tag separating it with a ..

html %>% 
  html_nodes("div.intro")
## {xml_nodeset (1)}
## [1] <div class="intro">\n      <p>Welcome!</p>\n      <p>This is my firs ...

Great. Now say we want to drill down and isolate the subordinate <p> nodes. We can add p to our node filter.

html %>% 
  html_nodes("div.intro p")
## {xml_nodeset (2)}
## [1] <p>Welcome!</p>
## [2] <p>This is my first website. </p>

To extract the text contained within a node we use the html_text() function.

html %>% 
  html_nodes("div.intro p") %>% 
  html_text()
## [1] "Welcome!"                   "This is my first website. "

The result is a character vector with two elements corresponding to the text contained in each <p> tag. If you were paying close attention you might have noticed that the second element in our vector includes extra whitespace after the period. To trim leading and trailing whitespace from text we can add the trim = TRUE argument to html_text().

html %>% 
  html_nodes("div.intro p") %>% 
  html_text(trim = TRUE)
## [1] "Welcome!"                  "This is my first website."

From here we would then work to organize the text into a format we want to store it in and write the results to disk. Let’s leave writing data to disk for later in the post. For now keep our focus on working with rvest to acquire data from html documents working with a more practical example.

A practical example

With some basic understanding of HTML and how to use the rvest package, let’s turn to a realistic example. Say we want to acquire text from the Spanish news site elpais.com. The first step in any web scrape is to investigate the site and page(s) we want to scrape. Minimally this includes identifying the URL we want to target and exploring the structure of the HTML document. Take the following webpage I have identified, seen in Figure 2.

Content page from the Spanish new site El País.

Figure 2: Content page from the Spanish new site El País.

As in our toy example, first we want to feed the HTML document to the read_html() function to parse the tags into nodes. In this case we will assign the web address to the variable url. read_html() will automatically connect to the web and download the raw html.

url <- "https://elpais.com/elpais/2017/10/17/opinion/1508258340_992960.html"
html <- read_html(url)
html
## {xml_document}
## <html lang="es">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body id="salida_articulo" class="salida_articulo salida_articulo_op ...

At this point we have captured and parsed the raw HTML assigning it to the object named html. The next step is to identify the node or nodes that contain the information we want to extract from the page. To do this it is helpful to use a browser to inspect specific elements of the webpage. Your browser will be equipped with a command that you can enable by hovering your mouse over the element of the page you want to target and using a right click to select “Inspect Element”. This will split your browser window horizontally showing you the raw HTML underlying the webpage.

Figure 3: Using the “Inspect Element” command to explore raw html.

From Figure 3 we see that the node we want to target is h1. Now this tag is common and we don’t want to extract every h1 so we use the class articulo-titulo to specify we only want the title of the article. Using the convention described in our toy example, we can isolate the title of the page.

html %>% 
  html_nodes("h1.articulo-titulo")
## {xml_nodeset (1)}
## [1] <h1 class="articulo-titulo " id="articulo-titulo" ="headline ...

We can then extract the text with html_text().

title <- 
  html %>% 
  html_nodes("h1.articulo-titulo") %>% 
  html_text(trim = TRUE)
title
## [1] "Crímenes contra el periodismo en el seno de la UE"

Let’s extract the author’s name and the article text in the same way.

# Author
author <- 
  html %>% 
  html_node("span.autor-nombre") %>% 
  html_text(trim = TRUE)
# Article text
text <- 
  html %>% 
  html_nodes("div.articulo-cuerpo p") %>% 
  html_text(trim = TRUE)

Another piece of information we might want to include in our web scrape is the date the article was published. Again, we use the “Inspect Element” tool in your browser to locate the tag we intend to isolate. This time, however, the information that returned by html_text() is less than ideal –the date is inter-spliced with text formatting.

html %>% 
  html_nodes("div.articulo-datos time") %>% 
  html_text(trim = TRUE)
## [1] "18 OCT 2017 - 14:26\t\t\t\t\tCEST"

Looking at the time node provides another angle: a clean date is contained as the datetime attribute of the time tag.

html %>% 
  html_nodes("div.articulo-datos time")
## {xml_nodeset (1)}
## [1] <time datetime="2017-10-18T14:26:30+02:00" class="articulo-actualiza ...

To extract a tag’s attribute we use the html_attr() function.

# Date
date <- 
  html %>% 
  html_nodes("div.articulo-datos time") %>% 
  html_attr("datetime")
date
## [1] "2017-10-18T14:26:30+02:00"

At this point, we have isolated and extracted the title, author, date, and text from the webpage. Each of these elements are stored in character vectors in our R session. To complete our task we need to write this data to disk as plain text. With an eye towards a tidy dataset, an ideal format to store the data is in a CSV file where each column corresponds to one of the elements from our scrape and each row an observation. The observations will contain the text from each <p> tag. A CSV file is a tabular format and so before we can write the data to disk let’s coerce the data that we have into tabular format. We will use the tibble() function here to streamline our data frame creation.1 Feeding each of the vectors title, author, date, and text as arguments to tibble() creates the tabular format we are looking for.

tibble(title, author, date, text)

Notice that there are six rows in this data frame, one corresponding to each paragraph in text. R has a bias towards working with vectors of the same length. As such each of the other vectors (title, author, and date) are replicated, or recycled, until they are the same length as the longest vector text, which a length of six.

For good documentation let’s add our object url to the data frame, which contains the actual web link to this page, and assign the result to webpage_data.

webpage_data <- tibble(title, author, date, text, url)

The final step is to write this data to disk. To do this we will use the write_csv() function.

write_csv(x = webpage_data, path = "data/original/elpais_webpage.csv")

Putting it all together

At this point you may be think, ‘Great, I can download data from a single page, but what about downloading multiple pages?’ Good question. That’s really where the strength of a programming approach takes hold. Extracting information from multiple pages is not fundamentally different than working with a single page. However, it does require more sophisticated code. I will not document the code in this post but you are encouraged to download the GitHub repository which contains the working code and peruse the functions/aquire_functions.R script to see the details and replicate the processing covered here. Yet I will give you a gist of the steps taken to scrape multiple pages from the El País website.

As I mentioned earlier in this section, the first step in any web scrape is to investigate the structure of the site and page(s) we want to scrape. The El País site is organized such that each article is ‘tagged’ with some meta-category. After doing some browsing on their site, I discovered there is a searchable archive page that lists all the ‘tags’ used on the site. By selecting a tag, a paginated interface listing all of the articles associated with said tag is made available.

El País archives page for the `politica` tag.

Figure 4: El País archives page for the politica tag.

In a nutshell, the approach then is to leverage these archives to harvest links to article pages with a specific tag, download the content of these links and then organize and write the data to disk in CSV format. In more detail I’ve provided concrete steps with the custom functions I wrote to accomplish each:

  1. Get the total number of archive pages available.

Includes an optional argument sample_size to specify the number of archive pages to harvest links from. The default is 1.

get_archive_pages <- function(tag_name, sample_size = 1) {
  # Function: Scrape tag main page and return selected number of archive pages
  url <- paste0("https://elpais.com/tag/", tag_name)
  html <- read_html(url) # load html from selected url
  pages_available <- 
    html %>% # pass html
    html_node("li.paginacion-siguiente a") %>% # isolate 'next page' link
    html_attr("href") %>% # extract 'next page' link
    str_extract("\\d+$") %>% # extract the numeric value (num pages of links) in link
    as.numeric() + 1 # covert to a numeric vector and add 1 (to include first page)
  cat(pages_available, "pages available for the", tag_name, "tag.\n")
  archive_pages <- paste0(url, "/a/", (pages_available - (sample_size - 1)):pages_available) # compile urls
  cat(sample_size, "pages selected.\n")
  return(archive_pages)
}
  1. Harvest the links to the content pages.

The str_replace() function from the stringr library is used here to create valid URLs by replacing the // with https:// in the links harvested directly from the webpage.

get_content_links <- function(url) {
  # Function: Scrape the content links from a tag archive page
  html <- read_html(url) # load html from selected url
  urls <- 
    html %>% # pass html
    html_nodes("h2.articulo-titulo a") %>% # isolate links
    html_attr("href") %>% # extract urls
    str_replace(pattern = "//", replacement = "https://") # create valid urls
  cat(length(urls),"content links scraped from tag archives.\n")
  return(urls)
}
  1. Get the content for a given link and organize it into tabular format.

A conditional statement is included to identify webpages with no text content. All pages have a boilerplate paragraph, so pages with a text vector of length greater than one will be content pages.

get_content <- function(url) {
  # Function: Scrape the title, author, date, and text from a provided
  # content link. Return as a tibble/data.frame
  cat("Scraping:", url, "\n")
  html <- read_html(url) # load html from selected url
  
  # Title
  title <- 
    html %>% # pass html
    html_node("h1.articulo-titulo") %>% # isolate title
    html_text(trim = TRUE) # extract title and trim whitespace
  
  # Author
  author <- 
    html %>% # pass html
    html_node("span.autor-nombre") %>% # isolate author
    html_text(trim = TRUE) # extract author and trim whitespace
  
  # Date
  date <- 
    html %>% # pass html
    html_nodes("div.articulo-datos time") %>% # isolate date
    html_attr("datetime") # extract date
  
  # Text
  text <- 
    html %>% # pass html
    html_nodes("div.articulo-cuerpo p") %>% # isolate text by paragraph
    html_text(trim = TRUE) # extract paragraphs and trim whitespace
  
  # Check to see if the article is text based
  # - only one paragraph suggests a non-text article (cartoon/ video/ album)
  if (length(text) > 1) { 
    # Create tibble/data.frame
    return(tibble(url, title, author, date, text, paragraph = (1:length(text))))
  } else {
    message("Non-text based article. Link skipped.")
    return(NULL)
  }
}
  1. Write the tabular data to disk.

I’ve added code we’ve used in the previous data acquisition methods in this post to create a target directory before writing the file.

write_content <- function(content, target_file) {
  # Function: Write the tibble content to disk. Create the directory if
  # it does not already exist.
  target_dir <- dirname(target_file) # identify target file directory structure
  dir.create(path = target_dir, recursive = TRUE, showWarnings = FALSE) # create directory
  write_csv(content, target_file) # write csv file to target location
  cat("Content written to disk!\n")
}

These function each perform a task in our workflow and can be joined together to do our web scrape. To make this workflow maximally efficient I’ve wrapped them, and a conditional statement to avoid re-downloading a resource, in a function named download_elpais_tag(). I’ve also added the map() function to our workflow at a couple key points. map() takes an object an iterates over each element in that object. Since the get_content_links() and the get_content() functions work on an object with a single element, we need the functions to be iteratively applied to objects with multiple elements. After map() does its work applying the function to the elements of the object the results need to be joined. For the results from map(get_content_links) will be a vector, so combine() is the appropriate function. For map(get_content) a tibble data frame will be returned so we use bind_rows() to join the data.

download_elpais_tag <- function(tag_name, sample_size, target_file, force = FALSE) {
  # Function: Download articles from elpais.com based on tag name. Select
  # number of archive pages to consult, then scrape and write the content 
  # to disk. If the target file exists, do not download again.
  if(!file.exists(target_file) | force == TRUE) {
    cat("Downloading data.\n")
    get_archive_pages(tag_name, sample_size) %>% # select tag archive pages
      map(get_content_links) %>% # get content links from pages sampled
      combine() %>% # combine the results as a single vector
      map(get_content) %>% # get the content for each content link
      bind_rows() %>% # bind the results as a single tibble
      write_content(target_file) # write content to disk
  } else {
    cat("Data already downloaded!\n")
  }
}

Adding these functions, including the download_elpais_tag() function to the functions/acquire_functions.R script in our project management template and then sourcing this script from the acquire_data.R script in the code/ directory will allow us to use the function like so:

# Scrape archives of the Spanish news site elpais.com by tag
# To search for valid tags: https://elpais.com/tag/listado/
download_elpais_tag(tag_name = "politica", 
                    target_file = "data/original/elpais/political_articles.csv")
90554 pages available for the politica tag.
1 pages selected.
22 content links scraped from tag archives.
Scraping: https://elpais.com/deportes/2017/10/20/actualidad/1508510590_014924.html 
Scraping: https://politica.elpais.com/politica/2017/10/20/actualidad/1508506425_813840.html 
Scraping: https://elpais.com/internacional/2017/10/20/actualidad/1508503663_430515.html 
Scraping: https://politica.elpais.com/politica/2017/10/20/actualidad/1508507460_569874.html 
Scraping: https://elpais.com/cultura/2017/10/20/actualidad/1508488913_681643.html 
Scraping: https://elpais.com/internacional/2017/10/20/actualidad/1508506096_337991.html 
Scraping: https://politica.elpais.com/politica/2017/10/20/actualidad/1508503572_812343.html 
Scraping: https://politica.elpais.com/politica/2017/10/20/actualidad/1508488656_838766.html 
Scraping: https://politica.elpais.com/politica/2017/10/20/actualidad/1508489106_542799.html 
Scraping: https://elpais.com/ccaa/2017/10/19/valencia/1508445805_457854.html 
Scraping: https://elpais.com/elpais/2017/10/20/album/1508487891_134872.html 
Non-text based article. Link skipped.
Scraping: https://elpais.com/ccaa/2017/10/20/catalunya/1508492661_274873.html 
Scraping: https://elpais.com/elpais/2017/10/19/ciencia/1508412461_971020.html 
Scraping: https://elpais.com/ccaa/2017/10/20/andalucia/1508499080_565687.html 
Scraping: https://elpais.com/ccaa/2017/10/20/catalunya/1508495565_034721.html 
Scraping: https://elpais.com/cultura/2017/10/19/actualidad/1508403967_099974.html 
Scraping: https://politica.elpais.com/politica/2017/10/20/actualidad/1508496322_284364.html 
Scraping: https://elpais.com/economia/2017/10/19/actualidad/1508431364_731058.html 
Scraping: https://elpais.com/elpais/2017/10/20/album/1508491490_512616.html 
Non-text based article. Link skipped.
Scraping: https://politica.elpais.com/politica/2017/10/20/actualidad/1508481079_647952.html 
Scraping: https://elpais.com/ccaa/2017/10/20/valencia/1508493387_961965.html 
Scraping: https://elpais.com/economia/2017/10/20/actualidad/1508492104_302263.html 
Content written to disk!

I applied the function to the tag gastronomia (gastronomy) in the same fashion. The results are stored in the data/original/ directory. Our complete data structure for this post looks like this:

data
├── derived
└── original
    ├── elpais
    │   ├── gastronomy_articles.csv
    │   └── political_articles.csv
    ├── gutenberg
    │   ├── works_pq.csv
    │   └── works_pr.csv
    ├── sbc
    │   ├── meta-data
    │   └── transcriptions
    └── scs
        ├── README
        ├── discourse
        ├── disfluency
        ├── tagged
        ├── timed-transcript
        └── transcript

8 directories, 10 files

Getting text from other formats

As a final note it is worth pointing out that machine-readable data for analysis is often trapped in other formats such as Word or PDF files. R provides packages for working with these formats and can extract the text programmatically. See antiword for Word files and pdftools for PDF files. In the case that a PDF is an image that needs OCR (Optical Character Recognition), you can experiment with the tessseract package. It is important to be aware, however, that recovering plain text from these formats can often result in conversion artifacts; especially using OCR. Not to worry, we can still work with the data it just might mean more pre-processing before we get to doing our analysis.

Round up

In this post we covered scraping language data from the web. The rvest package provides a host of functions for downloading and parsing HTML. We first looked at a toy example to get a basic understanding of how HTML works and then moved to applying this knowledge to a practical example. To maintain a reproducible workflow, the code developed in this example was grouped into task-oriented functions which were in turn joined and wrapped into a function that provided convenient access to our workflow and avoided unnecessary downloads (in the case the data already exists on disk).

Here we have built on previously introduced R coding concepts and demonstrated various others. Web scraping often requires more knowledge of and familiarity with R as well as other web technologies. Rest assured, however, practice will increase confidence in your abilities. I encourage you to practice on your own with other websites. You will encounter problems. Consult the R documentation in RStudio or online and lean on the R community on the web at sites such as StackOverflow.

At this point you have both a bird’s eye view of the data available on the web and strategies on how to access a great majority of it. It is now time to turn to the next step in our data analysis project: data curation. In the next posts I will cover how to wrangle your raw data into a tidy dataset. This will include working with and incorporating meta-data as well as augmenting a dataset with linguistic annotations.

References

Wickham, Hadley. 2016. Rvest: Easily Harvest (Scrape) Web Pages. https://CRAN.R-project.org/package=rvest.

———. 2017. Tidyverse: Easily Install and Load the ’Tidyverse’. https://CRAN.R-project.org/package=tidyverse.


  1. tibble objects are data.frame objects with some added extra bells and whistles that we won’t get into here.

To leave a comment for the author, please follow the link and comment on their blog: R on francojc ⟲.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)