Web Scraping With R – Easier Than Python

[This article was first published on r – Better Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A good dataset is difficult to find. That’s expected, but nothing to fear about. Techniques like web scraping enable us to fetch data from anywhere at any time – if you know how. Today we’ll explore just how easy it is to scrape web data with R and do so through R Shiny’s nice GUI interface.

So, what is web scraping? In a nutshell, it’s just a technique of gathering data from various websites. One might use it when:

  • There’s no dataset available for the needed analysis
  • There’s no public API available

Still, you should always check the site’s policy on web scraping, alongside with this article on Ethics in web scraping. After that, you should be able to use common sense to decide if scraping is worth it.

If it feels wrong, don’t do it.

Luckily, some websites are made entirely for practicing web scraping. One of them is books.toscrape.com, which, as the name suggests, lists made up books in various genres:

Books to scrape website

So, let’s scrape the bastard next, shall we? 

Plan of attack

Open up the webpage and click on any two categories (sidebar on the left), and inspect the URL. Here’s our pick:

http://books.toscrape.com/catalogue/category/books/travel_2/index.html
http://books.toscrape.com/catalogue/category/books/mystery_3/index.html

What do these URLs have in common? Well, everything except for the bolded part. That’s the category itself. No idea what’s the deal with the numbering, but it is what it is. Every page contains a list of books, and a single book looks like this:

Single book

Our job is to grab the information for every book in a category. Doing this requires a bit of HTML knowledge, but it’s a simple markup language, so I don’t see a problem there.

We want to scrape:

  • Title – h3 > a > title property
  • Rating – p.star-rating > class attribute
  • Price – div.product_price > div.price_color > text
  • Availability – div.product_price > div.instock > text
  • Book URL – div.image_container > a > href property
  • Thumbnail URL – div.image_container > img > src property

You know everything now, so let’s start with the scraping next.

Scraping books

The `rvest` package is used in R to perform web scraping tasks. It’s very similar to `dplyr`, a well-known data analysis package, due to the pipe operator’s usage and the behavior in general. We know how to get to certain elements, but how to implement this logic in R?

Here’s an example of how to scrape book titles in the travel category:

library(rvest)

url <- 'http://books.toscrape.com/catalogue/category/books/travel_2/index.html'
titles <- read_html(url) %>% 
  html_nodes('h3') %>%
  html_nodes('a') %>% 
  html_text()

Scraped titles

Wasn’t that easy? We can similarly scrape everything else. Here’s the script: 

library(rvest)
library(stringr)

titles <- read_html(url) %>% 
  html_nodes('h3') %>%
  html_nodes('a') %>% 
  html_text()

urls <- read_html(url) %>%
  html_nodes('.image_container') %>% 
  html_nodes('a') %>% 
  html_attr('href') %>% 
  str_replace_all('../../../', '/')

imgs <- read_html(url) %>% 
  html_nodes('.image_container') %>%
  html_nodes('img') %>%
  html_attr('src') %>%
  str_replace_all('../../../../', '/')

ratings <- read_html(url) %>% 
  html_nodes('p.star-rating') %>% 
  html_attr('class') %>% 
  str_replace_all('star-rating ', '')

prices <- read_html(url) %>% 
  html_nodes('.product_price') %>% 
  html_nodes('.price_color') %>% 
  html_text()

availability <- read_html(url) %>% 
  html_nodes('.product_price') %>% 
  html_nodes('.instock') %>% 
  html_text() %>% 
  str_trim()

Awesome! As a final step, let’s glue all of this together in a single Data frame:

scraped <- data.frame(
  Title = titles, 
  URL = urls, 
  SourceImage = imgs, 
  Rating = ratings, 
  Price = prices, 
  Availability = availability
)

Scraped dataframe

You can end the article here and call it a day, but there’s something else we can build from this – a simple, easy to use web application. Let’s do that next.

Web application for scraping

R has a fantastic library for web/dashboard development – `Shiny`. It’s far easier to use than anything similar in Python, so we’ll stick with it. To start, create a new R file and paste the following code inside:

library(shiny)
library(rvest)
library(stringr)
library(glue)

ui <- fluidPage()

server <- function(input, output) {}

shinyApp(ui=ui, server=server)

It is a boilerplate every Shiny app requires. Next, let’s style the UI. We’ll need:

  • Title – just a big, bold text on top of everything (optional)
  • Sidebar – contains a dropdown menu to select a book genre
  • Central area – displays a table output once the data gets scraped

Here’s the code:

ui <- fluidPage(
  column(12, tags$h2('Real-time web scraper with R')),
  sidebarPanel(
    width=3,
    selectInput(
      inputId='genreSelect',
      label='Genre',
      choices=c('Business', 'Classics', 'Fiction', 'Horror', 'Music'),
      selected='Business',
    )
  ),
  mainPanel(
    width=9,
    tableOutput('table')
  )
)

Next, we need to configure the server function. It has to remap our nicely formatted inputs to an URL portion (e.g., ‘Business’ to ‘business_35’) and scrape data for the selected genre. We already know how to do so. Here’s the code for the server function:

server <- function(input, output) {
  output$table <- renderTable({
    mappings <- c('Business' = 'business_35', 'Classics' = 'classics_6', 'Fiction' = 'fiction_10',
                  'Horror' = 'horror_31', 'Music' = 'music_14') 
    
    url <- glue('http://books.toscrape.com/catalogue/category/books/', mappings[input$genreSelect], '/index.html')
    
    titles <- read_html(url) %>% 
      html_nodes('h3') %>%
      html_nodes('a') %>% 
      html_text()
    
    urls <- read_html(url) %>%
      html_nodes('.image_container') %>% 
      html_nodes('a') %>% 
      html_attr('href') %>% 
      str_replace_all('../../../', '/')
    
    imgs <- read_html(url) %>% 
      html_nodes('.image_container') %>%
      html_nodes('img') %>%
      html_attr('src') %>%
      str_replace_all('../../../../', '/')
    
    ratings <- read_html(url) %>% 
      html_nodes('p.star-rating') %>% 
      html_attr('class') %>% 
      str_replace_all('star-rating ', '')
    
    prices <- read_html(url) %>% 
      html_nodes('.product_price') %>% 
      html_nodes('.price_color') %>% 
      html_text()
    
    availability <- read_html(url) %>% 
      html_nodes('.product_price') %>% 
      html_nodes('.instock') %>% 
      html_text() %>% 
      str_trim()
    
    data.frame(
      Title = titles, 
      URL = urls, 
      SourceImage = imgs, 
      Rating = ratings, 
      Price = prices, 
      Availability = availability
    )
  })
}

And that’s it – we can run the app now and inspect the behavior!

Final Shiny app

Just what we wanted – simple, but still entirely understandable. Let’s wrap things up in the next section.

Parting words

In only a couple of minutes, we went from zero to a working web scraping application. Options to scale this are endless – add more categories, work on the visuals, include more data, format data more nicely, add filters, etc.

I hope you’ve managed to follow and that you’re able to see the power of web scraping. This was a dummy website and a dummy example, but the approach stays the same irrelevant to the data source. 

Thanks for reading.

Join my private email list for more helpful insights.

The post Web Scraping With R – Easier Than Python appeared first on Better Data Science.

To leave a comment for the author, please follow the link and comment on their blog: r – Better Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)