Web Scraping Amazon Reviews (March 2019)

[This article was first published on Just R Things, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Web Scraping is one of the most common (and sometimes tedious) data collection tasks nowadays in Data Science. It is an essential step in gather data – especially text data – in order to perform various Natural Language Processing tasks, such as Sentiment Analysis, Topic Modeling, and Word Embedding. In this post, we explore how to use R to automate scraping procedures to easily obtain data off of web pages.

Web scraping is done by selecting certain elements or paths of any given webpage and extracting parts of interest (also known as parsing), we are able to obtain data. A simple example of webscraping in R can be found in this awesome blog post on R-bloggers.

In this post will be scraping reviews from Amazon, specifically reviews for the DVD / Blu-ray of 2018 film, Venom. In order to scrape data for a specific product, we first need to ASIN code. The ASIN code for a product is typically found within the URL of the product page (i.e. https://www.amazon.com/Venom-Blu-ray-Tom-Hardy/dp/B07HSKPDBV). Using the product code B07HSKPDBV, Let’s scrape the product name of this on Amazon. The URL of Amazon’s product pages are easy to build; simply concatenate the ASIN code to the “base” URL as such: https://www.amazon.com/dp/B07HSKPDBV.

venom_amazon.PNG

We build the URL, and point to a specific node #productTitle of the HTML web page using the CSS selector (read about CSS Selector and how to obtain it using the SelectorGadget here). Finally, we clean and parse the text to obtain just the product name:

# Install / Load relevant packages
if(!"pacman" %in% installed.packages()[,"Package"]) install.packages("pacman")
pacman::p_load(rvest, dplyr, tidyr, stringr)

# Venom product code
prod_code <- "B07HSKPDBV"

url <- paste0("https://www.amazon.com/dp/", prod_code)
doc <- read_html(url)

#obtain the text in the node, remove "\n" from the text, and remove white space
prod <- html_nodes(doc, "#productTitle") %>% 
  html_text() %>% 
  gsub("\n", "", .) %>% 
  trimws()

prod

## [1] "Venom"

With this simple code, we were able to obtain the product name of this ASIN code.

Now, we want to grab all the reviews of this product, and combine them all into a nice single data.frame. Below is an R function to scrape various elements from a web page:

# Function to scrape elements from Amazon reviews
scrape_amazon <- function(url, throttle = 0){

  # Install / Load relevant packages
  if(!"pacman" %in% installed.packages()[,"Package"]) install.packages("pacman")
  pacman::p_load(RCurl, XML, dplyr, stringr, rvest, purrr)

  # Set throttle between URL calls
  sec = 0
  if(throttle < 0) warning("throttle was less than 0: set to 0")
  if(throttle > 0) sec = max(0, throttle + runif(1, -1, 1))

  # obtain HTML of URL
  doc <- read_html(url)

  # Parse relevant elements from HTML
  title <- doc %>%
    html_nodes("#cm_cr-review_list .a-color-base") %>%
    html_text()

  author <- doc %>%
    html_nodes("#cm_cr-review_list .a-profile-name") %>%
    html_text()

  date <- doc %>%
    html_nodes("#cm_cr-review_list .review-date") %>%
    html_text() %>% 
    gsub(".*on ", "", .)

  review_format <- doc %>% 
    html_nodes(".review-format-strip") %>% 
    html_text() 

  stars <- doc %>%
    html_nodes("#cm_cr-review_list  .review-rating") %>%
    html_text() %>%
    str_extract("\\d") %>%
    as.numeric() 

  comments <- doc %>%
    html_nodes("#cm_cr-review_list .review-text") %>%
    html_text() 

  suppressWarnings(n_helpful <- doc %>%
    html_nodes(".a-expander-inline-container") %>%
    html_text() %>%
    gsub("\n\n \\s*|found this helpful.*", "", .) %>%
    gsub("One", "1", .) %>%
    map_chr(~ str_split(string = .x, pattern = " ")[[1]][1]) %>%
    as.numeric())

  # Combine attributes into a single data frame
  df <- data.frame(title, author, date, review_format, stars, comments, n_helpful, stringsAsFactors = F)

  return(df)
}

Let’s use this function on the first page of reviews.

# load DT packege
pacman::p_load(DT)

# run scraper function
url <- "http://www.amazon.com/product-reviews/B07HSKPDBV/?pageNumber=1"
reviews <- scrape_amazon(url)

# display data
str(reviews)

## 'data.frame':    8 obs. of  7 variables:
##  $ title        : chr  "Large amounts of fun" "Nobody channels Eddie Brock / Venom like Tom Hardy. Nobody!" "What a stinker - easily the wost movie I have ever rented on Amazon and I deeply regret it." "Excellent!" ...
##  $ author       : chr  "MacJunegrand" "kiwijinxter" "Charles Schoch" "H. Tague" ...
##  $ date         : chr  "October 24, 2018" "October 12, 2018" "December 20, 2018" "December 11, 2018" ...
##  $ review_format: chr  "Format: Prime Video" "Format: Blu-ray" "Format: Prime VideoVerified Purchase" "Format: Prime VideoVerified Purchase" ...
##  $ stars        : num  5 5 1 5 1 1 1 1
##  $ comments     : chr  "Movie critics have a though job, I get it. They have to watch lots of movies and not for pleasure, but as a job"| __truncated__ "As someone who had hundreds of Spider-Man comics when I was younger (I even owned a copy of Amazing Spider-Man "| __truncated__ "This movie was bad. How bad?  It is the only rental I have memory of where I deeply regret spending the 6 bucks"| __truncated__ "Been a huge fan of Venom since the 90s and been waiting for a good live action adaptation of him ever since and"| __truncated__ ...
##  $ n_helpful    : num  392 80 38 29 23 16 16 17

As you can see, this function obtains the Title, Author, Date, Review Format, Stars, Comments, and N customers who found the review helpful. (Note that by modifying the function above, you can also include additional metrics as you desire).

Let’s now loop this function over 100 pages of reviews to bulk scrape more reviews. Each page contains 8 – 10 reviews (varies by product), in this case, 8 reviews per page. Thus by looping over 100 pages, we can obtain 800 reviews. We also set a throttle of 3 seconds, which will force the function to halt for 3 seconds (+ ~Uniform(-1, 1) seconds), so as not to trigger Amazon’s bot detectors.

# Set # of pages to scrape. Note: each page contains 8 reviews.
pages <- 100

# create empty object to write data into
reviews_all <- NULL

# loop over pages
for(page_num in 1:pages){
  url <- paste0("http://www.amazon.com/product-reviews/",prod_code,"/?pageNumber=", page_num)
  reviews <- scrape_amazon(url, throttle = 3)
  reviews_all <- rbind(reviews_all, cbind(prod, reviews))
}

str(reviews_all)

## 'data.frame':    800 obs. of  8 variables:
##  $ prod         : chr  "Venom" "Venom" "Venom" "Venom" ...
##  $ title        : chr  "Large amounts of fun" "Nobody channels Eddie Brock / Venom like Tom Hardy. Nobody!" "What a stinker - easily the wost movie I have ever rented on Amazon and I deeply regret it." "Excellent!" ...
##  $ author       : chr  "MacJunegrand" "kiwijinxter" "Charles Schoch" "H. Tague" ...
##  $ date         : chr  "October 24, 2018" "October 12, 2018" "December 20, 2018" "December 11, 2018" ...
##  $ review_format: chr  "Format: Prime Video" "Format: Blu-ray" "Format: Prime VideoVerified Purchase" "Format: Prime VideoVerified Purchase" ...
##  $ stars        : int  5 5 1 5 1 1 1 1 5 5 ...
##  $ comments     : chr  "Movie critics have a though job, I get it. They have to watch lots of movies and not for pleasure, but as a job"| __truncated__ "As someone who had hundreds of Spider-Man comics when I was younger (I even owned a copy of Amazing Spider-Man "| __truncated__ "This movie was bad. How bad?  It is the only rental I have memory of where I deeply regret spending the 6 bucks"| __truncated__ "Been a huge fan of Venom since the 90s and been waiting for a good live action adaptation of him ever since and"| __truncated__ ...
##  $ n_helpful    : int  392 80 38 29 23 16 16 17 11 10 ...

You can see that we were able to obtain 800 reviews for Venom. Happy Analyzing!

To see what you can do with text data, you can see my other posts on Natural Language Processing.

To leave a comment for the author, please follow the link and comment on their blog: Just R Things.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)