Scrapeover Friday — a.k.a. Another R Scraping Makeover

[This article was first published on R – rud.is, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I caught a glimpse of a tweet by @dataandme on Friday:

Mara is — without a doubt — the best data science promoter in the Twitterverse. She seems to have her finger on the pulse of everything that’s happening in the data science world and is one of the most ardent amplifiers there is.

The post she linked to was a bit older (2015) and had a very “stream of consciousness” feel to it. I actually wish more R folks took to their blogs like this to post their explorations into various topics. The code in this post likely worked at the time it was posted and accomplished the desired goal (which means it was ultimately decent code). Said practice will ultimately help both you and others.

Makeover Time

As I’ve noted before, web scraping has some rules, even though they can be tough to find. This post made a very common mistake of not putting in a time delay between requests (a cardinal scraping rule) which we’ll fix in a moment.

There are a few other optimizations we can make. The first is moving from a for loop to something a bit more vectorized. Another is to figure out how many pages we need to scrape from information in the first set of results.

However, an even bigger one is to take advantage of the underlying XHR POST request that the new version of the site ultimately calls (it appears this site has undergone some changes since the blog post and it’s unlikely the code in the post actually works now).

Let’s start by setting up a function to grab individual pages:

library(httr)
library(rvest)
library(stringi)
library(tidyverse)

get_page <- function(i=1, pb=NULL) {
  
  if (!is.null(pb)) pb$tick()$print()
  
  POST(url = "http://www.propwall.my/wp-admin/admin-ajax.php", 
       body = list(action = "star_property_classified_list_change_ajax", 
                   tab = "Most Relevance", 
                   page = as.integer(i), location = "Mont Kiara", 
                   category = "", listing = "For Sale", 
                   price = "", keywords = "Mont Kiara, Kuala Lumpur", 
                   filter_id = "17", filter_type = "Location", 
                   furnishing = "", builtup = "", 
                   tenure = "", view = "list", 
                   map = "on", blurb = "0"), 
       encode = "form") -> res
  
  stop_for_status(res)
  
  res <- content(res, as="parsed") 
  
  Sys.sleep(sample(seq(0,2,0.5), 1))
  
  res
  
}

The i parameter gets passed into the body of the POST request. You can find that XHR POST request via the Network tab of your browser Developer Tools view. You can either transcribe it by hand or use the curlconverter package (which is temporarily off CRAN so you’ll need to get it from github) to auto-convert it to an httr::VERB request.

We also add a parameter (default to NULL) to support the use of a progress bar (so we can see what’s going on). If we pass in a populated dplyr progress bar, this will tick it down for us.

Now, we can use that to get the total number of listings.

get_page(1) %>% 
  html_node(xpath=".//a[contains(., 'Classifieds:')]") %>% 
  html_text() %>% 
  stri_match_last_regex("([[:digit:],]+)$") %>% 
  .[,2] %>% 
  stri_replace_all_fixed(",", "") %>% 
  as.numeric() -> classified_ct

total_pages <- 1 + (classified_ct %/% 20)

We’ll setup another function to extract the listing URLs and titles:

get_listings <- function(pg) {
  data_frame(
    link = html_nodes(pg, "div#list-content > div.media * h4.media-heading > a:nth-of-type(1)" ) %>%  html_attr("href"),
    description = html_nodes(pg, "div#list-content > div.media * h4.media-heading > a:nth-of-type(1)" ) %>% html_text(trim = TRUE)  
  )
}

Rather than chain calls to html_nodes() we take advantage of well-formed CSS selectors (which ultimately gets auto-translated to XPath strings). This has the advantage of speed (though that’s not necessarily an issue when web scraping) as well as brevity.

Now, we’ll scrape all the listings:

pb <- progress_estimated(total_pages)
listings_df <- map_df(1:total_pages, ~get_listings(get_page(.x, pb)))

Yep. That’s it. Everything’s been neatly abstracted into functions and we’ve taken advantage of some modern R idioms to accomplish our first task.

FIN

With the above code you should be able to do your own makeover of the remaining code in the original post. Remember to:

  • add a delay when you sequentially scrape pages from a site
  • abstract out common operations into functions
  • take advantage of purrr functions (or built-in *apply functions) to avoid for loops

I’ll close with a note about adhering to site terms of service / terms and conditions. Nothing I found when searching for ToS/ToC on the site suggested that scraping, automated grabbing or use of the underlying data in bulk was prohibited. Many sites have such restrictions — like IMDB (I mention that as it’s been used alot lately by R folks and it really shouldn’t be). LinkedIn recently sued scrapers for ToS such violations.

I fundamentally believe violating ToS is unethical behavior and should be avoided just on those grounds. When I come across sites I need information from that have restrictive ToS I contact the site owner (when I can find them) and ask them for permission and have only been refused a small handful of times. Given those recent legal actions, it’s also to better be safe than sorry.

To leave a comment for the author, please follow the link and comment on their blog: R – rud.is.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)