Webscraping Aliexpress with Rselenium

Posted on November 17, 2020 by Aurelien Callens in R bloggers | 0 Comments

[This article was first published on R on Aurélien Callens, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today, I am going to show you how to scrape product prices from Aliexpress website.

A few words on web scraping

Before diving into the subject, you should be aware that web scraping is not allowed on certain websites. To know if it is the case for the website you want to scrape, I invit you to check the robots.txt page which should be located at the root of the website address. For Aliexpress this page is located here : www.aliexpress.com/robots.txt .

This page indicates that webscrapping and crawling are not allowed on several page categories such as /bin/*, /search/*, /wholesale* for example. Fortunately for us, the /item/* category, where the product pages are stored, can be scraped.

RSelenium

Installation for Ubuntu 18.04 LTS

The installation for RSelenium was not as easy as expected and I encountered two errors.

The first error I got after I installed the package and tried the function Rsdriver was :

Error in curl::curl_fetch_disk(url, x$path, handle = handle) :
Unrecognized content encoding type. libcurl understands deflate, gzip content encodings.

Thanks to this post, I installed the missing package : stringi.

Once this error was addressed, I had a different one :

Error: Invalid or corrupt jarfile /home/aurelien/.local/share/binman_seleniumserver/generic/4.0.0-alpha-2/selenium-server-standalone-4.0.0-alpha-2.jar

This time the problem came from a corrupted file. Thanks to this post, I knew that I just had to download this file selenium-server-standalone-4.0.0-alpha-2.jar from the official selenium website and replace the corrupted file with it.

I hope this will help some of you to install RSelenium with Ubuntu 18.04 LTS !

Opening a web browser

After addressing the errors above, I can now open a firefox browser :

library(RSelenium)

#Open a firefox driver
rD <- rsDriver(browser = "firefox") 
remDr <- rD[["client"]]

Logging in Aliexpress

The first step to scrape product prices on Aliexpress is to log in into your account:

log_id <- "Your_mail_adress"
password <- "Your_password"

# Navigate to aliexpress login page 
remDr$navigate("https://login.aliexpress.com/")

# Fill the form with mail address
remDr$findElement(using = "id", "fm-login-id")$sendKeysToElement(list(log_id))

# Fill the form with password
remDr$findElement(using = 'id', "fm-login-password")$sendKeysToElement(list(password))

#Submit the login form by clicking Submit button
remDr$findElement("class", "fm-button")$clickElement()

Navigating through the URLs and scraping the prices

Now we have to navigate through a vector containing the URL of the aliexpress products we are interested in. Then we extract the price of the product by using the xpath of the product price of the webpage. The xpath of the element you want to scrape can be found by using the developer tool of chrome or firefox ( tutorial here ! ). Once the price is extracted we have to ensure this price is in numerical format by removing any special character (euro or dollar sign) and replace the comma by a point for the decimal separator. Here is the R code:

  url_list <- list("https://fr.aliexpress.com/store/product/Maxcatch-6Pcs-lot-Craws-Soft-Fishing-Lures-110mm-11-5g-Artificial-Bait-Soft-Bait-Craws-Lures/406467_32419930548.html?spm=a2g0w.12010612.0.0.5deb64f7836LnZ",
            "https://fr.aliexpress.com/store/product/Maxcatch-Fishing-Lure-5Pcs-Lot-155mm-7-4g-3-colors-Swimbait-Artificial-Lizard-Soft-Fishing-Lures/406467_32613648610.html?spm=a2g0w.12010612.0.0.5deb64f7836LnZ",
            "https://fr.aliexpress.com/store/product/Maxcatch-6Pcs-lot-Soft-Fishing-Lures-Minnow-Biat-95mm-6g-Jerkbait-Soft-Bait/406467_32419066106.html?spm=a2g0w.12010612.0.0.25fe5872CBqy0m") 

# Allocate a vector to store the price of the products 
currentp <- c()
for(i in 1:length(url_list)){
  
  # Navigate to link [i]
  remDr$navigate(url_list[i])
  
  # Find the price with an xpath selector and findElement.  
  # Sometimes products can be removed and this could throw an error this is why we are using 'try' to handle the potential errors
  
  current <- try(remDr$findElement(using = "xpath",'//*[contains(concat( " ", @class, " " ), concat( " ", "product-price-value", " " ))]'), silent = T)
  
  #If error : current price is NA 
  if(class(current) =='try-error'){
    currentp[i] <- NA
  }else{
    # Get the price 
    text <- unlist(current$getElementText())
    
    #Remove euro sign
    text <- gsub("[^A-Za-z0-9,;._-]","",text)
    
    #Case when there is a range of price instead of one price + replace comma by point
    if(grepl("-", text)) {  
      pe <- sub("-.*","",text) %>% sub(",", ".", ., fixed = TRUE)
      currentp[i] <-  as.numeric(pe)
    }else{
      currentp[i] <- as.numeric(sub(",", ".", text, fixed = TRUE))
  }
  }
  
Sys.sleep(4)
}

Between each link it is advised to wait a few seconds with Sys.sleep(4) to avoid being black-listed by the website.

Phantomjs version

If you execute the code above, you should see a firefox browser open and navigate through the list you provided. In case you don’t want an active window, you can replace firefox by phantomjs browser which is a headless browser (without a window).

I don’t know why but using rsDriver(browser = "phantomjs") does not work for me. I found this post which propose to start the phantomjs browser with the wdman package:

library(wdman)
library(RSelenium)
# start phantomjs instance
rPJS <- wdman::phantomjs(port = 4680L)

# is it alive?
rPJS$process$is_alive()

#connect selenium to it?
remDr <-  RSelenium::remoteDriver(browserName="phantomjs", port=4680L)

# open a browser
remDr$open()

remDr$navigate("http://www.google.com/")

# Screenshot of the headless browser to check if everything is working
remDr$screenshot(display = TRUE)

# Don't forget to close the browser when you are finished ! 
remDr$close()

Conclusion

Once you have understand the basics of RSelenium and how to select elements inside HTML pages, it is really easy to write a script to scrape data on the web. This post was a short example to scrape the product price on Aliexpress pages but the script can be extended to scrape more data on each page such as the name of the item, its rating etc… It is even possible to automate this script to run daily in order to see price changes over time. As you see possibilities are endless!

To leave a comment for the author, please follow the link and comment on their blog: R on Aurélien Callens.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Webscraping Aliexpress with Rselenium

A few words on web scraping

RSelenium

Installation for Ubuntu 18.04 LTS

Opening a web browser

Logging in Aliexpress

Navigating through the URLs and scraping the prices

Phantomjs version

Conclusion

Related

A few words on web scraping

RSelenium

Installation for Ubuntu 18.04 LTS

Opening a web browser

Logging in Aliexpress

Navigating through the URLs and scraping the prices

Phantomjs version

Conclusion

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)