Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Today, I am going to show you how to scrape product prices from Aliexpress website.
A few words on web scraping
Before diving into the subject, you should be aware that web scraping is not allowed on certain websites. To know if it is the case for the website you want to scrape, I invit you to check the robots.txt page which should be located at the root of the website address. For Aliexpress this page is located here : www.aliexpress.com/robots.txt .
This page indicates that webscrapping and crawling are not allowed on several page categories such as /bin/*
, /search/*
, /wholesale*
for example. Fortunately for us, the /item/*
category, where the product pages are stored, can be scraped.
RSelenium
Installation for Ubuntu 18.04 LTS
The installation for RSelenium was not as easy as expected and I encountered two errors.
The first error I got after I installed the package and tried the function Rsdriver was :
Error in curl::curl_fetch_disk(url, x$path, handle = handle) : Unrecognized content encoding type. libcurl understands deflate, gzip content encodings.
Thanks to this post, I installed the missing package : stringi.
Once this error was addressed, I had a different one :
Error: Invalid or corrupt jarfile /home/aurelien/.local/share/binman_seleniumserver/generic/4.0.0-alpha-2/selenium-server-standalone-4.0.0-alpha-2.jar
This time the problem came from a corrupted file. Thanks to this post, I knew that I just had to download this file selenium-server-standalone-4.0.0-alpha-2.jar from the official selenium website and replace the corrupted file with it.
I hope this will help some of you to install RSelenium with Ubuntu 18.04 LTS !
Opening a web browser
After addressing the errors above, I can now open a firefox browser :
library(RSelenium) #Open a firefox driver rD <- rsDriver(browser = "firefox") remDr <- rD[["client"]]
Logging in Aliexpress
The first step to scrape product prices on Aliexpress is to log in into your account:
log_id <- "Your_mail_adress" password <- "Your_password" # Navigate to aliexpress login page remDr$navigate("https://login.aliexpress.com/") # Fill the form with mail address remDr$findElement(using = "id", "fm-login-id")$sendKeysToElement(list(log_id)) # Fill the form with password remDr$findElement(using = 'id', "fm-login-password")$sendKeysToElement(list(password)) #Submit the login form by clicking Submit button remDr$findElement("class", "fm-button")$clickElement()
Navigating through the URLs and scraping the prices
Now we have to navigate through a vector containing the URL of the aliexpress products we are interested in. Then we extract the price of the product by using the xpath of the product price of the webpage. The xpath of the element you want to scrape can be found by using the developer tool of chrome or firefox ( tutorial here ! ). Once the price is extracted we have to ensure this price is in numerical format by removing any special character (euro or dollar sign) and replace the comma by a point for the decimal separator. Here is the R code:
url_list <- list("https://fr.aliexpress.com/store/product/Maxcatch-6Pcs-lot-Craws-Soft-Fishing-Lures-110mm-11-5g-Artificial-Bait-Soft-Bait-Craws-Lures/406467_32419930548.html?spm=a2g0w.12010612.0.0.5deb64f7836LnZ", "https://fr.aliexpress.com/store/product/Maxcatch-Fishing-Lure-5Pcs-Lot-155mm-7-4g-3-colors-Swimbait-Artificial-Lizard-Soft-Fishing-Lures/406467_32613648610.html?spm=a2g0w.12010612.0.0.5deb64f7836LnZ", "https://fr.aliexpress.com/store/product/Maxcatch-6Pcs-lot-Soft-Fishing-Lures-Minnow-Biat-95mm-6g-Jerkbait-Soft-Bait/406467_32419066106.html?spm=a2g0w.12010612.0.0.25fe5872CBqy0m") # Allocate a vector to store the price of the products currentp <- c() for(i in 1:length(url_list)){ # Navigate to link [i] remDr$navigate(url_list[i]) # Find the price with an xpath selector and findElement. # Sometimes products can be removed and this could throw an error this is why we are using 'try' to handle the potential errors current <- try(remDr$findElement(using = "xpath",'//*[contains(concat( " ", @class, " " ), concat( " ", "product-price-value", " " ))]'), silent = T) #If error : current price is NA if(class(current) =='try-error'){ currentp[i] <- NA }else{ # Get the price text <- unlist(current$getElementText()) #Remove euro sign text <- gsub("[^A-Za-z0-9,;._-]","",text) #Case when there is a range of price instead of one price + replace comma by point if(grepl("-", text)) { pe <- sub("-.*","",text) %>% sub(",", ".", ., fixed = TRUE) currentp[i] <- as.numeric(pe) }else{ currentp[i] <- as.numeric(sub(",", ".", text, fixed = TRUE)) } } Sys.sleep(4) }
Between each link it is advised to wait a few seconds with Sys.sleep(4) to avoid being black-listed by the website.
Phantomjs version
If you execute the code above, you should see a firefox browser open and navigate through the list you provided. In case you don’t want an active window, you can replace firefox by phantomjs browser which is a headless browser (without a window).
I don’t know why but using rsDriver(browser = "phantomjs")
does not work for me. I found this post which propose to start the phantomjs browser with the wdman package:
library(wdman) library(RSelenium) # start phantomjs instance rPJS <- wdman::phantomjs(port = 4680L) # is it alive? rPJS$process$is_alive() #connect selenium to it? remDr <- RSelenium::remoteDriver(browserName="phantomjs", port=4680L) # open a browser remDr$open() remDr$navigate("http://www.google.com/") # Screenshot of the headless browser to check if everything is working remDr$screenshot(display = TRUE) # Don't forget to close the browser when you are finished ! remDr$close()
Conclusion
Once you have understand the basics of RSelenium and how to select elements inside HTML pages, it is really easy to write a script to scrape data on the web. This post was a short example to scrape the product price on Aliexpress pages but the script can be extended to scrape more data on each page such as the name of the item, its rating etc… It is even possible to automate this script to run daily in order to see price changes over time. As you see possibilities are endless!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.