Due to the incrising interest in the Internet and due to the its rising number of users, one can notice the surprising growth in the demand for analyzing data and information in the Internet that were left by users and for users. Many companies and institutions base their business decisions on the extensive research of social media portals and Internet forums, where users leave reviews on various products and brands. Not only the same analysis, but also the ability to obtain data from the Internet, is a key part of the puzzle…
… one can read in the description of the second talk (by Krzysztof Słomczynski (krzyslom)) of the first meetup of the new Tri-City R Users Group – meet(R) in Tricity!. The meeting will be held on this Thursday (12-01-2017) at the Gdańsk University of Technology.
- Launching RSelenium
- Logging to Ali Express
- Few basic RSelenium functions
- Extracting information from Ali Express allowed only for logged user
- Plotting expenses
Motivated by that upcoming talk I took a tour through RSelenium vignettes: RSelenium basics and RSelenium Docker to launch my first Docker container with Selenium Server. If you are not yet motivated to use Docker containers, then have a look at this post R 3.3.0 is another motivation for Docker.
RSelenium is an R interface that connects to Selenium (Server), which is a project focused on automating web browsers and enables to create a regular web browser session that can be controlled with command lines. Such Internet browsing automatization is a huge trigger for web-harvesting because with RSelenium you are able to simulate a real user and to pass keys to the browser session (such as user login and password). With such a possibility you are able to log into any portal automatically (or manually) and to fill bot security captchas code (this rather manually). You can also interact with web elements that first need to be clicked to show information which are in the demand to be web-scraped.
I used Selenium Docker container to launch Selenium Server which is available on Docker hub. The following command launches Selenium Server and binds it to the localhost on the port 4445. It also binds the remote desktop on the port 5901 so that we can run VNC viewer (Vinagre) to observe operations performed on a fake web session.
When Docker container is running, we are able to establish a connection with Selenium Server from R with the following commands
Logging to Ali Express
The result of previous commands is a
remoteDriver object, in this situation called
remDr. This S4 object can be used to navigate through the web browser session. The below command navigates to the Ali Express logging page.
In this situation you can see than in the remote desktop VNC viewer we have entered Ali Express.
You can log in in 2 ways: by filling fields manually or by finding fields by their properties (like
class name) and by sending keys (like
user name) to them.
Then you can click on the
sign in button with
You can check what are properties of a field with
inspect element in any web browser, so that you will know how to navigate to the element with
findElement by it’s
id or by it’s
class or even
css. For this portal I wasn’t successful with sending keys so I logged manually.
However, for Facebook it worked like a charm
Few basic RSelenium functions
In this moment I think few comments about basics of RSelenium commands are required. With
findElement method you can get the first element on the web page that suits the searching criterion, with
findElements you can find all such elements. On each such element (or on the list of elements) you can perform further operations like
- sending data to element (
- clicking the element (
- finding elements within that element (
- extracting the text from element (
- highlighting the element (
and many more!
Extracting information from Ali Express allowed only for logged user
Ali Express is a market portal where you can order products mainly from China. They are of low quality but also of a cheap price, that’s why this portal is very popular, even though the long period of a home delivery (which is free in many cases). You can buy
clothes for few cents!
How much money did
I 🙂 spend for all my transactions?
My orders panel, allowed for a logged user I found out that for each page of orders (I had 32 pages of history of my transactions) I can extract whole body of a transaction, and then from that body I can extract the amount that I payed in dollars, check if the transaction wasn’t cancelled and get the ID of an operation. I just need to properly specify the names of classes of HTML tags/objects I need.
I will use
For all 32 pages of orders’ history the following
for loop extracts information for each sub page and then navigates to the next sub page of orders’ history.
The extracted information is a plain text so it requires some text manipulation to achieve the tidy data, that can be plotted. Additionally I filter orders to those that haven’t been cancelled.
The plot of cumulative expenses can be obtained with the following code. I can’t believe that 450 $ was spend over 8 months! The result of the code is the main photo of this post.