Site icon R-bloggers

RSelenium: A wonderful tool for web scraping.

[This article was first published on R – FordoX, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For one of my projects, I needed to fetch data in R from online sources. We all know that its a common practice to collect data from Twitter, Facebook and other online social media websites and analyse them. I used to do the same using the XML package until a problem occurred while scraping data from this. Even after looking up the internet, I was unable to find a solution. Hence, I raised my concern at Stackoverflow where one was generous enough to tell me about the RSelenium package. And trust me, I fell in love with this package. Kudos to it’s author.

But I will have to admit, I faced lots of trouble while using this package even after following the steps mentioned here. The following steps are written in a simple but detailed manner to easily setup the RSelenium package.

Step 1: install.packages(“RSelenium”).

Step 2: library(“RSelenium”).

Step 3 : checkForServer().

This function will download the jar file and will place it in the bin directory of the package “RSelenium”. If you wish to download elsewhere you can mention it with dir argument in the method. You can also manually download it from here.

Step 4: startServer().

This method will start the standalone server after trying to locate it in bin directory by default. If you have placed it in any other directory, mention it in the dir argument of the method.

Sometimes this method throws errors. To get more clear insight into the errors, I prefer this alternative way. Navigating to location of the server from console of my OS (i.e command prompt in Windows or the terminal in Ubuntu), the following command is to be run:

java -jar selenium-server-standalone-x.xx.x.jar

This will show all the status messages in the console and accordingly you can debug them.

Step 5:

remDr <- remoteDriver(remoteServerAddr = "localhost" 
         , port = 4444
         , browserName = "firefox"
         )

These are already the default parameters of remoteDriver(). So if you want to stick to these parameters you can simply type remoteDriver(). But, I preferred using chrome. For using chrome, I downloaded chromedriver executable from here and added the location of downloaded executable to the system PATH . Folllow the following steps:

If you now run the executable, the port number (say 9517) where the chromedriver is running will be shown. Replace the following parameters now :

remDr <- remoteDriver(remoteServerAddr = "localhost" , port = 9517 , browserName           = "chrome")

Visit this to find out more about other browsers.

Step 6: If you now visit http://localhost:9517/selenium-server/driver?cmd=getLogMessages, you will be able to check for the existence of the server.

You can additionally visit http://localhost:9517/selenium-server/driver?cmd=shutDownSeleniumServer to shut down the RSelenium server.

Step 7: The following method will create session id.

remDr$open()

Step 8:  If you want to query the status of the server you can use this method.

remDr$getStatus()

You are now connected to RSelenium and can work with it to collect data. If you face any problem, take some time to drop your comments down here.

Happy Mining!


To leave a comment for the author, please follow the link and comment on their blog: R – FordoX.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.