Playing with robots

May 3, 2011

(This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers)

My son would be extremely proud if I tell him I can spend hours building robots. Well, my robots are not as fancy as Dr Tenma’s, but they usually do what I ask them to do. For instance, it is extremely simple to build a robot with R, to extract data from websites. I have mentioned it here (one tennis matches), but it failed there (on NY Marathon). To illustrate the use of robots, assume that one wants to build his own dataset to study prices of airline tickets. First, we have to choose a departure city (e.g. Paris) and an arrival city (e.g. Montreal). Then, one wants to look at all possible dates from April first (I ran it last month) till the end of December (so we create a vector with all leaving dates, namely a vector for the day, one for the month, and one for the year). Then, we choose a return date (say 3 days after).


It is also possible (for a nice robot), to skip all prior dates


Then, we need a website where requests can be written nicely (with cities and dates appearing explicitly). Here, I cannot not mention the website that I used since it is stated on the website that it is strictly forbidden to run automatic requests… Anyway, consider a loop create a url address (actually I chose the value of the date randomly, since I had been told that those websites had memory: if you ask too many times for the same thing during a short period of time, prices would go up),


then, we just have to scan the webpage, looking for ticket prices (just looking for some specific names)


Here, we have to be a bit cautious, if prices exceed 1000. Then, it is possible to start a statistical study. For instance, if we compare to destination (from Paris), e.g. Montréal and New York, we obtain the following patterns (with high prices during holidays),

It is also possible to run the code twice (here it was run last month, and a couple of days ago), for the same destination (from Paris to Montréal),

Of course, it would be great if I could run that code say every week, to build up a nice dataset, and to study the dynamic of prices…

The problem is that it is forbidden to do this. In fact, on the website, it is mentioned that if we want to extract data (for an academic purpose), it is possible to ask for an extraction. But if we do tell that we study specific prices, data might be biased. So the good idea would be to use several servers, to make several requests, randomly, and to collect them (changing dates and destination). But here, my computing skills – unfortunately – reach a limit….

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics - Tag - R-english. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , , , , , , , , , ,

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Dommino data lab

Quantide: statistical consulting and training



CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)