# Playing with robots

May 3, 2011
By

(This article was first published on Freakonometrics - Tag - R-english, and kindly contributed to R-bloggers) My son would be extremely proud if I tell him I can spend hours building robots. Well, my robots are not as fancy as Dr Tenma’s, but they usually do what I ask them to do. For instance, it is extremely simple to build a robot with R, to extract data from websites. I have mentioned it here (one tennis matches), but it failed there (on NY Marathon). To illustrate the use of robots, assume that one wants to build his own dataset to study prices of airline tickets. First, we have to choose a departure city (e.g. Paris) and an arrival city (e.g. Montreal). Then, one wants to look at all possible dates from April first (I ran it last month) till the end of December (so we create a vector with all leaving dates, namely a vector for the day, one for the month, and one for the year). Then, we choose a return date (say 3 days after).

`DEP="Paris"ARR="Montreal"DATE1D=rep(c(1:30,1:31,1:30,1:31,1:31,1:30,1:31,1:30,1:31,1:31,1:29),3)DATE1M=rep(c(rep(4,30),rep(5,31),rep(6,30),rep(7,31),rep(8,31),rep(9,30),rep(10,31),rep(11,30),rep(12,31),rep(1,31),rep(2,29)),3)DATE1Y=rep(c(rep(2011,30+31+30+31+31+30+31+30+31+31+28),rep(2012,31+29)),3)k=3DATE3D=c((1+k):30,1:31,1:30,1:31,1:31,1:30,1:31,1:30,1:31,1:31,1:29,1:k)DATE3M=c(rep(4,30-k),rep(5,31),rep(6,30),rep(7,31),rep(8,31),rep(9,30),rep(10,31),rep(11,30),rep(12,31),rep(1,31),rep(2,29),rep(3,k))DATE3Y=c(rep(2011,30+31+30+31+31+30+31+30+31+31+28-k),rep(2012,31+29+k))`

It is also possible (for a nice robot), to skip all prior dates

`skip=max(as.numeric(Sys.Date()-as.Date("2011-04-01")),1)`

Then, we need a website where requests can be written nicely (with cities and dates appearing explicitly). Here, I cannot not mention the website that I used since it is stated on the website that it is strictly forbidden to run automatic requests… Anyway, consider a loop create a url address (actually I chose the value of the date randomly, since I had been told that those websites had memory: if you ask too many times for the same thing during a short period of time, prices would go up),

`URL=paste("http://www.♦♦♦♦/dest.dll?qscr=fx&flag=q&city1=",DEP,"&citd1=",ARR,"&","date1=",DATE1D[s],"/",DATE1M[s],"/",DATE1Y[s],"&date2=",DATE3D[s],"/",DATE3M[s],"/",DATE3Y[s],"&cADULT=1",sep="")`

then, we just have to scan the webpage, looking for ticket prices (just looking for some specific names)

`page=as.character(scan(URL,what="character"))I=which(page%in%c("Price0","Price1","Price2"))if(length(I)>0){PRIX=substr(page[I+1],2,nchar(page[I+1]))if(PRIX=="1"){PRIX=paste(PRIX,page[I+2],sep="")}if(PRIX=="2"){PRIX=paste(PRIX,page[I+2],sep="")}`

Here, we have to be a bit cautious, if prices exceed 1000. Then, it is possible to start a statistical study. For instance, if we compare to destination (from Paris), e.g. Montréal and New York, we obtain the following patterns (with high prices during holidays), It is also possible to run the code twice (here it was run last month, and a couple of days ago), for the same destination (from Paris to Montréal), Of course, it would be great if I could run that code say every week, to build up a nice dataset, and to study the dynamic of prices…

The problem is that it is forbidden to do this. In fact, on the website, it is mentioned that if we want to extract data (for an academic purpose), it is possible to ask for an extraction. But if we do tell that we study specific prices, data might be biased. So the good idea would be to use several servers, to make several requests, randomly, and to collect them (changing dates and destination). But here, my computing skills – unfortunately – reach a limit….

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Tags: , , , , , , , , , ,