**Freakonometrics - Tag - R-english**, and kindly contributed to R-bloggers)

*Paris*) and an arrival city (e.g.

*Montreal*). Then, one wants to look at all possible dates from April first (I ran it last month) till the end of December (so we create a vector with all leaving dates, namely a vector for the day, one for the month, and one for the year). Then, we choose a return date (say 3 days after).

DEP="Paris"

ARR="Montreal"

DATE1D=rep(c(1:30,1:31,1:30,1:31,1:31,1:30,1:31,1:30,

1:31,1:31,1:29),3)

DATE1M=rep(c(rep(4,30),rep(5,31),rep(6,30),rep(7,31),

rep(8,31),rep(9,30),rep(10,31),rep(11,30),rep(12,31),

rep(1,31),rep(2,29)),3)

DATE1Y=rep(c(rep(2011,30+31+30+31+31+30+31+

30+31+31+28),rep(2012,31+29)),3)

k=3

DATE3D=c((1+k):30,1:31,1:30,1:31,1:31,1:30,1:31,

1:30,1:31,1:31,1:29,1:k)

DATE3M=c(rep(4,30-k),rep(5,31),rep(6,30),rep(7,31),rep(8,31),

rep(9,30),rep(10,31),rep(11,30),rep(12,31),rep(1,31),rep(2,29),rep(3,k))

DATE3Y=c(rep(2011,30+31+30+31+31+30+31+30+31+

31+28-k),rep(2012,31+29+k))

It is also possible (for a nice robot), to skip all prior dates

skip=max(as.numeric(Sys.Date()-as.Date("2011-04-01")),1)

Then, we need a website where requests can be written nicely (with cities and dates appearing explicitly). Here, I cannot not mention the website that I used since it is stated on the website that it is strictly forbidden to run automatic requests... Anyway, consider a loop create a url address (actually I chose the value of the date randomly, since I had been told that those websites had memory: if you ask too many times for the same thing during a short period of time, prices would go up),

URL=paste("http://www.♦♦♦♦/dest.dll?qscr=fx&flag=q&city1=",

DEP,"&citd1=",ARR,"&",

"date1=",DATE1D[s],"/",DATE1M[s],"/",DATE1Y[s],

"&date2=",DATE3D[s],"/",DATE3M[s],"/",DATE3Y[s],

"&cADULT=1",sep="")

then, we just have to scan the webpage, looking for ticket prices (just looking for some specific names)

page=as.character(scan(URL,what="character"))

I=which(page%in%c("Price0","Price1","Price2"))

if(length(I)>0){

PRIX=substr(page[I+1],2,nchar(page[I+1]))

if(PRIX[1]=="1"){PRIX=paste(PRIX,page[I+2],sep="")}

if(PRIX[1]=="2"){PRIX=paste(PRIX,page[I+2],sep="")}

Here, we have to be a bit cautious, if prices exceed 1000. Then, it is possible to start a statistical study. For instance, if we compare to destination (from Paris), e.g. Montréal and New York, we obtain the following patterns (with high prices during holidays),

It is also possible to run the code twice (here it was run last month, and a couple of days ago), for the same destination (from Paris to Montréal),

Of course, it would be great if I could run that code say every week, to build up a nice dataset, and to study the dynamic of prices...

The problem is that it is forbidden to do this. In fact, on the website, it is mentioned that if we want to extract data (for an academic purpose), it is possible to ask for an extraction. But if we do tell that we study specific prices, data might be biased. So the good idea would be to use several servers, to make several requests, randomly, and to collect them (changing dates and destination). But here, my computing skills - unfortunately - reach a limit....

**leave a comment**for the author, please follow the link and comment on his blog:

**Freakonometrics - Tag - R-english**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...