How Orbitz uses Hadoop and R to optimize hotel search

December 21, 2010

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

Positional bias — the tendency for users to preferentially select results in the first few positions of a search — is a big issue for all kinds of search engines. But for online travel site Orbitz the stakes are higher than for a traditional Web search engine: if a customer chooses the first-listed hotel in a search for accommodations, but will be dissatisfied with their stay, that means Orbitz will soon have an unhappy customer. So for Orbitz, a key problem was to optimize their hotel search results for customer satisfaction.

As Orbitz's Jonathan Seidman (Lead Engineer on the Intelligent Marketplace/Machine Learning Team) and Ramesh Venkataramaiah (Principal Engineer on the Operations and Engineering Team) revealed in presentations to the WindyCityDB and Hadoop World NYC conferences, Orbitz solves this problem by using R to perform statistical analysis on data stored in Hadoop and extracted with Hive. 

Orbitz statistical analysis components

After extracting data including customer hotel booking records and user ratings of hotels from Hive, the Orbitz team used statistical analysis to identify the best hotel to promote to the top of the list for each new booking. Ramesh reports that the statistical techniques included liner filtering of time series (via the filter function) and applied moving averages with equal weights. These models even allowed for seasonal trends to be incorporated into the recommendations — for example, the fact that longer hotel stays tend to be booked in the summer months, as shown by the red days in this calendar heat map:

Hotel stay length
This is another great example of applying advanced statistical and visualization techniques in R to large and complex data sets stored in a Hadoop environment. See the full slide deck for other analyses employed by the Orbitz team, including hexagonal binning charts to identify positional bias and kernel density estimation to model hotel ratings. As Ramesh says in the presentation, R has a "steep learning curve, but worth it!". Using Hadoop and Hive to Optimize Travel Search

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: ,

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Dommino data lab

Quantide: statistical consulting and training



CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)