by Konstantin Golyaev, Data Scientist at Microsoft
I recently participated in an internal one-day Microsoft R Server (MRS) hackathon. For an experienced base R user but a complete MRS novice, this turned out to be an interesting challenge. R has fantastic and unparalleled set of tools for exploratory data analysis, as long as your data set is small enough to fit comfortably in memory. In particular, the dplyr package offers a concise and expressive data manipulation semantics, and is wildly popular among R users.
Unlike base R, MRS is able to handle large datasets that do not fit in memory. This is achieved by splitting the data into chunks that are small enough to fit in RAM, applying operations sequentially to every chunk, and storing the results, as well as data, on disk using the binary xdf (eXtended Data Frame) format. The downside to the flexibility is the need to use MRS-specific function from the RevoScaleR package, something that would have definitely slowed me down. Thankfully, the dplyrXdf package bridges the syntax gap between base R and MRS, and enables out-of-memory operations on large datasets with the same dplyr syntax. In what follows, I will describe the problem I chose to address and share the code used to obtain the solution.
Our data science team at Microsoft is split across locations in Redmond, the Bay Area, and Boston. For this reason, I decided to investigate whether planes departing from the corresponding major airports tend to arrive on time. To do this, I used the On-Time Performance Dataset from the Research and Innovative Technology Administration of the Bureau of Transportation Statistics. This dataset spans 26 years, 1988 through 2012, and is fairly large: over 148 million records, or 14 GB of raw information. The combined dataset can be obtained from the Revolution Analytics website. While it is definitely possible to work with such a dataset on a single machine using base R, unless one has access to a really powerful machine (e.g. Azure VM Standard_DS14), it would have taken a while, and we only had several hours for the hackathon.
Instead, I used 2012 data for prototyping the code, and then leveraged the out-of-memory capabilities of MRS to execute the same code on the entire dataset. I tend to be a heavy dplyr user, which makes prototyping with small data a breeze. Unfortunately, dplyr did not work with MRS xdf datasets back then (and it still does not today), which would mean that I would have had to rewrite most of my code in a way that MRS would understand. Thankfully, the dplyrXdf package made it entirely unnecessary; I relied on it as a translation mechanism that relayed my dplyr-style logic to MRS.
The data analysis itself was quite straightforward, given the time constraints of the hackathon. I provisioned a Windows Data Science Virtual Machine on Azure, which comes with a trial version of MRS. I focused on the four major airports that our team uses most frequently: Seattle (SEA), Boston (BOS), San Francisco (SFO), and San Jose (SJC). I grouped the last two into a single entity called SVC (for ‘Silicon Valley Campus’). I eliminated a number of records with missing data, as well as the few records where arrival time was not within the two-hour window of scheduled time. I grouped the flights by departure day of week, as well as departure time, which I bucketed into four six-hour long groups: from midnight to 6:00 AM, from 6:01 AM to noon, and so on. The code for the analysis can be obtained from my GitHub repository.
Once the data was properly partitioned, I computed frequencies of untimely arrivals within each cell, and plotted the results on a grid. In total, I obtained 84 plots – 3 locations, 7 days a week, 4 departure time groups. The vertical red line on each plot indicates the time when the flights were supposed to arrive. This means any probability mass to the right of the red line is bad news, while mass to the left of the red line is good news.
The first thing to notice from the SEA and SVC plots, is that, unless you travel on a plane that departs between midnight and 6:00 AM, you are not very likely to arrive late.
Frequency of late arrivals for Seattle (SEA)
It appears that “red-eye” flights from the West Coast are more likely to arrive later than scheduled, and being late by as much as 90 minutes is not out of the question for such flights.
Frequency of late arrivals for San Francisco (SFO) and San Jose (SJC)
For flights originating from BOS, however, the story is rather different. Namely, the flights leaving after 6:00 PM are also reasonably likely to be late, in addition to flights leaving between midnight and 6:00 AM. In fact, the “red-eyes” from BOS have a weird bi-modal distribution of relative arrival time: they are either 30 minutes too early or 90 minutes late. This is something to keep in mind if you travel from BOS and have a connecting flight on the way to your final destination.
Frequency of late arrivals for Boston (BOS)
In conclusion, a combination of Windows Data Science VM with MRS together with dplyrXdf turned out to be a powerhouse, for three reasons. First, it was trivial to get all the tools up and running, by provisioning the VM and installing dplyrXdf from GitHub. Second, the MRS component allowed me to scale operations to a pretty large dataset without having to worry about the implementation details. Third, the dplyrXdf component made me very productive because it eliminated the need to learn MRS-specific commands and let me use my dplyr skills.