by Dmitry Pechyoni, Microsoft Data Scientist
The New York City taxi dataset is one of the largest publicly available datasets. It has about 1.1 billion taxi rides in New York City. Previously this dataset was explored and visualized in a number of blog posts, where the authors used various technologies (e.g., PostgreSQL and Apache Elastic Search). Moreoever, in a recent blog post our colleagues showed how to build machine learning models over one year of this dataset using Microsoft R Server (MRS) running in a 4-node Hadoop cluster. In this blog post we will use a single commodity machine to show an end-to-end process of downloading and cleaning 4 years of data, as well building and evaluating a machine learning model. This entire process takes around 12 hours.
We used the Data Science Virtual Machine in Azure cloud. Our virtual machine runs Windows and has Standard A4 configuration (8 cores and 14Gb of memory). This machine comes with 126Gb of hard disk, which is not enough to store NYC taxi dataset. To store the dataset locally, we followed these instructions and attached a 1Tb hard disk to our virtual machine.
New York City taxi dataset has data from 2009-2015 for yellow taxis and from 2013-2015 for green taxis. Yellow and green taxis have different data schema. Also, we have found that the yellow taxi rides from 2010-2013 have exactly the same schema, whereas the schema of the yellow taxi rides from 2009, 2014, 2015 is different. To simplify our code, we have focused on yellow rides from 2010-2013 and discarded other ones. These are more than 600 million yellow taxi rides from 2010-2013, which take 115Gb. Our goal is to build a binary classification model that will predict if a passenger will pay a tip.
Obtaining the data, data cleaning and feature generation
To preprocess a big volume of data in a short time, we used parallel programming capabilities of Microsoft R Server (MRS), in particular the rxExec function. This function allows us to distribute a data processing over multiple cores. Since our machine has 8 cores, by using rxExec we were able to speed up the data processing significantly.
The original dataset has one csv file per month. Initially we convert each csv file to the xdf file format. Xdf files store data in compressed and optimized format and are input to all subsequent Microsoft R Server functions. This code does the above conversion, removes unnecessary columns, removes rows with noisy data, computes several new columns (duration of the trip, weekday, hour and label column) and finally converts several columns to factors.
By leveraging rxExec function and 8 cores, we are able to process 8 csv files in parallel. The data cleaning step takes 3 hours 50 minutes. The resulting xdf files occupy 26 Gb of disk space, a significant reduction from the original 115 Gb.
Generating training and test sets
In the next step we combine multiple xdf files into a single training and a single test file. We used 2010-2012 data for training and 2013 data for testing. When merging the files, we leverage again multiple cores and rxExec function. The merge is done using binary tree scheme, the merges at the same level are executed in parallel. The following picture shows an example of merging 6 files:
In this example initially 3 pairs of files (1+2, 3+4 and 5+6) are merged in parallel. Then 1+2 is merged with 3+4, and finally 1+2+3+4 is merged with 5+6. This code implements binary tree merge.
Merging of small files and creation of train.xdf and test.xdf files takes 5.5 hours. The resulting train.xdf file has 506M row and takes 18 Gb. The test file test.xdf has 148M row and takes 6 Gb.
An alternative way of preparing the data for modeling would be to convert csv files to a single composite xdf file and then to use it as an input to machine learning algorithm. In this case we can skip the merging step. However, the conversion from csv to composite xdf is done by a single call to rxImport function. This function processes csv files sequentially and is much slower than parallel creation of xdf files and their subsequent merge.
Training and Scoring
At this point we are ready to do training and scoring. Notice that the training file has 18 Gb and does not fit into memory (which is 14 Gb). Fortunately, implementations of machine learning algorithms in MRS can handle such big datasets. We ran a single iteration of logistic regression over the training set, which took 1 hour and 10 minutes. Then we ran scoring over the test set, which took 10 minutes.
We performed experiments with 3 feature sets. In the first feature set we used passenger type, trip distance, pickup latitude, pickup longitude, drop-off latitude, drop-off longitude, fare amount, MTA tax, tolls amount, surcharge, duration, hour as numeric features. We also used payment type, rate code, weekday as categorical features. Surprisingly, our model has AUC of 0.984 over the test set. Such a high AUC looks too good to be true and after additional data exploration we have found that the payment type is an excellent predictor is a passenger will leave a tip:
> rxSummary(formula = label ~ payment_type, data = ".\\xdf/train.xdf", summaryStats = c("Mean","ValidObs"), reportProgress = 0) Statistics by category (2 categories): Category payment_type Means ValidObs label for payment_type=Cash Cash 0.0005490915 289816908 label for payment_type=Credit Card Credit Card 0.9720573755 216974429
As we see in the training set, 97% of passengers who pay by credit card, leave a tip. But less than 0.1% of passengers who paid by cash left a tip. The test set has similar statistics. Hence predicting the tip at the time of payment is a very easy task. A more interesting task would be to predict if a passenger will leave a tip before the actual payment. We excluded the payment type feature, reran the experiment and got AUC of only 0.553 . This is better than the baseline AUC of 0.5, but not that impressive. At the next step we defined the hour feature as categorical variable and improved slightly the test set AUC to 0.564 . This relatively low AUC indicates that there is lots of room for improving our model.
This code runs training and scoring with our latest feature set.
The following table summarizes the running times of all steps of our experiment:
By leveraging Microsoft R Server, we were able to perform the entire process of building and evaluating machine learning model over hundreds of millions of examples using a single commodity machine within a number of hours. Our model is a simple logistic regression with a small set of features. In the future, it will be interesting to develop a model with larger feature set that will have more accurate prediction of tips before the actual payment.
The code used to create this experiment is in Github repository.