The rwml-R Github repo is updated with R code for exploratory data analysis of New York City taxi data from Chapter 6 of the book “Real-World Machine Learning” by Henrik Brink, Joseph W. Richards, and Mark Fetherolf. Examples given include reading large data files with the
fread function from
data.table, joining data frames by multiple variables with
inner_join, and plotting categorical and numerical data with
Data for NYC taxi example
The data files for the examples in Chapter 6 of the book are available at
They are compressed as a 7-Zip file archive
(e.g. with p7zip), so you will
need to have the
7z command available in your path to decompress and load
(On a mac, you can use Homebrew to install p7zip with
brew install p7zip.)
Using fread (and dplyr…again)
As in Chapter 5, the
fread function from the
data.table library is used to quickly read in a sample of the rather large
data files. It is similar to
read.table but faster and more convenient.
The following code reads in the first 50k lines of data from one of the
trip data files and one of the fare data files. The
dplyr are used to clean up the data (e.g. remove data
with unrealistic latitude and longitude values). The trip and fare data are
combined with the
inner_join function from the
Exploring the data
In the complete code-through, plots of categorical and numerical
features of the data are made using
ggplot2, including a visualization of the pickup locations in latitude and
longitude space which is shown below. With slightly less than 50,000 data
points, we can clearly see the street layout of downtown Manhatten.
Many of the trips originate in the other boroughs of New York, too.
If you have any feedback on the rwml-R project, please
leave a comment below or use the Tweet button.
As with any of my projects, feel free to fork the rwml-R repo
and submit a pull request if you wish to contribute.
For convenience, I’ve created a project page for rwml-R with
the generated HTML files from
knitr, including a page with
all of the event-modeling examples from chapter 6.