This post was inspired by my wife who used the GPS data from her Strava app to plot her running routes during 2020. Since I don’t run nearly as much as I used to, I need to go back to when I was training for the NYC marathon to find enough running to make such a map worthwhile. Also presenting a challenge is that I’m a bit of a luddite when it comes to running technology. I don’t have a GPS watch and I don’t run with a phone. To track my runs I manually enter my routes and workouts into MapMyRun and I time my runs with an ol’ fashioned sportswatch.
While this works for me on the road, it made the data gathering process for this visualization more difficult. And while MapMyRun does have TCX files for each workout, its not that useful if the data didn’t come from a GPS watch.
At the end of the day, my goal with this analysis is to make a cool looking heatmap of my training routes for the NYC Marathon… or at least to make a visualization that was cooler looking that my wife’s.
For those who can’t wait… this was final output:
This analysis uses four main packages.
Tidyverse for data manipulation,
sf for modifying spatial data,
tigris for getting the basemaps to plot my routes and
extrafont to bring in new fonts for the plots.
library(tidyverse) # Data Manipulation library(sf) # Manipulation Spatial Data library(tigris) # Getting Tract and Roads Spatial Data library(extrafont) # Better Fonts For GGPLOT
If I had a GPS watch or used Strava, I could just download all my files which would contain Geo information and plot it directly. But because I do everything manually, I needed to jump through some hoops. From my MapMyRun account I was able to download:
user_workout_history.csv– Containing all of my workouts along with a column for route_id.
- GPX files for each route that I had saved.
This led to the semi-painful manual process of using the first file to write down each route id that I had run, look up that route, and download the individual GPX file. Fortunately, I’m a creature of habit and and ran the same routes often, so there were only 24 to individually download.
The User Workout History File
This file was a CSV file exported from MapMyRun which contained one row for each workout I did along with meta-data such as date, time, speed, etc. However, the important column is route id which will be used to join the geo-data from the route’s GPX files.
runs <- read_csv('data/user_workout_history.csv') %>% # Create Route ID column mutate(route_id = str_extract(RouteID, '\\d+') %>% as.integer)
The Route GPX Files
As mentioned above the geocoded data for each route lives in GPX files, one for each of the 24 routes. Since I would apply the same pre-processing to each file this is a good candidate for the
map_dfr function to construct the data frame.
The following code uses
dir() to get a list of all the files in the directory as vectors, the
keep() function trims the vector to only the GPX files, and each GPX file is then passed into
read_sf to read in the geo-data. The data is subset to only two columns, and a route_id is created based on the numbers in the file name.
Finally, geo-data in
sf lives in a GEOMETRY column. However, in order to get the latitudes and longitudes as individual columns I use
st_coodinates to creates “X” and “Y” columns for longitude and latitude.
all_routes <- map_dfr( #Get all gpx files in the directory keep(dir('data'), ~str_detect(.x, "gpx")), #Read them in ~read_sf(paste0('data/',.x), layer = "track_points") %>% #keep the segment id and the geometry field select(track_seg_point_id, geometry) %>% # create a route_id based on the file mutate(route_id = parse_number(.x)) ) %>% #Extract Lat and Long as Columns cbind(., st_coordinates(.))
After the processing the data looks like:
|0||111694131||-73.97597||40.77624||POINT (-73.97597 40.77624)|
|1||111694131||-73.97555||40.77605||POINT (-73.97555 40.77605)|
|2||111694131||-73.97555||40.77605||POINT (-73.97555 40.77605)|
|3||111694131||-73.97546||40.77582||POINT (-73.97546 40.77582)|
|4||111694131||-73.97546||40.77582||POINT (-73.97546 40.77582)|
|5||111694131||-73.97552||40.77527||POINT (-73.97552 40.77527)|
Combining Runs and Routes
With all the workouts in the
runs data and all the routes in the
all_routes data, a simple inner-join will combine them. This will duplicates routes that I ran multiple times, which in this case would be the desired behavior.
#Join Routes to Runs to Duplicate runs_and_routes <- runs %>% inner_join(all_routes, by = "route_id")
Creating a map of NYC
Since the goal is to create a heatmap of the various routes I ran as part of marathon training, I need a map that contains all of the possible roads in NYC. The
tigris package allows for the access to US Census TIGER shapefiles. One of the levels is “roads”, which can be downloaded using the
road() function where the first parameter is state and 2nd parameter is county (New York County is Manhattan):
###Download Roads Map from Tigris nyc <- roads("NY", "New York") ggplot() + geom_sf(data = nyc) + ggthemes::theme_map()
The function provides road data for all of Manhattan. However, I did not run every part of Manhattan, so it would make more sense to truncate the map to areas where I did run.
In order to do this, I first need to define a boundary box based on my routes. Given a geometry, the
st_bbox() function from
sf will return a “bbox” object containing the four corners of my routes.
st_bbox(runs_and_routes$geometry) ## xmin ymin xmax ymax ## -74.01880 40.70806 -73.93118 40.82113
However, this will not provide any padding around my running routes which will make for a worse visualization. So I will use
map2_dbl to add a delta of 0.01 to the maximum values and remove a delta of -0.01 to the minimum values to slightly increase the bounding box.
### Construct Bounding Boxes and Expand Limits By A Delta bbox <- map2_dbl( st_bbox(runs_and_routes$geometry), names(st_bbox(runs_and_routes$geometry)), ~if_else(str_detect(.y, 'min'), .x - .01, .x + .01) ) bbox ## xmin ymin xmax ymax ## -74.02880 40.69806 -73.92118 40.83113
With an updated bounding box, I can now crop the initial map with my bounding box using the
st_crop() function. Also, in order to make the Coordinate Reference Systems the same, I use
st_transform to make sure the NYC map is using the same coordinates as my routes.
#Set CRS for NYC to CRS for Running Routes And Crop to the Bounding Box nyc2 <- st_transform(nyc, crs = st_crs(runs_and_routes$geometry)) %>% st_crop(bbox) ggplot() + geom_sf(data = nyc2) + ggthemes::theme_map()
We’ve now cut off Governor’s Island from the bottom left corner as well as parts of Northern Manhattan that I never ran to.
Constructing the Heatmap
With the new basemap created and the route data in its own data frame. I can create the heatmap using
stat_density2d with the route data and
geom_sf with the map data. From the
stat_density2d piece I pass in the routes data and set the fill value to be the count at each X and Y using the
after_stat() option. The
n parameter sets the number of grid points in each directions for the density.
The base map is very rectangular where it is tall but skinny. This made it difficult to add titles. To make things look better, I use
ggdraw from the
cowplot package to create a new drawing layer and add titles/captions to that layer.
p <- ggplot() + #Construct the Heatmap Portion stat_density2d(data = runs_and_routes, aes(x = X, y = Y, fill = after_stat(count)), geom = 'tile', contour = F, n = 1024 ) + #Draw the Map of Manhattan geom_sf(data = nyc2, color = '#999999', alpha = .15) + scale_fill_viridis_c(option = "B", guide = F) + ggthemes::theme_map() + theme( panel.background = element_rect(fill = 'black'), plot.background = element_rect(fill = 'black') ) cowplot::ggdraw(p) + labs(title = "JLaw's Marathon Training Heatmap", caption = "**Author**: JLaw") + theme(panel.background = element_rect(fill = "black"), plot.background = element_rect(fill = 'black'), plot.title = element_text(color = "#DDDDDD", family = 'Nirmala UI', #face = 'bold', size = 18), plot.caption = ggtext::element_markdown(color = '#DDDDDD', family = 'Calibri Light', hjust = 1, size = 12), )
I’m really happy with how this came out. It also provides some information about my running habits, mainly that I ran in Central Park a lot and that you can roughly tell where I worked at the time as that area is slightly hotter. There are some parts of Manhattan that I did run but don’t show up well in the map because I might have only run there once. An exploration of how much of Manhattan did I run will be covered in a follow-up post.