Visualizing Bike Share Data (NiceRide)

[This article was first published on RLang.io | R Language Programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This tutorial will cover exploring and visualizing data through 2018 for the Minneapolis, MN bike sharing service NiceRide. Part of what makes R incredible is the number of great packages. Part of what makes packages like ggmap and gganimate great is how they build on existing packages.

Station Network

First step, as always, is to include the libraries we will be using.

# The Usual Suspects
library(ggplot2)
library(ggthemes)
library(dplyr)
library(magrittr)
library(data.table)
library(stringr)
# Fetching
library(rvest)
# Cleaning column names
library(janitor)
# Date/Time formatting
library(lubridate)
# Maps
library(sf)
library(ggmap)
# Used for animated density plots
library(gganimate)
# Only needed for interactive maps
library(leaflet)
library(leaflet.extras)

Fetching the Data (Fail)

First we are going to cover how I would normally approach fetching tables like the ones shown on https://s3.amazonaws.com/niceride-data/index.html.

This approach did not work.

url <- "https://s3.amazonaws.com/niceride-data/index.html"
repo <- url %>% read_html
tbls <- html_nodes(repo, "table")
# Shows all tables
tbls

This is where we learn the approach didn’t work. Calling tbls gives us this return

{xml_nodeset (1)}
[1] <table class="hide-while-loading table table-striped">...

Since there is only one table we would normally be able to use the html_tables function from rvest and have a nice data frame.

df <- html_table(tbls[1],fill = T)[[1]]
nrow(df)
# [1] 0

Instead, we get the return above, which has been commented out. Zero rows. This is because the data which populates the page is loaded after. There is still hope though.

Fetching the Data (Success)

This is the approach that worked. If we go up one level to https://s3.amazonaws.com/niceride-data/ we can see an xml file which we can use.

url <- "https://s3.amazonaws.com/niceride-data/"
repo <- url %>% read_html %>% 
  xml_child(1) %>% xml_child(1) %>% xml_find_all(".//key")

# Now we have an xml_nodeset
repo[1:10]

That last bit lets us see a sampling of repo

{xml_nodeset (10)}
 [1] <key>201804-niceride-tripdata.csv.zip</key>
 [2] <key>201805-niceride-tripdata.csv.zip</key>
 [3] <key>201806-niceride-tripdata.csv.zip</key>
 [4] <key>201807-niceride-tripdata.csv.zip</key>
 [5] <key>201808-niceride-tripdata.csv.zip</key>
 [6] <key>201809-niceride-tripdata.csv.zip</key>
 [7] <key>201810-niceride-tripdata.csv.zip</key>
 [8] <key>201811-niceride-tripdata.csv.zip</key>
 [9] <key>Nice_Ride_data_2010_season.zip</key>
[10] <key>Nice_Ride_data_2011_season.zip</key>

Awesome, this is something we can use.

repo %<>% as_list %>% unlist %>% 
  #Starts with 2018
  str_match_all("^2018.*") %>% unlist

The function as_list is needed because we are converting an xml_nodeset to a vector. We used str_match_all to find files that start with 2018. The %<>% operator passes repo to as_list and also sets repo to the return value of the chain. At this point if you were to call repo you would see the following…

> repo
[1] "201804-niceride-tripdata.csv.zip"
[2] "201805-niceride-tripdata.csv.zip"
[3] "201806-niceride-tripdata.csv.zip"
[4] "201807-niceride-tripdata.csv.zip"
[5] "201808-niceride-tripdata.csv.zip"
[6] "201809-niceride-tripdata.csv.zip"
[7] "201810-niceride-tripdata.csv.zip"
[8] "201811-niceride-tripdata.csv.zip"

Downloading and Unzipping

You could merge both actions into one loop, but I made two loops. The %T>% operator comes in handy here to unlink the concatenated filename instead of the return of unzip.

dir.create("import")
# Download Files
for(file in repo) {
  download.file(paste0(url,file),destfile = paste0("import/",file)) 
}
# Unzip 
for(file in repo) {
  paste0("import/",file) %T>%
    unzip(exdir = "import") %>%
    unlink
}

Read and Merge into a Dataframe

Now that we have the files unzipped we can read them into a list and merge the list into a dataframe. You could do this manually, but this will allow for you to change the year easily, or merge 2018 and 2017 into a dataframe to company ridership between years.

# Read and merge
import <- lapply("./import" %>% list.files, function(name) {
  return(read_csv(paste0("./import/",name)))
})
rides <- rbindlist(import,fill = T)
rides %<>% clean_names(case = "snake")

Additional Columns

Now that we have the data in rides, we can build some extra columns/features. Age is a much more intuitive field than birth_year, and grouping ages into bins will come in handy later.

rides$age <- 2019-rides$birth_year
rides$age_bin <- rides$age %>% .bincode(seq(0,120,20))
rides$age_bin <- sapply(rides$age_bin,function(bin) {
  return(paste0((bin-1)*20,"-",(bin*20)," Years Old"))
})

We probably will want to see things in a more macro level view, so we should build out some date/time columns as well.

# Trip times
rides$minutes <- rides$tripduration/60
rides$hours <- rides$tripduration/60/60
# Start times
rides$start_hour <- lubridate::hour(rides$start_time)
rides$mm <- hour(rides$start_time)*60 + minute(rides$start_time)
rides$start_day <- wday(rides$start_time,label =  T, abbr = F, week_start = 1)
# Weekend/Weekday
rides$start_day_type <- ifelse(wday(rides$start_time, week_start = 1)>5, "Weekend", "Weekday")
# Week of year
rides$week <- week(rides$start_time)
# Month (1-12)
rides$month <- month(rides$start_time,label = T,abbr = F)
# Month (January-December)
rides$month_text <- month(rides$start_time,label = T,abbr = F)
# Remove unused levels from factor
rides$month_text <- droplevels(rides$month_text)

Some Tables for Context

You can print the info in a different way, I am just doing this to format everything better for the web.

table(rides$age_bin) %>% lapply({
  . %>% format(big.mark=",") %>% return
})

$`0-20 Years Old`
[1] 1.43

$`100-120 Years Old`
[1] 0

$`20-40 Years Old`
[1] 35.39

$`40-60 Years Old`
[1] 58.25

$`60-80 Years Old`
[1] 4.82

$`80-100 Years Old`
[1] 0.1

From this we can see that tossing out anything below a half a percent would remove 80-100 and 100-120. We could probably remove 0-20 as well, but that is up to you.

The Delicious Part

Now comes the time to visually explore the data. For the same of simplicity I am just using theme_fivethirtyeight() and scale_.*_viridis.*() for the theme and colors of most of these plots.

Visualizing Ridership by Month

ggplot(data=rides[which(rides$age<=60),], aes(x=week, fill= month_text)) +
  geom_histogram(alpha=.9) + theme_fivethirtyeight() + ggtitle("Ride Frequency by Week of Year") + 
  facet_grid(vars(usertype), vars(age_bin)) + scale_fill_viridis_d()
ggsave(filename = "ride-frequecy-histogram.png",width = 8,units = "in")

Ride Frequency Histogram

The first thing I notice is the 0-20 and 60-80 year old frequencies are so low they are nearly impossible to identify from this plot. If we want to learn what week of the year people are riding a density plot might work better. There is probably a better way to do this, but in order to get the red line I added an additional layer.

ggplot(data=rides[which(rides$age<=80),], aes(x=week, fill= age_bin)) +
  geom_histogram(alpha=.9,aes(y=..density..)) + theme_fivethirtyeight() + ggtitle("Ride Distribution by Week of Year") + 
  geom_density(alpha=0,color=rgb(1,0,0,.4)) + 
  facet_grid(vars(usertype), vars(age_bin)) + scale_fill_viridis_d()
ggsave(filename = "ride-frequency-density.png",width = 8,units = "in")

Ride Frequency Density

Looks like we are seeing a downward trend toward the end of the year for subscribers/20-80 and customers/20-80 with the biggest downward trends for subscribers/60-80. Makes sense, from the previous plot we saw week 40 fell in October and Minneapolis can get chilly.

Visualizing Start Time

I am dumping a big chunk of code here, don’t be alarmed. The first bit of code filters thanks to dplyr and only keeps the more active months. I also filtered out those age bins we talked about earlier. The droplevels() is because ggplot didn’t like when I had unused levels (from the months).

I also added breaks and labels for the time to correspond with peak times.

ggplot(rides %>% filter(age <= 80,
                        month > 4,
                        month < 10) %>% droplevels(),
       aes(x=mm, fill= age_bin)) +
  geom_density(alpha=.6) +
  scale_x_continuous(labels = c("5am","8am","1:30pm","5pm","8pm"),
                     breaks = c(300,480,750,1020,1200)) + 
  labs(fill="",title="NiceRide 2018 Start Times") + 
  theme_fivethirtyeight() +
  theme(strip.background = element_rect(fill = "#FFFFFF")) +
  facet_grid(vars(usertype), vars(start_day_type)) + scale_fill_viridis_d(option="A")

Ride Start Times

This plot gives us a lot of really interesting information! You can see how NiceRide is used by many people who are presumably commuting. There is also a lunch rush for NiceRide bikes.

Maps!

So far all of this information has been on a macro level for time based ridership. Now we will switch to location based ridership.

Static Network Map

In order to generate static heatmaps we will be using the ggmap package along with the ggplot2::geom_segment() function. There is also a ggplot2::geom_curve() function which you can use to generate curved line segments. For the curved line segments you must also use coord_cartesian()

In order to speed up the rendering of geom_segment() I aggregated the data and then scaled the opacity of the line segments. I don’t think this was a necessary step, but it made rendering (and tweaking the plot) much easier.

df.lines <- rides %>%
  group_by(start_station_longitude,
           start_station_latitude,
           end_station_longitude,
           end_station_latitude,
           start_station_name,
           end_station_name) %>%
  summarize(rides = n())

Now we have a handy dandy dataframe which has the number of rides between stations. The start_station_name and end_station_name are not required, but they are helpful if you want to further explore station data. You can also summarize on the fly for station information with plyr

rides %>%
  group_by(start_station_name,end_station_name) %>%
  filter(start_station_name!="NULL") %>%
  summarize(rides = n()) %>% ungroup %>% top_n(10)

Selecting by rides
# A tibble: 10 x 3
   start_station_name                end_station_name                  rides
   <chr>                             <chr>                             <int>
 1 100 Main Street SE                100 Main Street SE                 1161
 2 11th Ave S & S 2nd Street         11th Ave S & S 2nd Street          1436
 3 6th Ave SE & University Ave       6th Ave SE & University Ave        1148
 4 Lake Calhoun Center               Lake Calhoun Center                2804
 5 Lake Como Pavilion                Lake Como Pavilion                 2070
 6 Lake Harriet Bandshell            Lake Harriet Bandshell             2332
 7 Lake Street & Knox Ave S          Lake Street & Knox Ave S           6852
 8 Minnehaha Park                    Minnehaha Park                     1673
 9 W 36th Street & W Calhoun Parkway W 36th Street & W Calhoun Parkway  2012
10 Weisman Art Museum                Willey Hall                        1081

Neat, right? Also, that Lake Street & Knox Ave S location had over six thousand round trips. Also pretty interesting that all of the top 10 were round trips except for Weisman Art Museum/Willey Hall. Changing to the top 20 stations we see that pop four times as a starting station.

Selecting by rides
# A tibble: 20 x 3
   start_station_name                end_station_name                  rides
   <chr>                             <chr>                             <int>
 1 100 Main Street SE                100 Main Street SE                 1161
 2 11th Ave S & S 2nd Street         11th Ave S & S 2nd Street          1436
 3 6th Ave SE & University Ave       6th Ave SE & University Ave        1148
 4 Harriet Island                    Harriet Island                      970
 5 Hiawatha Ave & E 50th Street      Hiawatha Ave & E 50th Street        827
 6 Lake Calhoun Center               Lake Calhoun Center                2804
 7 Lake Calhoun Center               Lake Street & Knox Ave S            864
 8 Lake Como Pavilion                Lake Como Pavilion                 2070
 9 Lake Harriet Bandshell            Lake Harriet Bandshell             2332
10 Lake Nokomis                      Lake Nokomis                       1059
11 Lake Street & Knox Ave S          Lake Calhoun Center                 777
12 Lake Street & Knox Ave S          Lake Harriet Bandshell              741
13 Lake Street & Knox Ave S          Lake Street & Knox Ave S           6852
14 Lake Street & Knox Ave S          W 36th Street & W Calhoun Parkway   757
15 Minnehaha Park                    Minnehaha Park                     1673
16 W 36th Street & W Calhoun Parkway Lake Street & Knox Ave S            826
17 W 36th Street & W Calhoun Parkway W 36th Street & W Calhoun Parkway  2012
18 Weisman Art Museum                Social Sciences                     795
19 Weisman Art Museum                Willey Hall                        1081
20 Willey Hall                       Weisman Art Museum                 1036

Getting the Base Map

The package ggmap has a function called get_map() which we will be using. You can pass the name of the city, but I am passing the boundaries of our start/end coordinates.

# First you will need a Google API Key
register_google(key = "YOUR_API_KEY")
mpls <- get_map(c(left = min(rides$start_station_longitude), 
                  bottom = min(rides$start_station_latitude), 
                  right = max(rides$start_station_longitude), 
                  top = max(rides$start_station_latitude)),
                maptype='terrain', source='stamen', zoom=13)

Now that we have our city mpls tiles, we can add some layers, including our geom_segment() layer. The function scale_alpha() was a lifesaver here, and without that we wouldn’t have been able to aggregate our data in df.line as the whole plot would be either blacked out when using aes() or not determined by df.line$rides. The function theme_nothing() comes from ggmap and is useful for removing axis labels among other attributes.

# Darken allows us to lighten up the map
ggmap(mpls,darken = c(.8,"#FFFFFF")) + 
  geom_segment(data = df.lines,
               aes(x = start_station_longitude, 
                   y = start_station_latitude,
                   xend = end_station_longitude,
                   yend = end_station_latitude,
                   alpha = sqrt(rides)),
               color = "#000000") + coord_cartesian() +
  scale_alpha(range = c(0.0001, .5)) +
  geom_point(data = df.lines %>% 
               group_by(longitude = start_station_longitude,
                        latitude = start_station_latitude) %>%
               summarize(rides = sum(rides)),
             aes(x = longitude, 
                 y = latitude,
                 size = rides),
             color="#009900",alpha=.4) + 
  scale_size_continuous(range(4,100)) +
  scale_color_viridis_c() + 
  scale_fill_viridis_c() + 
  theme_nothing()
ggsave(filename = "station-network.jpg",width = 8,units = "in")

I made the executive decision to aggregate data for geom_point() on the fly. Without aggregating, we would be left with multiple points for each start station.

Station Network

Some of those roundtrips threw off the scale, which is why sqrt() is used. You could substitute sqrt() for log(), but I found the results more palatable as written.

There are a couple of insights to be garnered from this plot. There is a nice little rectangle of trips down at the bottom along Minnehaha Creak, the chain of lanes, and then on Lake Street and either West River Parkway or Hiawatha. There is also a ton more activity in Minneapolis than St. Paul, as expected. This might also be a good reference for where to put additional NiceRide stations. You may want to first replace geom_segment() with geom_path() from the ggmap::route() function which I will go over in another, shorter tutorial.

Leaflet Interactive Map

This one also required a new dataframe. Essentially we are summarizing the start stations and end stations separately so we can see if there is a change in the heatmap when cycling between the two layers.

df.heatmap.start_end <- list()
df.heatmap.start_end$start <- rides %>% 
  group_by(start_station_longitude, start_station_latitude) %>%
  summarize(intensity = sqrt(n()))
names(df.heatmap.start_end$start)[1:2] <- c("longitude","latitude")

df.heatmap.start_end$end <- rides %>%
  group_by(end_station_longitude, end_station_latitude) %>%
  summarize(intensity = sqrt(n()))
names(df.heatmap.start_end$end)[1:2] <- c("longitude","latitude")

df.heatmap.start_end$start$pos <- "Start"
df.heatmap.start_end$end$pos <- "End"

df.heatmap.start_end %<>% rbindlist(fill = T)

Hot Hot Heat

You don’t have to add two layers of heatmaps, but you can! I segmented the data between start and stop to see if there was any change. The function addLayersControl() will allow us to toggle between the two layers. You can tweak the parameters to your liking, but I found these settings produced a plot I was happy with.

leaflet(df.heatmap.start_end) %>% 
  addProviderTiles(providers$CartoDB.DarkMatter) %>%
  addHeatmap(data = df.heatmap.start_end %>% filter(pos=="Start"),
             lng=~longitude, 
             lat=~latitude, 
             intensity = ~intensity,
             blur = 10, 
             max = 100, radius = 15,
             layerId = "Start", group = "Start") %>%
  addHeatmap(data = df.heatmap.start_end %>% filter(pos=="End"),
             lng=~longitude, 
             lat=~latitude, 
             intensity = ~intensity,
             blur = 10, 
             max = 100, radius = 15,
             layerId = "End", group = "End") %>%
  addLayersControl(
    baseGroups = c("Start","End"),
    options = layersControlOptions(collapsed = FALSE)
  )

There was some change, and it was negligible.

Bonus

library(ggTimeSeries)
# Generate frequency table
df.cal <- rides$start_time %>% as_date() %>% table %>% data.frame
names(df.cal) <- c("Date","Rides")
df.cal$Date %<>% as_date

ggplot_calendar_heatmap(
  df.cal,
  'Date',
  'Rides'
) + theme_fivethirtyeight() + 
  theme(legend.position = "right",
        legend.direction = "vertical") + 
  scale_fill_viridis_c()
ggsave(filename = "ride-calendar.png",width = 8,units = "in")

Ride Calendar

Wrap Up

All in all there is a ton of data to go through, and plenty more can be done. Some ideas for those who want to continue this project…

  • Compare ridership by years
  • Append rides with weather information
  • Determine features with significant deviation
  • Use geom_path() and map suggested bike routes
  • Create static heatmaps using 2d density plots
  • Calendar Views

To leave a comment for the author, please follow the link and comment on their blog: RLang.io | R Language Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)