Using NYC Citi Bike Data to Help Bike Enthusiasts Find their Mate

April 26, 2017
By

(This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers)

There is no shortage of analyses on the NYC bike share system. Most of them aim at predicting the demand for bikes and balancing bike stock, i.e forecasting when to remove bikes from fully occupied stations, and refill stations before the supply runs dry.

 

This is why I decided to take a different approach and use the Citi Bike data to help its users instead.

 

The Challenge

citibike_citiTinder2The online dating scene is complicated and unreliable: there is a discrepancy between what online daters say and what they do. Although this challenge is not relevant to me anymore – I am married – I wished that, as a bike enthusiast, I had a platform where I could have spotted like-minded people who did ride a bike (and not just pretend they did).

The goal of this project was to turn the Citi Bike data into an app where a rider could identify the best spots and times to meet other Citi Bike users and cyclists in general.

 

 

The Data

mapAs of March 31, 2016, the total number of annual subscribers was 163,865, and Citi Bike riders took an average of 38,491 rides per day in 2016 (source: wikipedia)

This is more than 14 million rides in 2016!

I used the Citi Bike data for the month of May 2016 (approximately 1 million observations). Citi Bike provides the following variables:

  • Trip duration (in seconds).
  • Timestamps for when the trip started and ended.
  • Station locations for where the trip started and ended (both the names and coordinates).
  • Rider’s gender and birth year – this is the only demographic data we have.
  • Rider’s plan (annual subscriber, 7-day pass user or 1-day pass user).
  • Bike ID.

 

Riders per Age Group

Before moving ahead with building the app, I was interested in exploring the data and identifying patterns in relation to gender, age and day of the week. Answering the following questions helped identify which variables influence how riders use the Citi Bike system and form better features for the app:

  • Who are the primary users of Citi Bike?
  • What is the median age per Citi Bike station?
  • How do the days of the week impact biking behaviours?

As I expected, based on my daily rides from Queens to Manhattan, 75% of the Citi Bike trips are taken by males. The primary users are 25 to 24 years old.

ridersperage

Riders per Age Group

 

Distribution of Riders per Hour of the Day (weekdays)

However, while we might expect these young professionals to be the primary users during the weekdays around 8-9am and 5-6pm (when they commute to and from work), and the older audience to take over the Citi Bike system midday, this hypothesis proved to be wrong. The tourists don’t have anything to do with it; the short term customers only represent 10% of the dataset.

agegroupdistribution

Distribution of Riders per Hour of the Day (weekdays only)

 

Median Age per Departure Station

Looking at the median age of the riders for each station departure, we see the youngest riders in East Village, while older riders start their commute from Lower Manhattan (as shown in the map below). The age trends disappear when mapping the station arrival, above all in the financial district (in Lower Manhattan), which is populated by the young wolves of Wall Street (map not shown).

The map also confirms that the Citi Bike riders are mostly between 30 and 45 years old.

medianage

Median Age per Departure Station

 

 

Rides by Hour of the Day

Finally, when analyzing how the days of the week impacted biking behaviours, I was surprised to see that Citi Bike users didn’t ride for a longer period of time during the weekend: the median trip duration is 19 minutes for each day of the week.

tripdurationperminute

Trip Duration per Gender and Age Group

 

However, as illustrated below, there is a difference in peak hours; during the weekend, riders hop on a bike later during the day, with most of the rides happening midday while the peak hours during the weekdays are around 8-9am and 5-7pm when riders commute to and from work.

 

weekday_weekend

Number of Riders per Hour of the Day (weekdays vs. weekends)

 

 

The App

Where does this analysis leave us?

  • The day of the week and the hour of the day are meaningful variables which we need to take into account in the app.
  • Most of the users are between 30 and 45 years. This means that the age groups 25-34 and 35-44 won’t be granular enough when app users need to filter their search. We will let them filter by age instead.

 

The Citi Tinder app in a few words and screenshots.

There are 3 steps to the app:

  • The “when“: find the times and days where your ideal mate is more likely to ride.

step1_when

 

  • The “where“: once you know the best times and days, filter out the location by day of the week, time of the day, gender and age. You can also select if you want to spot where they arrive or depart.

step2_where

 

  • The “how‘: the final step is to grab a Citi Bike and get to those hot spots. The app calls the Google Maps API to show the directions with a little extra: you can compare the time estimated by Google to connect two stations versus the average time it took Citi Bike users. I believe the latter is more accurate because it factors in the time of the day and day of the week (which the app let you filter).

step3_how

 

Although screenshots are nice, the interactive app is better so head to the first step of the app to get started!

 

 

Would Have, Should Have, Could Have

This is the first of the four projects from the NYC Data Science Academy Data Science Bootcamp program. With a two-week timeline and only 24 hours in a day, some things gotta give… Below is a quick list of the analysis I could have, would have and should have done if given more time and data:yeahbike

  • Limited scope : I only took the data from May 2016. However, I expect the Citi Bike riders to behave differently depending on the season, temperature, etc. Besides, the bigger the sample size the more reliable the insights are.
  • Missing data : There was no data on the docks available per station that could be scraped from the Citi Bike website. The map would have been more complete if the availability of docks had been displayed.
  • Limited number of variables : I would have liked to have more demographics data (aside from gender and age); a dating app with only the age and gender as filters is restrictive…
  • Incomplete filters : With more time, I’d have added a filter ‘speed’ in the 2nd step of the app (the ‘where’ part) to enable the hard core cyclists to filter the fastest ones…
  • Sub-optimal visualization : I am aware that the map in the introduction page (with the dots displaying the median age per station) is hard to read and with more time, I’d have used polygons instead to group by neighbourhoods.
  • Finally, I would have liked to track unique users. Although users don’t have a unique identifier in the Citi Bike dataset, I could have identified unique users by looking at their gender, age, zip and usual start/end stations.

The post Using NYC Citi Bike Data to Help Bike Enthusiasts Find their Mate appeared first on NYC Data Science Academy Blog.

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)