Visualizing Hubway Trips in Boston

Posted on February 21, 2015 by Benevolent Planner in R bloggers | 0 Comments

Most Popular Hubway Stations (in order):

Post Office Sq. – located in the heart of the financial district.
Charles St. & Cambridge – the first Hubway stop after crossing from Cambridge over Longfellow Bridge.
Tremont St & West – East side of the Boston Common
South Station
Cross St. & Hannover – entrance to North End combing from financial district.
Boylston St & Berkeley – between Copley and the Common.
Stuart St & Charles – Theatre district, just south of the Common.
Boylston & Fairfield – located in front of the Boylston Street entrance to the Pru.
The Esplanade (Beacon St & Arlington) – this stop is on the north end of the Esplanade running along the Charles.
Chinatown Gate Plaza
Prudential at Belvidere
Boston Public Library on Boylston
Boston Aquarium
Newbury Street & Hereford
Government Center

I received great feedback from the last post visualizing crime in Boston, so I’m continuing the Boston-related content.

Data

I used trip-level data, which Hubway has made available here. The data is de-identified, although some bicyclist information is provided – e.g. gender and address zip code of registered riders (there are over 4 times more trips taken by males than females).

I initially wanted to visualize the trips on a city-level map, but dropped the idea after seeing a great post on the arcdiagram package in R. The Hubway system is basically a network where the nodes are bike stations and the edges are trips from one station to another. Arc diagrams are a cool way to visualize networks.

Arc Diagram Interpretation

The arcs represent network edges, or trip routes.
The thickness of the arcs is proportionate to the popularity of the route, as measured by the number of trips taken on that route.
The size of the nodes are proportionate to the popularity of the node, as measured by “degree.” The degree of a node is defined as the number of edges connected to that node.

Data Cleaning

Some of the data was questionable. There were many trips which began and ended in the same station with a trip duration of 2 minutes. There were also trips that lasted for over 10 hours.

I dropped the trips with very low duration (1st duration percentile) and very high duration (99th duration percentile).
There were many trips which began and ended in the same station that were not questionable. I removed these because they were cluttering the arc diagram without adding much value.
I only used data from bicyclists in certain zip codes (see zip_code vector in the code below).
Since the dataset was so massive, I only plotted a random sample of 1000 trips.

Comments on Arcdiagram Package

My one issue with the arcdiagram package is that there is no workaround for very small node labels
Some arc diagrams have arcs both below and above the x-axis. This package doesn’t seem to include this optionality.

 
install.packages('devtools')
install_github('arcdiagram', username ='gastonstat')
library('devtools')
library(arcdiagram)

input='.../Hubway/hubway_2011_07_through_2013_11'
setwd(input)

zip_code=c('02116','02111','02110','02114','02113','02109')

stations=read.csv('hubway_stations.csv')
trips=read.csv('hubway_trips.csv')

# clean data - there are negative values as well as outrageously huge values
# negative values 
trips_2=trips[which(trips$duration>=0),]

# remove clock resets (if trip was less than 6 minutes and start/ended at same station)
p=as.vector(quantile(trips_2$duration,c(.01)))
trips_3=trips_2[which(trips_2$duration>=p1 & trips_2$strt_statn!=trips_2$end_statn),]
# remove outrageously high trips. anything above 99% percentile:
p9=as.vector(quantile(trips_3$duration,c(.99)))
trips_4=trips_3[which(trips_3$duration<=p99),]

# subset to only trips starting/ending in given zip codes
trips_5=trips_4[which(trips_4$zip_code %in% zip_code),]

set.seed(1000)
data=cbind(trips_5$strt_statn,trips_5$end_statn)
samp_n=seq(1,length(data)/2)
samp_set=sample(samp_n,1000,replace=FALSE)
samp=data.frame(data[samp_set,])

# merge on station names
names(samp)=c('id','id2')
m=merge(x=samp,y=stations)
names(samp)=c('id2','id')
m=merge(x=samp,y=stations)

# create sample matrix
samp_w_labels=data.frame(m[,'station'],m2[,'station'])
names(samp_w_labels)=c('start','end')
samp_mat=as.matrix(samp_w_labels)

# delete trips that end where they start
con=paste(samp_mat[,1],samp_mat[,2],sep='')
dup=duplicated(con)
dupp=samp_mat[dup,]
dupp=dupp[which(dupp[,1]!=dupp[,2]),]

# create weights for arcs...weights will by frequency of trips
# each arc represents
clist=data.frame(paste(dupp[,1],dupp[,2],sep=''))
names(clist)=c('clist')
ctab=data.frame(table(clist))
c_m=merge(x=clist,y=ctab)

# create network structure
g=graph.edgelist(dupp, directed=TRUE)
edges=get.edgelist(g)
deg=degree(g)
clus=clusters(g)

# create colors
pal=colorRampPalette(c('darkorchid1','darkorchid4'),bias=5)
colors=pal(length(clus$membership))
node_cols=colors[clus$membership]

# generate arcplot
arcplot(dupp, 
 lwd.arcs =.2*c_m$Freq,cex.nodes=.07*deg,
 col.nodes='black',bg.nodes=node_cols, pch.nodes = 21,
 ordering=order(deg,decreasing=TRUE),
 cex.labels=.18,
 horizontal=TRUE)