The Hour of Hell of Every Morning – Commute Analysis, April to October 2012
[This article was first published on   everyday analytics, and kindly contributed to R-bloggers].  (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
                Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
So a little while ago I quit my job.Well, actually, that sounds really negative. I’m told that when you are discussing large changes in your life, like finding a new career, relationship, or brand of diet soda, it’s important to frame things positively.
So let me rephrase that – I’ve left job I previously held to pursue other directions. Why? Because I have to do what I love. I have to move forward. And I have to work with data. It’s what I want, what I’m good at, and what I was meant to do.
So onward and upward to bigger, brighter and better things.
But I digress. The point is that my morning commute has changed.
Background
I really enjoyed this old post at Omninerd, about commute tracking activities and an attempt to use some data analysis to beat traffic mathematically. So I thought, hey, I’m commuting every day, and there’s a lot of data being generated there – why not collect some of it and analyze it too?The difference here being that I was commuting with public transit instead of driving. So yes, the title is a bit dramatic (it’s an hour of hell in traffic for some people, I actually quite enjoy taking the TTC).
When I initially started collecting the data, I had intended to time both my commute to and from work. Unfortunately, I discovered that due to having a busy personal and professional life outside of the 9 to 5, that there was little point in tracking my commute at the end of the work day, as I was very rarely going straight home (I was ending up with a very sparse data set). I suppose this was one point of insight into my life before even doing any analysis in this experiment.
So I just collected data on the way to work in the morning.
Without going into the personal details of my life in depth, my commute went something like this:
- walk from home to station
- take streetcar from station west to next station
- take subway north to station near place of work
- walk from subway platform to place of work
Punching the route into Google Maps, it tells me the entire distance is 11.5 km. As we’ll see from the data, my travel time was pretty consistent and on average took about 40 minutes every morning (I knew this even before beginning the data collection). So my speed with all three modes of transportation averages out to ~17.25 km/hr. That probably doesn’t seem that fast, but if you’ve ever driven in Toronto traffic, trust me, it is.
In terms of the methodology for data collection, I simply used the stopwatch on my phone, starting it when I left my doorstep and stopping it when reaching the revolving doors by the elevators at work.
So all told, I kept track of the date, starting time and commute length (and therefore end time). As with many things in life, hindsight is 20/20, and looking back I realized I could have collected the data in a more detailed fashion by breaking it up for each leg of the journey.
This occurred to me towards the end of the experiment, and so I did this for a day. Though you can’t do much data analysis with just this one day, it gives a general idea of the typical structure of my commute:
|  | 
| There should be another line coming from the last circle, but it looks better this way. | 
Alternatively the visualization can be made more informative by leaving the circles sized by time and changing the curve lengths to represent the distance of each leg travelled. Then the distance for the waiting periods is zero and the graphic looks quite different:
|  | 
| I really didn’t think the walk from house was that long in comparison to the streetcar. Surprising. | 
Cool, no? And there’s an infinite number of other ways you could go about representing that data, but we’re getting into the realm of information design here. So let’s have a look at the data set.
Analysis
So first and foremost, we ask the question, is there a relationship between the starting time of my morning commute and the length of that commute? That is to say, does how early I leave to go to work in the morning impact how long it takes me to get to work, regardless of which day it is?
Before even looking at the data this is an interesting question to consider, as you could assume (I would venture to say know for a fact) that departure time is an important factor for a driving commute as the speed of one’s morning commute is directly impacted by congestion, which is relative to the number of people commuting at any given time.
However, I was taking public transit and I’m fairly certain congestion doesn’t affect it as much. Plus I headed in the opposite direction of most (away from the downtown core). So is there a relationship here?
Looking at this graph we can see a couple things. First of all, there doesn’t appear to be a salient relationship between the commute start time and duration. Some economists are perfectly happy to run a regression and slam a trend line through a big cloud of data points, but I’m not going to do that here. Maybe if there were a lot of points I’d consider it.
The other reason I’m not going to do that is that you can see from looking at this graph that the data are unevenly distributed. There are more larger values and outliers in the middle, but that’s only because the majority of my commutes started between ~8:15 and ~9:20 so that’s where most of the data lie. 
You can see this if we look at the distribution of starting hour:
I’ve included a density plot as well so I don’t have to worry about bin-sizing issues, though it should be noted that in this case it gives the impression of continuity when there isn’t any. It does help illustrate the earlier point however, about the distribution of starting times. If I were a statistician (which I’m not) I would comment on the distribution being symmetrical (i.e. is not skewed) and on its kurtosis.
The distribution of commute duration, on the other hand, is skewed:
I didn’t have any morning where the combination of my walking and the TTC could get me to North York in less than a half hour.
Next we look at commute duration and starting hour over time. The black line is a 5-day moving average.
Other than several days near the beginning of the experiment in which I left for work extra early, the average start time for the morning trip did not change greatly over the course of the months. There looks like there might be some kind of pattern in the commute duration though, with the peaking?
We can investigate if this is the case by comparing the commute duration per day of week:
There seems to be slightly more variation in the commute duration on Monday, and it takes a bit longer on Thursdays? But look at the y-axis. These aren’t big differences, were talking about a matter of several minutes here. The breakdown for when I leave each day isn’t particularly earth-shattering either:
Normally, I’d leave it at that, but are these differences significant? We can do a one-way ANOVA and check:
> aov1 = aov(commute$starthour ~ commute$weekday, data=commute)
> aov2 = aov(commute$time ~ commute$weekday, data=commute)
> summary(aov1)
              Df Sum Sq Mean Sq F value Pr(>F)
data$weekday   4  0.456  0.1140     0.7  0.593
Residuals    118 19.212  0.1628               
> summary(aov2)
              Df Sum Sq Mean Sq F value Pr(>F)
data$weekday   4   86.4   21.59   1.296  0.275
Residuals    118 1965.4   16.66               
That is to say, on average, it took about the same amount of time per day to get to work and I left around the same time.
This is in stark contrast to what people talk around the water cooler about when they’re discussing their commute. I’ve never done any data analysis on a morning drive myself (or seen any, other than the post at Omninerd), but there are likely more clearly defined weekly patterns to your average driving commute than what we saw here with public transit.
Conclusions
There’s a couple ways you can look at this.
You could say there were no earth-shattering conclusions as a result of the experiment.
Or you could say that, other than the occasional outlier (of the “Attention All Passengers on the Yonge-University-Spadina line” variety) the TTC is remarkably consistent over the course of the week, as is my average departure time (which is astounding given my sleeping patterns).
It’s all about perspective. So onward and upward, until next time.
		
            
Resources
How to Beat Traffic Mathematically
TTC Trip Planner
myTTC (independently built by an acquaintance of mine – check out more of his cool work at branigan.ca):
FlowingData: Commute times in your area, mapped [US only]
To leave a comment for the author, please follow the link and comment on their blog:  everyday analytics.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
 








