RObservations #38: Visualizing Average Delay Times On TTC Subway Stations

[This article was first published on r – bensstats, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Any Torontonian who has commuted regularly on the TTC has probably experienced their fair share of delays on the subway. Having experienced a few recently I was inspired to visualize the average delay times across all stops on the subway. What are the stations with the longest delays on average this past year? Could we make a nice visual with it?

After working on it, I managed to come up with this:

In this blog I share how I utilized the opendatatoronto package and a slew of others to visualize average delay times on the TTC subway. While there are alot of particular details to work out, I’m going to try to keep things high level so that the key steps can be understood and the specifics can be seen in the code.

If you are too impatient and just want to get the code, check it out on my Github here. Be sure to star the repository also!

Getting the data

By using the opendatatoronto package’s search_packages() function I was able to locate most of the data that I needed to make the visuals. I was able to get:

  1. The TTC subway polygons.
  2. Toronto’s neighborhood polygons.
  3. 2022 delay times on TTC subway stations.

Visualizing Subway Routes

The delay times data is great but it doesn’t give particular geo-coordinates of each stop which we are interested in plotting. However, we could get a basic map already with the routes plotted over the neighborhood polygons. Plotting this data together with the ggplot and ggdark packages makes for the start of a nice visual which shows the reach of the TTC subway within the city of Toronto. The colors for the lines I chose based on the color theme for the current subway map.

library(tidyverse)
library(opendatatoronto)
library(ggdark)

ttcSubwayShapefiles<- list_package_resources("c01c6d71-de1f-493d-91ba-364ce64884ac") %>% 
                      get_resource()

torontoMap <-get_resource("6cc2dbf4-96d3-4faa-8bcf-4ad3f688a1dc")

ggplot()+
  geom_sf(data=ttcSubwayShapefiles,mapping=aes(color=ROUTE_NAME),
          lwd=1.4)+
  geom_sf(data=torontoMap,
          fill="#777777",
          alpha=0.1,
          lwd=0.7)+
  dark_theme_minimal()+
  theme(legend.position="right",
        legend.title= element_blank(),
        axis.text = element_blank())+
  scale_color_manual(values=c("#FECA0A",
                              "#00A54F",
                              "#009BDE",
                              "#B30174"))

Plotting Subway Stops.

While the 2022 delay times dataset has information on all the stops of the TTC, it is not cleaned and formatted in the best manner for geo-coding (there are some spelling errors and some formatting issues). Since Wikipedia has all the subway stop names well formatted, I took the opportunity to use the rvest to get the table of listed subway stations and geocode them with the tidygeocoder package. I did have to make some fixes with some of the geocodes as the points didn’t line up directly with the route lines. I reason that this could be due to the distance from the subway entrance to the subway line itself.

library(tidygeocoder)
library(rvest)
subwayStations<-read_html("https://en.wikipedia.org/wiki/List_of_Toronto_subway_stations")%>%
                html_elements(xpath='//*[@id="mw-content-text"]/div[1]/table[2]')%>%
                html_table() %>% 
                .[[1]] %>% 
                mutate(fullAddress = paste0(Station," Station, Toronto, Canada")) %>% 
                geocode(address=fullAddress) %>% 
                # Doing some spot fixes on the geocoding to line up with the map. 
                mutate(lat = case_when(Station == "St. Patrick" ~ 43.6548308,
                                       Station == "Highway 407" ~ 43.783215,
                                       Station == "Vaughan" ~ 43.7942439,
                                       Station == "Ossington" ~43.6623565,
                                       TRUE ~ lat),
                       long = case_when(Station == "St. Patrick" ~ -79.3883485,
                                        Station == "Highway 407" ~ -79.5237479,
                                        Station == "Vaughan" ~ -79.5274867,
                                        Station == "Ossington" ~ -79.4263675,
                                        TRUE ~ long))

With this data I was able to come up with this visual:

  ggplot()+
   geom_sf(data=torontoMap,
          fill="#777777",
          alpha=0.1,lwd=0.7)+
  geom_sf(data=ttcSubwayShapefiles, mapping=aes(color=ROUTE_NAME),
          lwd=1.4)+
  geom_point(data=subwayStations,
             mapping=aes(x=long,y=lat),fill="white")+
  geom_text(data=subwayStations,
             mapping=aes(x=long,y=lat,label=Station),
            fill="white",
            vjust= 1.5,
            hjust=1.2,
            angle=45)+
  ggtitle("Subway Stations")+
  dark_theme_minimal()+
  theme(legend.position="bottom",
        legend.title= element_blank(),
        axis.text = element_blank())+
  scale_color_manual(values=c("#FECA0A",
                              "#00A54F",
                              "#009BDE",
                              "#B30174"))

Aggregating Delay Times for Each Subway Stop

The delay time data needs to be cleaned and filtered. For the scope of this visual we are interested in delays affecting individual stops, so multi-station delays needed to be filtered out. This could easily be filtered out by choosing only delays which “STATION” in the “Station” field. Combining those stops with the geo-coordinated station names and aggregating the data by average minimum delay we are able to visualize the delay times where the size of the points for each stop corresponds to average delay time.

delayTimes<-get_resource("441143ca-8194-44ce-a954-19f8141817c7") %>% 
  filter(Station %>% grepl("STATION",.)) %>% 
  mutate(Station_new = Station %>% 
           str_remove_all("( MC )|( BD )|( SRT )| TO.*|\\(.*|( YUS )|( CTR )|(LA$)") %>% 
           str_replace_all("STATION"," STATION") %>% 
           str_replace_all("BLOOR","BLOOR–YONGE") %>%
           str_replace_all("ST ","ST. ") %>% 
           str_replace_all("\\s+"," ") %>% 
           trimws() %>% 
           str_to_title() %>% 
           str_replace_all("East\\.","East") %>% 
           str_replace_all("West\\.","West") %>%
           str_replace_all("Bathurst.","Bathurst") %>% 
           str_replace_all("North York","North York Centre") %>%
           str_replace_all("Mccowan","McCowan") %>%
           str_remove_all(" Station") %>% 
           str_replace_all("^Yonge$","Bloor–Yonge") %>% str_replace_all("Sheppard-Yonge","Sheppard–Yonge") %>% str_replace_all("Bay Lower","Bay") %>% str_replace_all("^Sheppard$","Sheppard-West") %>% str_replace_all("Downview","Downsview")) %>% left_join(subwayStations, by=c("Station_new"="Station")) %>% group_by(Station_new) %>% summarize(avgDelay=mean(`Min Delay`)) %>% left_join(subwayStations,by=c("Station_new"="Station")) 

Doing this and adding some color to the top 5 longest average delay times we’re able to get the following visual. Since the text was very crowded, only the stops which were in the top 5 are listed on the map.

library(ggrepel)

ggplot()+
   geom_sf(data=torontoMap,
          fill="#777777",
          alpha=0.1,lwd=0.7)+
  geom_sf(data=ttcSubwayShapefiles, mapping=aes(color=ROUTE_NAME),
          lwd=1.4)+
  geom_point(data=
delayTimes %>% mutate(colorLab= ifelse(Station_new %in% (delayTimes %>% 
                             slice_max(avgDelay,n=5) %>% 
                             pull(Station_new)),"Yes","No")),
             mapping=aes(x=long,y=lat,size=avgDelay),
             color=ifelse(delayTimes %>% mutate(colorLab= ifelse(Station_new %in% (delayTimes %>% 
                             slice_max(avgDelay,n=5) %>% 
                             pull(Station_new)),"Yes","No")) %>% 
                             pull(colorLab)=="Yes","red","white"))+
             geom_text_repel(data=delayTimes %>% slice_max(avgDelay,n=5),
                             mapping=aes(x=long,y=lat,
                             label= paste0(Station_new,
                                           " Station, \n",
                                           round(avgDelay,1),
                                           " mins.")),
            vjust=case_when(delayTimes %>% slice_max(avgDelay,n=5) %>% pull(Station_new)=="Jane"~1.3,
                            delayTimes %>% slice_max(avgDelay,n=5) %>% pull(Station_new)=="Dupont"~-2,
                            delayTimes %>% slice_max(avgDelay,n=5) %>% pull(Station_new)=="Runnymede"~ -1,
                            TRUE ~ -0.5),
            colour="white")+
  ggtitle("TTC Subway Average Delay Times 2022")+
  dark_theme_minimal()+
  theme(legend.position="right",
        axis.text = element_blank(),
        axis.title = element_blank(),
        plot.title=element_text(size=20,
                                hjust = 0.5,
                                vjust=-2))+
  scale_color_manual(values=c("#FECA0A",
                              "#00A54F",
                              "#009BDE",
                              "#B30174"),
                     name=element_blank())+
 scale_size_continuous(name="Avg. Delay Times (mins.)",range = c(1, 10))

Incorperating The “TTC Font”

After doing some research on the Toronto subway font), I learned that the font is closely related to the Futura font family). While I initially used the Futura font, I found that it is possible to get the Toronto Subway font if you do a little research online. To have the font registered within R, we are going to need to use the extrafont package.

Be sure to run font_import() after you have the font installed on your machine (you only need to do this once) and be sure to run loadfonts() every for every session you run this code.

# Adjust point coordinate system. 
# figure out points. 
library(extrafont)
# Run Once
# font_import()
loadfonts(device = "win")

With that taken care of we have a nice looking visual:

ggplot()+
   geom_sf(data=torontoMap,
          fill="#777777",
          alpha=0.1,lwd=0.7)+
  geom_sf(data=ttcSubwayShapefiles, mapping=aes(color=ROUTE_NAME),
          lwd=1.4)+
  geom_point(data=
delayTimes %>% mutate(colorLab= ifelse(Station_new %in% (delayTimes %>% 
                             slice_max(avgDelay,n=5) %>% 
                             pull(Station_new)),"Yes","No")),
             mapping=aes(x=long,y=lat,size=avgDelay),
             color=ifelse(delayTimes %>% mutate(colorLab= ifelse(Station_new %in% (delayTimes %>% 
                             slice_max(avgDelay,n=5) %>% 
                             pull(Station_new)),"Yes","No")) %>% 
                             pull(colorLab)=="Yes","red","white"))+
  ggrepel::geom_text_repel(data=delayTimes %>% slice_max(avgDelay,n=5),
             mapping=aes(x=long,y=lat,
                         label= paste0(Station_new,
                                       " Station, \n",round(avgDelay,1)," mins.")),
            vjust=case_when(delayTimes %>% slice_max(avgDelay,n=5) %>% pull(Station_new)=="Jane"~1.3,
                            delayTimes %>% slice_max(avgDelay,n=5) %>% pull(Station_new)=="Dupont"~-2,
                            delayTimes %>% slice_max(avgDelay,n=5) %>% pull(Station_new)=="Runnymede"~ -1,
                            TRUE ~ -0.5),
            family="TorontoSubwayW01-Regular",
            colour="white")+
  ggtitle("TTC Subway Average Delay Times 2022")+
  dark_theme_minimal()+
  theme(legend.position="right",
        axis.text = element_blank(),
        axis.title = element_blank(),
        text=element_text(family="TorontoSubwayW01-Regular"),
        plot.title=element_text(size=20,
                                hjust = 0.5,
                                vjust=-2))+
  scale_color_manual(values=c("#FECA0A",
                              "#00A54F",
                              "#009BDE",
                              "#B30174"),
                     name="")+
  scale_size_continuous(name="Avg. Delay Times (mins.)",range = c(1, 10))

Conclusion

From my experience, the point of data visualization is to convey insight and tell a story. It should be easily to look at and tell a story on its own. “A picture is worth a thousand words”. I hope that making this visual did that. If you want to get the entire script for the visual you can copy it directly from my Github here.

Thank you for reading!

Want to see more of my content?

Be sure to subscribe and never miss an update!

To leave a comment for the author, please follow the link and comment on their blog: r – bensstats.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)