Visualization of NYC bus delays with R

[This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    Categories

    1. Advanced Modeling

    Tags

    1. Data Manipulation
    2. Data Visualisation
    3. ggplot2
    4. R Programming

    The motivation for choosing this data set to explore is straightforward: I wanted a real-world data set for which additional meta data would be easy to find, making NYC data perfect; I wanted something which did not require specialised domain knowledge, and so buses are a good choice; and I wanted a data set that might produce some interesting insights. As an additional bonus, this data set had very few downloads or kernels on Kaggle, so it seemed like mostly untrodden ground.

    Data loading and initial exporation

    Data are available from the NYC open data site. Linked is the main data page, which contains the bus breakdown ID number, as well as the route number, the schools serviced, the borough through which the bus travels, and so on.

    The readme file included in this bundle also contains references to linking information. Specifically, the drivers and attendants, routes, transportation sites, vehicles, pre-k riders by transportation site, routes by transportation site, and pre-k vendors by transportation site. All of this data can be combined (see later) to provide route- and vendor- level data to the main delay data set. For example, to quanitfy the total number of drivers employed by each company, the total number of students or schools that they service, and so on.

    Our goal is to be able to make some kind of useful prediction about how long a bus is going to be delayed, at the time its delay is called into the operations centre. So we take a look at our primary 'ii_' data set.

    in_csv <- "../data/II_Bus_Breakdown_and_Delays.csv"
    ii_breakdowns <- read_csv(in_csv)
    ii_breakdowns %>% glimpse
    ## Observations: 259,637
    ## Variables: 21
    ## $ School_Year                     <chr> "2015-2016", "2015-2016", "201...
    ## $ Busbreakdown_ID                 <int> 1212699, 1212700, 1212701, 121...
    ## $ Run_Type                        <chr> "Special Ed AM Run", "Special ...
    ## $ Bus_No                          <chr> "48186", "2518", "235", "2102"...
    ## $ Route_Number                    <chr> "N758", "L530", "K168", "K216"...
    ## $ Reason                          <chr> "Other", "Mechanical Problem",...
    ## $ Schools_Serviced                <chr> "75485", "21854", "18366", "21...
    ## $ Occurred_On                     <dttm> 2015-09-02 06:27:00, 2015-09-...
    ## $ Created_On                      <dttm> 2015-09-02 06:29:00, 2015-09-...
    ## $ Boro                            <chr> "Nassau County", "Brooklyn", "...
    ## $ Bus_Company_Name                <chr> "BORO TRANSIT, INC.", "RELIANT...
    ## $ How_Long_Delayed                <chr> "25 minutes", NA, "30MINS", "2...
    ## $ Number_Of_Students_On_The_Bus   <int> 0, 0, 0, 1, 0, 0, 0, 9, 0, 2, ...
    ## $ Has_Contractor_Notified_Schools <chr> "Yes", "Yes", "Yes", "Yes", "Y...
    ## $ Has_Contractor_Notified_Parents <chr> "No", "Yes", "Yes", "Yes", "Ye...
    ## $ Have_You_Alerted_OPT            <chr> "No", "Yes", "No", "No", "No",...
    ## $ Informed_On                     <dttm> 2015-09-02 06:29:00, 2015-09-...
    ## $ Incident_Number                 <chr> NA, NA, NA, NA, NA, NA, NA, NA...
    ## $ Last_Updated_On                 <dttm> 2015-09-02 06:29:16, 2015-09-...
    ## $ Breakdown_or_Running_Late       <chr> "Running Late", "Breakdown", "...
    ## $ School_Age_or_PreK              <chr> "School-Age", "School-Age", "S...
    
    ii_breakdowns %>% distinct(How_Long_Delayed) %>% head(n=20)
    ## # A tibble: 20 x 1
    ##    How_Long_Delayed
    ##    <chr>           
    ##  1 25 minutes      
    ##  2 <NA>            
    ##  3 30MINS          
    ##  4 20 min          
    ##  5 30 min          
    ##  6 90 MIN          
    ##  7 20 MINS         
    ##  8 20              
    ##  9 30 mins         
    ## 10 15 MINS         
    ## 11 20 MIN          
    ## 12 15 mins         
    ## 13 30 MINS         
    ## 14 25              
    ## 15 0               
    ## 16 10 mins         
    ## 17 25 min          
    ## 18 1hr             
    ## 19 20 mins         
    ## 20 45 mins
    

    There is an obvious problem: these delays are 'human recorded', meaning that the format is inconsistent. Sometimes we have minutes, min, mts, hours, hrs; sometimes just two numbers separated by ':', and so on. Additionally, the times are not accurately recorded, as we will verify in a moment.

    Wrangling the delay times into something machine readable takes some doing. I don't doubt that there is a more general path available, but I went through sequentially, case by case, as generally as I could; then cleaned up the rest at the end.

    #--sequential clean
    ii_times <- ii_breakdowns %>%
      mutate(units_delayed = str_extract(How_Long_Delayed, "[:alpha:]+"),
        time_delayed = as.numeric(str_extract(How_Long_Delayed, "[:digit:]+")),
        units_delayed = ifelse(str_detect(units_delayed, "^[mM]"), "minutes", "other"),
        units_delayed = ifelse(is.na(units_delayed), "other", units_delayed)
        ) %>%
      #remove "in" case, exclude any flagged with "[hH}"
      mutate(
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "in")  & !str_detect(How_Long_Delayed, "[hH]"),"minutes", units_delayed)
      ) %>%
      #deal with "to" case with minutes by taking only the first number. There are apparently no 'to' strings with hours in them.
      mutate(
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "to") & str_detect(How_Long_Delayed, "[mM]"),"minutes", units_delayed)
      ) %>%
      #deal with the - cases which don't involve hours at all
      mutate(
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "-") & !str_detect(How_Long_Delayed, "[hH]"),"minutes", units_delayed)
      ) %>%
      #deal with the double digit minute-1hr:
      mutate(
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "-") & str_detect(How_Long_Delayed, "^[1-5][0-9]"),"minutes", units_delayed)
      ) %>%
      #everything left with a hyphen is a 1 hour or more delay
      mutate(
        time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "-"), 60, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "-"), "minutes", units_delayed)
      ) %>%
      #-all hyphens dealt with at this point
      #Now work on the "/": first take the minute timers: all those with double digits first
      mutate(
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "/") & str_detect(How_Long_Delayed, "^[1-9][0-9]"), "minutes", units_delayed)
      ) %>%
      #All those starting with 1/2 go to 30 minutes
      mutate(
        time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1/2"), 30, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1/2"), "minutes", units_delayed)
      ) %>%
      #All those starting with 1 1/2 go to 90 minutes
      mutate(
        time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1 1/2"), 90, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1 1/2"), "minutes", units_delayed)
      ) %>%
      #---specific cases
      #"1 HOUR1/2"
      mutate(
        time_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "1 HOUR1/2", 90, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "1 HOUR1/2", "minutes", units_delayed)
      ) %>%
      #"@least 1/2"
      mutate(
        time_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "@least 1/2", 30, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "@least 1/2", "minutes", units_delayed)
      ) %>%
      #"1HR/20MIN"
      mutate(
        time_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "1HR/20MIN", 80, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & How_Long_Delayed == "1HR/20MIN", "minutes", units_delayed)
        )%>%
      #--end specific cases
      #1hr/x case
      #This is modifying things it shouldn't. Why is this even here?
      mutate(
        time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "1hr"), 60, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "1hr"), "minutes", units_delayed)
      )%>%
      #---cases with ":"
      #remove cases which erroneously report a time instead of a duration, since these are not reliable
      mutate(
        time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, ":") & str_detect(How_Long_Delayed, "[aA][mM]"), NA, time_delayed)
      )%>%
      #2:hr cases
      mutate(
        time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2:[hH][rR]"), 120, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2:[hH][rR]"), "minutes", units_delayed)
      )%>%
      #1:45 cases
      mutate(
        time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:45"), 105, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:45"), "minutes", units_delayed)
      )%>%
      #1:30 cases
      mutate(
        time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:30"), 90, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:30"), "minutes", units_delayed)
      )%>%
      #1: cases
      mutate(
        time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:"), 60, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1:"), "minutes", units_delayed)
      )%>%
      #--end cases with ":"
      #--cases with "^2"
      #2 hr cases
      mutate(
        time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2") & str_detect(How_Long_Delayed, "[hH]"), 120, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2") & str_detect(How_Long_Delayed, "[hH]"), "minutes", units_delayed)
      )%>%
      #double digit minute cases:
      mutate(
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^2[0-9]"), "minutes", units_delayed) #times are already correct
      )%>%
      #---"^1"
      #1 hour cases
      mutate(
        time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1.*[rR]$") & !(str_detect(How_Long_Delayed, "^1.5.*[rR]$")), 60, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1.*[rR]$") & !(str_detect(How_Long_Delayed, "^1.5.*[rR]$")), "minutes", units_delayed)
      )%>%
      #1.5 hour cases
      mutate(
        time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1.5"), 90, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1.5"), "minutes", units_delayed)
      )%>%
      #NR: some of the ^1 x min cases going in here are already tagged as minutes.
      #1 hr 30 cases
      mutate(
        time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "30"), 90, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "30"), "minutes", units_delayed)
      )%>%
      #1 hr 15 cases
      mutate(
        time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "15"), 75, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "15"), "minutes", units_delayed)
      )%>%
      #1 hr 20 cases
      mutate(
        time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "20"), 80, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "20"), "minutes", units_delayed)
      )%>%
      #1 hr 45 cases
      mutate(
        time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "45"), 105, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "45"), "minutes", units_delayed)
      )%>%
      #1 hr half
      mutate(
        time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "half"), 90, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "half"), "minutes", units_delayed)
      )%>%
      #1 h cases
      mutate(
        time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "[hH]"), 60, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^1") & str_detect(How_Long_Delayed, "[hH]"), "minutes", units_delayed)
      )%>%
      #---has pattern "^[1-9][oO]", that is, o replaces 0
      mutate(
        time_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^[1-9][oO]"), time_delayed*10, time_delayed),
        units_delayed = ifelse(units_delayed=="other" & str_detect(How_Long_Delayed, "^[1-9][oO]") , "minutes", units_delayed)
      ) %>%
      #---final outliers and rejects
      #just switch to minutes
      mutate(
        units_delayed = ifelse(How_Long_Delayed == "35 SM", "minutes", units_delayed),
        units_delayed = ifelse(How_Long_Delayed == "30am", "minutes", units_delayed),
        units_delayed = ifelse(How_Long_Delayed == "30INS", "minutes", units_delayed),
        units_delayed = ifelse(How_Long_Delayed == "45 late", "minutes", units_delayed),
        units_delayed = ifelse(How_Long_Delayed == "40n", "minutes", units_delayed),
        units_delayed = ifelse(How_Long_Delayed == "45 am", "minutes", units_delayed),
        units_delayed = ifelse(How_Long_Delayed == "45NIN", "minutes", units_delayed),
        units_delayed = ifelse(How_Long_Delayed == "45 INUTES", "minutes", units_delayed),
        units_delayed = ifelse(How_Long_Delayed == "45IN", "minutes", units_delayed)
      ) %>%
      #needs time modifications
      mutate(
        time_delayed = ifelse(How_Long_Delayed == "!0 mins", 1, time_delayed),
        time_delayed = ifelse(How_Long_Delayed == "0-15 Min", 5, time_delayed),
        time_delayed = ifelse(How_Long_Delayed == "0-15 Min", 5, time_delayed),
        time_delayed = ifelse(How_Long_Delayed == "MAYBE 1/2", 30, time_delayed),
        units_delayed = ifelse(How_Long_Delayed == "half hour", "minutes", units_delayed),
        time_delayed = ifelse(How_Long_Delayed == "Half hour", 30, time_delayed),
        units_delayed = ifelse(How_Long_Delayed == "Half hour", "minutes", units_delayed),
        time_delayed = ifelse(How_Long_Delayed == "one hour", 60, time_delayed),
        units_delayed = ifelse(How_Long_Delayed == "one hour", "minutes", units_delayed),
        time_delayed = ifelse(How_Long_Delayed == "ONE HOUR", 60, time_delayed),
        units_delayed = ifelse(How_Long_Delayed == "ONE HOUR", "minutes", units_delayed),
        time_delayed = ifelse(How_Long_Delayed == "HOUR", 60, time_delayed),
        units_delayed = ifelse(How_Long_Delayed == "HOUR", "minutes", units_delayed),
        time_delayed = ifelse(How_Long_Delayed == "hour", 60, time_delayed),
        units_delayed = ifelse(How_Long_Delayed == "hour", "minutes", units_delayed),
        time_delayed = ifelse(How_Long_Delayed == "HR1", 60, time_delayed),
        units_delayed = ifelse(How_Long_Delayed == "HR1", "minutes", units_delayed),
        time_delayed = ifelse(How_Long_Delayed == "1/ 15 min", 75, time_delayed),
        time_delayed = ifelse(How_Long_Delayed == "1:20MIN", 80, time_delayed),
        time_delayed = ifelse(How_Long_Delayed == "1 20 min", 80, time_delayed),
        time_delayed = ifelse(How_Long_Delayed == "1:30 min", 90, time_delayed),
        time_delayed = ifelse(How_Long_Delayed == "1:30?mins", 90, time_delayed),
        time_delayed = ifelse(How_Long_Delayed == "1 30mnts", 90, time_delayed),
        time_delayed = ifelse(How_Long_Delayed == "1 30 min", 90, time_delayed),
        time_delayed = ifelse(How_Long_Delayed == "1:40 MINS", 90, time_delayed),
        time_delayed = ifelse(How_Long_Delayed == "IHR40MIN", 100, time_delayed),
        units_delayed = ifelse(How_Long_Delayed == "IHR40MIN", "minutes", units_delayed),
        time_delayed = ifelse(How_Long_Delayed == "3 HRS", 180, time_delayed),
        units_delayed = ifelse(How_Long_Delayed == "3 HRS", "minutes", units_delayed),
        time_delayed = ifelse(How_Long_Delayed == "3 hr", 180, time_delayed),
        units_delayed = ifelse(How_Long_Delayed == "3 hr", "minutes", units_delayed),
        time_delayed = ifelse(How_Long_Delayed == "3 HOUR", 180, time_delayed),
        units_delayed = ifelse(How_Long_Delayed == "3 HOUR", "minutes", units_delayed),
        time_delayed = ifelse(How_Long_Delayed == "4 hours", 240, time_delayed),
        units_delayed = ifelse(How_Long_Delayed == "4 hours", "minutes", units_delayed),
        time_delayed = ifelse(How_Long_Delayed == "4 hrs", 240, time_delayed),
        units_delayed = ifelse(How_Long_Delayed == "4 hrs", "minutes", units_delayed)
      ) %>%
      #sometimes the record just reads 'mins' or equivalent. My 'units_delayed' picks this up as 'minutes'. We take a guess and set this to 5 minutes.
      mutate(
        time_delayed = ifelse(units_delayed == "minutes" & str_detect(How_Long_Delayed, "[mM][iI][nN]") & is.na(time_delayed), 5, time_delayed)
      ) %>%
      filter(!(is.na(time_delayed))) %>%
      filter(time_delayed <= 300) %>% #Any delays beyond 5 hours are probably typos, but also problematic if real.
      filter(units_delayed == "minutes") %>% #222 820 / 259 637. The rest are garbage or NA.
      select(
        Busbreakdown_ID, School_Year, Route_Number, Schools_Serviced, Bus_Company_Name,
        Has_Contractor_Notified_Schools, Has_Contractor_Notified_Parents, Have_You_Alerted_OPT, Reason,
        Boro, Number_Of_Students_On_The_Bus, School_Age_or_PreK,
        Created_On, Occurred_On,
        time_delayed
      )
    
    #---plot of the time_delayed
    library(RColorBrewer)
    
    pal <- brewer.pal(12, "Paired")
    
    reason_levels <- ii_times %>%
      group_by(Reason) %>%
      mutate(
        total_of_reason = n()
      ) %>% distinct(Reason, total_of_reason) %>%
        arrange(total_of_reason) %>%
      ungroup() %>%
      mutate(total_of_reason = row_number())
    
    ii_times %>%
      mutate(Reason = factor(Reason, levels = reason_levels[[1]])) %>%
      mutate(time_delayed = time_delayed) %>%
      ggplot(aes(x = time_delayed, fill = Reason)) +
      geom_bar() +
      theme_classic() +
      scale_fill_manual(values = pal) +
      coord_cartesian(expand = c(0,0), xlim = c(0, 100)) +
      labs(x="time_delayed [min]",y="Observations") +
      theme(text=element_text(family="Arial", size=16)) +
      theme(axis.line = element_line(colour = 'black', size = 1)) +
      theme(axis.ticks = element_line(colour = "black", size = 1)) +
      ggtitle("Observations by Reason")
    

    A bar graph broken up Reason shows that, as observed when cleaning up the delay data, these are 'human approximated delays'. The driver or attendant (or perhaps someone at operations) has approximated the delay to the nearest convenient time package: 5 minutes, 10 minutes, 15 minutes, 20, 25, 30, 40, 45, 50, 60, 90 minutes, and so on. Some delays have been figured out more accurately by comparing time differences, but most have not. One of the consequences is that if we decide to run a regression model for this data, we can never expect it to be more precise than 5-10 minutes of RMS error. Because that is how the data have been recorded.

    Note that there are a few data with delays beyond 100 minutes, and these points will show up later. The graph is presented this way for clarity. Traffic constitutes the major source of delay in the data set; accident is the most rare problem. 'Other', meaning a missing reason or undisclosed reason, is the second most common value in this field.

    We can have a look at some of the structure via density plots:

    ii_times %>%
      mutate(Reason = factor(Reason, levels = reason_levels[[1]])) %>%
      mutate(time_delayed = time_delayed) %>%
      ggplot(aes(x = time_delayed, fill = Reason)) +
      geom_density(aes(y=..count..), size = 0, adjust = 4, position = "stack", color = NA) +
      theme_classic() +
      scale_fill_manual(values = pal) +
      coord_cartesian(expand = c(0,0), xlim = c(0, 100)) +
      labs(x="time_delayed [min]",y="Observation count") +
      theme(text=element_text(family="Arial", size=16)) +
      theme(axis.line = element_line(colour = 'black', size = 1)) +
      theme(axis.ticks = element_line(colour = "black", size = 1)) +
      ggtitle("Density: observations by Reason")
    

    boro_levels <- ii_times %>%
      mutate(
        Boro = ifelse(is.na(Boro), "All Boroughs", Boro),
      ) %>%
      group_by(Boro) %>%
      mutate(
        total_of_boro = n()
      ) %>% distinct(Boro, total_of_boro) %>%
      arrange(total_of_boro) %>%
      ungroup() %>%
      mutate(total_of_boro = row_number())
    
    ii_times %>%
      mutate(
        Boro = ifelse(is.na(Boro), "All Boroughs", Boro),
      ) %>%
      mutate(Boro = factor(Boro, levels = boro_levels[[1]])) %>%
      ggplot(aes(x = time_delayed, fill = Boro)) +
      geom_density(aes(y=..count..), size = 0, adjust = 4, position = "stack", color = NA) +
      theme_classic() +
      scale_fill_manual(values = pal) +
      coord_cartesian(expand = c(0,0), xlim = c(0, 100)) +
      labs(x="time_delayed [min]",y="Observation count") +
      theme(text=element_text(family="Arial", size=16)) +
      theme(axis.line = element_line(colour = 'black', size = 1)) +
      theme(axis.ticks = element_line(colour = "black", size = 1)) +
      ggtitle("Density: observations by Borough")
    

    Here we must be careful with interpretation, since the density plot presents counts convolved with a gaussian density kernel. This helps us to see structure, but we must be careful interpreting the counts. There are still very few observations with delay time of 22 minutes, for example.

    But we can see some structure a little more clearly. For example, Manhattan produces more long-time delays than the Bronx or Staten Island.

    In the next set of notes we will try to fit a regression, based only on this data. However, we can already see the need to join more data on as we have few predictors. Before building a model or joining to the other data sets, there is a little more cleaning and dummifying of categorical variables to do:

    ii_spread <- ii_times %>%
      mutate(
        time_diff_report = abs(as.numeric(difftime(Created_On, Occurred_On, units = "mins"))) 
      ) %>%
      mutate(
        reported_before_resolved = (time_delayed > time_diff_report)*1L,
        Created_On = NULL,
        Has_Contractor_Notified_Schools = (Has_Contractor_Notified_Schools == "Yes")*1L,
        Has_Contractor_Notified_Parents = (Has_Contractor_Notified_Parents == "Yes")*1L,
        Have_You_Alerted_OPT = (Have_You_Alerted_OPT == "Yes")*1L,
        School_Age = (School_Age_or_PreK == "School-Age")*1L,
        School_Age_or_PreK = NULL
      ) %>%
      mutate(dummy = 1L,
        Reason = str_replace_all(Reason, " ", ""),
        Reason = str_replace_all(Reason, "\`", "")
        ) %>%
      spread(key = Reason, value = dummy, fill = 0L, sep = "_") %>%
      mutate(Reason_Other = NULL) %>% #default case is Other
      mutate(dummy = 1L,
        Boro = ifelse(is.na(Boro), "All Boroughs",Boro),
        Boro = str_replace_all(Boro, " ", "")
      ) %>%
      spread(key = Boro, value = dummy, fill = 0L, sep = "_") %>%
      mutate(Boro_AllBoroughs = NULL) %>% #default case is all
      mutate(
        Number_Of_Students_On_The_Bus = as.numeric(Number_Of_Students_On_The_Bus) #This field has some problems we will deal with later
      ) %>%
      filter(Number_Of_Students_On_The_Bus <= 80) #Although this cut is arbitrary, we choose not to believe the records with more than 80 people on the bus.
    

    Related Post

    1. Explaining Black-Box Machine Learning Models – Code Part 2: Text classification with LIME
    2. Image clustering with Keras and k-Means
    3. ‘How do neural nets learn?’ A step by step explanation using the H2O Deep Learning algorithm.
    4. Machine learning: Logistic regression and decision trees for healthcare in R
    5. Machine Learning Benchmarking with SFA in R

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)