NYC buses: simple Cubist regression

November 29, 2018
By

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    Categories

    1. Advanced Modeling

    Tags

    1. Data Manipulation
    2. Machine Learning
    3. R Programming
    4. Tips & Tricks

    We continue on from our previously loaded data of NYC bus delays. Although we are fairly sure that our target variable (time_delayed) is not accurate, it is all we have. We can thank the bus drivers of New York that we have anything at all, I suppose, but the inaccuracy of this variable is probably one of the reasons not very many people have touched this data set.

    Data preparation

    Some data preparation first. We have initially included the variable 'time_diff_report' in the ii main data set. This is the time between the incident and when the incident is called in to operations. The difficult with this number is that it is obviously included in the time estimate given for the breakdown. The time difference is only ever less than the breakdown time for cases which are reported after they are resolved. Obviously, this number can be used as a strong predictor, while also providing no useful predictive information for our real problem: working out how much longer the bus will be delayed. We correct all of this by subtracting this value from the delayed time, so that 'time_delayed' becomes the amount of time from after operations is called that the bus is delayed. We also remove cases which are resolved before they are reported.

    #remove zero variance columns
    zero_var <- function(dat) {
      out <- lapply(dat, function(x) length(unique(x)))
      keep <- rownames(data.frame(which(out > 1)))
      dat %>%
        select(keep)
    }
    
    in_csv <- "../output/intermediate/ii_spread.csv"
    ii_spread <- read_csv(in_csv) %>%
      mutate(
        time_delayed = as.numeric(time_delayed - time_diff_report),
        time_diff_report = NULL
      ) %>%
      filter(reported_before_resolved == 1) %>%
      select(-School_Year, -Route_Number, -Schools_Serviced, -Bus_Company_Name, -Occurred_On) %>%
      zero_var
    
    halfFold <- createFolds(ii_spread$time_delayed, k = 5, list=TRUE)
    ii_small <- ii_spread[halfFold[[2]],]
    ii_test <-ii_spread[halfFold[[3]],]
    ii_train <- ii_spread %>%
      anti_join(ii_test, by = "Busbreakdown_ID")
    
    rctrl_manual <- trainControl(method = "none", returnResamp = "all")
    rctrl_repcv <- trainControl(method = "repeatedcv", number = 2, repeats = 5, returnResamp = "all")
    

    Cross fold validation testing

    Since I am running all of this locally on a laptop which 'benchmarkme' tells me is the slowest system to have been benchmarked in 2018, I build data set 'ii_small' to do cross-fold validation testing, with root mean square error as my target. During my run, I test different numbers of committees and rules. The numbers presented here are what I ended up with. Neighbours always degrade the solution in this case are fixed to zero. We can see the full rule set of the model by calling 'summary', but this results in a lot of spam. A more succinct presentation is available by looking at the 'finalModel' components of the derived model. 'splits' gives a list of the derived rules, where multiple entries per rule indicates an 'AND' condition (both conditions satisfied). 'coefficients' gives the resulting slope and intercept of the application of that rule. Finally, we can get a full listing of how many times a predictor appears in the ruleset and the coefficient set by looking at 'usage': this is somewhat equivalent to a variable importance call.

    #---cubist
    cubist_grd <- expand.grid(
      committees = 1,
      neighbors = 0
    )
    
    set.seed(849)
    cubist_cv <- train(time_delayed ~ . -Busbreakdown_ID,
      data = ii_small,
      method = "cubist",
      metric="RMSE",
      na.action = na.pass,
      trControl = rctrl_repcv,
      tuneGrid = cubist_grd,
      control = Cubist::cubistControl(unbiased = T, rules = 20, sample = 0)
    )
    cubist_cv 
    ## Cubist 
    ## 
    ## 40908 samples
    ##    25 predictor
    ## 
    ## No pre-processing
    ## Resampling: Cross-Validated (2 fold, repeated 5 times) 
    ## Summary of sample sizes: 20454, 20454, 20453, 20455, 20455, 20453, ... 
    ## Resampling results:
    ## 
    ##   RMSE      Rsquared   MAE     
    ##   13.52728  0.1336734  9.883126
    ## 
    ## Tuning parameter 'committees' was held constant at a value of 1
    ## 
    ## Tuning parameter 'neighbors' was held constant at a value of 0
    
    #summary(cubist_cv)
    cubist_cv$finalModel$splits %>% glimpse
    ## Observations: 89
    ## Variables: 8
    ## $ committee   1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
    ## $ rule        1, 1, 1, 1, 2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 5, 5, 6, ...
    ## $ variable    School_Age, Boro_Bronx, Reason_HeavyTraffic, Number...
    ## $ dir         <=, <=, >, <=, >, >, >, >, <=, >, <=, >, <=, <=, <=...
    ## $ value       0, 0, 0, 6, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
    ## $ category    , , , , , , , , , , , , , , , , , , , , , , , , 
    ## $ type        "type2", "type2", "type2", "type2", "type2", "type2...
    ## $ percentile  0.13063459, 0.75171116, 0.29549232, 0.88540139, 0.9...
    
    cubist_cv$finalModel$coefficients %>% glimpse
    ## Observations: 20
    ## Variables: 27
    ## $ `(Intercept)`                    16.8, 14.9, 13.8, 21.0, 14.7, ...
    ## $ Has_Contractor_Notified_Schools  NA, NA, 1.4, 8.7, 1.4, 2.2, NA...
    ## $ Has_Contractor_Notified_Parents  NA, 7.2, 4.0, -7.9, NA, 1.2, 3...
    ## $ Have_You_Alerted_OPT             NA, 6.1, NA, -2.6, NA, -1.9, 3...
    ## $ Number_Of_Students_On_The_Bus    -0.78, -0.56, -0.17, -0.70, -0...
    ## $ School_Age                       NA, NA, NA, -1.2, NA, -2.3, 5....
    ## $ Reason_Accident                  NA, NA, 10, 2, NA, 6, 10, NA, ...
    ## $ Reason_DelayedbySchool           NA, NA, NA, NA, NA, -5, -5, NA...
    ## $ Reason_FlatTire                  NA, NA, 7, 1, NA, 4, NA, NA, 1...
    ## $ Reason_HeavyTraffic              NA, NA, -2.0, 1.9, NA, -1.2, N...
    ## $ Reason_LatereturnfromFieldTrip   NA, NA, NA, NA, NA, 2.1, 5.6, ...
    ## $ Reason_MechanicalProblem         NA, NA, 6.7, 14.6, NA, 5.0, 6....
    ## $ Reason_ProblemRun                NA, NA, NA, NA, NA, 2, 8, NA, ...
    ## $ Reason_WeatherConditions         NA, NA, 10.3, 3.6, NA, 3.4, NA...
    ## $ Reason_WontStart                 NA, NA, NA, NA, NA, 2, NA, NA,...
    ## $ Boro_Bronx                       NA, NA, NA, NA, NA, -2.5, NA, ...
    ## $ Boro_Brooklyn                    NA, 0.4, NA, NA, 3.7, 1.1, 4.8...
    ## $ Boro_Connecticut                 NA, NA, NA, NA, NA, NA, NA, NA...
    ## $ Boro_Manhattan                   NA, 0.6, NA, NA, NA, 2.6, 4.3,...
    ## $ Boro_NassauCounty                NA, NA, NA, NA, 8, 3, NA, NA, ...
    ## $ Boro_NewJersey                   NA, NA, NA, NA, NA, 4, 5, NA, ...
    ## $ Boro_Queens                      NA, 0.5, NA, NA, 6.8, 2.2, 3.6...
    ## $ Boro_RocklandCounty              NA, NA, NA, NA, NA, 4, NA, NA,...
    ## $ Boro_StatenIsland                NA, NA, NA, NA, NA, -2.5, NA, ...
    ## $ Boro_Westchester                 NA, 1.2, NA, NA, NA, 5.1, 13.6...
    ## $ committee                        "1", "1", "1", "1", "1", "1", ...
    ## $ rule                             "1", "2", "3", "4", "5", "6", ...
    
    cubist_cv$finalModel$usage %>% head
    ##   Conditions Model                        Variable
    ## 1         68    98   Number_Of_Students_On_The_Bus
    ## 2         66     6                      Boro_Bronx
    ## 3         65    29             Reason_HeavyTraffic
    ## 4         52    79 Has_Contractor_Notified_Parents
    ## 5         46    45                      School_Age
    ## 6         38    47                  Boro_Manhattan
    

    Validation on test data

    We could look through this data set for human-readable rules, but we can see from the high RMSE and low \(R^2\) value that we do not have a very good fit. To look at exactly how rough a model this is, we use our manual fold splits to predict from our 'ii_test' data set:

    cubist_man <- train(time_delayed ~ . -Busbreakdown_ID,
      data = ii_train,
      method = "cubist",
      metric="RMSE",
      na.action = na.pass,
      trControl = rctrl_manual,
      tuneGrid = cubist_grd,
      control = Cubist::cubistControl(unbiased = T, rules = 20, sample = 0)
    )
    
    ii_withpred <- ii_test %>%
      mutate(
        time_predicted = predict(cubist_man, newdata = ii_test)
      )
    
    ii_withpred %>%
      summarise(
        rms = (mean((time_predicted-time_delayed)^2))^0.5,
        sum_sq = sum((time_predicted-time_delayed)^2),
        tot_sum_sq = sum((time_delayed - mean(time_delayed))^2),
        r_sq = 1- sum_sq / tot_sum_sq
      ) #1,20: 13.5, 0.141
    ## # A tibble: 1 x 4
    ##     rms   sum_sq tot_sum_sq  r_sq
    ##              
    ## 1  13.6 7542210.   8829766. 0.146
    
    protec <- ii_withpred %>%
      mutate(Reason_Other = 1L) %>% #dummy flag to get the name in
      select(-contains("Reason")) %>%
      colnames()
    
    pal <- brewer.pal(10, "Paired")
    
    ii_withpred %>%
      mutate(Reason_Other = ifelse(Reason_Accident + Reason_DelayedbySchool + Reason_FlatTire + Reason_HeavyTraffic + Reason_LatereturnfromFieldTrip + Reason_MechanicalProblem + Reason_ProblemRun + Reason_WeatherConditions + Reason_WontStart == 1, 0L, 1L)) %>%
      gather(key = Reason, value = dummy, -one_of(protec)) %>%
      filter(dummy == 1) %>%
      mutate(dummy = NULL) %>%
      filter(row_number() %% 10 == 0) %>%
      ggplot(aes(x = time_predicted, y = time_delayed, fill = Reason, color = Reason)) +
      geom_point(alpha = 0.75, shape = 21) +
      scale_fill_manual(values = pal) +
      geom_abline(intercept = 0, slope = 1, color = "red") +
      coord_cartesian(expand = c(0,0), x = c(0,60)) +
      theme_classic() +
      labs(x="time_predicted [min]",y="time_delayed [min]") +
      theme(text=element_text(family="Arial", size=16)) +
      theme(axis.line = element_line(colour = 'black', size = 1)) +
      theme(axis.ticks = element_line(colour = "black", size = 1)) +
      ggtitle("Simple Cubist: delayed vs predicted")
    

    We can see that we haven't done a very good job. We are mis-predicting pretty much everything. But that is fine, since we know that we have more predictors available that we have not yet joined to the data set: company and route level metadata. We also do not know whether a regression approach is really going to be useful with this data, and indeed, later on we will switch over to a two class classification problem, and simply look at whether a delay will be greater or less than a cut off time.

    Related Post

    1. Visualization of NYC bus delays with R
    2. Explaining Black-Box Machine Learning Models – Code Part 2: Text classification with LIME
    3. Image clustering with Keras and k-Means
    4. ‘How do neural nets learn?’ A step by step explanation using the H2O Deep Learning algorithm.
    5. Machine learning: Logistic regression and decision trees for healthcare in R

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



    If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

    Comments are closed.

    Search R-bloggers


    Sponsors

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)