Kaggle Playground Series – Tidymodels

[This article was first published on Sport Data Science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hello readers, we are entering another Kaggle playground competition, so get your Yorkshire tea ready and enjoy the process of joining. This month the competition I entered is this one

https://www.kaggle.com/competitions/playground-series-s3e7It’seiew

It’s looks like looks are canncellations from hotels and spoiler alert – I had a lot of fun with this dataset.

EDA


setwd("C:/Users/aLEX/Documents/Projects/MontlyKagmach/feb22")

train = read_csv("train2.csv")

str(train)

First, I always work in workflow by using the str function to see the structure of the data and the column types. All the columns are numeric; however, I need to change the booking status column to a factor so a classification model can be fitted.



trin_sum = train %>% group_by(no_of_weekend_nights) %>%
                          summarise(bookper = mean(booking_status)) %>%
                            mutate(part_week = "Weekend") %>%
                                rename(nights = no_of_weekend_nights)





trin_sum2 = train %>% group_by(no_of_week_nights) %>%
                          summarise(bookper = mean(booking_status)) %>%
                            mutate(part_week = "Week day") %>%
                                rename(nights = no_of_week_nights)



week_tot = trin_sum %>% bind_rows(trin_sum2)


library(scales)


ggplot(week_tot, aes(x = nights, y = bookper)) + geom_col(fill = "#fc6b03") +
                                                    facet_wrap(~part_week, scales = "free") +
                                                      scale_y_continuous(labels = percent_format()) +
                                                        labs(x = "Nights", y = "Cancel %", title = "Comparison of Cancellation Rates Depending on Number of Booked Nights") +
                                   theme(panel.background = element_rect(fill = "#060036"), panel.grid = element_blank(), plot.background = element_rect(fill = "#060036"), 
                                         plot.title = element_text(colour = "white"), axis.title.x = element_text(colour = "white"), axis.title.y = element_text(colour = "white"), axis.ticks = element_line(colour = "white"), axis.line = element_line(colour = "white"), axis.text.x = element_text(colour = "white"), axis.text.y = element_text(colour = "white"))

The first area to investigate is how the number of nights booked at different times of the week impacts the cancellation rate. The dataset is split into the number of weeknights booked and weekend nights booked. Weekend nights show a clear trend, with an increase in cancellations up to 6 nights booked. 

cols = c("1" = "#fc6b03", "0" = "#060036")

ggplot(train, aes(x = as.factor(booking_status), y = avg_price_per_room, colour = as.factor(booking_status))) + 
                                                                                      facet_wrap(~room_type_reserved, scales = "free") +
                                                                                     geom_jitter(alpha = 0.3)  +
                                                                              scale_colour_manual(values = cols) +
                                                                                      guides(colour = F) +
                                           labs(x = "Booking Status", y = "Avg Room Price", title = "Booking Status Compared to price and room type") +
                                                                  theme(panel.background = element_blank())

The next area reviewed was the average price paid for the room alongside the type of room booked. Overall I would say the price thats paid has little impact on whether the room is cancelled across all the room types. When the room is free, there is a significant reduction in cancellations compared to the average.


date = train %>% group_by(arrival_year, arrival_month) %>%
                      summarise(percan = mean(booking_status))

cols2 = c("2017" = "#fc6b03", "2018" = "#060036")

ggplot(date, aes(x = arrival_month, y = percan, colour = as.factor(arrival_year))) + geom_line(size =2) +
                                                 labs(x = "Month", y = "Cancel %", title = "Comparing Cancellation Rate of Both Years") +
                                                                guides(colour = guide_legend(title = "Year")) +
                                                                    scale_colour_manual(values = cols2) +
                                                                         scale_y_continuous(labels = percent_format()) +
                                                                            theme(panel.background = element_blank())


This data set had 2 different years, covering 2017 and 18; therefore, I compared the cancellation rate across both years. In 2018 there were more cancellations in the latter part of the year. I’m starting this by just training a random forest model on all the data in the training dataset and then attempting some feature engineering to improve the performance

Model Fitting



train2 = train %>% 
              mutate(booking_status = as.factor(as.numeric(booking_status))) 


split1 = initial_split(train2, prop = 0.75, strata = booking_status)


train_data = training(split1)


test_data = testing(split1)


cross_val = vfold_cv(train_data, v = 5, repeats = 2, strata = booking_status)



ranger_recipe <- 
  recipe(formula = booking_status ~ ., data = train_data)  %>% 
      update_role(id, new_role = "ID")

ranger_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_mode("classification") %>% 
  set_engine("ranger") 

ranger_workflow <- 
  workflow() %>% 
  add_recipe(ranger_recipe) %>% 
  add_model(ranger_spec) 

doParallel::registerDoParallel()

set.seed(77062)
ranger_tune <-
  tune_grid(ranger_workflow, resamples = cross_val, grid = 10)



ran_test = ranger_tune %>% unnest(cols = c(.metrics)) %>%
                                  filter(.metric == "roc_auc") %>%
                                    select(mtry, min_n, .estimate) %>%
                                        pivot_longer(cols = 1:2, names_to = "parameter", values_to = "value")




ggplot(ran_test, aes(x = value, y = .estimate)) + geom_point(colour = "#fc6b03") +
                                                      facet_wrap(~parameter, scales = "free") +
                                                               labs(x = "Parameter Value", y = "ROC_AUC", title = "Random Forest Tuning Parameter Results") +
                                   theme(panel.background = element_rect(fill = "#060036"), panel.grid = element_blank(), plot.background = element_rect(fill = "#060036"), 
                                         plot.title = element_text(colour = "white"), axis.title.x = element_text(colour = "white"), axis.title.y = element_text(colour = "white"), axis.ticks = element_line(colour = "white"), axis.line = element_line(colour = "white"), axis.text.x = element_text(colour = "white"), axis.text.y = element_text(colour = "white"))


I trained a random forest model on the training data and identified the best tuning parameters. I will then fit a final model with the best tuning parameters, becoming the baseline for testing the feature-engineered columns.

Feature Engineering

I will investigate new features to improve the model. The price and length of the booking. If you had overpaid for the room, you are more likely to cancel the booking. The earlier a booking is made, the more likely to be cancelled. 

The average price per room is linked to the booked room and the time of year. I will create a simple model to predict what fee should be paid. The difference between what the person paid and what would be expected would then be used to predict cancellation. 


price_data = train %>% select(arrival_month, room_type_reserved, avg_price_per_room, lead_time)



price_est = lm(avg_price_per_room ~ ., price_data)


price_pred = predict(price_est, train)


train3 = train %>% bind_cols(price_pred) %>% 
                        rename(price_pred = "...20") %>% 
                      mutate(price_del = round(avg_price_per_room/price_pred-1,1)) %>% 
                        mutate(price_del2 = if_else(price_del > 1, 1, price_del)) %>%
                        group_by(price_del2) %>% 
                            summarise(meand = mean(booking_status), n = n())



ggplot(train3, aes(x = price_del2, y = meand, size = n)) + geom_point(col = "#fc6b03") +
                                                    scale_y_continuous(labels = percent_format()) +
                                                      scale_x_continuous(labels = percent_format()) + 
                                                      
                                           labs(x = "Delta to Expect %", y = "Cancel %", title = "Comparison of Cancellation Rates by Difference to Expected Price") +
                                   theme(panel.background = element_rect(fill = "#060036"), panel.grid = element_blank(), plot.background = element_rect(fill = "#060036"), 
                                         plot.title = element_text(colour = "white"), axis.title.x = element_text(colour = "white"), axis.title.y = element_text(colour = "white"), axis.ticks = element_line(colour = "white"), axis.line = element_line(colour = "white"), axis.text.x = element_text(colour = "white"), axis.text.y = element_text(colour = "white"))


I fitted a simple linear regression model to benchmark the expected price. Comparing the delta between the expected and actual price shows a difference in cancellation rate depending on the difference with the expected price. I then fitted a new model in the same format above. 



train_lead = train %>% group_by(lead_time) %>%
                          summarise(n = n(), per = mean(booking_status))


ggplot(train_lead, aes(x = lead_time, y = per)) + geom_line(colour = "#fc6b03") + 
                                                       scale_y_continuous(labels = percent_format()) +
                                                        labs(x = "Lead Time", y = "Cancel %", title = "Cancellation Rate by Lead Time of Booking") +
                                   theme(panel.background = element_rect(fill = "#060036"), panel.grid = element_blank(), plot.background = element_rect(fill = "#060036"), 
                                         plot.title = element_text(colour = "white"), axis.title.x = element_text(colour = "white"), axis.title.y = element_text(colour = "white"), axis.ticks = element_line(colour = "white"), axis.line = element_line(colour = "white"), axis.text.x = element_text(colour = "white"), axis.text.y = element_text(colour = "white"))

The next area to focus on was the impact of the booking when the booking was initially made. I plotted the cancellation % by the lead time of the booking. Less than 90 days cancellation rate is roughly 25%; between 90 and 150 days, the cancellation rate is around 45%. After that, the cancellation rate is approximately 70%. I created the new feature and created the model in the same process as the first model. 

Result


mod2_pred = predict(rf2, test_data2, type = "prob")

mod2_pred2 = predict(rf2, test_data2)

test_data_fin = test_data2 %>% bind_cols(mod2_pred) %>%
                                bind_cols(mod2_pred2)



result1 = roc_auc(test_data_fin, truth = booking_status, estimator = "macro_weighted", estimate = c(.pred_0:.pred_1))

result1 = result1 %>% mutate(model = "Expected Price")




mod_pred = predict(rf3, test_data3, type = "prob")

mod_pred2 = predict(rf3, test_data3)

test_data_fin3 = test_data3 %>% bind_cols(mod_pred) %>%
                                bind_cols(mod_pred2)



result3 = roc_auc(test_data_fin3, truth = booking_status, estimator = "macro_weighted", estimate = c(.pred_0:.pred_1))

result3 = result3 %>% mutate(model = "Baseline")



mod3_pred = predict(rf, test_data, type = "prob")

mod3_pred2 = predict(rf, test_data)

test_data_fin1 = test_data %>% bind_cols(mod3_pred) %>%
                                bind_cols(mod3_pred2)



result = roc_auc(test_data_fin1, truth = booking_status, estimator = "macro_weighted", estimate = c(.pred_0:.pred_1))

result = result %>% mutate(model = "Lead Time")




result_sum = result %>% bind_rows(result1) %>% 
                                  bind_rows(result3)

The results show that out of the 2 models with the new features the one with the lead time added if the model which has improved model performance. The price delta model had a worse performance which could be because the method for calculating the expected price wasnt accurate and the model could infer price impact from the price anyway. 

I then trained a lightgbm model and submitted the predictions and ended up 262nd

To leave a comment for the author, please follow the link and comment on their blog: Sport Data Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)