# Quantify the Covid19 Impact on the SFO Airport Passenger Air Traffic

**Rami Krispin**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Covid19 pandemic had, and unfortunately still have, a significant impact on most of the major industries. While for some sectors, the impact was positive (such as online retails, internet and steaming providers, etc.), it was negative for others (such as transportation, tourism, entertainment, etc.). In both cases, we can leverage time series modeling to quantify the effect of the Covid19 on the sector.

One simplistic approach for quantifying the impact of the pandemic (whether it is positive or negative) would include the following steps:

- Split the series into pre-covid and post-covid
- Train a time series model with the pre-covid data. This would enable us to simulate the value of the series if there was no pandemic
- Using the trained model to forecast the horizon of the post-covid series
- Use the difference between the forecast and actual (post-covid series) to quantify the Covid19 impact on the series

To demonstrate this approach, I will use the `sfo_passengers`

dataset from the sfo package. The `sfo_passengers`

dataset provides monthly statistics about the San Francisco International Airport (SFO) air traffic between July 2005 and September 2020. More details about the dataset available in the following post and vignette.

For this analysis, we will use the following packages:

`sfo`

– the passenger air traffic data`dplyr`

– data prep`plotly`

– data visualization`TSstudio`

– time series analysis and forecasting

library(sfo) library(dplyr) library(plotly) library(TSstudio)

#### Data

As mentioned above, the `sfo_passengers`

dataset provides monthly statistics about the air passenger traffic at SFO airport since 2005. That includes monthly information about the number of passengers by different categories such as operating airline, region, terminal, etc. Let’s load the data:

data("sfo_passengers") str(sfo_passengers) ## 'data.frame': 22576 obs. of 12 variables: ## $ activity_period : int 202009 202009 202009 202009 202009 202009 202009 202009 202009 202009 ... ## $ operating_airline : chr "United Airlines" "United Airlines" "United Airlines" "United Airlines" ... ## $ operating_airline_iata_code: chr "UA" "UA" "UA" "UA" ... ## $ published_airline : chr "United Airlines" "United Airlines" "United Airlines" "United Airlines" ... ## $ published_airline_iata_code: chr "UA" "UA" "UA" "UA" ... ## $ geo_summary : chr "International" "International" "International" "International" ... ## $ geo_region : chr "Mexico" "Mexico" "Mexico" "Mexico" ... ## $ activity_type_code : chr "Enplaned" "Enplaned" "Enplaned" "Deplaned" ... ## $ price_category_code : chr "Other" "Other" "Other" "Other" ... ## $ terminal : chr "Terminal 3" "Terminal 3" "International" "International" ... ## $ boarding_area : chr "F" "E" "G" "G" ... ## $ passenger_count : int 6712 396 376 6817 3851 3700 71 83 65 45 ...

Before we will convert the data into time series format, we will transform the `activity_period`

into a `Date`

format:

df <- sfo_passengers %>% mutate(date = as.Date(paste(substr(sfo_passengers$activity_period, 1,4), substr(sfo_passengers$activity_period, 5,6), "01", sep ="/")))

Next, we will transform the dataset into a time series format by grouping the passenger by the `date`

variable:

df <- df %>% group_by(date) %>% summarise(y = sum(passenger_count), .groups = "drop") head(df) ## # A tibble: 6 x 2 ## date y ## <date> <int> ## 1 2005-07-01 3225769 ## 2 2005-08-01 3195866 ## 3 2005-09-01 2740553 ## 4 2005-10-01 2770715 ## 5 2005-11-01 2617333 ## 6 2005-12-01 2671797

Now, we have a monthly time series:

plot_ly(data = df, x = ~ date, y = ~ y, type = "scatter", mode = "line", name = "Total Passengers") %>% add_segments(x = as.Date("2020-02-01"), xend = as.Date("2020-02-01"), y = min(df$y), yend = max(df$y) * 1.05, line = list(color = "black", dash = "dash"), showlegend = FALSE) %>% add_annotations(text = "Pre-Covid19", x = as.Date("2018-09-01"), y = max(df$y) * 1.05, showarrow = FALSE) %>% add_annotations(text = "Post-Covid19", x = as.Date("2021-08-01"), y = max(df$y) * 1.05, showarrow = FALSE) %>% layout(title = "Total Number of Air Passengers - SFO Airport", yaxis = list(title = "Number of Passengers"), xaxis = list(title = "Source: San Francisco Open Data Portal"))As can be observed in the plot above, the Covid19 effect is well pronounced since March 2020. To quantify the effect of the pandemic on the number of passengers, we will split the series into the following two series:

- Pre-Covid19 series- all observations prior to March 2020
- Post-Covid19 series- all observations post (include) to March 2020

We will use the Pre-Covid19 series to train a time series model. That will enable us to forecast the number of passengers as if there was no pandemic.

pre_covid <- df %>% dplyr::filter(date < as.Date("2020-03-01")) %>% dplyr::arrange(date) post_covid <- df %>% dplyr::filter(date >= as.Date("2020-03-01")) %>% dplyr::arrange(date)

We will use the `pre_covid`

series to train a time series model. That will enable us to forecast the number of passengers as there was no pandemic. Once we forecast the corresponding observations of the `post_covid`

series with the `pre_covid`

data, we could quantify the impact of the Covid19 on the total number of passengers.

#### Analyzing the data

Before we forecast the series, let’s run a quick exploratory analysis on the series to identify its main characteristics. We will use the TSstudio package to visualize the `pre_covid`

series. Note that the package does not support, yet, the `tsibble`

object. Therefore, we will convert the series into a `ts`

object first:

ts.obj <- ts(pre_covid$y, start = c(2005, 7), frequency = 12)

The series before the outbreak of the pandemic:

ts_plot(ts.obj, title = "Total Number of Air Passengers - SFO Airport", Ytitle = "Number of Passengers", slider = TRUE)

Like most series that describes monthly air passenger traffic, the series has a strong monthly seasonal pattern and a positive trend. You can also notice that the seasonal component’s oscillation has become larger since 2017 (compared to previous years). Let’s use the `ts_seasonal`

function to create a seasonal plot of the series:

ts_seasonal(ts.obj = ts.obj, type = "all")

As can see in the seasonal plot, the monthly seasonal effect kept overtime while the series continue to grow from year to year.

Similarly, we can review the series correlation with its past lags using the ACF and PACF functions:

ts_cor(ts.obj = ts.obj, lag.max = 36)

And as expected, we can see strong correlation between the series and the first and seasonal lags. We will leverage this information to select time-series models for seasonal data.

#### Forecast the Pre-Covid19 series

One of my favorite forecasting strategies is a combination of horse racing between different time series models and backtesting as a training approach. Backtesting is the time series equivalent of the machine learning cross-validation training approach. The idea here is simple - test each model with backtesting, and select the one that performed best, on average, on the different testing partition.

The `train_model`

function from the **TSstudio** package enables us to apply this strategy seamlessly using models from the **forecast** and **stats** packages. For simplicity, we will use different flavors of `ETS`

and `Holt-Winters`

models and out-of-the-box `auto.arima`

and `tslm`

models. For the backtesting, we will split the series into 6 testing partitions, each 12 months spaced by 3 months from each other.

The `methods`

argument defines the models to use and the `train_method`

argument defines the setting of the backtesting. Can find more details about the function here.

methods <- list(ets1 = list(method = "ets", method_arg = list(opt.crit = "lik"), notes = "ETS model opt.crit=lik"), ets2 = list(method = "ets", method_arg = list(opt.crit = "amse"), notes = "ETS model opt.crit=amse"), ets3 = list(method = "ets", method_arg = list(opt.crit = "mse"), notes = "ETS model opt.crit=mse"), auto_arima = list(method = "auto.arima", notes = "Auto ARIMA"), hw1 = list(method = "HoltWinters", method_arg = NULL, notes = "HoltWinters Model"), hw2 = list(method = "HoltWinters", method_arg = list(seasonal = "multiplicative"), notes = "HW with multip. seasonality"), tslm = list(method = "tslm", method_arg = list(formula = input ~ trend + season), notes = "tslm with trend and seasonal")) train_method = list(partitions = 6, sample.out = 12, space = 3)

After we defined the methods and `train_method`

arguments we will use the `train_model`

function to train the models. Note that the forecast horizon is set the the length of the `post_covid`

series. In addition we will set the `MAPA`

as the error metric to evaluate the performance of the different models on the testing partitions:

md <- train_model(input = ts.obj, methods = methods, train_method = train_method, horizon = nrow(post_covid), error = "MAPE") ## # A tibble: 7 x 7 ## model_id model notes avg_mape avg_rmse `avg_coverage_80%` `avg_coverage_95%` ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 hw1 HoltWinters HoltWinters Model 0.0277 149046. 0.792 0.958 ## 2 ets2 ets ETS model opt.crit=amse 0.0284 161754. 0.792 0.972 ## 3 hw2 HoltWinters HW with multip. seasonality 0.0300 171702. 0.528 0.847 ## 4 ets1 ets ETS model opt.crit=lik 0.0307 173337. 0.833 0.958 ## 5 ets3 ets ETS model opt.crit=mse 0.0311 174238. 0.861 0.972 ## 6 auto_arima auto.arima Auto ARIMA 0.0334 184381. 0.597 0.889 ## 7 tslm tslm tslm with trend and seasonal 0.0370 223194. 0.569 0.75

Based on the `leaderboard`

table from the `train_model`

function, the model that performed best on average on the different testing partitions is the **Holt-Winters** model (first version - `hw1`

). The model achieved, on average, the lowest MAPE (2.76%) and RMSE (149046) compared to the other models which evaluated. In addition, the model achieved a close to perfect coverage of the model prediction intervals with an average coverage of 79.2% and 95.8% for the 80% and 95% prediction interval, respectively. We can review the error distribution across the different partitions for each model with the `plot_error`

function:

plot_error(md)

The `plot_model`

enables us to animate the forecasted values of each model on the different testing partitions of the backtesting:

plot_model(md)

**leave a comment**for the author, please follow the link and comment on their blog:

**Rami Krispin**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.