Explanatory Analysis of the XGBoost Model for Budget Deficits of U.S.

[This article was first published on DataGeeek, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The debt ceiling was always an issue in the United States. As of today, the national government debt has reached the debt ceiling, which is $31.4 trillion. The authorities have warned of chaotic consequences if Congress no longer approves the debt ceiling.

The U.S. government has managed an annual deficit of approximately $1 billion since 2001. We will examine this situation in a more extended period and scope for the United States. The variables we are going to use:

  • gross domestic product per capita(gdp)
  • the general government deficit as a percentage of GDP(deficit)
  • the unemployment rate(unemployment)

We will compare gdp and deficit variables in an interactive bar chart.

library(tidyverse)
library(tidymodels)
library(DALEXtra)
library(ggtext)
library(glue)
library(plotly)
library(sysfonts)
library(showtext)
library(modelStudio)

df <- read_csv("https://raw.githubusercontent.com/mesdi/blog/main/deficit.csv")

#adding google font
font_add_google(name = "Roboto Slab", family = "slab")
showtext_auto()
 
#Hoverinfo texts 
text_gdp <- glue("GDP/capita: {number(df$gdp, scale_cut = cut_short_scale(),accuracy = 1)}\nYear: {df$time}")
text_deficit <- glue("Deficit/GDP: {number(df$deficit, suffix = '%', accuracy = 0.01)}\nYear: {df$time}")

#coefficient for dual y-axis transformation
coeff <- mean(df$gdp) / mean(df$deficit) %>% abs()

#Comparing GDP per capita and deficit % of GDP for the U.S.
df %>% 
  ggplot(aes(time)) +
  geom_bar(aes(y = gdp, text = text_gdp), 
           stat = "identity", 
           fill = "blue") +
  geom_line(aes(y = gdp, text = text_gdp, group =1), color = "navyblue", size =2) +
  geom_bar(aes(y = deficit * coeff, text = text_deficit), 
           stat= "identity", 
           fill = "red")+
  geom_line(aes(y = deficit * coeff, text = text_deficit, group =1), 
            color = "#800000", 
            size = 2) +
  #second(dual) y-axis
  scale_y_continuous(sec.axis = sec_axis(~./coeff)) +
  xlab("")+ 
  ylab("")+
  ggtitle("<span style = 'color:blue'>GDP per capita</span>  vs. <span style = 'color:red;'>Deficit % of GDP </span> for the U.S.")+
  theme_minimal()+
  theme(panel.grid.minor = element_blank(),
        axis.text.y = element_blank(),
        axis.text.x = element_text(size=12),
        plot.title = ggtext::element_markdown(hjust = 0.5)) -> p

#setting font family for ggplotly 
font <- list(
  family= "Roboto Slab", 
  size=15
)

#setting font family for hover label
label <- list(
  font = font
) 

#converts ggplot2 object to plotly for interactive chart
ggplotly(p, tooltip = "text") %>% 
  style(hoverlabel = label) %>% 
  layout(font = font)

When we analyze the above chart, we can say that the years 2010 and 2020 have the highest deficit rate values; unemployment rates during the mortgage crisis and the pandemic, respectively, might be one of the causes of that situation.

Now, we will examine what causes might affect the deficit rates; in order to do that, we will model the data with the xgboost and find the feature importance scores and the Shapley values with the modelStudio package.

#Preprocessing
df_rec <- 
  recipe(deficit ~ gdp + unemployment, data = df) 

#Creating a preprocessed data frame
df_proc <- 
  df_rec %>%
  prep() %>%
  bake(new_data = NULL)  

#Modeling and fitting
set.seed(12345)
df_fit <- 
  boost_tree() %>%
  set_mode("regression") %>%
  set_engine("xgboost") %>%
  fit(deficit ~ ., data = df_proc)

#Explainer object
explainer <- 
  DALEX::explain(
    model = df_fit,
    data  = df_proc %>% select(-deficit),
    y     = df$deficit,
    label = "XGBoost"
  )

#Model Studio
set.seed(1983)
modelStudio::modelStudio(explainer)

As seen above, both predictors have a close level of decisiveness on the target variable in general(feature importance); but when it comes to individual effects on the target(Shapley values), we see that they differ from each other inversely.

It is seen in the specific observation on the above graph, gdp has a decreasing effect, while unemployment has an increasing effect, on the deficit which has mostly negative values.

To leave a comment for the author, please follow the link and comment on their blog: DataGeeek.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)