Dynamic Regression with ARIMA Errors: The Students on the Streets

[This article was first published on DataGeeek, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The higher education students have had trouble being housing in Turkey in recent days. There have been people who even sleep on the streets like a homeless. The government has been accused of investing inadequate dormitories for sheltering the students. Let’s examine the ongoing sheltering problem of students.

The dataset we have built for this article consists of the number of formal education students and the housing capacity provided by Higher Education Loans and Dormitories Institution(KYK), which is a state organization.

We will determine whether there is a capacity shortage this year based on historical data. The model we are going to use is the dynamic regression model with ARIMA errors; Because we will model the dormitories’ capacity in terms of the number of students by the historical data between 1992-2020.

First of all, we will check the stationary of the variables. If one of them is not stationary, all the variables have to be differenced. This modeling process is called model in differences.

#Creating the tsibble variables and examining the stationary
library(tidyverse)
library(fable)
library(feasts)
library(tsibble)
library(readxl)

df <- read_excel("kyk.xlsx")

df_ts <- df %>% 
  as_tsibble(index = date)

train <- df_ts %>% filter_index(.~ "2020")
test <- df_ts %>% filter_index("2021"~.)

train %>% 
  pivot_longer(c(capacity, students), 
               names_to = "var", values_to = "value") %>%
  ggplot(aes(x = date, y = value)) +
  geom_line(size=1.5, color="blue") +
  facet_grid(vars(var), scales = "free_y") +
  scale_x_continuous(breaks = seq(1990,2020,5))+
  scale_y_continuous(labels = scales::label_number_si()) +
  labs(title = "The Number of Capacity and Students",y = "",x="") +
  theme_minimal() +
  theme(plot.title=element_text(face = "bold.italic",hjust = 0.5),
        panel.background = element_rect(fill = "#f0f0f0", color = NA))

When we examine the above plots, it seems like the two variables need differencing as well. After we fit the model, we will check the residuals for stationary by drawing the innovations residuals.

#Modeling 
fit <- train %>%
  model(ARIMA(capacity ~ students + pdq(d=1)))

#Checking residuals for stationary
fit %>% 
  gg_tsresiduals()


In the ACF graph, we can clearly see that the spikes are within the thresholds which means the residuals are white noise. In the below code block, we have confirmed that the residuals have no patterns by using the Ljung-Box test in which we saw the p-value greater than 0.05 at %5 significance level.

#Ljung-Box independence test for white noise
augment(fit) %>% 
  features(.innov, ljung_box)

#.model                                  lb_stat
#  <chr>                                     <dbl>
#1 ARIMA(capacity ~ students + pdq(d = 1))    1.32
#  lb_pvalue
#      <dbl>
#1     0.251

We are writing the below command to examine the components of the model.

report(fit)

#Series: capacity 
#Model: LM w/ ARIMA(0,1,1) errors 

#Coefficients:
#         ma1  students  intercept
#      0.8672    -0.022  22964.621
#s.e.  0.1177     0.006   8369.892

The differenced model is shown as: y'_t = -0.022 x'_t + \eta'_t, where \eta_t = 22964.621 + \eta_{t-1} + 0.8672 \epsilon_{t-1} + \epsilon_t is a MA(1) error. The insteresting thing is that the students component has a negative impact on the capacity, according to the model.

Before finding the expected value of the model, we will take a look at the relation between the number of capacity and students over time.

#Comparing students and capacity
ggplot(df_ts,aes(x=date))+
  geom_bar(aes(y=students),stat = "identity",fill="skyblue")+
  geom_line(aes(y=students),size=1.5,color="orange")+
  geom_line(aes(y=(capacity/students)*10^7),size=1.5,color="red")+
  scale_x_continuous(breaks = seq(1991,2021,5))+
  scale_y_continuous(labels = scales::label_number_si(),
                     sec.axis = sec_axis(~./10^7,
                                         labels = scales::percent_format(accuracy = 1),
                                         name = "The rate of capacity per student"))+
  xlab("")+ylab("The number of students")+
  theme_minimal()+
  theme(panel.background = element_rect(fill = "#f0f0f0", color = NA),
        #Main y-axis
        axis.title.y = element_text(color = "skyblue",
                                    #setting positions of axis title
                                    margin = margin(r=0.5,unit = "cm"),
                                    hjust = 1),
        
        
        #Second y-axis
        axis.title.y.right = element_text(color = "red",
                                          #setting positions of second axis title
                                          margin = margin(l=0.5,unit = "cm"),
                                          hjust = 0.01))
  

We have to keep in mind that approximately half of the higher education students study out of their hometown, according to the common opinion before the analyze the above graph. The orange line shows the trend of the number of students.

It is understood that while the number of students increased, the capacity per student decreased until 2014. Although the capacity rate has increased after this year, it is clear that the reason for this is the decrease in the number of students. However, it is still seen well below 50%.

Now that we’ve found our model, we can predict this year’s capacity value and see whether the government falls short of this expected value.

#Forecasting the current year
capacity_fc<- forecast(fit, new_data = test)

#Setting legends for separate columns for the plot
colors <- c("Actual"="red","Predicted"="blue") 

#Plotting predicted and the actual value 
ggplot(train,mapping=aes(x=date,y=capacity))+
  geom_line(size=1.5)+
  geom_point(capacity_fc,mapping=aes(x=date,y=.mean,color="Predicted"))+
  geom_point(test,mapping=aes(x=date,y=capacity,color="Actual"))+
  scale_x_continuous(breaks = seq(1991,2021,5))+
  scale_y_continuous(breaks = seq(200000,750000,100000),
                     labels = scales::label_number_si())+
  scale_color_manual(values = colors,name="Points")+
  theme_minimal() +
  theme(axis.title.y=element_text(vjust = 2),
        panel.background = element_rect(fill = "#f0f0f0", color = NA))

To the above plot, the number provided by the government falls short than the expected value of the model. They seem right to complain about the capacity of dormitories considering the half of the students study out of their hometown.

To leave a comment for the author, please follow the link and comment on their blog: DataGeeek.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)