Wildfires Comparison with ggplot2 dual Y-axis and Forecasting with KNN

Posted on August 23, 2021 by Selcuk Disci in R bloggers | 0 Comments

[This article was first published on DataGeeek, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In recent days, Turkey has been escalated wildfires. The government and the people were In recent days, Turkey has been escalated wildfires. The government and the people were on the alert and mobilized to put out the fires but, they couldn’t for days. A significant part of people thought that the simultaneous fires had no make sense, and there was probably some kind of terrorist actions beyond these fires. But there had also been wildfires in the other parts of the world at the same time. Hence, I have just decided to build a function to compare the numbers.

The countries that we will compare are Greece, Italy, and Turkey, which are all in the same Mediterranean climate zone and had brutal damage of fires. We will compare the hectares(ha) of damaged areas and the number of fires. The data set has annual observations between the 2009-2021 periods. Of course, the data for this year had to be taken until 20 august based on the EFFIS database; but it doesn’t change the underlying concept of this article.

library(tidyverse)
library(tsfknn)
library(forecast)
  
df <- read_excel("fires_statistics.xlsx")

For comparison purposes, we will use the second y-axis in our plot. We have to use transformation rate to create a second axis because the sec.axis function, which we are going to use for building the second y-axis, can not build an independent y-axis; it creates related to the main y-axis via some mathematical transformations. The transformation rate could be taken as approximate max(y1)/max(y2).

fun <- function(country="Greece"){
  
  
  df %>% 
    .[.$country==country,] %>% 
    ggplot(aes(x=year))+
    geom_bar(aes(y=burnt_areas_ha),stat = "identity",fill="orange")+
    #Bar labels(burnt_areas_ha)
    geom_text(aes(y=burnt_areas_ha,label=burnt_areas_ha),vjust=-0.2,color="orange")+
    #multiplied by transformation rate to tune in with the main y-axis(Burnt Areas) 
    geom_line(aes(y=fires_number*100),size=2,color="lightblue")+
    #Line labels(fires_number)
    geom_text(aes(y=fires_number,label=fires_number),vjust=1,color="lightblue")+
    scale_x_continuous(breaks = seq(2009,2021,1))+
    
    scale_y_continuous(
      
                      #for the main y-axis
                      labels = scales::label_number_si(),
                      breaks = scales::pretty_breaks(n=7),
      
                      #for the second y-axis
                      sec.axis = sec_axis(~./100,#divided by transformation rate, in order to be represented based on the first y-axis
                                          name = "Number of Fires",
                                          breaks = scales::pretty_breaks(7)),
                      
                      )+
    xlab("")+ylab("Burnt Areas(ha)")+
    ggtitle(paste("The comparison of the volume(ha) and numbers of Forest Fires in",
          country))+
    theme_minimal()+
    theme(
      
      #Main y-axis
      axis.title.y = element_text(color = "orange",
                                  #puts distance with the main axis labels
                                  margin = margin(r=0.5,unit = "cm"),
                                  hjust = 1),#sets the main axis title top left
      
      #Second y-axis
      axis.title.y.right = element_text(color = "lightblue",
                                        #puts distance with the second axis labels
                                        margin = margin(l=0.5,unit = "cm"),
                                        hjust = 0.01),#sets the second axis title bottom right
      
      #adjusts the date labels to avoid overlap line numbers
      axis.text.x = element_text(angle = 30,
                                 vjust = 1,
                                 hjust = 1),
      
      panel.grid.minor = element_blank(),#removes the minor grid of panel
      plot.title=element_text(face = "bold.italic",hjust = 0.5)
        
      
    )
  
}


fun("Turkey")

fun("Italy")

When we take a look above plots, the damaged areas have been quite increased compared to the past years for all countries. Probably the underlying reason, that confused Turkish people was the high ratio of the damaged areas to the number of fires. The other two countries also seem to have the same problem, compared to the past years. Undoubtedly, the reason beyond that is global warming. Of course, the insufficient treats of the governments to the fires had also played a part in the issue.

Now let’s forecast the number of fires this year for determining if there is an anomaly in Turkey. Because our data set is too small, it would be convenient to choose the k-nearest neighbor(KNN) algorithm. We will set k to 3 and 4 for optimization because the squared root of the number of observations is approximately 3,6.

Since the PACF order of the pattern are zero, we set the lags parameter as 1 to 5. In order to determine the order of PACF of the model, we first check the stationary of the data.

#Checking the pattern if it is stationary 
df %>% 
  filter(country=="Turkey") %>% 
  ggplot(aes(x=year,y=fires_number))+
  geom_line(size=1)+
  xlab("")+ylab("")+
  theme_minimal()

There seems to be exponential growth in the above graph. Hence, we will take the first difference of the data to stabilize the mean.

#Making the time series stationary and determining the order of ACF and PACF 
df %>% 
  filter(country=="Turkey") %>% 
  select(fires_number) %>% 
  pull() %>% #converts the dataframe column to a vector
  ts(start = 2009,frequency = 1) %>% #converts vector to a time series
  diff() %>% #gets first difference of the time series
  ggtsdisplay()# gets ACF/PACF plots

It seems the mean is stabilized, so we don’t have to take a second difference. To the PACF plot, there is no significant spike beyond the limits; that was why we took the order of PACF as zero.

#Forecasting with KNN  
df %>% 
  filter(country=="Turkey") %>% 
  select(fires_number) %>% 
  pull() %>% #converts the dataframe column to a vector  
  head(-1) %>% #removes the last observation for prediction
  ts(start = 2009,frequency = 1) %>% #converts the vector to a time series  
  knn_forecasting(h=1,lags = 1:5, k=c(3,4)) %>% 
  #.$pred (gets prediction value for 2021, which is 424.575)
  plot()
  #actual value(the last observation)
  points(x = 2021,
         y = last(df$fires_number[df$country=="Turkey"]),
         pch = 16,
         col="blue")
legend("topleft",
    legend = c("Predicted Value","Actual Value"),
    col = c("red","blue"),
    pch = c(16,16)
    
  )

When we analyze the plot, we see that the predicted value for this year is quite higher than the current value, which shows the speculation about the number of fires in Turkey seems to be baseless.

To leave a comment for the author, please follow the link and comment on their blog: DataGeeek.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Wildfires Comparison with ggplot2 dual Y-axis and Forecasting with KNN

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)