Introducing ACEA water level challenge

[This article was first published on rblog – Code X Value, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

About 2 week ago, yes right around the New year, I was browsing Kaggle just for fun. It made me remember how much fun it actually is to play around with random data. Not only that but very often with a cool purpose too. One of my new year goals is to have a little bit more fun with data again,  so it became a quick 1-2 and I have been diving into a challenge for the past 2 weeks. In this series I will discuss 4 distinct steps in my project:

  • Challenge goals and initial discussion
  • Data setup for modelling
  • Data modelling
  • Refinements and final insights

Please note, if you want to follow along with any code in these blog series then sign up for the challenge here: https://www.kaggle.com/c/acea-water-prediction/overview and you will receive the dataset for free.

The challenge: ACEA Water Levels

The goal of this challenge is to predict water levels in a collection of different water bodies based in Italy. Specifically we have to predict based on a time series model, to accurately assess the water level of tomorrow, based on data of today. To shortly note the Kaggle rules, this is an analytics challenge, which means creating a compelling story & notebook is a very important part. My notebook is publicly available on Kaggle here, but I will work through some code excerpts and interesting highlights in these blogs.

For today I want to address my initial findings, and explore how I thought through these challenges initial steps.

Designing my load function

One of the first steps everyone undertakes is (obviously) to load the data, it is my own personal preference to condense this code into a load function. The main reason why this has been so useful to me in my code is that I can create clean and concise notebooks by only loading the necessary data. It also allows me to build code chunks that are independent of each other by loading my data dynamically in my notebook as necessary.

My load function will typically contain the following 3 components:

  • Filepath to data
  • Column based filters
  • Merge functionality

In the ACEA dataset this worked out as follows:

# dirs:
maindir   <- 'C:/Prive/Kaggle/ACEA'
datadir   <- 'Data'
scriptdir <- 'Scripts'
htmldir   <- 'HTML'
# load:
# Inladen files:
files_loc <- list.files(path = paste(maindir, datadir, sep = '/'), recursive= T, full.names = T)
files_loc <- files_loc[grep("\\.csv", files_loc)]

load_files <- function(path, filter=NULL){
    output <- read.csv(path, stringsAsFactors = FALSE, check.names = FALSE, fileEncoding="UTF-8-BOM")
    output$name <- gsub('\\.csv','', basename(path))
    if(!is.null(filter)){
        return(output[,filter])
    }
    return(output)
}

data_auser <- load_files(files_loc[1])
data_doganella <- load_files(files_loc[2])
data_luco <- load_files(files_loc[3])
data_petrignano <- load_files(files_loc[4])
data_bilancino <- load_files(files_loc[5])
data_arno <- load_files(files_loc[6])
data_amiata <- load_files(files_loc[7])
data_lupa <- load_files(files_loc[8])
data_madonna <- load_files(files_loc[9])

As you can see, there is currently no merge functionality yet in my load function, I imagine I will update this as I can see some usage for it later on in my project. The crux here is that if you are faced with multiple datasets but would logically have to combine them in some fashion later, your merge function incorporated with your load function can create a clean, concise and easily understood notebook. It is however possible to use this function to load specific column sets which will proof usefull in my next chapter.

Temporal elements of ACEA dataset

One of the main discussion points of the ACEA dataset is temporal, by which I mean time based elements in the dataset. The data contains multiple types of columns such as rainfall, temperature, hydrometric and ground water levels. The final challenge is to build a time series model, we have to predict the water level of tomorrow. This means that there are some initial assumptions to be challenged or tested:

  • Trends over time
  • Seasonality
  • Availability of time

One of the cornerstones of our narrative needs to be timebased trends, the data runs back to 1998 on some variables (see below later on), which means that it is possible the different explanatory variables contain long running trends. I myself assume for example that rainfall is decreasing over time and temperature increasing. This is due to global warming effects. It is important to test these assumptions. When thinking of seasonality, we may assume furthermore water levels are likely lower in summer and higher in winter or early spring.

As far as easy to use temporal visualizations, I designed a plotly of all datasets and dates available in the dataset to showcase how long they run backwards giving us an idea of the possibility of finding different trends. The codechunk below makes use of the load function we designed earlier thus making it stand-alone executable.

# Put all dates and names variables:
datalist_date_name <- lapply(1:9, function(x){load_files(files_loc[x], filter = c('Date','name'))})
data_date_name <- data.table::rbindlist(datalist_date_name)
data_date_name <- data_date_name[which(nchar(data_date_name$Date) == 10),]
data_date_name$Date <- as.Date(data_date_name$Date, "%d/%m/%Y")

fig(10,6)
p <- ggplotly(ggplot(data_date_name, aes(x = Date, y = name)) + geom_point() + 
labs (x = 'Date (years)', y = '',title = 'Available dates by dataset') +
insighttheme )

htmlwidgets::saveWidget(p, file = paste(maindir, htmldir, 'dates_viz','index.html', sep = '/'))

Geospatial elements of ACEA dataset

One of the final things I dived into this week is the geospatial element present in the ACEA dataset, we have different aquifers, rainfall and temperature measurements around Italy, and thus it would be beneficial to explore if maps can help us develop a story that helps our viewers understand better how the data works and what crucial insight we are working on.

For example it would be helpful to show our temporal elements on a map, for people to quicker understand the overall findings of all locations found in the ACEA dataset. I did have to use some elbow grease in this section, as the different locations found are not groupable by for example municipalities or cities. Some of them are mountains, bridges, etc. Hence there was no comprehensive dataset with locations existing I could use. Therefore I put all 64 locations in google maps and wrote down the longitude and latitude by hand.

Now with 64 locations this was the quick fix, the longer fix is found here, https://developers.google.com/maps/documentation/geocoding/overview#RegionCodes. If you have to work with 1000’s of cities or a plethora of unknown locations then refer to the google API and programmatically extract the longitude and latitude. Since I am looking into doing more Kaggle and also working more with maps in my regular job I might have some fun and build a function for that later on.

Below was my first attempt at a map, showcasing the average rainfall surrounding the AUSER aquifer. Please note that this code makes use of a shapefile for italian municipalities found here: https://hub.arcgis.com/datasets/e68ceb0a193e4e378b29255b62ab75e0_0. This is basic version of the map, which I will be updating further on my notebook as needed for the final submission.

italy_mun <- readOGR(files_shp)
italy_mun <- fortify(italy_mun, region = 'COMUNE')

locations <- read_excel(files_geo)
locations$lat <- as.numeric(locations$lat)
locations$long <- as.numeric(locations$long)

mean_rainfall <- aggregate(rainfall_value ~ location, data = data_plot, FUN = mean)
mean_rainfall$location <- gsub('_',' ',gsub('Rainfall_','',mean_rainfall$location))
locations <- merge(locations, mean_rainfall, by.x = 'location', by.y = 'location', all.x = T, all.y = F)
locations <- locations[!is.na(locations$rainfall_value),]

#mean lats and longs:
municipalities <- aggregate(cbind(long, lat) ~ id, data = italy_mun, FUN = mean)
municipalities <- municipalities[which(max(locations$lat)+.1 >= municipalities$lat &                  # Municipalities within auser locations latitude
                                           min(locations$lat)-.1 <= municipalities$lat), ]

municipalities <- municipalities[which(max(locations$long)+.1 >= municipalities$long &                # municipalities within auser locations longitude
                                           min(locations$long)-.1 <= municipalities$long), ]

italy_mun_plot <- italy_mun[which(italy_mun$id %in% municipalities$id),]

ggplot() + geom_polygon(data = italy_mun_plot, aes(x = long, y = lat, group = id), fill="grey", alpha=0.3, col = 'white') + 
    geom_point(data = locations, aes(x = long, y = lat, colour = vartype, size = rainfall_value)) + insighttheme +
    labs("Map of rainfall locations belonging to Auser aquifer in Italy")
Output of basic map code

My takeaways from this blog are simple, look at your data and develop an understanding of the important dimensions, in this case time and location. This could be other dimensions such as currency, speed and/or connectivity. Connect your final model and insight back to these basic dimensions of your data to create a compelling narrative that is easy for all your users to follow.

I will be back soon with Part 2!

To leave a comment for the author, please follow the link and comment on their blog: rblog – Code X Value.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)