Data Science for Water Utilities Using R

Peter Prevos

5 years ago

[This article was first published on The Devil is in the Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Data science comes natural to water utilities because of the engineering competencies required to deliver clean and refreshing water. Many water managers I speak to are interested in a more systematic approach to creating value from data.

My work in this area is gaining popularity. Two weeks ago I was the keynote speaker at an asset data conference in New Zealand. My paper about data science strategy for water utilities is the most downloaded paper this year. This week I am in Vietnam, assisting the local Phú Thọ water company with their data science problems.

In all my talks and publications I emphasise the importance of collaboration between utilities and that we should share code because we are all sharing the same problems. I am hoping to develop a global data science coalition for water services to achieve this goal.

My book about making water utilities more customer-centric will soon be published, so time to start another project. My new book will be about Data Science for Water Utilities Using R. This book is currently not more than a collection of existing articles, code snippets and production work from my job. The cover is finished because it motivates me to keep writing.

This article describes my proposed chapter structure with some example code snippets. The most recent version of this code can be found on my GitHub repository. Feel free to leave a comment at the bottom of this article if you like to see additional problems discussed, or if you want to participate by sharing code.

Data Science for Water Utilities

The first chapter will provide a strategic overview of data science and how water utilities can use this discipline to create value. This chapter is based on earlier articles and recent presentations on the topic.

Using R

This chapter will make a case for using R by providing just enough information for readers to be able to follow the code in the book. A recurring theme at a data conference in Auckland I spoke at was the problems posed by the high reliance on spreadsheets. This chapter will explain why code is superior and how to use R to achieve this advantage.

Reservoirs

This first practical chapter will discuss how to manage data from reservoirs. The core problem is to find the relationship between depth and volume based on bathymetric survey data. I started toying with bathymetric data from Pretyboy Reservoir in the state of Mayne. The code below downloads and visualises this data.

# RESERVOIRS
library(tidyverse)
library(RColorBrewer)
library(gridExtra)

# Read data
if (!file.exists("Hydroinformatics/prettyboy.csv")) {
    url <- "http://www.mgs.md.gov/ReservoirDataPoints/PrettyBoy1998.dat"
    prettyboy <- read.csv(url, skip = 2, header = FALSE)
    names(prettyboy) <- read.csv(url, nrows = 1, header = FALSE, stringsAsFactors = FALSE)
    write_csv(prettyboy, "Hydroinformatics/prettyboy.csv")
} else prettyboy <- read_csv("Hydroinformatics/prettyboy.csv")
head(prettyboy)

# Remove extremes, duplicates and Anomaly
ext <- c(which(prettyboy$Easting == min(prettyboy$Easting)), 
         which(prettyboy$Easting == max(prettyboy$Easting)),
         which(duplicated(prettyboy)))
prettyboy <- prettyboy[-ext, ]

# Visualise reservoir
bathymetry_colours <- c(rev(brewer.pal(3, "Greens"))[-2:-3], 
                        brewer.pal(9, "Blues")[-1:-3])
ggplot(prettyboy, aes(x = Easting, y = Northing, colour = Depth)) + 
    geom_point(size = .1) + coord_equal() + 
    scale_colour_gradientn(colors = bathymetry_colours)

Bathymetric survey of the Prettyboy reservoir.

In the plot, you can see the lines where the survey boat took soundings. I am working on converting this survey data to a non-convex hull to calculate its volume and to determine the relationship between depth and volume.

Other areas to be covered in this chapter could be hydrology and meteorology, but alas I am not qualified in these subjects. I hope to find somebody who can help me with this part.

Water Quality

The quality of water in tanks and networks is tested using samples. One of the issues in analysing water quality data is the low number of data points due to the cost of laboratory testing. There has been some discussion about how to correctly calculate percentiles and other statistical issues.

This chapter will also describe how to create a water system index to communicate the performance of a water system to non-experts. The last topic in this chapter discusses analysing taste testing data.

Water system performance index.

Water Balance

We have developed a model to produce water balances based on SCADA data. I am currently generalising this idea by using the igraph package to define water network geometry. Next year I will start experimenting with a predictive model for water consumption that uses data from the Australian Census and historical data to predict future use.

SCADA Data

Data from SCADA systems are time series. This chapter will discuss how to model this data, find spikes in the readings and conduct predictive analyses.

Customer Perception

This chapter is based on my dissertation on customer perception. Most water utilities do not extract the full value from their customer surveys. In this chapter, I will show how to analyse latent variables in survey data. The code below loads the cleaned data set of the results of a customer survey I undertook in Australia and the USA. The first ten variables are the Personal Involvement Index. This code does a quick exploratory analysis using a boxplot and visualises a factor analysis that uncovers two latent variables.

# CUSTOMERS
library(psych)

# Read data
customers <- read_csv("Hydroinformatics/customers.csv")

# Exploratory Analyis
p1 <- customers[,1:10] %>% 
    gather %>% 
    ggplot(aes(x = key, y = value)) + 
        geom_boxplot() + 
        xlab("Item") + ylab("Response") + ggtitle("Personal Involvement Index")

# Factor analysis
fap <- fa.parallel(customers[,1:10]) grid.arrange(p1, ncol= 2) customers[,1:10] %>%
    fa(nfactors = fap$nfact, rotate = "promax") %>% 
    fa.diagram(main = "Factor Analysis")

Customer Complaints

Customer complaints are a gift to the business. Unfortunately, most business view complaints punitively. This chapter will explain how to analyse and respond to complaints to improve the level of service to customers.

Customer Contacts

One of the topics in this chapter is how to use Erlang-C modelling to predict staffing levels in contact centres.

Economics

Last but not least, economics is the engine room of any organisation. In the early stages of my career, I specialised in cost estimating, including probabilistic methods. This chapter will include an introduction to Monte Carlo simulation to improve cost estimation reliability.

Data Science for Water Utilities Mind Map

This book is still in its early stages. The mind map below shows the work in progress on the proposed chapters and topic.

Data Science for Water Utilities: The next steps

I started writing bits and pieces of Data Science for Water Utilities using the fabulous bookdown system in R-Studio. It will take me about a year to realise this vision as I need to increase my analytical skills to write about such a broad range of topics. I would love to get some feedback on these two questions:

What is missing in this list? Any practical problems I should include?
Would you like to donate some data and code to include in the book?

Feel free to leave a comment below.

The post Data Science for Water Utilities Using R appeared first on The Devil is in the Data.

To leave a comment for the author, please follow the link and comment on their blog: The Devil is in the Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.