[This article was first published on R – The Data Science Tribune, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Flint River in Flint, Michigan, USA, in the late 1970s. By U.S. Army Corps of Engineers, photographer unknown via Wikimedia Commons

### Introduction

As many have heard recently residents of Flint Michigan have been rightly outraged due to the high presence of toxic chemicals including lead in their drinking water. The question arises how did this occur and was it a forseeable incident? The backstory that led up to this incident can be generalized into a few main chapters.
1. Flint had long sourced their water from the Detroit Water and Sewerage Department (DWSD)
2. The city had financial incentive to reduce spending because they were under financial stress
3. Flint went into an agreement with the Karegnondi Water Authority (KWA) and their to be completed source from Lake Huron(end of 2016)
4. The existing supplier DWSD provided their 12 month notice that their supply contract would end on April 2014
5. The flint river was relied on to supply water in the interim

### Hypothesis

In this analysis we will be exploring data from the US Geological Water Quality Survey to analyse the Flint incident starting at the source or pre-treated water as well as nearby streams in Detroit and near Lake Huron. It is not meant to serve as conclusive evidence of any kind. We will be looking specifically at chloride concentrations to see if Flint has very corrosive water to begin with. Before we begin let’s check for and install any necessary packages for this story
setwd("~/DSTribune/Stories/FlintWaterQuality")
library(ggplot2)
library(dplyr)
library(xtable)
This is really a story of three water sources and three counties. Flint located in Genesee County originally sourced their water from the Detroit Water and Sewage Department which sources its water from multiple rivers including the Detroit River as well as Lake Huron. The Detroit River is located in Wayne County and Lake Huron intake in Sanilac County. The KWA plant that was to replace the more expensive DWSD water is still under construction and will source its water from Lake Huron. We will download fresh water data to reflect the Detroit River and Flint Rivers and merge them into one data frame. Our goal is to understand relative untreated corrosivity between the two rivers and with the hypothesis that Flint’s water might be more initially corrosive.
#Genesee (Flint)
temp <- tempfile()
wqGen$County = "Genesee" #Wayne (Detroit River) temp <- tempfile() download.file("http://waterqualitydata.us/Result/search?countrycode=US&statecode=US%3A26&countycode=US%3A26%3A163&sampleMedia=Water&characteristicType=Inorganics%2C+Major%2C+Non-metals&characteristicName=Chloride&mimeType=csv&zip=yes&sorted=no", temp) wqWayne<- read.csv(unz(temp, "result.csv")) wqWayne$County = "Wayne"

#Merge the three County Water Measurements
wqDf <- rbind(wqGen, wqWayne)

#Save an offline version of the merged county water data
write.csv(wqDf, file ="MI3CountyCountyWaterData.csv")
We filtered our data for high quality measurements only taken at the surface. We specifically collected data on dissolved chloride concentrations because chloride ions are the key element in contributing to the corrosion in Flint pipes leading the leaching of metals such as lead. In the second half of this story we will also cover how the addition of chlorine escalated chloride concentrations but for now we will focus on pre-treatment water quality.
wqDf <- filter(wqDf, ActivityMediaSubdivisionName == "Surface Water", ResultSampleFractionText == 'Dissolved', ResultStatusIdentifier == 'Accepted' | ResultStatusIdentifier == 'Final' | ResultStatusIdentifier == 'Historical')
wqDf$MonitoringLocationIdentifier <- as.character(wqDf$MonitoringLocationIdentifier)
wqDf$ActivityStartDate <- as.POSIXct(wqDf$ActivityStartDate)
wqDf <- wqDf %>%
filter(ResultMeasureValue != "NA")
We now would like to see if there is a significant difference in pre-treated chloride concentrations between the two counties.
#What we want is a percentage of samples binned by concentration
percentConc<- wqDf %>%
group_by(County) %>%
summarise(Avg = mean(ResultMeasureValue, na.rm = TRUE),
Max = max(ResultMeasureValue, na.rm = TRUE),
Median = median(ResultMeasureValue, na.rm = TRUE),
LatestSample = max(ActivityStartDate, na.rm = TRUE),
totalSamples = n(),
stdError = sd(ResultMeasureValue, na.rm = TRUE))

percentConc$min <- percentConc$Avg - percentConc$stdError percentConc$max <- percentConc$Avg + percentConc$stdError

plot1 <- ggplot(percentConc, aes(x=County))
plot1 <- plot1 + geom_errorbar(aes(ymin=min,ymax=max),data=percentConc,width = 0.5)
plot1 <- plot1 + geom_boxplot(aes(y=Avg))
plot1 <- plot1 + ggtitle("Surface Water Chloride Concentrations n in Genesse and Wayne County MI (USGS)") + ylab("Average Chloride Concentration")
plot1
On first glance it appears that Genesee County overall has a higher concentration of chloride in the surface water. Let’s see if this is statistically significant or not as their is overlap in the standard error.
Gen <- filter(wqDf, County == "Genesee")
Way <- filter(wqDf, County == "Wayne")
Gen_Way <- t.test(Gen$ResultMeasureValue, Way$ResultMeasureValue, alternative=c("greater"))
Gen_Way$p.value ## [1] 1.04371e-05 The p-value for this t-test shows that Genesee County has a significantly greater chloride concentration in its surface water compared to Dwayne county. Remember Dwayne county contains the Detroit River one of the sources of water that Flint was originally obtaining its water from before switching. Fig 1: Looking at just summary of all water reading in both counties County Avg Max Median LatestSample totalSamples stdError min max 1 Genesee 41.31 185.00 21.00 1446015600.00 229 42.88 -1.57 84.19 2 Wayne 25.41 330.00 8.50 1019113200.00 383 46.62 -21.22 72.03 Surface Water samples taken in the County of Genesee appear to show a multi-decade historical average of 41.3 mg/l almost twice as much as the 25.4 mg/l average in Genesee County. At this point I got a funny feeling why not check the median it should be relatively close to the mean? tapply(wqDf$ResultMeasureValue, wqDf$County, median) ## Genesee Wayne ## 21.0 8.5 Turns out the median was nowhere near the mean. The median shows Genesee County having a chloride concentration of 21.0 mg/l and Wayne with a 8.5 mg/l concentration. Genesee County has almost 3X the pre-treatment or initial chloride concentration compared to Wayne county. The discrepancy between the median and mean could be outliers or a non-normal distribution. If my experience has taught me any thing in these circumstances I need to see the full distribution and see what is happening here. ggplot(wqDf, aes(x = ResultMeasureValue, fill = County)) + geom_density(alpha = 0.3) + ggtitle("Density of Chloride Concentrations n Genesee and Wayne County Surface Water") + xlab("[Chloride] (mg/l)") + ylab("Frequency") That distribution sure doesn’t look normal. It appears Wayne county has a lot of samples with low concentrations of chloride. It could be that one sampling site has so many samples that it is warping the mean and median. Perhaps what we should be doing is collecting an average by sample site and looking at the distribution of sample site averages. percentConc<- wqDf %>% group_by(MonitoringLocationIdentifier, County) %>% summarise(Avg = mean(ResultMeasureValue, na.rm = TRUE), Max = max(ResultMeasureValue, na.rm = TRUE), Median = median(ResultMeasureValue, na.rm = TRUE), LatestSample = max(ActivityStartDate, na.rm = TRUE), totalSamples = n(), stdError = sd(ResultMeasureValue, na.rm = TRUE)) tapply(percentConc$Median, percentConc$County, mean) ## Genesee Wayne ## 33.2125 115.5000 At first it appeared as though Genesee County had significantly higher concentrations of Chloride than Wayne County. However once we aggregated median concentrations by Site and aggregated by County it appears that Wayne County has 5X the amount of chloride in its surface water. To put this to rest we will conduct one more filter to remove sites with less than 3 samples to remove possible outlier measurements at unique sites. Remember running even one water sample requires multiple labs, USGS employees sampling at a site, and tens of thousands of dollars. So 3 samples is a big deal in this world (I should know I used to sample and analyze water for 4 years for the US Geological Survey) HighSampleSizePercentConc <- filter(percentConc, totalSamples >= 3) tapply(percentConc$Median, percentConc\$County, mean)
##  Genesee    Wayne
##  33.2125 115.5000
When comparing Chloride concentrations at various USGS measurements sites in Flint, Detroit, and near Lake Huron we observe that initial concentrations of chloride in surface water near Flint is relatively smaller than its neighboring counties.

## Conclusion

We have finally arrived closer to the truth. In general the rivers and lakes in Genesee County appear to have a much lower chloride concentration than those in Wayne County. So our original hypothesis that initial chloride concentrations would be high or corrosive to begin with does not appear to be proved. It should be noted that Detroit also sourced much of its water from Lake Huron and we show some water samples near the port in the interactive map above. During this analysis we also looked at bacterial measurements using the same USGS source but found Flint did not have enough microbio samples to warrant a similar analysis on bacteria concentrations. The lack of initial high corrosivity in the rivers relative to nearby counties as seen in the interactive map suggests that initial chloride concentrations may not have been the main contributor to corrosivity and instead the addition of chlorine to remove bacteria may have been the main contributor. Many of the methods we used have been inspired over the years from similar stories over at R-bloggers.