Using the USGS dataRetrieval package to analyze continuous water quality data

[This article was first published on Kevin Zolea on Kevin Zolea, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Being that I work for the Division of Water Monitoring & Standards, I have to analyze a lot of water quality data. A majority of the data is in house from the samples we collect but most times I have to download data from online sources. These online sources usually consist of the same organizations and most have a pretty straight forward way of getting access to the data. However, the process of going to that organization’s website and getting whatever data you need is a bit time consuming. I always look for ways that I can automate a task and not have to do the same things over and over again. It wasn’t until recently that I learned about the dataRetrieval package, which can do exactly that.

What is the dataRetrieval package & how does it work?

The dataRetrieval package is a collection of functions to help retrieve U.S. Geological Survey (USGS) and U.S. Environmental Protection Agency (EPA) water quality and hydrology data from web services. With the dataRetrieval package you can discover, access, retrieve, and parse water data. The data comes from numerous different sources. The image below provides a nice overview of the different sources, data types, metadata, time series type, formats, and output.

Image Credit: USGS

Image Credit: USGS

I’m not going to go into all the different functions and usage of the dataRetrieval package but if you would like to learn more here are some of the sources I found the most useful:

What I will show in this blog post

In this blog post I will discuss the usage of the readNWISuv() function and how to create a nice plot with the ggplot2 package. The readNWISuv() function imports data from the NWIS web service. Specifically, this function retrieves instantaneous water quality data. In order to use this function you must have the following arguments:

readNWISuv(siteNumbers, parameterCd, startDate = "", endDate = "",tz = "UTC")

  • siteNumbers: A character vector of USGS site numbers (or multiple sites). This is usually an 8 digit number. You can use this map to find a site your interested in.
  • parameterCd: Character USGS parameter code.This is usually an 5 digit number. To find a parameter code of interest you can type in parameterCdFile. This allows you to explore the USGS parameter codes.
  • startDate: character starting date for data retrieval in the form YYYY-MM-DD.
  • endDate: character ending date for data retrieval in the form YYYY-MM-DD.
  • tz: character to set timezone attribute of dateTime. Default is “UTC”, and converts the date times to UTC. There are numerous different possible values to use. For example, if you wanted it to be in Eastern Standard Time, you would use "America/New_York"

Install and load dataRetrieval package from cran

install.packages("dataRetrieval") library(dataRetrieval)

Pull data with the readNWISuv() function

For my analysis I am going to pull continuous specific conductance (SC) data for a site of intereset in NJ. With this specific conductance data, I will calculate Total dissolved solids (TDS). I will do this by using an equation from a correlation I made between SC & TDS. Being that this isn’t the focus of the post, I will not go in detail about this. However, in a subsequent post I will explain how I did this.

USGS_continuous_sc_data<-readNWISuv("01408029","00095",tz = "America/New_York")

For simplicity, I am only looking up one site and one type of parameter. You can look up multiple sites and parameters in one pull. Now lets take a look at a preview of the pull we just made.

## Warning: package 'dataRetrieval' was built under R version 3.5.2
##   agency_cd  site_no            dateTime X_00095_00000 X_00095_00000_cd
## 1      USGS 01408029 2007-10-01 01:00:00           246                A
## 2      USGS 01408029 2007-10-01 01:15:00           246                A
## 3      USGS 01408029 2007-10-01 01:30:00           246                A
## 4      USGS 01408029 2007-10-01 01:45:00           246                A
## 5      USGS 01408029 2007-10-01 02:00:00           246                A
## 6      USGS 01408029 2007-10-01 02:15:00           246                A
##              tz_cd
## 1 America/New_York
## 2 America/New_York
## 3 America/New_York
## 4 America/New_York
## 5 America/New_York
## 6 America/New_York

The names of the columns in the dataframe can be described as follows:

  • agency_cd: The NWIS code for the agency reporting the data

  • site_no: The USGS site number

  • dateTime: The date and time of the value converted to UTC

  • X_00095_00000: The values of the parameter we gave to the function.

  • X_00095_00000_cd: The statistic code

  • tz_cd: The time zone code for dateTime

You can clean up the names with the reenameNWISColumns() function if you’d like.

We have the data… now what?

Now that we have retrieved the data, we can now start manipulating it and create a plot. We will create the plot using the ggplot2 package. I mentioned before that I was going to calculate TDS. Just to give some background… the reason I am doing this is because of a project I am working on that deals with roadsalt. I figured I would include it in this post just to show that you have a great deal of options in R. I will discuss my roadsalt research in more detail in later posts!

What is the ggplot2 package?

The ggplot2 package is a system for ‘declaratively’ creating graphics, based on “The Grammar of Graphics”. It is a great way to visualize the data you are analyzing. With ggplot2, you have a lot of flexibility with the amount of customization you can give your plot. In my opinion, I think it is very easy to learn and it produces beautiful high quality plots. To learn more about ggplot2, I recommend The Complete ggplot2 Tutorial.

Full code used to create plot:

library(dataRetrieval)
library(ggplot2)
library(dplyr)
library(plyr)

### Vector of sites with continuous specific conductance data ###
siteNumber<-c("01408029")

### Parameter code for specific conductance ###
parameterCd<-"00095"

### Function that retrieves near real time continuous data for specific sites and parameters ###
USGS_continuous_sc_data<-readNWISuv(siteNumber,parameterCd,tz = "America/New_York")

### Filter dataframe ###
USGS_continuous_sc_data<-USGS_continuous_sc_data%>%
  dplyr::select(site_no,dateTime,X_00095_00000)%>%
  dplyr::rename(Site = site_no,Specific_conductance = X_00095_00000)
  
### Calculate TDS based on continuous Specific Conductance data and eq from correlation plots ###
final_USGS_data_TDS<-USGS_continuous_sc_data%>%
  dplyr::mutate(Calculated_TDS = Specific_conductance * 0.572 +6.19)

### theme for plots ###
graph_theme_T<- theme_linedraw()+
  theme(plot.title=element_text(size=15, face="bold",vjust=0.5,hjust = 0.5),
        plot.subtitle = element_text(size=15, face="bold",vjust=0.5,hjust = 0.5),
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        plot.background = element_blank(),
        panel.background = element_blank(),
        plot.margin = unit(c(1.5,2,4,2), "lines"), 
        legend.position = "bottom",
        legend.background = element_blank(),
        legend.text=element_text(size=10, face="bold"))
        
### Make plot of data ###
      
p<-ggplot(final_USGS_data_TDS, aes(x=dateTime,y=Calculated_TDS)) +
  geom_line(aes(color = "USGS Continuous Data"),
    stat = "identity",size=1.3)+
  scale_y_continuous(expand = c(0, 0), limits = c(0, max(final_USGS_data_TDS$Calculated_TDS)))+
  ggtitle("Total Dissolved Solids (TDS) Concentration (mg/L)") +
  labs(subtitle =paste("USGS Site:",final_USGS_data_TDS$Site,sep = ''))+
  xlab("Year") + ylab("TDS Concentration (mg/L)") +
  geom_hline(aes(yintercept = 500,color="Freshwater Aquatic Life Criteria for TDS = 500 mg/L"),size=1.3,alpha=0.6)+
  scale_color_manual("",
                     values = c("USGS Continuous Data"="#037907",
                                "Freshwater Aquatic Life Criteria for TDS = 500 mg/L"="red"))+
  
  graph_theme_T 

p
  

Final Product:

## Warning: package 'ggplot2' was built under R version 3.5.2
## Warning: package 'dplyr' was built under R version 3.5.2

What is this plot showing?

This plot is showing the calculated TDS concentration for the selected site from 2007- present day. The red line indicates the Freshwater Aquatic Life Criteria for TDS. In the most simpliest terms, when the graph shows the TDS concentration (green line) going over the red line, TDS is over the standard.

Conclusion

As you can see, the dataRetrieval package is a very useful tool for water quality analysis. There is sooo much data you can obtain with just the ease of writing a few lines of code! I only touched base on 1 function! There are so many different functions you can use to gain access to all different types of water quality data. Definitly look over the resources I included in the beginning if you want to learn more. I know this post was very basic but I hope it has helped you in some way. If you have any questions feel free to reach out to me!

To leave a comment for the author, please follow the link and comment on their blog: Kevin Zolea on Kevin Zolea.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)