Site icon R-bloggers

Everybody talks about the weather

[This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Everybody talks about the weather, but nobody does anything about it. – Charles Dudley Warner

As a scientist who models plant diseases, I use a lot of weather data. Often this data is not available for areas of interest. Previously, I worked with the International Rice Research Institute (IRRI) and often the countries I was working with did not have weather data available or I was working on a large area covering several countries and needed a single source of data to work from. Other scientists who work with crop biophysical models to model crop yields also have similar weather data needs and may experience similar issues with data availability.

The United States National Oceanic and Atmospheric Administration's (NOAA) National Centers for Environmental Information (NCEI 1) provides several sources of data, one of which is the Global Surface Summary of the Day (GSOD) data. The data are attractive because they are daily time-step and ground or buoy station based, freely available, and the data span several years, 1929 to current, with data from 1973 to current being the most complete. For more information on the GSOD data, please see the description of the data provided by NCEI, http://www7.ncdc.noaa.gov/CDO/GSOD_DESC.txt.

While the GSOD data are a valuable source of weather data with global coverage. The data files can be cumbersome and difficult to work with for research purposes. So instead of just talking about it, I did something about it. The GSODR package aims to make it easy to find, transfer and format the data you need for use in analysis. The package provides four main functions for facilitating this:

When reformatting data either with get_GSOD() or reformat_GSOD(), all units are converted to International System of Units (SI), e.g., inches to millimetres and Fahrenheit to Celsius. File output can be saved as a Comma Separated Value (CSV) file or in a spatial GeoPackage (GPKG) file, implemented by most major GIS software products, summarising each year by station, which also includes vapour pressure (ea and es) and relative humidity variables calculated from existing data in GSOD by the package.

Documentation is provided for GSODR using pkgdown, which you can find on the GSODR website, http://ropensci.github.io/GSODR/ or of course in the usual R vignettes and help files.

How Can You Use GSODR?

Recently a colleague contacted me asking if I knew of or had weather data for a time period covering 1960 to 2015 for selected provinces in the Philippines where the International Rice Research Institute (IRRI) has conducted surveys. The IRRI survey loop in Central Luzon is a study that aims to monitor the changes in rice farming in the major rice producing area of the Philippines, the Central Luzon region, which is called the "rice bowl of the Philippines". In this survey data have been collected several times since the 1960s, see the Farm Household Survey Database webpage for the data collected data. Using the GSODR package I was able to retrieve weather data from stations within a 100km radius of the centre of the provinces included in the survey and provide my colleague with a CSV file of weather data from ground-based weather stations.

As an example of how we can use GSODR, I will demonstrate the following:

Retrieve PHL Provincial Data and Select Loop Provinces

As a first step, we'll use the raster package to retrieve data from GADM that will provide the provincial spatial data for the survey area. We will then use this to find the centroid of the area of interest, which will be used to find the nearest stations. Using raster::getData() fetch level 0 (national) and 1 (provincial) data for the Philippines.

install.packages("raster")
library(raster)
RP0 <- raster::getData(country = "Philippines", level = 0)
RP1 <- raster::getData(country = "Philippines", level = 1)

Now we will select the provinces involved in the survey and make a new object called Central_Luzon from the provincial level data, RP1.

Central_Luzon <- RP1[RP1@data$NAME_1 == "Pampanga" |
                     RP1@data$NAME_1 == "Tarlac" |
                     RP1@data$NAME_1 == "Pangasinan" |
                     RP1@data$NAME_1 == "La Union" |
                     RP1@data$NAME_1 == "Nueva Ecija" |
                     RP1@data$NAME_1 == "Bulacan", ]

With a little help from an old acquaintance, Arnold Salvacion and Scott Chamberlain we can create a map inset showing where the Central Luzon Loop survey takes place.

First we'll use gSimplify() from rgeos to simplify the map of the Philippines to make the map generation in the next few steps quicker.

RP0 <- rgeos::gSimplify(RP0, tol = 0.05)

library(ggplot2)
library(grid)
library(gridExtra)

CL_names <- data.frame(coordinates(Central_Luzon)) # get center coordinates of provinces in Central Luzon
CL_names$label <- Central_Luzon@data$NAME_1

# Main Map
p1 <- ggplot() +
  geom_polygon(data = Central_Luzon, aes(x = long, y = lat, group = group),
               colour = "grey10", fill = "#fff7bc") +
  geom_text(data = CL_names, aes(x = X1, y = X2, label = label),
            size = 3, colour = "grey20") +
  theme(axis.text.y = element_text(angle = 90, hjust = 0.5)) +
  ggtitle("Central Luzon Provinces Surveyed") +
  theme_bw() +
  xlab("Longitude") +
  ylab("Latitude") +
  coord_map()

## Regions defined for each Polygons

# Inset
p2 <- ggplot() +
  geom_polygon(data = RP0, aes(long, lat, group = group),
               colour = "grey10",
               fill = "#fff7bc") +
  coord_equal() +
  theme_bw() +
  labs(x = NULL, y = NULL) +
  geom_rect(aes(xmin = extent(Central_Luzon)[1],
                xmax = extent(Central_Luzon)[2],
                ymin = extent(Central_Luzon)[3],
                ymax = extent(Central_Luzon)[4]),
            alpha = 0,
            colour = "red",
            size = 1,
            linetype = 1) +
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        plot.margin = unit(c(0, 0, 0 ,0), "mm"))

grid.newpage()
v1 <- viewport(width = 1, height = 1, x = 0.5, y = 0.5) # plot area for the main map
v2 <- viewport(width = 0.28, height = 0.28, x = 0.67, y = 0.78) # plot area for the inset map
print(p1, vp = v1)
print(p2, vp = v2)

Dissolve Polygons and Find Centroid of Loop Survey Area

Now that we have the provincial data that we want, we will dissolve the polygons that represent the individual provinces in Central Luzon and find the centroid of all of them, which we will use as the central point for querying stations from the GSOD data set.

Central_Luzon <- rgeos::gUnaryUnion(Central_Luzon)
centroid <- rgeos::gCentroid(Central_Luzon)

ggplot() +
  geom_polygon(data = Central_Luzon, aes(x = long, y = lat, group = group),
               colour = "grey10", fill = "#fff7bc") +
  geom_point(aes(x = centroid@coords[1], y = centroid@coords[2])) +
  theme(axis.text.y = element_text(angle = 90, hjust = 0.5)) +
  ggtitle("Centre of Provinces Surveyed") +
  theme_bw() +
  xlab("Longitude") +
  ylab("Latitude") +
  coord_map()

Next, make a list of stations that are within this area. First we need to fetch the station medadata, "isd-history.csv" from the FTP server and then check which stations fall within a 100km radius of the centre of the provinces we're interested in.

library(GSODR)
library(readr)
# Fetch station list from NCEI
station_<- read_csv(
  "ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-history.csv",
  col_types = "ccccccddddd",
  col_names = c("USAF", "WBAN", "STN_NAME", "CTRY", "STATE", "CALL", "LAT",
                "LON", "ELEV_M", "BEGIN", "END"), skip = 1)
station_meta$STNID <- as.character(paste(station_meta$USAF,
                                         station_meta$WBAN,
                                         sep = "-"))

loop_stations <- nearest_stations(LAT = centroid@coords[2],
                                  LON = centroid@coords[1],
                                  distance = 100)

loop_stations <- station_meta[station_meta$STNID %in% loop_stations, ]

loop_stations <- loop_stations[loop_stations$BEGIN <= 19591231 &
                               loop_stations$END >= 20151231, ]

print(loop_stations[, c(1:2, 3, 7:12)])

## # A tibble: 9 × 9
##     USAF  WBAN          STN_NAME    LAT     LON ELEV_M    BEGIN      END
##    <chr> <chr>             <chr>  <dbl>   <dbl>  <dbl>    <dbl>    <dbl>
## 1 983240 99999  IBA/LUZON ISLAND 15.333 119.967    5.0 19490104 20170315
## 2 983250 99999           DAGUPAN 16.083 120.350    2.0 19450119 20170315
## 3 983270 99999        CLARK INTL 15.186 120.560  147.5 19450214 20170315
## 4 983280 99999            BAGUIO 16.375 120.620 1295.7 19490101 20170315
## 5 984260 41231 SUBIC BAY WEATHER 14.800 120.267   19.0 19450209 20170315
## 6 984290 99999 NINOY AQUINO INTL 14.509 121.020   22.9 19450331 20170315
## 7 984300 99999    SCIENCE GARDEN 14.650 121.050   46.0 19450228 20170315
## 8 984320 99999          AMBULONG 14.083 121.050   11.0 19490205 20170315
## 9 984330 99999             TANAY 14.583 121.367  651.0 19490101 20170315
## # ... with 1 more variables: STNID <chr>

These are all of the stations that are availble within 100km of the centroid of this area and the years for which data are available.

Using get_GSOD() to Fetch the Requested Station Files

This example shows how you could construct a query using the get_GSOD() function. Be aware that it may result in incomplete data and error from the server if it stops responding. We've done our best to make GSODR handle these errors, but if it does this, see the following option for using the reformat_GSOD() function.

PHL <- get_GSOD(station =
                  eval(parse(text = loop_stations[, 12])), years = 1960:2015)

##
## Checking requested station file for availability on server.

##
## Downloading individual station files.

## Starting data file processing

Another Option, Using reformat_GSOD()

GSODR provides a function for dealing with local files that have been transferred from the server already as well, reformat_GSOD(). If the previous example with get_GSOD() does not work, this is a good alternative that takes a bit more intervention but gives the same results.

Using your FTP client (e.g., FileZilla) log into the NCEI FTP server, ftp.ncdc.noaa.gov and navigate to /pub/data/gsod/. Manually downloading the files for each station listed above from 1960 to 2015 is possible, but tedious. An easier solution is to simply download the annual files found in each yearly directory, "gsod-YYYY.tar" and untar them locally and then use R to list the available files and select only the files for the stations of interest. Lastly, write the data to disk as a CSV file for saving and later use.

years <- 1960:2015

loop_stations <- eval(parse(text = loop_stations[, 12]))

# create file list
loop_stations <- do.call(
  paste0, c(expand.grid(loop_stations, "-", years, ".op.gz"))
  )

local_files <- list.files(path = "./GSOD", full.names = TRUE, recursive = TRUE)
local_files <- local_files[basename(local_files) %in% loop_stations]

loop_data <- reformat_GSOD(file_list = local_files)

readr::write_csv(loop_data, file = "Loop_Survey_Weather_1960-2015")

The Data Content and Format

The final data returned by either of these methods will be data that include the following elements for the years of 1960-2015

Here's what the data look like.

##    WBAN        STNID         STN_NAME CTRY STATE CALL    LAT     LON
## 1 99999 983240-99999 IBA/LUZON ISLAND   RP  <NA> RPUI 15.333 119.967
## 2 99999 983240-99999 IBA/LUZON ISLAND   RP  <NA> RPUI 15.333 119.967
## 3 99999 983240-99999 IBA/LUZON ISLAND   RP  <NA> RPUI 15.333 119.967
## 4 99999 983240-99999 IBA/LUZON ISLAND   RP  <NA> RPUI 15.333 119.967
## 5 99999 983240-99999 IBA/LUZON ISLAND   RP  <NA> RPUI 15.333 119.967
## 6 99999 983240-99999 IBA/LUZON ISLAND   RP  <NA> RPUI 15.333 119.967
##   ELEV_M ELEV_M_SRTM_90m    BEGIN      END YEARMODA YEAR MONTH DAY YDAY
## 1      5               9 19490104 20170315 19730101 1973    01  01    1
## 2      5               9 19490104 20170315 19730102 1973    01  02    2
## 3      5               9 19490104 20170315 19730103 1973    01  03    3
## 4      5               9 19490104 20170315 19730104 1973    01  04    4
## 5      5               9 19490104 20170315 19730105 1973    01  05    5
## 6      5               9 19490104 20170315 19730106 1973    01  06    6
##   TEMP TEMP_CNT DEWP DEWP_CNT    SLP SLP_CNT STP STP_CNT VISIB VISIB_CNT
## 1 26.8        7 23.8        7 1014.1       7  NA       0  34.3         7
## 2 26.6        5 23.2        5 1013.8       5  NA       0  34.0         5
## 3 25.8        6 23.2        6 1013.5       6  NA       0  34.9         6
## 4 28.4        5 25.0        5 1014.5       5  NA       0  38.0         5
## 5 25.8        4 23.5        4 1014.8       4  NA       0    NA         0
## 6 25.0        8 22.9        8 1014.8       8  NA       0  35.6         8
##   WDSP WDSP_CNT MXSPD GUST MAX MAX_FLAG MIN MIN_FLAG PRCP PRCP_FLAG SNDP
## 1  0.4        7   3.0   NA  31        *  24        *    0         I   NA
## 2  0.9        5   4.1   NA  33        *  19        *    0         I   NA
## 3  0.8        6   4.1   NA  30        *  21        *    0         I   NA
## 4  1.0        5   5.1   NA  32        *  23        *    0         I   NA
## 5  0.2        4   1.5   NA  30        *  22        *    0         I   NA
## 6  0.4        8   2.5   NA  31        *  19        *    0         I   NA
##   I_FOG I_RAIN_DRIZZLE I_SNOW_ICE I_HAIL I_THUNDER I_TORNADO_FUNNEL  EA
## 1     0              0          0      0         0                0 2.9
## 2     0              0          0      0         0                0 2.8
## 3     0              0          0      0         0                0 2.8
## 4     0              0          0      0         0                0 3.2
## 5     0              0          0      0         0                0 2.9
## 6     0              0          0      0         0                0 2.8
##    ES   RH
## 1 3.5 82.9
## 2 3.5 80.0
## 3 3.3 84.8
## 4 3.9 82.1
## 5 3.3 87.9
## 6 3.2 87.5

Using the GSODR package and R I was able to easily retrieve and provide weather data for the years requested that cover the area of interest for this survey and create a CSV file of the data for use with other software for the analysis.

Using GSOD Data With Climate Data From GSODRdata

If you want to use climate and ecological data with weather data, we've also provided a supplementary data package to go with GSODR, GSODRdata, which provides climate data from five sources in six data frames:

Due to the size of the package, >9Mb, it is only available from GitHub. However, these data frames provide climate and ecological data that corresponds to the GSOD station locations, making it easy to find and work with weather and climate data at the same time. If you're interested you can find some further examples in the GSODR vignettes.

Conclusions

The GSOD data have a wide range of applications and GSODR makes this data more accessible to scientists that need a global weather data set. Using GSODR means that you can efficiently query for a set of years, for specific stations or areas within a given radius. The GSOD data are not perfect, there are many gaps prior to 1973, but in the more recent years the data became more reliable and more stations are being added.

ggplot(station_meta, aes(x = END)) +
  geom_histogram() +
  xlab("End Date (YYYYMMDD)") +
  ylab("Number of Stations") +
  ggtitle("Count of stations' end date") +
  theme_bw()

How could you use GSODR?

Let us know how you use GSODR in your work. If you find an issue, please file an issue and we'll work to get it corrected as quickly as possible. Also, if you think that you have an idea that might make GSODR better let us know that too.

Acknowledgments

We're grateful to Jeff Hanson and Dillon Niederhut who took the time to review GSODR as a part of the rOpenSci onboarding process and the related paper published in the Journal of Open Source Software (Sparks et al., 2017). Their suggestions greatly improved this package. Also, thanks to Scott Chamberlain for his editorial comments on this blog post, including spelling corrections to his name.

References

Adam H Sparks, Tomislav Hengl and Andrew Nelson (2017). GSODR: Global Summary Daily Weather Data in R. The Journal of Open Source Software, 2(10). DOI: 10.21105/joss.00177. URL: https://doi.org/10.21105/joss.00177

Notes


  1. Formerly the National Climatic Data Center (NCDC) 

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.