Parse NOAA Integrated Surface Data Files

[This article was first published on rOpenSci Blog - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new package isdparser is
on CRAN. isdparser was in part liberated from rnoaa,
then improved. We'll use isdparser in rnoaa soon.

isdparser does not download files for you from NOAA's ftp servers. The
package focuses on parsing the files, which are variable length ASCII strings
stored line by line, where each line has some mandatory data, and any amount
of optional data.

The data is great, and includes for example, wind speed and direction, temperature,
cloud data, sea level pressure, and more. Includes data from approximately 35,000
stations worldwide, though best coverage is in North America/Europe/Australia.
Data go all the way back to 1901, and are updated daily.

However, the data is not fun to parse,
warranting an packge to deal with the parsing.

Installation

install.packages("isdparser")

If binaries aren't available, try from source:
install.packages("isdparser", type = "source") or from GitHub:
devtools::install_github("ropenscilabs/isdparser")

library(isdparser)
library(dplyr)

Parse individual lines

If you want to parse individual lines, use isd_parse_line()

First, let's get a ISD file. There's a few that come with the package:

path <- system.file('extdata/024130-99999-2016.gz', package = "isdparser")

Read in the file

lns <- readLines(path, encoding = "latin1")

Parse a line

isd_parse_line(lns[1])
#> # A tibble: 1 × 42
#>   total_chars usaf_station wban_station       date  time date_flag
#>                                    
#> 1          54       024130        99999 2016-01-01  0000         4
#> # ... with 36 more variables: latitude , longitude ,
#> #   type_code , elevation , call_letter , quality ,
#> #   wind_direction , wind_direction_quality , wind_code ,
#> #   wind_speed , wind_speed_quality , ceiling_height ,
#> #   ceiling_height_quality , ceiling_height_determination ,
#> #   ceiling_height_cavok , visibility_distance ,
#> #   visibility_distance_quality , visibility_code ,
#> #   visibility_code_quality , temperature ,
#> #   temperature_quality , temperature_dewpoint ,
#> #   temperature_dewpoint_quality , air_pressure ,
#> #   air_pressure_quality ,
#> #   AW1_present_weather_observation_identifier ,
#> #   AW1_automated_atmospheric_condition_code ,
#> #   AW1_quality_automated_atmospheric_condition_code ,
#> #   N03_original_observation , N03_original_value_text ,
#> #   N03_units_code , N03_parameter_code , REM_remarks ,
#> #   REM_identifier , REM_length_quantity , REM_comment 

By default you get a tibble back, but you can ask for a list in return instead.

Parsing by line allows the user to decide how to apply parsing across lines,
whether it be lapply style, or for loop, etc.

Parse entire files

You can also parse entire ISD files.

isd_parse(path)
#> # A tibble: 2,601 × 42
#>    total_chars usaf_station wban_station       date  time date_flag
#>                                     
#> 1           54       024130        99999 2016-01-01  0000         4
#> 2           54       024130        99999 2016-01-01  0100         4
#> 3           54       024130        99999 2016-01-01  0200         4
#> 4           54       024130        99999 2016-01-01  0300         4
#> 5           54       024130        99999 2016-01-01  0400         4
#> 6           39       024130        99999 2016-01-01  0500         4
#> 7           54       024130        99999 2016-01-01  0600         4
#> 8           39       024130        99999 2016-01-01  0700         4
#> 9           54       024130        99999 2016-01-01  0800         4
#> 10          54       024130        99999 2016-01-01  0900         4
#> # ... with 2,591 more rows, and 36 more variables: latitude ,
#> #   longitude , type_code , elevation , call_letter ,
#> #   quality , wind_direction , wind_direction_quality ,
#> #   wind_code , wind_speed , wind_speed_quality ,
#> #   ceiling_height , ceiling_height_quality ,
#> #   ceiling_height_determination , ceiling_height_cavok ,
#> #   visibility_distance , visibility_distance_quality ,
#> #   visibility_code , visibility_code_quality ,
#> #   temperature , temperature_quality ,
#> #   temperature_dewpoint , temperature_dewpoint_quality ,
#> #   air_pressure , air_pressure_quality ,
#> #   AW1_present_weather_observation_identifier ,
#> #   AW1_automated_atmospheric_condition_code ,
#> #   AW1_quality_automated_atmospheric_condition_code ,
#> #   N03_original_observation , N03_original_value_text ,
#> #   N03_units_code , N03_parameter_code , REM_remarks ,
#> #   REM_identifier , REM_length_quantity , REM_comment 

Optionally, you can print progress:

isd_parse(path, progress = TRUE)
#> # A tibble: 2,601 × 42
#>    total_chars usaf_station wban_station       date  time date_flag
#>                                     
#> 1           54       024130        99999 2016-01-01  0000         4
#> 2           54       024130        99999 2016-01-01  0100         4
#> 3           54       024130        99999 2016-01-01  0200         4
#> 4           54       024130        99999 2016-01-01  0300         4
#> 5           54       024130        99999 2016-01-01  0400         4
#> 6           39       024130        99999 2016-01-01  0500         4
#> 7           54       024130        99999 2016-01-01  0600         4
#> 8           39       024130        99999 2016-01-01  0700         4
#> 9           54       024130        99999 2016-01-01  0800         4
#> 10          54       024130        99999 2016-01-01  0900         4
#> # ... with 2,591 more rows, and 36 more variables: latitude ,
#> #   longitude , type_code , elevation , call_letter ,
#> #   quality , wind_direction , wind_direction_quality ,
#> #   wind_code , wind_speed , wind_speed_quality ,
#> #   ceiling_height , ceiling_height_quality ,
#> #   ceiling_height_determination , ceiling_height_cavok ,
#> #   visibility_distance , visibility_distance_quality ,
#> #   visibility_code , visibility_code_quality ,
#> #   temperature , temperature_quality ,
#> #   temperature_dewpoint , temperature_dewpoint_quality ,
#> #   air_pressure , air_pressure_quality ,
#> #   AW1_present_weather_observation_identifier ,
#> #   AW1_automated_atmospheric_condition_code ,
#> #   AW1_quality_automated_atmospheric_condition_code ,
#> #   N03_original_observation , N03_original_value_text ,
#> #   N03_units_code , N03_parameter_code , REM_remarks ,
#> #   REM_identifier , REM_length_quantity , REM_comment 

There's a parallel option as well, coming in handy with the larger ISD files:

isd_parse(path, parallel = TRUE)

Visualize the data

Make better date + time

df <- res %>%
  rowwise() %>%
  mutate(
    datetime = as.POSIXct(strptime(paste(date, paste0(substring(time, 1, 2), ":00:00")), "%Y-%m-%d %H:%M:%S"))
  ) %>%
  ungroup

viz

# removing some outliers (obs, look into more for serious use)
library(ggplot2)
ggplot(df[df$temperature < 100,], aes(datetime, temperature)) +
  geom_point() +
  theme_grey(base_size = 18)

plot of chunk unnamed-chunk-11

Future work

I plan to improve performance via profiling and swapping out slower code for faster,
as well as possibly dropping down to C++.

There was already a featur request for asking for fields of interest instead of
getting all fields, so that's on the list.

Do try out isdparser. Let us know of any bugs, and any feature requests!

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)