How NOT to format time series data

Posted on September 26, 2023 by jonathanscallahan in R bloggers | 0 Comments

[This article was first published on R – Blog – Mazama Science , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I work with lots of environmental time series data from stationary instruments. This post describes why you should avoid mixing data and metadata in a single file and instead suggests an easy-to-implement, easy-to-use, maximally compact format consisting of two .csv files linked by unique identifiers.

EPA Air Quality System

For decades, the US EPA has managed data from a network of ~1200 regulatory grade (i.e. expensive and professionally maintained) air quality monitors that produce hourly measurements at specific locations. Aggregated “raw” data are made publicly available in hourly data files. These .csv files are well documented and adhere to the tidy data standard with one measurement per record. The EPA AQS is a fantastic example of a government agency consistently doing the hard work of data ingest, harmonization, quality control and aggregation to produce useful datasets in a format that anyone can work with and make sense of. Kudos to the EPA for the AQS data files!

The problem

One downside, however, is that the records in these files contain lots of repeated, location- and instrument-specific metadata resulting in large file sizes. For small collections of data, repeated metadata is a small price to pay for the convenience of having everything you want in a single file. But this price increases as the number of records goes up. For stationary time series with many records, this convenience quickly becomes unaffordable.

An example

The 2020 wildfire season in the United States produced lots of smoke that impacted millions of people. This impact can be investigated using hourly PM2.5 measurements in the AQS data: https://aqs.epa.gov/aqsweb/airdata/hourly_88101_2020.zip

This file has 5,749,884 rows and is 44 MB compressed. (That seems OK. We should be able to work with ~2-3 x 44 MB.) But when we unzip this file it suddenly blows up to 1.5 GB!

What is going on? A look at the first few lines tells us:

"State Code","County Code","Site Num","Parameter Code","POC","Latitude","Longitude",...
"01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",...
"01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",...
"01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",...
"01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",...
"01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",...
"01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",...
"01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",...
"01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",...
"01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",...
...

Lots of repeated metadata!

For every instrument location, about 250 characters of metadata (=250 bytes) will be repeated 24 (hours) * 365 (days) = 8,760 times. This .csv file has over a Megabyte of unneeded, unwanted, repeated metadata for each of the 1200 instrument locations.

I have addressed this issue before in various workshops and meetings:

If you collect data repeatedly along any spatiotemporal axis (x-y-z-t), you need to keep data and metadata in separate files.

Current and future data volumes

The world of air quality data is undergoing rapid evolution. A proliferation of low-cost sensors is generating data at sub-hourly levels and threatens to overwhelm existing data management solutions. The PurpleAir network alone has ~15,000 sensors generating raw data every two minutes. If this data were stored in the AQS format, a single parameter .csv file would weigh in at over half a Terabyte!

Clearly, another data format is needed!

How NOT to format other types of data

Mixing data and metadata is not just bad for time series data. It is bad for many types of data.

How NOT to format a conversation

This example is a bit contrived, but imagine you have been tasked to convert the record of a conversation into a tidy .csv file for some natural language processing. Notes of the conversation might be formatted like this:

Jon (male, 60, Seattle, WA, scientist)
Jennifer (female 50, Boston, MA, nurse)
Sally (female, 40, Phoenix, AR, chef)

Jon — Welcome to our survey.
Sally — Glad to be here.
Jennifer — Me too.
Jon — How do you like our new data format?
Sally — I doubt I will ever use a csv file.
Jennifer — I think this data format has a problem.
Jon — What is the problem?
Jennifer — Too much repeated metadata.
…

I would NOT recommend storing an hour long conversation like this:

name,sex,age,city,state,profession,word_order,word
Jon,male,60,Seattle,WA,scientist,1,welcome
Jon,male,60,Seattle,WA,scientist,2,to
Jon,male,60,Seattle,WA,scientist,3,our
Jon,male,60,Seattle,WA,scientist,4,survey
Sally,female,40,Phoenix,AR,chef,5,glad
Sally,female,40,Phoenix,AR,chef,6,to
Sally,female,40,Phoenix,AR,chef,7,be
Sally,female,40,Phoenix,AR,chef,8,here
Jennifer,female,50,Boston,MA,nurse,9,me
Jennifer,female,50,Boston,MA,nurse,10,too
Jon,male,60,Seattle,WA,scientist,11,how
Jon,male,60,Seattle,WA,scientist,12,do
...

Although “tidy”, this format seems absurd because we all make the following assumptions:

Sentences use the space character to separate words.
Sentences always display words arranged by word_order.
Individuals, once introduced, only need to be referenced by their identifier (name).

We don’t need to repeat the speaker context with every word. A more understandable, more compact format for storing and sharing this data would be:

id,words
Jon,"welcome to our survey"
Sally,"glad to be here"
Jennifer,"me too"
...

with detailed metadata about Jon, Sally and Jennifer in a separate table:

id,sex,age,city,state,profession
Jon,male,60,Seattle,WA,scientist
Sally,female,40,Phoenix,AR,chef
Jennifer,female,50,Boston,MA,nurse

How NOT to format gridded model output

The atmospheric scientists who run weather models have been working with huge amounts of data since before time began (at “1970-01-01 00:00:00 UTC”). Weather model output files store data for every grid cell. A modern model will have many of these. The WRF 1.33km CANSAC grid contains 700x808x40 grid points for each time step. At 4 bytes per floating point number, that’s 90.5 MB per parameter per time step. A single model run will have dozens of parameters and 72 or more time steps. It’s easy to understand why this community developed the NetCDF data format and associated software libraries at a time when many of us were still playing Pong.

Just storing a single ASCII date stamp with every data record would increase data volumes by 700*808*40*(23 characters in “1970-01-01 00:00:00 UTC”) = 0.5 GB per timestep … for a single piece of repeated metadata!

Clearly, if you have LOTS of data, you need to adopt special formats and special software. Like it or not, this is what the future looks like for environmental time series data. The data deluge is well under way with the increased deployment of low-cost sensors producing data records every few minutes.

But we don’t have to embrace NetCDF quite yet and can still use ASCII .csv files for environmental time series as long as we:

Keep data and metadata in separate files.

How TO format time series data

The most efficient way to store environmental time series data is to put all of your measurement data into a data.csv file with the first column containing a datestamp recording when measurements were made and all other columns containing those measurements. Column headers will contain unique identifiers for each “device-deployment”, aka unique time series.

Device-deployment metadata containing instrument and location information will go into a meta.csv file with the unique identifiers in the first column and as much metadata as you want in the other columns. For a single device at a single location, the meta.csv file will only have a single record. For multiple device-deployments all measuring the same parameter, meta.csv will have one record for each instrument.

Here is a tiny example with hourly measurements from 4 air quality instruments:

# pm2.5_data.csv
"datetime","id_1","id_2","id_3","id_4"
2023-09-16 18:00:00,0,3,2,5
2023-09-16 19:00:00,1,1,1,39
2023-09-16 20:00:00,3,1,1,9
2023-09-16 21:00:00,1,NA,1,7
2023-09-16 22:00:00,0,NA,1,8
...

# pm2.5_meta.csv
"id","longitude","latitude","timezone","stateCode","countyName",...
"id_1",-108.290764,37.350174,"America/Denver","CO","Montezuma",...
"id_2",-112.041621,46.58432,"America/Denver","MT","Lewis and Clark",...
"id_3",-114.894524,47.195087,"America/Denver","MT",NA,...
"id_4",-122.094742,45.385597,"America/Los_Angeles","OR",NA,...

The data.csv file can grow as needed without any bloat from repeated metadata. You can also add as many fields as you like to meta.csv, knowing that this data will not be unnecessarily repeated. This structure is easy to generate, easy to work with and far smaller than an equivalent “tidy” structure with all metadata in every record.

Best of all, packages already exist in R to work with this structure!

MazamaTimeSeries

A suite of R packages has been developed over the last decade to support efficient data processing and visualization of air quality and other environmental time series data. Packages AirMonitor, AirMonitorPlots, AirSensor2 and RAWSmet all use this data model and are built on top of the MazamaTimeSeries package which includes functions that mimic dplyr operations on a single table. Please see the detailed documentation and examples associated with each of these packages.

The advantages of structuring your data with separate data and metadata are pretty undeniable:

~ 100x smaller files means fast data transfer and no need for extra disk space
~ 100x smaller memory footprint means faster processing and no need for extra memory

Best wishes for fast and efficient analysis of environmental time series data!

To leave a comment for the author, please follow the link and comment on their blog: R – Blog – Mazama Science .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

How NOT to format time series data

EPA Air Quality System

The problem

An example

Current and future data volumes

How NOT to format other types of data

How NOT to format a conversation

How NOT to format gridded model output

How TO format time series data

MazamaTimeSeries

Related

EPA Air Quality System

The problem

An example

Current and future data volumes

How NOT to format other types of data

How NOT to format a conversation

How NOT to format gridded model output

How TO format time series data

MazamaTimeSeries

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)