I work with lots of environmental time series data from stationary instruments. This post describes why you should avoid mixing data and metadata in a single file and instead suggests an easy-to-implement, easy-to-use, maximally compact format consisting of two .csv files linked by unique identifiers.
EPA Air Quality System
For decades, the US EPA has managed data from a network of ~1200 regulatory grade (i.e. expensive and professionally maintained) air quality monitors that produce hourly measurements at specific locations. Aggregated “raw” data are made publicly available in hourly data files. These .csv files are well documented and adhere to the tidy data standard with one measurement per record. The EPA AQS is a fantastic example of a government agency consistently doing the hard work of data ingest, harmonization, quality control and aggregation to produce useful datasets in a format that anyone can work with and make sense of. Kudos to the EPA for the AQS data files!
One downside, however, is that the records in these files contain lots of repeated, location- and instrument-specific metadata resulting in large file sizes. For small collections of data, repeated metadata is a small price to pay for the convenience of having everything you want in a single file. But this price increases as the number of records goes up. For stationary time series with many records, this convenience quickly becomes unaffordable.
The 2020 wildfire season in the United States produced lots of smoke that impacted millions of people. This impact can be investigated using hourly PM2.5 measurements in the AQS data: https://aqs.epa.gov/aqsweb/airdata/hourly_88101_2020.zip
This file has 5,749,884 rows and is 44 MB compressed. (That seems OK. We should be able to work with ~2-3 x 44 MB.) But when we unzip this file it suddenly blows up to 1.5 GB!
What is going on? A look at the first few lines tells us:
"State Code","County Code","Site Num","Parameter Code","POC","Latitude","Longitude",... "01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",... "01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",... "01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",... "01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",... "01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",... "01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",... "01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",... "01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",... "01","073","0023","88101",3,33.553056,-86.815,"WGS84","PM2.5 - Local Conditions",... ...
Lots of repeated metadata!
For every instrument location, about 250 characters of metadata (=250 bytes) will be repeated 24 (hours) * 365 (days) = 8,760 times. This .csv file has over a Megabyte of unneeded, unwanted, repeated metadata for each of the 1200 instrument locations.
I have addressed this issue before in various workshops and meetings:
If you collect data repeatedly along any spatiotemporal axis (x-y-z-t), you need to keep data and metadata in separate files.
Current and future data volumes
The world of air quality data is undergoing rapid evolution. A proliferation of low-cost sensors is generating data at sub-hourly levels and threatens to overwhelm existing data management solutions. The PurpleAir network alone has ~15,000 sensors generating raw data every two minutes. If this data were stored in the AQS format, a single parameter .csv file would weigh in at over half a Terabyte!
Clearly, another data format is needed!
How NOT to format other types of data
Mixing data and metadata is not just bad for time series data. It is bad for many types of data.
How NOT to format a conversation
This example is a bit contrived, but imagine you have been tasked to convert the record of a conversation into a tidy .csv file for some natural language processing. Notes of the conversation might be formatted like this:
Jon (male, 60, Seattle, WA, scientist)
Jennifer (female 50, Boston, MA, nurse)
Sally (female, 40, Phoenix, AR, chef)
Jon — Welcome to our survey.
Sally — Glad to be here.
Jennifer — Me too.
Jon — How do you like our new data format?
Sally — I doubt I will ever use a csv file.
Jennifer — I think this data format has a problem.
Jon — What is the problem?
Jennifer — Too much repeated metadata.
I would NOT recommend storing an hour long conversation like this:
name,sex,age,city,state,profession,word_order,word Jon,male,60,Seattle,WA,scientist,1,welcome Jon,male,60,Seattle,WA,scientist,2,to Jon,male,60,Seattle,WA,scientist,3,our Jon,male,60,Seattle,WA,scientist,4,survey Sally,female,40,Phoenix,AR,chef,5,glad Sally,female,40,Phoenix,AR,chef,6,to Sally,female,40,Phoenix,AR,chef,7,be Sally,female,40,Phoenix,AR,chef,8,here Jennifer,female,50,Boston,MA,nurse,9,me Jennifer,female,50,Boston,MA,nurse,10,too Jon,male,60,Seattle,WA,scientist,11,how Jon,male,60,Seattle,WA,scientist,12,do ...
Although “tidy”, this format seems absurd because we all make the following assumptions:
- Sentences use the space character to separate words.
- Sentences always display words arranged by word_order.
- Individuals, once introduced, only need to be referenced by their identifier (name).
We don’t need to repeat the speaker context with every word. A more understandable, more compact format for storing and sharing this data would be:
id,words Jon,"welcome to our survey" Sally,"glad to be here" Jennifer,"me too" ...
with detailed metadata about Jon, Sally and Jennifer in a separate table:
id,sex,age,city,state,profession Jon,male,60,Seattle,WA,scientist Sally,female,40,Phoenix,AR,chef Jennifer,female,50,Boston,MA,nurse
How NOT to format gridded model output
The atmospheric scientists who run weather models have been working with huge amounts of data since before time began (at “1970-01-01 00:00:00 UTC”). Weather model output files store data for every grid cell. A modern model will have many of these. The WRF 1.33km CANSAC grid contains 700x808x40 grid points for each time step. At 4 bytes per floating point number, that’s 90.5 MB per parameter per time step. A single model run will have dozens of parameters and 72 or more time steps. It’s easy to understand why this community developed the NetCDF data format and associated software libraries at a time when many of us were still playing Pong.
Just storing a single ASCII date stamp with every data record would increase data volumes by 700*808*40*(23 characters in “1970-01-01 00:00:00 UTC”) = 0.5 GB per timestep … for a single piece of repeated metadata!
Clearly, if you have LOTS of data, you need to adopt special formats and special software. Like it or not, this is what the future looks like for environmental time series data. The data deluge is well under way with the increased deployment of low-cost sensors producing data records every few minutes.
But we don’t have to embrace NetCDF quite yet and can still use ASCII .csv files for environmental time series as long as we:
Keep data and metadata in separate files.
How TO format time series data
The most efficient way to store environmental time series data is to put all of your measurement data into a
data.csv file with the first column containing a datestamp recording when measurements were made and all other columns containing those measurements. Column headers will contain unique identifiers for each “device-deployment”, aka unique time series.
Device-deployment metadata containing instrument and location information will go into a
meta.csv file with the unique identifiers in the first column and as much metadata as you want in the other columns. For a single device at a single location, the
meta.csv file will only have a single record. For multiple device-deployments all measuring the same parameter,
meta.csv will have one record for each instrument.
Here is a tiny example with hourly measurements from 4 air quality instruments:
# pm2.5_data.csv "datetime","id_1","id_2","id_3","id_4" 2023-09-16 18:00:00,0,3,2,5 2023-09-16 19:00:00,1,1,1,39 2023-09-16 20:00:00,3,1,1,9 2023-09-16 21:00:00,1,NA,1,7 2023-09-16 22:00:00,0,NA,1,8 ... # pm2.5_meta.csv "id","longitude","latitude","timezone","stateCode","countyName",... "id_1",-108.290764,37.350174,"America/Denver","CO","Montezuma",... "id_2",-112.041621,46.58432,"America/Denver","MT","Lewis and Clark",... "id_3",-114.894524,47.195087,"America/Denver","MT",NA,... "id_4",-122.094742,45.385597,"America/Los_Angeles","OR",NA,...
data.csv file can grow as needed without any bloat from repeated metadata. You can also add as many fields as you like to
meta.csv, knowing that this data will not be unnecessarily repeated. This structure is easy to generate, easy to work with and far smaller than an equivalent “tidy” structure with all metadata in every record.
Best of all, packages already exist in R to work with this structure!
A suite of R packages has been developed over the last decade to support efficient data processing and visualization of air quality and other environmental time series data. Packages AirMonitor, AirMonitorPlots, AirSensor2 and RAWSmet all use this data model and are built on top of the MazamaTimeSeries package which includes functions that mimic dplyr operations on a single table. Please see the detailed documentation and examples associated with each of these packages.
The advantages of structuring your data with separate data and metadata are pretty undeniable:
- ~ 100x smaller files means fast data transfer and no need for extra disk space
- ~ 100x smaller memory footprint means faster processing and no need for extra memory
Best wishes for fast and efficient analysis of environmental time series data!