My R package designed to import all of the Berkeley Earth Surface temperature data is officially on CRAN, as BerkeleyEarth. The version there is 1.3 and I’ve completed some testing with the help of David Vavra. The result of that is version 1.5 which is available here at the drop box. I’ll be posting that to CRAN in a bit. For anyone who has worked with temperature data from the various sources Berkeley Earth is a godsend. For the first time we have a dataset that brings together all the open available temperature datasets into one consistent format. The following sources are merged and reconciled.
- Global Historical Climatology Network – Monthly
- Global Historical Climatology Network – Daily
- US Historical Climatology Network – Monthly
- World Monthly Surface Station Climatology
- Hadley Centre / Climate Research Unit Data Collection
- US Cooperative Summary of the Month
- US Cooperative Summary of the Day
- US First Order Summary of the Day
- Scientific Committee on Antarctic Research
- GSN Monthly Summaries from NOAA
- Monthly Climatic Data of the World
- GCOS Monthly Summaries from DWD
- World Weather Records (only those published since 1961)
- Colonial Era Weather Archives.
The data files are availabe here: http://berkeleyearth.org/data/
Let’s start with a top level description of the dataflow through the system. All the source data is collected and turned into a common format: http://berkeleyearth.org/source-files/. Those files are then merged into a single file called the “multi value” file. In this file every series for every station is present. The data format for all the temperature data files is common: there are 7 columns: Station Id, Series Number, Date, Temperature, Uncertainty, Observations, and Time of Observation. So, in the “multi-value” file a single station will have multiple series numbers. In the next step of the process “single value” files are created. There are four versions of these files depending upon the Quality control applied and whether or not seasonality is removed. Thus there are 5 versions of the data: Multi value, single value with no QC and no removal of seasonality, single value with QC and no removal… you get the idea. In addition, the final files are delivered as TMAX, TMIN and TAVG. In other words there are 15 datasets.
The 15 datasets can all be downloaded with version 1.5 of the package using the function downloadBerkeley(). The function is passed a data.frame of Urls to the files and those selected are downloaded and unzipped. In this process the package will create Three top level directories: TAVG, TMIN, and TMAX. Files are then downloaded to sub directories under the correct directory. It’s vitally important to keep all directories and file names intact for this package to function. The file named “data.txt”, has the same name across all 15 versions, so keeping things organized via directory structure will prevent obvious mistakes. There is a safeguard of sorts in the files themselves. Every file starts with comments that indicate the type of file that it is ( Tmax, multi value ). I’ve included a function getFileInformation() that will iterate through a directory and write this information to a local file. That function also reads all the files and extracts the “readme” headers, writing them to a separate “readme” subdirectory.
The download takes a while and I suggest you fire it up and leave your system alone while it downloads and unpacks the data. Should you get any warnings or errors you can always patch things up by hand ( download manually) or call downloadBerkeley() again subsetting the url data.frame to target those files that get corrupted. That is, on occasion you MAY get a warning that the file size downloaded doesnt match the file description. I suggest patching these up by hand downloading. I could, of course add some code to check and verify the whole downloading process, so ask if you would like that.
Once you have all the files downloaded you are ready to use the rest of the package. Every folder has the same set of files: data related to the stations and the core file “data.txt” The station metadata comes in several versions from the bare minimum ( station, lat,lon, altitude) to the complete station descrition. Functions are provided to read every file: They are all named to let you know exactly what file they read readSiteComplete() readSiteDetail(). The filenames are all defaulted to the Berkeley defined name. You merely set the Directory name and call the function. All functions return a data.frame with standard R NA’s used in place of the Berkeley values for NA. In addition, I’ve rearranged some of the data columns so that the station inventories can be used with the package RghcnV3.
The big task in all of this is reading the file “data.txt”. On the surface it’s easy. Its a 7 column file that can be read as a matrix, using read.delim(). There are two challenges. The first challenge is the “sparse” time format. There are over 44K stations. Some of those stations have a couple months of data, others have data from 1701 on. Some stations have complete records with reports for every month; other stations have gaps in there reporting. Berkeley common format only reports the months that have data. Like So:
Station Series date Temperature
1 1 1806.042 23
1 1 1925.125 16
If all the dates between 1806 and 1925 have no records ( either absent or dropped because of QC) then the months are simply missing. There are no NA. This gives us a very compact data storage solution, however, if you want to do any real work you have to put that data into a structure like a time series or a matrix where all times of all stations are aligned. In short, you need to fill in NAs. And you have to do this every time you read the data in. At some point I expect that people will get that storage is cheap and they will just store NAs where they are needed. Reading in sparse data and filling in NAs is simple, time consuming, and prone to bone headed mistakes. Our second challenge is memory. Once we’ve expanded the data to include NA we run the risk of blowing thru RAM. Then if we want to calculate on the data we might make intermediate versions. More memory. There isn’t a simple solution to this, but here is what version 1.5 has. It has three different routines for reading in the data:
readBerkeleyData(): This routine reads all 7 data columns and does no infilling. Its primary purpose is to create a memory backed file of the data. However, if you want to analyze things like time of observation or number of observations you have to use this function. Also, if you have your own method of “infilling NA” you can use this to grab all the data in its time sparse format. On FIRST read the function will take about 10 minutes to create a file backed version of the matrix using the package bigmemory. Every subsequent use of the call gets you immediate access to the data.bin file it creates.
readBerkeleyTemp(): this routine also creates a file backed matrix. On the very first call it sees if the temperature.bin file exists. Since that file doesnt exists, it is created. It is created from “data.txt” OR “data.bin”. Data.bin is created by readBerkeleyData(). So basically, readBerkeleyTemp() on first pass calls readBerkeleyData(). If readBerkeleyData() hasn’t been called before, it is called and data.bin is created and returned to readBerkeleyTemp(). The function then proceeds to create a file called temperature.bin. That file has a column for every station and a row for every time. NAs are put in place. The column names are also used to represent the station Id. Row names are used for time. The Berkeley “date” format is changed as well. This process can take over 2 hours. A buffer variable is provided to control how much data is read in before it is flushed to disk. It is set to 500K. At some stage This buffer will be optimized to the local RAM actually available. If you have more than 4GB you can play with this number to see if that speeds things up.
Lastly the function readAsArray() is provided. This function does not create a file backed matrix. It reads in “data.txt” and creates a 3D array of temperature only. The first dimension is stations, the second is months and the third is years. dimnames are provided. This data structure is used by the analytical functions in RghcnV3.