Ghcn V3 Metadata improvements

December 12, 2010

(This article was first published on Steven Mosher's Blog, and kindly contributed to R-bloggers)

The Global Historical Climate Network  (GHCN) is in it’s beta stage. On of the stated goals of the project is to improve the metadata that is provided for the station data.  Over the past few months several independent volunteers have been focusing on the issue of station metadata, each with their own focus. Ron Broberg deserves credit for taking the lead with applying GIS tools to the issue and Peter O’neill deserves credit for his station by station review of GISS inventories. A couple other folks are busy at work and I will leave it to them to discuss their efforts when the time is appropriate as the publication process precludes them from talking openly about it.

Here, my main focus has been on GHCN and more recently GHCN V3. The goal of the project is to provide a more accurate and more comprehensive inventory of station locations and station metadata.  A short recap of the importance of this. Imagine, if you would, an inventory of 2000 stations. We suspect that some are urban and some are rural. We want to estimate the difference between the urban and the rural with an eye toward assessing the impact of UHI on the record.  Further let’s suppose that the difference is large, suppose that urban warms are 1C per century ( from 1900 to 2010) while rural shows 0C warming. In that case the rule we use to separate urban from rural, while important, is not critical. For example, if we mis identify some urban sites as rural ( say 10%) then some fraction of urban warming will  ”infect” the rural subset. The effect of urbanization will still be clear in our comparison. If our categorization is less accurate, say we get 50% of the urban wrong, then our ability to discriminate the signal will be reduced according. If we also mis label rural sites as urban, the effect will be compounded. We can see then that if the UHI effect is small, the need for a better discrimination function increases.  For example, if we think that the urban stations have warmed .8C over the course of 1900-2009 (+-.05C) while the rural have only warmed, say .6C (+-.05) misidentification will have more impact on our ability to find that UHI signal. One approach would be to take the 2000 stations and divide them into 3 groups. extreme rural, extreme urban, and “mixed”. This would, of course reduce the number of stations and the signal could then be lost in the noise that results from fewer stations. Still, that result would indicate that the UHI  effect is small. That happens to be my position.  Proving that, however, requires a diligent look at the metadata.

Metadata Improvements:

Data file is in the Box: named ExtV3Metadata.inv, a csv file is also included. here

The station information presented here is still in the beta stage. But it’s ready for a public release and some initial comments on what we can tell:  The process of improving the data and extending it is described below.


1. GHCN V3 Inventory.

2. Updated WMO station locations

3. Nightlights as used by Hansen 2010

4. Improved Nightlights as recommended by Nightlights principle Investigator

5. Nightlight Buffers

6. Gridded population density as provided by GPW

7. Gridded historical population density as provided by Hyde

8. Gridded Population Density as provided by GRUMP

9. Gridded Impervious Surface area.

10. Land masks provided by several sources.

Step One: The beta version of the GHCN v3 inventories are read into a R data frame. For the posted file that inventory was the matching inventory for the “adjusted” dataset. This inventory includes only 7279 stations as one appears to be dropped from the unadjusted data.  The data fields read in include

Id:  The Id field is the GHCN ID. It’s an 11 digit index of the form cccwwwwwddd. Where ccc indicates a country code, wwwww, indicates a WMO code, and ddd indicates a    IMOD number. There are several things to note. For the US stations, the WMO code does not appear to map to the WMO master list. For example, “42500046506″ is listed as the GHCNID of Orland California. In GHCN v2 the ID for ORLAND is :  42572591004 ORLAND. And for USHCN it is:046506-02 For For WMO we have no entry for ORLAND. In V2 Orland was listed according the WMO number for nearby Red Bluff. The 004 in the Orland IMOD indicates that Orland is at a different location than the WMO it is reported under. Confusing? You bet. To be accurate the V3 readme will have to be changed to indicate that the new GHCN Id, does not reflect WMO numbers in the middle 5 digits in all cases. In the US, the USHCN ID is used as the last 5 digits. Basically there are USHCN stations that do not have WMO numbers. In V2 they were listed as IMODs of the closest WMO ( redbluff) in V3 they are listed according to their USHCN number. That makes comparing V2 to V3 a bit troublesome.

Lat: The latitude of the station is reported in degrees north from -90 to 90. In my inventory   the value  is one of two values: the value found in GHCN V3 or the value found in the recently updated WMO master list. The WMO has required countries to update the precision of the station location data and that process is underway. It’s not entirely complete. Consequently some of the GHCN V3 station locations remain the same. Those that have been updated by WMO are updated here

Lon: Longitude is degrees east, from -180 to 180. As with Latitude this field contains the corrections from the recent WMO updates.

Altitude: Altitude in meters from the Ghcn V3 inventory. Corrected altitudes from the WMO master list are not included here. That will come later.

Name: The station name from the GHCN V3 inventory. I am in the laborious process of cleaning up the name list to remove the following: country names, state designations, province designations, partial names, punctuation marks. The goal would be to have a list of names as well as alternative names. Countries, states, provinces can be added properly by geocoding and should not be in the station name field.

GridEl: Grid elevation. As taken from the GHCN inventory data. In meters this represents the average elevation of the grid at .5degrees. Once the position data is improved this could be supplanted with more accurate metadata from DEMs.

Rural: A designation R,S,U that indicates whether the station is Rural, Small Town or Urban. This characterization is made based on the population of the nearest town, where R is a town with less than 10K people and Urban is greater than 50,000. This is a dated measure of urbanity. It’s problematic because it does not tell us whether the town is densely populated or spread out.

Population: The population of the nearest town in 1000s.

Topography: type of topography in the environment surrounding the station, (Flat-FL,Hilly-HI,Mountain Top-MT,Mountainous Valley-MV).

Vegetation:type of vegetation in environment of station if station is Ruraland when it is indicated on the Operational Navigation Chart (Desert-DE,Forested-FO,Ice-IC,Marsh-MA).

Coastal: An indication if the site is a Coastal location (CO) or near a lake (LA) or more than 30km away from water.

DistanceToCoast: In the site is close to water this field indicates the distance in km.

Airport: a true false flag for whether the station is at an airport or not. This has not been corrected using WMO data, but there are discrepancies.

DistanceToTown: Distance in km for the airport

NDVI: Normalized Difference Vegetative index. This field indicates the type of vegetation in the area. Its the original V3 data and should be supplanted with improved data.

Light_Code: while the V3 read me does not include or explain this data, it was present in V2. Bascially it is an undocumented description of the sites urbanity

Step Two. In the second step the updated WMO master list is merged with GHCN V3. This is not straightforward. First the GHCN list must be reduced to those stations that are not IMODs. Where the GHCN ID is  cccwwwwwddd, the ddd field must be 000. Next the WMO file must be trimmed as well. It has multiple entries for stations. The multiple stations represent “air stations” that are collocated with the ground station. In the WMO index this is indicated by an indexSubNbr = 1. Next the GHCN V3 file is merged with The WMO file based on WMO   ID.  After this is completed distances can be calculated and the names can be checked for consistency. That process results in the following fields:

WmoName : The name used by the WMO is recorded. In certain cases the WMO name is spelled differently. In some cases it is entirely different.

WmoLon: The Longitude given by the updated WMO master list. These updates are in progress. Some mistakes remain as memeber nations are delivering partial results. The data is supposed to be accurate to degrees, mintutes and seconds.

WmoLat:the latitude given by the updated WMO master list. These updates are in progress. Some mistakes remain as memeber nations are delivering partial results. The data is supposed to be accurate to degrees, mintutes and seconds.

GhcnDistance: The distance between the old loaction given by GHCN V3 and the new location. As calculated by a Haversine distance calculation

NameMatch: A true false flag indicating if the name matched using a rather lax fuzzy name match criteria

GhcnLon: The Legacy longitude. This is the Longitude from the source GHCN V3 file.

GhcnLat: The legacy Latitude

Step Three: In step three the corrected inventory is passed to a metadata compilation function.  The lon lat is passed in and metadata associated with those positions is passed out, along with the LON and LAT passed in for consistency checking

Lon  : corrected Longitude same as field 1

Lat : corrected Latitude same as field 2

LandWater: The fraction of land in the 1/4 degree grid cell surrounding the station. This includes inland water.

LandOcean : The fraction of land in the 1/4 degree grid cell surrounding the station. This includes only ocean water.

CoastDistance: The distance the station is from the coast. If the station is over land this should equal 0.  If a station is in the water it returns the distance to the closest coast. This occurs when coastal stations or island stations are misplaced. The accuracy of the coast map is 30 arc seconds.

Lights: The value of nightlights using the same file that Hansen2010 uses. It should be noted that this file has been deprecated by the file creators. It represents nightlights at the station in the 1995-97 era.

LightsF16: The value of nightlights using the most recent analysis from 2006.  The raw data in the file has been processed to produce a DN number according to the file readme.

Bright3km: Every LightsF16 field surrounding the station has been processed to extract the brightest pixel within 3km.  Given that Nightlights positional accuracy is ~1-2km, a station with perfect location information may still be mis registered with the image because of positional errors in the nightlights data.

Bright5km: same as above with a 5km radius

Bright10km same as above with a 10km radius

Bright20km same as above with a 20km radius

Isa: Impervious surface percentage. The percentage of impervious surfaces estimated from 0-100% A negative number indicates the station is in the water. ISA is the result of a regression and is based on Nightlights data and Landscan population.

GpwDensity : population density ( humans per square km) from the GPW source

GDensity :population density ( humans per square km) from the GRUMP source

The following fields are derived from the HYDE historical population/land use project which is being used for Ar5. The figure is density of humans per sq km. Figures are given for every decade. The data has been  processed from 5 minute data.

Pop1850 ,Pop1860,  Pop1870, Pop1880, Pop1890,Pop1900,Pop1910,Pop1920
Pop1930,Pop1940,Pop1950,Pop1960,Pop1970 ,Pop1980,Pop1990,Pop2000

GrumpUrban: A flag indicating where the site is Urban (2) rural(1) or in water (0)

To leave a comment for the author, please follow the link and comment on his blog: Steven Mosher's Blog. offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...


Comments are closed.