Global done!

September 23, 2010
By

(This article was first published on Steven Mosher's Blog, and kindly contributed to R-bloggers)

Over the past few weeks I’ve been working at getting Moshtemp to work entirely in the raster package. I’ve been aided greatly by the author of that package, Robert, who has been turning out improvements to the package with regularity. For a while I was a bit stymied by some irregularities in getting source from R Forge, so I turned to other work. More on that later. But now I’m happy to report that with version 1.5.9 I am able to do the whole analysis in raster.

First some caveats. In what follows I benchmark against the analysis done by Hadley/CRU. That benchmark is complicated by the fact that we use different data and slightly different methods. I’ll point out those differences as I go through things. Generally my goal is just to get this kind of analysis grounded in open source tools and not quibble about the differences. So, there is no grand take away message, other than “you can use raster to do this kind of work.”

Also, I’ll be going through the code a couple more times to throw out more code that I wrote and to make the whole project building process cleaner, more flexible, and better documented:

To the code:

source(“Global.R”)

First  we load all the scripts used by the program. That contains file names, constants, utility functions. It’s a lot of infrastructure that is important to understand.

Start by getting all the data sources. These data sources are all downloaded and set up in the first two scripts “download” and  ”set up”. Read a raster of percent of land in each grid. The grid is read in in 1/4 degree cells. The getMask function uses your default cell size of 5 degrees to aggregate the mask to 5 degree figures. That can be changed, by passing in a different cell size. This grid has inland water.

Next we calculate the “inverse” of that grid and get ourselves a mask of “water percent” in each grid. I do this just for clarity. there is no need to actually have this in memory since we use and disguard it

Next we  get a list of all stations. The Inventory of stations is a dataframe containing all the metadata for a station. Lat/lon, population, etc. It has been previously downloaded and clean up.

Inv <- getGhcnInventory()

Next we load temperature normals. This is a zoo object (an extension of time series) written out in the preprocessing steps. It has stations in columns and time in rows. That’s vital to know. As noted before you can just read this object in and plot station temperatures. like so plot(Anom[ ,1]) there are thousands of stations.

In a OOP version I would probably turn this into an object type.  With the following properties:

1. rownames are a time class from zoo. 2.colnames are  station ids from an inventory. Measures are anomalies. Units are C, etc.  Next we load the SST anomalies. These have been processed from netCDF to a rasterbrick, with data from 1900 to 2009 in one month increments.

Now, we reconcile the inventory with the stations available in the normals file. The inventory has over 7000 stations, but when we processed normals we dropped many of those. So, Anom has about 5000 stations and Inv has 7000. if we want to we can subset Inv further. For example we can select only  ”urban” stations. There might be only 3000 of those. to get Anom and Inv on the same footing, we have to “intersect” the station Ids in Inv with the column names in Anom. to do that we call:

Data <- intersect.InvAnomalies(Inv,Anom)

Intersect  wraps a ‘set intersect’ call and returns a list where the stations in the inventory match those in the anomaly structure Anom. They should be ordered correctly and Anom has been transposed so that rows are stations and columns are time. That’s a design choice I need to think about. Doing that prevents downstream issues, but if you try to plot(Anom[ ,1]) after the transpose, you’ll need to flip x and y again. Next we access the list output by the function

Anom <- Data$Anomalies Inv <- Data$Inventory

If you want to preserve Anom in its untransposed order, then just do AnomTransposed <- Data\$Anomalies. That would leave Anom available in its untransposed state. Next comes the function that does all the heavy lifting for you.

Lets walk through this call. In Anom we have stations in rows. Every station has a Lon/Lat. We want to “map” those points into a grid structure. What kind of grid do we want to “map” them to?  landMask. That tell the routine the kind of grid that the points get mapped into. What points?  xy=getInvLonLat(Inv).these are the points associated with the stations in Inv.  xy takes in a structure, like a matrix, two columns wide, and rows equal to number of points (stations). The points MUST be in lonlat order. the utility function getInvLonLat() does that for you. Next, what are the values for those points by time (layer) values=Anom.  This assigns the time series (each row) to each layer in the brick. layer1 = month 1… The number of rows in xy MUST MATCH the number of rows in values. That is why we had to transpose Anom, since it originally has points in columns.  Now we could overcomplicate pointsToRaster and give it a flag to handle the transpose for us, but as it stands you have to do some work outside the call to prep your data. When multiple points “map” into the same 5 degree bin we want to apply a function. Do we add the temps? subtract them? or take the average? fun=mean,na.rm=T. We take the mean, and we remove NA from that calculation.  Now, temperatures have been gridded. Assigned to a cell and averaged on a monthly basis with all other stations in that grid.

Weights <- area(LandGrid,na.rm=T,weight=T)

We calculate the weights for each layer in the brick. The weight is simply this. Every cell has an area. For any given month there will be some fraction of the total cells with measurements. Say 1000 cells of the 2592 cells in the entire grid. every month we total the entire area of the cells with measures and we calculate areacell/totalarea. The weights per layer sum to 1. The areas are area on the sphere. Next we multiply the weights by the temperatures by the percent of land in each grid. This gives us the area weighted temperature over land.

layerNames(Land) <-  as.character(timeline)

landMonthly <- cellStats(Land,sum)

Then I assign the global variable “timeline” which is 1900 to 2009 in months to the layer names. Then I collect the stats per layer into a vector named landMonthly. NOTE, you sum to get this value since each grid has been weighted. the land is done. next comes the Ocean: The code is self explantory:

Weights <- area(SST,na.rm=T,weight=T)

layerNames(Ocean) <- as.character(timeline)

oceanMonthly <-cellStats(Ocean,sum)

then the final bits which I’ve explained before:

Coastal <-Ocean+Temps

Global = cover(Coastal, Ocean, Temps)

layerNames(Global) <- as.character(timeline)

globalMonthly <-cellStats(Global,sum)

recall that when we Add the Ocean to the Land, we ONLY add those cells they SHARE. which gives us the coast. Area weighted land fraction and ocean fraction, summed gives us the weighted value for the coast. Finally, we combine the three with “cover” Coast is coast cells and all other cells NA. You cover that with Ocean and the cells that are ocean get copied to the NA. Thats ocean and coast, cover that with land, and the land gets added. Lastly you sum the entire lot and you are done.

Normally, when people plot the maps they plot the unweighted anomaly. I’ll do that later but for now we have this, which is a map of the weighted anomaly.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...