MODIS processing with R, GDAL, and the NCO tools

June 3, 2010

(This article was first published on PlanetFlux, and kindly contributed to R-bloggers)

I use MODIS data for analysis of vegetation dynamics around the world.  The native HDF file format provided by NASA is great for archiving the data (it’s amazing how much information they include in each file), but unfortunately there aren’t many tools for directly (and easily) extracting data stored across files like there are for the NetCDF format.  I wanted to be able to use the NCO tools to extract timeseries of a sub-setted region stored in many files.  This is quite easy with the ncrcat and similar NCO utilities.  The trouble is that the NCO tools require the data to be in netcdf3 or netcdf-4 classic to do this. I tried to use various HDF->netCDF conversion tools but eventually decided it was easiest to convert HDF->geotiff->netcdf.  The hierarchical nature of the HDF files makes them difficult to parse and so far the netcdf tools are not able to work with them unless they are ‘flattened’ to a single group.

So to use the NCO tools with MODIS data, I had to develop a processing routine that will convert them to netcdf-4 classic format.  This post will show you how.

I generally use R for my data processing and workflow scripting, though much of this could be done with shell scripting or other languages.  The code below is specific to my situation and will almost certainly require some editing if you want to use it.  I’m posting it to serve as general guidelines on how one might do this, not as a ready to use and generally applicable set of functions. I appreciate any suggestions on improvements.

First you need to identify which tiles you need by looking at the map:

Then run something like this to download the entire MOD13Q1 (vegetation indices) data from MODIS.  Get ready for lots of data (the following list of tiles results in 269GB!):

 ### Some specifications  
ftpsite="" #for veg index
hdffolder="1_HDF" #folder to hold HDF files
ofolder="2_netcdf" #this is where the output NetCDF files will go
bandsubset=paste(1) #keep only these bands from HDF file
regionbounds="49 113 30 146" # Set region boundaries for clipping
## Get tiles to download - get these from the map
tiles=c("h25v04","h26v04","h26v05","h27v04","h28v04","h27v05","h28v05","h29v05") #list of tiles to download, they will be mosaiced and clipped later - format is important, see wget command below
### download HDFs
### Use wget to recursively download all files for selected tiles - will not overwrite unless source is newer
system(paste("wget -S --recursive --no-parent --no-directories -N -P",hdffolder," --accept",searchstring," --reject \"*xml\"",ftpsite))

This will put all the files in the hdffolder (which I set to “1_HDF”).  You now have each tile for time period as a separate file.  I find it convenient to mosaic the tiles and perhaps warp them to another projection and subset it to a smaller region than all of the tiles together.  To do this I use the following functions (note that some details are specific to extracting NDVI, if you want something else it will need to be edited).  The first converts a MODIS date code to the R date format:

modday2date<-function(i) as.Date(paste(c(substr(i,2,5),substr(i,6,9)),collapse=”-“),”%Y-%j”)
The second converts a geotiff to a netcdf with a single-date record dimension:

## Create netcdf file from modistool geotiff output
##create file and open for editing
## convert Date
## Read in geotiff to set netCDF attributes
print(paste("Importing ",geotiff))
## Set dimentions
print("Defining NetCDF"),clobber=T),write=T) # Opens connection with netcdf file and write permission, "latitude", [email protected]@cells.dim[2], unlim=FALSE), "longitude", [email protected]@cells.dim[1], unlim=FALSE), "time", dimlength=length(utdates), unlim=TRUE),"time","NC_SHORT","time"),"time",utdates, start=NA, count=NA, na.mode=0),"time", "units","NC_CHAR",paste("days since ",startday," 0",sep="")),"time", "standard_name","NC_CHAR","time"),"longitude","NC_DOUBLE","longitude"),"longitude", "units","NC_CHAR","degrees_east"),"longitude", "standard_name","NC_CHAR","longitude"),"longitude",seq([email protected][1],[email protected][2],[email protected]@cellsize[1]), start=NA, count=NA, na.mode=0),"latitude","NC_DOUBLE","latitude"),"latitude", "units","NC_CHAR","degrees_north"),"latitude", "standard_name","NC_CHAR","latitude"),"latitude",seq([email protected][3],[email protected][4],[email protected]@cellsize[2]), start=NA, count=NA, na.mode=0),varname,"NC_SHORT",c("longitude","latitude","time")),varname, "missing_value","NC_DOUBLE",missing),varname, "units","NC_CHAR",varname),varname, "standard_name","NC_CHAR",varname)
## Process the data
print("Processing data and writing to disk")
[email protected]!=missing
[email protected][notnull][[email protected][notnull]range[2]/scale]=missing #get rid of occasional weird value
## write the data to netCDF file,varname,as.matrix(x)[,ncol(as.matrix(x)):1], start=c(1,1,1),count=c([email protected]@cells.dim[1],[email protected]@cells.dim[2],1))

And the third puts it all together by first using the mrtmosaic tool to mosaic the tiles and subset to the band of interest, then using resample tool (available at the link above) to convert the HDF to a geotiff and then calling the function above to convert that to a netCDF:

 ### Function to procss the files using the commands above.  
f=file.exists(ofile) # file exists?
if(f) print(paste(ofile," already exists, moving to next file....")) # if exists, stop
if(!f) {
## Mosaic and subset
system(paste("mrtmosaic -i ",tparams," -s ",bandsubset," -o ",tmos,sep=""))
## Clip and reproject
if(!file.exists("ModisToolParams.prm")) {print("Parameter file missing, stopping process") ; stop}
system(paste("resample -p ModisToolParams.prm -i ",tmos," -o ",tclip," -t GEO -l \'",regionbounds,"\' -a INPUT_LAT_LONG -r CC",sep=""))
file.remove(tmos,tclip,tparams); gc()
print(paste(which(dates%in%i), "out of ", length(dates)))

You can then run these functions to process a folder of HDF files with the following commands.  If you are on a *nix system with multiple cores, I recommend using the multicore package to parallelize the processing.  Otherwise you’ll need to change the mclapply() below to lapply():

 allfiles=as.character(list.files(path = paste(getwd(),hdffolder,sep="/"))) # build file list  
dates=unique("rbind",strsplit(allfiles,".",fixed=T))[,2]) #get unique dates"rbind",strsplit(allfiles,".",fixed=T))[1,1] #get prefix of file type
### Run the files - will not overwrite existing mosaics

Now you can use the NCO commands to extract a timeseries from these files, perhaps for only a subset of the data using ncrcat.

To leave a comment for the author, please follow the link and comment on their blog: PlanetFlux. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: ,

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Dommino data lab

Quantide: statistical consulting and training



CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)