Processing satellite image collections in R with the gdalcubes package

July 17, 2019
By

(This article was first published on r-spatial, and kindly contributed to R-bloggers)

[view raw
Rmd
]

The problem

Scientists working with collections and time series of satellite imagery
quickly run into some of the following problems:

  • Images from different areas of the world have different spatial
    reference systems (e.g., UTM zones).
  • The pixel size of a single image sometimes differs among its
    spectral bands / variables.
  • Spatially adjacent image tiles often overlap.
  • Time series of images are often irregular when the area of interest
    covers spatial areas larger than the extent of a single image.
  • Images from different data products or different satellites are
    distributed in diverse data formats and structures.

GDAL and the rgdal R
package
can solve most of
these difficulties by reading all relevant data formats and implementing
image warping to reproject, rescale, resample, and crop images.
However, GDAL does not know about image time series and hence there is
a lot of manual work needed before data scientists can actually work
with these data. Instead of collections of images, data users in many
cases desire a regular data structure such as a four-dimensional
raster data cube with dimensions
x, y, time, and band.

In R, there is currently no implementation to build regular data cubes
from image collections. The stars
package
provides a generic
implementation for processing raster and vector data cubes with an
arbitrary number of dimensions, but assumes that the data are already
organized as an array.

Introduction and overview of gdalcubes

This blog post introduces the gdalcubes R package, aiming at making
the work with collections and time series of satellite imagery easier
and more interactive.

The core features of the package are:

  • build regular dense data cubes from large satellite image
    collections based on a user-defined data cube view (spatiotemporal
    extent, resolution, and map projection of the cube)
  • apply and chaining operations on data cubes
  • allow for the execution of user-defined functions on data cubes
  • export data cubes as netCDF files, making it easy to process
    further, e.g., with
    stars or
    raster.

Technically, the R package is a relatively lightweight wrapper around
the gdalcubes C++ library. The
library strongly builds on GDAL,
netCDF, and
SQLite (the full list of C++
libraries used by gdalcubes is found
here).

This blog post focuses on how to use the R package gdalcubes, more
details about the underlying ideas can be found in our recent open
access paper in the datacube special issue of the DATA
journal
.

Installation

The package can be installed directly from
CRAN:

install.packages("gdalcubes")

Some features and functions used in this blog post are still in the
development version (which will be submitted to CRAN as version 0.2.0),
which currently needs a source install:

# install.packages("remotes")
remotes::install_git("https://github.com/appelmar/gdalcubes_R", ref="dev", args="--recursive")

If this fails with error messages like “no rule to make target …”,
please read here.

Demo dataset

We use a collection of 180 Landsat
8
surface reflectance images,
covering a small part of the Brazilian Amazon forest. If you would like
to work with the dataset on your own (and maybe reproduce some parts of
this blog post), you have two options:

Option A: Download the dataset with full resolution (30 meter pixel
size) here
(67 gigabytes compressed; 190 gigabytes unzipped)

Option B: Download a downsampled version of the dataset that
contains all images at a coarse spatial resolution (300 meter pixel
size) here
(740 megabytes compressed; 2 gigabytes unzipped)

Except for the spatial resolution of the images, the datasets are
identical, meaning that all code samples work for both options but
result images may look blurred when using option B. Here is the code to
download and unzip the dataset to your current working directory (which
might take some time):

# Option A
#download.file("https://uni-muenster.sciebo.de/s/SmiqWcCCeFSwfuY/download", destfile = "L8_Amazon.zip")

# Option B
if (!file.exists("L8_Amazon.zip")) {
  download.file("https://uni-muenster.sciebo.de/s/e5yUZmYGX0bo4u9/download", destfile = "L8_Amazon.zip")

  unzip("L8_Amazon.zip", exdir = "L8_Amazon")
}

Creating image collections

After extraction of the zip archive, we get one directory per image,
where each image contains 10 GeoTIFF files representing the spectral
bands and additional per-pixel quality information. As a first step, we
must scan all available images once, extract some metadata (e.g.
acquisition date/time and spatial extents of images and how the files
relate to bands), and store this information in a simple image
collection
index file. This file does not store any pixel values but
only metadata and references to where images can be found.

First, we simply need to find available GeoTIFF files in all
subdirectories of our demo dataset:

files = list.files("L8_Amazon", recursive = TRUE, full.names = TRUE, pattern = ".tif")
length(files)

## [1] 1800

head(files)

## [1] "L8_Amazon/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_pixel_qa.tif"  
## [2] "L8_Amazon/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_radsat_qa.tif" 
## [3] "L8_Amazon/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_sr_aerosol.tif"
## [4] "L8_Amazon/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_sr_band1.tif"  
## [5] "L8_Amazon/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_sr_band2.tif"  
## [6] "L8_Amazon/LC082260632014071901T1-SC20190715045926/LC08_L1TP_226063_20140719_20170421_01_T1_sr_band3.tif"

To understand the structure of particular data products, the package
comes with a set of predefined rules (called collection formats) that
define how required metadata can be derived from the data. These include
formats for some Sentinel-2, MODIS, and Landsat products. We can list
all available formats with:

library(gdalcubes)

## Loading required package: Rcpp

## Loading required package: RcppProgress

## Loading required package: jsonlite

## Loading required package: ncdf4

## Using gdalcubes library version 0.1.9999

collection_formats()

##   CHIRPS_v2_0_daily_p05_tif | Image collection format for CHIRPS v 2.0
##                             | daily global precipitation dataset (0.05
##                             | degrees resolution) from GeoTIFFs, expects
##                             | list of .tif or .tif.gz files as input.
##                             | [TAGS: CHIRPS, precipitation]
## CHIRPS_v2_0_monthly_p05_tif | Image collection format for CHIRPS v 2.0
##                             | monthly global precipitation dataset (0.05
##                             | degrees resolution) from GeoTIFFs, expects
##                             | list of .tif or .tif.gz files as input.
##                             | [TAGS: CHIRPS, precipitation]
##                     L8_L1TP | Collection format for Landsat 8 Level 1 TP
##                             | product [TAGS: Landsat, USGS, Level 1, NASA]
##                       L8_SR | Collection format for Landsat 8 surface
##                             | reflectance product [TAGS: Landsat, USGS,
##                             | Level 2, NASA, surface reflectance]
##                     MxD10A2 | Collection format for selected bands from
##                             | the MODIS MxD10A2 (Aqua and Terra) v006 Snow
##                             | Cover product [TAGS: MODIS, Snow Cover]
##                     MxD11A1 | Collection format for selected bands from
##                             | the MODIS MxD11A2 (Aqua and Terra) v006 Land
##                             | Surface Temperature product [TAGS: MODIS,
##                             | LST]
##                     MxD11A2 | Collection format for selected bands from
##                             | the MODIS MxD11A2 (Aqua and Terra) v006 Land
##                             | Surface Temperature product [TAGS: MODIS,
##                             | LST]
##                     MxD13A3 | Collection format for selected bands from
##                             | the MODIS MxD13A3 (Aqua and Terra) product
##                             | [TAGS: MODIS, VI, NDVI, EVI]
##               Sentinel2_L1C | Image collection format for Sentinel 2 Level
##                             | 1C data as downloaded from the Copernicus
##                             | Open Access Hub, expects a list of file
##                             | paths as input. The format works on original
##                             | ZIP compressed as well as uncompressed
##                             | imagery. [TAGS: Sentinel, Copernicus, ESA,
##                             | TOA]
##           Sentinel2_L1C_AWS | Image collection format for Sentinel 2 Level
##                             | 1C data in AWS [TAGS: Sentinel, Copernicus,
##                             | ESA, TOA]
##               Sentinel2_L2A | Image collection format for Sentinel 2 Level
##                             | 2A data as downloaded from the Copernicus
##                             | Open Access Hub, expects a list of file
##                             | paths as input. The format should work on
##                             | original ZIP compressed as well as
##                             | uncompressed imagery. [TAGS: Sentinel,
##                             | Copernicus, ESA, BOA, Surface Reflectance]

The “L8_SR” format is what we need for our demo dataset. Next, we must
tell gdalcubes to scan the files and build an image collection. Below,
we create an image collection from the set of GeoTIFF files, using the
“L8_SR” collection format and store the resulting image collection
under “L8.db”.

L8.col = create_image_collection(files, "L8_SR", "L8.db")
L8.col

## A GDAL image collection object, referencing 180 images with 17  bands
## Images:
##                                       name      left       top    bottom
## 1 LC08_L1TP_226063_20140719_20170421_01_T1 -54.15776 -3.289862 -5.392073
## 2 LC08_L1TP_226063_20140820_20170420_01_T1 -54.16858 -3.289828 -5.392054
## 3 LC08_L1GT_226063_20160114_20170405_01_T2 -54.16317 -3.289845 -5.392064
## 4 LC08_L1TP_226063_20160724_20170322_01_T1 -54.16317 -3.289845 -5.392064
## 5 LC08_L1TP_226063_20170609_20170616_01_T1 -54.17399 -3.289810 -5.392044
## 6 LC08_L1TP_226063_20170711_20170726_01_T1 -54.15506 -3.289870 -5.392083
##       right            datetime
## 1 -52.10338 2014-07-19T00:00:00
## 2 -52.11418 2014-08-20T00:00:00
## 3 -52.10878 2016-01-14T00:00:00
## 4 -52.10878 2016-07-24T00:00:00
## 5 -52.11958 2017-06-09T00:00:00
## 6 -52.09798 2017-07-11T00:00:00
##                                                  srs
## 1 +proj=utm +zone=22 +datum=WGS84 +units=m +no_defs 
## 2 +proj=utm +zone=22 +datum=WGS84 +units=m +no_defs 
## 3 +proj=utm +zone=22 +datum=WGS84 +units=m +no_defs 
## 4 +proj=utm +zone=22 +datum=WGS84 +units=m +no_defs 
## 5 +proj=utm +zone=22 +datum=WGS84 +units=m +no_defs 
## 6 +proj=utm +zone=22 +datum=WGS84 +units=m +no_defs 
## [ omitted 174 images ] 
## 
## Bands:
##         name offset scale unit       nodata image_count
## 1    AEROSOL      0     1                           180
## 2        B01      0     1      -9999.000000         180
## 3        B02      0     1      -9999.000000         180
## 4        B03      0     1      -9999.000000         180
## 5        B04      0     1      -9999.000000         180
## 6        B05      0     1      -9999.000000         180
## 7        B06      0     1      -9999.000000         180
## 8        B07      0     1      -9999.000000         180
## 9        EVI      0     1      -9999.000000           0
## 10     MSAVI      0     1      -9999.000000           0
## 11       NBR      0     1      -9999.000000           0
## 12      NBR2      0     1      -9999.000000           0
## 13      NDMI      0     1      -9999.000000           0
## 14      NDVI      0     1      -9999.000000           0
## 15  PIXEL_QA      0     1                           180
## 16 RADSAT_QA      0     1                           180
## 17      SAVI      0     1      -9999.000000           0

Internally, the output file is a simple SQLite database. Please notice
that our collection does not contain data for all possible bands (see
image_count column). Depending on particular data download requests,
Landsat 8 surface reflectance data may come e.g. with some
post-processed bands (like vegetation indexes) that can be used if
available.

Creating and processing data cubes

To create a raster data cube, we need (i) an image collection and (ii) a
data cube view, defining how we look at the data, i.e., at which
spatiotemporal resolution, window, and spatial reference system. For a
quick look at the data, we define a cube view with 1km x 1km pixel size,
yearly temporal resolution, covering the full spatiotemporal extent of
the image collection, and using the web mercator spatial reference
system.

v.overview = cube_view(extent=L8.col, dt="P1Y", dx=1000, dy=1000, srs="EPSG:3857", 
                       aggregation = "median", resampling = "bilinear")
raster_cube(L8.col, v.overview)

## A GDAL data cube proxy object
## Dimensions:
##   name        low       high size chunk_size
## 1    t     2013.0     2019.0    7         16
## 2    y  -764014.4  -205014.4  559        256
## 3    x -6582280.1 -5799280.1  783        256
## 
## Bands:
##         name    type offset scale nodata unit
## 1    AEROSOL float64      0     1    NaN     
## 2        B01 float64      0     1    NaN     
## 3        B02 float64      0     1    NaN     
## 4        B03 float64      0     1    NaN     
## 5        B04 float64      0     1    NaN     
## 6        B05 float64      0     1    NaN     
## 7        B06 float64      0     1    NaN     
## 8        B07 float64      0     1    NaN     
## 9        EVI float64      0     1    NaN     
## 10     MSAVI float64      0     1    NaN     
## 11       NBR float64      0     1    NaN     
## 12      NBR2 float64      0     1    NaN     
## 13      NDMI float64      0     1    NaN     
## 14      NDVI float64      0     1    NaN     
## 15  PIXEL_QA float64      0     1    NaN     
## 16 RADSAT_QA float64      0     1    NaN     
## 17      SAVI float64      0     1    NaN

As specified in our data cube view, the time dimension of the resulting
data cube only has 7 values, representing years from 2013 to 2019. The
aggregation parameter in the data cube view defines how values from
multiple images in the same year shall be combined. In contrast, the
selected resampling algorithm is applied when reprojecting and rescaling
individual images.

If we are interested in a smaller area at higher temporal resolution, we
simply need to define a data cube view with different parameters,
including a specific spatiotemporal extent by passing a list as extent
argument to cube_view. Below, we define a data cube view for a 100km x
100km area with 50m pixel size at monthly temporal resolution.

v.subarea = cube_view(extent=list(left=-6320000, right=-6220000, bottom=-600000, top=-500000, 
                      t0="2014-01-01", t1="2018-12-31"), dt="P1M", dx=50, dy=50, srs="EPSG:3857", 
                      aggregation = "median", resampling = "bilinear")

raster_cube(L8.col, v.subarea)

## A GDAL data cube proxy object
## Dimensions:
##   name      low     high size chunk_size
## 1    t   201401   201812   60         16
## 2    y  -600000  -500000 2000        256
## 3    x -6320000 -6220000 2000        256
## 
## Bands:
##         name    type offset scale nodata unit
## 1    AEROSOL float64      0     1    NaN     
## 2        B01 float64      0     1    NaN     
## 3        B02 float64      0     1    NaN     
## 4        B03 float64      0     1    NaN     
## 5        B04 float64      0     1    NaN     
## 6        B05 float64      0     1    NaN     
## 7        B06 float64      0     1    NaN     
## 8        B07 float64      0     1    NaN     
## 9        EVI float64      0     1    NaN     
## 10     MSAVI float64      0     1    NaN     
## 11       NBR float64      0     1    NaN     
## 12      NBR2 float64      0     1    NaN     
## 13      NDMI float64      0     1    NaN     
## 14      NDVI float64      0     1    NaN     
## 15  PIXEL_QA float64      0     1    NaN     
## 16 RADSAT_QA float64      0     1    NaN     
## 17      SAVI float64      0     1    NaN

The raster_cube function always returns a proxy object, meaning that
neither any expensive computations nor any data reads from disk are
started. Instead, gdalcubes delays the execution until the data is
really needed when users call plot(), or write_ncdf(). However, the
result of our call to raster_cube can be passed to data cube
operators. For example, the code below drops all bands except the
visible RGB bands and, again, returns a proxy object.

L8.cube.rgb = select_bands(raster_cube(L8.col, v.overview), c("B02","B03","B04")) 
L8.cube.rgb 

## A GDAL data cube proxy object
## Dimensions:
##   name        low       high size chunk_size
## 1    t     2013.0     2019.0    7         16
## 2    y  -764014.4  -205014.4  559        256
## 3    x -6582280.1 -5799280.1  783        256
## 
## Bands:
##   name    type offset scale nodata unit
## 1  B02 float64      0     1    NaN     
## 2  B03 float64      0     1    NaN     
## 3  B04 float64      0     1    NaN

Calling plot() will eventually start computationn, and hence might
take some time:

system.time(plot(L8.cube.rgb, rgb=3:1, zlim=c(0,1200)))

##    user  system elapsed 
##  16.367   0.605  17.131

Chaining data cube operations

For the remaining examples, we use multiple threads to process data
cubes by setting:

gdalcubes_options(threads=4)

We can chain many of the provided data cube operators (e.g., using the
pipe %>%). The following code will derive the median values of the RGB
bands over time, producing a single RGB overview image for our selected
subarea.

suppressPackageStartupMessages(library(magrittr)) # use the pipe
raster_cube(L8.col, v.subarea) %>%
  select_bands(c("B02","B03","B04")) %>%
  reduce_time("median(B02)", "median(B03)", "median(B04)") %>%
  plot(rgb=3:1, zlim=c(0,1200))

Implemented data cube operators include:

  • apply_pixel apply one or more arithmetic expressions on individual
    data cube pixels, e.g., to derive vegetation indexes.
  • reduce_time apply on or more reducers over pixel time series.
  • reduce_time apply on or more reducers over spatial slices.
  • select_bands subset available bands.
  • window_time apply an aggregation function or convolution kernel
    over moving time series windows.
  • join_bands combines the bands of two indentically shaped data
    cubes.
  • filter_pixel filter pixels by a logical expression on band values,
    e.g., select all pixels with NDVI larger than 0.
  • write_ncdf export a data cube as a netCDF file.

In a second example, we compute the normalied difference vegetation
index (NDVI) with apply_pixel and derive its maximum values over time:

suppressPackageStartupMessages(library(viridis)) # use colors from viridis package

raster_cube(L8.col, v.subarea) %>%
  select_bands(c("B04","B05")) %>%
  apply_pixel(c("(B05-B04)/(B05+B04)"), names="NDVI") %>%
  reduce_time("max(NDVI)") %>%
  plot(zlim=c(-0.2,1), col=viridis, key.pos=1)

User-defined functions

Previous examples used character expressions to define reducer and
arithmetic functions. Operations like apply_pixel and filter_pixel
take character arguments to define the expressions. The reason for this
is that expressions are translated to C++ functions and all computations
then are purely C++. However, to give users more flexibility and allow
the definition of user-defined functions, reduce_time and
apply_pixel also allow to pass arbitrary R functions as an argument.
In the example below, we derive the 0.25, and 0.75 quantiles over NDVI
time series. There is of course no limitation what the provided reducer
function does and it is thus possible to use functions from other
packages.

v.16d = cube_view(view=v.overview, dt="P16D")
raster_cube(L8.col, v.16d) %>%
  select_bands(c("B04", "B05")) %>%
  apply_pixel(c("(B05-B04)/(B05+B04)"), names="NDVI") %>%
  reduce_time(names = c("q1","q3"), FUN = function(x) {
    quantile(x["NDVI",], probs = c(0.25, 0.75), na.rm = TRUE)
  }) %>%
  plot(col=viridis, zlim=c(-0.2,1), key.pos=1)

However, there are some things, users need to keep in mind when working
with user-defined functions:

  1. Users should provide names of the output bands and make sure that
    the function always return the same number of elements.
  2. When executed, the function runs in a new R session, meaning that it
    cannot access variables in the current worskspace and packages must
    be loaded within the function if needed.
  3. Ideally, users should carefully check for errors. A frequent cause
    for errors is the presence of NA values, which are abundant in
    raster data cubes from irregular image collections.
  4. In the current version, only apply_pixel and reduce_time allow
    for passing user-defined functions.

Interfacing with stars

The stars package is much more generic and supports higher dimensional
arrays and hence supports e.g. data from climate model output. It also
does not assume data to be orthorectified, i.e. it works also with
curvilinear grids and hence supports data as from Sentinel-5P. In
contrast, gdalcubes concentrates on multispectral image time series (4d)
only.

gdalcubes currently comes with a simple as_stars() function, writing a
data cube as a (temporary) netCDF file, which is then opened by
read_stars. The stars object holds bands as attributes. If needed
(e.g. for ggplot below), st_redimension converts attributes to a new
dimension.

suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(stars))

x = raster_cube(L8.col, v.overview) %>%
  select_bands(c("B02","B03","B04")) %>%
  as_stars()
x

## stars object with 3 dimensions and 3 attributes
## attribute(s), summary of first 1e+05 cells:
##       B02              B03              B04         
##  Min.   : 123.8   Min.   : 174.1   Min.   :  112.1  
##  1st Qu.: 209.9   1st Qu.: 397.3   1st Qu.:  247.2  
##  Median : 295.0   Median : 497.9   Median :  380.3  
##  Mean   : 457.6   Mean   : 653.0   Mean   :  563.5  
##  3rd Qu.: 476.7   3rd Qu.: 686.5   3rd Qu.:  662.6  
##  Max.   :9419.0   Max.   :9780.7   Max.   :10066.2  
##  NA's   :79497    NA's   :79497    NA's   :79497    
## dimension(s):
##      from  to   offset delta                       refsys point
## x       1 783 -6582280  1000 +proj=merc +a=6378137 +b=... FALSE
## y       1 559  -205014 -1000 +proj=merc +a=6378137 +b=... FALSE
## time    1   7       NA    NA                      POSIXct FALSE
##                         values    
## x                         NULL [x]
## y                         NULL [y]
## time 2013-01-01,...,2019-01-01

ggplot() + 
  geom_stars(data = slice(st_redimension(x), "time", 5)) + 
  facet_wrap(~new_dim) + 
  theme_void() +
  coord_fixed() +
  scale_fill_viridis(limits=c(0, 2000)) 

Future work

There are many things, which we didn’t cover in this blog post like
applying masks during the construction of data cubes. More importantly,
a question we are very much interested in at the moment is how far we
can go with gdalcubes and stars in cloud computing envrionments, where
huge image collections such as the full Sentinel-2 archive are already
stored. This will be the topic of another blog post.

We also have a lot of ideas for future developments but no real
schedule. If you would like to contribute, get involved, or if you have
further ideas, please get in touch! Development and discussion takes
place on GitHub (R package on
GitHub
, C++ library on
GitHub
).

To leave a comment for the author, please follow the link and comment on their blog: r-spatial.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)