Using stars for remote big Earth Observation data processing

[This article was first published on r-spatial, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

[view raw


This is the third blog on the
stars project, an R-Consortium
funded project for spatiotemporal tidy arrays with R. It shows how
stars can be used in a client/server setup, where large amounts of
Earth observation data are held on a server, and an (R) client is used
to query the data, process it on the server side, and retrieve a
(small) portion, for example to plot.

This blog is a proof of concept, and was developed for the use case of
remotely processing a set of Sentinel
images; no serious effort
has been made so far to generalize this to other image sources. We use the
stars package both on server and client side.

The data

Our dataset consists of 36 zip files, containing Sentinel
scenes (or images, tiles,
granules). Together they are 20 Gb. GDAL has a
for such zip files, and
hence stars can read them directly. Each zip file contains over 100
files, of which 15 are JPEG2000 files with imagery having 10, 20 or 60 m
resolution. We focus on the 10 m bands (Red, Green, Blue, Near
Infrared); single images of these have 10980 x 10980 pixels, but also
come with overview (downsampled) levels fo 20, 40, 80 and 160 m. That is
convenient, because for plotting it does not make much sense to
transport more pixels than our screen has.

The data were downloaded from ESA’s
SciHub, which first looks cumbersome
but in the end it was quite easy using aria2c. (The aria2 file
contains my user credentials.)


The server side is set up using plumber:

r <- plumb("server.R")

The server.R script is now explained in sections.

Reading data

The following script reads all the files ending in .zip, and creates
names from them that are understood by the GDAL Sentinel 2
. The /vsizip prefix
makes sure that the zip files is read as if it were unzipped. Function
read_stars_meta reads as much as possible metadata from the set of
files (and their subdatasets: here only subdataset 1, which has the 10 m
x 10 m imagery), and returns it in a (still) somewhat chaotic nested
list. The object md is an sf-tibble with names and bounding boxes (as
polygon feature geometries) of every image.

library(jsonlite) # base64_enc

# load some imagery meta data
# from a S2 .zip file, create a readable gdal source:
s2_expand_zip = function(s) {
    paste0("/vsizip/", s, "/", sub(".zip", ".SAFE", s), "/MTD_MSIL1C.xml")
lst = list.files(pattern = "*.zip")
l = lapply(s2_expand_zip(lst), read_stars_meta, sub = 1)
bb <-, lapply(l, function(x) st_as_sfc(st_bbox(st_dimensions(x)))))
fn = sapply(l, function(x) attr(x, "file_names"))
md = st_sf(tibble(file = fn, geom = bb))

# global database
data = list(md = md)

Serving /data

The following function simply logs every request to screen and forwards

#* Log some information about the incoming request
#* @filter logger
  cat(as.character(Sys.time()), "-", 
    req$REQUEST_METHOD, req$PATH_INFO, "-", 
    req$HTTP_USER_AGENT, "@", req$REMOTE_ADDR, ":", req$postBody, "\n")

The final two functions do the real work: they read an expression,
evaluate it and return the result (get), or they store the result on
the server side (put):

# plumber REST end points /data:
#* @get  /data
#* @post /data
get_data <- function(req, expr = NULL) {
    if (is.null(expr))
        base64_enc(serialize( eval(parse(text = expr), data), NULL)) # to char

#* @put /data
put_data <- function(req, name, value) {
    data[[name]] <<- unserialize(base64_dec(fromJSON(value)))

This means that GET http://x.y/data gives a listing of available
datasets, and GET http://x.y/data/obj retrieves object obj. PUT
uploads objects to the database (called data).


On the client side, we need to also do some work:

I/O with the server:

library(httr)     # GET, POST, PUT
library(jsonlite) # base64_dec, base64_enc, toJSON, fromJSON
## Loading required package: methods

get_data = function(url, expr = NULL) {
    if (is.null(expr))
        fromJSON( content(GET(url), "text", encoding = "UTF-8"))
            content(POST(url, body = list(expr = expr), encode = "json"),
                "text", encoding = "UTF-8")

put_data = function(url, name, value) {
    value = toJSON(base64_enc(serialize(value, NULL)))
    PUT(url, body = list(name = name, value = value), encode = "json")

Somewhat confusingly I gave these the same name as on the server side,
but these exist in a different R session: the one you interact with
directly, the client. For get_data, the return value is decoded and
unserialized to get a copy of the original R object on the server. This
R object may have been present at the server, or may have been created
on the server: since the server uses eval, it takes arbitrary R

Example client session: metadata

We can get a view of the data present on the server by

url = "http://localhost:8000/data"
get_data(url) # list
## [1] "md"

Next, we can retrieve the md (metadata) object:

## Loading required package: abind
## Loading required package: sf
## Linking to GEOS 3.5.1, GDAL 2.2.1, proj.4 4.9.3
library(tibble)   # print
md = get_data(url, "md")
## Simple feature collection with 36 features and 1 field
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 3e+05 ymin: 5590200 xmax: 809760 ymax: 6e+06
## epsg (SRID):    32631
## proj4string:    +proj=utm +zone=31 +datum=WGS84 +units=m +no_defs
## # A tibble: 36 x 2
##    file                                                               geom
##    <chr>                                                     <POLYGON [m]>
##  1 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((499980 5890200, 609780 5890200,…
##  2 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((6e+05 5890200, 709800 5890200, …
##  3 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((699960 5690220, 809760 5690220,…
##  4 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((3e+05 5690220, 409800 5690220, …
##  5 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((499980 5590200, 609780 5590200,…
##  6 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((6e+05 5590200, 709800 5590200, …
##  7 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((6e+05 5690220, 709800 5690220, …
##  8 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((6e+05 5790240, 709800 5790240, …
##  9 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((6e+05 5890200, 709800 5890200, …
## 10 SENTINEL2_L1C:/vsizip/S2A_MSIL1C_20… ((699960 5590200, 809760 5590200,…
## # ... with 26 more rows

Query data

We will now create an RGB mosaic of the data, while reading them at
overview level 3, meaning 160 m x 160 m pixels.

# select a country:
nl = st_as_sf(raster::getData("GADM", country = "NLD", level = 0)) %>%

file = md[nl,]$file
plot(st_geometry(nl), axes = TRUE)
s = sapply(file, function(x) {
    expr = paste0("read_stars(\"", x, "\", options = \"OVERVIEW_LEVEL=3\", NA_value = 0)")
    r = get_data(url, expr)
    image(r, add = TRUE, rgb = 4:2, maxColorValue = 15000)
plot(st_geometry(md), border = 'yellow', add = TRUE, lwd = .8)
plot(st_geometry(nl), add=TRUE, border = 'orange', lwd = 2)

The expression file = md[nl,]$file selects the file names from the
image metadata md for those scenes (images) that intersect with the
country (nl).

On top of the image (which shows mostly clouds) we plotted the country
boundary, and the tile boundaries.

Process: st_apply, NDVI

This way, we can also create new variables on the server side and show
them. As an example, we can use st_apply which does an apply to the arrays
in a stars object (and takes care of the dimensions afterwards!), and
define an
function, which we upload to the server:

ndvi = function(x) (x[4]-x[1])/(x[4]+x[1])
put_data(url, "ndvi", ndvi)
## Response [http://localhost:8000/data]
##   Date: 2018-03-23 03:01
##   Status: 200
##   Content-Type: application/json
##   Size: 2 B
## [1] "md"   "ndvi"
get_data(url, "ndvi")
## function(x) (x[4]-x[1])/(x[4]+x[1])

Next, we can query the server giving it an expression that applies the
ndvi function to the (overview level 3) imagery, and returns the
single-band (NDVI) image:

plot(st_geometry(nl), axes = TRUE)
s = sapply(file, function(x) {
    expr = paste0("st_apply(read_stars(\"", x, "\", options = \"OVERVIEW_LEVEL=3\", NA_value = 0),
        1:2, ndvi)")
    r = get_data(url, expr)
    image(r, add = TRUE, zlim = c(-1,1), col = colorRampPalette(c(grey(.1), grey(.5), 'green'))(10))
plot(st_geometry(nl), add = TRUE) # once more

The diagonal line is caused by the fact that we combine imagery from
different satellite overpasses, of which most are pretty cloudy.

Computing NDVI is just one example; the Index
200 more indexes that can be computed from Sentinel 2 data.

Limitations and lessons learned

  • As a prototype, this implementation has strong limitations. To use
    this in an architecture with multiple, concurrent users would
    require authentication, user management, and concurrent R sessions.
  • if we try to read one full 10 m scene with 4 bands from the server,
    we see the error long vectors not supported, indicating that the
    encoded data is over 2 Gb; file download and upload will be needed
    for larger data sizes, or more clever interaction patterns such as a
    web map service or
    tile map service.
  • to pass R objects, we need to serialize them, base64_enc that,
    and wrap the result in toJSON. On the receiver side, we need to do
    the reverse (fromJSON, base_64_dec, unserialize).
  • never try to put an R expression or its base64 encoding in a url
    (using GET) – use POST and put it in the message body; GET
    will do funny things with funny characters (even =).
  • the way we submit expressions looks pretty messy using
    character strings, there must be a more tidy way to do this.

Try this at home

No blog without a reproducible example! Start two R sessions, one for a
client, one for a server. On the server sessions, run:

devtools::install_github("stars") # requires sf 0.6-1, which is now on CRAN
r <- plumb(system.file("plumber/server.R", package = "stars"))

and keep this running! On the client session, run

source(system.file("plumber/client.R", package = "stars"),, echo = TRUE)

and watch for the logs from the server. In the client, you can now


to see an object that was read in the server session.

The way forward

Where this example worked locally, and thus automatically with “small”
large data (20 Gb imagery), in next steps we will develop and deploy
this with “large” large data. Amazon hosts large collections of Landsat
and Sentinel 2
data, and with
the setup sketched above it should be fairly straightforward to directly
compute on these data, given access to a cloud back-end machine. The R
Consortium has sponsored the Earth data processing backend for testing
and evaluating

project, which has a budget for a storage and/or compute back-end for
evaluation purposes.

openEO is a project that develops
an open API to connect R, python and javascript clients to big Earth
observation cloud back-ends in a simple and unified way. That is a
similar ambition, but larger. openEO uses a process graph (directed,
acyclic graph) to represent the expression, which must be acceptable for
all back-ends. Here, we have R on both sides of the wire, which makes
life much more simple. But openEO is also a cool project: check out the
proof of concept videos!

Earlier stars blogs

To leave a comment for the author, please follow the link and comment on their blog: r-spatial. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)