What’s R vector, Victor?
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this week’s episode of the “Hidden Monads in R” series, I’ll explore the vector aspect of R data structures, and see how the flatmap operation can be quite useful.
Flatmap? Aren’t all maps flat?
The Nobel Prize organisation provides an API with information about the prizes and laureates. We can retrieve a JSON file, which is what I did. I read the file and examine one of the entries below.
# Source: http://api.nobelprize.org/v1/prize.json prizes <- jsonlite::fromJSON("./prize.json", simplifyDataFrame = FALSE)[["prizes"]] str(prizes[[11]]) ## List of 3 ## $ year : chr "2023" ## $ category : chr "physics" ## $ laureates:List of 3 ## ..$ :List of 5 ## .. ..$ id : chr "1026" ## .. ..$ firstname : chr "Pierre" ## .. ..$ surname : chr "Agostini" ## .. ..$ motivation: chr "\"for experimental methods that generate attosecond pulses of light for the study of electron dynamics in matter\"" ## .. ..$ share : chr "3" ## ..$ :List of 5 ## .. ..$ id : chr "1027" ## .. ..$ firstname : chr "Ferenc" ## .. ..$ surname : chr "Krausz" ## .. ..$ motivation: chr "\"for experimental methods that generate attosecond pulses of light for the study of electron dynamics in matter\"" ## .. ..$ share : chr "3" ## ..$ :List of 5 ## .. ..$ id : chr "1028" ## .. ..$ firstname : chr "Anne" ## .. ..$ surname : chr "L’Huillier" ## .. ..$ motivation: chr "\"for experimental methods that generate attosecond pulses of light for the study of electron dynamics in matter\"" ## .. ..$ share : chr "3"
Let’s say I want a character vector containing the full names of Nobel laureates in medicine since 2020. First, I can concoct a function that gets such a vector from a single entry (I know, this one is physics).
who_got_it <- function(prize) { laureates <- vapply( X = prize[["laureates"]], FUN = \(l) c(l[["surname"]] %||% "", l[["firstname"]] %||% ""), FUN.VALUE = c("Doe", "John") ) trimws(paste(laureates[2,], laureates[1,])) } who_got_it(prizes[[11]]) ## [1] "Pierre Agostini" "Ferenc Krausz" "Anne L’Huillier"
To achieve my goal, I just have to filter the list accordingly, and lapply
the
function on the matching entries.
(medicine_since_2020 <- Filter( f = \(p) p[["category"]] == "medicine" & as.numeric(p[["year"]]) >= 2020, x = prizes ) |> lapply(who_got_it) ) ## [[1]] ## [1] "Victor Ambros" "Gary Ruvkun" ## ## [[2]] ## [1] "Katalin Karikó" "Drew Weissman" ## ## [[3]] ## [1] "Svante Pääbo" ## ## [[4]] ## [1] "David Julius" "Ardem Patapoutian" ## ## [[5]] ## [1] "Harvey Alter" "Michael Houghton" "Charles Rice"
Neat! But I want them in a single vector. so I need an unlist step at the end.
unlist(medicine_since_2020) ## [1] "Victor Ambros" "Gary Ruvkun" "Katalin Karikó" ## [4] "Drew Weissman" "Svante Pääbo" "David Julius" ## [7] "Ardem Patapoutian" "Harvey Alter" "Michael Houghton" ## [10] "Charles Rice"
Yes, it’s that simple. This is a flatmap process for vectors, and it’s a
composition of a map and a flatten step (lapply and unlist in this case). It
almost looks silly to write a flatmap function, after all it’s not that
difficult to lapply and unlist sequentially. But it’s used often, so it saves
time and reduces mistakes. In this case – to be correct – I should have used
unlist(recursive = FALSE)
, otherwise it flattens nested lists, and that would
be wrong.
A biology-related problem
Laboratory experiments are often performed in 96-well plastic plates, with 8 rows (labeled A-H) and 12 columns (labeled 1-12). Each microwell is a separate micro-experiment (labeled A1-H12). Let’s generate well labels for such a dataset!
rows <- LETTERS[1:8] columns <- 1:12 |> sprintf(fmt = "%02i")
So all we have to do is combine one vector of values with another, using the
handy paste0()
function, right? Wrong.
paste0(rows, columns) |> noquote() ## [1] A01 B02 C03 D04 E05 F06 G07 H08 A09 B10 C11 D12
We’ve only got 12 values instead of 96, and the shorter vector (letters) is recycled as needed. It’s often what you want, so it’s done this way for a good reason. But in this case, we’d prefer to have an each-with-each combination.
Some readers may already have started daydreaming of nested for loops (please
don’t). More experienced R programmers would probably go for expand.grid()
or
rep(rows, each = length(columns)
to match up the vectors, and then paste()
them together. But R is a versatile language, and there are many paths to the
same destination. A functional R programmer could just take flatmap
off
the shelf, and here is how.
For purely didactic reasons, let’s define a non-vectorized paste function, called paste01 1. It takes a single value and a character vector, and returns a character vector – the combination of the value with each member of the vector.
$$paste01 :: Str \rightarrow [Str] \rightarrow [Str]$$
paste01 <- \(x, y) { stopifnot(length(x) == 1L); paste0(x, y)} paste01(rows[1], columns) ## [1] "A01" "A02" "A03" "A04" "A05" "A06" "A07" "A08" "A09" "A10" "A11" "A12"
When we map this function on our rows
vector, we almost get what we
need.
$$lapply(paste01) :: [Str] \rightarrow [Str] \rightarrow [[Str]]$$
lapply(rows, paste01, columns) |> head(3L) ## [[1]] ## [1] "A01" "A02" "A03" "A04" "A05" "A06" "A07" "A08" "A09" "A10" "A11" "A12" ## ## [[2]] ## [1] "B01" "B02" "B03" "B04" "B05" "B06" "B07" "B08" "B09" "B10" "B11" "B12" ## ## [[3]] ## [1] "C01" "C02" "C03" "C04" "C05" "C06" "C07" "C08" "C09" "C10" "C11" "C12"
It’s a list of vectors, so we have to flatten it. Yupp, it’s a flatmap.
$$unlist(lapply(paste01)) :: [Str] \rightarrow [Str] \rightarrow [Str]$$
unlist(lapply(rows, paste01, columns)) |> noquote() ## [1] A01 A02 A03 A04 A05 A06 A07 A08 A09 A10 A11 A12 B01 B02 B03 B04 B05 B06 B07 ## [20] B08 B09 B10 B11 B12 C01 C02 C03 C04 C05 C06 C07 C08 C09 C10 C11 C12 D01 D02 ## [39] D03 D04 D05 D06 D07 D08 D09 D10 D11 D12 E01 E02 E03 E04 E05 E06 E07 E08 E09 ## [58] E10 E11 E12 F01 F02 F03 F04 F05 F06 F07 F08 F09 F10 F11 F12 G01 G02 G03 G04 ## [77] G05 G06 G07 G08 G09 G10 G11 G12 H01 H02 H03 H04 H05 H06 H07 H08 H09 H10 H11 ## [96] H12
Tadaa!
So, a flatmap function for vectors can be defined. It takes:
- a vector of values
- a function that turns one of those into a (potentially different kind of) vector
The output type matches the 2nd kind of vector.
$$ flatmap :: [a] \rightarrow (a \rightarrow [b]) \rightarrow [b] $$
flatmap <- function(X, FUN, ..., USE.NAMES = TRUE) { unlist(lapply(X, FUN, ...), recursive = FALSE, USE.NAMES = USE.NAMES) }
Debrief
Such a function could also be defined as an infix operator, and could take the
form of %>>=%
, for example. If that looks familiar, it’s not a coincidence:
flatmap is the bind operation for the vector monad.
Previously, I assumed that yet
another infix operator is not what R needs the most, and I created a function
wrapper instead.
The same could be done here! R already has a very similar wrapper,
base::Vectorize()
, which only needs a tiny tweak, unlist()
-ing the results.
It’s so trivial that I won’t even write it out here.
What excites me much more is the possibility of combining the two ideas:
handling NA-s and flatmapping in a single bind wrapper function, which would
truly allow focusing on the logic, and let the “expert” wrapper deal with the
rest. As customary, some more exploration is needed.
-
Actually, this works equally well with the original
paste0
, because lapply will map on the first argument anyway, which guarantees that we’ll deal with a single value in each iteration. ↩︎
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.