Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I’ve been learning Haskell for a few years now and I am really liking a lot of the features, not least the strong typing and functional approach. I thought it was lacking some of the things I missed from R until I found the dataHaskell project.
There have been several attempts recently to enhance R with some strong types, e.g. vapour, typr, using {rlang}’s checks, and even discussions about implementations at the core level e.g. in September 2025 continued in November 2025. While these try to bend R towards types, perhaps an all-in solution makes more sense.
In this post I’ll demonstrate some of the features and explain why I think it makes for a good (great?) data science language.
I’ve posted more than a handful of times
about Haskell but maybe not so much the benefits of a real-world usage, more
toy problems (e.g. I did a lot of Advent of Code using it last year). I’ve been
working towards using it more, and even managed to get a custom {knitr} engine
working – here’s the special sauce that makes a ```{haskell} block work:
knitr::knit_engines$set(haskell = function(options) {
code <- options$code
codefile <- tempfile(fileext = ".hs")
codefile_brace <- tempfile(fileext = ".hs")
on.exit(file.remove(codefile, codefile_brace))
writeLines(c(":script dataframe", "", code), con = codefile)
system2('hscript', codefile, stdout = codefile_brace)
out <- system2(
file.path(path.expand('~'), '.ghcup/bin/ghc'),
c('-e',"':script ", codefile_brace, "'"),
stdout = TRUE
)
knitr::engine_output(options, code, out)
})
This writes the lines of code to a temporary file, prepended with some
configuration options, then runs essentially ghc -e ':script file.txt' and
deletes the temporary file. For the purposes of making cleaner code blocks, the
code detours through an awk script
which inserts some :{ blocks around
multi-line statements,
helping to reproduce how these look in a Jupyter notebook.
The result is then shown in the code block, so this is a “live” output
map (+5) [2..8] ## [7,8,9,10,11,12,13]
Neat, right?
Because I’m treating each code block as an independent script, it means there is
some repetition between blocks. I’ll hide that away with some judicious echo
options where necessary, but otherwise each block should be able to be run
as a ‘script’ with the right pre-processing.
A Brief intro to Haskell Syntax
Haskell is a bit different if you’ve only ever seen R or Python, but it doesn’t
take too much effort to understand what’s going on. Firstly, while parentheses are
used for function calls in R, a space is used in Haskell, so instead of sum(x)
you use sum x. Parentheses are still used for grouping together combinations of
things that need to be evaluated together.
Lists are a fundamental data type and are denoted by square brackets, e.g. [3,4,5]
and they need to contain a single type. For a strongly typed language, that shouldn’t
come as a surprise. A single number might be of type Double and a list of these
would be of type [Double].
If you’re worried that you’ve become too reliant on a piped workflow, fear not! dataHaskell’s dataframe package adds the familiar pipe operator
[2,8,7,10,1,9,5,3,4,6] |> reverse |> take 5 ## [6,4,3,5,9]
with the important distinction that it passes the left side to the end of the right side (not to the first argument) which flows cleaner given how Haskell functions are typically written, e.g.
take 3 [1,2,3,4,5,6] -- vs [1,2,3,4,5,6] |> take 3 ## [1,2,3] ## [1,2,3]
The line in the middle there demonstrates that comments start with two hyphens --,
or for multi-line comments, between {- and -}.
If you need to write a function (for which you use camelCase) you can annotate
it with a definition, though the compiler can figure this out itself most of the
time (plus it helps for readability). The way to do this is with one extra line
above the implementation. If the type is generic, you can use a placeholder e.g.
a rather than a specific type. Technically all functions take only one argument,
possibly returning another function (see currying)
but this is more explicit in the signature; e.g. [a] -> a -> [a] represents a
function which takes a list and a value and returns a list
appendValueToList :: [a] -> a -> [a] appendValueToList xs y = xs ++ [y] appendValueToList [2,4,6] 8 appendValueToList ["f", "o", "o"] "t" ## [2,4,6,8] ## ["f","o","o","t"]
The period is used for function composition, i.e.
import Data.List (sort) (reverse . sort) [2,8,7,10,1,9,5,3,4,6] ## [10,9,8,7,6,5,4,3,2,1]
applies a composed ‘sort and reverse’ operation to the list. The import is there
because the ‘base’ library (“Prelude”) doesn’t have the sort function, so it’s
imported. There’s actually a few of these which need to be imported to use the
code I’m showing below, but it’s inserted into the codeblocks via the :script dataframe
line in the engine definition above. That calls out to an executable which runs
the code block as if it was contained in a main function in a full program,
which enables us to use IO operations inline, such as reading from files and
printing results. That all gets a little trickier without this ‘scripting’ context,
but I’m here to make the point that such a scripting context works well for
doing data science.
So, what would one use this for?
I saw this (follow-up) post from Claus Wilke about Python not being a great language for data science and while I concur with the points made there, I do believe some of them are personal preference. I’m a proponent of “use the tools you’re comfortable with” and I can’t argue with however many thousands of data scientists are successfully using Python to do data science.
The point about “what makes for a good data science language” made me pause to think and I came to the conclusion that Haskell actually ticks the boxes, at least with the dataHaskell ecosystem and the dataframe package. What follows is not to be taken as a pile-on against Python or even a complaint about R, but rather something in the style of “if you like that, check this out!”
Lots of languages seem to have some sort of dataframe these days – thanks R! – e.g. Python has Pandas/Polars, Julia has DataFrames.jl, even Kotlin has a DataFrame. Haskell does, too with dataframe and I’ve been learning how to use this recently.
The points made in Claus’ post were that the features which make R a better language for data science over Python are (paraphrasing):
- call-by-value semantics (non-mutability)
- built-in missing values
- built-in vectorization
- non-standard evaluation (NSE)
Let’s look at how Haskell deals with each of these.
Non-mutability
Claus details how Python’s call-by-reference semantics enables one to modify variables unintentionally, since they’re scoped across functions. Haskell certainly doesn’t have this problem – everything is immutable, and functions are “pure” (no side-effects, though you can interact with typed side-effect ‘instructions’). If you want to “do” anything to a data object you pass it into a function and get a new object out. There’s no risk of accidentally modifying a variable, but of course the downside of this is that you can’t do so without a function. While in R it’s straightforward to do
a <- c(2, 9, 6) a[2] <- 4 a ## [1] 2 4 6
in Haskell that sort of thing is off limits – you can use an operator to extract a value from a list (0-indexed), e.g.
a = [2,9,6] a !! 1 ## 9
but there’s no way to assign the second element to some other value. Instead, you need to break the vector apart and stitch the new value inside
a = [2,9,6] updateSecond :: [a] -> a -> [a] updateSecond (x:_:z) y = x : y : z updateSecond xs _ = xs updateSecond a 4 ## [2,4,6]
No risk of accidentally writing that, I’m sure.
I’ve also included the type definition in this case which reads as “a function
which takes a list of some type a ([a]) and a single value of type a and
returns a list of that same type, [a].” FYI, this is one example where you may
need the definition to be enclosed between :{ and :}, if you’re running
interactively in ghci, but here I’m using the
pre-processing trick
mentioned above.
A tick for truly immutable data – the only way to “alter” a value is to operate on it with a function and reassign it.
Built-in missing values
This is somewhere that Haskell shines – if you want a value that might not be
available in R you use an NA (which is a shorthand for whichever flavour/class
you actually want, e.g. NA_character_). Using one of these in any mathematical
calculation ‘poisons’ it and returns NA, e.g.
sum(1, NA, 3) ## [1] NA
To avoid this, most functions offer a na.rm argument which instructs to remove
the missing values prior to performing the calculation
sum(1, NA, 3, na.rm = TRUE) ## [1] 4
What’s happening here is that R encodes a value that is maybe missing. Haskell
formalises this into the Data.Maybe package and you have to be explicit in
dealing with a missing value (Nothing) or a definitely-not-missing value (Just x)
non_missing = [1, 2, 3, 4] has_missing = [Just 1,Just 2,Nothing,Just 4] :t non_missing :t has_missing ## non_missing :: Num a => [a] ## has_missing :: Num a => [Maybe a]
where we see that has_missing is a Maybe type.
sum non_missing ## 10
You can’t just sum the latter; it produces an error because it doesn’t have a
function which can sum a Maybe Integer
sum has_missing
s:7:1: error: [GHC-39999]
• No instance for ‘Num (Maybe Integer)’ arising from a use of ‘it’
• In the first argument of ‘print’, namely ‘it’
In a stmt of an interactive GHCi command: print it
|
7 | sum has_missing
| ^^^^^^^^^^^^^^^
you need to remove any Nothing first, then most likely ‘unwrap’ from the
Maybe context
import Data.Maybe sum $ map fromJust $ filter isJust has_missing ## 7
or alternatively
sum (catMaybes has_missing) ## 7
or you can get fancy
sum [x | Just x <- has_missing] ## 7
The point is that you have to deal with the missingness if it’s there. What this
also means is that if you have a Double column, it does NOT have missing values,
so you can safely sum those values (plus get all sorts of performance benefits from
the compiler because it, too, knows there’s no missing values).
For Claus’ example, we can produce a proper Nothing at the end of this calculation
fmap (fmap (> 3)) x ## [Just False,Just False,Nothing,Just True,Just True]
Another tick – proper missing values.
Built-in vectorisation
Haskell is NOT an array language, so sure, it doesn’t have vectorisation built-in,
but it’s worth noting that at the end of Claus’ post he details some limitations
of R and acknowledges that “R does not have any scalar data types”. Haskell has
scalars, vectors, and arrays, and you need to be specific when you want to iterate
over those – the “type” of a variable includes the dimensionality, so Double is
not the same as [Double] (a list of doubles).
Since Haskell is a functional programming language it has every type of map you
could want, including specialities for monads and applicatives. While this means
you do need to write map when you want to iterate, it also means you’re never
surprised that there was more than one value there.
What’s more, because it’s a compiled language, the compiler can optimise all sorts
of vector operations. One example is using “fusion” to combine a filter and a
map such that the
intermediate vector isn’t actually used at all.
This means that a stack of functions like
foldr (+) 0 . map (*2) . filter even
which would naively require a full pass to filter the even values, a half pass to double those, then anther half pass to add them up, can be done in a single pass.
You can also add rewrite rules if you’re sure your replacement holds (and many libraries can assert these conditions, so implement such rules) so that some operations can be entirely compiled away. Reversing a finite list twice is a no-op, so takes no time, so one could add
{-# RULES
"reverse.reverse/id" reverse . reverse = id
#-}
which means a double reverse can be replaced with the identity function.
Even without such a rule, Haskell (being a compiled language) is fast
x = [1..1000000000] :set +s a = reverse $ reverse x (0.00 secs, 0 bytes)
This is disappointing to run inline in R
x <- seq_len(1e9) system.time(rev(rev(x))) user system elapsed 4.596 2.543 8.824
This ins’t just about compiling; R does have just-in-time compilation of functions, but lacks the compiler tricks that Haskell uses, so a compiled version of this doesn’t do a lot better
revrev <- function(x) {
rev(rev(x))
}
revrev_comp <- compiler::cmpfun(revrev)
system.time(revrev_comp(x))
user system elapsed
4.035 0.739 4.777
So, no vectorisation, but possibly enough compiler tricks to make up for it – tick.
Non-standard evaluation (NSE)
This is where the fun really starts – the dataframe package from the dataHaskell ecosystem adds the sort of slicing and dicing you’re probably familiar with. Apart from general inspection of data frames
df <- D.readParquet "iris.parquet" D.describeColumns df ## --------------------------------------------------------- ## Column Name | # Non-null Values | # Null Values | Type ## -------------|-------------------|---------------|------- ## Text | Int | Int | Text ## -------------|-------------------|---------------|------- ## variety | 150 | 0 | Text ## petal.width | 150 | 0 | Double ## petal.length | 150 | 0 | Double ## sepal.width | 150 | 0 | Double ## sepal.length | 150 | 0 | Double
(don’t be fooled by that <- – that’s Haskell’s way of doing something that reaches
outside of the CPU, e.g. to the disk to read a file) we can use D.dimensions to get the
overall shape, and more specific helpers like D.nRows and D.nColumns are available
which we can incorporate into e.g. text output
import Text.Printf (printf) df <- D.readParquet "iris.parquet" D.dimensions df printf "%d rows, %d columns" (D.nRows df) (D.nColumns df) ## (150,5) ## 150 rows, 5 columns
Many of the dplyr-esqe operations are available, with a lot of thought put into
how these would interact with a strongly typed structure
iris <- D.readParquet "iris.parquet" iris |> D.filterWhere (F.col @Text "variety" .== "Setosa") |> D.filterWhere (F.col @Double "sepal.length" .> 5.4) ## ----------------------------------------------------------------- ## sepal.length | sepal.width | petal.length | petal.width | variety ## -------------|-------------|--------------|-------------|-------- ## Double | Double | Double | Double | Text ## -------------|-------------|--------------|-------------|-------- ## 5.8 | 4.0 | 1.2 | 0.2 | Setosa ## 5.7 | 4.4 | 1.5 | 0.4 | Setosa ## 5.7 | 3.8 | 1.7 | 0.3 | Setosa ## 5.5 | 4.2 | 1.4 | 0.2 | Setosa ## 5.5 | 3.5 | 1.3 | 0.2 | Setosa
but dataframe goes one step further via template haskell… you can expose the columns as variables (admittedly, in the wider scope) so this works
iris <- D.readParquet "iris.parquet" -- make columns available as expressions :exposeColumns iris iris |> D.derive "sepal.ratio" (sepal_width / sepal_length) |> D.take 5 ## sepal_length :: Expr Double ## sepal_width :: Expr Double ## petal_length :: Expr Double ## petal_width :: Expr Double ## variety :: Expr Text ## -------------------------------------------------------------------------------------- ## sepal.length | sepal.width | petal.length | petal.width | variety | sepal.ratio ## -------------|-------------|--------------|-------------|---------|------------------- ## Double | Double | Double | Double | Text | Double ## -------------|-------------|--------------|-------------|---------|------------------- ## 5.1 | 3.5 | 1.4 | 0.2 | Setosa | 0.6862745098039216 ## 4.9 | 3.0 | 1.4 | 0.2 | Setosa | 0.6122448979591836 ## 4.7 | 3.2 | 1.3 | 0.2 | Setosa | 0.6808510638297872 ## 4.6 | 3.1 | 1.5 | 0.2 | Setosa | 0.673913043478261 ## 5.0 | 3.6 | 1.4 | 0.2 | Setosa | 0.72
The info printed prior to the result is the about the exposed columns, and it’s worth noting that the dots/periods have been replaced by underscores. That’s because in Haskell the period is used for composition, as described above.
Many verbs are supported, so we can do some more detailed transformations
iris <- D.readParquet "iris.parquet"
:exposeColumns iris
iris |>
D.filterWhere ( sepal_width .> 2.6 ) |>
D.groupBy ["variety"] |>
D.aggregate
[ "n" .= F.count petal_length
, "sl_mean" .= F.mean sepal_length
, "pl_mean" .= F.mean petal_length
]
## sepal_length :: Expr Double
## sepal_width :: Expr Double
## petal_length :: Expr Double
## petal_width :: Expr Double
## variety :: Expr Text
## ---------------------------------------------------------
## variety | n | sl_mean | pl_mean
## -----------|-----|-------------------|-------------------
## Text | Int | Double | Double
## -----------|-----|-------------------|-------------------
## Versicolor | 34 | 6.099999999999998 | 4.435294117647058
## Setosa | 49 | 5.016326530612244 | 1.4653061224489798
## Virginica | 43 | 6.651162790697675 | 5.57674418604651
and remember, because we know there’s no missing values in a column of Doubles
(not Maybe Doubles) we can take averages without worrying about any na.rm
complications.
The NSE bit doesn’t quite work everywhere, but sometimes a string is just fine, e.g.
iris <- D.readParquet "iris.parquet" D.plotScatter "sepal.length" "sepal.width" iris ## 4.5│ ## │ ⠈ ## │ ⠠ ## │ ⠂ ## │ ⡀ ⠁ ## │ ⠠ ⠠ ⠄ ⠄ ## │ ⠐ ⠐ ⠂ ## │ ⠁ ⠈ ⡁⢀ ⡀ ⢀ ⠈ ## │ ⠄ ⠄ ⠄⠠ ⠄ ⠄ ⠄ ⠠ ⠄ ## │ ⡀ ⡀⢀ ⡂⠐ ⢀ ⠂⢀ ⡀ ⠂⢀ ⡀⢀ ⢀ ## 3.2│ ⠄ ⠄⠠ ⠠ ⠄ ⠄ ## │ ⠐ ⠂ ⠂⠐ ⠂ ⠂ ⠂⠐ ⠐ ⠂⠐ ⠂⠐ ⠂⠐ ⠂⠐ ⠐ ⠂ ## │ ⠁ ⡁⢈ ⡀ ⠁⢈ ⢈ ⡁⢈ ⡀⠈ ⢀ ⠁⢀ ⡀ ## │ ⠄ ⠄ ⠄ ⠄ ⠄⠠ ## │ ⠐ ⠐ ⠂ ⠐ ⠂ ## │ ⢈ ⠈ ⢈ ⠁⠈ ⠁ ⠁ ## │ ⠠ ⠄ ⠠ ⠄ ## │ ⠂ ⠐ ## │ ⡀ ## 1.9│ ## └──────────────────────────────────────────────────────────── ## 4.1 6.1 8.1 ## ## ⣿ sepal.length vs sepal.width
(following https://blog.djnavarro.net/posts/2021-04-18_pretty-little-clis/ to get the ANSI sequences to work in a code block).
A more detailed comparison to dplyr is provided in
the dataframe documentation.
So, NSE? Tick!
Conclusion
I’ve hopefully demonstrated some of the power of a strongly typed language and a package focused on data science enabling the sort of functionality that an R (or Python) user might be looking for. I am hopeful that Haskell (and the dataHaskell ecosystem) can be a viable option for those of us wanting to do data science in a strongly typed language with a very clever compiler capable of making significant performance improvements.
If you’re interested in dataHaskell then check out this post and consider taking it for a spin – we’re working on reducing friction to get started via devcontainers and hosted notebook solutions, and are keen to hear from more data scientists about what they’d like the ecosystem to be able to support.
I believe Haskell IS a great language for data science!
As always, I can be found on Mastodon and the comment section below.
< details> < summary> devtools::session_info()
## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.4.1 (2024-06-14) ## os macOS 15.6.1 ## system aarch64, darwin20 ## ui X11 ## language (EN) ## collate en_US.UTF-8 ## ctype en_US.UTF-8 ## tz Australia/Adelaide ## date 2025-12-05 ## pandoc 3.8.2.1 @ /opt/homebrew/bin/ (via rmarkdown) ## quarto 1.7.31 @ /usr/local/bin/quarto ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date (UTC) lib source ## blogdown 1.21.1 2025-06-28 [1] Github (rstudio/blogdown@33313a5) ## bookdown 0.41 2024-10-16 [1] CRAN (R 4.4.1) ## bslib 0.9.0 2025-01-30 [1] CRAN (R 4.4.1) ## cachem 1.1.0 2024-05-16 [1] CRAN (R 4.4.0) ## cli 3.6.5 2025-04-23 [1] CRAN (R 4.4.1) ## devtools 2.4.6 2025-10-03 [1] CRAN (R 4.4.1) ## digest 0.6.38 2025-11-12 [1] CRAN (R 4.4.1) ## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.4.0) ## evaluate 1.0.5 2025-08-27 [1] CRAN (R 4.4.1) ## fansi 1.0.7 2025-11-19 [1] CRAN (R 4.4.3) ## fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0) ## fs 1.6.6 2025-04-12 [1] CRAN (R 4.4.1) ## glue 1.8.0 2024-09-30 [1] CRAN (R 4.4.1) ## htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0) ## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.4.0) ## jsonlite 2.0.0 2025-03-27 [1] CRAN (R 4.4.1) ## knitr 1.50 2025-03-16 [1] CRAN (R 4.4.1) ## lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0) ## magrittr 2.0.4 2025-09-12 [1] CRAN (R 4.4.1) ## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.4.0) ## pkgbuild 1.4.8 2025-05-26 [1] CRAN (R 4.4.1) ## pkgload 1.4.1 2025-09-23 [1] CRAN (R 4.4.1) ## purrr 1.2.0 2025-11-04 [1] CRAN (R 4.4.1) ## R6 2.6.1 2025-02-15 [1] CRAN (R 4.4.1) ## remotes 2.5.0 2024-03-17 [1] CRAN (R 4.4.1) ## rlang 1.1.6 2025-04-11 [1] CRAN (R 4.4.1) ## rmarkdown 2.30 2025-09-28 [1] CRAN (R 4.4.1) ## rstudioapi 0.17.1 2024-10-22 [1] CRAN (R 4.4.1) ## sass 0.4.10 2025-04-11 [1] CRAN (R 4.4.1) ## sessioninfo 1.2.3 2025-02-05 [1] CRAN (R 4.4.1) ## usethis 3.2.1 2025-09-06 [1] CRAN (R 4.4.1) ## vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0) ## xfun 0.54 2025-10-30 [1] CRAN (R 4.4.1) ## yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.0) ## ## [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library ## ## ──────────────────────────────────────────────────────────────────────────────
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
