Site icon R-bloggers

Haskell IS a Great Language for Data Science

[This article was first published on rstats on Irregularly Scheduled Programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve been learning Haskell for a few years now and I am really liking a lot of the features, not least the strong typing and functional approach. I thought it was lacking some of the things I missed from R until I found the dataHaskell project.

There have been several attempts recently to enhance R with some strong types, e.g.  vapour, typr, using {rlang}’s checks, and even discussions about implementations at the core level e.g.  in September 2025 continued in November 2025. While these try to bend R towards types, perhaps an all-in solution makes more sense.

In this post I’ll demonstrate some of the features and explain why I think it makes for a good (great?) data science language.

I’ve posted more than a handful of times about Haskell but maybe not so much the benefits of a real-world usage, more toy problems (e.g. I did a lot of Advent of Code using it last year). I’ve been working towards using it more, and even managed to get a custom {knitr} engine working – here’s the special sauce that makes a ```{haskell} block work:

knitr::knit_engines$set(haskell = function(options) {
  code <- options$code
  codefile <- tempfile(fileext = ".hs")
  codefile_brace <- tempfile(fileext = ".hs")
  on.exit(file.remove(codefile, codefile_brace))
  writeLines(c(":script dataframe", "", code), con = codefile)
  system2('hscript', codefile, stdout = codefile_brace)
  out  <- system2(
    file.path(path.expand('~'), '.ghcup/bin/ghc'),
    c('-e',"':script ", codefile_brace, "'"),
    stdout = TRUE
  )

  knitr::engine_output(options, code, out)
})

This writes the lines of code to a temporary file, prepended with some configuration options, then runs essentially ghc -e ':script file.txt' and deletes the temporary file. For the purposes of making cleaner code blocks, the code detours through an awk script which inserts some :{ blocks around multi-line statements, helping to reproduce how these look in a Jupyter notebook. The result is then shown in the code block, so this is a “live” output

map (+5) [2..8]
## [7,8,9,10,11,12,13]

Neat, right?

Because I’m treating each code block as an independent script, it means there is some repetition between blocks. I’ll hide that away with some judicious echo options where necessary, but otherwise each block should be able to be run as a ‘script’ with the right pre-processing.

A Brief intro to Haskell Syntax

Haskell is a bit different if you’ve only ever seen R or Python, but it doesn’t take too much effort to understand what’s going on. Firstly, while parentheses are used for function calls in R, a space is used in Haskell, so instead of sum(x) you use sum x. Parentheses are still used for grouping together combinations of things that need to be evaluated together.

Lists are a fundamental data type and are denoted by square brackets, e.g. [3,4,5] and they need to contain a single type. For a strongly typed language, that shouldn’t come as a surprise. A single number might be of type Double and a list of these would be of type [Double].

If you’re worried that you’ve become too reliant on a piped workflow, fear not! dataHaskell’s dataframe package adds the familiar pipe operator

[2,8,7,10,1,9,5,3,4,6] |>
  reverse |>
  take 5
## [6,4,3,5,9]

with the important distinction that it passes the left side to the end of the right side (not to the first argument) which flows cleaner given how Haskell functions are typically written, e.g.

take 3 [1,2,3,4,5,6]

-- vs 

[1,2,3,4,5,6] |>
  take 3
## [1,2,3]
## [1,2,3]

The line in the middle there demonstrates that comments start with two hyphens --, or for multi-line comments, between {- and -}.

If you need to write a function (for which you use camelCase) you can annotate it with a definition, though the compiler can figure this out itself most of the time (plus it helps for readability). The way to do this is with one extra line above the implementation. If the type is generic, you can use a placeholder e.g.  a rather than a specific type. Technically all functions take only one argument, possibly returning another function (see currying) but this is more explicit in the signature; e.g. [a] -> a -> [a] represents a function which takes a list and a value and returns a list

appendValueToList :: [a] -> a -> [a]
appendValueToList xs y = xs ++ [y]

appendValueToList [2,4,6] 8

appendValueToList ["f", "o", "o"] "t"
## [2,4,6,8]
## ["f","o","o","t"]

The period is used for function composition, i.e.

import Data.List (sort)

(reverse . sort) [2,8,7,10,1,9,5,3,4,6]
## [10,9,8,7,6,5,4,3,2,1]

applies a composed ‘sort and reverse’ operation to the list. The import is there because the ‘base’ library (“Prelude”) doesn’t have the sort function, so it’s imported. There’s actually a few of these which need to be imported to use the code I’m showing below, but it’s inserted into the codeblocks via the :script dataframe line in the engine definition above. That calls out to an executable which runs the code block as if it was contained in a main function in a full program, which enables us to use IO operations inline, such as reading from files and printing results. That all gets a little trickier without this ‘scripting’ context, but I’m here to make the point that such a scripting context works well for doing data science.

So, what would one use this for?

I saw this (follow-up) post from Claus Wilke about Python not being a great language for data science and while I concur with the points made there, I do believe some of them are personal preference. I’m a proponent of “use the tools you’re comfortable with” and I can’t argue with however many thousands of data scientists are successfully using Python to do data science.

The point about “what makes for a good data science language” made me pause to think and I came to the conclusion that Haskell actually ticks the boxes, at least with the dataHaskell ecosystem and the dataframe package. What follows is not to be taken as a pile-on against Python or even a complaint about R, but rather something in the style of “if you like that, check this out!”

Lots of languages seem to have some sort of dataframe these days – thanks R! – e.g.  Python has Pandas/Polars, Julia has DataFrames.jl, even Kotlin has a DataFrame. Haskell does, too with dataframe and I’ve been learning how to use this recently.

The points made in Claus’ post were that the features which make R a better language for data science over Python are (paraphrasing):

  • call-by-value semantics (non-mutability)
  • built-in missing values
  • built-in vectorization
  • non-standard evaluation (NSE)

Let’s look at how Haskell deals with each of these.

Non-mutability

Claus details how Python’s call-by-reference semantics enables one to modify variables unintentionally, since they’re scoped across functions. Haskell certainly doesn’t have this problem – everything is immutable, and functions are “pure” (no side-effects, though you can interact with typed side-effect ‘instructions’). If you want to “do” anything to a data object you pass it into a function and get a new object out. There’s no risk of accidentally modifying a variable, but of course the downside of this is that you can’t do so without a function. While in R it’s straightforward to do

a <- c(2, 9, 6)
a[2] <- 4
a
## [1] 2 4 6

in Haskell that sort of thing is off limits – you can use an operator to extract a value from a list (0-indexed), e.g. 

a = [2,9,6]

a !! 1
## 9

but there’s no way to assign the second element to some other value. Instead, you need to break the vector apart and stitch the new value inside

a = [2,9,6]

updateSecond :: [a] -> a -> [a]
updateSecond (x:_:z) y = x : y : z
updateSecond xs _ = xs

updateSecond a 4
## [2,4,6]

No risk of accidentally writing that, I’m sure.

I’ve also included the type definition in this case which reads as “a function which takes a list of some type a ([a]) and a single value of type a and returns a list of that same type, [a].” FYI, this is one example where you may need the definition to be enclosed between :{ and :}, if you’re running interactively in ghci, but here I’m using the
pre-processing trick mentioned above.

A tick for truly immutable data – the only way to “alter” a value is to operate on it with a function and reassign it.

Built-in missing values

This is somewhere that Haskell shines – if you want a value that might not be available in R you use an NA (which is a shorthand for whichever flavour/class you actually want, e.g. NA_character_). Using one of these in any mathematical calculation ‘poisons’ it and returns NA, e.g.

sum(1, NA, 3)
## [1] NA

To avoid this, most functions offer a na.rm argument which instructs to remove the missing values prior to performing the calculation

sum(1, NA, 3, na.rm = TRUE)
## [1] 4

What’s happening here is that R encodes a value that is maybe missing. Haskell formalises this into the Data.Maybe package and you have to be explicit in dealing with a missing value (Nothing) or a definitely-not-missing value (Just x)

non_missing = [1, 2, 3, 4]
has_missing = [Just 1,Just 2,Nothing,Just 4]

:t non_missing
:t has_missing
## non_missing :: Num a => [a]
## has_missing :: Num a => [Maybe a]

where we see that has_missing is a Maybe type.

sum non_missing
## 10

You can’t just sum the latter; it produces an error because it doesn’t have a function which can sum a Maybe Integer

sum has_missing
s:7:1: error: [GHC-39999]
    • No instance for ‘Num (Maybe Integer)’ arising from a use of ‘it’
    • In the first argument of ‘print’, namely ‘it’
      In a stmt of an interactive GHCi command: print it
  |
7 | sum has_missing
  | ^^^^^^^^^^^^^^^

you need to remove any Nothing first, then most likely ‘unwrap’ from the Maybe context

import Data.Maybe

sum $ map fromJust $ filter isJust has_missing
## 7

or alternatively

sum (catMaybes has_missing) 
## 7

or you can get fancy

sum [x | Just x <- has_missing]
## 7

The point is that you have to deal with the missingness if it’s there. What this also means is that if you have a Double column, it does NOT have missing values, so you can safely sum those values (plus get all sorts of performance benefits from the compiler because it, too, knows there’s no missing values).

For Claus’ example, we can produce a proper Nothing at the end of this calculation

fmap (fmap (> 3)) x
## [Just False,Just False,Nothing,Just True,Just True]

Another tick – proper missing values.

Built-in vectorisation

Haskell is NOT an array language, so sure, it doesn’t have vectorisation built-in, but it’s worth noting that at the end of Claus’ post he details some limitations of R and acknowledges that “R does not have any scalar data types”. Haskell has scalars, vectors, and arrays, and you need to be specific when you want to iterate over those – the “type” of a variable includes the dimensionality, so Double is not the same as [Double] (a list of doubles).

Since Haskell is a functional programming language it has every type of map you could want, including specialities for monads and applicatives. While this means you do need to write map when you want to iterate, it also means you’re never surprised that there was more than one value there.

What’s more, because it’s a compiled language, the compiler can optimise all sorts of vector operations. One example is using “fusion” to combine a filter and a map such that the intermediate vector isn’t actually used at all.

This means that a stack of functions like

foldr (+) 0 . map (*2) . filter even

which would naively require a full pass to filter the even values, a half pass to double those, then anther half pass to add them up, can be done in a single pass.

You can also add rewrite rules if you’re sure your replacement holds (and many libraries can assert these conditions, so implement such rules) so that some operations can be entirely compiled away. Reversing a finite list twice is a no-op, so takes no time, so one could add

{-# RULES
"reverse.reverse/id" reverse . reverse = id
  #-}

which means a double reverse can be replaced with the identity function.

Even without such a rule, Haskell (being a compiled language) is fast

x = [1..1000000000]
:set +s
a = reverse $ reverse x
(0.00 secs, 0 bytes)

This is disappointing to run inline in R

x <- seq_len(1e9)
system.time(rev(rev(x)))
   user  system elapsed 
  4.596   2.543   8.824 

This ins’t just about compiling; R does have just-in-time compilation of functions, but lacks the compiler tricks that Haskell uses, so a compiled version of this doesn’t do a lot better

revrev <- function(x) {
  rev(rev(x))
}
revrev_comp <- compiler::cmpfun(revrev)
system.time(revrev_comp(x))
   user  system elapsed 
  4.035   0.739   4.777

So, no vectorisation, but possibly enough compiler tricks to make up for it – tick.

Non-standard evaluation (NSE)

This is where the fun really starts – the dataframe package from the dataHaskell ecosystem adds the sort of slicing and dicing you’re probably familiar with. Apart from general inspection of data frames

df <- D.readParquet "iris.parquet"

D.describeColumns df
## ---------------------------------------------------------
## Column Name  | # Non-null Values | # Null Values |  Type 
## -------------|-------------------|---------------|-------
##     Text     |        Int        |      Int      |  Text 
## -------------|-------------------|---------------|-------
## variety      | 150               | 0             | Text  
## petal.width  | 150               | 0             | Double
## petal.length | 150               | 0             | Double
## sepal.width  | 150               | 0             | Double
## sepal.length | 150               | 0             | Double

(don’t be fooled by that <- – that’s Haskell’s way of doing something that reaches outside of the CPU, e.g. to the disk to read a file) we can use D.dimensions to get the overall shape, and more specific helpers like D.nRows and D.nColumns are available which we can incorporate into e.g. text output

import Text.Printf (printf)

df <- D.readParquet "iris.parquet"

D.dimensions df

printf "%d rows, %d columns" (D.nRows df) (D.nColumns df)
## (150,5)
## 150 rows, 5 columns

Many of the dplyr-esqe operations are available, with a lot of thought put into how these would interact with a strongly typed structure

iris <- D.readParquet "iris.parquet"

iris |> 
  D.filterWhere (F.col @Text "variety" .== "Setosa") |> 
  D.filterWhere (F.col @Double "sepal.length" .> 5.4)
## -----------------------------------------------------------------
## sepal.length | sepal.width | petal.length | petal.width | variety
## -------------|-------------|--------------|-------------|--------
##    Double    |   Double    |    Double    |   Double    |  Text  
## -------------|-------------|--------------|-------------|--------
## 5.8          | 4.0         | 1.2          | 0.2         | Setosa 
## 5.7          | 4.4         | 1.5          | 0.4         | Setosa 
## 5.7          | 3.8         | 1.7          | 0.3         | Setosa 
## 5.5          | 4.2         | 1.4          | 0.2         | Setosa 
## 5.5          | 3.5         | 1.3          | 0.2         | Setosa

but dataframe goes one step further via template haskell… you can expose the columns as variables (admittedly, in the wider scope) so this works

iris <- D.readParquet "iris.parquet"

-- make columns available as expressions
:exposeColumns iris

iris |> 
  D.derive "sepal.ratio" (sepal_width / sepal_length) |>
  D.take 5 
## sepal_length :: Expr Double
## sepal_width :: Expr Double
## petal_length :: Expr Double
## petal_width :: Expr Double
## variety :: Expr Text
## --------------------------------------------------------------------------------------
## sepal.length | sepal.width | petal.length | petal.width | variety |    sepal.ratio    
## -------------|-------------|--------------|-------------|---------|-------------------
##    Double    |   Double    |    Double    |   Double    |  Text   |       Double      
## -------------|-------------|--------------|-------------|---------|-------------------
## 5.1          | 3.5         | 1.4          | 0.2         | Setosa  | 0.6862745098039216
## 4.9          | 3.0         | 1.4          | 0.2         | Setosa  | 0.6122448979591836
## 4.7          | 3.2         | 1.3          | 0.2         | Setosa  | 0.6808510638297872
## 4.6          | 3.1         | 1.5          | 0.2         | Setosa  | 0.673913043478261 
## 5.0          | 3.6         | 1.4          | 0.2         | Setosa  | 0.72

The info printed prior to the result is the about the exposed columns, and it’s worth noting that the dots/periods have been replaced by underscores. That’s because in Haskell the period is used for composition, as described above.

Many verbs are supported, so we can do some more detailed transformations

iris <- D.readParquet "iris.parquet"

:exposeColumns iris

iris |> 
  D.filterWhere ( sepal_width .> 2.6 ) |>
  D.groupBy ["variety"] |> 
  D.aggregate
      [ "n"       .= F.count petal_length
      , "sl_mean" .= F.mean sepal_length
      , "pl_mean" .= F.mean petal_length
      ]
## sepal_length :: Expr Double
## sepal_width :: Expr Double
## petal_length :: Expr Double
## petal_width :: Expr Double
## variety :: Expr Text
## ---------------------------------------------------------
##  variety   |  n  |      sl_mean      |      pl_mean      
## -----------|-----|-------------------|-------------------
##    Text    | Int |      Double       |       Double      
## -----------|-----|-------------------|-------------------
## Versicolor | 34  | 6.099999999999998 | 4.435294117647058 
## Setosa     | 49  | 5.016326530612244 | 1.4653061224489798
## Virginica  | 43  | 6.651162790697675 | 5.57674418604651

and remember, because we know there’s no missing values in a column of Doubles (not Maybe Doubles) we can take averages without worrying about any na.rm complications.

The NSE bit doesn’t quite work everywhere, but sometimes a string is just fine, e.g.

iris <- D.readParquet "iris.parquet"

D.plotScatter "sepal.length" "sepal.width" iris
##    4.5│                                                            
##       │                       ⠈                                    
##       │                    ⠠                                       
##       │                ⠂                                           
##       │                   ⡀     ⠁                                  
##       │              ⠠        ⠠                              ⠄  ⠄  
##       │              ⠐  ⠐ ⠂                                        
##       │       ⠁   ⠈ ⡁⢀ ⡀   ⢀                         ⠈             
##       │       ⠄  ⠄  ⠄⠠ ⠄  ⠄        ⠄  ⠠ ⠄                          
##       │    ⡀  ⡀⢀    ⡂⠐           ⢀      ⠂⢀ ⡀  ⠂⢀ ⡀⢀  ⢀             
##    3.2│       ⠄  ⠄⠠                      ⠠    ⠄  ⠄                 
##       │  ⠐ ⠂     ⠂⠐ ⠂     ⠂  ⠂⠐  ⠐ ⠂⠐      ⠂⠐ ⠂⠐    ⠂⠐     ⠐ ⠂     
##       │    ⠁                 ⡁⢈ ⡀  ⠁⢈ ⢈ ⡁⢈ ⡀⠈  ⢀       ⠁⢀    ⡀     
##       │                ⠄     ⠄  ⠄  ⠄    ⠄⠠                         
##       │                    ⠐  ⠐ ⠂   ⠐                        ⠂     
##       │           ⢈  ⠈     ⢈ ⠁⠈         ⠁     ⠁                    
##       │     ⠠       ⠄      ⠠            ⠄                          
##       │                            ⠂  ⠐                            
##       │             ⡀                                              
##    1.9│                                                            
##       └────────────────────────────────────────────────────────────
##        4.1                           6.1                          8.1
## 
## ⣿ sepal.length vs sepal.width

(following https://blog.djnavarro.net/posts/2021-04-18_pretty-little-clis/ to get the ANSI sequences to work in a code block).

A more detailed comparison to dplyr is provided in the dataframe documentation.

So, NSE? Tick!

Conclusion

I’ve hopefully demonstrated some of the power of a strongly typed language and a package focused on data science enabling the sort of functionality that an R (or Python) user might be looking for. I am hopeful that Haskell (and the dataHaskell ecosystem) can be a viable option for those of us wanting to do data science in a strongly typed language with a very clever compiler capable of making significant performance improvements.

If you’re interested in dataHaskell then check out this post and consider taking it for a spin – we’re working on reducing friction to get started via devcontainers and hosted notebook solutions, and are keen to hear from more data scientists about what they’d like the ecosystem to be able to support.

I believe Haskell IS a great language for data science!

As always, I can be found on Mastodon and the comment section below.


< details> < summary> devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.4.1 (2024-06-14)
##  os       macOS 15.6.1
##  system   aarch64, darwin20
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Australia/Adelaide
##  date     2025-12-05
##  pandoc   3.8.2.1 @ /opt/homebrew/bin/ (via rmarkdown)
##  quarto   1.7.31 @ /usr/local/bin/quarto
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blogdown      1.21.1  2025-06-28 [1] Github (rstudio/blogdown@33313a5)
##  bookdown      0.41    2024-10-16 [1] CRAN (R 4.4.1)
##  bslib         0.9.0   2025-01-30 [1] CRAN (R 4.4.1)
##  cachem        1.1.0   2024-05-16 [1] CRAN (R 4.4.0)
##  cli           3.6.5   2025-04-23 [1] CRAN (R 4.4.1)
##  devtools      2.4.6   2025-10-03 [1] CRAN (R 4.4.1)
##  digest        0.6.38  2025-11-12 [1] CRAN (R 4.4.1)
##  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.4.0)
##  evaluate      1.0.5   2025-08-27 [1] CRAN (R 4.4.1)
##  fansi         1.0.7   2025-11-19 [1] CRAN (R 4.4.3)
##  fastmap       1.2.0   2024-05-15 [1] CRAN (R 4.4.0)
##  fs            1.6.6   2025-04-12 [1] CRAN (R 4.4.1)
##  glue          1.8.0   2024-09-30 [1] CRAN (R 4.4.1)
##  htmltools     0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.4.0)
##  jsonlite      2.0.0   2025-03-27 [1] CRAN (R 4.4.1)
##  knitr         1.50    2025-03-16 [1] CRAN (R 4.4.1)
##  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.4.0)
##  magrittr      2.0.4   2025-09-12 [1] CRAN (R 4.4.1)
##  memoise       2.0.1   2021-11-26 [1] CRAN (R 4.4.0)
##  pkgbuild      1.4.8   2025-05-26 [1] CRAN (R 4.4.1)
##  pkgload       1.4.1   2025-09-23 [1] CRAN (R 4.4.1)
##  purrr         1.2.0   2025-11-04 [1] CRAN (R 4.4.1)
##  R6            2.6.1   2025-02-15 [1] CRAN (R 4.4.1)
##  remotes       2.5.0   2024-03-17 [1] CRAN (R 4.4.1)
##  rlang         1.1.6   2025-04-11 [1] CRAN (R 4.4.1)
##  rmarkdown     2.30    2025-09-28 [1] CRAN (R 4.4.1)
##  rstudioapi    0.17.1  2024-10-22 [1] CRAN (R 4.4.1)
##  sass          0.4.10  2025-04-11 [1] CRAN (R 4.4.1)
##  sessioninfo   1.2.3   2025-02-05 [1] CRAN (R 4.4.1)
##  usethis       3.2.1   2025-09-06 [1] CRAN (R 4.4.1)
##  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.4.0)
##  xfun          0.54    2025-10-30 [1] CRAN (R 4.4.1)
##  yaml          2.3.10  2024-07-26 [1] CRAN (R 4.4.0)
## 
##  [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
## 
## ──────────────────────────────────────────────────────────────────────────────


To leave a comment for the author, please follow the link and comment on their blog: rstats on Irregularly Scheduled Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version