Site icon R-bloggers

Comparing R’s {targets} and dbt for Data Engineering

[This article was first published on rstats on Irregularly Scheduled Programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m getting more and more into data engineering these days and having used R for a long time, I’m seeing a lot of problems that look nail-shaped to my R-shaped hammer. The available tools to solve those problems exist for (presumably) very good reasons, so I wanted to take some time to dig into how to use them and compare their workflows to what I would otherwise naively do in R.

I should mention here that I’m currently open to data/code-related opportunities and am actively seeking a new role – if your organisation is looking for someone aligned with my skillset, please get in touch with me any way you can, e.g. contact at jcarroll.com.au.

I’m a firm believer in “you learn with your hands, not with your eyes” so I wanted to actually build something. I definitely could spin up Claude Code and have it produce the entire thing for me – and in a different project I might do that – but in this case I want to make the mistakes myself so I can learn where the complexity really lives and where my prior assumptions are misaligned. I did have Claude (the chat version, not the full coding agent) guide me through the steps to get this project running, and I did let it clean up my SQL; this project wasn’t about learning to better optimise my SQL, but understanding exactly what it produced will help me write a better version on my next iteration.

Thinking of a real-world project I could take for a spin, I decided to build some ingestion for my personal finances. I’ve used Quickbooks previously which connects up to my bank and helps categorise personal and business (as a freelance contractor) expenses. I decided I’ll build my own ‘slowbooks’ processing workflow based on some manual exports (I don’t think my bank has an API).

Both of the approaches I’ll compare here build on the idea of a Makefile which connects up commands to run based on dependencies, and only runs what is needed; if all the input dependencies of a step have not changed, there’s no need to re-run that step. From what I understand, you could largely get away with just writing some Makefiles (or the newer implementation just) but these two approaches help to better structure how that’s constructed.

This is a somewhat longer post than some of mine, so here’s some quick links to the sections

< nav id="TableOfContents">

dbt

One tool that comes up frequently is ‘data build tool’ most commonly referred to as just dbt, though that full name doesn’t even show up on their website. Started in 2016, it’s released as a Python package (dbt-core) though if you do try to just install something called ‘dbt’ you get the cloud CLI tool which isn’t quite the same. Naming stuff is hard.

It’s a way to write code you can commit, which translates to SQL and performs data ingestion, processing, transformation, and storage in a structured way with relationships between various steps in the workflow. It adds macros on top of plain SQL to make the transformations easier, written in jinja, a template engine which enables writing something more like Python within SQL.

This episode of Data Science Lab from Posit walks through an example of using dbt, and while it’s a fantastic overview of what a project looks like, it can’t answer all of the ‘how would I do that?’ problems that will come up in a different project.

Like they did, I will use DuckDB for a database – I enjoyed reading through ‘DuckDB in Action’ with the DSLC.io book club and can definitely see the advantages over SQLite which I would previously have reached for in this case.

I installed dbt via uv – the official instructions use pip and I’ve been burned too many times with that tool; uv is much nicer. Nonetheless, I still encountered Python-related issues because it looks like dbt doesn’t yet support Python 3.14 and yet this isn’t mentioned in their instructions either. I got it working with this command, adding the dbt-duckdb extension I plan to use, as well as streamlit to make a dashboard later

uv init slowbooks --python 3.12
cd slowbooks
uv add dbt-duckdb duckdb streamlit

Adding a profiles.yml in the project root defining the database (DuckDB) I want to produce to store the tables

slowbooks:
  target: dev
  outputs:
    dev:
      type: duckdb
      path: slowbooks.duckdb
      schema: main

I can then initialise the project with

uv run dbt init . --skip-profile-setup

This creates the basic project structure, and there’s a lot going on.

I also needed to define a dependency in packages.yml so that I could use the macros

packages:
  - package: dbt-labs/dbt_utils
    version: [">=1.0.0"]

and ran

uv run dbt deps

I put my exported CSVs (several for my transaction/savings accounts and one for my credit card) in a new raw/ folder; my understanding is that the seeds/ folder is for static data, although that’s the folder used in the Posit tutorial above.

I also ran some pre-processing over my CSVs to categorise the merchants. My bank provides a ‘category’ and ‘subcategory’ for each item, but I wanted to be able to override some of those to more specific definitions so that I could group by them, e.g. ’total spent on books’ since I mainly buy those from just a couple of merchants. This produced a new CSV of patterns, resolved names, and classifications, since the ‘description’ of an item in my transactions might have, e.g.

Paypal *FruitShop 0401000000 Au

and I want to identify the ‘FruitShop’ part, so I can match against that pattern. This is a (fairly) static file (the source data will occasionally be extended), so that did go into seeds/.

{targets}

The whole time I’ve been learning about dbt I’ve had a voice in my head asking “can’t I just use {targets}?” Yes, it’s an R-specific tool, but it does a fantastic job at what it does. It’s not a new tool at all – this post from 2021 demonstrates the power of it, and Miles McBain has been singing the praises of it since at least as early as 2020 (along with the predecessor {drake}).

Rather than double up all of my inputs, I will just keep the {targets} implementation as a subdirectory of my dbt project and refer to the exact same source files. I will create a distinct database, though.

Installing {targets}, provided you already have a working R installation, is as straightforward as

install.packages("targets")

within an R session, be that in RStudio, Positron, Emacs, or a terminal.

As for the rest of the file structure, 100% of the R code here goes into a _targets.R file – much cleaner, albeit that’s a tradeoff in terms of separating different components.

⬆️ Table of Contents

Comparing Workflows

For the actual processing I’m going to show both dbt and {targets} approaches in tabsets for switching back and forth.

For dbt a ‘model’ is a select statement producing a table, with the structure being models split out into three layers of increasingly production-ready data. From the dbt docs, these are defined as:

and I’m trying to stick to that as best as I can.

Staging – Load Data

The first step was to ingest that into a ‘staging’ model. This is where the initial data loading happens. For this personal project I’ve exported the CSV files I need, and will do so again in the future, adding them to the same folder for de-duplication within the pipeline. In a more mature project these might be read from an API or a connection to a managed database, and both approaches can easily switch between different ’environments’ (dev, staging, prod, …) without adjusting much, certainly without having to rename all the dependency labels.

Top of this section | ⬆️ Table of Contents

Intermediate – Joins and Enrichment

For this ‘simple’ example there won’t be a lot of difference between an ‘intermediate’ stage and a final ‘mart’ stage, but this is where the merging with the merchant categories occurs. The transactions from staging are loaded and joined according to the patterns I’ve defined in the seed file.

Top of this section | ⬆️ Table of Contents

Marts – Summaries and Outputs

I could create some definitive ‘data product’ results here, but for now this is very similar to the ‘intermediate’ stage with one additional grouping by merchant as well as month

Top of this section | ⬆️ Table of Contents

Tests / Validation

One ‘selling point’ I’ve seen for dbt is that it can additionally add validation tests within the workflow. That’s extremely useful to ensure that you’re not producing junk data inadvertently.

Top of this section | ⬆️ Table of Contents

Analysis

What’s the point of organising this data if we’re not going to do something with it? This is where I start to really wonder if {targets} maybe has a bigger picture in mind when it connects up the data, because while dbt will do all of the processing in SQL, R will happily continue to do the analysis.

I think this is where a separation of concerns becomes necessary, and that depends on the scale of the data involved. While you or I working on a small project might be very happy to tie the analysis into the data preparation all in one place, Netflix probably wants to segregate the data processing and analysis steps into entirely different divisions, so tying a bow on the cleaned data and letting analysts pick it up from a database makes a lot more sense.

For my example, let’s say I’m interested in analysing which categories have out of the ordinary amounts of spend in a given month – have I spent more on groceries this month? To do that, I want to calculate the average spend in each category each month plus the variation and identify when the spend is more than a standard deviation away from the average.

Top of this section | ⬆️ Table of Contents

The Complete Workflow

That’s all the pieces I need to push data in the exported CSVs through the pipe and produce a database of monthly aggregated, categorised totals. Here’s how it looks in terms of the two tools.

Top of this section | ⬆️ Table of Contents

DAG / Visualisation / Docs

The similarity to a Makefile of these approaches depends on being able to determine what has ‘changed’ and what is the same, and this is where the two approaches differ. Both consider the workflow as a Directed Acyclic Graph (DAG) with steps taking dependencies on previous steps or data sources. This means I can visualise the workflow as a graph, but also makes for some important differences between how things actually run.

Top of this section | ⬆️ Table of Contents

Exploration

I only built the exploration dashboard as part of the dbt project because I’ve built plenty of shiny apps – I wanted to see what Claude could build based on this database data source. It built a streamlit app which shows the monthly spend broken down by category, and I had it add filters for the various categories, tables of transactions, and the monthly outliers.

The dashboard works great, albeit not perfectly. It looks something like this

Slowbooks dashboard (click to embiggen)

There’s obvious issues with it – not least that the legend is incomplete, but for the sort of exploration I wanted to try out, it’s a great starting point.

It reads the summary tables directly from the database, so the analysis doesn’t need to happen within the app – a nice separation of business logic and visualisation.

Comparison

As a final sanity check, I’ll confirm that I get the same number of transactions in the monthly trend tables which are saved to both databases, albeit with different names

duckdb slowbooks.duckdb -c "select sum(transaction_count) from mart_category_trends;"
┌────────────────────────┐
│ sum(transaction_count) │
│         int128         │
├────────────────────────┤
│                   2028 │
└────────────────────────┘

duckdb targets/slowbooks_r.duckdb -c "select sum(transaction_count) from mart_monthly_summary;"
┌────────────────────────┐
│ sum(transaction_count) │
│         int128         │
├────────────────────────┤
│                   2028 │
└────────────────────────┘

🎉

As for what I like and don’t like about each approach:

Other Solutions

While I’ve focused on this comparison between dbt and {targets}, these aren’t the only players in the game. I’m aware of Airflow, at least in the sense that it can ingest dbt pipelines and schedule them. For the Python folks there’s also prefect and dagster, the latter of which also has an R ingestion route in the form of dagsterpipes. A purely R solution is maestro which appears to target (pun intended) data coming from an API or database for which {targets} can’t identify the ‘up-to-date-ness’ (since that involves a hash of the file).

⬆️ Table of Contents

Conclusion

I’ve vastly grown my understanding of both dbt and {targets} and have a much greater appreciation for what goes into using each of these to move and curate data. Plus, now I have a cool new toy I’ve built to explore my finances. I’m not sharing the code itself – partly so that I don’t risk committing my own finance data by accident, and partly because what I’ve done here isn’t anything you need to build on; if you’re interested in learning either or both of these tools, I recommend you do what I did and build a toy project.

I’m interested to hear what you think of this comparison – have I overlooked some significant difference or similarity? Some use-case where one of them would really shine over the other? Have I misrepresented something? I’m here to learn, so by all means please do let me know. And if you’re looking for someone with a history of programming and data who digs into projects this way, I’m on the market for opportunities.

As always, I can be found on Mastodon and the comment section below.


< details> < summary> devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.5.3 (2026-03-11)
##  os       macOS Tahoe 26.3.1
##  system   aarch64, darwin20
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       Australia/Adelaide
##  date     2026-05-04
##  pandoc   3.6.3 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
##  quarto   1.7.31 @ /usr/local/bin/quarto
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version date (UTC) lib source
##  blob          1.3.0   2026-01-14 [1] CRAN (R 4.5.2)
##  blogdown      1.23    2026-01-18 [1] CRAN (R 4.5.2)
##  bookdown      0.46    2025-12-05 [1] CRAN (R 4.5.2)
##  bslib         0.10.0  2026-01-26 [1] CRAN (R 4.5.2)
##  cachem        1.1.0   2024-05-16 [1] CRAN (R 4.5.0)
##  cli           3.6.5   2025-04-23 [1] CRAN (R 4.5.0)
##  DBI           1.3.0   2026-02-25 [1] CRAN (R 4.5.2)
##  dbplyr        2.5.2   2026-02-13 [1] CRAN (R 4.5.2)
##  devtools      2.4.6   2025-10-03 [1] CRAN (R 4.5.0)
##  digest        0.6.39  2025-11-19 [1] CRAN (R 4.5.2)
##  dplyr         1.2.1   2026-04-03 [1] CRAN (R 4.5.2)
##  duckdb        1.5.2   2026-04-13 [1] CRAN (R 4.5.2)
##  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.5.0)
##  evaluate      1.0.5   2025-08-27 [1] CRAN (R 4.5.0)
##  fastmap       1.2.0   2024-05-15 [1] CRAN (R 4.5.0)
##  fs            1.6.7   2026-03-06 [1] CRAN (R 4.5.2)
##  generics      0.1.4   2025-05-09 [1] CRAN (R 4.5.0)
##  glue          1.8.1   2026-04-17 [1] CRAN (R 4.5.2)
##  htmltools     0.5.9   2025-12-04 [1] CRAN (R 4.5.2)
##  jquerylib     0.1.4   2021-04-26 [1] CRAN (R 4.5.0)
##  jsonlite      2.0.0   2025-03-27 [1] CRAN (R 4.5.0)
##  knitr         1.51    2025-12-20 [1] CRAN (R 4.5.2)
##  lifecycle     1.0.5   2026-01-08 [1] CRAN (R 4.5.2)
##  magrittr      2.0.4   2025-09-12 [1] CRAN (R 4.5.0)
##  memoise       2.0.1   2021-11-26 [1] CRAN (R 4.5.0)
##  otel          0.2.0   2025-08-29 [1] CRAN (R 4.5.0)
##  pillar        1.11.1  2025-09-17 [1] CRAN (R 4.5.0)
##  pkgbuild      1.4.8   2025-05-26 [1] CRAN (R 4.5.0)
##  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.5.0)
##  pkgload       1.5.0   2026-02-03 [1] CRAN (R 4.5.2)
##  purrr         1.2.2   2026-04-10 [1] CRAN (R 4.5.2)
##  R6            2.6.1   2025-02-15 [1] CRAN (R 4.5.0)
##  remotes       2.5.0   2024-03-17 [1] CRAN (R 4.5.0)
##  rlang         1.1.7   2026-01-09 [1] CRAN (R 4.5.2)
##  rmarkdown     2.30    2025-09-28 [1] CRAN (R 4.5.0)
##  rstudioapi    0.18.0  2026-01-16 [1] CRAN (R 4.5.2)
##  sass          0.4.10  2025-04-11 [1] CRAN (R 4.5.0)
##  sessioninfo   1.2.3   2025-02-05 [1] CRAN (R 4.5.0)
##  tibble        3.3.1   2026-01-11 [1] CRAN (R 4.5.2)
##  tidyselect    1.2.1   2024-03-11 [1] CRAN (R 4.5.0)
##  usethis       3.2.1   2025-09-06 [1] CRAN (R 4.5.0)
##  vctrs         0.7.1   2026-01-23 [1] CRAN (R 4.5.2)
##  withr         3.0.2   2024-10-28 [1] CRAN (R 4.5.0)
##  xfun          0.56    2026-01-18 [1] CRAN (R 4.5.2)
##  yaml          2.3.12  2025-12-10 [1] CRAN (R 4.5.2)
## 
##  [1] /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library
## 
## ──────────────────────────────────────────────────────────────────────────────

To leave a comment for the author, please follow the link and comment on their blog: rstats on Irregularly Scheduled Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version