DuckDB: Quacking SQL

[This article was first published on R – Michael's and Christian's Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Lost in Translation between R and Python 8

This is the next article in our series “Lost in Translation between R and Python”. The aim of this series is to provide high-quality R and Python 3 code to achieve some non-trivial tasks. If you are to learn R, check out the R tab below. Similarly, if you are to learn Python, the Python tab will be your friend.

DuckDB

DuckDB is a fantastic in-process SQL database management system written completely in C++. Check its official documentation and other blogposts like this to get a feeling of its superpowers. It is getting better and better!

Some of the highlights:

  • Easy installation in R and Python, made possible via language bindings.
  • Multiprocessing and fast.
  • Allows to work with data bigger than RAM.
  • Can fire SQL queries on R and Pandas tables.
  • Can fire SQL queries on (multiple!) csv and/or Parquet files.
  • Quacks Apache Arrow.

Installation

DuckDB is super easy to install:

  • R: install.packages("duckdb")
  • Python: pip install duckdb

Additional packages required to run the code of this post are indicated in the code.

A first query

Let’s start by loading a dataset, initializing DuckDB and running a simple query.

The dataset we use here contains information on over 20,000 sold houses in Kings County. Along with the sale price, different features describe the size and location of the properties. The dataset is available on OpenML.org with ID 42092.

library(OpenML)
library(duckdb)
library(tidyverse)

# Load data
df <- getOMLDataSet(data.id = 42092)$data

# Initialize duckdb, register df and materialize first query
con = dbConnect(duckdb())
duckdb_register(con, name = "df", df = df)
con %>% 
  dbSendQuery("SELECT * FROM df limit 5") %>% 
  dbFetch()
import duckdb
import pandas as pd
from sklearn.datasets import fetch_openml

# Load data
df = fetch_openml(data_id=42092, as_frame=True)["frame"]

# Initialize duckdb, register df and fire first query
# If out-of-RAM: duckdb.connect("py.duckdb", config={"temp_directory": "a_directory"})
con = duckdb.connect()
con.register("df", df)
con.execute("SELECT * FROM df limit 5").fetchdf()
Result of first query (from R)

Average price per grade

If you like SQL, then you can do your data preprocessing and simple analyses with DuckDB. Here, we calculate the average house price per online grade (the higher the grade, the better the house).

query <- 
  "
  SELECT AVG(price) avg_price, grade 
  FROM df 
  GROUP BY grade
  ORDER BY grade
  "
avg <- con %>% 
  dbSendQuery(query) %>% 
  dbFetch()

avg
# Average price per grade
query = """
  SELECT AVG(price) avg_price, grade 
  FROM df 
  GROUP BY grade
  ORDER BY grade
  """
avg = con.execute(query).fetchdf()
avg
R output

Highlight: queries to files

The last query will be applied directly to files on disk. To demonstrate this fantastic feature, we first save “df” as a parquet file and “avg” as a csv file.

write_parquet(df, "housing.parquet")
write.csv(avg, "housing_avg.csv", row.names = FALSE)
# Save df and avg to different file types
df.to_parquet("housing.parquet")  # pyarrow=7
avg.to_csv("housing_avg.csv", index=False)

Let’s load some columns of “housing.parquet” data, but only rows with grades having an average price of one million USD. Agreed, that query does not make too much sense but I hope you get the idea…πŸ˜ƒ

# "Complex" query
query2 <- "
  SELECT price, sqft_living, A.grade, avg_price
  FROM 'housing.parquet' A
  LEFT JOIN 'housing_avg.csv' B
  ON A.grade = B.grade
  WHERE B.avg_price > 1000000
  "

expensive_grades <- con %>% 
  dbSendQuery(query2) %>% 
  dbFetch()

head(expensive_grades)

# dbDisconnect(con)
# Complex query
query2 = """
  SELECT price, sqft_living, A.grade, avg_price
  FROM 'housing.parquet' A
  LEFT JOIN 'housing_avg.csv' B
  ON A.grade = B.grade
  WHERE B.avg_price > 1000000
  """
expensive_grades = con.execute(query2).fetchdf()
expensive_grades

# con.close()
R output

Last words

  • DuckDB is cool!
  • If you have strong SQL skills but do not know R or Python so well, this is a great way to get used to those programming languages.
  • If you are unfamiliar to SQL but like R and/or Python, you can use DuckDB for a while and end up being an SQL addict.
  • If your analysis involves combining many large files during preprocessing, then you can try the trick shown in the last example of this post.

The Python notebook and R code can be found at:

To leave a comment for the author, please follow the link and comment on their blog: R – Michael's and Christian's Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)