Using PL/R and PL/Python in Postgres

[This article was first published on Data Science Riot!, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’ve recently been exploring options to calculate median and quartiles in my Postgres database. If you’re familiar with quartiles you know how handy they can be. There’s a few different options in the Postgres universe to accomplish this, so I figured I would give them all a whirl and see which was the friendliest (and fastest) on my CPU.

The Data

I’m using the “batting” table for Sean Lahman’s baseball database as my proof of concept. The table has just under 100,000 rows. Not too big, but a good test case. For my example here, I’m using the “r” column, which indicates total runs scored for a season.

The R Method

PL/R is a popular Postgres extension. If you haven’t checked it out, I would highly recommend it.

  • Pros: Very fast and simple code. Only one line of actual R code used to build this function.

  • Cons: R isn’t pre-installed on most systems and PLR isn’t shipped with Postgres. Takes a bit of system config., but not too much.

CREATE OR REPLACE FUNCTION r_quartile(ANYARRAY) RETURNS ANYARRAY AS $$
quantile(arg1, probs = seq(0, 1, 0.25), names = FALSE)
$$ LANGUAGE 'plr';

CREATE AGGREGATE quartile (ANYELEMENT) (
sfunc = array_append,
stype = ANYARRAY,
finalfunc = r_quartile,
initcond = '{}');

The Python Method

I’m a big fan of Python in general. It’s currently one of my favorite ETL languages. Here I’m using yet another great Postgres extension called PL/Python.

  • Pros: Python is pre-installed on most Linux and Mac systems making set up a breeze.

  • Cons: Surprisingly slow! Python ties to pipe all the data into the interpreter and then back out again as a function result. Too much system cost for me!

CREATE TYPE boxplot_values AS (
  min       numeric,
  q1        numeric,
  median    numeric,
  q3        numeric,
  max       numeric
);

CREATE OR REPLACE FUNCTION public._final_boxplot(strarr numeric[])
  RETURNS boxplot_values AS
$BODY$
    x = strarr
    a.sort()
    i = len(a)
    return ( a[0], a[i//4], a[i//2], a[i*3//4], a[-1] )
$BODY$
  LANGUAGE plpythonu IMMUTABLE
  COST 100;
 
CREATE AGGREGATE boxplot(numeric) (
  SFUNC=array_append,
  STYPE=numeric[],
  FINALFUNC=_final_boxplot,
  INITCOND='{}'
);

The C Method

Everyone remember C from your CS-101 class in college? Yeah, that’s why no one likes to write it. Fortunately, this is a pre-packaged Postgres extension written in C called Quantile. I’m not going to post the mile-long C code here, but you can see it on the GitHub repo.

  • Pros: BLAZING FAST! Returned an array faster than native SQL could calculate a median! I ended up putting the R solution into production because PL/R has room for further application, but if I were looking for speed and nothing else, the Quantile extension is the clear winner.

  • Cons: A third-party extension, so you’re at the mercy of the developers to keep things updated. This particular repo is about four years old and looks to be updated on a regular basis.

To leave a comment for the author, please follow the link and comment on their blog: Data Science Riot!.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)