Using PostgreSQL in R: A quick how-to

February 1, 2016
By

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead of big data infrastructure. In some cases you can use a serverless SQL database that gives you the power of SQL for data manipulation, while maintaining a lightweight infrastructure.

We call this work pattern “SQL Screwdriver”: delegating data manipulation to a lightweight infrastructure with the power of SQL for data manipulation.

NewImageImage: Iainf, some rights reserved.

We assume for this how-to that you already have a PostgreSQL database up and running. To get PostgreSQL for Windows, OSX, or Unix use the instructions at PostgreSQL downloads. If you happen to be on a Mac, then Postgres.app provides a “serverless” (or application oriented) install option.

For the rest of this post, we give a quick how-to on using the RpostgreSQL package to interact with Postgres databases in R.

You have your PostgresSQL database up and running. Now you want to work with the data in that database in R. First, let’s create a data frame that we want to insert into the database.

# An example data frame to play with

iris = as.data.frame(iris)
summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  

The column names of this data frame are problematic for databases (and especially PostgreSQL) for a few reasons: the “.”s in the names can be an issue, and PostgreSQL expects column names to be all lowercase. Here’s a function to make the column names db safe:

# make names db safe: no '.' or other illegal characters,
# all lower case and unique
dbSafeNames = function(names) {
  names = gsub('[^a-z0-9]+','_',tolower(names))
  names = make.names(names, unique=TRUE, allow_=TRUE)
  names = gsub('.','_',names, fixed=TRUE)
  names
}
colnames(iris) = dbSafeNames(colnames(iris))
summary(iris)

##   sepal_length    sepal_width     petal_length    petal_width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50

Now let’s open up a database connection and insert the table.

# Create a connection to the database
library('RPostgreSQL')

## Loading required package: DBI

pg = dbDriver("PostgreSQL")

# Local Postgres.app database; no password by default
# Of course, you fill in your own database information here.
con = dbConnect(pg, user="ninazumel", password="",
                 host="localhost", port=5432, dbname="ninazumel")

# write the table into the database.
# use row.names=FALSE to prevent the query 
# from adding the column 'row.names' to the table 
# in the db
dbWriteTable(con,'iris',iris, row.names=FALSE)

## [1] TRUE

The function dbWriteTable() returns TRUE if the table was successfully written. Note this call will fail if iris already exists in the database. Use overwrite=TRUE to force overwriting of an existing table, and append=TRUE to append to an existing table.

Now you can read the table back out.

# read back the full table: method 1
dtab = dbGetQuery(con, "select * from iris")
summary(dtab)

##   sepal_length    sepal_width     petal_length    petal_width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##    species         
##  Length:150        
##  Class :character  
##  Mode  :character  
##                    

# read back the full table: method 2
rm(dtab)
dtab = dbReadTable(con, "iris")
summary(dtab)

##   sepal_length    sepal_width     petal_length    petal_width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##    species         
##  Length:150        
##  Class :character  
##  Mode  :character  

Of course, the point of using a database is to extract subsets or transformations of your data, using SQL.

# get part of the table
rm(dtab)
dtab = dbGetQuery(con, "select sepal_length, species from iris")
summary(dtab)

##   sepal_length     species         
##  Min.   :4.300   Length:150        
##  1st Qu.:5.100   Class :character  
##  Median :5.800   Mode  :character  
##  Mean   :5.843                     
##  3rd Qu.:6.400                     
##  Max.   :7.900

You can use dbSendQuery for sending queries that don’t return a data-frame-like result.

# remove table from database
dbSendQuery(con, "drop table iris")

# commit the change
dbCommit(con)

When you are done, disconnect.

# disconnect from the database
dbDisconnect(con)

And that’s it!

Extra: RPostgreSQL and sqldf

If you are accustomed to manipulating data with SQL, you may prefer SQL notation to the sometimes convoluted calling conventions of the analogous R operations like aggregate(), or the functions in the dplyr package. In this case you have probably already discovered the sqldf package, which allows you to manipulate data frames using SQL. If you are a sqldf user, there is an additional subtlety if you are also using RPostgreSQL or other R packages for talking to databases: sqldf uses its own internal (and ephemeral) database to perform its operations, but if RPostgreSQL is loaded, sqldf will pick up your PostgreSQL driver by default. This is probably not what you want.

options(gsubfn.engine = "R")
library(sqldf)  
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
## sqldf will default to using PostgreSQL <=== NOTE THIS!

# Now try to use sqldf to aggregate sepal_length by species
query = "select avg(sepal_length) avg_sepal_length, 
                species 
         from dtab 
         group by species"

sqldf(query)

# Error in postgresqlNewConnection(drv, ...) : 
#   RS-DBI driver: (could not connect [email protected] on dbname "test"
# )
# Error in !dbPreExists : invalid argument type

To use sqldf on local data frames, you must specify the driver and dbname explicitly.

sqldf(query, drv="SQLite", dbname=":memory:")

##   avg_sepal_length    species
## 1            5.006     setosa
## 2            5.936 versicolor
## 3            6.588  virginica

Extra: PostgreSQL and dplyr

If you do use dplyr, the good news is that you can connect to a PostgreSQL database directly through the dplyr function src_postgres().

library('dplyr')

# Connect to local PostgreSQL via dplyr
localdb <- src_postgres(dbname = '',
                           host = 'localhost',
                           port = 5432,
                           user = 'ninazumel',
                           password = '')

# cheat and access the db connection directly
# assume we have made the colnames db safe
dbWriteTable(localdb$con,'iris',iris, row.names=FALSE)

## [1] TRUE

The tbl() command lets you access tables in the database remotely, and sql() lets you send queries.

# this is not a data frame; it's a dplyr PostgreSQL handle into the database
d = tbl(localdb, "iris") 
d

## Source: postgres 9.5.0 [[email protected]:5432/ninazumel]
## From: iris [150 x 5]
## 
##    sepal_length sepal_width petal_length petal_width species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## ..          ...         ...          ...         ...     ...

# this is a data frame
dtab = as.data.frame(d)

# send a query through dplyr
query = "select avg(sepal_length) avg_sepal_length, 
                species 
         from iris
         group by species"
dsub = tbl(localdb, sql(query))
dsub

## Source: postgres 9.5.0 [[email protected]:5432/ninazumel]
## From:  [?? x 2]
## 
##    avg_sepal_length    species
## 1             5.936 versicolor
## 2             6.588  virginica
## 3             5.006     setosa
## ..              ...        ...

# make it local
dsub = as.data.frame(dsub)
summary(dsub)

##  avg_sepal_length   species         
##  Min.   :5.006    Length:3          
##  1st Qu.:5.471    Class :character  
##  Median :5.936    Mode  :character  
##  Mean   :5.843                      
##  3rd Qu.:6.262                      
##  Max.   :6.588

# shuts down database
rm(list=c('d','localdb')); gc() 

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)