RDBL – manipulate data in-database with R code only

Posted on August 29, 2016 by David Kun in R bloggers | 0 Comments

[This article was first published on DataScience+, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post I introduce our own package RDBL, the R DataBase Layer. With this package you can manipulate data in-database without writing SQL code. The package interprets the R code and sends out the corresponding SQL statements to the database, fully transparently. To minimize overhead, the data is only fetched when absolutely necessary, allowing the user to create the relevant joins (merge), filters (logical indexing) and groupings (aggregation) in R code before the SQL is run on the database. The core idea behind RDBL is to let R users with little or no SQL knowledge to utilize the power of SQL database engines for data manipulation.

It is important to note that the SQL statements generated in the background are not executed unless explicitly requested by the command as.data.frame. Hence, you can merge, filter and aggregate your dataset on the database side and load only the result set into memory for R.

In general the design principle behind RDBL is to keep the models as close as possible to the usual data.frame logic, including (as shown later in detail) commands like aggregate, referencing columns by the $$$ operator and features like logical indexing using the $[]$ operator.

RDBL supports a connection to any SQL-like data source which supports a DBI interface or an ODBC connection, including but not limited to Oracle, MySQL, SQLite, SQL Server, MS Access and more.

To install RDBL, run the following command:

install.packages("RDBL", 
                 repos = c("https://lair.functionalfinances.com/repos/cran/", 
                           "https://lair.functionalfinances.com/repos/shared/"))

That’s right: you can install our private package (and all its dependencies) with an install.packages command, instead of the usual install_git setup where you have to go hunting for each dependency! For that kind of magic, we use laiR, our private R repository solution. You can find a demo version of this tool at lair.ownr.io. To hear more about that, please leave a comment and I will reach out to you.

For the source code, head to our private git repo and download the project.

Please note: the examples below require either RSQLite or RMySQL installed as well. In general you will need a connection driver package for the database you want to use RDBL with. It can be any database supporting either DBI or ODBC connections, let it be ROracle for your Oracle database or RSQLServer for your SQL Server. From here on, it is assumed that you have RDBL installed and loaded as well as the connection driver for the database you are using and you have set up a connection:

# load the package for the database driver
library(RSQLite)
# create a connection to the SQLite database
dbfile <- "Chinook_Sqlite.sqlite"
db.connection <- dbConnect(SQLite(), dbfile) 
SetConnection(db.connection)

If you want to use these examples with your own database, make sure to set up the DB connection to the database in place of the example connection to the SQLite database delivered along with the package.

The package comes with a database example in SQLite and one in MySQL. All examples below are using the SQLite database but everything works exactly the same way for the MySQL example database as well. For database layout see this link for the example database. The project you have downloaded also contains an example.R file showing how to use the package – this post walks you through that example.R file to showcase the possibilities.

Setting up the database models

RDBL provides an abstract representation for the database tables called a db.model. A db.model defines names for the table and its columns to be used in R. This serves two purposes: first, some databases support naming conventions that R does not (e.g. SQL Server allows spaces in column names) and, secondly and more importantly, this allows a decoupling of the physical data model and the R code, i.e. changes to the table name or the column names do not require changes to the code in any other place then the definition of the db.model.

# set up a database model using default settings
artist.db <- db.model("Artist")

# set up a database model defining mappings on the fly
artist.db.mapping <- db.model("Artist", 
                              ArtistId = "ArtistId", 
                              Name = "ArtistName")

As you can see, you can either use the default mapping for a table, in which case column names from the table are used as R names or you can specify the R names for the columns (but then you have to specify a name for each column you want in the db.model – any column you don't provide a mapping for will be left out of the model). Note that in case of the database containing non-supported characters in the column names you should always create an explicit mapping!

A quick look at the class of these variables show that they are indeed db.model's:

class(artist.db)
[1] "db.model"
attr(,"package")
[1] "RDBL"

To get the column names for a db.model, simply use the names command, just like in case a of data.frame:

names(artist.db)
[1] "ArtistId" "Name"

names(artist.db.mapping)
[1] "ArtistId"   "ArtistName"

As noted earlier, you can reference the columns of a db.model using the $ operator:

artist.db$Name
[1] "db.column"

To actually load the data for a model, simply convert it to a data.frame:

# load in the data from some of the tables
artists <- as.data.frame(artist.db)

artists
    ArtistId     Name
1          1     AC/DC
2          2     Accept
...
274      274     Nash Ensemble
275      275     Philip Glass Ensemble

Before we continue, let's set up some more models to showcase the possibilities:

# set up other database models for future use
album.db <- db.model("Album")
track.db <- db.model("Track")
genre.db <- db.model("Genre")

Adding derived columns

When defining models, another exciting possibility is adding a new column derived from other columns in the table, using simple SQL operators such as the + operator:

# define a new column using basic SQL operations such as +
track.db.mapping <- db.model("Track",
                             TrackId = "TrackId",
                             AlbumId = "AlbumId",
                             Bytes = "Bytes",
                             GenreId = "GenreId",
                             "Bytes + Bytes" = "Bytes2")

head(as.data.frame(track.db.mapping), 3)
  TrackId AlbumId    Bytes GenreId   Bytes2
1       1       1 11170334       1 22340668
2       2       2  5510424       1 11020848
3       3       3  3990994       1  7981988

Any SQL operator valid for a SELECT statement can be used in this form. Of course the new column is only "created" in the db.model, not in the actual database. Note: different databases may support different operators!

Renaming columns

One can also rename columns in the db.model, e.g. to make merging simpler:

# rename columns with the same name
# Note: no changes made to the database itself!
rename(genre.db, "Name", "GenreName")

names(genre.db)
[1] "GenreId"   "GenreName"

Alternatively, you can also assign values to the names(db.model) just like in case of a data.frame:

names(artist.db) <- c("ArtistId", "ArtistName")
names(artist.db)

[1] "ArtistId"   "ArtistName"

The same note applies: the renaming does not effect the database.

Merge is the now join

Another way to define a db.model is by merging other db.models (effectively creating a join on the database side):

# join tables together using merge
# artist x album x track x genre to get the mapping between artists and genre
artist.genre.db <- 
  merge(merge(merge(artist.db,
                    album.db,
                    by = "ArtistId")[,c("ArtistName", 
                                        "AlbumId")],
              track.db,
              by = "AlbumId")[, c("ArtistName",
                                  "GenreId")],
        genre.db,
        by = "GenreId")[, c("ArtistName",
                            "GenreName")]

Note: No data was loaded, instead a new db.model is created, see the class of the object:

class(artist.genre.db)
[1] "db.join"
attr(,"package")
[1] "RDBL"

it's not a db.model, it's a db.join which is a subclass of db.model.

A closer look at this new model shows that only the columns specified in the /([]/) operators are included:

names(artist.genre.db)
[1] "ArtistName" "GenreName"

Exactly the same model can be created by merging step-by-step:

# merge artists and albums, then restrict to several columns
artist.album.db <-
  merge(artist.db,
        album.db,
        by = "ArtistId")[,c("ArtistName", 
                            "AlbumId")]

# merge tracks and genres
track.genre.db <- 
  merge(track.db.mapping,
        genre.db,
        by = "GenreId")[,c("AlbumId", 
                           "GenreName")]

# merge the previous two merges to get the 
# artist-genre mapping
artist.genre.db.2 <- merge(artist.album.db,
                           track.genre.db)[, c("ArtistName",
                                               "GenreName")]

Logical indexing and column restrictions

Just like a data.frame, a db.model can be filtered by logical indexing and the columns to be returned can be specified as well:

# artists who have at least 1 rock song
rock.artists.db <- 
  unique(
    artist.genre.db[artist.genre.db$GenreName == "Rock", "ArtistName"]
  )

# note: again, no data loaded, see the class of the object
class(rock.artists.db)
[1] "db.unique"
attr(,"package")
[1] "RDBL"

It is interesting to note that the class is db.unique. Just like in case of db.join, db.unique is also a subclass of db.model.

Sorting

Ordering a db.model is also possible:

# one can also order the db.models
rock.artists.db.ordered <- rock.artists.db[order.db(rock.artists.db$ArtistName)]

Now let's load the result set for a quick look:

rock.artists <- 
  as.data.frame(rock.artists.db.ordered)

head(rock.artists, 4)
         ArtistName
1             AC/DC
2            Accept
3         Aerosmith
4 Alanis Morissette

Aggregating! In-database!

As you can know from my previous posts, I love aggregating! Fortunately, RDBL supports the aggregate command with a data.table-like syntax:

# which group has the most rock songs?
# aggregate on the database side
rocks <- artist.genre.db[artist.genre.db$GenreName == "Rock", ]  
most.rock.db <-     
  aggregate(rocks, 
            list(SongCount = count(rocks$GenreName)), 
            by = "ArtistName")

class(most.rock.db)
[1] "db.join"
attr(,"package")
[1] "RDBL"


# order the dataset by number of rock songs in decreasing order
most.rock.db <- most.rock.db[order.db(most.rock.db$SongCount, 
                                      decreasing = TRUE)]

# load the resulting data set from the database
head(as.data.frame(most.rock.db), 3)
    ArtistName SongCount
1 Led Zeppelin       114
2           U2       112
3  Deep Purple        92

You can also aggregate without specifying the by statement (effectively leaving out the GROUP BY statement on the database side):

# aggregate without a 'by' - aggregating through the whole (filtered) table
as.data.frame(aggregate(track.db[track.db$GenreId == 2, ], 
                        list(count = count(track.db$Name), 
                             total.size = sum(track.db$Bytes))))  

  count total.size
1   130 1233457751

As you can see you can define the aggregated columns in a list just like in data.table, while the by statement expects a character vector of the column names to group-by by (no pun intended).

Persisting

Finally, you can write a data set back to the database by using the default column mapping (in this case it is expected that the columns in the data.frame map one-to-one to the columns in the target db.model) or by specifying the column mapping yourself:

# write records into an existing table using auto-mapping of columns
auto.map.df <- data.frame(ArtistId = as.integer(runif(n=1, 
                                                      min = 8000000, 
                                                      max = 80000001)), 
                          ArtistName = "New_Artist")

write.db(auto.map.df, 
         artist.db, 
         append = TRUE)

# write records into an existing table using user-defined mapping of columns
to.map.df <- data.frame(aid = as.integer(runif(n=1, 
                                               min = 8000000, 
                                               max = 80000001)), 
                        aname = "New_Artist_2")

write.db(to.map.df, 
         artist.db, 
         append = TRUE,
         mapping = list(ArtistId = "aid",
                        ArtistName = "aname"))

That's all, folks! As you can see, RDBL allows database-manipulation from R without knowing SQL, although that never hurts. It also adds an abstraction layer in R to decouple from the physical data model. Finally, and, for many, most importantly: it allows loading resulting data sets only, trusting the database to do the heavy lifting when it comes to JOINs, WHERE clauses and GROUP BY statements (or merges, logical indexing and aggregation) reducing the size of the actually loaded data dramatically.

Related Post

To leave a comment for the author, please follow the link and comment on their blog: DataScience+.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.