Amazon’s columnar database, Redshift is a great companion for a lot of Data Science tasks, it allows for fast processing of very big datasets, with a familiar query language (SQL).
There are 2 ways to load data into Redshift, the classic one, using the
INSERT statement, works, but it is highly inefficient when loading big datasets. The other one, recommended in Redshift’s docs, consists on using the
One of the easiests ways to accomplish this, since we are already using Amazon’s infrastructure, is to do a load from S3. S3 loading requires that you upload your data to Redshift and then run a
COPY statement specifying where your data is.
Also, because Redshift is a distributed database, they recommend you to split your file, in a number of files which are a multiple of the number of slices on your database, so they can load it in parallel, and they also let you gzip your files for a faster upload.
Wait a second, now to upload a big dataset fast we have to:
- Create a table in Redshift with the same structure as my data frame
- Split the data into N parts
- Convert the parts into a format readable by Redshift
- Upload all the parts to Amazon S3
- Run the COPY statement on Redshift
- Delete the temporary files on Amazon S3
That does seem like a lot of work, but don’t worry, i’ve got your back! I’ve created an R Package which does exactly this, it’s
redshiftTools! The code is available at github.com/sicarul/redshiftTools.
To install the package, you’ll need to do:
install.packages('devtools') devtools::install_github("RcppCore/Rcpp") devtools::install_github("rstats-db/DBI") devtools::install_github("rstats-db/RPostgres") install.packages("aws.s3", repos = c(getOption("repos"), "http://cloudyr.github.io/drat")) devtools::install_github("sicarul/redshiftTools")
Afterwards, you’ll be able to use it like this:
library("aws.s3") library(RPostgres) library(redshiftTools) con <- dbConnect(RPostgres::Postgres(), dbname="dbname", host='my-redshift-url.amazon.com', port='5439', user='myuser', password='mypassword',sslmode='require') rs_replace_table(my_data, dbcon=con, tableName='mytable', bucket="mybucket") rs_upsert_table(my_other_data, dbcon=con, tableName = 'mytable', bucket="mybucket", keys=c('id', 'date'))
rs_replace_table truncates the target table and then loads it entirely from the data frame, only do this if you don't care about the current data it holds. On the other hand,
rs_upsert_table replaces rows which have coinciding keys, and inserts those that do not exist in the table.
Please open an issue in Github if you find any issues. Have fun!