Relational databases, such as MySQL, organize data into tables consisting of rows and columns similarly to an R
data.frame. While many beginners will be satisfied with R’s native
RData storage format, storing data in a flat-file has its disadvantages. First, they are difficult to share among collaborators. .RData files can contain many matrices and data.frames. It’s very easy to create different versions if two collaborators are performing analyses which require cleaning the data. Second, they are insecure compared to the permissions available to administrators of SQL databases. For example, I may want a research assistant or an analyst on the data science team to have read-only access to certain tables to prevent data corruption, and restrict access entirely to sensitive, identifiable data such as participant names and email addresses. Using MySQL, I can grant
SELECT to individuals that require read-only access and each individual can have separate login credentials. Thus, if an individual leaves the academic lab or resigns from the company, I can simply disable their access. Third, you can easily backup and version your database using
mysqldump or automatic backups if you choose to use Amazon’s RDS. Did you
merge incorrectly or accidentally wipe your data? No problem, just roll back to a previous version. Fourth, data can be efficiently streamed into R using SQL queries. As you may know, R data is stored entirely in-memory. This means to analyze a 24 GB dataset, you’ll need at least 24GB of free RAM in your computer in addition to what’s needed to perform your analysis. Instead, if I know that I only need last month’s customers to perform an analysis, I can stream only that data into R using a query such as:
SELECT * FROM table WHERE date > '2015-07-01';
If I wanted, I could automatically filter out incomplete values before it gets into R:
SELECT * FROM table WHERE date > '2015-07-01' AND test_score IS NOT NULL;
Interested? In this tutorial, you will learn how to store and retrieve data from a MySQL database with R using the
Since MySQL runs on a variety of platforms, I’ll demonstrate how to install it on Mac, Linux, and Windows, as well as how to provision it using Amazon RDS. These instructions are by no means flawless – feel free to email us feedback if you find problems.
Using Homebrew on OS X
Homebrew is a package manager for OS X that enables the installation of a variety of packages from source. For our purposes, we’re going to install MySQL and run it as a daemon each time your computer starts. To install Homebrew, open Terminal.app and run the following code:
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Before doing anything, run
brew doctor and make sure you fix any of its complaints before proceeding. After you have a working Homebrew installing, you can now install MySQL:
# Update Homebrew brew update # Install MySQL brew install MySQL # Tell MySQL to load at startup ln -sfv /usr/local/opt/mysql/*.plist ~/Library/LaunchAgents launchctl load ~/Library/LaunchAgents/homebrew.mxcl.mysql.plist
Congrats! Try logging in:
You can also try mycli, which offers a more modern CLI for MySQL with features such as syntax completion. If that command fails, check that the server is actually running:
ps -ax | grep mysqld
Installing MySQL on Ubuntu
Chances are, you may have an existing server in your lab or on an EC2 instance that you want to use for this tutorial. If you don’t, I highly recommend you sign up for Amazon’s Free Tier and try it out. Luckily, Linux distributions such as Ubuntu and Fedora come with package managers that make installing MySQL easier for end-users. In this example, we’ll use the APT package manager to complete the installation. Remember, you’ll need
sudo access to complete this installation. If you don’t have administrative access, contact your server administrator and ask them to install MySQL for you.
# Update APT sudo apt-get update # Install the server and client sudo apt-get install mysql-server-5.6 mysql-client-5.6
If you’re also using R on the same system, you can install RMySQL using APT. The benefit is that you avoid troubleshooting any compilation errors. However, you may want to add the official CRAN repo to your APT
sources.list file before doing this. In my case, I verified my Ubuntu version by running
lsb_release -a and added the following line to my APT sources list at
deb http://cran.rstudio.org/bin/linux/ubuntu trusty/. Once you have the CRAN Ubuntu repo installed and have updated APT, you can install many CRAN plugins from APT:
sudo apt-get install r-cran-rmysql
Setting up RDS on AWS
First, sign up for AWS using your Amazon account. Once you’re in, navigate to the RDS section under “Databases”:
Next, we’re going to install another package to extend R’s native capabilities and allow it to read and write from MySQL. First, try to install the binary package from CRAN:
install.packages("RMySQL") # Load the library library(RMySQL)
If you don’t get any errors, you’re all set! However, depending on your version of R and your platform, you may have to build from source. If that’s the case, don’t fret! As long as you’ve installed the MySQL headers (for example, using Homebrew) you’ll be able to specify that you want to install
RMySQL from source, in which it will compile the package:
install.packages("RMySQL", type = "source"> library(RMySQL) Loading required package: DBI
Next, we’ll discover how to read to and write data from a table in your MySQL database. This tutorial won’t deal with the best ways to store data, but I highly recommend the reader take a peak at Wickham, H. (2014). Tidy Data.
RMySQL contains many functions, but the three we’ll focus on are 1)
dbWriteTable(). These, as you’ve no doubt already guessed, allow you to 1) list all the tables available in a given database, read a MySQL table into R as a
data.frame, and write to a MySQL table in a way conceptually similar to how
write.csv() operates. First, let’s create the connection to the database and explore how some of the functions operate. Note that in my case,
dbListTables() returns a character vector with each table name because I’m connecting to an existing database. New databases should not return anything.
library(DBI) con_sql <- dbConnect(RMySQL::MySQL(), group = "group-name") # List tables in current database dbListTables(con_sql)  "year1" "year2" "year3" "year4" # Write data.frame to MySQL dbWriteTable(conn = con_sql, name = 'mtcars', value = mtcars) # Read MySQL table to data.frame mtcars <- dbReadTable(conn = con_sql, name = 'mtcars') # Check out your data! head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Since the result of
dbReadTable() is itself a data.frame, you can stream SQL directly into your plot.ly graphs and play around as usual. This functionality is fantastic for building persistent dashboards for your analysis.
library(plotly) plot_ly(data = mtcars, color = as.factor(cyl), y = mpg, type = "box")
Securing your Credentials in R
You’ll notice that in my example above I didn’t embed my credentials directly in my R script. Scripting usernames and passwords is a terrible idea in practice. For example, what if you upload your analysis to a public Github? Suddenly, the world knows that your password is
ilovecats!. We can avoid such embarrassment by storing our credentials in a .
my.cnf configuration file that RMySQL will look for and read. Values such as the database name, your username, and corresponding password can then be safely stored in your
$HOME directory, away from your analysis. To do this, create a file called
my.cnf in your home directory:
[group-name] database=test user=root password= host=127.0.0.1
While this post is mainly concerned with 1) connecting R to MySQL databases and 2) reading and writing data, I encourage readers to check out the Khan Academy’s free course on SQL. Using the
dbSendQuery() function, we can perform more complex queries that involve joining and subsetting data before they’re read into R:
query <- dbSendQuery(con_sql, "select * from mtcars where cyl = 4;") data <- fetch(query, n=10) dbClearResult(query)