Working With Databases and SQL in RStudio

[This article was first published on RStudio | Open source & professional software for data science teams on RStudio, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Photo by Conny Schneider on Unsplash

Relational databases are a common way to store information, and SQL is a widely-used language for managing data held in these systems. RStudio provides several options to work with these crucial tools.

Let’s explore using a PostgreSQL database that contains FiveThirtyEight’s data on Bob Ross paintings.

Connecting to Databases With RStudio

You can connect to databases in RStudio, either by manually writing the connection code or using the Connections Pane.

Install the packages that correspond to your database. For example, you can connect to a bit.io PostgreSQL database by creating an account and inserting the repo’s details in a code chunk:

# Install these packages if you have not already
# install.packages(c('DBI', 'RPostgres'))

con <- DBI::dbConnect(
  RPostgres::Postgres(),
  dbname = 'ivelasq3',
  host = 'db.bit.io',
  port = 5432,
  user = 'ivelasq3_demo_db_connection',
  password = Sys.getenv('BITIO_KEY') # insert your password here
)

In addition to manually writing code, you can connect to databases with the Connections Pane in the IDE. It shows all the connections to supported data sources. You can also scan through your databases, see which connections are currently active, and close connections.

RStudio IDE Connections Pane and callouts for each of its functionality

RStudio IDE Connections Pane

For RStudio commercial customers, we offer RStudio Professional ODBC Driver. These are ODBC data connectors that help you connect to some of the most popular databases and use them in a production environment.

Querying Databases Using RStudio

Once you have your connection set up, you can run database queries in RStudio. There are several ways of doing this. Let’s explore RStudio’s SQL integration, the DBI package, the dbplyr package, and R Markdown.

SQL Integration in RStudio

The RStudio IDE has direct integration with .sql files. You can open, edit, and test those file types inside RStudio.

Generate a .sql file with your open connection (or go to File, New File, SQL Script) and start writing your query.

Notice that there’s a comment RStudio added to the top of the file:

-- !preview conn=con

This comment tells RStudio to execute the query against the open connection named con. Click Preview or press Ctrl + Shift + Enter to run the query, and your results appear in a new tab:

Screenshot of a SQL query in a SQL file and previewing the results in the RStudio IDE

The DBI package

You can query your data with the DBI::dbGetQuery() function. Paste your SQL code as a quoted string. Using the example database from earlier, let’s query the first three rows of the elements table:

DBI::dbGetQuery(con, 'SELECT episode, title FROM \"ivelasq3/elements\".\"elements\" LIMIT 3')

  episode                       title
1  S01E01 "\\"A WALK IN THE WOODS\\""
2  S01E02        "\\"MT. MCKINLEY\\""
3  S01E03        "\\"EBONY SUNSET\\""

The glue package package makes writing SQL queries a little easier. The glue::glue_sql() function is able to handle the SQL quoting and variable placement:

tbl_glue <-
  glue::glue_sql('SELECT episode, title FROM "ivelasq3/elements"."elements" LIMIT 3')

DBI::dbGetQuery(con, tbl_glue)

  episode                       title
1  S01E01 "\\"A WALK IN THE WOODS\\""
2  S01E02        "\\"MT. MCKINLEY\\""
3  S01E03        "\\"EBONY SUNSET\\""

The dbplyr package

You can write your queries with dplyr syntax using the dbplyr package.

library(dplyr)

tbl_dbplyr <-
  tbl(con, dbplyr::ident_q('"ivelasq3/elements"."elements"'))

The dbplyr package translates dplyr verbs into SQL queries, making it easy to work with the data from your database.

tbl_dbplyr %>%
  summarise(total = n())

# Source:   lazy query [?? x 1]
# Database: postgres [[email protected]:5432/ivelasq3]
    total
  <int64>
1     403

You can always inspect the SQL translation with the show_query() function. The dbplyr package will switch between SQL syntaxes based on the DB type (e.g., MS, Oracle, PG, etc.).

tbl_dbplyr %>%
  summarise(total = n()) %>% 
  show_query()

<SQL>
SELECT COUNT(*) AS "total", TRUE AS "na.rm"
FROM "ivelasq3/elements"."elements"

The dbplyr package allows you to work iteratively like you would in dplyr. All of your code is in R so you do not have to switch between languages to explore the data.

tbl_dbplyr2 <-
  tbl_dbplyr %>%
  mutate(night_and_ocean =
    case_when(night == 1 & ocean == 1 ~ 1,
              TRUE ~ 0))

tbl_dbplyr2 %>% 
  summarise(night_sum = sum(night),
            ocean_sum = sum(ocean),
            night_and_ocean_sum = sum(night_and_ocean))

# Source:   lazy query [?? x 3]
# Database: postgres [[email protected]:5432/ivelasq3]
  night_sum ocean_sum night_and_ocean_sum
      <dbl>     <dbl>               <dbl>
1        11        36                   4

Using the function collect(), we can then use our data with other functions or R packages such as ggplot2.

library(ggplot2)

tbl_ggplot <-
  tbl_dbplyr %>%
  collect() %>% 
  rowwise() %>% 
  mutate(total_number =
         as.numeric(sum(c_across(where(is.numeric))))) %>% 
  ggplot(aes(total_number)) +
  geom_histogram(fill = "#A4C689") +
  theme_minimal() +
  xlab("Number of elements by episode")

Histogram of number of elements by Bob Ross episode

R Markdown

Would you rather write verbatim SQL code? You can run SQL code in an R Markdown document. Create a sql code chunk and specify your connection with the connection = con code chunk option.

```{sql, connection = con}
SELECT episode, title
FROM "ivelasq3/elements"."elements"
LIMIT 3
```
Table 1: 3 records
episode title
S01E01 “"A WALK IN THE WOODS"”
S01E02 “"MT. MCKINLEY"”
S01E03 “"EBONY SUNSET"”

R Markdown provides options that simplify using SQL with R. For example, this post shows how you can use the cat engine to write the content of a chunk to a file.

```{cat, engine.opts = list(file = "select_tbl.sql", lang = "sql")}
SELECT episode, title
FROM "ivelasq3/elements"."elements"
LIMIT 3
```

You can read in the file using the code chunk option so you do not have to write out your SQL query.

```{sql, connection = con, code=readLines("select_tbl.sql")}
```

You can send the query output to an R data frame by defining output.var in the code chunk. Then you can reuse that data frame elsewhere in your code.

```{sql, connection = con, code=readLines("select_tbl.sql"), output.var = "dat"}
```

```{r}
print(dat)
```

  episode                       title
1  S01E01 "\\"A WALK IN THE WOODS\\""
2  S01E02        "\\"MT. MCKINLEY\\""
3  S01E03        "\\"EBONY SUNSET\\""

These options make working with SQL in R Markdown even smoother.

Learn More

This blog post just touched on a few examples of how to work with databases and SQL in RStudio. Check out more resources below.

To leave a comment for the author, please follow the link and comment on their blog: RStudio | Open source & professional software for data science teams on RStudio.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)