Visualizations are one of R’s strengths. There are many functions and packages that create complex plots, often with one simple command. These plotting functions do two things: first, they take the raw data and run the calculations needed for a given visualization, and second, they draw the plot. If the source of the data resides within a database, the usual approach is to import all of the data and then create the plot. This is a problem, especially if the data is large.
A strategy to address this problem is found in the new Database with RStudio website. The Creating Visualizations page outlines a solution that introduces the “Transform in Database, plot in R” concept, and demonstrates its practical implementation. The article focused on knowledge sharing, rather than on providing a tool.
dbplot package is meant to collect multiple functions for in-database visualization code. It implements the principles laid out in the Creating Visualizations page, and it provides three types of functions:
Helper functions that return a
Helper functions that return the results of the plot’s calculations
db_bin()function introduced in the Creating Visualizations page
The package provides calculations or “base”
ggplot2 visualizations for the following:
dbplot from GitHub using the
This example will use a Microsoft SQL Server database connection to provide a quick glance of how the package works. For more examples, please visit the package’s GitHub repository.
dbplot_histogram() function creates a 30-bin histogram by default. Because it uses
dplyr commands to perform the bin calculations, the function will work with any database that has
dplyr support, including
sparklyr. The only caveat is that the database must support basic functions like
min(), which some database types do not support.
library(dbplyr) tbl(con, "airports") %>% dbplot_histogram(alt)
This example shows how the resulting plot object can be further refined after the
dbplot_histogram() function returns a plot:
tbl(con, "airports") %>% dbplot_histogram(alt, binwidth = 700) + labs(title = "Airports Altitude") + theme_minimal()
If more control over the plot is needed, then the
db_compute_bins() function returns a data frame with the lowest value of each bin and the record count per bin:
tbl(con, "airports") %>% db_compute_bins(alt)
## # A tibble: 28 x 2 ## alt count ## <dbl> <int> ## 1 -54.0 559 ## 2 250.4 176 ## 3 554.8 203 ## 4 859.2 131 ## 5 1163.6 82 ## 6 1468.0 40 ## 7 1772.4 20 ## 8 2076.8 18 ## 9 2381.2 16 ## 10 2685.6 12 ## # ... with 18 more rows
The results of the compute command can then be piped into a plot:
tbl(con, "airports") %>% db_compute_bins(alt) %>% ggplot() + geom_col(aes(alt, count, fill = count))
dbplot package includes the
db_bin() function, first introduced in the Creating Visualizations page. For more information, please read the Histogram section.
## (((max(any_field) - min(any_field))/(30)) * ifelse((as.integer(floor(((any_field) - ## min(any_field))/((max(any_field) - min(any_field))/(30))))) == ## (30), (as.integer(floor(((any_field) - min(any_field))/((max(any_field) - ## min(any_field))/(30))))) - 1, (as.integer(floor(((any_field) - ## min(any_field))/((max(any_field) - min(any_field))/(30))))))) + ## min(any_field)
More plots will be possible as
dplyr-to-SQL translations are fine-tuned and enhanced. The
dbplot package will be the place where new calculations and plots will be implemented.