A Data Science Lab for R

December 19, 2017
By

(This article was first published on R Views, and kindly contributed to R-bloggers)

In a previous post I described the role of analytic administrator as a data scientist who: onboards new tools, deploys solutions, supports existing standards, and trains other data scientists. In this post I will describe how someone in that role might set up a data science lab for R.

Architecture

A data science lab is an environment for developing code and creating content. It should enhance the productivity of your data scientists and integrate with your existing systems. Your data science lab might live on your premises or in the cloud. It might be built with hardware, virtual machines, or containers. You may use it to support a single data scientist or hundreds of R developers. Here is one reference architecture of a data science lab based on server instances.

Key components of this setup include: authentication; load balancing; a testing environment; data connectivity; and a publishing platform. In this server-based architecture, data scientists use a web browser to access the data science lab. High performance compute and live data reside securely behind a firewall.

Instance Sizing

The size of your server instance depends on how many concurrent sessions you run and how large your sessions are. Keep in mind that R is single threaded by default and holds data in memory. Here is a list of example server sizes:

Instance Size Cores RAM Description

Minimum recommended

2

4G

This server will be for lightweight jobs, testing, and sandboxing.

Small

4

8G

This server will support one or two analysts with small data.

Large

16

256G

This server will support 15 analysts with a blend of large and small sessions. Alternatively, it will support dozens of analysts with small sessions.

Jumbo

32+

1T+

May be useful for heavier workloads.

Open-Source R

If you haven’t done so already, I recommend you make R a legitimate part of your organization by officially recognizing it as an analytic standard. You should be familiar with installing and managing R and its packages.

You can install R as a pre-compiled binary from a repository, or you can install R from source. Installing R from source allows you to install multiple versions of R side by side. If you compile R from source, I recommend you link to the BLAS libraries so that you can speed up certain low-level math computations.

Data science labs tend to require a modern toolkit. You should expect to upgrade R at least once a year. You should also keep your operating system up to date. New and improved R packages tend to work better when you use them with recent versions of R and updated system libraries.

RStudio Server Pro

Building a data science lab involves installing, configuring, and managing tools. In this section I will describe how to administer RStudio Server Pro which has features for authentication, security, and admin controls.

1. Installation

Once you have installed R, you can install RStudio Server Pro by downloading the binaries and following the instructions. You will need root privileges to install and run the software. You will also need to create local system accounts for all of your R developers.

2. Configuration

Authentication. The first thing you will want to do after you install RStudio Server Pro is to configure it with your authentication system. RStudio Server Pro supports LDAP via PAM sessions. If you use single sign on or another system, you can configure RStudio Server Pro to work in proxied auth mode. You can also authenticate via Google accounts and local system accounts.

Data Connectivity. Most data scientists use R with databases. The RStudio Pro Drivers are ODBC drivers that will connect R to some of the most popular databases today. These drivers are a free add-on for RStudio Server Pro. If you are using a data source that is not supported, or if you are using the open source version of RStudio Server, you can bring your own ODBC driver.

Load Balancing. If you want to load balance your server instances, you can use the load balancer that is built into RStudio Server Pro or you can bring your own load balancer. Load balancing is designed to balance user sessions seamlessly across the cluster and provide high availability. It requires a shared home drive that is mounted to each one of the instances.

More Features. RStudio Server Pro has a list of features that you can configure. You should decide which features you want to enable or disable. For more information on configuring each of these features, see the admin guide.

Feature Description

Authentication

  • LDAP, Active Directory, Google Accounts and system accounts
  • Full support for Pluggable Authentication Modules, Kerberos via PAM, and custom authentication via proxied HTTP header

Data Connectivity

  • RStudio Professional Drivers are ODBC data connectors that help you connect to some of the most popular databases.

Load Balancing

  • Load balance R sessions across two or more servers
  • Ensure high availability using multiple masters

Enhanced security

  • Encrypt traffic using SSL and restrict client IP addresses

Administrative dashboard

  • Monitor active sessions and their CPU and memory utilization
  • Suspend, forcibly terminate, or assume control of any active session
  • Review historical usage and server logs

Auditing and monitoring

  • Monitor server resources (CPU, memory, etc.) on both a per-user and system-wide basis
  • Send metrics to external systems with the Graphite/Carbon plaintext protocol
  • Health check with configurable output (custom XML, JSON)
  • Audit all R console activity by writing input and output to a central location

Advanced R session management

  • Tailor the version of R, reserve CPU, prioritize scheduling and limit resources by User and Group
  • Provision accounts and mount home directories dynamically via the PAM Session API
  • Automatically execute per-user profile scripts for database and cluster connectivity

Project sharing

  • Share projects & edit code files simultaneously with others

3. Management

Once RStudio Server Pro is installed and configured, you’ll need to manage it over time. RStudio Server Pro comes with a variety of tools for workspace and server management that will help keep your environment organized. For example, you can kill sessions, set session timeouts, and broadcast notifications to user sessions in real-time. You can manage product licenses for both online and offline environments. If your instances start and stop frequently you can opt for using a floating license manager.

Next Steps

Your data science lab for R should be designed to scale. That might mean adding more people, more systems, or more tools. It also might mean creating more content. Shiny is an R package that makes it easy to build interactive web apps straight from R. R Markdown is an R package that makes it easy to author reports and build dashboards. You can publish your Shiny apps or R Markdown reports with the push of a button to RStudio Connect. RStudio Connect lets you share and manage content in one convenient place. You can also publish Shiny apps to shinyapps.io, which allows you to share your Shiny apps online.

To leave a comment for the author, please follow the link and comment on their blog: R Views.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)