RStudio Connect as a Solution for Remote Data Science Teams

[This article was first published on r – Appsilon Data Science | End­ to­ End Data Science Solutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The situation around COVID-19 has forced many countries around the world to adopt social distancing policies. This means that more of the world’s workforce is working remotely than ever before. Data Science teams are no exception. Distributed teams bring unique challenges, and managers may be looking for new tools. In this article we’ll explain how RStudio Connect helps organizations to properly organize teams and overcome the typical inefficiencies of remote work.

Some common problems for distributed teams include:

  • Onboarding new users, teams, and “teams of teams” 
  • Version control and arriving at a “single source of truth”
  • Organizational overhead
  • Security issues

At Appsilon we’ve grappled with these same challenges as we’ve promoted a remote-work friendly culture. Our data scientists and developers collaborate with each other daily from at least three cities in two different countries, and we frequently work with clients around the globe in faraway time zones. We’ve found that RStudio Connect is a tool that can aid all of the parties involved with Data Science in an organization: producers of artifacts, consumers of artifacts, and IT Administrators. RStudio Connect empowers employees to consume and distribute information within an organization and reduce a lot of unnecessary labor going into these processes.

Getting Started and Onboarding New Members to the DS Ecosystem

One of the first problems that an organization may encounter in a remote work scenario is onboarding new individuals and teams to the data science ecosystem. RStudio Connect shortens the time it takes to get remote teams up and running with sharing and consuming R/Shiny applications. One of the main reasons for this is that much of the infrastructure work is completed for you automatically – there’s no need to design and maintain your own internal solutions for problems like user authentication. We’ve seen organizations spend vast amounts of developer time endlessly replicating features that are included automatically in RStudio Connect. 

Maybe an organization does not have IT Administrator support for its data science team and users. In this case, the data scientists themselves may have to deploy and manage RStudio Connect. Connect’s developers had this use case in mind. RStudio has provided a “Jump Start Examples” tutorial within Connect to help Data Scientists adapt to their new environment and quickly learn best practices. This reduces the hands-on work that team leaders have to do to onboard new users and ensures that everyone gets started with the same common knowledge of the ecosystem and its capabilities. 

Jump Start Examples

Jump Start Examples [Source: RStudio]

Simplifying the Role of the System Administrator

RStudio Connect can help simplify the role of the system administrator by offering tools to manage visitor load:

  • Detailed metrics for the server and the associated processes
  • Logs for all processes spawned by Connect
  • Secure deployments and interactions with artifacts using SSL/TLS

Then there is the issue of access management. The recent release (1.8.0) makes it even easier to support data science teams with one enhancement in particular: seamless single sign-on (SSO) integration. RStudio Connect can integrate with the SAML Identity Provider (or IdP) of your company’s choice to perform user authentication and, optionally, user/group membership management. In the SAML world, RStudio Connect fulfills the role of service provider (or SP).

Plus, Every RStudio Connect user account is configured with a role that controls their default capabilities on the system. Data scientists, analysts and others working in R will most likely want “publisher” accounts. Other users are likely to need only “viewer” accounts. 

Task Scheduling with RStudio Connect

One powerful feature of RStudio Connect is the ability to schedule tasks. These tasks can be everything from simple ETL jobs to daily reports. Version 1.8.0 makes it easier for administrators to track these tasks across all publishers in a single place. This new view makes it possible to identify conflicts or times when the server is being overbooked.

RStudio Connect Task Scheduling

RStudio Connect Task Scheduling [Source: RStudio]

Version Control and Single Source of Truth

An important reason to use Rstudio Connect is the single source of truth feature. It is built around the  “pins” R package and provides a way for R users to easily share resources using RStudio Connect. Your resources may be text files (CSV, JSON, etc.), R objects (.Rds, .Rda, etc.), or any other type of files you want to share. Sharing these files can be useful in many situations, such as when multiple pieces of content require the same data. Rather than copying that data, each piece of content references a “single source” of truth hosted on RStudio Connect.

When content depends on processed datasets or model objects that need to be regularly updated, rather than redeploying the content each time the information changes, use a pinned resource and update only the dataset or model. The update can be automated using a scheduled R Markdown document. Other deployed content will read the newest data on each run.

Connect is also helpful when you need to share resources that aren’t structured for traditional tools like databases. Models saved as R objects aren’t easy to store in a database. Rather than using email or file systems to share these R objects, use RStudio Connect to host these resources as pins. This ensures that everyone has easy access to the R objects in a single place.

A single source of truth means time savings for all participants, wherever they may be located. Read more about how data quality and data validation saves time and resources here

Custom Emails: Reduce Manual Tasks

So now your data science ecosystem is up and running. Next — sending plots, tables, and results inline in emails is a powerful way for data scientists to make an impact. RStudio Connect allows you to create custom emails to send daily reminders, conditional alerts, and to track key metrics. The latest release of the blastula package makes it even easier for data scientists to specify these emails programmatically:

if (demand_forecast > 1000) {
  render_connect_email(input = "alert-supply-team-email.Rmd") %>%
  attach_connect_email(
    subject = sprintf("ALERT: Forecasted increase of %g units", increase),
    attach_output = TRUE,
    attachments = c("demand_forecast_data.csv")
  )

} else {
  suppress_scheduled_email() 
}

Imagine sending emails about updates to datasets and dashboards manually for a year. Or two years. Now imagine sharing R/Shiny applications (and/or Plumber APIs, Pins, R Markdown docs, etc.) as easily as you share memes on Instagram. Which scenario is more appealing? 

Security

With the deployment of a new network – a whole new ecosystem really – security should be a primary concern. For instance, you need to be thinking about preventing Brute Force and Dictionary attacks. By default, RStudio Connect allows as many login attempts as it can handle from any source when using the PAM, LDAP, and Password authentication providers. Users will be able to log in directly by entering their username and password. Setting the Authentication.ChallengeResponseEnabled flag to true enables a CAPTCHA form in the login screen, and requires that CAPTCHA be solved in order to authenticate. Both visual and audio CAPTCHA challenges are provided for accessibility needs.

Additionally, we recommend setting up separate instances of RStudio Connect depending on their purpose – one public instance and a second instance accessible only from the internal infrastructure. This means that you can host publicly accessible demos of Shiny dashboards while keeping your internal RStudio Connect infrastructure inaccessible to unauthorized access. This way it’s easy to show off your work to clients or provide public access without compromising on security. 

Conclusion

The new reality of enforced social distancing means that many organizations with Data Science ecosystems will have to overcome new challenges involved in remote work. We’ve found that RStudio Connect has solved many of our problems with a wide array of available tools and packages. Further, when sharing your data work is as easy as a couple of clicks, you can raise the data literacy of your entire organization by increasing access to data insights. 

We encourage other Data Science teams around the world to consider reaching out to certified RStudio partners for further consultation to make sure that RStudio Connect is the right choice for you. As an RStudio Full Certified Partner, we’re well-positioned to help you make the leap!

Resources

Follow Us for More

Article RStudio Connect as a Solution for Remote Data Science Teams comes from Appsilon Data Science | End­ to­ End Data Science Solutions.

To leave a comment for the author, please follow the link and comment on their blog: r – Appsilon Data Science | End­ to­ End Data Science Solutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)