Where Does RStudio Fit into Your Cloud Strategy?

[This article was first published on RStudio | Open source & professional software for data science teams on RStudio, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Photo by Mantas Hesthaven on Unsplash

Over the last few years, more companies have begun migrating their data science work to the cloud. As they do, they naturally want to bring along their favorite data science tools, including RStudio, R, and Python. In this blog post, we discuss the various ways RStudio products can help you along that journey.

Why Do Organizations Want to Move to the Cloud?

There are many reasons why organizations are looking to use cloud services more widely for data science. They include:

  • Long delays and high startup costs for new data science teams: When you bring a new team of data scientists onboard, it can be costly and time consuming to spin up the necessary hardware for the team. New hardware might be needed for developing data science analyses or for sharing interactive Shiny applications for stakeholders. These burdens tend to fall either on the individual data scientists or on DevOps and IT administrators who are responsible for configuring servers.
  • Obstacles to collaboration between organizations or groups: If a team is restricted to operating within their organization’s firewall, it can be very difficult to support collaboration or instruction between groups that don’t normally interact with each other. For example, running a data science workshop or statistics class can be unwieldy if everyone is working within their own separate environments.
  • High costs of computing infrastructure: Another key challenge is the potentially high costs of setting up and maintaining an organization’s computing infrastructure, including both hardware and software. These costs include the initial investments, maintenance and upgrade fees, and the related manpower costs.
  • Difficulty scaling to meet variable demand: Scaling server resources to satisfy highly variable data science demands can be very difficult because organizations rarely maintain excess capacity. For example, an organization may want to publish a news article or a COVID dashboard for which they expect high demand, only to discover that it needs the IT organization to spin up a back-end Kubernetes cluster to handle the load.
  • Excessive time and costs moving the data to the analysis: If an organization’s data is already stored on one of the major cloud providers or in a remote data center, moving that data to your laptop for analysis can be slow and expensive. Ideally, you should perform the data access, transformation and analysis as close to where the data lives as possible. Not doing so could subject you to excessive data transfer charges to move the data.

Let Your Data Science Goals Drive Your Cloud Strategy

Depending on the circumstances of your organization and what specific challenges you are trying to address, you should consider four possible options for your data science cloud strategy:

  • Hosted and Software as a Service (SaaS) offerings: A fully hosted service can minimize the cost and time required to start up a new project. However, functionality may be limited compared to on premise offerings and integration with your internal data and infrastructure can be challenging.
  • Deployment to a Virtual Private Cloud (VPC) provider: Deploying software on a major cloud platform such as Amazon Web Services (AWS) or Azure can provide the full flexibility and customization of on premise software. However, setting up a virtual private cloud application often requires more management overhead to integrate with your internal systems as well as careful administration of usage to avoid unexpected usage charges.
  • Cloud marketplace Offerings: Pre-built applications offered on services such as the AWS and Azure Marketplaces make it easy to get started at a pay-as-you-go hourly cost, but require careful management to ensure the software is available and running only when needed.
  • Data science in your data lake: By embedding your data science tools into your existing data platform, your computations can be run close to the data, minimize overhead, and easily tie into your data pipeline. However, this adds additional complexity and potential limitations.

We’re provided the table below to help you assess the various RStudio cloud offerings. It matches up problems and potential solutions with specific RStudio options and resources to consider. The options are arranged in order of increasing complexity of configuration and administration.

Table 1: Summary of Cloud Options for RStudio Software
Problem Potential Solution Pros and Cons Options to consider
Simplify and reduce startup costs SaaS/Hosted offering
  • Simplest and lowest cost to deploy
  • Hardware and software managed by the provider
  • Costs may be fixed, variable or a mix of the two
  • Limited integration with your organization’s internal data and security protocols.
  • May not be cost efficient for large groups
  • May have limited options for custom configuration
Create data science analyses with RStudio Cloud
Share Shiny applications with shinyapps.io
Manage packages with RStudio Public Package Manager, a free service to provide easy installation of package binaries, and access to previous package versions
Promote collaboration or instruction between organizations or groups SaaS/Hosted offering
  • Same pros as above, plus the ability to easily share projects
  • Same cons as above
Share projects or teach classes/workshops with RStudio Cloud
Mitigate high costs of computing infrastructure Marketplace Offerings
  • Easy to get started at minimal, pay-as-you-go (hourly) cost.
  • Access to specialized hardware (e.g GPUs)
  • To manage hourly costs, careful management is required to ensure software is running only when needed
RStudio products on AWS Marketplace, Azure Marketplace, and Google Cloud Platform.
Deployment to a VPC on a major cloud provider
  • Outsources hardware costs
  • Integrates with existing analytic assets on cloud platforms
  • Allows easy customization and configuration
  • Provides access to specialized hardware (e.g GPUs)
  • Ensures data sovereignty by running your processes in a local cloud region
  • Complexity of managing software configuration and integration with your organization’s on-premise data and security protocols.
  • Costs may be highly variable, based on usage
Deploy RStudio products in a VPC, using cloud formation templates for AWS and Azure ARM template (See RStudio Cloud Tools)
Deploy RStudio products via Docker e.g. use EKS (Elastic Kubernetes Service) on AWS. (See Docker images for RStudio Professional Products)
Connect to cloud based data storage, such as Redshift or S3.
Scale to meet variable demand Clustering approaches, including Kubernetes
  • Cloud-deployed applications can be easily scaled to meet demand, since cloud providers provide container resources on demand.
  • Careful management required to avoid unnecessary compute costs, while still matching job requirements to computational needs.
In addition to the points above, RStudio Server Pro’s Launcher integrates with Kubernetes, an industry-standard clustering solution that allows efficient scaling.
RStudio Connect provides many options to scale and tune performance, including being part of an autoscaling group. These options allow Connect to deliver dashboards, Shiny applications, and other types of content to large numbers of users.
Minimize data movement Data lakes
  • Run your computations close to the data, minimizing overhead
  • Tie your data science directly into your data pipeline
  • Adds additional complexity and potential limitations
Connect to cloud based data storage, such as Redshift or S3.
Managed RStudio Server Pro on Spark and Hadoop on Azure and AWS (Cazena)

Ready to Take RStudio to the Cloud?

If you’d like to take RStudio along on your journey to the cloud, you can start by exploring the resources linked in the table above. We also invite you to join us on December 2 for a webinar, “What does it mean to do data science in the cloud?”, conducted with our partner ProCogia. You can register for the webinar here.

Our product team is also happy to provide advice and guidance along this journey. If you’d like to set up a time to talk with us, you can book a time here. We look forward to being your guide.

To leave a comment for the author, please follow the link and comment on their blog: RStudio | Open source & professional software for data science teams on RStudio.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)