3 Ways to Expand Your Data Science Compute Resources

[This article was first published on RStudio | Open source & professional software for data science teams on RStudio, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Photo by Richard Gatley on Unsplash

Data science leaders have embraced the work-from-home era created by COVID-19. Most data science teams have continued their work either using their company laptops or server-based IDEs such as RStudio Server. However, these home workers often run into the limitations of their laptops when they:

  • Run long-lived programs: Machine learning models and simulations frequently run for hours or days on laptops.
  • Demand lots of memory: Training models, parameter tuning, or working with complex datasets (such as genomic data) often require more RAM than even the most tricked-out laptop has.
  • Need specialized architectures: Some machine learning libraries perform best on GPUs or with optimized system architectures that are not available on most laptops.

Embrace Server-Based Data Science Development

The key to freeing data scientists from laptop limitations is to embrace server-based development, as we noted in a prior post, Equipping Work From Home Data Science Teams. Providing data scientists with access to a server-based IDE like RStudio Server can give them more processors, cores, memory, and architecture options than would be available on their laptops. Additionally, with RStudio Server Pro, data scientists can go even further by launching interactive or batch sessions on SLURM and Kubernetes clusters.

Figure 1: 3 ways RStudio Server allows data scientists to use server resources for their jobs.

As shown in Figure 1, RStudio offers three ways for data scientists to take advantage of centralized resources and escape the limitations of their laptops:

  • Local background jobs: In any version of RStudio, data scientists can run an R script in the background. This is especially helpful in RStudio Server, where the task has access to more resources, and you don’t have to worry about shutting off the laptop or a Windows update interrupting the process.

Menu options allow you to run jobs in the background or using Launcher.

  • Interactive Launcher sessions on RStudio Server Pro: RStudio Server Pro adds the ability for a data scientist to start an interactive session on a Kubernetes or SLURM cluster, giving them the full power of RStudio, but with code executing in these unique and powerful environments. These interactive sessions are useful for exploratory data analysis and debugging.
  • RStudio Server Pro Launcher jobs: Finally, data scientists can execute ad-hoc, long-running scripts and programs on clusters using Launcher and let them run without any further console interaction. This approach can be particularly useful for model training, ETL jobs, and other workloads that may run for hours or days. Running these workloads in a batch-oriented mode allows the data scientist to work on other projects without being blocked waiting for results to arrive.
RStudio Server Interactive Launcher Sessions on RStudio Server Pro Launcher Jobs on RStudio Server Pro
Typical RAM Tens to hundreds of gigabytes Multiple terabytes Multiple terabytes
Typical Processor Cores Tens Hundreds to Thousands Hundreds to Thousands
Typical Jobs Routine analyses Interactive tasks requiring large compute, GPUs, or RAM such as exploratory data analysis Batch tasks like parameter tuning, ETL, or model training and scoring
Setup required RStudio Server install RStudio Server Pro + Cluster add-in RStudio Server Pro + Cluster add-in
Limitations Server Resources Best for interactive work, not parallel tasks Jobs kicked off manually, limited job feedback

Figure 3: Three Ways to Expand Data Science Computational Resources Using RStudio Pro and Launcher.

Central Servers Improve Data Scientist Productivity

Data scientists benefit from using RStudio Server and RStudio Server Pro for their analysis because:

  • Unblock the data scientist from waiting for long-lived jobs: Instead of going out for a cup of coffee while waiting to fit their model to a large training set, data scientists can run the model fitting in the background and work on other code while waiting for it to complete.
  • Free the data scientist from having to shoehorn their analysis onto a small platform: Laptop memory and processor limitations often force data scientists to sample their data or recode their models to run in a smaller footprint. By providing access to servers that have many times the resources of their laptops, data scientists can use their full data sets to fit complete models.
  • Allow data scientists more flexibility and make IT happy: Data scientists are able to use more flexible resources and server architectures such as access to GPUs. Server-based development is also a great benefit for IT professionals who are able to see expanded use of the platforms they’ve built and reduced costs through elastic compute.

For More Information About Background and Cluster Jobs

To learn more about the new Launcher capabilities built into RStudio:

If you’d like to try out RStudio Server Pro for your team, you can learn how to download an evaluation copy from the RStudio Server Pro product page.

To leave a comment for the author, please follow the link and comment on their blog: RStudio | Open source & professional software for data science teams on RStudio.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)