Building Our Own Open Source Supercomputer with R and AWS

February 3, 2019
By

(This article was first published on Laurens Geffert, and kindly contributed to R-bloggers)

How to build a scaleable computing cluster on AWS and run hundreds or
thousands of models in a short amount of time. We will completely rely on R and
open source R packages. This is post 1 out of 2.

Introduction

An ever-increasing number of businesses is moving to the cloud and using
platforms such as Amazon Web Services(AWS)
for their data infrastructure. This is convenient for
Data Scientists like myself because this conversion of tools means that my
knowledge from previous jobs becomes much more applicable to a new role and
I can hit the ground running.

Lately I have become very excited about the
future
package and how it makes the scaling of computational tasks easy and intuitive.
The basic idea of the future package is to make your code infrastructure
independent. Specify your tasks and the future execution plan decides
how to run the calculations.

I wanted to see what we could do with future and other open source R packages
such as aws.ec2 by
cloudyR,
ssh
by rOpenSci,
remoter
by Drew Schmidt, and last but not least furrr
by Davis Vaughan.

The basic idea:

  • use R and AWS to spin up our own cloud compute cluster
  • log in to the head node and define a computationally expensive task
  • farm this task out to a number of worker nodes in our cluster
  • do all of this WITHOUT having to switch between RStudio, RStudioServer, the command line, the AWS console, etc.

Why do I care about the last point? Well, Data Science is a science and should
rely on the Scientific Method.
One core component of the Scientific Method is reproducibility, and one of the
best ways to keep your Data Science workflow reproducible is to write code that
can run start to finish without any user intervention. This also allows for
greater applicability in the future because you can re-use your previous data
product or service in the next project without retracing manual steps.
Don’t just take my word for it, here is another great Hadley
Wickham video in which he stresses the same point:

So without further ado, let’s get started implementing that bullet point list!

Preparation

There are a few basic requirements that need to be in place:

  1. an active AWS account.
  2. an Amazon Machine Image (AMI)
    with R, remoter, tidyverse, future, and furrr installed.
  3. a working ssh key pair on your local machine and the AMI that allows you
    to ssh into and between your ec2 instances.

Detailed instructions on how to fulfil these basic requirements are beyond the
scope of this post. You can find more information in the articles linked below.

Setup

Load the required packages. Also make sure your AWS access credentials are set.
I do this using Sys.setenv. There is other ways but I found that this works
best for me. We also specify the AMI ID and the instance type (this is a good
overview; I am using t2.micro
here because it is free). If you have any problems with this step, double-
check that the region set in Sys.setenv matches the region of your AMI.

library(aws.ec2)
library(ssh)
library(remoter)
library(tidyverse)

# set access credentials
aws_access <- aws.signature::locate_credentials()
Sys.setenv(
  "AWS_ACCESS_KEY_ID" = aws_access$key,
  "AWS_SECRET_ACCESS_KEY" = aws_access$secret,
  "AWS_DEFAULT_REGION" = aws_access$region
)

# set parameters
aws_ami <- "ami-06485bfe40a86470d"
aws_describe <- describe_images(aws_ami)
aws_type <- "t2.micro"

Ready for launch!

Boot and Connect

We can now fire up our head-node instance.

ec2inst <- run_instances(
  image = aws_ami,
  type = aws_type)

# wait for boot, then refresh description
Sys.sleep(10)
ec2inst <- describe_instances(ec2inst)

# get IP address of the instance
ec2inst_ip <- get_instance_public_ip(ec2inst)

The instance should be running and we can connect to it via ssh in bash.
That works, but personally I’d prefer to stay in RStudio instead of switching
to the command line. This is where remoter and ssh come in. We can
establish an ssh connection straight from our R session and use that to launch
the remoter::server on our instance. By using the future package to run the
ssh command we keep our interactive RStudio session free and can subsequently
use it to connect to the instance with remoter

# ssh connection
username <- system("whoami", intern = TRUE)
con <- ssh_connect(host = paste(username, ec2ip, sep = "@"))

# helper function for a random temporary password
random_tmp_password <- generate_password()
# CMD string to start remoter::server on instance
r_cmd_start_remoter <- str_c(
  "sudo Rscript -e ",
  "'remoter::server(",
  "port = 55555, ",
  "password = %pwd, ",
  "showmsg = TRUE)'",
  collapse = "") %>%
  str_replace("%pwd", str_c('"', random_tmp_password, '"'))

# connect and execute
plan(multicore)
x <- future(
  ssh_exec_wait(
    session = con,
    command = r_cmd_start_remoter))

remoter::client(
  addr = ec2ip,
  port = 55555,
  password = random_tmp_password,
  prompt = "remote")

Et Voila! We are connected to our remote head node and can run R code
in the cloud without ever leaving the comfort of RStudio. And the amazing bit:
all of this took me about a day to set up from scratch!

I will leave it here for now. In the next post we will dive into the
details of how to scale up the approach above to create an AWS cloud computing
cluster. This approach is extremely powerful for embarrassingly parallel
problems (which are actually not embarrassing at all, I swear!)

As always, I hope it is useful for you. I’d very much appreciate any thoughts,
comments, and feedback so write me a message below or get in touch via twitter!

To leave a comment for the author, please follow the link and comment on their blog: Laurens Geffert.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)