How to build a scaleable computing cluster on AWS and run hundreds or thousands of models in a short amount of time. We will completely rely on R and open source R packages. This is post 1 out of 2.
An ever-increasing number of businesses is moving to the cloud and using platforms such as Amazon Web Services(AWS) for their data infrastructure. This is convenient for Data Scientists like myself because this conversion of tools means that my knowledge from previous jobs becomes much more applicable to a new role and I can hit the ground running.
Lately I have become very excited about the
package and how it makes the scaling of computational tasks easy and intuitive.
The basic idea of the future package is to make your code infrastructure
independent. Specify your tasks and the
future execution plan decides
how to run the calculations.
The basic idea:
- use R and AWS to spin up our own cloud compute cluster
- log in to the head node and define a computationally expensive task
- farm this task out to a number of worker nodes in our cluster
- do all of this WITHOUT having to switch between RStudio, RStudioServer, the command line, the AWS console, etc.
Why do I care about the last point? Well, Data Science is a science and should rely on the Scientific Method. One core component of the Scientific Method is reproducibility, and one of the best ways to keep your Data Science workflow reproducible is to write code that can run start to finish without any user intervention. This also allows for greater applicability in the future because you can re-use your previous data product or service in the next project without retracing manual steps. Don’t just take my word for it, here is another great Hadley Wickham video in which he stresses the same point:
So without further ado, let’s get started implementing that bullet point list!
There are a few basic requirements that need to be in place:
- an active AWS account.
- an Amazon Machine Image (
- a working
sshkey pair on your local machine and the
AMIthat allows you to ssh into and between your
Detailed instructions on how to fulfil these basic requirements are beyond the scope of this post. You can find more information in the articles linked below.
Load the required packages. Also make sure your AWS access credentials are set.
I do this using
Sys.setenv. There is other ways but I found that this works
best for me. We also specify the
AMI ID and the instance type (this is a good
overview; I am using
here because it is free). If you have any problems with this step, double-
check that the region set in
Sys.setenv matches the region of your AMI.
library(aws.ec2) library(ssh) library(remoter) library(tidyverse) # set access credentials aws_access <- aws.signature::locate_credentials() Sys.setenv( "AWS_ACCESS_KEY_ID" = aws_access$key, "AWS_SECRET_ACCESS_KEY" = aws_access$secret, "AWS_DEFAULT_REGION" = aws_access$region ) # set parameters aws_ami <- "ami-06485bfe40a86470d" aws_describe <- describe_images(aws_ami) aws_type <- "t2.micro"
Ready for launch!
Boot and Connect
We can now fire up our head-node instance.
ec2inst <- run_instances( image = aws_ami, type = aws_type) # wait for boot, then refresh description Sys.sleep(10) ec2inst <- describe_instances(ec2inst) # get IP address of the instance ec2inst_ip <- get_instance_public_ip(ec2inst)
The instance should be running and we can connect to it via ssh in bash.
That works, but personally I’d prefer to stay in RStudio instead of switching
to the command line. This is where
ssh come in. We can
establish an ssh connection straight from our R session and use that to launch
remoter::server on our instance. By using the future package to run the
ssh command we keep our interactive RStudio session free and can subsequently
use it to connect to the instance with
# ssh connection username <- system("whoami", intern = TRUE) con <- ssh_connect(host = paste(username, ec2ip, sep = "@")) # helper function for a random temporary password random_tmp_password <- generate_password() # CMD string to start remoter::server on instance r_cmd_start_remoter <- str_c( "sudo Rscript -e ", "'remoter::server(", "port = 55555, ", "password = %pwd, ", "showmsg = TRUE)'", collapse = "") %>% str_replace("%pwd", str_c('"', random_tmp_password, '"')) # connect and execute plan(multicore) x <- future( ssh_exec_wait( session = con, command = r_cmd_start_remoter)) remoter::client( addr = ec2ip, port = 55555, password = random_tmp_password, prompt = "remote")
Et Voila! We are connected to our remote head node and can run R code in the cloud without ever leaving the comfort of RStudio. And the amazing bit: all of this took me about a day to set up from scratch!
I will leave it here for now. In the next post we will dive into the details of how to scale up the approach above to create an AWS cloud computing cluster. This approach is extremely powerful for embarrassingly parallel problems (which are actually not embarrassing at all, I swear!)
As always, I hope it is useful for you. I’d very much appreciate any thoughts, comments, and feedback so write me a message below or get in touch via twitter!