Setting up a Scalable RStudio Instance in AWS

(This article was first published on R on Sastibe's Data Science Blog, and kindly contributed to R-bloggers)

Assume you want to start to write R code (a very good decision, in my opinion) and you want to be able to write and test code whereever you are. Wouldn’t it be awesome if one could set up an environment that can be used for R coding independent of any device? Where all you need is a decent browser, a working internet connection and you’re good to go?

Obviously, that is the case. In this post, I will show you the steps for setting up such an environment on Amazon Web Services (AWS). The main advantages of using such a set-up:

  • Runs on any infrastructure: All you need is a working internet connection, a decent browser and an AWS account, which is usually1 free.
  • Runs everywhere: The AWS machine will be set up to automatically clone your GitHub repository (don’t worry if this doesn’t mean anything to you, this point is optional), so that you don’t even have to have your codes on the device.
  • Scalable: The AWS machine running your code can be chosen to suit any of your needs, in any session. Just playing around with a new package? Use the smallest size, doesn’t cost a dime. Trying to re-create state-of-the-art machine learning performance with a fancy DNN-classifier? Go all in with 500 GB of RAM; it’ll cost ya, but it’s fun.
  • Up-to-Date: Since the envirionment is freshly installed each time, your R version as well as the package versions in use are automatically up-to-date. In the latter case, that would also be easy to maintain on a local machine, the former, however, is a nice benefit.

Convinced? Awesome, let’s get started!

Overview of main steps

First a short overview of the main steps covered in this blog post:

  1. Get an AWS account (duh!),
  2. Configure your RStudio AMI,
    1. Find the right RStudio AMI,
    2. Configure Security Groups,
    3. Automatically Change your RStudio Password,
    4. Incorporate a clone of your GitHub repo,
  3. Start your First RStudio instance (and bask in its glory),
  4. Create a personal AMI for future convenience,
  5. Shut down the Instance and all Resources.

Preconditions for this tutorial should be basically none, at least in terms of coding and/or understanding R itself. The main task will lie in clicking the right buttons.

Step 1: Get an AWS account.

Well, it isn’t really my place to tell you how to get an AWS account if Amazon itself did such a great job explaining it. Just use the link to set up your account, and I further suggest to follow this set of instructions, building your very first instance. Take your time going through these instructions, I’ll wait…

Ready? Alright, sweet. Then we continue with

Step 2: Configure your RStudio AMI.

In this step, I collected several steps, not all of which are necessary. Steps 2a and 2b are crucial, Step 2c is recommended. Step 2d can be skipped on the first set-up. The implementation of this step can always be re-assessed whenever it becomes necessary.

Let’s begin by starting an instance in the AWS Dashboard. Just open “Instances” on the side menu of your EC2 Dashboard and click on “Launch Instance”:

Here we go!

Here we go!

Step 2a: Find the current RStudio AMI.

The first task is to choose an Amazon Machine Image, or AMI, which is essentially an operating system container. More to the point, in an AMI a Linux distribution can be bundled with addtional software packages tailored to any type of need: web development, accounting (I’m guessing here, but … sure) and, most importantly, using RStudio. On Louis Anslett’s homepage you can find a wonderful storage of RStudio AMIs. We use the newest version for the correct geographical zone, in my case Frankfurt:

One AMI for each region. Neat

One AMI for each region. Neat

As you can see, thanks to Louis Anslett’s work, the AMI includes not only the newest version of RStudio but also of R itself as well as a handful of helpful additional software packages. For instance, Git comes pre-installed, which we will use later on; also Juliais installed for those looking to try out the possible future of data science languages. But I’m deviating… Let’s note the AMI-ID (in our case “ami-a80db3c7”), put this in the start-up options and let’s continue.

Step 2b: Configure the security groups for your RStudio instance

In AWS, security groups control the access to the machine over the internet (if you don‘t care about how exactly this works and only want to follow the instructions, just skip the next sentences). More precisely, they define which kind of protocols may use which ports on your machine from a given IP range. For example, you can set the access rights for a ssh protocol to be able to connect to your machine on port 22 only from your personal IP address at home.

In our case, we actually only need access via http protocol, since the RStudio instance will allow log-in via browser interface. Therefore, our security group can be kept quite simple:

The bottom option allows the whole world to see the instance. Golly.

The bottom option allows the whole world to see the instance. Golly.

The IP range can be limited to your own personal IP to ensure the safety of your instance. This precaution could be necessary since only the login page of RStudio stands between the internet and your instance (spooky, huh?). However, since the personal IP usually changes each day (roughly speaking), this becomes a personal question of “privacy vs. convenience”. In my case, as you can see, convenience won.

2c. Automatically Change your RStudio Password

In the documentation of the RStudio AMI we can find the following passage: “It is highly recommended you change the password immediately and an easy means of doing this is explained upon login in the script that is loaded there”. Alright, fine, but I’d rather to that programmatically, i.e. automatically. The weirdly named “User data” option provides just the framework: All commands placed here get executed at the beginning of the start-up. You can find this setting in the menu “Configure Instance Details” under “Advanced Details”.

In order to change the password of the user “RStudio” on start-up, we paste the following code:

#!/bin/bash
echo "rstudio:guest" | chpasswd

where you should replace the password “guest” with whatever you deem appropriate. We are almost done with the set-up now, there only remains

Step 2d (optional): Automatically Clone a GitHub repo

I write all my private code projects on my GitHub account (here: https://github.com/sebastianschweer. What a shameless self-plug!) and I also would like my code to be available for me each time I start up my RStudio instance. Fortunately, this is easily configured with “User data” again, by just adding the command

git clone https://github.com/sebastianschweer/sastibe.git /home/rstudio/sastibe
chmod -R 777 /home/rstudio/sastibe

to the “User data” of Step 2c. Now, when I start up the new RStudio instance, the repository sastibe gets cloned into the folder /home/rstudio/sastibe, which is automatically loaded in RStudio. The line with chmod ensures that any user (not just root, who is executing this command at startup) has the rights to alter content in that folder. This permission allows me to change code and pushing my changes to the repository and all that, which is just super convenient.

Step 3: Start your First RStudio instance (and bask in its glory),

The last and most exciting click is this one:

Hooray, all the hard work pays off!

Hooray, all the hard work pays off!

We have now started the instance. This means that a virtual machine, configured according to our specifitcations is being run on one of Amazon’s bajillion2 cloud computing servers. In the menu “Instances” we now see an active instance running. After we are done, we will use this menu to shut it down again (so that it doesn#t cost us), but not now: we are eager to test it out! Accessing the instance is quite easy in our case: Just copy the “IPv4 Public IP” adress and paste it in your browser:

Green lights indicate the instance runs harmoniously.

Green lights indicate the instance runs harmoniously.

Hopefully, you haven’t forgotten your password (check Step 2c if you did), your username is “rstudio”. After succesful login, you’ll be greeted by this screen:

The login to a world of wonder.

The login to a world of wonder.

Et voilà: Your very own scalable RStudio instance, accessible world-wide and ready to use at all times. In other words: Congratulations, you now have a state-of-the-art Data Science Machine at your command. Use it wisely. If you want to see what kind of wonders you can do with this setup, check out the upcoming blog post. Otherwise, let me just point you towards another wonderful introduction.

Step 4: Create a personal AMI for future convenience

Now, Step 3 consisted of 4 different steps, and it would be ratehr inconvenient to have to repeat these steps each time you need a new RStudio instance, right? Luckily, AWS has got you covered: You can create an “image” of any AWS instance: simply put, this saves your current configuration for later use. The creation of such an image is straightforward: Just go to “Instances” in your AWS Dashboard, right-click on the machine you want to base the image on and select “Create Image”:

Locate Create Image in the menu of Instance Settings

Locate “Create Image” in the menu of “Instance Settings”

After this step, you will find the created image in the menu AMIs, ready to reuse. Before you go do crazy and wonderful Data Science in your wonderful new Environment, though, it is essential that you let me tell you about

The Last Step (After Each AWS Usage): Shutting Down

An AWS instance doesn’t shut down by itself, or go into hibernation or anything like that. It just keeps running unless otherwise specified, eventually costing lots of money (even the free tier services have their prices after some limit). So, let me show you how to shut down your brand new machine. It’s quite simple, just right-click on the running instance and set the “Instance State” to “Terminate”.

Show no mercy, terminate!

Show no mercy, terminate!

Since our instance also automatically loaded an EBS volume (like a hard disk to save data), we need to shut that down too. Choose the entry EBS volumes in the sidepane and Detach all volumes that are active. If your overview in the pane “Dashboard” looks similar to this :

5 volumes: make sure that they are not in-use, since storage may also cost after some initial period

“5 volumes”: make sure that they are not in-use, since storage may also cost after some initial period

There are no hidden services running racking up costs.

Summary

After configuring your AWS environment as decried above, your new ‚Data Science Workflow‘ can look like this:

  1. Log in to AWS,
  2. Choose your personal RStudio AMI,
  3. Choose the Necessary Specifications of the Machine,
  4. Log in to the Machine in the Browser,
  5. Do Awesome Data Science,
  6. Shut Down Machine and all Resources.

Have fun, and remember: Primere non nocere!


  1. For a given value of usually. I personally try to test out lots of resources just because I can, yet even so, my total expenses for AWS result in 0.37€ (January 2018).

  2. A rough estimate. Maybe only bajillions.

To leave a comment for the author, please follow the link and comment on their blog: R on Sastibe's Data Science Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)