Assume you want to start to write
R code (a very good decision, in my opinion) and you want to be able to write and test code whereever you are. Wouldn’t it be awesome if one could set up an environment that can be used for
R coding independent of any device? Where all you need is a decent browser, a working internet connection and you’re good to go?
Obviously, that is the case. In this post, I will show you the steps for setting up such an environment on Amazon Web Services (AWS). The main advantages of using such a set-up:
- Runs on any infrastructure: All you need is a working internet connection, a decent browser and an AWS account, which is usually1 free.
- Runs everywhere: The AWS machine will be set up to automatically clone your GitHub repository (don’t worry if this doesn’t mean anything to you, this point is optional), so that you don’t even have to have your codes on the device.
- Scalable: The AWS machine running your code can be chosen to suit any of your needs, in any session. Just playing around with a new package? Use the smallest size, doesn’t cost a dime. Trying to re-create state-of-the-art machine learning performance with a fancy DNN-classifier? Go all in with 500 GB of RAM; it’ll cost ya, but it’s fun.
- Up-to-Date: Since the envirionment is freshly installed each time, your
Rversion as well as the package versions in use are automatically up-to-date. In the latter case, that would also be easy to maintain on a local machine, the former, however, is a nice benefit.
Convinced? Awesome, let’s get started!
Overview of main steps
First a short overview of the main steps covered in this blog post:
- Get an AWS account (duh!),
- Configure your RStudio AMI,
- Find the right RStudio AMI,
- Configure Security Groups,
- Automatically Change your RStudio Password,
- Incorporate a clone of your GitHub repo,
- Start your First RStudio instance (and bask in its glory),
- Create a personal AMI for future convenience,
- Shut down the Instance and all Resources.
Preconditions for this tutorial should be basically none, at least in terms of coding and/or understanding
R itself. The main task will lie in clicking the right buttons.
Step 1: Get an AWS account.
Well, it isn’t really my place to tell you how to get an AWS account if Amazon itself did such a great job explaining it. Just use the link to set up your account, and I further suggest to follow this set of instructions, building your very first instance. Take your time going through these instructions, I’ll wait…
Ready? Alright, sweet. Then we continue with
Step 2: Configure your RStudio AMI.
In this step, I collected several steps, not all of which are necessary. Steps 2a and 2b are crucial, Step 2c is recommended. Step 2d can be skipped on the first set-up. The implementation of this step can always be re-assessed whenever it becomes necessary.
Let’s begin by starting an instance in the AWS Dashboard. Just open “Instances” on the side menu of your EC2 Dashboard and click on “Launch Instance”:
Step 2a: Find the current RStudio AMI.
The first task is to choose an Amazon Machine Image, or AMI, which is essentially an operating system container. More to the point, in an AMI a Linux distribution can be bundled with addtional software packages tailored to any type of need: web development, accounting (I’m guessing here, but … sure) and, most importantly, using RStudio. On Louis Anslett’s homepage you can find a wonderful storage of RStudio AMIs. We use the newest version for the correct geographical zone, in my case Frankfurt:
As you can see, thanks to Louis Anslett’s work, the AMI includes not only the newest version of RStudio but also of
R itself as well as a handful of helpful additional software packages. For instance, Git comes pre-installed, which we will use later on; also
Juliais installed for those looking to try out the possible future of data science languages. But I’m deviating… Let’s note the AMI-ID (in our case “ami-a80db3c7”), put this in the start-up options and let’s continue.
Step 2b: Configure the security groups for your RStudio instance
In AWS, security groups control the access to the machine over the internet (if you don‘t care about how exactly this works and only want to follow the instructions, just skip the next sentences). More precisely, they define which kind of protocols may use which ports on your machine from a given IP range. For example, you can set the access rights for a ssh protocol to be able to connect to your machine on port 22 only from your personal IP address at home.
In our case, we actually only need access via http protocol, since the RStudio instance will allow log-in via browser interface. Therefore, our security group can be kept quite simple:
The IP range can be limited to your own personal IP to ensure the safety of your instance. This precaution could be necessary since only the login page of RStudio stands between the internet and your instance (spooky, huh?). However, since the personal IP usually changes each day (roughly speaking), this becomes a personal question of “privacy vs. convenience”. In my case, as you can see, convenience won.
2c. Automatically Change your RStudio Password
In the documentation of the RStudio AMI we can find the following passage: “It is highly recommended you change the password immediately and an easy means of doing this is explained upon login in the script that is loaded there”. Alright, fine, but I’d rather to that programmatically, i.e. automatically. The weirdly named “User data” option provides just the framework: All commands placed here get executed at the beginning of the start-up. You can find this setting in the menu “Configure Instance Details” under “Advanced Details”.
In order to change the password of the user “RStudio” on start-up, we paste the following code:
#!/bin/bash echo "rstudio:guest" | chpasswd
where you should replace the password “guest” with whatever you deem appropriate. We are almost done with the set-up now, there only remains
Step 2d (optional): Automatically Clone a GitHub repo
I write all my private code projects on my GitHub account (here: https://github.com/sebastianschweer. What a shameless self-plug!) and I also would like my code to be available for me each time I start up my RStudio instance. Fortunately, this is easily configured with “User data” again, by just adding the command
git clone https://github.com/sebastianschweer/sastibe.git /home/rstudio/sastibe chmod -R 777 /home/rstudio/sastibe
to the “User data” of Step 2c. Now, when I start up the new RStudio instance, the repository
sastibe gets cloned into the folder
/home/rstudio/sastibe, which is automatically loaded in RStudio. The line with
chmod ensures that any user (not just root, who is executing this command at startup) has the rights to alter content in that folder. This permission allows me to change code and pushing my changes to the repository and all that, which is just super convenient.
Step 3: Start your First RStudio instance (and bask in its glory),
The last and most exciting click is this one:
We have now started the instance. This means that a virtual machine, configured according to our specifitcations is being run on one of Amazon’s bajillion2 cloud computing servers. In the menu “Instances” we now see an active instance running. After we are done, we will use this menu to shut it down again (so that it doesn#t cost us), but not now: we are eager to test it out! Accessing the instance is quite easy in our case: Just copy the “IPv4 Public IP” adress and paste it in your browser:
Hopefully, you haven’t forgotten your password (check Step 2c if you did), your username is “rstudio”. After succesful login, you’ll be greeted by this screen:
Et voilà: Your very own scalable RStudio instance, accessible world-wide and ready to use at all times. In other words: Congratulations, you now have a state-of-the-art Data Science Machine at your command. Use it wisely. If you want to see what kind of wonders you can do with this setup, check out the upcoming blog post. Otherwise, let me just point you towards another wonderful introduction.
Step 4: Create a personal AMI for future convenience
Now, Step 3 consisted of 4 different steps, and it would be ratehr inconvenient to have to repeat these steps each time you need a new RStudio instance, right? Luckily, AWS has got you covered: You can create an “image” of any AWS instance: simply put, this saves your current configuration for later use. The creation of such an image is straightforward: Just go to “Instances” in your AWS Dashboard, right-click on the machine you want to base the image on and select “Create Image”:
After this step, you will find the created image in the menu AMIs, ready to reuse. Before you go do crazy and wonderful Data Science in your wonderful new Environment, though, it is essential that you let me tell you about
The Last Step (After Each AWS Usage): Shutting Down
An AWS instance doesn’t shut down by itself, or go into hibernation or anything like that. It just keeps running unless otherwise specified, eventually costing lots of money (even the free tier services have their prices after some limit). So, let me show you how to shut down your brand new machine. It’s quite simple, just right-click on the running instance and set the “Instance State” to “Terminate”.
Since our instance also automatically loaded an EBS volume (like a hard disk to save data), we need to shut that down too. Choose the entry EBS volumes in the sidepane and Detach all volumes that are active. If your overview in the pane “Dashboard” looks similar to this :
There are no hidden services running racking up costs.
After configuring your AWS environment as decried above, your new ‚Data Science Workflow‘ can look like this:
- Log in to AWS,
- Choose your personal RStudio AMI,
- Choose the Necessary Specifications of the Machine,
- Log in to the Machine in the Browser,
- Do Awesome Data Science,
- Shut Down Machine and all Resources.
Have fun, and remember: Primere non nocere!