Automating R Scripts on Amazon EC2

[This article was first published on Travis Nelson's Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Overview:

  • How to setup R on an EC2 instance of Ubuntu 11.04 (Natty Narwhal)
  • How to setup Apache Tomcat 6.0 web server and configuring it with basic authentication so that we can view our output from R on a password protected webpage
  • How to automate your R scripts to run as a daily cron job.

Lately, my new hobby of algorithmic stock trading has necessitated running nightly R scripts which take about an hour to complete.  Most of this time is spent on single-treaded web-scraping, which I could put into parallel to speed up the process, but this might surge those websites and have them block my IP address from getting all that free data.  I’m also hesitant to run something on my home pc in fear of random windows updates which could impact my program.  Another problem is random loss of connection from Comcast, so the running at home options was out.  I decide to turn to the cloud!

I’ve been playing around a lot with Amazon EC2 lately, and have been really happy with how powerful it is at such a reasonable cost.  It’s also nice to play around with the micro instances since Amazon has the AWS free usage tier where you can have up to 750 free hours per month of usage.  I decided that the micro instances would be the best space to run my nightly R jobs.  Since you can do pretty much anything with your instances, I thought it would be good to put a simple web server on there so I could view my results from anywhere.  This also allows me to have some basic authentication to my files, as to not give away my quant strategies.

Creating the EC2 instance

Login to the AWS management console.  I started by launching an instance of a community AMI, ami-1aad5273, Ubuntu 11.04 Natty EBS boot that is 64-bit.

For the Instance Details, you should choose an instance type of Micro, unless you have a need for it to be higher.  This will also keep you eligible for the AWS free usage tier.

For Advanced Instance Options, keep the defaults and click continue.

Next, enter in a descriptive name for this instance.

Enter in a name for your Key Pair. Press ‘Create & Download your Key Pair’.  Remember where you save this, as you will need it later.

Enter in a Security Group with ports 22 and 8080.  You will need port 22 for SSL access to the VM and you will need 8080 open for web access to Apache Tomcat.

Then press ‘Launch’ on the Review page.  Your instance will be ready to use in seconds.

Logging on to your EC2 instance

Logging on to your EC2 instance can be done several different ways and quite different for each environment.  Since I did this from a Windows machine, I posted instructions for how to SSH to your Amazon EC2 instance using a free tool called PuTTY.  For the AMI we are using, the login will be ‘ubuntu‘.

Installing R

Update your apt-get package list so you get the latest stable version for your OS.

sudo apt-get update

Install R using apt-get.  I like using the -y argument since it does not prompt you if you are sure you want to install.  At the time of writing this, apt-get was using R version 2.12.1 (2010-12-16).

sudo apt-get –y install r-base

At this point you need to see if R runs correctly by typing ‘R’ at the prompt.

To exit from R on command line linux, press d or type q().

Here is a good link about R on Ubuntu from UCSB if you would like more information.

Installing and configuring Apache Tomcat

Install Tomcat using apt-get.  I like using the -y argument since it does not prompt you if you are sure you want to install.  At the time of writing this, apt-get installed Tomcat 6.0.28-10.

sudo apt-get -y install tomcat6

Now you need to configure tomcat to allow browsing of the directories on the server.  You need to first edit the default servlet in the /etc/tomcat6/web.xml file to have ‘listings’ be ‘true’.

sudo vi /etc/tomcat6/web.xml
nested in the web.xml file at about line number 104:

    <servlet>
        <servlet-name>default</servlet-name>
        <servlet-class>org.apache.catalina.servlets.DefaultServlet</servlet-class>
        <init-param>
            <param-name>debug</param-name>
            <param-value>0</param-value>
        </init-param>
        <init-param>
            <param-name>listings</param-name>
            <param-value>true</param-value>
        </init-param>
        <load-on-startup>1</load-on-startup>
    </servlet>

Now create a directory that will be viewable in a web browser where you will save output from your R scripts.

cd /var/lib/tomcat6/webapps/ROOT
sudo mkdir testdir

Now check in a web browser to see if the directory is viewable.  You should see this and not a 404-Not found error.  Be sure you are using port 8080.  http://IPADDRESS:8080/testdir/

Now we need to setup the basic authentication.  We will need to edit the /etc/tomcat6/web.xml file again.  This time we will be adding a security-constraint and login-config (in bold below) under the default servlet in the file.

sudo vi /etc/tomcat6/web.xml
    <servlet>
        <servlet-name>default</servlet-name>
        <servlet-class>org.apache.catalina.servlets.DefaultServlet</servlet-class>
        <init-param>
            <param-name>debug</param-name>
            <param-value>0</param-value>
        </init-param>
        <init-param>
            <param-name>listings</param-name>
            <param-value>true</param-value>
        </init-param>
        <load-on-startup>1</load-on-startup>
    </servlet>

    <security-constraint>
        <web-resource-collection>
            <web-resource-name>R Test</web-resource-name>
            <url-pattern>/testdir/*</url-pattern>
        </web-resource-collection>
        <auth-constraint>
            <role-name>member</role-name>
        </auth-constraint>
        </security-constraint>

    <login-config>
        <auth-method>BASIC</auth-method>
        <realm-name>Secure Area</realm-name>
    </login-config>

Now we need to add users to the /etc/tomcat6/tomcat-users.xml file with the same role that we setup in the web.xml file.  We used a role called ‘member’ in web.xml, so we will need to use the same one in tomcat-users.xml.

sudo vi /etc/tomcat6/tomcat-users.xml
<tomcat-users>
    <user username="user" password="password" roles="member"/>
</tomcat-users>

Restart Tomcat for changes to take.

sudo /etc/init.d/tomcat6 restart

Now check to see if the authentication works.



You’re in!  Additional information about Apache Tomcat on Ubuntu 11.04 can be found here.

Running a batched R script

In you home directory, put the R file you want to run in batch mode.  I wrote this quick test program that saves the output to the Tomcat web directory with the date in the filename.

sudo vi TestBatch.R
# This is a test program to save file with today's date
# Author: Travis Nelson
#######################################################
setwd("/var/lib/tomcat6/webapps/ROOT/testdir")
filename <- "_output.txt"
filename <- paste(as.character(Sys.Date()), filename, sep="")
data <- paste("Output for ", as.character(Sys.Date()), sep="")
write(x=data,file=filename)

Test the batch program to see if it works.  If you do not use sudo, you will see that “Permission denied, Execution halted” in the TestBatch.Rout file, but nothing to the prompt.

sudo R CMD BATCH TestBatch.R

Check to see if the file correctly outputted to the web directory (/var/lib/tomcat6/webapps/ROOT/testdir/) or on your web browser.

Setting up a cron job to run your batch script

Now let’s automate the job to run Monday through Friday at 5:00am.  Since I am in Denver, this is GMT -7, so we will need to take this into account when setting up the cron job.

First, create the script file that will be used to call the R batch command.  I just named it test.sh with the contents:

sudo R CMD BATCH TestBatch.R

Then change the permissions for the file so that it can be executed.

sudo chmod 750 test.sh

Verify that running the script will update the timestamp of your R output file in your output directory.  Here you see that the timestamp changes for this file so we know the script is working.

$ ls -l /var/lib/tomcat6/webapps/ROOT/testdir/2011-06-05_output.txt
-rw-r--r-- 1 root root 22 2011-06-05 08:55 /var/lib/tomcat6/webapps/ROOT/testdir/2011-06-05_output.txt
$ sudo /home/ubuntu/test.sh
$ ls -l /var/lib/tomcat6/webapps/ROOT/testdir/2011-06-05_output.txt
-rw-r--r-- 1 root root 22 2011-06-05 09:19 /var/lib/tomcat6/webapps/ROOT/testdir/2011-06-05_output.txt

Add a crontab for your job:

sudo crontab -u ubuntu -e

This will bring up a editor where you will need to add this line to the bottom of the file.

0 11 * * 1-5 sudo /home/ubuntu/test.sh

Save the file and you should be good to go.  For testing this, you might want to try for the hourly option of “55 * * * * sudo /home/ubuntu/test.sh”, where 55 is the number of minutes after the hour to see if it is running correctly.  Also, if you need additional help, I found this site with some helpful information about the crontab options.

I hope this tutorial is helpful and please leave me comments/questions.

 

 

To leave a comment for the author, please follow the link and comment on their blog: Travis Nelson's Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)