Benchmarking AWS Instances with MNIST classification

[This article was first published on R on Sastibe's Data Science Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In a previous post I have shown you how to setup an AWS instance running the newest RStudio, R, Python, Julia and so forth, where the configuration of the instance can be freely chosen. However, there is quite a lot of possibilities of instance configurations out there: There are different instance classes (General Purpose, Compute Optimized, RAM Optimized, … ) and different instance sizes within these classes. For General Purpose, or t2, there are, e.g. t2.nano, t2.micro, t2.small, t2.medium, t2.large, t2.xlarge and t2.2xlarge1.

These instances differ in two dimensions: price and performance. Obviously, these dimensions are highly correlated, since higher price means (or should mean, at least) higher performance. Now, price is easily measured, yet performance is a bit trickier: For example, it is not entirely straightforward to assess the impact of higher RAM, CPU or even GPU directly across many different configurations. But we’re doing data science, right? So why not create a programmatic test in order to gauge the performance empirically? Well, let‘s do it!

The Test

For this benchmark test I chose a classical machine learning task: the classification of the MNIST dataset of handwritten digits, to be categorized as 0-9. This data set is very commonly used as an example set for machine learning algorithms.

For this benchmark test, I borrowed a nice skript by Kory Becker written here, which trains a Support Vector Machine (SVM) on the problem, using only the first 1000 observations of the dataset, each with 768 attributes. I altered the code ever so slightly to that each run of the script returns the following measurements:

  1. Elapsed Time: The time elapsed since starting the script (excluding the time to install the libraries and download of the data),
  2. Accuracy of model, i.e. the percentage of predictions that classified the digits correctly.

Additionally, I included the following information:

  1. RAM in Gigabytes,
  2. Number of CPUs in use, and finally
  3. Price in Dollars per Hour.

The Candidates

AWS provides a large number of different configurations, and I will not discuss all of these in this post. Rather, let me focus on four different specifications of computing resource demands and chose a distinctive representative:

  • General Purpose: t2, m4
  • Compute Optimized: c4
  • Memory Optimized: r4

For each of these classes, I had planned to test the sizes small, medium, large, xlarge and 2xlarge. The sizes micro, small and medium are actually only available for t2 (oh, no!), so that I ended up only testing 14 configurations.

The Results

I started with the candidate t2.micro, which is free of charge. Unfortunately, the script never succesfully ran the training of the model, presumably because the dimension of merely 1 GB of RAM is not sufficient. Still, a “not possible” result is still a useful result for choosing the right infrastructure.

Let’s have a first look at the results, first in plain numbers:

instance_class instance_size ram vcpus ecu price_per_hour elapsed_time accuracy
t2 micro 1.00 1.0 NA 0.0134 NA NA
t2 small 2.00 1.0 NA 0.0268 68.624 0.917
t2 large 8.00 2.0 NA 0.1072 65.335 0.918
t2 xlarge 16.00 4.0 NA 0.2144 55.611 0.918
t2 2xlarge 32.00 8.0 NA 0.4288 56.284 0.919
t2 medium 4.00 2.0 NA 0.0536 63.961 0.919
m4 large 8.00 2.0 6.5 0.1200 82.823 0.933
m4 xlarge 15.00 4.0 13.0 0.2400 80.749 0.928
m4 2xlarge 32.00 8.0 26.0 0.4800 65.728 0.912
m4 4xlarge 64.00 16.0 53.5 0.9600 64.573 0.927
m4 16xlarge 256.00 64.0 188.0 3.8400 93.310 0.915
r4 large 15.25 2.0 7.0 0.1600 80.749 0.928
r4 xlarge 30.50 13.5 4.0 0.3200 68.372 0.920
c4 large 3.75 2.0 8.0 0.1140 121.004 0.915

At a quick glance, the accuracy of the models looks quite uniform. This is hardly surprising, as the algorithm itselg is unchanged by hardware limitation, and the apparent fluctuations can be explained by the stochastic nature of the train-test-data set sampling in the script.

A core assumption is that more computing power yields faster results. A second core assumption is that the higher the computing power, the higher the cost. Combining these assumptions leads us to assume that higher cost leads to a lower time elapsed. A quick visualization of the data demonstrates that the results support this notion:

The measurement of the instance “m4.16xlarge” doesn’t quite fit into the pattern, and I am frankly unsure of the reasons. The measurement was taken twice, so that circumstantial errors leading to this measurement can be rejected.

Let us look a little more precisely at the data, in order to establish the most influential factors determining the speed of the analysis. We use the wonderful ggpairs visualization of the GGally package and omit the observation of the instance “m4.16xlarge” in the analysis:

This plot contains a number of results at once. First off, and unsurprisingly, the price per hour correlates vey strongly with the number of virtual CPUs and the size of the RAM, indicating that “the higher the computing power, the higher the cost” was a correct core assumption.

Second, the correlation between Elapsed Eime and the numeric indicators of performance RAM, vCPUs and Price per Hour is clearly negative across the board, but the highest correlation is clearly attained by the dimension RAM. This provides yet another indication that the notion Performance of R hinges on RAM is true.

One last question to consider: which instance type is optimal for R purposes? Optimality will be defined by provide the quickest results for the least money. Compare the fits of a standard linear model:

instance_class (Intercept) price_per_hour
m4 83.66913 -22.66862
r4 93.12600 -77.35625
t2 66.86029 -29.47335

This shows that the entry price is cheapest for instances of the class “t2”, as the y-intercept is the lowest in this case. However, in cases of higher Price per Hour, i.e. higher necessary computing power, “r4” is the better choice: The time decreases quickest with the increase in power. Both lines meet at a price per hour of roughly 55 Cents per hour, corresponding to an instance r4.2xlarge with 61 GB RAM.

Takeaway Messages

To conclude this article, let me summarize the key findings:

  • The most important hardware feature for the increasing computing speed of “R” analysis is RAM
  • For analysis with a small to medium scope of performance (RAM less than 60 GB), the instance class “t2” is the best choice in AWS.
  • For larger scale projects, the instance class “r4”, optimzed for RAM usage, is the optimal choice.

  1. …why is it ‚nano‘, ‚micro‘ but then ‚large‘ ‚extra large‘? Be consistent, dangit!

To leave a comment for the author, please follow the link and comment on their blog: R on Sastibe's Data Science Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)