AzureDSVM: a new R package for elastic use of the Azure Data Science Virtual Machine

May 19, 2017
By

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Le Zhang (Data Scientist, Microsoft) and Graham Williams (Director of Data Science, Microsoft)

The Azure Data Science Virtual Machine (DSVM) is a curated VM which provides commonly-used tools and software for data science and machine learning, pre-installed. AzureDSVM is a new R package that enables seamless interaction with the DSVM from a local R session, by providing functions for the following tasks:

  1. Deployment, deallocation, deletion of one or multiple DSVMs;
  2. Remote execution of local R scripts: compute contexts available in Microsoft R Server can be enabled for enhanced computation efficiency for either a single DSVM or a cluster of DSVMs;
  3. Retrieval of cost consumption and total expense spent on using DSVM(s).

AzureDSVM is built upon the AzureSMR package and depends on the same set of R packages such as httr, jsonlite, etc. It requires the same initial set up on Azure Active Directory (for authentication).

To install AzureDSVM with devtools package:

library(devtools)
devtools::install_github("Azure/AzureDSVM")
library("AzureDSVM")

When deploying a Data Science Virtual Machine, the machine name, size, OS type, etc. must be specified. AzureDSVM supports DSVMs on Ubuntu, CentOS, Windows, and Windows with the Deep Learning Toolkit (on GPU-class instances). For example, the following code fires up a D4 v2 Ubuntu DSVM located in South East Asia:

deployDSVM(context, 
           resource.group="example",
           location="southeastasia",
           size="Standard_D4_v2",
           os="Ubuntu",
           hostname="mydsvm",
           username="myname",
           pubkey="pubkey")

where context is an azureActiveContext object created by AzureSMR::createAzureContext() function that encapsulates credentials (Tenant ID, Client ID, etc.) for Azure authentication.

In addition to launching a single DSVM, the AzureDSVM package makes it easy to launch a cluster with multiple virtual machines. Multi-deployment supports:

  1. creating a collection of independent DSVMs which can be distributed to a group of data scientists for collaborative projects, as well as
  2. clustering a set of connected DSVMs for high-performance computation.

To create a cluster of 5 Ubuntu DSVMs with default VM size, use:

cluster<-deployDSVMCluster(context, 
                           resource.group=RG, 
                           location="southeastasia", 
                           hostnames="mydsvm",
                           usernames="myname", 
                           pubkeys="pubkey",
                           count=5)

To execute a local script on remote cluster of DSVMs with a specified Microsoft R Server compute context, use the executeScript function. (NOTE: only Linux-based DSVM instances are supported at the moment as underneath the remote execution is achieved via SSH. Microsoft R Server 9.x allows remote interaction for both Linux and Windows, and more details can be found here.) Here, we use the RxForeachDoPar context (as indicated by the compute.context option):

executeScript(context,
              resource.group="southeastasia",
              machines="dsvm_names_in_the_cluster",
              remote="fqdn_of_dsvm_used_as_master",
              user="myname",
              script="path_to_the_script_for_remote_execution",
              master="fqdn_of_dsvm_used_as_master",
              slaves="fqdns_of_dsvms_used_as_slaves",
              compute.context="clusterParallel")

Information of cost consumption and expense spent on DSVMs can be retrieved with:

consum<-expenseCalculator(context,
                            instance="mydsvm",
                            time.start="time_stamp_of_starting_point",
                            time.end="time_stamp_of_ending_point",
                            granularity="Daily",
                            currency="USD",
                            locale="en-US",
                            offerId="offer_id_of_azure_subscription",
                            region="southeastasia")

print(consum)

Detailed introductions and tutorials can be found in the AzureDSVM Github repository, linked below.

Github (Azure): AzureDSVM

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)