useR 2015: Computational

July 1, 2015

(This article was first published on Why? » R, and kindly contributed to R-bloggers)

These are my initial notes from useR 2015. I will/may revise when I have time.

Computational Performance; Chair: Dirk Eddelbuettel

Running R+Hadoop using Docker Containers (E. James Harner)


  • Big data architectures:
    • HDFS/Hadoop: software framework for distributed storage and distributed processing
    • Tachyon/Spark: uses in-memory

Rc2 server (R cloud computing)

  • Has an editor & output panel. Interactive collaboration (Demo)
  • highly scalable
  • 4-tier architecture: client, app server, compute cloud (JSON over BSD sockets for R),
    databases (pgSQL & couchdb)

RC2 Client

  • Sharable project and workspaces
  • Graphs are written to files and moved to the database as blobs
  • Security: A 3 value token is used for auto-logins


Rc2 is an accessible IDE for students and data scientist to allow real time collaboration. It also acts as a front end to Hadoop and Spark clusters.

Algorithmic Differentiation for Extremum Estimation: An Introduction Using RcppEigen (Matt P. Dziubinski)


  • Parametric model: We want to estimate a parameter by maximizing an objective function
  • No closed formed expressions, so we need to numerically optimize


  • Derivative free: does not rely on knowledge of the objective function
  • Gradient-based: needs the gradient of the objective function
    • Steepest ascent, newton
    • Often exhibit superior convergence rates
    • But getting the gradient can be tricky, e.g. finite difference methods

Algorithmic diffentiation

  • Essentially use the chain rule
  • Need to recode the objective function in Cpp using Rcpp

Improving computational performance with algorithm engineering (Kirill Müller)

Application: activity based microsimulation models

Weighted sampling without replacement

  • Random sample:
  • Common framework: Subdivide an interval according to probabilities
    • If sampling without replacement, remove sub-interval
  • R uses trivial algorithm with update in O(n)
    • Heap-like data structure
  • Alternative approaches:
    • Rejection sampling
    • One-pass sampling (Efraimidis and Spirakis, 2006)

Statistical matching (data fusion)

  • Use Gower's distance to compare distribution
    • works with interval, ordinal and nominal variables

Please note that the notes/talks section of this post is merely my notes on the
presentation. I may have made mistakes: these notes are not guaranteed to be
correct. Unless explicitly stated, they represent neither my opinions nor the
opinions of my employers. Any errors you can assume to be mine and not the
speaker’s. I’m happy to correct any errors you may spot – just let me know!

To leave a comment for the author, please follow the link and comment on their blog: Why? » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)