(This article was first published on

**Why? » R**, and kindly contributed to R-bloggers)These are my initial notes from useR 2015. I will/may revise when I have time.

# Computational Performance; Chair: Dirk Eddelbuettel

## Running R+Hadoop using Docker Containers (E. James Harner)

### Introduction

- Big data architectures:
- HDFS/Hadoop: software framework for distributed storage and distributed processing
- Tachyon/Spark: uses in-memory

### Rc2 server (R cloud computing)

- Has an editor & output panel. Interactive collaboration (Demo)
- highly scalable
- 4-tier architecture: client, app server, compute cloud (JSON over BSD sockets for R),

databases (pgSQL & couchdb)

### RC2 Client

- Sharable project and workspaces
- Graphs are written to files and moved to the database as blobs
- Security: A 3 value token is used for auto-logins

### Summary

Rc2 is an accessible IDE for students and data scientist to allow real time collaboration. It also acts as a front end to Hadoop and Spark clusters.

## Algorithmic Differentiation for Extremum Estimation: An Introduction Using RcppEigen (Matt P. Dziubinski)

### Why

- Parametric model: We want to estimate a parameter by maximizing an objective function
- No closed formed expressions, so we need to numerically optimize

### Algorithms

- Derivative free: does not rely on knowledge of the objective function
- Gradient-based: needs the
**gradient**of the objective function- Steepest ascent, newton
- Often exhibit superior convergence rates
- But getting the gradient can be tricky, e.g. finite difference methods

### Algorithmic diffentiation

- Essentially use the chain rule
- Need to recode the objective function in Cpp using Rcpp

## Improving computational performance with algorithm engineering (Kirill Müller)

Application: activity based microsimulation models

### Weighted sampling without replacement

- Random sample:
`sample.int`

- Common framework: Subdivide an interval according to probabilities
- If sampling without replacement, remove sub-interval

- R uses trivial algorithm with update in O(n)
- Heap-like data structure

- Alternative approaches:
- Rejection sampling
- One-pass sampling (Efraimidis and Spirakis, 2006)

## Statistical matching (data fusion)

- Use Gower's distance to compare distribution
- works with interval, ordinal and nominal variables

*Please note that the notes/talks section of this post is merely my notes on the
presentation. I may have made mistakes: these notes are not guaranteed to be
correct. Unless explicitly stated, they represent neither my opinions nor the
opinions of my employers. Any errors you can assume to be mine and not the
speaker’s. I’m happy to correct any errors you may spot – just let me know!*

To

**leave a comment**for the author, please follow the link and comment on their blog:**Why? » R**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...