## Resampling data in Hadoop with RHadoop

February 27, 2013
On Revolution Analytics partner Cloudera's blog, Uri Laserson has posted an excellent guide to resampling from a large data set in Hadoop. Resampling is an important step in fitting ensemble models (including random forests and other bagging techniques), and Uri provides a step-by-step guide to implementing resampling methods using RHadoop. He provides the complete map-reduce code in the R...

## the BUGS Book [guest post]

February 24, 2013
(My colleague Jean-Louis Fouley, now at I3M, Montpellier, kindly agreed to write a review on the BUGS book for CHANCE. Here is the review, en avant-première! Watch out, it is fairly long and exhaustive! References will be available in the published version. The additions of book covers with BUGS in the title and of the corresponding

## Large correlation in parallel

February 24, 2013
A little improvement to the bigcor function proposed on Rmazing to compute huge correlation matrix in R, I made the function work in parallel using all the CPU cores available on the machine. The code is here.Here is a benchmark of the 2 func...

## The Wisdom of Crowds – Clustering Using Evidence Accumulation Clustering (EAC)

February 24, 2013
Today’s blog post is about a problem known by most of the people using cluster algorithms on datasets without given true labels (unsupervised learning). The challenge here is the “freedom of choice” over a broad range of different cluster algorithms and how to determine the right parameter values. The difficulty is the following: Every clustering algorithm and even...

## bigcor: Large correlation matrices in R

February 22, 2013
$bigcor: Large correlation matrices in R$

As I am working with large gene expression matrices (microarray data) in my job, it is sometimes important to look at the correlation in gene expression of different genes. It has been shown that by calculating the Pearson correlation between genes, one can identify (by high values, i.e. > 0.9) genes that share a common

## ±∞

February 21, 2013
The Cauchy distribution (`?dcauchy` in `R`) nails a flashlight over the number line

and swings it at a constant speed from 9 o’clock down to 6 o’clock over to 3 o’clock. (Or the other direction, from 3→6→9.) Then counts Read more »

## Progress bar in R

February 20, 2013
A decent percentage of working time in R, I spend looping over chromosomes, transcription factors or tissues, usually, using parallelization.To get the stuff to run simultaneously I use the foreach function from the doMC package, and for monitoring of ...

## Version 1.0 of multilevelPSA Available on CRAN

February 14, 2013
Version 1.0 of `multilevelPSA` has been released to CRAN. The `multilevelPSA` package provides functions to estimate and visualize propensity score models with multilevel, or clustered, data. The graphics are an extension of `PSAgraphics` package by Helmreich and Pruzek. The example below will investigate the differences between private and public school internationally using the Programme of International Student Assessment...

## Quantile Autoregression in R

February 9, 2013
In the past, I wrote about robust regression. This is an important tool which handles outliers in the data. Roger Koenker is a substantial contributor in this area. His website is full of useful information and code so visit when … Continue reading