## Getting SASsy

April 29, 2012
By

Although I am most familiar with R for statistical analysis and programming, I also use a fair amount of SAS at work. I found it a huge transition at first, but one thing that helped make SAS “click” for me … Continue reading →

## Clustering analysis and its implementation in R

April 29, 2012
By

Earlier I posted a blog for "k-means + heatmap" used for clustering analysis. Recently to prepare for the "Bioinformatics Tools" meeting, I made a slide with more details on "clustering analysis". Here it is:https://docs.google.com/presentation/d/1vMS3...

## Animating Schelling’s segregation model

April 29, 2012
By

Recent blog post on Animations in R inspired me to write a code that generates animations of simulation model. For this task I have chosen Schelling's segregation model.Having written the code I have found that one year ago a similar code has been...

## Guess who wins: apply() versus for loops in R

April 28, 2012
By

Yesterday I tried to do some data processing on my really big data set in MS Excel. Wow, did it not like handling all those data!! Every time I tried to click on a different ribbon, the screen didn’t even … Continue reading →

## Open data and ecological fallacy

April 28, 2012
By

A couple of days ago, on Twitter, @alung mentioned an old post I did publish on this blog about open-data, explaining how difficult it was to get access to data in France (the post, published almost 18 months ago can be found here, in French)....

## microbenchmarking with R

April 28, 2012
By

I love to benchmark.  Maybe I’m a bit weird but I love to bench  everything in R.  Recently I’ve had people raise accuracy challenges to the typical system.time and rbenchmark package approaches to benchmarking.  I saw Hadley Wickham promoting the … Continue reading →

## Correlation of temperature proxies with observations

April 28, 2012
By
$Correlation of temperature proxies with observations$

The climate change debate focuses mainly around the assumption that the annual global mean temperatures of the past few decades have been the highest in the past millenium. How do we know what the annual global mean temperature was in the year, say, 1351 AD? The answer is: Through temperature proxies. Such proxies include tree

## R equivalents to SAS and SPSS procedures

April 27, 2012
By

With more than 5,000 R packages now available (from the CRAN and BioConductor repositories), for any statistical or data analysis procedure you can confidently say, "there's a package for that". To make it easier for SAS and SPSS users to find what they need in R, Bob Muenchen has updated his useful table of equivalent R packages for SAS...

## Sage Bionetworks Synapse

April 27, 2012
By

Michael Kellen, Director of Technology at Sage Bionetworks, is trying to build a GitHub for science. It's called Synapse and Kellen described it in a talk at the Sage Bionetworks Commons Congress 2012, this past weekend: 'Synapse' Pilot for Building an...

## The Best Statistical Programming Language is …Javascript?

April 27, 2012
By

R-Bloggers has recently been buzzing about Julia, the new kid on the statistical programming block. Julia, however, is hardly the sole contender for the market of R defectors, with Clojure-fork Incanter generating buzz as well. Even with these two making noise, I think there’s a huge point that everyone is missing, and it’s front-and-center on

April 27, 2012
By

The R language has passed another milestone, a paper aimed at the academic programming language community (or at least one section of this community) has been written about it, Evaluating the Design of the R Language by Morandat, Hill, Osvald and Vitek. Hardly earth shattering news, but it may have some impact on how R

## R Workshop: Reproducible Research using Sweave for Beginers

April 27, 2012
By

Monday, April 30, 2012  14h-16h. Stewart Biology Rm w6/12 (Montreal) guRu: Denis Haine (Université de Montréal) Topics Reproducible research was first coined by Pr. Jon Claerbout, professor of geophysics at Stanford University, to describe that the results from researches can be replicated by other scientists by making available data, procedures, materials and the computational environment

## How to download complete XML records from PubMed and extract data

April 27, 2012
By

Yesterday I wrote an article that looked at the top 20 Cognitive Behavior Therapy journals with the most publications; today I will explain how I did it with R.

## A Bayesian Consumption Function

April 27, 2012
By

What the title of this post is supposed to mean is: "Estimating a simple aggregate consumption function using Bayesian regression analysis".In a recent post I mentioned my long-standing interest in Bayesian Econometrics. When I teach this material I usually include a simple application that involves estimating a consumption function using U.S. time-series data. I used to have...

## Real Time Structural Break

April 27, 2012
By

Yesterday as I played with bfast I kept thinking “Yes, but this is all in hindsight.  How can I potentially use this in a system?”  Fortunately, one of the fine authors very generously commented on my post Structural Breaks (Bull or Bear?...

## Measuring user retention using cohort analysis with R

April 27, 2012
By

Cohort analysis is super important if you want to know if your service is in fact a leaky bucket despite nice growth of absolute numbers. There’s a good write up on that subject “Cohorts, Retention, Churn, ARPU” by Matt Johnson. So how to do it using R and how to visualize it. Inspired by examples

## Speeding up R computations Pt II: compiling

April 27, 2012
By

A year ago I wrote a post on speeding up R computations. Some of the tips that I mentioned then have since been made redundant by a single package: compiler. Forget about worrying about curly brackets and whether to write 3*3 or 3^2 - compile...

## Create polygons from a matrix

April 27, 2012
By

The following function matrix.poly allows for the addition of polygons to a plot based on a matrix and defined matrix positions. I have used this function on occasion to highlight specific matrix locations (e.g. in the above figure). You can do the same by overlaying another image (left in above plot) but with this...

## Read Big Text Files Column by Column

April 27, 2012
By

Dear R Programmers,There is new package "colbycol" on CRAN, which makes our jobs easier when we have large files i.e. more than a GB to be read in R. Especially, when we don't need all of the columns/variables for our analysis. Kudos for author, Carlos...

## Graphic Parameters (symbols, line types, and colors) for ggplot2

April 27, 2012
By

Following up on John Mount’s post on remembering symbol parameters in ggplot2, I decided to give it a try and included symbols, line types, and colors (based upon Earl Glynn’s wonderful color chart).  Code follows below. require(ggplot2) ...

## Graphic Parameters (symbols, line types, and colors) for ggplot2

April 27, 2012
By

Following up on John Mount’s post on remembering symbol parameters in ggplot2, I decided to give it a try and included symbols, line types, and colors (based upon Earl Glynn’s wonderful color chart).  Code follows below. require(ggplot2) ...

## Randomization thoughts

April 27, 2012
By

Le Grand Casino of Monte CarloOn Monday I’m going to be leading a little stats workshop on randomization tests and null models. In preparation for this I wrote up code for null model examples I wanted to write a post that introduced the basics of these models (Null models, bootstrapping,...

## soilDB Demo: Processing SSURGO Attribute Data with SDA_query()

April 26, 2012
By

Mapping near Paloma, CA This image has nothing to do with the following content. A quick example of how to use the USDA-NRCS soil data access query facility (SDA), via the soilDB package for R. The following code describes how to get component-level so...

## phyloseq: Reproducible interactive analysis of microbiome census data using R

April 26, 2012
By

Collaborative development of phyloseq on GitHub. Official stable release of phyloseq on Bioconductor. Advances in DNA sequencing technology have dramatically improved the scope and scale of culture-independent investigations into microbial communities. There are effective software tools available to process raw DNA … Continue reading →

April 26, 2012
By

In my previous post about rewriting my code to run in parallel part one I mentioned that we will make a small change to adfTest() function as well. In this post we will perform this small but performance-dramatic change. When you take a closer look at the source code of this particular function from fUnitRoots package

## Structural Breaks (Bull or Bear?)

April 26, 2012
By

When I spotted the bfast R package, I could not resist attempting to apply it to identify bull and bear markets.  For all the details that I do not understand, please see the references: Jan Verbesselt, Rob Hyndman, Glenn Newnham, Darius Culvenor...

## Graphic Parameters (symbols, line types, and colors) for ggplot2

April 26, 2012
By

Following up on John Mount’s post on remembering symbol parameters in ggplot2, I decided to give it a try and included symbols, line types, and colors (based upon Earl Glynn’s wonderful color chart).  Code follows below.

## Big Data statistics in the search for a cure for MS

April 26, 2012
By

Multiple Sclerosis (MS) is a debilitating and complex disease with an unknown cause — and for which there is currently no cure. The SUNY Buffalo is home to one of the leading multiple sclerosis (MS) research centers in the world, and as reported in Healthcare IT News, the research team is using IBM Netezza and Revolution R Enterprise to...