Benchmarking feature selection with Boruta and caret

November 25, 2010
By
Benchmarking feature selection with Boruta and caret

Feature selection is the data mining process of selecting the variables from our data set that may have an impact on the outcome we are considering. For commercial data mining, which is often characterised by having too many variables for model building, this is an...

Read more »

Random graphs with fixed numbers of neighbours

November 24, 2010
By
Random graphs with fixed numbers of neighbours

In connection with Le Monde puzzle #46, I eventually managed to write an R program that generates graphs with a given number n of nodes and a given number k of edges leaving each of those nodes. (My early attempt was simply too myopic to achieve any level of success when n was larger than

Read more »

R preferred by Kaggle competitors

November 24, 2010
By

Kaggle, the predictive-analytics competition site, has analyzed the preferences of the 2,500 data scientists who participate in its competitions, and R was the most-preferred software of the competitors at 22.5%. The next-nearest alternative was Matlab, at 16%. On a related note, the premier of the Australian state of New South Wales has just launched a competition on Kaggle to...

Read more »

R preferred by Kaggle competitors

November 24, 2010
By

Kaggle, the predictive-analytics competition site, has analyzed the preferences of the 2,500 data scientists who participate in its competitions, and R was the most-preferred software of the competitors at 22.5%. The next-nearest alternative was Matlab, at 16%. On a related note, the premier of the Australian state of New South Wales has just launched a competition on Kaggle to...

Read more »

Life Is Short, Use Python

November 24, 2010
By
Life Is Short, Use Python

Life is short, use PythonI started to play with Python two weeks ago due to the limitation of R in terms of handling large data, then a friend of mine suggested me to try Python since I had to do data massage frequently, "Python is the best choice, trust me", he...

Read more »

The joys of teaching R

November 23, 2010
By
The joys of teaching R

Just read a funny but much to the point blog entry on the difficulties of teaching proper programming skills to first year students! I will certainly make use of the style file as grading 180 exams is indeed a recurrent nightmare… Filed under: R,...

Read more »

Great-circle distance calculations in R

November 23, 2010
By
Great-circle distance calculations in R

Recently I found myself needing to calculate the distance between a large number of longitude and latitude locations. As it turns out, because the earth is a three-dimensional object, you cannot simply pretend that you are in Flatland, albeit some … Continue reading →

Read more »

Principal Component Analysis: Which variables contribute most to principal components ?

November 23, 2010
By

Principal component analysis (PCA) is a mathematical transformation of possibly(correlated) variables into a number of uncorrelated variables called principal components. The resulting components from this transformation is defined in such a way that t...

Read more »

Principal Component Analysis: Which variables contribute most to principal components ?

November 23, 2010
By

Principal component analysis (PCA) is a mathematical transformation of possibly(correlated) variables into a number of uncorrelated variables called principal components. The resulting components from this transformation is defined in such a way that t...

Read more »

Slides from first Utah.edu & R.P. RUG meeting

November 23, 2010
By

Here are the slides from the first University of Utah and Research Park R Users Group meeting. They discuss getting help and finding packages. R

Read more »

How to make beautiful bubble charts with R

November 23, 2010
By
How to make beautiful bubble charts with R

Nathan Yau has just published at FlowingData a step-by-step guide on making bubble charts in R. It's actually pretty simple: read in data, sqrt-transform the "bubble" variable (to scale the bubbles by area, not radius), and use the symbols function to plot. It's the last step, though, that really ups the presentation quality: read R's PDF file into Illustrator...

Read more »

R and AOL in NYC

November 23, 2010
By

R and the NYC R User Group get brief mentions in this article about AOL's offices in New York City. The NYC UseRs meet at AOL and (ironically) the next meeting on Dec 9 is on the topic of R at Google. New York Observer: Bringing Some Sizzle to the Dial-Up King (via)

Read more »

R Style Guide

November 23, 2010
By
R Style Guide

Each year I have the pleasure (actually it’s quite fun) of teaching R programming to first year mathematics and statistics students. The vast majority of these students have no experience of programming, yet think they are good with computers because they use facebook! The class has around 100 students, and there are eight practicals. In

Read more »

Programming with R – Processing Football League Data Part I

November 23, 2010
By

In this post we will make use of football results data from the football-data.co.uk website to demonstrate creating functions in R to automate a series of standard operations that would be required for results data from various leagues and divisions. The first step is to consider what control options should be available as part of the

Read more »

Robust adaptive Metropolis algorithm [arXiv:10114381]

November 23, 2010
By
Robust adaptive Metropolis algorithm [arXiv:10114381]

Matti Vihola has posted a new paper on arXiv about adaptive (random walk) Metropolis-Hastings algorithms. The update in the (lower diagonal) scale matrix is where is the current acceptance probability and the target acceptance rate; is the current random noise for the proposal, ; is a step size sequence decaying to zero. The spirit of

Read more »

Learn Logistic Regression (and beyond)

November 23, 2010
By
Learn Logistic Regression (and beyond)

One of the current best tools in the machine learning toolbox is the 1930s statistical technique called logistic regression. We explain how to add professional quality logistic regression to your analytic repertoire and describe a bit beyond that. A statistical analyst working on data tends to deliberately start simple move cautiously to more complicated methods. Related posts:

Read more »

makefiles for Sweave, R and LaTeX using Eclipse on Windows

November 22, 2010
By

This post provides a brief introduction to make and makefiles. In particular it describes how to set up make on Windows with an emphasis on using make in Eclipse on projects involving R, Sweave, and LaTeX. Overview make is software that uses makefile...

Read more »

makefiles for Sweave, R and LaTeX using Eclipse on Windows

November 22, 2010
By

This post provides a brief introduction to make and makefiles. In particular it describes how to set up make on Windows with an emphasis on using make in Eclipse on projects involving R, Sweave, and LaTeX. Overview make is software that uses makefile...

Read more »

RClimate Tools for Do It Yourself Climate Trend Analysis – Nov, 2010 Update

November 22, 2010
By
RClimate Tools for Do It Yourself Climate Trend Analysis – Nov, 2010 Update

I have made several updates to  RClimate tools for do-it-yourself  climate scientists.  The downloadable monthly climate trends file  (link to csv file) now includes the 5 major global land-ocean temperature anomaly time series (GISS, HAD, NOAA, RS...

Read more »

R.I.P. StatProb?

November 22, 2010
By
R.I.P. StatProb?

As posted in early August from JSM 2010 in Vancouver, StatProb was launched as a way to promote an on-line encyclopedia/wiki with the scientific backup of expert reviewers. This was completely novel and I was quite excited to take part in the venture as a representative of the Royal Statistical Society. Most unfortunately, the separation

Read more »

Access the InfoChimps API from R

November 22, 2010
By

InfoChimps.com is mainly known as a clearinghouse for finding large data sets, for free or for sale. But they have also released (in beta, at least) an API that lets you find some pretty useful information on-demand. Normally, you'd have you use RESTful calls to access the API, but now Drew Conway has created an R package (and released...

Read more »

Example 8.15: Firth logistic regression

November 22, 2010
By
Example 8.15: Firth logistic regression

In logistic regression, when the outcome has low (or high) prevalence, or when there are several interacted categorical predictors, it can happen that for some combination of the predictors, all the observations have the same event status. A similar e...

Read more »

Homage to floating points

November 22, 2010
By

I recently got very close to the floating point trap, again, so here is a little tribute with some small examples!

Read more »

Retrieving transcriptome sequences for RNASeq analysis

November 22, 2010
By

One approach for analyzing RNASeq data from an organism with a well-annotated genome, is to align the reads to mRNA (cDNA) sequences instead of the genome. To do that you need to extract the transcript sequences from a database. This is how to extract ensembl transcript sequences from UCSC from within R:_________________________________________________ library(GenomicFeatures) library(BSgenome.Hsapiens.UCSC.hg18) tr tr_seq write.XStringSet(tr_seq, file="hg18.ensgene.transcripts.fasta", 'fasta', width=80, append=F) _________________________________________________ Next steps...

Read more »

Retrieving transcriptome sequences for RNASeq analysis

November 22, 2010
By

One approach for analyzing RNASeq data from an organism with a well-annotated genome, is to align the reads to mRNA (cDNA) sequences instead of the genome. To do that you need to extract the transcript sequences from a database. This is how to extract ensembl transcript sequences from UCSC from within R:_________________________________________________ library(GenomicFeatures) library(BSgenome.Hsapiens.UCSC.hg18) tr tr_seq write.XStringSet(tr_seq, file="hg18.ensgene.transcripts.fasta", 'fasta', width=80, append=F) _________________________________________________ Next steps...

Read more »

Were stock returns really better in 2007 than 2008?

November 22, 2010
By
Were stock returns really better in 2007 than 2008?

We know that the S&P 500 was up a little in 2007 and down a lot in 2008.  So on the surface the question seems really stupid.  But randomness played a part.  Let’s have a go at deciding how much of a part. Figure 1: Comparison of 2007 and 2008 for the S&P 500. Statistical … Continue reading...

Read more »

Graphical comparison of MCMC performance [arXiv:1011.445]

November 22, 2010
By
Graphical comparison of MCMC performance [arXiv:1011.445]

A new posting on arXiv by Madeleine Thompson on a graphical tool for assessing performance. She has developed a software called SamplerCompare, implemented in R and C. The graphical evaluation plots “log density evaluations per iteration times autocorrelation time against a tuning parameter in a grid of plots where rows represent distributions and columns represent

Read more »

Animate .gif images in R / ImageMagick

November 21, 2010
By
Animate .gif images in R / ImageMagick

Yesterday I surfed the web looking for 3D wireframe examples to explain linear models in class. I stumbled across this site where animated 3D wireframe plots are outputted by SAS.  Below I did something similar in R. This post shows the few steps of how to create an animated .gif file using R and ImageMagick.

Read more »

My First R Package: infochimps

November 20, 2010
By

I have finally taken the plunge and created my first R package! As frequent readers will know, I often sing the praises of infochimps, a startup out of Austin, TX attempting to be the world’s data clearinghouse. While infochimps is an excellent resource for data sets, they also provide their own set excellent data

Read more »