First and easy steps with R and Sweave

November 28, 2008 · Posted in R bloggers · Comments Off 
What really sold me to the idea of using Sweave and therefore (re)learning LaTeX was the idea of Reproducible Research. Charlie Geyer has put together some examples how to mix and match R and LaTeX with Sweave. Today's goal therefore is to run his examples and to see, what problems I run into :)

Allright, if you all go to the example section of Charlie's Reproducible Research Page, you will find three examples. Let's start with the first one.

To start, I created a folder ReproducibleResearch, copied my Sweave.sty file into it and created a project within Textmate by dragging the folder onto the Textmate icon in my dock. Then I created a document foo.Rnw, copied the contents of the first example into it and saved it (if you have the R, SWeave and LaTeX bundles installed, TextMate should be recognizing this *.Rnw document as a Sweave document.

Feeling lucky, I just pressed cmd-R to run this code in R.

…drumroll…

It just worked. Wow! A TextMate "Sweave, Typeset & View" window just showed me "An (sic!) Sweave Demo" by Charles J. Geyer! Complete with LaTeX typesetting, R output and even graphics. That's what I want, so this is a great start! Many kudos to Charles.

Now, let's analyse the code to see what we can rip off this example.

\documentclass{article}

\usepackage{amsmath}
\usepackage{amscd}
\usepackage[tableposition=top]{caption}
\usepackage{ifthen}
\usepackage[utf8]{inputenc}

\begin{document}

\title{An Sweave Demo}
\author{Charles J. Geyer}
\maketitle


All right, so this is just plain LaTeX. It is a document of class article, it uses some packages, it begins the document and defines a title. Doesn't look pretty in code, but hey, this is LaTeX - you better get used to this ;)

This is a demo for using the \verb@Sweave@ command in R. To
get started make a regular \LaTeX\ file (like this one) but
give it the suffix \verb@.Rnw@ instead of \verb@.tex@ and then
turn it into a \LaTeX\ file (\verb@foo.tex@) with the (unix) command
\begin{verbatim}
R CMD Sweave foo.Rnw
\end{verbatim}
So you can do
\begin{verbatim}
latex foo
xdvi foo
\end{verbatim}
and so forth.


Now there is some text - all this \verb@sometext@ gives you a code-like text formating within normal text. \begin{verbatim} starts a code block, \end{verbatim} stops it. Pretty standard LaTeX stuff.

A few lines later it gets more interesting:

<>=
2 + 2
@


Alright, so this is a code chunk, which will be run and the output of R will be written. Very nice! Try it out and alter the 2 + 2 to something else and sweave the file again. I typed in 2 * 1024 - 35 and the result has been 2013. Easy enough.



In the next post I will digg deeper into creating graphs and doing more complex analysis reports with Sweave. For tonight, I am happy with the first results.

Sweave.sh plays with cacheSweave

November 26, 2008 · Posted in R bloggers · Comments Off 
I have added support for caching to Sweave.sh script as implemented in cacheSweave R package written by Roger D. Peng. Now, one can set caching on for chunks that are time consuming (data import, some calculations, ...) and the Sweaving process will reuse the cached objects each time they are needed. Read the details about the cacheSweave package in the package vignette. Option --cache for Sweave.sh script should also be easy to understand. However, here is a minimalist example:
\documentclass{article}
\usepackage{Sweave}
\begin{document}

<<setup>>=
n <- 10
s <- 15
@

Let us first simulate \Sexpr{n} values from a normal distribution and add a \Sexpr{s} sec pause to show the effect of caching.

<<simulate, cache=true>>=
x <- rnorm(n)
Sys.sleep(s)
@

Now print the values:

<<print, results=verbatim>>=
print(x)
@

\end{document}
Now, one can run the following command:
Sweave.sh --cache test.Rnw
and the output on the command line is:
Run Sweave and postprocess with LaTeX directly from the command line
- cache mode via cacheSweave R package

R version 2.8.0 (2008-10-20)
Copyright (C) 2008 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(package='cacheSweave'); Sweave(file='test.Rnw', driver=cacheSweaveDriver);
Loading required package: filehash
filehash: Simple key-value database (2.0 2008-08-03)
Loading required package: stashR
A Set of Tools for Administering SHared Repositories (0.3-2 2008-04-30)
Writing to file test.tex
Processing code chunks ...
1 : echo term verbatim (label=setup)
2 : echo term verbatim (label=simulate)
3 : echo term verbatim (label=print)

You can now run LaTeX on 'test.tex'
When you repeat the Sweaving process, which you more or less always do, there is no need to wait for 15 second since cacheSweave package takes the x object from the cache! Excellent job Roger!

Calculating an N50 from Velvet output

November 25, 2008 · Posted in R bloggers · Comments Off 
In sequencing circles the N50 length is a useful heuristic for judging the quality of an assembly. Here is my definition of N50 length, which you may or may not find intuitive:
N50 length is the length of the shortest contig such that the sum of contigs of equal length or longer is at least 50% of the total length of all contigs
For example's sake imagine an assembler has created contigs of the following length (in descending order):
91 77 70 69 62 56 45 29 16 4
The sum of these is 519bp, so the sum of all contigs equal to or greater than N50 must be equal to or greater than 519/2 or 259.5
We can see by brute force that
91+77=168
91+77+70=238
91+77+70+69=307 (that'll do)
so the N50 for this assembly is 69bp

Another way to look at this:
at least half the nucleotides in this assembly belong to contigs of size 69bp or longer.

N50 vs N50 length

Technically N50, as opposed to N50 length, refers to the ordinal of that last contig that pushes it over the brink - in this example 4 (since 69bp is the 4th largest contig). Unfortunately, a higher N50 implies the opposite of a longer N50 length. Some papers refer to N50 length as L50, while most have simply followed the lazy convention of dropping "length" off of "N50 length". I think it is important to include units with your N50 to minimize confusion.

Contig N50 vs Scaffold N50

Another distinction is often made between contig N50 and scaffold N50. Contigs are "contiguous segments", while scaffolds (aka supercontigs) consist of contigs separated by gaps. Scaffolds are constructed using paired-end information at the read level and, in major sequencing projects, paired BAC ends. Because the scaffolds sequences are filled with varying quantities of empty N's, the scaffold N50 should not solely be used as a comparative score of assembly quality.

Velvet, when used with sane expCov settings, is very conservative with regard to scaffolding - so much that the contigs.fa N50 can be virtually considered a contig N50, as opposed to a scaffold N50. More aggressive programs, such as SOAPdenovo, produce separate contig and scaffold files.

Velvet N50 from stats.txt

Velvet is a popular assembler for short sequences that uses DeBruijn graphs and Eulerian graph theory instead of a repetitive align-consensus-align approach. Although it returns an N50 in the course of assembling, I wanted to derive it from the contigs themselves. These contigs are summarized in a table called stats.txt

Using R and its cumulative summation function we can easily compute N50.

Strategy: Order the contigs by decreasing size and find the first value for which the cumulative summation is at least half the total sum.

Thanks to Barry Rowlingson for providing this solution


myStatsTable<-read.table("stats.txt",header=TRUE)
contigs<-rev(sort(myStatsTable$lgth+myKmerValue-1))
n50<-contigs[cumsum(contigs) >= sum(contigs)/2][1]


Beware the kmer

Note: The length in the stats.txt file is given as length=lgth+kmer-1, where kmer is the kmer value chosen for that assembly. The N50 length given in the Log file also appears to be in kmers. You cannot convert an N50 in kmers to bp by adding kmer-1. The math doesn't work like that - you need to convert each contig to bp before recalculating N50.

Finally, you can calculate N50 from sequences in the contigs.fa, but this file only contains contigs longer than 2-kmers by default. The contigs.fa bp-N50 will sometimes approximate the kmer-N50 in the Log file, but that is not a rule to depend on.

Use anchoring to lose years off your age

November 21, 2008 · Posted in R bloggers · Comments Off 
It is well known in psychology that the way a question is asked can strongly influence the answer. For example, if you spin a random number wheel and later have people guess the proportion of United Nations countries that are African, their answers closely follow the number the wheel landed on. The initial random number serves as a reference or anchor for their guesses.

Suppose you wanted people to guess you had a lower age than you really had. One way to do it would be to anchor the question with the numbers 12 and 42 and compare the results. Specifically, when someone asked you what your age was you could tell them,
  • I'm really 12 years old, don't tell my mom and dad I'm talking to strangers
  • I'm 42 years old but you can't tell because of my macrobiotic diet based on microfibers
before having them guess your age.

Just for fun I asked 16 people in each anchoring condition to guess my age. The mean age of the people making the guess in the "42" condition was 20.4 years and for the "12" condition it was 20.8. Not much of difference, so any age effect would be solely due to the anchoring effect.

Results
The mean age for the guess in the "42" condition was 32.4, for the "12" condition it was 26.4, here is a boxplot of the respective guesses
To make sure the results were statistically significant I performed a bootstrap calculation with R
Intervals : Level      Normal     95%   ( 3.375,  8.622 )
The bootstrap confidence interval doesn't contain 0, which means there was indeed a statistically significant anchoring effect and if you want people to think you look younger than you really are, you might want to anchor your age to a ridiculously low number.

Here's the R code to perform the bootstrap
library(boot)
#The guesses of the girls
guess.42<-c(28,33,40,34,42,25,33,30,31,28,30,30,35,32,35)
guess.12<-c(23,28,26,25,25,30,27,28,28,23,27,28,30,24,24)
D = data.frame(guess.42,guess.12)
samplemean <- function(D, d) {
E=D[d,]
return(mean(E$guess.42)-mean(E$guess.12))
}
b = boot(D, samplemean, R=1000)
print(b)
plot(b)
boot.ci(b)
boxplot(guess.42,guess.12,ylab="age guessed",xlab="age anchored",names=c("42","12"))
#Now we do a t test
t.test(guess.42,guess.12,type="greater")


Note: It was obvious that the people who guessed 23, and above 35, as my age were joking, nonetheless it was interesting that the exaggerated ages they chose in jest were close to the anchor.

Setting up Textmate to use R

November 21, 2008 · Posted in R bloggers · Comments Off 


After becoming frustrated using the StatET plugin for Eclipse on my Mac (sometimes the R console would start and sometimes not), I decided to use Textmate instead. Textmate allows you to install extra bundles which are plug-ins to add some new functionality to TextMate. There are two bundles relevant for R developers:
- R bundle (here)
- Latex bundle (here)
- Sweave bundle (here)

If you are running Leopard (Mac OS 10.5), the subversion client should be installed (otherwise, do so - you can find more info about this on the TextMate help pages).

To install the two bundles, invoke the following terminal spells:

mkdir -p /Library/Application\ Support/TextMate/Bundles
cd /Library/Application\ Support/TextMate/Bundles
svn co http://svn.textmate.org/trunk/Bundles/R.tmbundle
svn co http://svn.textmate.org/trunk/Bundles/Latex.tmbundle
svn co http://svn.textmate.org/trunk/Bundles/SWeave.tmbundle


Start up TextMate and now you have support for R and Sweave in TextMate (this assumes that you already have installed a copy of Latex - get the MacTeX-2008 distribution here).

Chances are that if you want to "Sweave, Typeset and View" R will choke because it does not find the necessary Sweave LaTeX style file (sweave.sty) file. If this is the case, just put a copy of sweave.sty into your working directory and everything will work wonderfully (yes, I know that this should be fixable by setting up the right search path somewhere, so if you have run into this problem and fixed it better than me, please send some enlightening commentaries!).

Plot symbols in R

November 18, 2008 · Posted in R bloggers · Comments Off 
In the plot environment, the "pch" parameter decides the symbol of your output. There are about 130 symbols hard coded and passed into "pch."However, there are still some symbols that are not in these 130 pch's. For instance, there is no checkmark. To plot checkmark or other symbols, we can use symbol() in the expression() or quote() functions. For example:plot(0, 0, main=bquote(symbol("\326"

Call C from R and R from C

November 17, 2008 · Posted in R bloggers · Comments Off 
Several years ago, while a research associate at the University of Chicago, I had the privilege of sitting in on a course taught by Peter Rossi: Bayesian Applications in Marketing and MicroEconometrics. This course -- one I recommend to anyone at U Chicago who is interested in statistics -- was an incredibly clear treatment of Bayesian statistics, but the aspect I appreciated most was Peter's careful demonstration of Bayesian theory and methods using R.

One feature of R that I had not made use of up until that point was the ability to call compiled C and Fortran functions from within R (this makes loop-heavy Metropolis-Hastings samplers much, much faster). It turns out that you can also include the R libraries in C source code so that R functions (e.g., random number generators) can be easily accessed. The R-Cran website has an excellent tutorial on how to develop R extensions (here), but I wanted to share an example Peter used in class because it is extremely brief, and for 95% of what I do, this is all I need.

As Peter writes, this is an incredibly inefficient way of simulating from the chisquare distribution, but it demonstrates the point. His more extensive writeup is located here.

Save the following as testfun.c:
#include <R.h>
#include <Rmath.h>
#include <math.h>
/* Function written by Peter Rossi from the University of Chicago GSB */
/*http://faculty.chicagogsb.edu/peter.rossi/teaching/37904/Adding%20Functions%20Written%20in%20C%20to%20R.pdf */

/* include standard C math library and R internal function declarations */
void mychisq(double *vec, double *chisq, int *nu)
/* void means return nothing */
{
int i,iter; /* declare local vars */
/* all statements end in; */
GetRNGstate(); /* set random number seed */
for (i=0 ; i < *nu; ++i)
/* loop over elements of vec */
/*nu "dereferences" the pointer */
{ /* vectors start at 0 location!*/
vec[i] = rnorm(0.0,1.0); /*use R function to draw normals */
Rprintf("%ith normal draw= %lf \n",(i+1),vec[i]);
/* print out results for "debugging" */
}
*chisq=0.0;
iter=0;
while(iter < *nu) /* "while" version of a loop */
{
if( iter == iter)
{*chisq=*chisq + vec[iter]*vec[iter];}
/* redundant if stmnt */
iter=iter+1; /* note: can't use ** */
/* if you want to be "cool" use iter += 1 */
}
PutRNGstate(); /* write back ran number seed */
}



To call this function in R, you first need to compile it. To do this you need all the standard compilers and libraries for your operating system. For Debian or Ubuntu, this should do it (if I missed a package, let me know in the comments):

$ sudo aptitude update
$ sudo aptitude install build-essential r-base-dev

Now, you should be able to compile the function:

$ R CMD SHLIB testfun.c

If all goes well, you should see the files testfun.o and testfun.so in the directory. To test the function we will source the following R script into R:

call_mychisq<-function(nu){
##This function is just a wrapper for .C
vector=double(nu); chisq=1
.C("mychisq",as.double(vector),res=as.double(chisq),as.integer(nu))$res
}

##Load the compiled code (you may need to include
## the explicit file path if it is not local
## NOTE: for Windows machines, you will want to load testfun.dll"

dyn.load("testfun.so")
result<-call_mychisq(10)

##If you re-compile testfun.c, you must unload it
## and then re-load it:
## dyn.unload("testfun.so")
## dyn.load("testfun.so")

And get the following output:

> dyn.load("testfun.so")
> result<-call_mychisq(10) 1th normal draw= -1.031170 2th normal draw= -1.214103 3th normal draw= 0.002335 4th normal draw= 0.296146 5th normal draw= -0.908862 6th normal draw= -1.567820 7th normal draw= -0.079227 8th normal draw= -1.404414 9th normal draw= 0.616567 10th normal draw= -0.007855 > result
[1] 8.268028

Multivariate dependence with copulas

November 17, 2008 · Posted in R bloggers · Comments Off 
Classes (S4) of commonly used copulas including elliptical (normal and t), Archimedean (Clayton, Gumbel, Frank, and Ali-Mikhail-Haq), extreme value (Husler-Reiss and Galambos), and other families (Plackett and Farlie-Gumbel-Morgenstern). Methods for density, distribution, random number generation, bivariate dependence measures, perspective and contour plots. Functions for fitting copula models. Independence tests among random variables and random vectors. Serial independence tests for univariate and multivariate continuous time series. Goodness-of-fit tests for copulas based on multipliers and on the parametric bootstrap.

R package can be downloaded at http://cran.r-project.org/web/packages/copula/index.html
Tags -
...

reading the full post...

You may also interested into other posts:

Saving a Workspace

November 11, 2008 · Posted in R bloggers · Comments Off 

R will can save the users workspace at the end of a session so that he can take it up again where he left off. I personally don't like doing this but there are times when one would want to save their work, especially after complex and time consuming computations.

read more