Example 7.11: Plot an empirical cumulative distribution function from scratch
SAS
In SAS, we first read the data from a locally stored Stata data set. Then, we calculate the empirical CDF for pcs. We do this by first sorting (1.5.6) on pcs. Then the order statistic of the observation divided by the total number of obervations is the empirical probability that X ≤ x. We find the number of observations using the nobs data set option (A.6.1) and the observation number using the _n_ implied variable (1.4.15). Then we can plot the empirical CDF using proc gplot (section 5.1.1) to make a scatterplot and a symbol statement with the j interpolation (as in section 1.13.5) to connect the points.
proc import datafile="C:\book\help.dta"
out=help dbms=dta;
run;
proc sort data=help;
by pcs;
run;
data help_a;
set help nobs=totalobs;
ecdf_pcs = _n_ / totalobs;
run;
symbol1 i=j v=none c=blue;
proc gplot data=help_a;
plot ecdf_pcs * pcs;
run;
quit;
R
Now, R. First, make the ECDF values in a similar fashion to SAS, then make an empty frame (5.1.1), then fill the frame with lines connecting the calculated values (5.2.1).
library(foreign)
ds <- read.dta
("http://www.math.smith.edu/sasr/datasets/help.dta")
attach(ds)
sortpcs <- sort(pcs)
ecdfpcs <- (1:length(sortpcs))/length(sortpcs)
plot(sortpcs, ecdfpcs, type="n")
lines(sortpcs, ecdfpcs)

CO2 and Temperature Trends

Formatting Decimals in Texts with R
Yanping Chen raised a question in the Chinese COS forum on the output of Eviews: how to (re)format the decimal coefficients in equations as text output? For example, we want to round the numbers in CC = 16.5547557654 + 0.0173022117998*PP + 0.216234040485 * PP(-1) + 0.810182697599 * (WP + WG) to the 3rd decimal places. This can be simply done by regular expressions, as decimals always begin with a “.”. The basic steps are:
- find out where are the decimals in the character string;
- format them;
- replace the original decimals with formatted values;
Given a character vector, we can format the decimals with the code below:
# x: equations; FUN: formatting function; ...: passed to FUN
coefFormat = function(x, FUN, ...) {
sapply(x, function(s) {
dig = sapply(gregexpr("\\.[0-9]+", s), function(m) {
sapply(seq(along = m), function(i) {
substr(s, m[i], m[i] + attr(m, "match.length")[i] - 1)
})
})
for (j in {
if (is.null(dim(dig)))
NULL
else 1:dim(dig)[1]
}) {
s = sub(dig[j, 1], substring(FUN(as.numeric(dig[j, 1]), ...),
2), s, fixed = TRUE)
}
s
})
}I used sapply() for 3 times to avoid explicit loops but consequently the code might be difficult to read. The critical part is the regular expression “\\.[0-9]+” which means one of more (controlled by “+” after “[0-9]”) digits (“[0-9]” or “[:digit:]”) after a decimal point “.”. As “.” is a metacharacter in regular expressions, we need to use a backslash before it, and again, “\” is a special character in R, so we need another backslash to denote a backslash. 
x = readLines(zz <- textConnection( "CC = 16.5547557654 + 0.0173022117998 * PP + 0.216234040485 * PP(-1) + 0.810182697599 * (WP + WG) II = 20.2782089394 + 0.150221823899 * PP + 0.61594357734 * PP(-1) - 0.157787636546 * KK WP = 1.50029688603 + C(10) * XX + 0.146673821502 * XX(-1) + 0.130395687204 * AA ")) close(zz) writeLines(coefFormat(x, round, digits = 3)) # CC = 16.555 + 0.017 * PP + 0.216 * PP(-1) + 0.81 * (WP + WG) # # II = 20.278 + 0.15 * PP + 0.616 * PP(-1) - 0.158 * KK # # WP = 1.5 + C(10) * XX + 0.147 * XX(-1) + 0.13 * AA # writeLines(coefFormat(x, formatC, digits = 3, format = "f")) # CC = 16.555 + 0.017 * PP + 0.216 * PP(-1) + 0.810 * (WP + WG) # # II = 20.278 + 0.150 * PP + 0.616 * PP(-1) - 0.158 * KK # # WP = 1.500 + C(10) * XX + 0.147 * XX(-1) + 0.130 * AA #
Related Posts
Counterintuitive Results in Flipping Coins
HTH and HTT respectively? (For example, for the sequence HHTH, the number for HTH to appear is 4, and in THTHTT, the number for HTT is 6.)It seems that the two results are equivalent, as H and T occurs with equal probability 0.5, so we naturally believe the average numbers of steps to HTH and HTT are the same, but the fact is not as we imagined.
## smart guys use math formulae to solve the problem,
## but *lazy* guys like me use simulations with R
coin.seq = function(v) {
x = NULL
n = 0
while (!identical(x, v)) {
x = append(x[length(x) - 1:0], rbinom(1, 1, 0.5))
n = n + 1
}
n
}
set.seed(919)
mean(htt <- replicate(1e+05, coin.seq(c(1, 0, 0))))
# [1] 8.00304
mean(hth <- replicate(1e+05, coin.seq(c(1, 0, 1))))
# [1] 10.0062
png("coin-htt-hth.png", height = 150, width = 500)
par(mar = c(3, 2.5, 0.1, 0), mgp = c(2, 0.8, 0))
boxplot(list(HTT = htt, HTH = hth), horizontal = T,
xlab = "n", ylim = range(boxplot(list(HTT = htt, HTH = hth),
plot = FALSE)$stats))
points(c(mean(htt), mean(hth)), 1:2, pch = 19)
dev.off()The answer is counterintuitive, isn’t it?
Well, mathematicians certainly do not like my solution (I guess they even hate such an imprecise approach). I hope some smart guys can give me some hints on working out the probability distribution and hence the expectation.
Related Posts
‘R’ programming
Learning R 2009-08-28 13:11:00

fy<-c(1990:2019)
deficit.1<-c(-3.9, -4.5, -4.7, -3.9, -2.9, -2.2, -1.4, -0.3, 0.8, 1.4, 2.4, 1.3, -1.5, -3.5, -3.6, -2.6, -1.9, -1.2, -3.2, -11.2, -9.6, NA,NA, NA, NA, NA, NA, NA, NA, NA)
projected<-c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -9.6, -6.1, -3.7, -3.2, -3.2, -3.1, -3.3, -3.2, -3.1, -3.4)
png("c:/data/deficit_color.png",height=480,width=480)
plot(deficit.1~fy,ylim=c(-12,5),type="n",lwd=2,col="red",
main="Federal budget deficit, 1990-2019",
cex.lab=1.1,cex.axis=.75,
xlab="Fiscal year",ylab="Deficit (% of GDP)")
rect(1988,-15,1994,6,col="#FF9999",border=NA)
rect(1994,-15,2002,6,col="#6699FF",border=NA)
rect(2002,-15,2010,6,col="#FF9999",border=NA)
rect(2010,-15,2014,6,col="#6699FF",border=NA)
rect(2014,-15,2021,6,col="#CCCCCC",border=NA)
abline(h=0,lwd=.5,lty=3,col="#555555")
lines(fy,deficit.1,lwd=2.5)
lines(fy,projected,lwd=2.5,lty=2)
rect(1990,-11.5,2000,-9.5,col="#dddddd",border="#555555")
text(1995,-10.5,"Source: CBO")
graphics.off()
Combine R CMD build and junit
This is a post in the series Mixing R CMD build and ant. Previous posts have shown how to compile the java code that lives in the src directory of a package and how to document this code using javadoc.
This post tackles the problem of unit testing of the java functionality that is shipped as part of an R package. Java has several unit test platforms, we will use junit here, but similar things could be done with other systems such as testng, ...
The helloJavaWorld package now looks like this :
.
|-- DESCRIPTION
|-- NAMESPACE
|-- R
| |-- helloJavaWorld.R
| `-- onLoad.R
|-- inst
| |-- doc
| | |-- helloJavaWorld.Rnw
| | |-- helloJavaWorld.pdf
| | `-- helloJavaWorld.tex
| `-- java
| |-- hellojavaworld-tests.jar
| `-- hellojavaworld.jar
|-- man
| `-- helloJavaWorld.Rd
`-- src
|-- Makevars
|-- build.xml
|-- junit
| `-- HelloJavaWorld_Test.java
|-- lib
| `-- junit-4.7.jar
`-- src
`-- HelloJavaWorld.java
9 directories, 15 files
We have added the src/lib directory that contains the junit library and the HelloJavaWorld_Test.java that contain a simple class with a unit test
And the ant build file has been changed in order to
- build the junit test cases, see the build-testcases target
- run the unit tests, see the test target
- create nice html reports, see the report target
The package can be downloaded here
Coming next, handling of dependencies between java code that lives in different R packages
A Fast Intro to PLYR for R
I’m not dead yet! Although it has been rumored that I am. The new job is going great and I’m thrilled to be with a new firm doing interesting work alongside smart people. It makes me seem smarter by simple association.
There’s been a lot going on recently in the R user community. There was an R flash mob of Stack Overflow which resulted in a noticeable increase in the number of R questions and answers in SO. I’ve been blown away by the quality of the participants. There has also been increased quality discussions on Twitter which are being tagged with #rstats. These changes in the community have not gone unnoticed.
Recently I posted a question about how to do a ‘group by’ in a regression with R. I had a way I had been doing this but I was suspicious there was a better way. One of the answers proposed using the PLYR package. I think I had seen the plyr package a few times but never really understood it. Although I didn’t select this as my top answer, it prompted me to look into PLYR more. What I discovered was really interesting.
The PLYR package is a tool for doing split-apply-combine (SAC) procedures. I’m very fluent in SQL so the best analogy for me was the GROUP BY statement in SQL. PLYR adds very little new functionality to R. What it does do is take the process of SAC and make it cleaner, more tidy and easier. I think I’m not the only one who wants a clean and tidy SAC. Here’s a quick example of making some summary stats using PLYR:
# install.packages("plyr") #run this if you don't have the package already
library(plyr)
#make some example data
dd<-data.frame(matrix(rnorm(216),72,3),c(rep("A",24),rep("B",24),rep("C",24)),c(rep("J",36),rep("K",36)))
colnames(dd) <- c("v1", "v2", "v3", "dim1", "dim2")
#ddply is the plyr function
ddply(dd, c("dim1","dim2"), function(df)mean(df$v1))result:
dim1 dim2 V1
1 A J 0.02554362
2 B J -0.15839675
3 B K -0.06077399
4 C K -0.02326776PLYR functions have a neat naming convention. The first two letters of the function tells the input and output data types, respectively. The one I use the most is ddply which takes a data frame in and spits out a data frame. Let me see if I can explain what ddply is doing. The first argument, dd, is the input data frame. The next argument is the “group by” variables. Since I want to group by two variables I send them as a vector (that’s what the c() bit does). What threw me for a loop initially was the third argument, the function. What I found myself trying (unsuccessfully) was just using mean(v1) as the third argument. If I did that, R would spit at me and bring the marital status of my parents into question. I discovered that the problem was the ddply function was splitting the data by my ‘group by’ variables and then it wanted to pass each of the resulting data frames to a function. So what does it mean to pass a data frame to mean(v1)? Yeah, it means Jack Crap, that’s what it means. So in one of the PLYR examples I saw they were using these inline functions. The idea behind function(df)mean(df$v1) is to create a function to which we can pass a data frame and get out a meaningful result. The subset (or split) of the data gets passed to the function and that subset is then known as df. mean(df$v1) calculates the mean of v1 and returns an answer. ddply holds on to the answers of each split and then reassembles them all in the end. Slick, ey?
As with most things in R the idea can be extended to a vector of functions in order to perform many operations on each split:
ddply(dd, c("dim1","dim2"), function(df)c(mean(df$v1),mean(df$v2),mean(df$v3),sd(df$v1),sd(df$v2),sd(df$v3)))The result looks like this:
dim1 dim2 V1 V2 V3 V4 V5 V6 1 A J 0.02554362 0.3400250 0.1206980 0.9326424 1.0044120 1.100762 2 B J -0.15839675 0.3662559 -0.1784193 0.7447807 0.8752162 1.105258 3 B K -0.06077399 0.5184403 -0.2076024 1.0385107 1.0609706 1.153153 4 C K -0.02326776 0.2639328 0.1352895 0.7940938 0.9025207 1.072460
Pretty nifty.
The author of PLYR is Hadley Wickham who is also the man behind GGPLOT2. If you like PLYR or GGPLOT2 then you should immediately buy Hadley’s GGPLOT2 book on Amazon. But be sure and use the link on this site or the link on Hadley’s site so he can get Amazon associate payment. The authors I have talked to told me they get more from the Associate program than they get from publishing royalties.
My father is a retired pilot turned crop farmer. He ALWAYS carries a pair of pliers in a nylon pouch on his belt. I can see that Hadley’s PLRY package is going to become my proverbial ‘belt pliers.’
Of course if I wrote an R package I’d have to name it Super RamBar, cause that’s just how I roll.
Using R and Bioconductor for sequence analysis
Here's another quick R vignette, in case I pick this up later and need to remind myself where I got stuck. I was trying to use R for a bit of basic sequence analysis, with mixed results.
First, install the BSgenome package, which is part of Bioconductor. Get GeneR while you're at it.
> source("http://bioconductor.org/biocLite.R")
> biocLite("BSgenome")
> biocLite("GeneR")Follow the instructions in the document How to forge a BSgenome data package. You'll need to get fasta files from somewhere such as NCBI's Entrez Genome. Another nice data source is Regulatory Sequence Analysis Tools.
I created a BSgenome package for our favorite model organism Halobacterium salinarum NRC-1, which I named halo for short. Now, I can ask what sequences make up the halo genome and find out how long they are.
> library(BSgenome.halo.NCBI.1)
> seqnames(halo)
[1] "chr" "pNRC200" "pNRC100"
> seqlengths(halo)
chr pNRC200 pNRC100
2014239 365425 191346
> length(halo$chr)
[1] 2014239
There are a few things I wanted to do next. First, I wanted to load a list of genes with their coordinates. That should allow me to quickly get the sequence for each gene, or get sequence of upstream regions for regulatory motif finding. Second, if I'm going to find any new protein coding regions, I'd like to have a function that could take a stretch of DNA and find ORFs (open reading frames). As far as I can tell, all there is to ORF finding is searching each reading frame for long stretches that start with a methionine (AUG) and end with a stop codon (UAG, UGA, and UAA ). Maybe there's more to it than that.
This is where I left off. GeneR seems to use an entirely different way of encoding sequence based on buffers. I have to admit to being a little disappointed. I hope it's just my cluelessness and there's really a reasonable way to do this kind of thing in R and Bioconductor.
Related stuff from Blue Collar Bioinformatics
- Gene Ontology analysis with Python and Bioconductor
- Differential expression analysis with Bioconductor and Python
ggplot2 Version of Figures in “Lattice: Multivariate Data Visualization with R” (Final Part)
Over the past weeks I have tried to replicate the figures in Lattice: Multivariate Data Visualization with R using Hadley Wickham’s ggplot2.
With the exception of a few graph types (e.g. ggplot2 doesn’t support 3d-graphs, and there were a few other cases), it was possible to create ggplot2 versions of almost all the figures. Sometimes this required data manipulation before plotting in order to get data into a suitable form to feed into ggplot2, but more often than not ggplot2 provided satisfactory out-of-the-box visualisation very closely comparable to that of lattice.
I would like to conclude this series with comments on a few keywords that stuck to my mind while preparing all these graphs.
Speed
Both lattice and ggplot2 are running on top of the grid graphics, however lattice is a lot faster. A lot. Whilst drawing one or two graphs, one might not even notice the difference in speed, but once the number of graphs increases or the datasets get bigger the relative slowness of ggplot2 becomes more clearly recognisable (have a look at the pdf-s linked to at the end of this post for comparative timings).
Reader Ben Bolker emailed Hadley Wickham about the issue, and the response he got was “So far I have been completely focused on functionality, and not at all on speed. I would really like to spend some time profiling and optimised ggplot2 (I suspect an order of magnitude speed increase would be possible), but unfortunately my summer is filling up rapidly and I am feeling some pressure to write papers rather than (more) code.”
It is good news that speed can be improved, now let’s hope Hadley finds some time to look into this.
Output Customisation
Almost every element of the output of both packages is highly customisable. lattice has more options to tinker with the finest details of the plot, allowing to make sure that the final graph looks exactly the way one wants. Such fine-tuning requires, though, a very good knowledge of the inner workings of the program, as the available options are not always so obvious. I find fine-tuning a graph using the ggplot2’s approach a lot easier, as it is clear which element of the plot is being adjusted.
Still, as always, there is room for improvement – the ability to better manipulate the heights/widths/aspect ratios of facets (facet_grid has the space="fixed" argument, but not facet_wrap); and better control over size and positioning of legends are the two main items that surfaced during this exercise.
Syntax
As already mentioned lattice has a jungle of parameters one can manipulate to achieve the best output possible. Lattice to me is more cluttered with all of its rich options (panel/prepanel functions), and I personally prefer the ggplot2 approach of building up a graph layer by layer using “human-readable” expressions. Compared to the use of various specialised functions in lattice I find this more intuitive and easier to follow.
The lattice panel functions in capable hands make it an extremely powerful tool. However, having seen the lattice examples, only now did I come to fully appreciate the power of the ggplot2 equivalents: stat_summary and stat_function.
Again, if there was one thing to add to my wish-list, it would be the ability to use formulas/functions (e.g. reorder) as facetting variables – allowing to skip one data pre-processing step.
Documentation
ggplot2 has a very good website with many useful examples (the same information without the rendered graphs is included in the help file), as well as a book with good explanations. Using a combination of all these, one gets a good overview of the available options, and answers to the questions that may arise. I especially like the examples on the website, that often highlight the more intricate features of the program.
lattice manual explains all the available options in great detail, sometimes requiring a good amount of concentration and will to go through the instructions. Apart from the book website, one can also make use of R Graphical Manual that includes “a collection of graphics from all R packages”.
Pdf-version of the posts
Some readers requested a pdf-version of the posts – all the chapters have been compiled into one pdf-file that can be downloaded here (6mb).
Another version of the same file which also includes the system.time results for most of the print statements used to generate the images is available here (6mb).
And yet another version with no images can be downloaded here (800 kb).
Tools
I will also list the tools I used to create the blog posts as well as the pdf-files:
- asciidoc – a text document format for writing short documents, articles, books and UNIX man pages. AsciiDoc files can be translated to HTML and DocBook markups.
- ascii – ascii is an R package that replaces R results in AsciiDoc document with AsciiDoc markup.
- blogpost – a command-line weblog client for publishing AsciiDoc documents to WordPress blog hosts. It creates and updates weblog posts and pages directly from AsciiDoc source documents.
- dblatex – PDF output was generated by passing AsciiDoc generated DocBook through dblatex/pdftex.


