How to use mcsm
Within the past two days, I received this email
Dear Prof.Robert
I have just bought your recent book on Introducing Monte Carlo Methods with R. Although I have checked your web page for the R programs (bits of the code in the book, codes for generating the figures and tec – not the package available on cran) used in the book, I have not found them.
I wonder whether you could make them available.
Thank you very much for your time and patience.
Yours Sincerely
and that one
Dear Prof. Robert,
I bought “Introducing Monte Carlo Methods with R” from Amazon booksore. I am a teacher at [...] University, and I choose this book as a textbook in my class.
I can not find the R package “mcsm” according to your book (page 5). Where can I download the R package “mcsm”?
I highly appreciate your help.
Best regards,
so I fear that readers may miss the piece of information provided in the book. As indicated on pages 36-37 of Introducing Monte Carlo Methods with R, mcsm is a registred R package, readers can therefore download it manually from CRAN, but they should first try using install.packages in R as this is both easier and safer. (They should check on the main R project webpage for more help in installing packages.)
Another useful information for readers is that the code used on the examples of Introducing Monte Carlo Methods with R is available from mcsm through the demo command/code. Typing demo(Chapter.3) starts the production of the examples of Chapter 3:
> demo(Chapter.3)
demo(Chapter.3)
————————
Type <Return> to start :
> # Section 3.1, Introduction
>
> ch=function(la){ integrate(function(x){x^(la-1)*exp(-x)},0,Inf)$val}
> plot(lgamma(seq(.01,10,le=100)),log(apply(as.matrix(
+ seq(.01,10,le=100)),1,ch)),xlab=”log(integrate(f))”,
+ ylab=expression(log(Gamma(lambda))),pch=19,cex=.6)
> S=readline(prompt=”Type <Return> to continue : “)
Type <Return> to continue :
and obviously the same for all other chapters. This also means the code is available in the corresponding file, something like
/usr/lib/R/site-library/mcsm/demo/Chapter.3.R
depending on your system.
Filed under: Books, R, Statistics Tagged: CRAN, demo(), install.packages, Introducing Monte Carlo Methods with R, mcsm, R package

Be Careful Searching Python Dictionaries!
For my talk on High Performance Computing in R (which I had to reschedule due to a nasty stomach bug), I used Wikipedia linking data, an adjacency list of articles and the articles to which they link. This data was linked from DataWrangling and was originally created by Henry Haselgrove. The dataset is small on disk, but I needed a dataset that was huge, very huge. So, without a simple option off the top of my head, I took this data and expanded a subset of it into an incidence matrix, occupying 2GB in RAM. Perfect!
The subsetting was a bit of a problem because I had to maintain the links within the subgraph induced by the same. This required me to search dictionary objects for keys. This is where things went awry. Usually, I try to be efficient as possible. Since I was just producing a little example, and would never ever run this code otherwise, I wasn’t as careful.
The data were presented as a follows
1. First, I looked at from, and if from was in the chosen subset, keep it and proceed to 2, otherwise, throw it out.
2. Then, take the to nodes and only keep the nodes that are also in the subset.
How we search for an element in a dict makes a huge, huge difference in running time. It is pretty easy to tell when you have fallen into this pitfall because your program will go ridiculously slow. In my case I had 50,000 unique keys, all of type int (originally).
The Wrong Way
Anybody with a background in mathematics may easily see this as the problem: is k in the set of keys for the dictionary? I mean, it is the keys we are searching, not the values.
for node in to:
if k in d.keys():
#keep the node
else:
#toss out the node.
or similarly
to = filter(lambda x: x in index.keys(), to)
This is bad. Using this method is the difference between your coding taking a few seconds, and running for a few hours.
The Better Ways
This correct way doesn’t make as much sense logically unless we understand one thing: the in operator is overloaded for dictionaries, to search for a key.
If we do not know this, it seems that k in d is ambiguous. My first thought was that it would search the values. Actually, it searches the keys.
>>> a = {'a': 1, 'b': 2, 'c': 3}
>>> 'b' in a
True
>>> 2 in a
False
>>> 2 in a.values()
True
So we can use the following,
for node in to:
if k in d:
#keep the node.
else:
#drop the node.
or similarly,
result = filter(lambda x: x in index, to)
To just check for the existence of a key, such as in my case, we can use has_key().
for node in to:
if d.has_key(k):
#keep the node.
else:
#drop the node.
Another way is to use exceptions. This may be natural for the beginning programmer. If the user queries k and it is not a key in the dictionary, an exception is thrown. The user can catch this exception just to shut up the interpreter.
for k in to:
try:
dummy = index[k]
except:
#drop the node
else:
#keep the node
Timing the Methods
I stripped down my code to run one iteration of the processing using the various methods, all Python scripts. Since the subsets are chosen randomly each time, I ran each method 100 times and averaged the running time.
Clarification: I did not want to limit my code sample too much. The curious reader will notice that 5000ms is pretty high for just a search operation. There is some other stuff going on before the search takes place: generating a random list, opening a file, reading a line from it, and parsing the line. This overhead increases the running time. The table below is solely to show the
To time a function, we can use a function decorator as follows. Define the function print_timing, and then add a decorator to the function being time:
def print_timing(func):
def wrapper(*arg):
t1 = time.clock()
res = func(*arg)
t2 = time.clock()
print '%0.3fms' % ((t2-t1)*1000.0)
return res
return wrapper
@print_timing
def my_function():
#stuff
return
Adapted from a post on DaniWeb.
The table below shows the results. The point to take home is never, ever search dict.keys(). Using has_key() and exceptions gives a pretty good improvement. Simply iterating over the dictionary itself takes the least amount of time.

The difference in the average time between the slowest method and the fastest method is 74ms. The difference between these methods is 7 seconds after processing 100 nodes. The difference becomes unacceptable when processing even just 1,000 to nodes.
So why is the d.keys() so slow? When we call d.keys(), we get a list. Searching the list is an operation. The other methods that I studied rely on the dict and only the dict object. A Python dict is like a hash table, and checking for existence of a key is an
operation.
While writing this post, I found this web page discussing common Python idioms and hints to improve efficiency. I recommend it for anyone that programs in Python.
An interesting paper
The paper pointed out the following options to fit GLMM using R:
- glmer
- glmmML
- glmm (from the repeated package)
- glmmADMB
- MCMCglmm
- glmmBUGS
- glmmPQL
- BUGS (through R2WinBUGS)
- glmmAK
I am not aware of the glmmAK package before. From the first glance, it seems to be very promising in the sense that it seems to allow non-Gaussian random effect in a Bayesian framework, something similar to what npmlreg does with ML method.
============== edited on March 3 ====================
DPpackage is another package that can estimate GLMM in a Bayesian framework.
============== edited on March 5 ====================
ASReml/ASReml-R is another choice. It is not free software, but it does seem to have some unique strengths. Maybe I should download a demo copy and try it myself.
============== edited on March 29 ===================
hglm is another possibility.
oro.dicom 0.2.4
- Increased speed
- Uploading only header information (for restricted memory)
- Reading implicit value representations (VR's)
- Parsing SequenceItem tags (undefined lengths are allowed)
- Integration with oro.nifti to convert DICOM to NIfTI
The list object from reading all 166 DICOM files has two fields: header and image. Each hdr element contains the DICOM header fields associated with that file and the corresponding img element contains the PixelData tag. The first DICOM file contains 312 DICOM header fields. Using extractHeader() one can obtain the value of a DICOM header field in either text or numeric format.
The image above is the mid-sagittal view of this subject from his/her MPRAGE acquisition.
Steve Miller on R at Predictive Analytics World
At the Information Management blog, Steve Miller has provided two great reviews (here and here) of last week's Predictive Analytics World conference, including a recap of the Bay Area User's Group meeting featuring John Chambers. (My personal highlight from John's talk? A photograph of the very first sketch of what was to become the S system, which ultimately begat the R language.) But I also wanted to point you to Steve's review of Mike Driscoll's jaw-dropping talk The Social Effect: Predicting Telecom Customer Churn with Call Data:
Mike Driscoll's: The Social Effect: Predicting Telecom Customer Churn with Call Data, was a good illustration of predictive analytics in a larger data warehousing and BI context. Driscoll and his team analyzed billions of calls, millions of records and thousands of defectors from a Greenplum DW looking for predictors of churn. Driscoll's a big proponent of the open source R Project for Statistical Computing to support his work flow of data munge, data model, and data visualize. And with a Ph.D. in Bioinformatics, he often thinks like an epidemiologist, in this case looking for indications of contagious churn behavior. Using several social network analysis packages available in R, Driscoll's team appears to have found that churn in an individual's social network of calling accounts in a given month is likely to lead to more churn in subsequent months, a clear indication of a network effect. That contagion is overwhelmingly the strongest signal the team found from the data. A next step is to work with marketing to intervene on early network churn with email campaigns to minimize losses from the affected networks. I'd love to see the results from those experiments.
Mike's talk was jaw-dropping for me for two reasons. Firstly, in innovation: using a caller's social network (defined by the people they call the most, information drawn from the call-log data) was an elegent and powerful way to predict probability of switching to another network. It seems intuitive -- if your friends switch to another provider, you're likely to, as well -- and it's always good to see intuition borne out by data. And secondly, in scale: we're talking about 10Gb of data here, analyzed in just a few hours using R. Not too long ago that type of computation would be reserved for high-tech computing grids and expensive software; these days, it's amazing what you can do with a hefty 64-bit workstation and open-source tools. Unfortunately Mike's slides aren't available just now (some great social network charts in there), but if they do become available I'll let you know.
Information Management: Predictive Analytics World – Take 2
Because it’s Friday: Visualizing an email chain
We've all been there: someone sends an email to a mailing list with a Reply-To directing responses back to the mailing list. Before long, someone replies (unwittingly, to everyone) to ask to be taken of the list. And before long, the entire affair devolves into an endless cycle of requests to unsubscribe and pleas to stop mailing the entire list. This process is visualized beautifully in this timeline of emails to the Caltrans Performance Measurement System mailing list (click to see the enlarged original on FlowingData):
Some of the classic r-help threads could also benefit from such a treatment.
FlowingData: Data Underload #8 – Unsolicited
http://flowingdata.com/2010/02/08/data-underload-8-unsolicited/
R and Sudoku solvers: Plus ca change…
But what everybody seems to be forgetting is that R has had a Sudoku solver for years, thanks to the sudoku package by David Brahm and Greg Snow which was first posted four years ago. What comes around, goes around.
With that, and about one minute of Emacs editing to get the Le Monde puzzle into the required ascii-art form, all we need to do is this:
That took all of five seconds while my computer was also compiling a particularly resource-hungry C++ package....R> library(sudoku) R> s <- readSudoku("/tmp/sudoku.txt") R> s [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] 8 0 0 0 0 1 2 0 0 [2,] 0 7 5 0 0 0 0 0 0 [3,] 0 0 0 0 5 0 0 6 4 [4,] 0 0 7 0 0 0 0 0 6 [5,] 9 0 0 7 0 0 0 0 0 [6,] 5 2 0 0 0 9 0 4 7 [7,] 2 3 1 0 0 0 0 0 0 [8,] 0 0 6 0 2 0 1 0 9 [9,] 0 0 0 0 0 0 0 0 0 R> system.time(solveSudoku(s)) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] 8 4 9 6 7 1 2 5 3 [2,] 6 7 5 2 4 3 9 1 8 [3,] 3 1 2 9 5 8 7 6 4 [4,] 1 8 7 4 3 2 5 9 6 [5,] 9 6 4 7 8 5 3 2 1 [6,] 5 2 3 1 6 9 8 4 7 [7,] 2 3 1 8 9 4 6 7 5 [8,] 4 5 6 3 2 7 1 8 9 [9,] 7 9 8 5 1 6 4 3 2 user system elapsed 5.288 0.004 5.951 R>
Just in case we needed another illustration that it is hard to navigate the riches and wonders that is CRAN...
Welcome, Robin!
Robin Ryder started his new blog with his different solutions to Le Monde puzzle of last Saturday (about the algebraic sum of products…), solutions that are much more elegant than my pedestrian rendering. I particularly like the one based on the Jacobian of a matrix! (Robin is doing a postdoc in Dauphine and CREST—under my supervision—on ABC and other computational issues, after completing a PhD in Oxford on philogenic trees for language history with Geoff Nicholls. His talk at the Big’MC seminar last month is reproduced there.)
And, in a totally unrelated way, here is the Sudoku (in Le Monde) that started my post on simulated annealing, nicely represented on Revolutions. (Although I cannot see why the central columns are set in grey…) I must mention that I am quite surprised at the number of visits my post received, given that using simulated annealing for solving Sudokus has been around for a while. Even my R code, while original, does not compete with simulated annealing solutions that take a few seconds… I thus completely share Dirk Eddelbuettel‘s surprise in this respect (but point to him that Robin’s blog entry has nothing to do with Sudokus, but with another Le Monde puzzle!)
Filed under: R, Statistics, University life Tagged: Le Monde, linguistics, philogenic trees, sudoku

Responding to the Flowingdata GDP Graph Challenge
Nathan Yau of Flowingdata put up a challenge earlier today to improve upon a graph showing government spending as a percentage of GDP, published in the Economist.
The underlying data wasn’t available. So I put on my graph-to-numbers glasses on and pulled out some data. Here it is in case you want to have a go.
I took on the first part of the challenge i.e. Can you think of a way to make this graph easier to read?
The Original Graph from the Economist:

I hacked up the following version in R. It was a bit of a challenge to get it right given the constraints of fitting all the data and legend within a 290 x 300 image.

Do you think this is an improvement? Leave a comment below or in Nathan’s original post at http://flowingdata.com/2010/02/25/challenge-make-this-graph-easier-to-read/.
And here’s the R code:
#Read the file
gdp<-read.delim(“gdp_long.txt”,header=T)
#Reset the column name from United.States to United States; R replaces spaces in variable names with dots; you’ll see why below.
colnames(gdp)[6]<-”United States”
#Define our colour palette so that we can edit it in one place and refer to colours by index as shown below
pal=c(“black”,”darkorange”,”blue”,”forestgreen”,”tomato”)
#Start PNG device with the given constraints of 300×290 (boy that’s a small image!)
png(“gdp.png”,height=300,width=290,units=”px”)
#Plot settings
par(mar=c(2,2,3,1) #Small images call for small margins
,las=1) #For some reason, the default orientation (las=0) of the axis labels is parallel to the axis. This works OK for the X axis but makes it hard to read Y axis labels, so set to horizontal.
#Finally, the main plot command
plot(Canada~Year,data=gdp
,type=”l”
,xaxt=”n” #Don’t draw default X axis; we’ll draw a custom one below.
,xaxs=”i” #X axis style (internal) just finds an axis with pretty labels that fits within the original data range. If you don’t use this then an extra space is added at the edges even if you set xlim
,yaxs=”i”#Style – Same reason as X Axis
,main=”Total government spending \n(% of GDP by year)” #Got rid of ‘The shape of the beast’ for space constraints
,col=pal[1]
,ylim=c(30,70) #Setting the top Y axis limits to allow space for the legend
,lwd=4 #Quite unusually high line width but good for improving visibility in a small graph.
)
#Custom X axis
axis(side=1 #That’s the bottom X axis side
,at=Year[2:16] #labels starting from 1996; using at=Year places the labels at odd years.
,labels=substr(Year[2:16],3,4)) #Instead of using the full year, use just 2 digits.
grid(lwd=0.4,lty=1,col=”#000000″) #Very faint grid to guide the eyes.
#Add the rest of the lines
#France
lines(France~Year,data=gdp,col=pal[2],lwd=4)
#Germany
lines(Germany~Year,data=gdp,col=pal[3],lwd=4)
#Britain
lines(Britain~Year,data=gdp,col=pal[4],lwd=4)
#United States; Note we can’t use United States~Year because of the space between United and States. This calls for use of the data[["variable name"]] notation.
lines(gdp[["United States"]]~Year,data=gdp,col=pal[5],lwd=4)
#Lastly the legend
legend(“top” #Align it at the top in the center
,ncol=2 #Number of columns to spread the legend labels overs; 2 works best for our graph.
,legend=colnames(gdp)[2:6]
,lty=1
,lwd=4
,col=pal #Make sure to use the same colour palette as the graph lines!
,bg=”#FFFFFF” #White background to make it merge with the plot background.
,inset=0.01) #Inset the legend so that it doesn’t quite touch the border of the plot.
dev.off() #Close the graphics device
Nutritional supplements efficacy score – Graphing plots of current studies results (using R)
In this post I showcase a nice bar-plot and a balloon-plot listing recommended Nutritional supplements , according to how much evidence exists for thier benefits, scroll down to see it(and click here for the data behind it)
* * * *
The gorgeous blog “Information Is Beautiful” recently publish an eye candy post showing a “balloon race” image (see a static version of the image here) illustrating how much evidence exists for the benefits of various Nutritional supplements (such as: green tea, vitamins, herbs, pills and so on) . The higher the bubble in the Y axis score (e.g: the bubble size) for the supplement the greater the evidence there is for its effectiveness (But only for the conditions listed along side the supplement).
There are two reasons this should be of interest to us:
- This shows a fun plot, that R currently doesn’t know how to do (at least I wasn’t able to find an implementation for it). So if anyone thinks of an easy way for making one – please let me know.
- The data for the graph is openly (and freely) provided to all of us on this Google Doc.
The advantage of having the data on a google doc means that we can see when the data will be updated. But more then that, it means we can easily extract the data into R and have our way with it (Thanks to David Smith’s post on the subject)
For example, I was wondering what are ALL of the top recommended Nutritional supplements, an answer that is not trivial to get from the plot that was in the original post.
In this post I will supply two plots that present the data: A barplot (that in retrospect didn’t prove to be good enough) and a balloon-plot for a table (that seems to me to be much better).
Barplot
(You can click the image to enlarge it)

The R code to produce the barplot of Nutritional supplements efficacy score (by evidence for its effectiveness on the listed condition).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | # loading the data supplements.data.0 <- read.csv("http://spreadsheets.google.com/pub?key=0Aqe2P9sYhZ2ndFRKaU1FaWVvOEJiV2NwZ0JHck12X1E&output=csv") supplements.data <- supplements.data.0[supplements.data.0[,2] >2,] # let's only look at "good" supplements supplements.data <- supplements.data[!is.na(supplements.data[,2]),] # and we don't want any missing data supplement.score <- supplements.data[, 2] ss <- order(supplement.score, decreasing = F) # sort our data supplement.score <- supplement.score[ss] supplement.name <- supplements.data[ss, 1] supplement.benefits <- supplements.data[ss, 4] supplement.score.col <- factor(as.character(supplement.score)) levels(supplement.score.col) <- c("red", "orange", "blue", "dark green") supplement.score.col <- as.character(supplement.score.col) # mar: c(bottom, left, top, right) The default is c(5, 4, 4, 2) + 0.1. par(mar = c(5,9,4,13)) # taking care of the plot margins bar.y <- barplot(supplement.score, names.arg= supplement.name, las = 1, horiz = T, col = supplement.score.col, xlim = c(0,6.2), main = c("Nutritional supplements efficacy score","(by evidence for its effectiveness on the listed condition)", "(2010)")) axis(4, labels = supplement.benefits, at = bar.y, las = 1) # Add right axis abline(h = bar.y, col = supplement.score.col , lty = 2) # add some lines so to easily follow each bar |
Also, the nice things is that if the guys at Information Is Beautiful will update there data, we could easily run the code and see the updated list of recommended supplements.
Balloon plot
So after some web surfing I came around an implementation of a balloon plot in R (Thanks to R graph gallery)
There where two problems with using the command out of the box. The first one was that the colors where non informative (easily fixed), the second one was that the X labels where overlapping one another. Since there is no “las” parameter in the function, I just opened the function up, found where this was plotted and changed it manually (a bit messy, but that’s what you have to do sometimes…)
Here are the result (you can click the image for a larger image):
And here is The R code to produce the Balloon plot of Nutritional supplements efficacy score (by evidence for its effectiveness on the listed condition).
(it’s just the copy of the function with a tiny bit of editing in line 146, and then using it)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | require(colorspace) require(gplots) # I was able to find the function by using # methods(balloonplot)[1] # This command: getAnywhere("balloonplot.default") # Wouldn't work... balloonplot2 <- gplots:::balloonplot.default # This one works :) # now run: fix(balloonplot2) # search for # y <- ny + 0.75 + (nlabels.x - i + 0.5) * colmar # And add beneath it the following line: # y <- rep(y, dim(xlabs)[1]) - c(0,.5,1) supplement.benefits <- tolower(supplement.benefits ) supplement.name <- tolower(supplement.name) balloonplot2( supplement.name,supplement.benefits, supplement.score, xlab ="supplement", ylab="Benefit", show.margins=F, dotsize = 15,fun=function(x)max(x,na.rm=T), rowmar = 7, colmar = 7, dotcolor = rev(heat_hcl(max( supplement.score)))[ supplement.score-1], main = c("Balloon plot of", "Nutritional supplements efficacy score","(by evidence for its effectiveness on the listed condition)", "(2010)"), sub = c("Published on www.r-statistics.com") ) |
Got any good ideas of how else to plot the data? let me know in the comments ![]()



