Updated R code and data for ARM

May 19, 2010
Patricia and I have cleaned up some of the R and Bugs code and collected the data for almost all the examples in ARM. See here for links to zip files with the code and data....

Mining and Analyzing Online Social Graph Data

May 19, 2010
Drew Conway, PhD student in NYU's Department of Politics, provides an introduction to mining social graph data from the Internet that focuses on the technical, substantive and ethical concerns related to this type of analysis.

Random [uniform?] sudokus [corrected]

May 19, 2010
As the discrepancy in the sum of the nine probabilities seemed too blatant to be attributed to numerical error given the problem scale, I went and checked my R code for the probabilities and found a choose(9,3) instead of a choose(6,3) in the last line… The fit between the true distribution and the

May 19, 2010
Armadillo Armadillo is a C++ linear algebra library aiming towards a good balance between speed and ease of use. Integer, floating point and complex numbers are supported, as well as a subset of trigonometric and statistics functions. Various matr...

Random [uniform?] sudokus

May 19, 2010
A longer run of the R code of yesterday with a million sudokus produced the following qqplot. It does look ok but no perfect. Actually, it looks very much like the graph of yesterday, although based on a 100-fold increase in the number of simulations. Now, if I test the adequation with a basic chi-square

LSPM Joint Probability Tables

May 18, 2010
I've received several requests for methods to create joint probability tables for use in LSPM's portfolio optimization functions.  Rather than continue to email this example to individuals who ask, I post it here in hopes they find it via a Google...

May 18, 2010
Update x6 (Jul 27): so I guess people want pitch counts. The data @ MLB seems to only give the pitch count of the end result and the strikes/balls/outs of the particular pitch. Of course you can combine them to get the pitch count. Stupid WordPress comments strip out necessary HTML to properly display code,

robot (SPX) DNA Management Techniques

May 18, 2010
Yes, this is related to trading, but no, it is not my thesis on why the Euro is going to parity. Instead, it is sort of a workshop for robot(SPX) developers on how to organize their digital DNA. As you begin to use programming as a money extraction tool on the markets, you'll soon find...

Confusing slice sampler

May 18, 2010
Most embarrassingly, Liaosa Xu from Virginia Tech sent the following email almost a month ago and I forgot to reply: I have a question regarding your example 7.11 in your book Introducing Monte Carlo Methods with R.  To further decompose the uniform simulation by sampling a and b step by step, how you determine the

R: Dueling normals

May 18, 2010
More playing around with R. To create the graph above, I sampled 100 times from two different normal distributions, then plotted the ratio of times that the first distribution beat the second one on the y-axis. The second distribution always had a mean of 0, the mean of first distribution went from 0 to 4,

Parallel Computing with R for Life Sciences

May 18, 2010
I hadn't heard of the CloudAsia 2010 conference before, but from the programme the workshop Master Class on HPC Application For Life Sciences looked like it was interesting. One workshop session in particular caught my eye: Practical Parallel Computing in R by Xie Chao and Tan Tin Wee (from the National University of Singapore). The workshop notes (PDF) provide...

Prototype: Web-Friendly Visualizations in R

May 18, 2010
Developing web-friendly data visualizations is not very difficult, though as far as I know, a package that allows one to do this directly in R does not exist (e-mail me if you know of one). As someone who has been developing lots of data-oriented software tools, it's always nice to post visualizations online. To facilitate

JAGS 2.1.0 and rjags 2.1.0 are released

May 17, 2010
JAGS 2.1.0 is now available from Sourceforge.  You will find the source as well as binary packages for Windows and Mac OS X. Binary packages for Debian are available through the usual Debian channels, and packages for RPM-based Linux distributions … Continue reading →

House Mountain Hike

May 17, 2010
My wife Mary and my Dad Wesley and I took a hike this weekend (5/14/10) to the House Mountain state recreation area in Knox county, Tennessee. The hike was about 3.8 miles with a total elevation gain of around 1000 feet (940.23ft by GPS). The plot below gives the elevation profile over the course of

Random sudokus [test]

May 17, 2010
Robin Ryder pointed out to me that 3 is indeed the absolute minimum one could observe because of the block constraint (bon sang, mais c’est bien sûr !). The distribution of the series of 3 digits being independent over blocks, the theoretical distribution under uniformity can easily be simulated: #uniform distribution on the block diagonal

Rcpp 0.8.0

May 17, 2010
Romain and I are happy to announce the release of Rcpp version 0.8.0. It has been uploaded to CRAN. A Debian upload is delayed until the now-required inline package is accepted into Debian. The source package is also available from here. This release ...

May 17, 2010
Winning the first game in a baseball series: a harbinger, or not?

May 17, 2010
For those not familiar with the major-league baseball in the US (and despite living here for more than 10 years, I still include myself in that category), the games usually played in series: team A visits the home of team B, and the two teams play two or more games against each other on successive days. It's common wisdom...

Example 7.37: calculation of Hotelling’s T^2

May 17, 2010
Hotelling's T^2 is a multivariate statistic used to compare two groups, where multiple outcomes are observed for each subject. Here we demonstrate how to calculate Hotelling's T^2 using R and SAS, and test the code using a simulation study then apply ...

Index of the R-Sessions

May 17, 2010
The R-Sessions are a series of blog entries on using R. A large part consists of an R-manual I once wrote. Other posts include some tricks I found out, as well as entries detailing functions and packages I wrote for ...

Hitting the Big Data Ceiling in R

May 16, 2010
As a true R fan, I like to believe that R can do anything, no matter how big, how small or how complicated: there is some way to do it in R. I decided to approach my large, sparse matrix problem with this attitude. But here I sit a broken man. There is no “native” big data support built into...

Graphing using R

May 16, 2010
Long-time readers of the Stubborn Mule will know that charts are a regular feature here. Almost all of these charts were produced using the R statistical software package which, in my view, produces far superior results to the most commonly used graphing tool: Excel. As a community service to help rid the world of horrible

Random sudokus

May 16, 2010
After thinking about random sudokus for a few more weeks, I eventually came to read the paper by Newton and DeSalvo about the entropy of sudoku matrices. As written earlier, if we consider (as Newton and DeSakvo) a uniform distribution where the sudokus are drawn uniformly over the set of all sudokus, the entropy of

A 34 Minute Video on Using R to Analyse Winter Olympic Medal Data

May 16, 2010
In this post I present a 34-minute video on using R. The video is based on an analysis of 1924 to 2006 Winter Olympic Medals that I presented previously in text form. The video aims to to show what an interactive session in R might look like using ...

Emulating Internet Traffic in Load Tests

May 15, 2010
One of the recurring questions in the GCaP class last week was: How can we make web-application load tests more representative of real Internet traffic? The sticking point is that conventional load-test simulators like LoadRunner, JMeter, and httperf, ...

Typo in Bayesian Core [again]

May 15, 2010
Reza Seirafi from Virginia Tech sent me the following email about Bayesian Core, which alas is pointing out a real typo in the reversible jump acceptance probability for the mixture model: With respect to the expression provided on page 178 for the acceptance probability of the split move, I was wondering if the omission of