Pre-processing text: R/tm vs. python/NLTK

February 16, 2011
Let’s say that you want to take a set of documents and apply a computational linguistic technique.  If your method is based on the bag-of-words model, you probably need to pre-process these documents first by segmenting, tokenizing, stripping, stopwording, and … Continue reading →

Mixed models – Part 2: lme lmer

February 15, 2011
Getting more into mixed models, I’ve been playing around with both nlme::lme and lme4::lmer. http://tolstoy.newcastle.edu.au/R/e2/help/06/10/3345.html was quite a good post at explaining the differences, which from what I gather is largely performance based when using crossed or partially crossed models. In the models I am tinkering with at the moment I am noticing differences in

ABC in London

February 15, 2011
After the very exciting and I think quite successful ABC in Paris meeting two years ago, Michael Stumpf from Imperial College London suggested a second edition in London along the same lines. Michael kindly associated me with the planning of this meeting. It is (logically) called ABC in London (or ABCiL) and will take place

Reaching 1000

February 14, 2011
This is the 1000th post on the ‘Og! Here are the entries that have had above 1000 views (not viewers) so far: In{s}a(ne)!! 5,353 “simply start over and build something better” 4,345 Julien on R shortcomings 1,966 Sudoku via simulated annealing 1,762 Of black swans and bleak prospects 1,462 Do we need an integrated Bayesian/likelihood

The Most Romantic Electro-Grunge Statistical Computing Song Ever Made

February 14, 2011
Warning message: This song contains highly suggestive coefficients and graphic depictions of exuberant R-core lovin’. “Plotting Ihaka” is based on Rotting Piñata by Sponge, and reflects a small measure of my boundless joy in the world of R. Despite being a firm proponent of muffins, I can confidently say that I would rather live in

Another Bernoulli factory

February 13, 2011
The paper “Exact sampling for intractable probability distributions via a Bernoulli factory” by James Flegal and Radu Herbei got posted on arXiv without me noticing, presumably because it came out just between Larry Brown’s conference in Philadelphia and my skiing vacations! I became aware of it only yesterday and find it quite interesting in that

Visualize NHL Play-by-Play using Tableau Public and R

February 13, 2011
Nothing like a little Sunday morning data hacking before a big game!  I have been wanting to play with the NHL play-by-play event files for some time now.  The JSON datasets provide a wealth of information about each event in the game including the location, as defined by the fields xcoord and ycoord. I am

Parallel computation [back]

February 12, 2011
We have now received reports back from JCGS for our parallel MCMC paper and they all are very nice and supportive! The reviewers essentially all like the Rao-Blackwellisation concept we developed in the paper and ask for additions towards a more concrete feeling for the practical consequences of the method. We should thus be able

Le Monde puzzle [#5]

February 10, 2011
Another Sudoku-like puzzle from the weekend edition of Le Monde. The object it starts with is a 9×9 table where each entry is an integer and where neighbours take adjacent values. (Neighbours are defined as north, west, south and east of an entry.) The question is about whether or not it is possible to find

Model weights for model choice

February 9, 2011
$Model weights for model choice$

An ‘Og reader. Emmanuel Charpentier, sent me the following email about model choice: I read with great interest your critique of Peter Congdon’s 2006 paper (CSDA, 50(2):346-357) proposing a method of estimation of posterior model probabilities based on improper distributions for parameters not present in the model inder examination, as well as a more general