The setup When doing statistics the Bayesian way, we are sometimes bombarded with complicated integrals that do not lend themselves to closed-form solutions. This used to be a problem. Nowadays, not so much. This post illustrates how a person can...

In a tongue-in-cheek post at the Information Management blog, Steve Miller shares his "frustration" with R: package developers keep on releasing new functionality for R that makes his own work obsolete. For example, there's now pre-packaged functionality in R for enhanced dotplots, Economist-style graphics, additive regression models and more, which all obviate the need for Steve to implement such...

“It seems quite absurd to reject an EP-based approach, if the only alternative is an ABC approach based on summary statistics, which introduces a bias which seems both larger (according to our numerical examples) and more arbitrary, in the sense that in real-world applications one has little intuition and even less mathematical guidance on to

In the last Utah R Users group meeting I gave a presentation on data manipulations on R, and today I found through the plyr mailing list two commands that I was previously unaware of that should definitely be made mention of, arrage and mutate.

User BobH asked on StackOverflow about accelerating path-dependent loops. He provided a simple example in which a vector gets filled conditional on the value of the preceding element. Simple to code, but hard to vectorise. By the time I saw that q...

There are only three known jokes about statistics in the whole universe, so to complete the trilogy (see here and here for the other two), listen up: Three statisticians are on a train journey to a conference, and they get chatting to three epidemiologists who are also going to the same place. The epidemiologists are

Time series data are widely seen in analytics. Some examples are stock indexes/prices, currency exchange rates and electrocardiogram (ECG). Traditional time series analysis focuses on smoothing, decomposition and forecasting, and there are many R functions and packages available for those … Continue reading →

The usual approach to testing software is to create a specific problem and see if the software gets the correct answer. Although this is very useful, there are problems with it: It is labor-intensive It almost totally neglects to test the code that throws errors There can be unconscious bias in the test cases created … Continue reading...

Hong Ooi talks about some of the more interesting projects that he has used R for in the last year. These include fitting models for mortgage loss given default, a Monte Carlo application for stress-testing loan portfolios (in combination with Excel an...

No doubt you've heard about the tyranny of the 9s in reference to computer system availability. You're probably also familiar with the phrase six sigma, either in the context of manufacturing process quality control or the improvement of business processes. As we discovered in the recent Guerrilla Data Analysis Techniques class, the two concepts are related.

Which topics are the most popular at the BioStar bioinformatics Q&A site? One source of data is the tags used for questions. Tags are somewhat arbitrary of course, but fortunately BioStar has quite an active community, so “bad” tags are usually edited to improve them. Hint: if your question is “How to find SNPs”, then

If you missed last week's worldwide R user conference at the University of Warwick, several attendees have posted informative roundups of the event. Check out these posts from Patrick Burns, Karl Broman, Colin Gillespie, Pairach Piboonrungroj and Richie Cotton (which features a rare, good Statistics joke). My own roundup of the conference was posted on Friday, in case you...

A heads-up that I'll be giving a free webinar this Wednesday, August 24. In 30 minutes, I'll give an overview of the open-source R project and the additional features of Revolution R Enterprise: R users already know why the R language is the lingua franca of statisticians today: because it's the most powerful statistical language in the world. Revolution...

RTextTools v1.3 was released on August 21, and the package binaries are now available on CRAN. This update fixes a major bug with the stemmers, and it is highly recommended you upgrade to the latest version. Other changes include optimization of existing functions and improvements to the documentation.Additionally, Duncan Temple Lang has graciously

I have been waiting for the KDD conference to come to California, and I was ecstatic to see it held in San Diego this year. AdMeld did an awesome job displaying KDD ads on the sites that I visit, sometimes multiple times per page. That’s good targeting! Mining and Learning on Graphs Workshop 2011 I had originally planned to attend the...

Earlier, I discussed the nice properties of bow tie plots for visualizing and understanding inferences from simple randomized treatment experimental designs. R code to quickly create these plots is available here. You can use the command source("htt...

For a quick recap, Pierre and I supervised a team project at Ensae last year, on a statistical critique of the abstract painting 1024 Colours by painter Gerhard Richter. The four students, Clémence Bonniot, Anne Degrave, Guillaume Roussellet and Astrid Tricaud, did an outstanding job. Here is a selection of graphs and results they produced.

Over at the ExploringDataBlog, Ron Pearson just wrote a post about the cases when means are useless. In fact, it’s possible to calculate a whole load of stats on your data and still not really understand it. The canonical dataset for demonstrating this (spoiler alert: if you are doing an intro to stats course, you