Here’s a bit of code used to produce one of the figures in my recent paper dealing with modeling rocky intertidal snail body temperatures. This was my first foray into ggplot2, and it only involved a few hours of head-scratching. The plot is a co...

Here’s a bit of code used to produce one of the figures in my recent paper dealing with modeling rocky intertidal snail body temperatures. This was my first foray into ggplot2, and it only involved a few hours of head-scratching. The plot is a co...

When predicting 0/1 data we can use logit (or probit or robit or some other robust model such as invlogit (0.01 + 0.98*X*beta)). Logit is simple enough and we can use bayesglm to regularize and avoid the problem of separation. What if there are more than 2 categories? If they’re ordered (1, 2, 3, etc),

Ever been editing an .Rnw (Sweave) file and tried to sync a pdf with the source in TeXShop (or TeXWorks) and had it open the .tex file? This is because the synctex information (in the .synctex.gz file) is messed up. Both TeXShop and TeXWorks support synctex, that means that if everything is groovy, we should

The blog Heuristically Andrew puts Revolution R through its paces by running some benchmarks versus open-source R for data mining applications. The benchmarks set out to answer the following question: I recently upgraded my notebook (where I often use R for data mining) and was faced with two questions: for the fastest speed for building models, do I use...

In an earlier post I went through some econometrics that involved the problem of testing for multivariate cointegration in the case where there are one or more trend-breaks or level-breaks in the time-series data. Specifically, I talked about the modified Trace tests introduced by Johansen et al. (2000), and I mentioned the really nice discussion of the application of these tests...

Yesterday the Princeton machine learning reading group went through a paper by Tukey on “Some graphic and semigraphic displays”. One issue we talked about at length was Tukey’s idiosyncratic approach to visualizing periodic data in a circular format to emphasize the connections between the “start” and the “end” of the data set. Allison Chaney pointed

tl;dr Browse through Sarah Palin’s emails, automagically organized by topic, here. LDA-based Email Browser Earlier this month, several thousand emails from Sarah Palin’s time as governor of Alaska were released. The emails weren’t organized in any fashion, though, so to make them easier to browse, I did some topic modeling (in particular, using latent Dirichlet

Following my earlier posts on the revision of Lack of confidence, here is an interesting outcome from the derivation of the exact marginal likelihood in the Laplace case. Computing the posterior probability of a normal model versus a Laplace model in the normal (gold) and the Laplace (chocolate) settings leads to the above histogram(s), which

We are very proud to announce our cloudnumbers.com release number 5! In the last days we rolled out several releases and bug fixes. Cloudnumbers.com now supports many more features and has an optimized startup process. This is a list of our main and very important new features: Bioconductor packages for the R application can be

Visit “The R Programming wikibook” to extend your knowledge about R and to get a lot of introductions how to use it. If you are an R expert and wish to contribute your knowledge and editing skills to the project, then you can learn how to write in wiki-markup and how to edit a wikibook.

The title of chapter 5 in my Guerrilla Capacity Planning book is, "Evaluating Scalability Parameters," and underneath it you'll see this quote:"With four parameters I can fit an elephant. With five I can make his trunk wiggle." —John von NeumannIn that vein, Guerrilla alumnus Stephen O'C. pointed me at a recent blog post and paper (PDF)...

LDA-based Email Browser Earlier this month, several thousand emails from Sarah Palin’s time as governor of Alaska were released. The emails weren’t organized in any fashion, though, so to make them easier to browse, I’ve been working on some topic modeling (in particular, using latent Dirichlet allocation) to separate the documents into different groups. I threw...

Why isn't everyone using the RObjectTables package? This is the best thing ever! Here's the basic idea of RObjectTables: An environment is an object where you can lookup names and associate them with values. And in particular its where you look up variables used in an expression. But there's no reason you can't take any other object that associates names...

The French agronomy research institute INRA is organising a Fall school in La Rochelle, Nov. 28 – Dec. 02, on Bayesian methods, oriented towards the applications in food sciences, environmental sciences, and biology. The provisional program (in French) is ■ Initiation aux outils informatiques R et WinBUGS (TP et réalisation de projets sur ordinateur) ■

With RTextTools now released and the feedback rolling in, the development team is getting the ball rolling on the help documentation for the library. Currently, you cannot access help files about the library or its functions from within R. However, we do offer a draft of a quick start guide in PDF format under the Documentation section of the...

The Generalized Linear Model (GLM) allows us to model responses with distributions other than the Normal distribution, which is one of the assumptions underlying linear regression as used in many cases. When data is counts of events (or items) then a discrete distribution is more appropriate is usually more appropriate than approximating with a continuous