Here I compare two distributions, flowering duration of indigenous and allochtonous plant species. The hypothesis is that alien compared to indigenous plant species exhibit longer flowering periods. Read more »

Here I implemented in R some dithering algorithms: - Floyd-Steinberg dithering - Bill Atkinson dithering - Jarvis-Judice-Ninke dithering - Sierra 2-4a dithering - Stucki dithering - Burkes dithering - Sierra2 dithering - Sierra3 dithering For each algorithm, I wrote a 2-dimensional convolution function (a matrix passing over a matrix); it is slow because I didn't implemented any fasting tricks. It can be easily implemented in C, then used...

I'm working on a 3 part post on how to build, score and perform optimization with predictive models in R. Having done this type of work in IBM SPSS for a number of years, I wanted to replicate it in R. It's amazing how little is published on how to s...

R seems to have two byte code compilers: the Ra add-on module (and the accompanying "jit" package) and the "compiler" package came with the default installation. I wonder how they differentiate from each other and what the strengths and weaknesses...

In early May I had the opportunity to attend a workshop on using high performance computing in R hosted at Nimbios. I’ve been meaning to write a summary of the meeting ever since but got sidetracked by various other projects. Since a collaborator recently asked for meeting notes I finally took the time to write

As September draws nearer, my mind inevitably turns away from my lofty (and largely unmet) summer research goals, and toward teaching. This semester I will be trying out a teaching technique using live data collection and analysis as a tool to encourage student engagement. The idea is based on the electronic polling technology known as

In my last post, I described three situations where the average of a sequence of numbers is not representative enough to be useful: in the presence of severe outliers, in the face of multimodal data distributions, and in the face of infinite-variance distributions. The post generated three interesting comments that I want to respond to here.First and foremost, I...

<< My review of Day 1. I am summarizing all of the days together since each talk was short, and I was too exhausted to write a post after each day. Due to the broken-up schedule of the KDD sessions, I group everything together instead of switching back and forth among a dozen different topics. By far the most enjoyable...

As a coincidence, while I was waiting for the solution to puzzle #737 published this Friday in Le Monde, the delivery (wo)man forgot to include the weekend magazine and I had to buy it this morning with my baguette (as if anyone cares!). The solution is (y0,z0,w0)=(38,40,46) and…it does not work! The value of (x1,y1,z1,w1) is

If you missed this week's webinar, the slides from my presentation Revolution R Enteprise: 100% R and More may be useful as an introduction to R and the additional capabilities of Revolution R Enterprise. The slides themselves and the replay video are also available for download from the link below. Revolution Analytics webinars: Revolution R Enterprise: 100% R and...

Here's a followup to yesterday's post on using the rdatamarket package to import data into R. Ajay Ohri at the DecisionStats blog offers nine additional methods for bringing data into R, from sources including InfoChimps, the Google Prediction API, the World Bank World Development Indicators, Bloomberg Market Data, and much more. See Ajay's post at the link below for...

Last week I talked about our editrules package at the useR!2011 conference. In the coming time I plan to write a short series of blogs about the functionality of editrules. Below I describe the eliminate and isFeasible functions. But first: … Continue reading →

One of the coolest R packages I heard about at the useR! Conference: Toby Dylan Hocking‘s directlabels package for putting labels directly next to the relevant curves or point clouds in a figure. I think I first learned about this idea from Andrew Gelman: that a separate legend requires a lot of back-and-forth glances, so

John Kay muses on interpreting statistical data: Always ask of such data “what is the question to which this number is the answer?”. “Earnings before interest, tax, depreciation and amortisation on a like-for-like basis before allowance for exceptional restructuring costs” is the answer to the question “what is the highest profit number we can present without attracting...

John Kay muses on interpreting statistical data: Always ask of such data “what is the question to which this number is the answer?”. “Earnings before interest, tax, depreciation and amortisation on a like-for-like basis before allowance for exceptional restructuring costs” is the answer to the question “what is the highest profit number we can present without attracting flat disbelief?”.

The puzzle in the weekend edition of Le Monde this week can be expressed as follows: Consider four integer sequences (xn), (yn), (zn), and (wn), such that and, if u=(xn,yn,zn,wn), for i=1,…,4, if ui is not the maximum of u and otherwise. Find the first return time n (if any) such that xn=0. Find the value

I was recently asked how to implement time series cross-validation in R. Time series people would normally call this “forecast evaluation with a rolling origin” or something similar, but it is the natural and obvious analogue to leave-one-out cross-validation for cross-sectional data, so I prefer to call it “time series cross-validation”. Here is some example