Blog Archives

R: Eliminating observed values with zero variance

March 8, 2010
By
R: Eliminating observed values with zero variance

I needed a fast way of eliminating observed values with zero variance from large data sets using the R statistical computing and analysis platform. In other words, I want to find the columns in a data frame that has zero variance. And as fast as possible, because my data sets are large, many, and changing fast....

Read more »

Beautiful Data

July 27, 2009
By
Beautiful Data

O'Reilly's recent publication Beautiful Data has a chapter by Jeff Jonas which is enough reason in itself for me to recommend it. The chapter, Data Finds Data, is also available as a PDF download.

Read more »

Beautiful Data

July 27, 2009
By
Beautiful Data

O'Reilly's recent publication Beautiful Data has a chapter by Jeff Jonas which is enough reason in itself for me to recommend it. The chapter, Data Finds Data, is also available as a PDF download.

Read more »

Massively parallel database for analytics

July 22, 2009
By
Massively parallel database for analytics

This is by far the best description of why traditional parallel databases (like Teradata, Greenplum et al.) is a evolutionary dead end. But much more than a theoretical discussion, they have built a solution which they call HadoopDB. It is based on Hadoop, PostgreSQL, and Hive and is completely Open Source. Alternative, column-based, backends to PostgreSQL...

Read more »

Massively parallel database for analytics

July 22, 2009
By
Massively parallel database for analytics

This is by far the best description of why traditional parallel databases (like Teradata, Greenplum et al.) is a evolutionary dead end. But much more than a theoretical discussion, they have built a solution which they call HadoopDB. It is based on Hadoop, PostgreSQL, and Hive and is completely Open Source. Alternative, column-based, backends to PostgreSQL...

Read more »

The Knapsack Problem

July 10, 2009
By
The Knapsack Problem

David posts a question about how to solve this knapsack problem using the R statistical computing and analysis platform. My reply in the comments seems to have disappeared for a while so here is my proposed solution:

Read more »

The Knapsack Problem

July 10, 2009
By
The Knapsack Problem

David posts a question about how to solve this knapsack problem using the R statistical computing and analysis platform. My reply in the comments seems to have disappeared for a while so here is my proposed solution:

Read more »

OECD Statistics

July 2, 2009
By
OECD Statistics

I am a sucker for good quality data. I wrote about data.gov, the US Government data site before, and now I find OECD Statistics which has some 300 data sets, many of which seems to be readily accessible (though some may require subscription)

Read more »

OECD Statistics

July 2, 2009
By
OECD Statistics

I am a sucker for good quality data. I wrote about data.gov, the US Government data site before, and now I find OECD Statistics which has some 300 data sets, many of which seems to be readily accessible (though some may require subscription)

Read more »

R tips: Determine if function is called from specific package

June 16, 2009
By
R tips: Determine if function is called from specific package

I like the "multicore" library for a particular task. I can easily write a combination of if(require("multicore",...)) that means that my function will automatically use the parallel mclapply() instead of lapply() where it is available. Which is grand 99% of the time, except when my function is called from mclapply() (or one of the lower level functions)...

Read more »