Ubuntu Developer Summit in Barcelona

May 31, 2009 · Posted in R bloggers · Comments Off 
Due to some things falling into place, I had an opportunity to attend the first two days of last week's Ubuntu Developer Summit in beautiful Barcelona. Somehow, I had never managed to attend a Debian conference either, so it was good to meet a few of the old Debian hands now moving Ubuntu along, as well as a few of the Ubuntu folks. I also gave a short presentation on R in Debian / Ubuntu and the plans for the upcoming Ubuntu release. More on that another time.

All told, a well-organised conference in a nice setting -- two stone throws from the legendary Camp Nou. Unfortunately, I had to leave by Wednesday so I missed what was undoubtedly quite a scene in Barcelona following Barca's dismantling of Man U in this year's Champions League final.

Nice Interview

May 31, 2009 · Posted in R bloggers · Comments Off 
Here you can read a nice interview with David Smith, REvolution Computing’s Director of Community, statistician and bloggeR.

R used by KDD 2009 cup winner of slow challenge

May 31, 2009 · Posted in R bloggers · Comments Off 

The results from the KDD Cup 2009 challenge (which we wrote about before) are in, and the winner of the slow challenge used the R statistical computing and analysis platform for their winning submission.

The write up (username/password may be required) from Hugh Miller and team at the University of Melbourne includes these points:

  • Decision tree, stub, or Random Forest as base classifiers with Logistic loss or cross-entropy loss function
  • Models fit in an hour or so
  • Used the R statistical package
  • Most of models run on Windows laptop with Intel Core 2 Duo 2.66GHz processor, 2GB RAM, 120Gb hard drive.

Impressive hardware selection! Well done R. Weka was another popular tool among the top entrants. Key for all of them were clever data preparation and variable substitution. The fast track winners from IBM document this in some detail:

We normalized the numerical variables by range, keeping the sparsity. For the categorical variables, we coded them using at most 11 binary columns for each variable. For each categorical variable, we generated a binary feature for each of the ten most common values, encoding whether the instance had this value or not. The eleventh column encoded whether the instance had a value that was not among the top ten most common values. We removed constant attributes, as well as duplicate attributes.

We replaced the missing values by mean for numerical attributes, and coded them as a separate value for discrete attributes. We also added a separate column for each numeric attribute with missing values, indicating wether the value was missing or not. We also tried another approach for imputing missing values based on KNN.

On the large data set we discretized the 100 numerical variables that had the highest mutual information with the target into 10 bins, and added them as extra features.

We tried PCA on the large data set, but it did not seem to help.

Because we noticed that some of the most predictive attributes were not linearly correlated with the targets, we build shallow decision trees (2-4 levels deep) using single numerical attributes and used their predictions as extra features. We also build shallow decision trees using two features at a time and used their prediction as an extra feature in the hope of capturing some non-additive interactions among features.

JPM Chase Corporate Challenge 2009

May 30, 2009 · Posted in R bloggers · Comments Off 
The 28th annual JP Morgan Chase Corporate Challenge race took place a couple of days ago May 21. Participation was down from the record of 23,000 runners set last year at around 17,125. With splendid weather, it is always a nice way to start the Memorial day weekend.

We fielded a small but spirited team of nine runners. I finished with a decent (hand-stopped) time of 22 minutes and 27.93 seconds for the 3.5 miles -- or a 6:25 min/mile pace. That is among the fasters times but not quite the fastest compared to the other six times I have run this.

Most importantly, everybody seems to have had a blast. And we did set a record for longer post-race party which sets a nice precedent for 2010.

The R Journal, Issue 1 Volume 1

May 29, 2009 · Posted in R bloggers · Comments Off 
The R journal just published its inaugural peer-reviewed journal. Aligned with the open-source mantra, the journal is free and openly accessible. The journal features short articles on topics focused on R, including notes about new add-on packages, hints for R newcomers, application reports detailing examples of data analysis with R, and other news items. The current issue in PDF and

R tips: Use read.table instead of strsplit to split a text column into multiple columns

May 29, 2009 · Posted in R bloggers · Comments Off 

Someone on the R-help mailing list had a data frame with a column containing IP addresses in quad-dot format (e.g. 1.10.100.200). He wanted to sort by this column and I proposed a solution involving strsplit. But Peter Dalgaard comes up with a much nicer method using read.table on a textConnection object:

> a <- data.frame(cbind(color=c("yellow","red","blue","red"),
                        status=c("no","yes","yes","no"),
                        ip=c("162.131.58.26","2.131.58.16","2.2.58.10","162.131.58.17")))
> con <- textConnection(as.character(a$ip))
> o <- do.call(order,read.table(con, sep="."))
> close(con)
> a[o,]
   color status            ip
3   blue    yes     2.2.58.10
2    red    yes   2.131.58.16
4    red     no 162.131.58.17
1 yellow     no 162.131.58.26

That is very, very neat! Thank you Peter.

R Journal 1/1

May 29, 2009 · Posted in R bloggers · Comments Off 
R Journal 1/1 is out! Download it from here.

Accessing Soil Survey Data via Web-Services

May 28, 2009 · Posted in R bloggers · Comments Off 

Soil Survey Data

 
Online Querying of NRCS Soil Survey Data
Sometimes you are only interested in soils data for a single map unit, component, or horizon. In these cases downloading the entire survey from Soil Data Mart is not worth the effort. An online query mechanism would suffice. The NRCS provides a form-based, interactive querying mechanism and a SOAP-based analogue. These services allow soil data lookup from the current snapshot of all data stored in NASIS.

read more

Making Sense of Large Piles of Soils Information: Soil Taxonomy

May 27, 2009 · Posted in R bloggers · Comments Off 

Western Fresno Soil Hierarchy: partial view of the hierarchy within the US Soil Taxonomic systemWestern Fresno Soil Hierarchy: partial view of the hierarchy within the US Soil Taxonomic system

 
Soil Data
Field and lab characterization of soil profile data result in the accumulation of a massive, multivariate and three-dimensional data set. Classification is one approach to making sense of a large collection of this type of data. US Soil Taxonomy is the primary soil classification system used in the U.S.A and many other countries. This system is hierarchical in nature, and makes use on the presence or absence of diagnostic soil features. A comprehensive discussion of Soil Taxonomy is beyond the scope of this post. A detailed review of Soil Taxonomy can be found in Buol, S. W.; Graham, R. C.; McDaniel, P. A. & Southard, R. J. Soil Genesis and Classification Iowa State Press, 2003.

read more

Embeding fonts in figures produced by R

May 27, 2009 · Posted in R bloggers · Comments Off 
Some publishers insist that we embed (include) the fonts in each figure. Here is a set of links regarding this issue for figures produced by R:

Next Page »

Diag| Memory: Current usage: 36454 KB
Diag| Memory: Peak usage: 37126 KB