Ubuntu Developer Summit in Barcelona
All told, a well-organised conference in a nice setting -- two stone throws from the legendary Camp Nou. Unfortunately, I had to leave by Wednesday so I missed what was undoubtedly quite a scene in Barcelona following Barca's dismantling of Man U in this year's Champions League final.
Nice Interview
R used by KDD 2009 cup winner of slow challenge
The results from the KDD Cup 2009 challenge (which we wrote about before) are in, and the winner of the slow challenge used the R statistical computing and analysis platform for their winning submission.
The write up (username/password may be required) from Hugh Miller and team at the University of Melbourne includes these points:
- Decision tree, stub, or Random Forest as base classifiers with Logistic loss or cross-entropy loss function
- Models fit in an hour or so
- Used the R statistical package
- Most of models run on Windows laptop with Intel Core 2 Duo 2.66GHz processor, 2GB RAM, 120Gb hard drive.
Impressive hardware selection! Well done R. Weka was another popular tool among the top entrants. Key for all of them were clever data preparation and variable substitution. The fast track winners from IBM document this in some detail:
We normalized the numerical variables by range, keeping the sparsity. For the categorical variables, we coded them using at most 11 binary columns for each variable. For each categorical variable, we generated a binary feature for each of the ten most common values, encoding whether the instance had this value or not. The eleventh column encoded whether the instance had a value that was not among the top ten most common values. We removed constant attributes, as well as duplicate attributes.
We replaced the missing values by mean for numerical attributes, and coded them as a separate value for discrete attributes. We also added a separate column for each numeric attribute with missing values, indicating wether the value was missing or not. We also tried another approach for imputing missing values based on KNN.
On the large data set we discretized the 100 numerical variables that had the highest mutual information with the target into 10 bins, and added them as extra features.
We tried PCA on the large data set, but it did not seem to help.
Because we noticed that some of the most predictive attributes were not linearly correlated with the targets, we build shallow decision trees (2-4 levels deep) using single numerical attributes and used their predictions as extra features. We also build shallow decision trees using two features at a time and used their prediction as an extra feature in the hope of capturing some non-additive interactions among features.
JPM Chase Corporate Challenge 2009
We fielded a small but spirited team of nine runners. I finished with a decent (hand-stopped) time of 22 minutes and 27.93 seconds for the 3.5 miles -- or a 6:25 min/mile pace. That is among the fasters times but not quite the fastest compared to the other six times I have run this.
Most importantly, everybody seems to have had a blast. And we did set a record for longer post-race party which sets a nice precedent for 2010.
The R Journal, Issue 1 Volume 1
R tips: Use read.table instead of strsplit to split a text column into multiple columns
Someone on the R-help mailing list had a data frame with a column containing IP addresses in quad-dot format (e.g. 1.10.100.200). He wanted to sort by this column and I proposed a solution involving strsplit. But Peter Dalgaard comes up with a much nicer method using read.table on a textConnection object:
> a <- data.frame(cbind(color=c("yellow","red","blue","red"),
status=c("no","yes","yes","no"),
ip=c("162.131.58.26","2.131.58.16","2.2.58.10","162.131.58.17")))
> con <- textConnection(as.character(a$ip))
> o <- do.call(order,read.table(con, sep="."))
> close(con)
> a[o,]
color status ip
3 blue yes 2.2.58.10
2 red yes 2.131.58.16
4 red no 162.131.58.17
1 yellow no 162.131.58.26
That is very, very neat! Thank you Peter.
R Journal 1/1
Accessing Soil Survey Data via Web-Services
Online Querying of NRCS Soil Survey Data
Sometimes you are only interested in soils data for a single map unit, component, or horizon. In these cases downloading the entire survey from Soil Data Mart is not worth the effort. An online query mechanism would suffice. The NRCS provides a form-based, interactive querying mechanism and a SOAP-based analogue. These services allow soil data lookup from the current snapshot of all data stored in NASIS.
Making Sense of Large Piles of Soils Information: Soil Taxonomy
Western Fresno Soil Hierarchy: partial view of the hierarchy within the US Soil Taxonomic system
Soil Data
Field and lab characterization of soil profile data result in the accumulation of a massive, multivariate and three-dimensional data set. Classification is one approach to making sense of a large collection of this type of data. US Soil Taxonomy is the primary soil classification system used in the U.S.A and many other countries. This system is hierarchical in nature, and makes use on the presence or absence of diagnostic soil features. A comprehensive discussion of Soil Taxonomy is beyond the scope of this post. A detailed review of Soil Taxonomy can be found in Buol, S. W.; Graham, R. C.; McDaniel, P. A. & Southard, R. J. Soil Genesis and Classification Iowa State Press, 2003.
