Bot Botany – K-Means and ggplot2

September 2, 2010 · Posted in R bloggers · Comment 

So if you had a robot that was an expert at botany - would you have a bot botanist?  Among other things, it would need to to distinguish flowers through vision and image processing, and be able to classify various kinds of plants based upon specific characteristics.  What do both of these requirements have in common?   They can be done using the k-means clustering.  Image segmentation can be used to allow our robot to recognize objects.  Based upon petal and sepal size, it could determine say - the species of an iris.  The well-known iris data set has been featured in other posts.

K-Means in R
If you look up the k-means algorithm online or in a reference book, you will be met with a flurry a mathematical symbols and formal explanations.  The basic principal (informally stated) is rather simple... given set of observations (picture a scatter plot of points), and a number of groups or clusters that you wish to group them in, the k-means algorithm finds the center of each group and associates observations with the groups with the "closest" center.

To use k-means in R, call the kmeans function with a matrix of values and the number of centers.  The function seeks to partition the points into k groups (the number of centers) such that the sum of squares from points to the assigned cluster centers is minimized.  Each observation (point) belongs to the cluster with the nearest mean.

How-To
To start, we will copy the iris data set to a separate data frame.  Not strictly speaking necessary, but makes it easier me to reflexively enter df whenever the data frame is in view.  Next we create a matrix object containing only the Petal Length and Width.


df=iris
m=as.matrix(cbind(df$Petal.Length, df$Petal.Width),ncol=2)

Now we will do the actual clustering.


cl=(kmeans(m,3))

Simple eh?  The cl object contains a number of interesting attributes associated with the model.


cl$size
cl$withinss


Next we do a bit of data formatting and preparation for subsequent calls to graph the data.  Notice that we add the cluster information back to our original data frame.  This is a good organization of the data and also a requirement for working with ggplot2 which is designed to use data frames.

df$cluster=factor(cl$cluster)
centers=as.data.frame(cl$centers)

The following graph color codes the points by cluster.  We also add the centers and a semi transparent halo around the center to emphasize the place of the center... and its role in classifying the observations into clusters.

library(ggplot2)

ggplot(data=df, aes(x=Petal.Length, y=Petal.Width, color=cluster )) + 
 geom_point() + 
 geom_point(data=centers, aes(x=V1,y=V2, color='Center')) +
 geom_point(data=centers, aes(x=V1,y=V2, color='Center'), size=52, alpha=.3, legend=FALSE)



This plot is an interesting example of how several different sets of data (in this case the actual observations as well as the centers) in separate data frames can be included in a single ggplot2 chart.

Misclassified Observations
The models is pretty accurate, but not perfect.  The following SQL statement highlights the few misclassified observations:

sqldf('select Species, cluster, count(*) from df group by Species, Cluster')



     Species cluster count(*)
1     setosa       2       50
2 versicolor       1       48
3 versicolor       3        2
4  virginica       1        6
5  virginica       3       44

So we grab the outliers into their own data frame....

df2 = sqldf('select * from df where (Species || cluster) in 
             (select Species || cluster from df group by Species, Cluster having count(*) < 10)')


Now we can enhance the previous graph to put a diamond around misclassified points.


last_plot() +  geom_point(data=df2, aes(x=Petal_Length, y=Petal_Width, shape=5, alpha=.7, size=4.5), legend=FALSE) 


And so with a bit of Data Mining knowledge and the R programming language, even our machines can stop and smell the roses...

(Follow the link for other posts by C)

Taking R to the Limit, Part II – Large Datasets in R

August 20, 2010 · Posted in R bloggers · Comments Off 

Rlogo

For Part I, Parallelism in R, click here.

Tuesday night I again had the opportunity to present on high performance computing in R, at the Los Angeles R Users’ Group. This was the second part of a two part series called “Taking R to the Limit: High Performance Computing in R.” Part II discussed ways to work with large datasets in R. I also tied in MapReduce into the talk. Unfortunately, there was too much material and I had originally planned to cover Rhipe, using R on EC2 and sparse matrix libraries.

Slides

My edited slides are posted on SlideShare, and available for download here.

Topics included:

  • bigmemory, biganalytics and bigtabulate
  • ff
  • HadoopStreaming
  • brief mention of Rhipe

Code

The corresponding demonstration code is here.

Data

Since this talk discussed large datasets, I used some, well, large datasets. Some demonstrations used toy data including trees and the famous iris dataset included in base R. To load these, just use the call library(iris) or library(trees).

Large datasets:

  • On-Time Airline Performance data from 2009 Data Expo. This Bash script will download all of the necessary data files and create a nice dataset for you called airline.csv in the directory in which it is executed. I would just post it here, but it is very large and I only have so much bandwidth!
  • The Twitter dataset appears to no longer be available. Instead, use anna.txt which comes with HadoopStreaming. Simply replace twitter.tsv with anna.txt.

Video

The video was created with Vara ScreenFlow and I am very happy with how easy it is to use and how painless editing was.

For Part I, Parallelism in R, click here.

(Follow the link for other posts by Ryan)

A brief introduction to “apply” in R

August 19, 2010 · Posted in R bloggers · Comments Off 

At any R Q&A site, you’ll frequently see an exchange like this one:

Q: How can I use a loop to [...insert task here...] ?
A: Don’t. Use one of the apply functions.

So, what are these wondrous apply functions and how do they work? I think the best way to figure out anything in R is to learn by experimentation, using embarrassingly trivial data and functions.

If you fire up your R console, type “??apply” and scroll down to the functions in the base package, you’ll see something like this:

base::apply             Apply Functions Over Array Margins
base::by                Apply a Function to a Data Frame Split by Factors
base::eapply            Apply a Function Over Values in an Environment
base::lapply            Apply a Function over a List or Vector
base::mapply            Apply a Function to Multiple List or Vector Arguments
base::rapply            Recursively Apply a Function to a List
base::tapply            Apply a Function Over a Ragged Array

Let’s examine each of those.

1. apply
Description: “Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.”

OK – we know about vectors/arrays and functions, but what are these “margins”? Simple: either the rows (1), the columns (2) or both (1:2). By “both”, we mean “apply the function to each individual value.” An example:

# create a matrix of 10 rows x 2 columns
m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2)
# mean of the rows
apply(m, 1, mean)
 [1]  6  7  8  9 10 11 12 13 14 15
# mean of the columns
apply(m, 2, mean)
[1]  5.5 15.5
# divide all values by 2
apply(m, 1:2, function(x) x/2)
      [,1] [,2]
 [1,]  0.5  5.5
 [2,]  1.0  6.0
 [3,]  1.5  6.5
 [4,]  2.0  7.0
 [5,]  2.5  7.5
 [6,]  3.0  8.0
 [7,]  3.5  8.5
 [8,]  4.0  9.0
 [9,]  4.5  9.5
[10,]  5.0 10.0

That last example was rather trivial; you could just as easily do “m[, 1:2]/2″ – but you get the idea.

2. by
Description: “Function ‘by’ is an object-oriented wrapper for ‘tapply’ applied to data frames.”

The by function is a little more complex than that. Read a little further and the documentation tells you that “a data frame is split by row into data frames subsetted by the values of one or more factors, and function ‘FUN’ is applied to each subset in turn.” So, we use this one where factors are involved.

To illustrate, we can load up the classic R dataset “iris”, which contains a bunch of flower measurements:

attach(iris)
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

# get the mean of the first 4 variables, by species
by(iris[, 1:4], Species, mean)
Species: setosa
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
       5.006        3.428        1.462        0.246
------------------------------------------------------------
Species: versicolor
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
       5.936        2.770        4.260        1.326
------------------------------------------------------------
Species: virginica
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
       6.588        2.974        5.552        2.026

Essentially, by provides a way to split your data by factors and do calculations on each subset. It returns an object of class “by” and there are many, more complex ways to use it.

3. eapply
Description: “eapply applies FUN to the named values from an environment and returns the results as a list.”

This one is a little trickier, since you need to know something about environments in R. An environment, as the name suggests, is a self-contained object with its own variables and functions. To continue using our very simple example:

# a new environment
e <- new.env()
# two environment variables, a and b
e$a <- 1:10
e$b <- 11:20
# mean of the variables
eapply(e, mean)
$b
[1] 15.5

$a
[1] 5.5

I don’t often create my own environments, but they’re commonly used by R packages such as Bioconductor so it’s good to know how to handle them.

4. lapply
Description: “lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.”

That’s a nice, clear description which makes lapply one of the easier apply functions to understand. A simple example:

# create a list with 2 elements
l <- list(a = 1:10, b = 11:20)
# the mean of the values in each element
lapply(l, mean)
$a
[1] 5.5

$b
[1] 15.5

# the sum of the values in each element
lapply(l, sum)
$a
[1] 55

$b
[1] 155

The lapply documentation tells us to consult further documentation for sapply, vapply and replicate. Let’s do that.

    4.1 sapply
Description: “sapply is a user-friendly version of lapply by default returning a vector or matrix if appropriate.”

That simply means that if lapply would have returned a list with elements $a and $b, sapply will return either a vector, with elements [['a']] and [['b']], or a matrix with column names “a” and “b”. Returning to our previous simple example:

# create a list with 2 elements
l <- list(a = 1:10, b = 11:20)
# mean of values using sapply
l.mean <- sapply(l, mean)
# what type of object was returned?
class(l.mean)
[1] "numeric"
# it's a numeric vector, so we can get element "a" like this
l.mean[['a']]
[1] 5.5

    4.2 vapply
Description: “vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use.”

A third argument is supplied to vapply, which you can think of as a kind of template for the output. The documentation uses the fivenum function as an example, so let’s go with that:

l <- list(a = 1:10, b = 11:20)
# fivenum of values using vapply
l.fivenum <- vapply(l, fivenum, c(Min.=0, "1st Qu."=0, Median=0, "3rd Qu."=0, Max.=0))
class(l.fivenum)
[1] "matrix"
# let's see it
l.fivenum
           a    b
Min.     1.0 11.0
1st Qu.  3.0 13.0
Median   5.5 15.5
3rd Qu.  8.0 18.0
Max.    10.0 20.0

So, vapply returned a matrix, where the column names correspond to the original list elements and the row names to the output template. Nice.

    4.3 replicate
Description: “replicate is a wrapper for the common use of sapply for repeated evaluation of an expression (which will usually involve random number generation).”

The replicate function is very useful. Give it two mandatory arguments: the number of replications and the function to replicate; a third optional argument, simplify = T, tries to simplify the result to a vector or matrix. An example – let’s simulate 10 normal distributions, each with 10 observations:

replicate(10, rnorm(10))
             [,1]        [,2]        [,3]       [,4]        [,5]         [,6]
 [1,]  0.67947001 -1.94649409  0.28144696  0.5872913  2.22715085 -0.275918282
 [2,]  1.17298643 -0.01529898 -1.47314092 -1.3274354 -0.04105249  0.528666264
 [3,]  0.77272662 -2.36122644  0.06397576  1.5870779 -0.33926083  1.121164338
 [4,] -0.42702542 -0.90613885  0.83645668 -0.5462608 -0.87458396 -0.723858258
 [5,] -0.73892937 -0.57486661 -0.04418200 -0.1120936  0.08253614  1.319095242
 [6,]  2.93827883 -0.33363446  0.55405024 -0.4942736  0.66407615 -0.153623614
 [7,]  1.30037496 -0.26207115  0.49818215  1.0774543 -0.28206908  0.825488436
 [8,] -0.04153545 -0.23621632 -1.01192741  0.4364413 -2.28991601 -0.002867193
 [9,]  0.01262547  0.40247248  0.65816829  0.9541927 -1.63770154  0.328180660
[10,]  0.96525278 -0.37850821 -0.85869035 -0.6055622  1.13756753 -0.371977151
             [,7]        [,8]       [,9]       [,10]
 [1,]  0.03928297  0.34990909 -0.3159794  1.08871657
 [2,] -0.79258805 -0.30329668 -1.0902070  0.73356542
 [3,]  0.10673459 -0.02849216  0.8094840  0.06446245
 [4,] -0.84584079 -0.57308461 -1.3570979 -0.89801330
 [5,] -1.50226560 -2.35751419  1.2104163  0.74650696
 [6,] -0.32790991  0.80144695 -0.0071844  0.05742356
 [7,]  1.36719970  2.34148354  0.9148911  0.20451421
 [8,] -0.51112579 -0.53658159  1.5194130 -0.94250069
 [9,]  0.52017814 -1.22252527  0.4519702  0.08779704
[10,]  1.35908918  1.09024342  0.5912627 -0.20709053

5. mapply
Description: “mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each (…) argument, the second elements, the third elements, and so on.”

The mapply documentation is full of quite complex examples, but here’s a simple, silly one:

l1 <- list(a = c(1:10), b = c(11:20))
l2 <- list(c = c(21:30), d = c(31:40))
# sum the corresponding elements of l1 and l2
mapply(sum, l1$a, l1$b, l2$c, l2$d)
 [1]  64  68  72  76  80  84  88  92  96 100

Here, we sum l1$a[1] + l1$b[1] + l2$c[1] + l2$d[1] (1 + 11 + 21 + 31) to get 64, the first element of the returned list. All the way through to l1$a[10] + l1$b[10] + l2$c[10] + l2$d[10] (10 + 20 + 30 + 40) = 100, the last element.

6. rapply
Description: “rapply is a recursive version of lapply.”

I think “recursive” is a little misleading. What rapply does is apply functions to lists in different ways, depending on the arguments supplied. Best illustrated by examples:

# let's start with our usual simple list example
l <- list(a = 1:10, b = 11:20)
# log2 of each value in the list
rapply(l, log2)
      a1       a2       a3       a4       a5       a6       a7       a8
0.000000 1.000000 1.584963 2.000000 2.321928 2.584963 2.807355 3.000000
      a9      a10       b1       b2       b3       b4       b5       b6
3.169925 3.321928 3.459432 3.584963 3.700440 3.807355 3.906891 4.000000
      b7       b8       b9      b10
4.087463 4.169925 4.247928 4.321928
# log2 of each value in each list
rapply(l, log2, how = "list")
$a
 [1] 0.000000 1.000000 1.584963 2.000000 2.321928 2.584963 2.807355 3.000000
 [9] 3.169925 3.321928

$b
 [1] 3.459432 3.584963 3.700440 3.807355 3.906891 4.000000 4.087463 4.169925
 [9] 4.247928 4.321928

# what if the function is the mean?
rapply(l, mean)
   a    b
 5.5 15.5

rapply(l, mean, how = "list")
$a
[1] 5.5

$b
[1] 15.5

So, the output of rapply depends on both the function and the how argument. When how = “list” (or “replace”), the original list structure is preserved. Otherwise, the default is to unlist, which results in a vector.

You can also pass a “classes=” argument to rapply. For example, in a mixed list of numeric and character variables, you could specify that the function act only on the numeric values with “classes = numeric”.

7. tapply
Description: “Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.”

Woah there. That sounds complicated. Don’t panic though, it becomes clearer when the required arguments are described. Usage is “tapply(X, INDEX, FUN = NULL, …, simplify = TRUE)”, where X is “an atomic object, typically a vector” and INDEX is “a list of factors, each of same length as X”.

So, to go back to the famous iris data, “Species” might be a factor and “iris$Petal.Width” would give us a vector of values. We could then run something like:

attach(iris)
# mean petal length by species
tapply(iris$Petal.Length, Species, mean)
    setosa versicolor  virginica
     1.462      4.260      5.552

Summary
I’ve used very simple examples here, with contrived data and standard functions (such as mean and sum). For me, this is the easiest way to learn what a function does: I can look at the original data, then the result and figure out what happened. However, the “apply” family is a much more powerful than these illustrations – I encourage you to play around with it.

The things to consider when choosing an apply function are basically:

  • What class is my input data? – vector, matrix, data frame…
  • On which subsets of that data do I want the function to act? – rows, columns, all values…
  • What class will the function return? How is the original data structure transformed?

It’s the usual input-process-output story: what do I have, what do I want and what lies inbetween?


Filed under: computing, R, statistics Tagged: r-project, statistics, tutorial

(Follow the link for other posts by nsaunders)

Iris Data Set Visualization Web App in < 100 LOC

August 7, 2010 · Posted in R bloggers · Comments Off 



The iris data set pops up pretty regularly in statistical literature.  It consists of 50 records from three species of Iris flowers (Iris setosa, Iris virginica and Iris versicolor).   I came across it recently while reading Introduction to Data Mining.   It comes up in several places in the book to demonstrate techniques for visualization and classification.  There has been a number of articles, posts and videos about R and the web in recent times.  This post presents a way of creating some plots for the data set using R and a Ruby Web Application.

R and the Web
There are a number of situations where it would make sense to expose a data set and wrap a certain amount of R functionality within a web application.  Non-R users might need access to the data.  You might want to provide a presentation of findings available through the web.  You might even want to collaborate with other R developers by posting an HTML table that they can read in using XML.  In time, I expect that some standard web application frameworks will emerge to fill in the gap.  And so the current application is some “thinking out loud” on my part in that direction.

Prerequisites
In order to run this application on your machine, R and Ruby must be installed and functional.   The R packages ggplot2, R2HTML, RServe are used as well as the iris data set.  This web app was written and tested in Windows, but should run in *nix with small modifications.

Ruby Configuration
Install the package to allow communication with RServe

    gem install rserve-client

On *nix systems folks often sudo to install.

If for some reason you do not go the normal route of installing a gem (e.g. you downloaded from github at http://github.com/clbustos/Rserve-Ruby-client), make sure that your ruby $LOAD_PATH has the library available when you run the program.

You can even do this in line in the ruby program by including a line like the following at the beginning of the program:

$LOAD_PATH<<'C:\clbustos-Rserve-Ruby-client-v0.2.4-2-g47b0da7\lib'

The application itself is available on github.
Run the Web Application

Start Rserve.
  C:\Program Files\R\R-2.10.1\library\Rserve> Rserve

You should see output like this if it started successfully.
 Rserve: Ok, ready to answer queries.

Start the Ruby web application - specify a port if you like with the -p option.
 iris_data_set_webapp.rb -p 4445

With these steps complete, you should be able to hit the application at http://localhost:4445.  Three links are available.  The r_version link simply demonstrates that Ruby (through the sinatra framework and Rserve client) can communicate with R.  Clicking this link causes the version of R to display in the browser.

The second link is to the iris data itself.  This page displays a formatted HTML table rendered using the R2HTML package.  Admittedly, this is a bit of a confusion of concerns (view information be provided by R) but it provided a convenient mechanism to convert a data frame to an HTML table.

The third link allows you to modify the aesthetics of the plot.  Specifically, the x, y and color can be set to any of the available variables.  The result is a "grammatically correct" chart.

Code Walkthrough and Commentary
The package declarations could be written out
   require 'rubygems'
   require 'sinatra'
    ...

Instead all of the packages are included in an array (surrounded by brackets).  Then each require directive is issued as we iterate through each element in the array..

 ['rubygems', 'sinatra', 'rserve','fileutils','haml'].each{|r|require r}

The packages being used are

rubygems - the ruby packaging system itself
sinatra - a minimal web app DSL
rserve - to integrate with R
fileutils  - some convenience methods for file system access
haml - well, this one requires some explanation...

HAML is one of the many Ruby mark up/templating languages that is in vogue today among Rubyists.  It seems to save a few keystrokes from writing straight HTML, but it slows me down since I think in HTML and end up working backwards to writing the HAML.  I kind of like the pythonesque interpretation of indentation being meaningful and that the HTML looks pretty.

Anyway, it is used here but I am still on the fence about it.

To experiment with haml using irb, just require haml, create an engine and output the results to HTML.

require 'haml'
Haml::Engine.new('%h3 hello world').to_html

Back to the web app.  Create a global connection to Rserve.

include Rserve
$c = Connection.new

 The following lines kinda-sorta reload Sinatra most if the time which allows  you to change code and view the changes without starting and stopping the server.  Only it does not always work  :) …but it works enough for me that I included it and just restart if things are not updating the way I expect.

configure do
  Sinatra::Application.reset!
  use Rack::Reloader
end

This looks at a line that comes from the web app source file itself.  Yep, kind of wild (echoes of camping).  If the line matches the regexp  and is one of the get functions below (other than the index itself), we pull out the url path and slap it in an HTML anchor.  This is a convenient way to have a home screen during development where each get URL can be invoked.

def anchor(line)

  if line=~/get \'\/([a-z|A-Z])/
     l=line.split[1].gsub("'",'');
     haml "%a{:href => '#{l}'}> #{l} \n%br"    
  end
  
end

Return a string of html with an heading that says “Links” and a link to each “get” URL available in the web application.  The list of links is generated by reading the contents of this file, and creating a hyperlink (if possible) using the “anchor” method above.

get '/' do
  html=haml '%h3 Links'
  File.open(__FILE__).readlines.each{|l|html+=anchor(l).to_s}
  html
end

This is a simple example of how integration with R works.   The connection to RServe named $c is sent a string of R code to evaluate.  We expect a single result that we interpret as a string.

get '/r_version' do
  $c.eval("R.version.string").as_string  
end


This method creates an R script, evaluates it and returns a link to an image that will appear in the public directory that is in the same directory with this file.  The < < SCRIPT syntax is sometimes called a heredoc.  It is just a convenient way to create multi line strings - you could use double quotes in this context as well.  The variables x, y and color that are passed in are substituted  where you see #{x}, #{y}, #{color}.



def irisplot(x,y,color)
    script= < < SCRIPT
library(ggplot2)
  ggplot(
          data=iris, 
           aes(x=#{x}, y=#{y}, color=#{color})
         ) + geom_point()
   ggsave('#{FileUtils.pwd}/public/irisplot.png')
SCRIPT


  $c.eval(script)

  " < img src='irisplot.png' width='600', height='600' > "
  

end

Note: The spaces between the less than signs for the HEREDOC are artificial - they were required because blogger was not correctly interpreting them together.  Similar problem with the image tag - I just added spaces to prevent rendering issues.

This example demonstrates how to open an independent R script and run it.  See the iris.R script itself for more information about what is going on.  In general, the R2HTML package is being used to create a file whos handle is returned.  We then read the contents of the file in and these are returned as HTML.  The contents of the file are an HTML table that represents the iris data frame.

get '/iris_data' do  
  url=$c.eval(File.open('iris.R').readlines.join("\n")).as_string
  File.open(url).readlines()
end

This page can be scraped using R and two lines of code.  You could read this data into R running on another computer on the network:

library(XML)
df=readHTMLTable('http://nameofmachine:4445/iris')

The following creates an iris plot using the parameters passed in.

post '/plot' do
  irisplot(params['x'],params['y'],params['color'])
end

Finally, returns the form that allows you to input which fields are used to create a plot for the iris data set.

get '/iris_plot_input' do


  # Retrieve the iris data set column names into a ruby class variable.
  # These will be used to populate dropdowns.
  @colnames=$c.eval('data(iris);colnames(iris);').as_strings
  
  # WARNING Hard Coded Defaults below.  I used these so that 
  # we would have reasonable values by default.


  # Put all of the HAML markup in a string
html=<
%form{ :action => "/plot", :method => "post"}
  %table
    %tr
      %td
        %label{:for => "name"} x:
      %td  
        %select{:name=>'x'}  
          = @colnames.each do |col|
            %option{:value=> col, :selected => (col == 'Sepal.Length')} 
              =col
    %tr
      %td    
        %label{:for => "name"} y:    
      %td  
        %select{:name=>'y'}  
          = @colnames.each do |col|
            %option{:value=> col, :selected => (col == 'Sepal.Width')} 
              =col
    %tr
      %td    
        %label{:for => "name"} color:
      %td  
        %select{:name=>'color'}  
          = @colnames.each do |col|
            %option{:value=> col, :selected => (col == 'Species')} 
              =col
    %tr
      %td
        %input{:type => "submit", :value => "Create Plot"}
HAML


# Render it with the HAML engine
haml html


end

Examples produced by the app:



This application is amazing in that it simply pulls together some of the best programming resources around.  In well under 100 lines of code it is simple and easy to maintain.  With that in mind, I have been thinking about other directions that could be used to generalize this approach.  One is simply to bundle Sinatra with R (perhaps using JRuby).  Sinatra web apps could then be dynamically based upon data sets (kind of like the current app) or around R functions (kind of like the fgui package).  It seems like Hadley Wickham had a similar idea first and has a related project on Github. His approach is to port Sinatra to R so that web apps could be developed in R without the use of another language such as Ruby.


(Follow the link for other posts by C)

Thinking about Graphs

July 30, 2010 · Posted in R bloggers · Comments Off 
A recent Wall Street Journal article ruminated about the degree that language shapes thought (rather than the other way around).  This idea has rather profound implications in the more specific domain of programming languages. We initially learn a programming language but later “think” in terms of the language. 


To some degree, we are constrained in our ability to solve problems if we only know a single language. This situation has been recognized different ways by the programming community. The Logo programming language was built based upon constructionist learning theory and was intended to provide a “mental model” for children to come to understand mathematical constructs. In recent times, many programmers have committed to being polyglots, learning new languages as a part of professional development. Their concern is not always to learn the latest language that they will need to work, but to find out new ways of conceptualizing problems and structuring solutions. This leads to a more subtle goal of ggplot2.  


The ggplot2 package is appealing because it makes it possible to quickly create appealing graphs and charts. However, it is based upon an underlying “grammar of graphics”. This “Grammar” serves a number of purposes. It provides a structure for the API implementation. The API is designed so that you specify what you want rather than how to create it.


Another, perhaps more subtle effect is that it also can influence the way that an R programmer thinks about creating a graph. With this in mind, it is helpful to “think through” the process of creating a chart in the terms presented by ggplot2 in a more disciplined fashion.


Components of a Plot

According Hadley Wickham (the author of ggplot and the ggplot book), the following components make up a plot:
  • Data
  • Aesthetic Mappings
  • Geometric Objects
  • Statistical Transformations
  • Position Adjustment
  • Faceting
  • Coordinate System
The Reference Manual is also organized around these components:
  • Geoms (Geometric Objects)
  • Statistics (Statistical Transformations)
  • Scales
  • Coordinate System
  • Faceting
  • Position Adjustment 
He has organized the material slightly differently in a presentation at Vanderbilt.
  • Default Data Set
  • Set of Aesthetic Mappings
  • Multiple Layers (points, jittered points, box plots, histogram
  • Scale for Each Aesthetic
  • Faceting Specification
  • Coordinate System

In this case a layer comprises several of the elements listed earlier.
  • Data set and Aesthetic Mapping
  • Geometric Object.
  • Statistics
  • Position Adjustment

Data is not included as a part of ggplot2.  In addition, algebra (from a component identified by Wilkinson) is not included as it in the realm of data transformation rather than actual chart creation.  


The individual components of the grammar are fairly well defined regardless of where they appear on a list. The possible interactions between the components are rather complex. The construction of traditional charts are defined by a distinct combination of components. For example, the combination of geom and a stat is significant. At other times, the coordinate system is a defining factor. (A pie chart is a one column stacked bar chart that is mapped to a polar coordinate system).





Chart               Geom              Stat                 Coordinate System
Scatterplot        point                identity              cartesian
Histogram         bar                   bin                    cartesian
Pie Chart          bar                   identity              polar




Iris Data Set
The iris data set is a well known set of multivariate data introduced by Ronald Fisher in the 1930s. The first few rows of the set are as follows:


Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa


6          5.4         3.9          1.7         0.4  setosa


The charts below will use the following components:


  • Data - iris data set
  • Mapping to aesthetic
    • x - Petal.Length
    • y - Petal.Width
    • Color - Species
  • Geometric Object - point

A scatterpoint that includes these components can be created using qplot.  However, see Harlan's comment below (and this his blog which I appear to be echoing) - this is probably not the best way to start if one wants to "think" in the grammar rather than simply produce a good looking graph in the smallest number of keystrokes. Better to use ggplot as demonstrated later in the post.


library(ggplot2)
qplot(Petal.Length, Petal.Width, data=iris, color=Species)

The main decision that needed to be made to construct this call was how to map the aesthetics.  It is important to consider whether each variable being mapped is discrete or continuous to create a meaningful (and not just grammatically correct) result. 


                        Discrete                      Continuous
Color                Distinct color                Gradient (red to blue)
Size                  Distinct steps                Radius based on value
Shape               Distinct Shape               N/A


We can even include more information by mapping another attribute (sepal area – derived from sepal length times width) to size.


qplot(Petal.Length, 
  Petal.Width, 
  data=iris, 
  size=Sepal.Length * Sepal.Width, 
  color=Species) 

A great deal of information can be encoded using the various data attributes.  The plot gives an some indication regarding the petal length and width (based upon the position), species (based upon color) and sepal area (based upon size).  However, not every value is clearly in view.  There are a few changes that might provide an indication that values might overlap.  A jitter is might be used.  A better alternative is to set an alpha value provides a degree of transparency.  





qplot(Petal.Length, 
   Petal.Width, 
   data=iris, 
   size=Sepal.Length * Sepal.Width, 
   color=Species, 
   alpha=0.3) 


We can be more explicit about what is going on using ggplot rather than qplot.  The basic scatterplot can be created and in this case will be stored in a variable.





p = ggplot(data=iris, 
   aes(Petal.Length, 
       Petal.Width, 
       color=Species)
   ) + geom_point()




With the original plot in a variable, we can add components and immediately see their effect as it is rendered.  A line might help discern a trend in the original scatterplot. When applying a stat, you need to – well -  think statistically.  Consider the following.





p + stat_abline()

The line created doesn’t mean much – this is because it is simply a line with a slope of 1 and intercept of zero.  A more meaningful line can be created by determining the line of best fit.


coef(lm(Petal.Width ~ Petal.Length, data=iris))

# this returns 
# (Intercept) Petal.Length
#  -0.3630755    0.4157554


p + 
stat_abline(intercept=-0.363, 
            slope=0.416, 
            color='purple')



So calling a given stat did something in this case.  To get it to do something meaningful required additional work.  


Distinction between Grammar Components
The distinction between a statistic and geometric object is not always clear (at least in terms of the ggplot2 API).   A line with a slope and intercept might be though of as a statistic or a geometric object.


p + geom_abline(intercept=-0.363,slope=0.416, color='purple')


Likewise a position adjustment (like a jitter) can be thought of as both a geometric and positional terms.


qplot(Petal.Length, 
      Petal.Width, 
      data=iris, 
      position='jitter') + 
geom_abline(intercept=-0.363, slope=0.416)

qplot(Petal.Length, Petal.Width, data=iris) + 
geom_jitter()


I point out these idiosyncrasies because – as with many formal abstractions of real world concepts – edge cases exist.  For example, in Western music theory, one dutifully learns the rules of counterpoint only to find out that they are not always observed by composers in practice and that certain constructs are not easily classified.  This doesn’t eliminate the usefulness to studying music theory.  It simply highlights the difficulty in neatly categorizing every aspect of a specific creation in an accurate an meaningful way.  And for what its worth, I think that Hadley Wickham as done a marvelous job – and appears to have taken an approach of providing an interface to underlying functionality when it appears in more than one category.



Order of Application
Note that the order in which geoms and stats are applied matters!  For instance:


p+geom_boxplot()




The boxplot obscures the original points.  These can be added back on after applying the boxplot.


scap+geom_boxplot()+geom_point()




This gives a glimpse of the flexibility and sophistication of the system.  The fundamental elements of chart design that comprise the grammar can be combined in new and flexible ways.  Not every grammatically correct possibility is aesthetically pleasing or accurate as interpreted by human perception.  But ggplot2 is worth learning not only for its own sake, but for the insights it can provide into the creative activity of constructing charts and graphs.
           















(Follow the link for other posts by C)

New versions for ggplot2 (0.8.8) and plyr (1.0) were released today

July 6, 2010 · Posted in R bloggers · Comments Off 

As prolific as the CRAN website is of packages, there are several packages to R that succeeds in standing out for their wide spread use (and quality), Hadley Wickhams ggplot2 and plyr are two such packages.
plyr image
And today (through twitter) Hadley has updates the rest of us with the news:

just released new versions of plyr and ggplot2. source versions available on cran, compiled will follow soon #rstats

Going to the CRAN website shows that plyr has gone through the most major update, with the last update (before the current one) taking place on 2009-06-23. And now, over a year later, we are presented with plyr version 1, which includes New functions, New features some Bug fixes and a much anticipated Speed improvements.
ggplot2, has made a tiny leap from version 0.8.7 to 0.8.8, and was previously last updated on 2010-03-03.

Me, and I am sure many R users are very thankful for the amazing work that Hadley Wickham is doing (both on his code, and with helping other useRs on the help lists). So Hadley, thank you!

Here is the complete change-log list for both packages:

plyr 1.0 (2010-07-02) —————————————————

(taken from the CRAN website)
New functions:

* arrange, a new helper method for reordering a data frame.
* count, a version of table that returns data frames immediately and that is
much much faster for high-dimensional data.
* desc makes it easy to sort any vector in descending order
* join, works like merge but can be much faster and has a somewhat simpler
syntax drawing from SQL terminology
* rbind.fill.matrix is like rbind.fill but works for matrices, code
contributed by C. Beleites

Speed improvements

* experimental immutable data frame (idata.frame) that vastly speeds up
subsetting – for large datasets with large numbers of groups, this can yield
10-fold speed ups. See examples in ?idata.frame to see how to use it.
* rbind.fill rewritten again to increase speed and work with more data types
* d*ply now much faster with nested groups

New features:

* d*ply now accepts NULL for splitting variables, indicating that the data
should not be split
* plyr no longer exports internal functions, many of which were causing
clashes with other packages
* rbind.fill now works with data frame columns that are lists or matrices
* test suite ensures that plyr behaviour is correct and will remain correct
as I make future improvements.

Bug fixes:

* **ply: if zero splits, empty list(), data.frame() or logical() returned,
as appropriate for the output type
* **ply: leaving .fun as NULL now always returns list
(thanks to Stavros Macrakis for the bug report)
* a*ply: labels now respect options(stringAsFactors)
* each: scoping bug fixed, thanks to Yasuhisa Yoshida for the bug report
* list_to_dataframe is more consistent when processing a single data frame
* NAs preserved in more places
* progress bars: guaranteed to terminate even if **ply prematurely terminates
* progress bars: misspelling gives informative warning, instead of
uninformative error
* splitter_d: fixed ordering bug when .drop = FALSE

ggplot2 0.8.8 (2010-07-02) —————————————-

(taken from the CRAN website)

Bug fixes:

* coord_equal finally works as expected (thanks to continued prompting from Jean-Olivier Irisson)
* coord_equal renamed to coord_fixed to better represent capabilities
* coord_polar and coord_polar: new munching system that uses distances (as defined by the coordinate system) to figure out how many pieces each segment should be broken in to (thanks to prompting from Jean-Olivier Irisson)
* fix ordering bug in facet_wrap (thanks to bug report by Frank Davenport)
* geom_errorh correctly responds to height parameter outside of aes
* geom_hline and geom_vline will not impact legend when used for fixed intercepts
* geom_hline/geom_vline: intercept values not set quite correctly which caused a problem in conjunction with transformed scales (reported by Seth Finnegan)
* geom_line: can now stack lines again with position = “stack” (fixes #74)
* geom_segment: arrows now preserved in non-Cartesian coordinate system (fixes #117)
* geom_smooth now deals with missing values in the same way as geom_line (thanks to patch from Karsten Loesing)
* guides: check all axis labels for expressions (reported by Benji Oswald)
* guides: extra 0.5 line margin around legend (fixes #71)
* guides: non-left legend positions now work once more (thanks to patch from Karsten Loesing)
* label_bquote works with more expressions (factors now cast to characters, thanks to Baptiste Auguie for bug report)
* scale_color: add missing US spellings
* stat: panels with no non-missing values trigged errors with some statistics. (reported by Giovanni Dall’Olio)
* stat: statistics now also respect layer parameter inherit.aes (thanks to bug report by Lorenzo Isella and investigation by Brian Diggs)
* stat_bin no longer drops 0-count bins by default
* stat_bin: fix small bug when dealing with single bin with NA position (reported by John Rauser)
* stat_binhex: uses range of data from scales when computing binwidth so hexes are the same size in all facets (thanks to Nicholas Lewin-Koh for the bug report)
* stat_qq has new dparam parameter for specifying distribution parameters (thanks to Yunfeng Zhang for the bug report)
* stat_smooth now uses built-in confidence interval (with small sample correction) for linear models (thanks to suggestion by Ian Fellows)
* sta

(Follow the link for other posts by Tal Galili)

Make R speak SQL with sqldf

July 5, 2010 · Posted in R bloggers · 2 Comments 


The R community is unique as programming communities go.  Many users of R come from academia and have a relatively extensive mathematical background.  The R community has developed in relative isolation from some other areas of programming that have been widely adopted by business.  To many business users, working with data is synonymous with dealing with a relational database system (RDBMS).  Yet none of the R books that I read use relational databases at all (and online resources on the subject are limited).

It is not that such integration is not possible, there are many packages on CRAN that relate to databases.  Oracle has even integrated its data mining package with R using RODM in recent days.  There is a rather obvious overlap of concerns between those in the RDMS crowd and R users.  The widespread use of RDBMS systems appears to be restricted more to business environments.  This is changing somewhat in recent days as alternatives to relational models are explored - particularly when dealing with large data sets.

As someone coming to R from an RDBMS background, I noticed a significant difference in the terminology and concepts that exist concerning the manipulation of data.  R users reshape data frames while database users construct queries and return result sets.  In either case, we are dealing with the same basic idea - essentially a rectangular grid containing a set of values obtained from a data source.  And we have the same basic goal, modify the structure of the grid without changing the essential meaning of the underlying data.

There is tremendous value in learning the R terms and functions involved in manipulating data.  There are some things that are trivial to do in R that are a large amount of work in SQL.  In the meantime, there is a package that I found useful - that also might be useful to RDBMS developers - those who "think in SQL" but want to use R for additional statistical analysis or creation of visualizations.  It is sqldfan R package for runing SQL statements on data frames.


Examples speak for themselves in large part.


# Load the package
library(sqldf)


# Use the titanic data set

data(titanic3, package="PASWR")
colnames(titanic3)
head(titanic3)

Incidentally, I really appreciate the tradition in the statistics (and therefore R) communities of using well known published data sets to illustrate analysis and programming techniques.  This is not done consistently in the RDBMS community, although the Oracle community tends to use sample schemas (SCOTT/TIGER, HR, etc) and other RDBMS vendors have similar sample schemas for training.  There are not nearly as well know or consistently used as data sets in the statistics community (titanic, iris, etc).  We will use the titanic data set in this example.

As an RDBMS developer, I might be interested in knowing the number of people that are a particular age in the data set.  That would translate in my mind as selecting the count of each age listed in the titanic3 data set.  A GROUP BY clause would be required to produce a COUNT(*) for each age listed.

sqldf('select age, count(*) from titanic3 where age != "NA" group by age')

This is better expressed recognizing that NA and NULL correspond:


sqldf('select age, count(*) from titanic3 where age is not null group by age')


This returns a listing of results - very similar to what I would do in a RDBMS environment.  Of course in R, this sort of information might be handled differently.  Rather than having a listing that runs off the screen, I can view a histogram that reflects the entire distribution in a manner that is viewable in a small amount of space.  In this case

library(ggplot2)

DF=sqldf('select age from titanic3 where age != "NA"')
qplot(DF$age,data=DF, geom="histogram")


The GROUP BY summarization using SQL is one example of a relatively straightforward summarization available using SQL.  Lets say that I am now interested in focusing in on the survival rate of 29 year olds on the titanic.

DF=sqldf('select count(*) total from titanic3 where age=29 group by survived')
DF2=t(DF)

colnames(DF2)=c('Died','Survived')


Using the t function to perform a matrix transform which is essentially the same as producing a pivot of the original data.


      Died Survived
total   17       13


The sqldf package is a great resource for RDBMS folks who are working in R.  I am not sure that it actually provides any functionality that is not readily available in a more R-ish form, but it certainly provides a way of conceptualizing problems and implementing solutions that is familiar to database developers.  Otherwise stated:  it is not so matter whether R or SQL can express it but how easy it is to express. Some queries are simply much easier to express in one notation or the other.

The package will be particularly helpful when introducing R to database developers - and could be helpful when R and database developers are collaborating by providing a common means of communication.

(Follow the link for other posts by C)

Clustergram: visualization and diagnostics for cluster analysis (R code)

June 15, 2010 · Posted in R bloggers · Comments Off 

About Clustergrams

In 2002, Matthias Schonlau published in “The Stata Journal” an article named “The Clustergram: A graph for visualizing hierarchical and . As explained in the abstract:

In hierarchical cluster analysis dendrogram graphs are used to visualize how clusters are formed. I propose an alternative graph named “clustergram” to examine how cluster members are assigned to clusters as the number of clusters increases.
This graph is useful in exploratory analysis for non-hierarchical clustering algorithms like k-means and for hierarchical cluster algorithms when the number of observations is large enough to make dendrograms impractical.

A similar article was later written and was (maybe) published in “computational statistics”.

Both articles gives some nice background to known methods like k-means and methods for hierarchical clustering, and then goes on to present examples of using these methods (with the Clustergarm) to analyse some datasets.

Personally, I understand the clustergram to be a type of parallel coordinates plot where each observation is given a vector. The vector contains the observation’s location according to how many clusters the dataset was split into. The scale of the vector is the scale of the first principal component of the data.

Clustergram in R (a basic function)

After finding out about this method of visualization, I was hunted by the curiosity to play with it a bit. Therefore, and since I didn’t find any implementation of the graph in R, I went about writing the code to implement it.

The code only works for kmeans, but it shows how such a plot can be produced, and could be later modified so to offer methods that will connect with different clustering algorithms.

The function I present here gets a data.frame/matrix with a row for each observation, and the variable dimensions present in the columns.
The function assumes the data is scaled.
The function then goes about calculating the cluster centers for our data, for varying number of clusters.
For each cluster iteration, the cluster centers are multiplied by the first loading of the principal components of the original data. Thus offering a weighted mean of the each cluster center dimensions that might give a decent representation of that cluster (this method has the known limitations of using the first component of a PCA for dimensionality reduction, but I won’t go into that in this post).
Finally all of our data points are ordered according to their respective cluster first component, and plotted against the number of clusters (thus creating the clustergram).

My thank goes to Hadley Wickham for offering some good tips on how to prepare the graph.

Here is the code (example follows)

 
 
clustergram.kmeans <- function(Data, k, ...)
{
	# this is the type of function that the clustergram
	# 	function takes for the clustering.
	# 	using similar structure will allow implementation of different clustering algorithms
 
	#	It returns a list with two elements:
	#	cluster = a vector of length of n (the number of subjects/items)
	#				indicating to which cluster each item belongs.
	#	centers = a k dimensional vector.  Each element is 1 number that represent that cluster
	#				In our case, we are using the weighted mean of the cluster dimensions by 
	#				Using the first component (loading) of the PCA of the Data.
 
	cl <- kmeans(Data, k,...)
 
	cluster <- cl$cluster
	centers <- cl$centers %*% princomp(Data)$loadings[,1]	# 1 number per center
												# here we are using the weighted mean for each
 
	return(list(
				cluster = cluster,
				centers = centers
			))
}		
 
clustergram.plot.matlines <- function(X,Y, k.range, 
											x.range, y.range , COL, 
											add.center.points , centers.points)
	{
		plot(0,0, col = "white", xlim = x.range, ylim = y.range,
			axes = F,
			xlab = "Number of clusters (k)", ylab = "PCA weighted Mean of the clusters", main = "Clustergram of the PCA-weighted Mean of the clusters k-mean clusters vs number of clusters (k)")
		axis(side =1, at = k.range)
		axis(side =2)
		abline(v = k.range, col = "grey")
 
		matlines(t(X), t(Y), pch = 19, col = COL, lty = 1, lwd = 1.5)
 
		if(add.center.points)
		{
			require(plyr)
 
			xx <- ldply(centers.points, rbind)
			points(xx$y~xx$x, pch = 19, col = "red", cex = 1.3)
 
			# add points	
			# temp <- l_ply(centers.points, function(xx) {
									# with(xx,points(y~x, pch = 19, col = "red", cex = 1.3))
									# points(xx$y~xx$x, pch = 19, col = "red", cex = 1.3)
									# return(1)
									# })
						# We assign the lapply to a variable (temp) only to suppress the lapply "NULL" output
		}	
	}
 
 
 
clustergram <- function(Data, k.range = 2:10 , 
							clustering.function = clustergram.kmeans,
							clustergram.plot = clustergram.plot.matlines, 
							line.width = .004, add.center.points = T)
{
	# Data - should be a scales matrix.  Where each column belongs to a different dimension of the observations
	# k.range - is a vector with the number of clusters to plot the clustergram for
	# clustering.function - this is not really used, but offers a bases to later extend the function to other algorithms 
	#			Although that would  more work on the code
	# line.width - is the amount to lift each line in the plot so they won't superimpose eachother
	# add.center.points - just assures that we want to plot points of the cluster means
 
	n <- dim(Data)[1]
 
	PCA.1 <- Data %*% princomp(Data)$loadings[,1]	# first principal component of our data
 
	if(require(colorspace)) {
			COL <- heat_hcl(n)[order(PCA.1)]	# line colors
		} else {
			COL <- rainbow(n)[order(PCA.1)]	# line colors
			warning('Please consider installing the package "colorspace" for prittier colors')
		}
 
	line.width <- rep(line.width, n)
 
	Y <- NULL	# Y matrix
	X <- NULL	# X matrix
 
	centers.points <- list()
 
	for(k in k.range)
	{
		k.clusters <- clustering.function(Data, k)
 
		clusters.vec <- k.clusters$cluster
			# the.centers <- apply(cl$centers,1, mean)
		the.centers <- k.clusters$centers 
 
		noise <- unlist(tapply(line.width, clusters.vec, cumsum))[order(seq_along(clusters.vec)[order(clusters.vec)])]	
		# noise <- noise - mean(range(noise))
		y <- the.centers[clusters.vec] + noise
		Y <- cbind(Y, y)
		x <- rep(k, length(y))
		X <- cbind(X, x)
 
		centers.points[[k]] <- data.frame(y = the.centers , x = rep(k , k))	
	#	points(the.centers ~ rep(k , k), pch = 19, col = "red", cex = 1.5)
	}
 
 
	x.range <- range(k.range)
	y.range <- range(PCA.1)
 
	clustergram.plot(X,Y, k.range, 
											x.range, y.range , COL, 
											add.center.points , centers.points)
 
 
}

Example on the iris dataset

The iris data set is a favorite example of many R bloggers when writing about R accessors , Data Exporting, Data importing, and for different visualization techniques.
So it seemed only natural to experiment on it here.

data(iris)
set.seed(250)
par(cex.lab = 1.5, cex.main = 1.2)
Data <- scale(iris[,-5]) # notice I am scaling the vectors)
clustergram(Data, k.range = 2:8, line.width = 0.004) # notice how I am using line.width.  Play with it on your problem, according to the scale of Y.

Here is the output:

Looking at the image we can notice a few interesting things. We notice that one of the clusters formed (the lower one) stays as is no matter how many clusters we are allowing (except for one observation that goes way and then beck).
We can also see that the second split is a solid one (in the sense that it splits the first cluster into two clusters which are not “close” to each other, and that about half the observations goes to each of the new clusters).
And then notice how moving to 5 clusters makes almost no difference.
Lastly, notice how when going for 8 clusters, we are practically left with 4 clusters (remember – this is according the mean of cluster centers by the loading of the first component of the PCA on the data)

If I where to take something from this graph, I would say I have a strong tendency to use 3-4 clusters on this data.

But wait, did our clustering algorithm do a stable job?
Let’s try running the algorithm 6 more times (each run will have a different starting point for the clusters)

set.seed(500)
Data <- scale(iris[,-5]) # notice I am scaling the vectors)
par(cex.lab = 1.2, cex.main = .7)
par(mfrow = c(3,2))
for(i in 1:6) clustergram(Data, k.range = 2:8 , line.width = .004, add.center.points = T)

Resulting with: (press the image to enlarge it)

Repeating the analysis offers even more insights.
First, it would appear that until 3 clusters, the algorithm gives rather stable results.
From 4 onwards we get various outcomes at each iteration.
At some of the cases, we got 3 clusters when we asked for 4 or even 5 clusters.

Reviewing the new plots, I would prefer to go with the 3 clusters option. Noting how the two “upper” clusters might have similar properties while the lower cluster is quite distinct from the other two.

By the way, the Iris data set is composed of three types of flowers. I imagine the kmeans had done a decent job in distinguishing the three.

Limitation of the method (and a possible way to overcome it?!)

It is worth noting that the current way the algorithm is built has a fundamental limitation: The plot is good for detecting a situation where there are several clusters but each of them is clearly “bigger” then the one before it (on the first principal component of the data).

For example, let’s create a dataset with 3 clusters, each one is taken from a normal distribution with a higher mean:

set.seed(250)
Data <- rbind(
				cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),
				cbind(rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3)),
				cbind(rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3))
				)				
clustergram(Data, k.range = 2:5 , line.width = .004, add.center.points = T)

The resulting plot for this is the following:

The image shows a clear distinction between three ranks of clusters. There is no doubt (for me) from looking at this image, that three clusters would be the correct number of clusters.

But what if the clusters where different but didn’t have an ordering to them?
For example, look at the following 4 dimensional data:

set.seed(250)
Data <- rbind(
				cbind(rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),
				cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)),
				cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3)),
				cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3))
				)				
clustergram(Data, k.range = 2:8 , line.width = .004, add.center.points = T)

In this situation, it is not clear from the location of the clusters on the Y axis that we are dealing with 4 clusters.
But what is interesting, is that through the growing number of clusters, we can notice that there are 4 “strands” of data points moving more or less together (until we reached 4 clusters, at which point the clusters started breaking up).
Another hope for handling this might be using the color of the lines in some way, but I haven’t yet figured out how.

Clustergram with ggplot2

Hadley Wickham has kindly played with recreating the clustergram using the ggplot2 engine. You can see the result here:
http://gist.github.com/439761
And this is what he wrote about it in the comments:

I’ve broken it down into three components:
* run the clustering algorithm and get predictions (many_kmeans and all_hclust)
* produce the data for the clustergram (clustergram)
* plot it (plot.clustergram)
I don’t think I have the logic behind the y-position adjustment quite right though.

Here is an example of how it looks:

Conclusions (some rules of thumb and questions for the future)

In a first look, it would appear that the clustergram can be of use. I can imagine using this graph to quickly run various clustering algorithms and then compare them to each other and review their stability (In the way I just demonstrated in the example above).

The three rules of thumb I have noticed by now are:

  1. Look at the location of the cluster points on the Y axis. See when they remain stable, when they start flying around, and what happens to them in higher number of clusters (do they re-group together)
  2. Observe the strands of the datapoints. Even if the clusters centers are not ordered, the lines for each item might (needs more research and thinking) tend to move together – hinting at the real number of clusters
  3. Run the plot multiple times to observe the stability of the cluster formation (and location)

Yet there is more work to be done and questions to seek answers to:

  • The code needs to be extended to offer methods to various clustering algorithms.
  • How can the colors of the lines be used better?
  • How can this be done using other graphical engines (ggplot2/lattice?) – (Update: look at Hadley’s reply in the comments)
  • What to do in case the first principal component doesn’t capture enough of the data? (maybe plot this graph to all the relevant components. but then – how do you make conclusions of it?)
  • What other uses/conclusions can be made based on this graph?

I am looking forward to reading your input/ideas in the comments (or in reply posts).

(Follow the link for other posts by Tal Galili)

MLB Baseball Pitching Matchups ~ grabbing pitcher and/or batter codes by specify game date using R XML

June 1, 2010 · Posted in R bloggers · 1 Comment 

MLB Gameday stores its game data in XML format, with the players denoted in ID numbers. To find out who is who, the codes are stored in pitchers.xml or batters.xml of each game.

My DownloadPitchFX.R script can download the ID numbers, but it doesn’t look to see who the ID is because of the extra processing time. But to use the data (say in RMySQL), it helps to have another script that figures out the ID number for any player.

The following script (GetPitcherBatterCodes.R) requires the last and/or first name of the player, the team that he plays on and the specific date the player is assumed to play. It outputs a data frame with the matched name (however many) and their ID numbers. You can also let just.player = FALSE to download all of the players listed in that game (although it does that anyways).

The input for the team name is fairly general. You can use the codes that are specified in Gameday (“SF”, “sfn”), or the actual city of the team (“San Francisco”), or its team name (“Giants”).

## GetPitcherBatterCodes.R
## get pitcher batter codes for pitch f/x

library(XML)

# -- Outputs
# data frame of all matching names, OR
# data frame of all batters or pitchers in game

# -- Inputs
# game.date ~ game date player plays in, default POSIXlt format, e.g. "2009-05-20"
# is.pitcher ~ TRUE for pitcher, FALSE for batter
# last_name ~ a character vector for the last name
# first_name ~ a char vector for first name,
#   have to spell correctly but don't need both first and last names..
# team ~ denote team that player plays in,
#   use any of the following code within quotes.. example for SF Giants, or SD Padres:
#   away_name_abbrev="SF" home_name_abbrev="SD" away_code="sfn" away_file_code="sf" away_team_city="San Francisco" away_team_name="Giants" home_code="sdn" home_file_code="sd" home_team_city="San Diego" home_team_name="Padres"
# just.player ~ TRUE to get ID for player, FALSE to grab all pitchers OR batters in game

GetPitcherBatterCodes <- function(game.date = "2009-05-20",
                                  is.pitcher = TRUE,
                                  last_name = "Lincecum", first_name = "Tim",
                                  team = "sfn",
                                  just.player = TRUE,
                                  URL.base = "http://gd2.mlb.com/components/game/mlb/") {
  # extract date
  game.date <- as.POSIXlt(game.date)
  year <- game.date$year + 1900
  month <- game.date$mon + 1
  day <- game.date$mday
  URL.date <- paste(URL.base, "year_", year, "/",
                    ifelse(month >= 10, "month_", "month_0"), month, "/",
                    ifelse(day >= 10, "day_", "day_0"), day, "/", sep = "")

  # extract miniscoreboard.xml
  URL.scoreboard <- paste(URL.date, "miniscoreboard.xml", sep = "")
  XML.scoreboard <- xmlInternalTreeParse(URL.scoreboard)
  parse.scoreboard <- sapply(c("gameday_link",
                               "away_name_abbrev", "home_name_abbrev",
                               "away_code", "home_code",
                               "away_file_code", "home_file_code",
                               "away_team_city", "home_team_city",
                               "away_team_name", "home_team_name"), function(x)
                             xpathSApply(XML.scoreboard, "//game[@*]", xmlGetAttr, x))

  # get game URL of specified team
  team.index <- apply(parse.scoreboard, 1, function(x) team %in% x)
  team.URL <- parse.scoreboard[team.index, 1][1] # protect from double headers
  URL.game <- paste(URL.date, "gid_", team.URL, "/", sep = "")

  # get player data
  URL.players <- ifelse(is.pitcher, paste(URL.game, "pitchers/", sep = ""),
                        paste(URL.game, "batters/", sep = ""))
  HTML.players <- htmlParse(URL.players)
  codes.players <- xpathSApply(HTML.players, "//a[@*]", xmlGetAttr, "href")[-1]

  # loop through player codes to match last AND/OR first name
  info.players <- sapply(codes.players, function(x) {
    URL.player <- paste(URL.players, x, sep = "")
    XML.player <- xmlInternalTreeParse(URL.player)
    print(x)
    info.player <- sapply(c("team", "id", "type", "first_name", "last_name"), function(x)
                          xpathSApply(XML.player, "//Player[@*]", xmlGetAttr, x))
  })

  # get results and match player names if necessary
  if (just.player == TRUE) {
    last.index <- last_name == info.players["last_name",]
    first.index <- first_name == info.players["first_name",]
    matched.index <- as.logical(last.index + first.index)
    matched.players <- data.frame(id = info.players["id", matched.index],
                                  first_name = info.players["first_name", matched.index],
                                  last_name = info.players["last_name", matched.index])
    return(matched.players)
  }
  else return(info.players)
}

Some output:

> aho <- GetPitcherBatterCodes()
> aho
      id first_name last_name
1 453311        Tim  Lincecum

> aho2 <- GetPitcherBatterCodes(just.player = FALSE)
> aho2
           116615.xml 133982.xml 217096.xml 277405.xml 346793.xml 408241.xml
team       "sfn"      "sfn"      "sfn"      "sfn"      "sfn"      "sdn"
id         "116615"   "133982"   "217096"   "277405"   "346793"   "408241"
type       "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"
first_name "Randy"    "Bob"      "Barry"    "Justin"   "Jeremy"   "Jake"
last_name  "Johnson"  "Howry"    "Zito"     "Miller"   "Affeldt"  "Peavy"
           425514.xml 429718.xml 429723.xml 429781.xml 429985.xml 430161.xml
team       "sdn"      "sdn"      "sfn"      "sdn"      "sdn"      "sfn"
id         "425514"   "429718"   "429723"   "429781"   "429985"   "430161"
type       "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"
first_name "Heath"    "Shawn"    "Merkin"   "Kevin"    "Chad"     "Noah"
last_name  "Bell"     "Hill"     "Valdez"   "Correia"  "Gaudin"   "Lowry"
           430606.xml 430650.xml 430657.xml  430665.xml 430912.xml 432934.xml
team       "sdn"      "sdn"      "sdn"       "sfn"      "sfn"      "sdn"
id         "430606"   "430650"   "430657"    "430665"   "430912"   "432934"
type       "pitcher"  "pitcher"  "pitcher"   "pitcher"  "pitcher"  "pitcher"
first_name "Mike"     "Edwin"    "Cha Seung" "Brandon"  "Matt"     "Chris"
last_name  "Adams"    "Moreno"   "Baek"      "Medders"  "Cain"     "Young"
           435619.xml 445995.xml 446207.xml 448592.xml 450312.xml 450527.xml
team       "sfn"      "sdn"      "sdn"      "sdn"      "sdn"      "sfn"
id         "435619"   "445995"   "446207"   "448592"   "450312"   "450527"
type       "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"
first_name "Pat"      "Arturo"   "Josh"     "Cla"      "Mark"     "Alex"
last_name  "Misch"    "Lopez"    "Geer"     "Meredith" "Worrell"  "Hinshaw"
           450832.xml 451216.xml 452724.xml 453281.xml 453311.xml 456043.xml
team       "sfn"      "sfn"      "sfn"      "sdn"      "sfn"      "sfn"
id         "450832"   "451216"   "452724"   "453281"   "453311"   "456043"
type       "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"
first_name "Jesse"    "Brian"    "Billy"    "Wade"     "Tim"      "Jonathan"
last_name  "English"  "Wilson"   "Sadler"   "LeBlanc"  "Lincecum" "Sanchez"
           457117.xml 457566.xml 458155.xml 459987.xml 460044.xml 464351.xml
team       "sdn"      "sdn"      "sfn"      "sdn"      "sdn"      "sfn"
id         "457117"   "457566"   "458155"   "459987"   "460044"   "464351"
type       "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"
first_name "Ernesto"  "Greg"     "Joe"      "Cesar"    "Cesar"    "Kelvin"
last_name  "Frieri"   "Burke"    "Martinez" "Ramos"    "Carrillo" "Pichardo"
           464400.xml 465629.xml 466412.xml 467683.xml 471183.xml 477581.xml
team       "sfn"      "sdn"      "sdn"      "sfn"      "sfn"      "sdn"
id         "464400"   "465629"   "466412"   "467683"   "471183"   "477581"
type       "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"  "pitcher"
first_name "Henry"    "Edward"   "Luis"     "Osiris"   "Waldis"   "Walter"
last_name  "Sosa"     "Mujica"   "Perdomo"  "Matos"    "Joaquin"  "Silva"
           489265.xml 491159.xml 502381.xml  503355.xml
team       "sfn"      "sdn"      "sdn"       "sdn"
id         "489265"   "491159"   "502381"    "503355"
type       "pitcher"  "pitcher"  "pitcher"   "pitcher"
first_name "Sergio"   "Joe"      "Luke"      "Jackson"
last_name  "Romo"     "Thatcher" "Gregerson" "Quezada"


Filed under: Baseball, R, XML

(Follow the link for other posts by apeescape)

Color choosing in R made easy

May 21, 2010 · Posted in R bloggers · Comments Off 

I don’t know about you, but when I want to make a graph in R, I handpick the colors, line widths etc… to produce awesome output.

A lot of my time is spent on color choosing, I had to find a more convenient way of doing so. Earl F. Glynn’s “Chart of R colors”  posted on  http://research.stowers-institute.org/efg/R/Color/Chart/ gave me the idea to the following function.

R has an Internal called colors(), it’s output is a 657 long character vector of reserved names for colors.

The names can be used directly as in :

plot(iris,col="orange")

Iris attributes in orangeAlternatively, they can be referred to from colors()

plot(iris,col=colors()[503])

Iris attributes in orange redThe color() function has a two arguments (notice it is not plural… I chose this unreserved name because it’s easy to remember) :

  1. plot.it – FALSE by default
  2. locate – 0 by default

The call color() is the same as colors().

> color()[555]==colors()[555]
[1] TRUE

Calling color(plot.it=T) or color(T) gives the following output (very similar to Earl F. Glynn’s “Chart of R colors”) :

You can choose and remember the numbers, or print and stick above your working area… but the following makes color() more useful :

Specifying locate = k > 0 will plot the chart above and this time will use the locator() function in a loop to choose colors you want. After choosing k colors the output will be  a k-long character vector of the chosen colors.

> color(T,5)
[1] "firebrick4" "grey9"      "green"      "gray99"     "khaki1"

You can use it directly in a plot function, the palette of colors will plot first, choose your colors, the plot you called for will be followed by it:

plot(iris,col=color(T,5))

The function is given by :

color    <- function (plot.it=F,locate=0)
 {
 if(!plot.it)
   {
   return(.Internal(colors())) # so far, not different from colors()
   } # close on if
 else
   {
   ytop    <- rep(seq(1/26,1,by=1/26),each=26)[1:657]
   ybottom <- rep(seq(0,1-1/26,by=1/26),each=26)[1:657]
   xleft   <- rep(seq(0,1-1/26,by=1/26),times=26)[1:657]
   xright  <- rep(seq(1/26,1,by=1/26),times=26)[1:657]
   pall    <- round(col2rgb(colors())/256)
   pall    <- colSums(pall) ; pall2 <- character(0)
   pall2[pall>0]   <- "black"
   pall2[pall==0]  <- "white"

   par(mar=c(0,0,1,0))

   plot.new()
   title(main="Palette of colors()")
   rect(xleft,ybottom,xright,ytop,col=colors())
   text(x=xleft+((1/26)/2)
   ,y=ytop-((1/26)/2)
   ,labels = 1:657
   ,cex=0.55
   ,col=pall2)

   } # close on else
   if(locate==0) print("Palette of colors()")
   else
   {
   colmat    <- matrix(c(1:657,rep(NA,26^2-657)),byrow=T,ncol=26,nrow=26)
   cols        <- NA
   i        <- NA
   for(i in 1:locate)
   {
   h    <- locator(1)
   if(any(h$x<0,h$y<0,h$x>1,h$y>1)) stop("locator out of bounds!")
   else
   {
   cc        <- floor(h$x/(1/26))+1
   rr        <- floor(h$y/(1/26))+1            
   cols[i]    <- .Internal(colors())[colmat[rr,cc]]
   } # close on else
   } # close on i
   return(cols)
   } # close on else
   } # close on else+function

You can also write it to variable for further use:

> cols<- color(T,5)
> cols
[1] "magenta"        "orange3"        "palevioletred2" "seagreen4"     
[5] "seagreen2"

Of course it’s not perfect, I still have not solved issues of working with Devices such as pdf() jpeg() and such…

Your comments are welcome.


(Follow the link for other posts by aviadklein)

Next Page »