Bot Botany – K-Means and ggplot2
K-Means in R
If you look up the k-means algorithm online or in a reference book, you will be met with a flurry a mathematical symbols and formal explanations. The basic principal (informally stated) is rather simple... given set of observations (picture a scatter plot of points), and a number of groups or clusters that you wish to group them in, the k-means algorithm finds the center of each group and associates observations with the groups with the "closest" center.
To use k-means in R, call the kmeans function with a matrix of values and the number of centers. The function seeks to partition the points into k groups (the number of centers) such that the sum of squares from points to the assigned cluster centers is minimized. Each observation (point) belongs to the cluster with the nearest mean.
How-To
To start, we will copy the iris data set to a separate data frame. Not strictly speaking necessary, but makes it easier me to reflexively enter df whenever the data frame is in view. Next we create a matrix object containing only the Petal Length and Width.
df=iris
m=as.matrix(cbind(df$Petal.Length, df$Petal.Width),ncol=2)
Now we will do the actual clustering.
cl=(kmeans(m,3))
Simple eh? The cl object contains a number of interesting attributes associated with the model.
cl$size
cl$withinss
Next we do a bit of data formatting and preparation for subsequent calls to graph the data. Notice that we add the cluster information back to our original data frame. This is a good organization of the data and also a requirement for working with ggplot2 which is designed to use data frames.
df$cluster=factor(cl$cluster)
centers=as.data.frame(cl$centers)
The following graph color codes the points by cluster. We also add the centers and a semi transparent halo around the center to emphasize the place of the center... and its role in classifying the observations into clusters.
library(ggplot2)
ggplot(data=df, aes(x=Petal.Length, y=Petal.Width, color=cluster )) +
geom_point() +
geom_point(data=centers, aes(x=V1,y=V2, color='Center')) +
geom_point(data=centers, aes(x=V1,y=V2, color='Center'), size=52, alpha=.3, legend=FALSE)
This plot is an interesting example of how several different sets of data (in this case the actual observations as well as the centers) in separate data frames can be included in a single ggplot2 chart.
Misclassified Observations
The models is pretty accurate, but not perfect. The following SQL statement highlights the few misclassified observations:
sqldf('select Species, cluster, count(*) from df group by Species, Cluster')
Species cluster count(*)
1 setosa 2 50
2 versicolor 1 48
3 versicolor 3 2
4 virginica 1 6
5 virginica 3 44
Now we can enhance the previous graph to put a diamond around misclassified points.
last_plot() + geom_point(data=df2, aes(x=Petal_Length, y=Petal_Width, shape=5, alpha=.7, size=4.5), legend=FALSE)
And so with a bit of Data Mining knowledge and the R programming language, even our machines can stop and smell the roses...
(Follow the link for other posts by C)
Taking R to the Limit, Part II – Large Datasets in R
For Part I, Parallelism in R, click here.
Tuesday night I again had the opportunity to present on high performance computing in R, at the Los Angeles R Users’ Group. This was the second part of a two part series called “Taking R to the Limit: High Performance Computing in R.” Part II discussed ways to work with large datasets in R. I also tied in MapReduce into the talk. Unfortunately, there was too much material and I had originally planned to cover Rhipe, using R on EC2 and sparse matrix libraries.
Slides
My edited slides are posted on SlideShare, and available for download here.
Topics included:
- bigmemory, biganalytics and bigtabulate
- ff
- HadoopStreaming
- brief mention of Rhipe
Code
The corresponding demonstration code is here.
Data
Since this talk discussed large datasets, I used some, well, large datasets. Some demonstrations used toy data including trees and the famous iris dataset included in base R. To load these, just use the call library(iris) or library(trees).
Large datasets:
- On-Time Airline Performance data from 2009 Data Expo. This Bash script will download all of the necessary data files and create a nice dataset for you called airline.csv in the directory in which it is executed. I would just post it here, but it is very large and I only have so much bandwidth!
- The Twitter dataset appears to no longer be available. Instead, use anna.txt which comes with HadoopStreaming. Simply replace twitter.tsv with anna.txt.
Video
The video was created with Vara ScreenFlow and I am very happy with how easy it is to use and how painless editing was.
For Part I, Parallelism in R, click here.
(Follow the link for other posts by Ryan)
A brief introduction to “apply” in R
At any R Q&A site, you’ll frequently see an exchange like this one:
Q: How can I use a loop to [...insert task here...] ?
A: Don’t. Use one of the apply functions.
So, what are these wondrous apply functions and how do they work? I think the best way to figure out anything in R is to learn by experimentation, using embarrassingly trivial data and functions.
If you fire up your R console, type “??apply” and scroll down to the functions in the base package, you’ll see something like this:
base::apply Apply Functions Over Array Margins base::by Apply a Function to a Data Frame Split by Factors base::eapply Apply a Function Over Values in an Environment base::lapply Apply a Function over a List or Vector base::mapply Apply a Function to Multiple List or Vector Arguments base::rapply Recursively Apply a Function to a List base::tapply Apply a Function Over a Ragged Array
Let’s examine each of those.
1. apply
Description: “Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.”
OK – we know about vectors/arrays and functions, but what are these “margins”? Simple: either the rows (1), the columns (2) or both (1:2). By “both”, we mean “apply the function to each individual value.” An example:
# create a matrix of 10 rows x 2 columns
m <- matrix(c(1:10, 11:20), nrow = 10, ncol = 2)
# mean of the rows
apply(m, 1, mean)
[1] 6 7 8 9 10 11 12 13 14 15
# mean of the columns
apply(m, 2, mean)
[1] 5.5 15.5
# divide all values by 2
apply(m, 1:2, function(x) x/2)
[,1] [,2]
[1,] 0.5 5.5
[2,] 1.0 6.0
[3,] 1.5 6.5
[4,] 2.0 7.0
[5,] 2.5 7.5
[6,] 3.0 8.0
[7,] 3.5 8.5
[8,] 4.0 9.0
[9,] 4.5 9.5
[10,] 5.0 10.0
That last example was rather trivial; you could just as easily do “m[, 1:2]/2″ – but you get the idea.
2. by
Description: “Function ‘by’ is an object-oriented wrapper for ‘tapply’ applied to data frames.”
The by function is a little more complex than that. Read a little further and the documentation tells you that “a data frame is split by row into data frames subsetted by the values of one or more factors, and function ‘FUN’ is applied to each subset in turn.” So, we use this one where factors are involved.
To illustrate, we can load up the classic R dataset “iris”, which contains a bunch of flower measurements:
attach(iris)
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
# get the mean of the first 4 variables, by species
by(iris[, 1:4], Species, mean)
Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.006 3.428 1.462 0.246
------------------------------------------------------------
Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.936 2.770 4.260 1.326
------------------------------------------------------------
Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width
6.588 2.974 5.552 2.026
Essentially, by provides a way to split your data by factors and do calculations on each subset. It returns an object of class “by” and there are many, more complex ways to use it.
3. eapply
Description: “eapply applies FUN to the named values from an environment and returns the results as a list.”
This one is a little trickier, since you need to know something about environments in R. An environment, as the name suggests, is a self-contained object with its own variables and functions. To continue using our very simple example:
# a new environment e <- new.env() # two environment variables, a and b e$a <- 1:10 e$b <- 11:20 # mean of the variables eapply(e, mean) $b [1] 15.5 $a [1] 5.5
I don’t often create my own environments, but they’re commonly used by R packages such as Bioconductor so it’s good to know how to handle them.
4. lapply
Description: “lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.”
That’s a nice, clear description which makes lapply one of the easier apply functions to understand. A simple example:
# create a list with 2 elements l <- list(a = 1:10, b = 11:20) # the mean of the values in each element lapply(l, mean) $a [1] 5.5 $b [1] 15.5 # the sum of the values in each element lapply(l, sum) $a [1] 55 $b [1] 155
The lapply documentation tells us to consult further documentation for sapply, vapply and replicate. Let’s do that.
4.1 sapply
Description: “sapply is a user-friendly version of lapply by default returning a vector or matrix if appropriate.”
That simply means that if lapply would have returned a list with elements $a and $b, sapply will return either a vector, with elements [['a']] and [['b']], or a matrix with column names “a” and “b”. Returning to our previous simple example:
# create a list with 2 elements l <- list(a = 1:10, b = 11:20) # mean of values using sapply l.mean <- sapply(l, mean) # what type of object was returned? class(l.mean) [1] "numeric" # it's a numeric vector, so we can get element "a" like this l.mean[['a']] [1] 5.5
4.2 vapply
Description: “vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use.”
A third argument is supplied to vapply, which you can think of as a kind of template for the output. The documentation uses the fivenum function as an example, so let’s go with that:
l <- list(a = 1:10, b = 11:20)
# fivenum of values using vapply
l.fivenum <- vapply(l, fivenum, c(Min.=0, "1st Qu."=0, Median=0, "3rd Qu."=0, Max.=0))
class(l.fivenum)
[1] "matrix"
# let's see it
l.fivenum
a b
Min. 1.0 11.0
1st Qu. 3.0 13.0
Median 5.5 15.5
3rd Qu. 8.0 18.0
Max. 10.0 20.0
So, vapply returned a matrix, where the column names correspond to the original list elements and the row names to the output template. Nice.
4.3 replicate
Description: “replicate is a wrapper for the common use of sapply for repeated evaluation of an expression (which will usually involve random number generation).”
The replicate function is very useful. Give it two mandatory arguments: the number of replications and the function to replicate; a third optional argument, simplify = T, tries to simplify the result to a vector or matrix. An example – let’s simulate 10 normal distributions, each with 10 observations:
replicate(10, rnorm(10))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.67947001 -1.94649409 0.28144696 0.5872913 2.22715085 -0.275918282
[2,] 1.17298643 -0.01529898 -1.47314092 -1.3274354 -0.04105249 0.528666264
[3,] 0.77272662 -2.36122644 0.06397576 1.5870779 -0.33926083 1.121164338
[4,] -0.42702542 -0.90613885 0.83645668 -0.5462608 -0.87458396 -0.723858258
[5,] -0.73892937 -0.57486661 -0.04418200 -0.1120936 0.08253614 1.319095242
[6,] 2.93827883 -0.33363446 0.55405024 -0.4942736 0.66407615 -0.153623614
[7,] 1.30037496 -0.26207115 0.49818215 1.0774543 -0.28206908 0.825488436
[8,] -0.04153545 -0.23621632 -1.01192741 0.4364413 -2.28991601 -0.002867193
[9,] 0.01262547 0.40247248 0.65816829 0.9541927 -1.63770154 0.328180660
[10,] 0.96525278 -0.37850821 -0.85869035 -0.6055622 1.13756753 -0.371977151
[,7] [,8] [,9] [,10]
[1,] 0.03928297 0.34990909 -0.3159794 1.08871657
[2,] -0.79258805 -0.30329668 -1.0902070 0.73356542
[3,] 0.10673459 -0.02849216 0.8094840 0.06446245
[4,] -0.84584079 -0.57308461 -1.3570979 -0.89801330
[5,] -1.50226560 -2.35751419 1.2104163 0.74650696
[6,] -0.32790991 0.80144695 -0.0071844 0.05742356
[7,] 1.36719970 2.34148354 0.9148911 0.20451421
[8,] -0.51112579 -0.53658159 1.5194130 -0.94250069
[9,] 0.52017814 -1.22252527 0.4519702 0.08779704
[10,] 1.35908918 1.09024342 0.5912627 -0.20709053
5. mapply
Description: “mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each (…) argument, the second elements, the third elements, and so on.”
The mapply documentation is full of quite complex examples, but here’s a simple, silly one:
l1 <- list(a = c(1:10), b = c(11:20)) l2 <- list(c = c(21:30), d = c(31:40)) # sum the corresponding elements of l1 and l2 mapply(sum, l1$a, l1$b, l2$c, l2$d) [1] 64 68 72 76 80 84 88 92 96 100
Here, we sum l1$a[1] + l1$b[1] + l2$c[1] + l2$d[1] (1 + 11 + 21 + 31) to get 64, the first element of the returned list. All the way through to l1$a[10] + l1$b[10] + l2$c[10] + l2$d[10] (10 + 20 + 30 + 40) = 100, the last element.
6. rapply
Description: “rapply is a recursive version of lapply.”
I think “recursive” is a little misleading. What rapply does is apply functions to lists in different ways, depending on the arguments supplied. Best illustrated by examples:
# let's start with our usual simple list example
l <- list(a = 1:10, b = 11:20)
# log2 of each value in the list
rapply(l, log2)
a1 a2 a3 a4 a5 a6 a7 a8
0.000000 1.000000 1.584963 2.000000 2.321928 2.584963 2.807355 3.000000
a9 a10 b1 b2 b3 b4 b5 b6
3.169925 3.321928 3.459432 3.584963 3.700440 3.807355 3.906891 4.000000
b7 b8 b9 b10
4.087463 4.169925 4.247928 4.321928
# log2 of each value in each list
rapply(l, log2, how = "list")
$a
[1] 0.000000 1.000000 1.584963 2.000000 2.321928 2.584963 2.807355 3.000000
[9] 3.169925 3.321928
$b
[1] 3.459432 3.584963 3.700440 3.807355 3.906891 4.000000 4.087463 4.169925
[9] 4.247928 4.321928
# what if the function is the mean?
rapply(l, mean)
a b
5.5 15.5
rapply(l, mean, how = "list")
$a
[1] 5.5
$b
[1] 15.5
So, the output of rapply depends on both the function and the how argument. When how = “list” (or “replace”), the original list structure is preserved. Otherwise, the default is to unlist, which results in a vector.
You can also pass a “classes=” argument to rapply. For example, in a mixed list of numeric and character variables, you could specify that the function act only on the numeric values with “classes = numeric”.
7. tapply
Description: “Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.”
Woah there. That sounds complicated. Don’t panic though, it becomes clearer when the required arguments are described. Usage is “tapply(X, INDEX, FUN = NULL, …, simplify = TRUE)”, where X is “an atomic object, typically a vector” and INDEX is “a list of factors, each of same length as X”.
So, to go back to the famous iris data, “Species” might be a factor and “iris$Petal.Width” would give us a vector of values. We could then run something like:
attach(iris)
# mean petal length by species
tapply(iris$Petal.Length, Species, mean)
setosa versicolor virginica
1.462 4.260 5.552
Summary
I’ve used very simple examples here, with contrived data and standard functions (such as mean and sum). For me, this is the easiest way to learn what a function does: I can look at the original data, then the result and figure out what happened. However, the “apply” family is a much more powerful than these illustrations – I encourage you to play around with it.
The things to consider when choosing an apply function are basically:
- What class is my input data? – vector, matrix, data frame…
- On which subsets of that data do I want the function to act? – rows, columns, all values…
- What class will the function return? How is the original data structure transformed?
It’s the usual input-process-output story: what do I have, what do I want and what lies inbetween?
Filed under: computing, R, statistics Tagged: r-project, statistics, tutorial
(Follow the link for other posts by nsaunders)
Iris Data Set Visualization Web App in < 100 LOC
The iris data set pops up pretty regularly in statistical literature. It consists of 50 records from three species of Iris flowers (Iris setosa, Iris virginica and Iris versicolor). I came across it recently while reading Introduction to Data Mining. It comes up in several places in the book to demonstrate techniques for visualization and classification. There has been a number of articles, posts and videos about R and the web in recent times. This post presents a way of creating some plots for the data set using R and a Ruby Web Application. |
R and the Web
There are a number of situations where it would make sense to expose a data set and wrap a certain amount of R functionality within a web application. Non-R users might need access to the data. You might want to provide a presentation of findings available through the web. You might even want to collaborate with other R developers by posting an HTML table that they can read in using XML. In time, I expect that some standard web application frameworks will emerge to fill in the gap. And so the current application is some “thinking out loud” on my part in that direction.
Prerequisites
In order to run this application on your machine, R and Ruby must be installed and functional. The R packages ggplot2, R2HTML, RServe are used as well as the iris data set. This web app was written and tested in Windows, but should run in *nix with small modifications.
Ruby Configuration
Install the package to allow communication with RServe
gem install rserve-client
On *nix systems folks often sudo to install.
If for some reason you do not go the normal route of installing a gem (e.g. you downloaded from github at http://github.com/clbustos/Rserve-Ruby-client), make sure that your ruby $LOAD_PATH has the library available when you run the program.
You can even do this in line in the ruby program by including a line like the following at the beginning of the program:
$LOAD_PATH<<'C:\clbustos-Rserve-Ruby-client-v0.2.4-2-g47b0da7\lib'
The application itself is available on github.
Run the Web Application
Start Rserve.
C:\Program Files\R\R-2.10.1\library\Rserve> Rserve
You should see output like this if it started successfully.
Rserve: Ok, ready to answer queries.
Start the Ruby web application - specify a port if you like with the -p option.
iris_data_set_webapp.rb -p 4445
With these steps complete, you should be able to hit the application at http://localhost:4445. Three links are available. The r_version link simply demonstrates that Ruby (through the sinatra framework and Rserve client) can communicate with R. Clicking this link causes the version of R to display in the browser.
The second link is to the iris data itself. This page displays a formatted HTML table rendered using the R2HTML package. Admittedly, this is a bit of a confusion of concerns (view information be provided by R) but it provided a convenient mechanism to convert a data frame to an HTML table.
The third link allows you to modify the aesthetics of the plot. Specifically, the x, y and color can be set to any of the available variables. The result is a "grammatically correct" chart.
Code Walkthrough and Commentary
The package declarations could be written out
require 'rubygems'
require 'sinatra'
...
Instead all of the packages are included in an array (surrounded by brackets). Then each require directive is issued as we iterate through each element in the array..
['rubygems', 'sinatra', 'rserve','fileutils','haml'].each{|r|require r}
The packages being used are
rubygems - the ruby packaging system itself
sinatra - a minimal web app DSL
rserve - to integrate with R
fileutils - some convenience methods for file system access
haml - well, this one requires some explanation...
HAML is one of the many Ruby mark up/templating languages that is in vogue today among Rubyists. It seems to save a few keystrokes from writing straight HTML, but it slows me down since I think in HTML and end up working backwards to writing the HAML. I kind of like the pythonesque interpretation of indentation being meaningful and that the HTML looks pretty.
Anyway, it is used here but I am still on the fence about it.
To experiment with haml using irb, just require haml, create an engine and output the results to HTML.
require 'haml'
Haml::Engine.new('%h3 hello world').to_html
Back to the web app. Create a global connection to Rserve.
include Rserve
$c = Connection.new
The following lines kinda-sorta reload Sinatra most if the time which allows you to change code and view the changes without starting and stopping the server. Only it does not always work :) …but it works enough for me that I included it and just restart if things are not updating the way I expect.
configure do
Sinatra::Application.reset!
use Rack::Reloader
end
This looks at a line that comes from the web app source file itself. Yep, kind of wild (echoes of camping). If the line matches the regexp and is one of the get functions below (other than the index itself), we pull out the url path and slap it in an HTML anchor. This is a convenient way to have a home screen during development where each get URL can be invoked.
def anchor(line)
if line=~/get \'\/([a-z|A-Z])/
l=line.split[1].gsub("'",'');
haml "%a{:href => '#{l}'}> #{l} \n%br"
end
end
Return a string of html with an heading that says “Links” and a link to each “get” URL available in the web application. The list of links is generated by reading the contents of this file, and creating a hyperlink (if possible) using the “anchor” method above.
get '/' do
html=haml '%h3 Links'
File.open(__FILE__).readlines.each{|l|html+=anchor(l).to_s}
html
end
This is a simple example of how integration with R works. The connection to RServe named $c is sent a string of R code to evaluate. We expect a single result that we interpret as a string.
get '/r_version' do
$c.eval("R.version.string").as_string
end
This method creates an R script, evaluates it and returns a link to an image that will appear in the public directory that is in the same directory with this file. The < < SCRIPT syntax is sometimes called a heredoc. It is just a convenient way to create multi line strings - you could use double quotes in this context as well. The variables x, y and color that are passed in are substituted where you see #{x}, #{y}, #{color}.
def irisplot(x,y,color)
script= < < SCRIPT
library(ggplot2)
ggplot(
data=iris,
aes(x=#{x}, y=#{y}, color=#{color})
) + geom_point()
ggsave('#{FileUtils.pwd}/public/irisplot.png')
SCRIPT
$c.eval(script)
" < img src='irisplot.png' width='600', height='600' > "
end
Note: The spaces between the less than signs for the HEREDOC are artificial - they were required because blogger was not correctly interpreting them together. Similar problem with the image tag - I just added spaces to prevent rendering issues.
This example demonstrates how to open an independent R script and run it. See the iris.R script itself for more information about what is going on. In general, the R2HTML package is being used to create a file whos handle is returned. We then read the contents of the file in and these are returned as HTML. The contents of the file are an HTML table that represents the iris data frame.
get '/iris_data' do
url=$c.eval(File.open('iris.R').readlines.join("\n")).as_string
File.open(url).readlines()
end
This page can be scraped using R and two lines of code. You could read this data into R running on another computer on the network:
library(XML)
df=readHTMLTable('http://nameofmachine:4445/iris')
The following creates an iris plot using the parameters passed in.
post '/plot' do
irisplot(params['x'],params['y'],params['color'])
end
Finally, returns the form that allows you to input which fields are used to create a plot for the iris data set.
get '/iris_plot_input' do
# Retrieve the iris data set column names into a ruby class variable.
# These will be used to populate dropdowns.
@colnames=$c.eval('data(iris);colnames(iris);').as_strings
# WARNING Hard Coded Defaults below. I used these so that
# we would have reasonable values by default.
# Put all of the HAML markup in a string
html=<
%form{ :action => "/plot", :method => "post"}
%table
%tr
%td
%label{:for => "name"} x:
%td
%select{:name=>'x'}
= @colnames.each do |col|
%option{:value=> col, :selected => (col == 'Sepal.Length')}
=col
%tr
%td
%label{:for => "name"} y:
%td
%select{:name=>'y'}
= @colnames.each do |col|
%option{:value=> col, :selected => (col == 'Sepal.Width')}
=col
%tr
%td
%label{:for => "name"} color:
%td
%select{:name=>'color'}
= @colnames.each do |col|
%option{:value=> col, :selected => (col == 'Species')}
=col
%tr
%td
%input{:type => "submit", :value => "Create Plot"}
HAML
# Render it with the HAML engine
haml html
end
Examples produced by the app:
This application is amazing in that it simply pulls together some of the best programming resources around. In well under 100 lines of code it is simple and easy to maintain. With that in mind, I have been thinking about other directions that could be used to generalize this approach. One is simply to bundle Sinatra with R (perhaps using JRuby). Sinatra web apps could then be dynamically based upon data sets (kind of like the current app) or around R functions (kind of like the fgui package). It seems like Hadley Wickham had a similar idea first and has a related project on Github. His approach is to port Sinatra to R so that web apps could be developed in R without the use of another language such as Ruby.
(Follow the link for other posts by C)
Thinking about Graphs

- Data
- Aesthetic Mappings
- Geometric Objects
- Statistical Transformations
- Position Adjustment
- Faceting
- Coordinate System
- Geoms (Geometric Objects)
- Statistics (Statistical Transformations)
- Scales
- Coordinate System
- Faceting
- Position Adjustment
- Default Data Set
- Set of Aesthetic Mappings
- Multiple Layers (points, jittered points, box plots, histogram
- Scale for Each Aesthetic
- Faceting Specification
- Coordinate System
- Data set and Aesthetic Mapping
- Geometric Object.
- Statistics
- Position Adjustment
The individual components of the grammar are fairly well defined regardless of where they appear on a list. The possible interactions between the components are rather complex. The construction of traditional charts are defined by a distinct combination of components. For example, the combination of geom and a stat is significant. At other times, the coordinate system is a defining factor. (A pie chart is a one column stacked bar chart that is mapped to a polar coordinate system).
Chart Geom Stat Coordinate System
Iris Data Set
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
- Data - iris data set
- Mapping to aesthetic
- x - Petal.Length
- y - Petal.Width
- Color - Species
- Geometric Object - point
Order of Application
(Follow the link for other posts by C)
New versions for ggplot2 (0.8.8) and plyr (1.0) were released today
As prolific as the CRAN website is of packages, there are several packages to R that succeeds in standing out for their wide spread use (and quality), Hadley Wickhams ggplot2 and plyr are two such packages.

And today (through twitter) Hadley has updates the rest of us with the news:
just released new versions of plyr and ggplot2. source versions available on cran, compiled will follow soon #rstats
Going to the CRAN website shows that plyr has gone through the most major update, with the last update (before the current one) taking place on 2009-06-23. And now, over a year later, we are presented with plyr version 1, which includes New functions, New features some Bug fixes and a much anticipated Speed improvements.
ggplot2, has made a tiny leap from version 0.8.7 to 0.8.8, and was previously last updated on 2010-03-03.
Me, and I am sure many R users are very thankful for the amazing work that Hadley Wickham is doing (both on his code, and with helping other useRs on the help lists). So Hadley, thank you!
Here is the complete change-log list for both packages:
plyr 1.0 (2010-07-02) —————————————————
(taken from the CRAN website)
New functions:
* arrange, a new helper method for reordering a data frame.
* count, a version of table that returns data frames immediately and that is
much much faster for high-dimensional data.
* desc makes it easy to sort any vector in descending order
* join, works like merge but can be much faster and has a somewhat simpler
syntax drawing from SQL terminology
* rbind.fill.matrix is like rbind.fill but works for matrices, code
contributed by C. Beleites
Speed improvements
* experimental immutable data frame (idata.frame) that vastly speeds up
subsetting – for large datasets with large numbers of groups, this can yield
10-fold speed ups. See examples in ?idata.frame to see how to use it.
* rbind.fill rewritten again to increase speed and work with more data types
* d*ply now much faster with nested groups
New features:
* d*ply now accepts NULL for splitting variables, indicating that the data
should not be split
* plyr no longer exports internal functions, many of which were causing
clashes with other packages
* rbind.fill now works with data frame columns that are lists or matrices
* test suite ensures that plyr behaviour is correct and will remain correct
as I make future improvements.
Bug fixes:
* **ply: if zero splits, empty list(), data.frame() or logical() returned,
as appropriate for the output type
* **ply: leaving .fun as NULL now always returns list
(thanks to Stavros Macrakis for the bug report)
* a*ply: labels now respect options(stringAsFactors)
* each: scoping bug fixed, thanks to Yasuhisa Yoshida for the bug report
* list_to_dataframe is more consistent when processing a single data frame
* NAs preserved in more places
* progress bars: guaranteed to terminate even if **ply prematurely terminates
* progress bars: misspelling gives informative warning, instead of
uninformative error
* splitter_d: fixed ordering bug when .drop = FALSE
ggplot2 0.8.8 (2010-07-02) —————————————-
(taken from the CRAN website)
Bug fixes:
* coord_equal finally works as expected (thanks to continued prompting from Jean-Olivier Irisson)
* coord_equal renamed to coord_fixed to better represent capabilities
* coord_polar and coord_polar: new munching system that uses distances (as defined by the coordinate system) to figure out how many pieces each segment should be broken in to (thanks to prompting from Jean-Olivier Irisson)
* fix ordering bug in facet_wrap (thanks to bug report by Frank Davenport)
* geom_errorh correctly responds to height parameter outside of aes
* geom_hline and geom_vline will not impact legend when used for fixed intercepts
* geom_hline/geom_vline: intercept values not set quite correctly which caused a problem in conjunction with transformed scales (reported by Seth Finnegan)
* geom_line: can now stack lines again with position = “stack” (fixes #74)
* geom_segment: arrows now preserved in non-Cartesian coordinate system (fixes #117)
* geom_smooth now deals with missing values in the same way as geom_line (thanks to patch from Karsten Loesing)
* guides: check all axis labels for expressions (reported by Benji Oswald)
* guides: extra 0.5 line margin around legend (fixes #71)
* guides: non-left legend positions now work once more (thanks to patch from Karsten Loesing)
* label_bquote works with more expressions (factors now cast to characters, thanks to Baptiste Auguie for bug report)
* scale_color: add missing US spellings
* stat: panels with no non-missing values trigged errors with some statistics. (reported by Giovanni Dall’Olio)
* stat: statistics now also respect layer parameter inherit.aes (thanks to bug report by Lorenzo Isella and investigation by Brian Diggs)
* stat_bin no longer drops 0-count bins by default
* stat_bin: fix small bug when dealing with single bin with NA position (reported by John Rauser)
* stat_binhex: uses range of data from scales when computing binwidth so hexes are the same size in all facets (thanks to Nicholas Lewin-Koh for the bug report)
* stat_qq has new dparam parameter for specifying distribution parameters (thanks to Yunfeng Zhang for the bug report)
* stat_smooth now uses built-in confidence interval (with small sample correction) for linear models (thanks to suggestion by Ian Fellows)
* sta
(Follow the link for other posts by Tal Galili)
Make R speak SQL with sqldf
It is not that such integration is not possible, there are many packages on CRAN that relate to databases. Oracle has even integrated its data mining package with R using RODM in recent days. There is a rather obvious overlap of concerns between those in the RDMS crowd and R users. The widespread use of RDBMS systems appears to be restricted more to business environments. This is changing somewhat in recent days as alternatives to relational models are explored - particularly when dealing with large data sets.
As someone coming to R from an RDBMS background, I noticed a significant difference in the terminology and concepts that exist concerning the manipulation of data. R users reshape data frames while database users construct queries and return result sets. In either case, we are dealing with the same basic idea - essentially a rectangular grid containing a set of values obtained from a data source. And we have the same basic goal, modify the structure of the grid without changing the essential meaning of the underlying data.
There is tremendous value in learning the R terms and functions involved in manipulating data. There are some things that are trivial to do in R that are a large amount of work in SQL. In the meantime, there is a package that I found useful - that also might be useful to RDBMS developers - those who "think in SQL" but want to use R for additional statistical analysis or creation of visualizations. It is sqldf, an R package for runing SQL statements on data frames.
Examples speak for themselves in large part.
# Load the package
library(sqldf)
# Use the titanic data set
data(titanic3, package="PASWR")
colnames(titanic3)
head(titanic3)
Incidentally, I really appreciate the tradition in the statistics (and therefore R) communities of using well known published data sets to illustrate analysis and programming techniques. This is not done consistently in the RDBMS community, although the Oracle community tends to use sample schemas (SCOTT/TIGER, HR, etc) and other RDBMS vendors have similar sample schemas for training. There are not nearly as well know or consistently used as data sets in the statistics community (titanic, iris, etc). We will use the titanic data set in this example.
As an RDBMS developer, I might be interested in knowing the number of people that are a particular age in the data set. That would translate in my mind as selecting the count of each age listed in the titanic3 data set. A GROUP BY clause would be required to produce a COUNT(*) for each age listed.
This is better expressed recognizing that NA and NULL correspond:
sqldf('select age, count(*) from titanic3 where age is not null group by age')
This returns a listing of results - very similar to what I would do in a RDBMS environment. Of course in R, this sort of information might be handled differently. Rather than having a listing that runs off the screen, I can view a histogram that reflects the entire distribution in a manner that is viewable in a small amount of space. In this case
library(ggplot2)
DF=sqldf('select age from titanic3 where age != "NA"')
qplot(DF$age,data=DF, geom="histogram")

The GROUP BY summarization using SQL is one example of a relatively straightforward summarization available using SQL. Lets say that I am now interested in focusing in on the survival rate of 29 year olds on the titanic.
DF=sqldf('select count(*) total from titanic3 where age=29 group by survived')
DF2=t(DF)
colnames(DF2)=c('Died','Survived')
Using the t function to perform a matrix transform which is essentially the same as producing a pivot of the original data.
Died Survived
total 17 13
The sqldf package is a great resource for RDBMS folks who are working in R. I am not sure that it actually provides any functionality that is not readily available in a more R-ish form, but it certainly provides a way of conceptualizing problems and implementing solutions that is familiar to database developers. Otherwise stated: it is not so matter whether R or SQL can express it but how easy it is to express. Some queries are simply much easier to express in one notation or the other.
The package will be particularly helpful when introducing R to database developers - and could be helpful when R and database developers are collaborating by providing a common means of communication.
(Follow the link for other posts by C)
Clustergram: visualization and diagnostics for cluster analysis (R code)
About Clustergrams
In 2002, Matthias Schonlau published in “The Stata Journal” an article named “The Clustergram: A graph for visualizing hierarchical and . As explained in the abstract:
In hierarchical cluster analysis dendrogram graphs are used to visualize how clusters are formed. I propose an alternative graph named “clustergram” to examine how cluster members are assigned to clusters as the number of clusters increases.
This graph is useful in exploratory analysis for non-hierarchical clustering algorithms like k-means and for hierarchical cluster algorithms when the number of observations is large enough to make dendrograms impractical.
A similar article was later written and was (maybe) published in “computational statistics”.
Both articles gives some nice background to known methods like k-means and methods for hierarchical clustering, and then goes on to present examples of using these methods (with the Clustergarm) to analyse some datasets.
Personally, I understand the clustergram to be a type of parallel coordinates plot where each observation is given a vector. The vector contains the observation’s location according to how many clusters the dataset was split into. The scale of the vector is the scale of the first principal component of the data.
Clustergram in R (a basic function)
After finding out about this method of visualization, I was hunted by the curiosity to play with it a bit. Therefore, and since I didn’t find any implementation of the graph in R, I went about writing the code to implement it.
The code only works for kmeans, but it shows how such a plot can be produced, and could be later modified so to offer methods that will connect with different clustering algorithms.
The function I present here gets a data.frame/matrix with a row for each observation, and the variable dimensions present in the columns.
The function assumes the data is scaled.
The function then goes about calculating the cluster centers for our data, for varying number of clusters.
For each cluster iteration, the cluster centers are multiplied by the first loading of the principal components of the original data. Thus offering a weighted mean of the each cluster center dimensions that might give a decent representation of that cluster (this method has the known limitations of using the first component of a PCA for dimensionality reduction, but I won’t go into that in this post).
Finally all of our data points are ordered according to their respective cluster first component, and plotted against the number of clusters (thus creating the clustergram).
My thank goes to Hadley Wickham for offering some good tips on how to prepare the graph.
Here is the code (example follows)
clustergram.kmeans <- function(Data, k, ...) { # this is the type of function that the clustergram # function takes for the clustering. # using similar structure will allow implementation of different clustering algorithms # It returns a list with two elements: # cluster = a vector of length of n (the number of subjects/items) # indicating to which cluster each item belongs. # centers = a k dimensional vector. Each element is 1 number that represent that cluster # In our case, we are using the weighted mean of the cluster dimensions by # Using the first component (loading) of the PCA of the Data. cl <- kmeans(Data, k,...) cluster <- cl$cluster centers <- cl$centers %*% princomp(Data)$loadings[,1] # 1 number per center # here we are using the weighted mean for each return(list( cluster = cluster, centers = centers )) } clustergram.plot.matlines <- function(X,Y, k.range, x.range, y.range , COL, add.center.points , centers.points) { plot(0,0, col = "white", xlim = x.range, ylim = y.range, axes = F, xlab = "Number of clusters (k)", ylab = "PCA weighted Mean of the clusters", main = "Clustergram of the PCA-weighted Mean of the clusters k-mean clusters vs number of clusters (k)") axis(side =1, at = k.range) axis(side =2) abline(v = k.range, col = "grey") matlines(t(X), t(Y), pch = 19, col = COL, lty = 1, lwd = 1.5) if(add.center.points) { require(plyr) xx <- ldply(centers.points, rbind) points(xx$y~xx$x, pch = 19, col = "red", cex = 1.3) # add points # temp <- l_ply(centers.points, function(xx) { # with(xx,points(y~x, pch = 19, col = "red", cex = 1.3)) # points(xx$y~xx$x, pch = 19, col = "red", cex = 1.3) # return(1) # }) # We assign the lapply to a variable (temp) only to suppress the lapply "NULL" output } } clustergram <- function(Data, k.range = 2:10 , clustering.function = clustergram.kmeans, clustergram.plot = clustergram.plot.matlines, line.width = .004, add.center.points = T) { # Data - should be a scales matrix. Where each column belongs to a different dimension of the observations # k.range - is a vector with the number of clusters to plot the clustergram for # clustering.function - this is not really used, but offers a bases to later extend the function to other algorithms # Although that would more work on the code # line.width - is the amount to lift each line in the plot so they won't superimpose eachother # add.center.points - just assures that we want to plot points of the cluster means n <- dim(Data)[1] PCA.1 <- Data %*% princomp(Data)$loadings[,1] # first principal component of our data if(require(colorspace)) { COL <- heat_hcl(n)[order(PCA.1)] # line colors } else { COL <- rainbow(n)[order(PCA.1)] # line colors warning('Please consider installing the package "colorspace" for prittier colors') } line.width <- rep(line.width, n) Y <- NULL # Y matrix X <- NULL # X matrix centers.points <- list() for(k in k.range) { k.clusters <- clustering.function(Data, k) clusters.vec <- k.clusters$cluster # the.centers <- apply(cl$centers,1, mean) the.centers <- k.clusters$centers noise <- unlist(tapply(line.width, clusters.vec, cumsum))[order(seq_along(clusters.vec)[order(clusters.vec)])] # noise <- noise - mean(range(noise)) y <- the.centers[clusters.vec] + noise Y <- cbind(Y, y) x <- rep(k, length(y)) X <- cbind(X, x) centers.points[[k]] <- data.frame(y = the.centers , x = rep(k , k)) # points(the.centers ~ rep(k , k), pch = 19, col = "red", cex = 1.5) } x.range <- range(k.range) y.range <- range(PCA.1) clustergram.plot(X,Y, k.range, x.range, y.range , COL, add.center.points , centers.points) }
Example on the iris dataset
The iris data set is a favorite example of many R bloggers when writing about R accessors , Data Exporting, Data importing, and for different visualization techniques.
So it seemed only natural to experiment on it here.
data(iris) set.seed(250) par(cex.lab = 1.5, cex.main = 1.2) Data <- scale(iris[,-5]) # notice I am scaling the vectors) clustergram(Data, k.range = 2:8, line.width = 0.004) # notice how I am using line.width. Play with it on your problem, according to the scale of Y.
Looking at the image we can notice a few interesting things. We notice that one of the clusters formed (the lower one) stays as is no matter how many clusters we are allowing (except for one observation that goes way and then beck).
We can also see that the second split is a solid one (in the sense that it splits the first cluster into two clusters which are not “close” to each other, and that about half the observations goes to each of the new clusters).
And then notice how moving to 5 clusters makes almost no difference.
Lastly, notice how when going for 8 clusters, we are practically left with 4 clusters (remember – this is according the mean of cluster centers by the loading of the first component of the PCA on the data)
If I where to take something from this graph, I would say I have a strong tendency to use 3-4 clusters on this data.
But wait, did our clustering algorithm do a stable job?
Let’s try running the algorithm 6 more times (each run will have a different starting point for the clusters)
set.seed(500) Data <- scale(iris[,-5]) # notice I am scaling the vectors) par(cex.lab = 1.2, cex.main = .7) par(mfrow = c(3,2)) for(i in 1:6) clustergram(Data, k.range = 2:8 , line.width = .004, add.center.points = T)
Resulting with: (press the image to enlarge it)

Repeating the analysis offers even more insights.
First, it would appear that until 3 clusters, the algorithm gives rather stable results.
From 4 onwards we get various outcomes at each iteration.
At some of the cases, we got 3 clusters when we asked for 4 or even 5 clusters.
Reviewing the new plots, I would prefer to go with the 3 clusters option. Noting how the two “upper” clusters might have similar properties while the lower cluster is quite distinct from the other two.
By the way, the Iris data set is composed of three types of flowers. I imagine the kmeans had done a decent job in distinguishing the three.
Limitation of the method (and a possible way to overcome it?!)
It is worth noting that the current way the algorithm is built has a fundamental limitation: The plot is good for detecting a situation where there are several clusters but each of them is clearly “bigger” then the one before it (on the first principal component of the data).
For example, let’s create a dataset with 3 clusters, each one is taken from a normal distribution with a higher mean:
set.seed(250) Data <- rbind( cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3)), cbind(rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3)) ) clustergram(Data, k.range = 2:5 , line.width = .004, add.center.points = T)
The resulting plot for this is the following:

The image shows a clear distinction between three ranks of clusters. There is no doubt (for me) from looking at this image, that three clusters would be the correct number of clusters.
But what if the clusters where different but didn’t have an ordering to them?
For example, look at the following 4 dimensional data:
set.seed(250) Data <- rbind( cbind(rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3)) ) clustergram(Data, k.range = 2:8 , line.width = .004, add.center.points = T)
In this situation, it is not clear from the location of the clusters on the Y axis that we are dealing with 4 clusters.
But what is interesting, is that through the growing number of clusters, we can notice that there are 4 “strands” of data points moving more or less together (until we reached 4 clusters, at which point the clusters started breaking up).
Another hope for handling this might be using the color of the lines in some way, but I haven’t yet figured out how.
Clustergram with ggplot2
Hadley Wickham has kindly played with recreating the clustergram using the ggplot2 engine. You can see the result here:
http://gist.github.com/439761
And this is what he wrote about it in the comments:
I’ve broken it down into three components:
* run the clustering algorithm and get predictions (many_kmeans and all_hclust)
* produce the data for the clustergram (clustergram)
* plot it (plot.clustergram)
I don’t think I have the logic behind the y-position adjustment quite right though.
Here is an example of how it looks:

Conclusions (some rules of thumb and questions for the future)
In a first look, it would appear that the clustergram can be of use. I can imagine using this graph to quickly run various clustering algorithms and then compare them to each other and review their stability (In the way I just demonstrated in the example above).
The three rules of thumb I have noticed by now are:
- Look at the location of the cluster points on the Y axis. See when they remain stable, when they start flying around, and what happens to them in higher number of clusters (do they re-group together)
- Observe the strands of the datapoints. Even if the clusters centers are not ordered, the lines for each item might (needs more research and thinking) tend to move together – hinting at the real number of clusters
- Run the plot multiple times to observe the stability of the cluster formation (and location)
Yet there is more work to be done and questions to seek answers to:
- The code needs to be extended to offer methods to various clustering algorithms.
- How can the colors of the lines be used better?
- How can this be done using other graphical engines (ggplot2/lattice?) – (Update: look at Hadley’s reply in the comments)
- What to do in case the first principal component doesn’t capture enough of the data? (maybe plot this graph to all the relevant components. but then – how do you make conclusions of it?)
- What other uses/conclusions can be made based on this graph?
I am looking forward to reading your input/ideas in the comments (or in reply posts).
(Follow the link for other posts by Tal Galili)
MLB Baseball Pitching Matchups ~ grabbing pitcher and/or batter codes by specify game date using R XML
MLB Gameday stores its game data in XML format, with the players denoted in ID numbers. To find out who is who, the codes are stored in pitchers.xml or batters.xml of each game.
My DownloadPitchFX.R script can download the ID numbers, but it doesn’t look to see who the ID is because of the extra processing time. But to use the data (say in RMySQL), it helps to have another script that figures out the ID number for any player.
The following script (GetPitcherBatterCodes.R) requires the last and/or first name of the player, the team that he plays on and the specific date the player is assumed to play. It outputs a data frame with the matched name (however many) and their ID numbers. You can also let just.player = FALSE to download all of the players listed in that game (although it does that anyways).
The input for the team name is fairly general. You can use the codes that are specified in Gameday (“SF”, “sfn”), or the actual city of the team (“San Francisco”), or its team name (“Giants”).
## GetPitcherBatterCodes.R
## get pitcher batter codes for pitch f/x
library(XML)
# -- Outputs
# data frame of all matching names, OR
# data frame of all batters or pitchers in game
# -- Inputs
# game.date ~ game date player plays in, default POSIXlt format, e.g. "2009-05-20"
# is.pitcher ~ TRUE for pitcher, FALSE for batter
# last_name ~ a character vector for the last name
# first_name ~ a char vector for first name,
# have to spell correctly but don't need both first and last names..
# team ~ denote team that player plays in,
# use any of the following code within quotes.. example for SF Giants, or SD Padres:
# away_name_abbrev="SF" home_name_abbrev="SD" away_code="sfn" away_file_code="sf" away_team_city="San Francisco" away_team_name="Giants" home_code="sdn" home_file_code="sd" home_team_city="San Diego" home_team_name="Padres"
# just.player ~ TRUE to get ID for player, FALSE to grab all pitchers OR batters in game
GetPitcherBatterCodes <- function(game.date = "2009-05-20",
is.pitcher = TRUE,
last_name = "Lincecum", first_name = "Tim",
team = "sfn",
just.player = TRUE,
URL.base = "http://gd2.mlb.com/components/game/mlb/") {
# extract date
game.date <- as.POSIXlt(game.date)
year <- game.date$year + 1900
month <- game.date$mon + 1
day <- game.date$mday
URL.date <- paste(URL.base, "year_", year, "/",
ifelse(month >= 10, "month_", "month_0"), month, "/",
ifelse(day >= 10, "day_", "day_0"), day, "/", sep = "")
# extract miniscoreboard.xml
URL.scoreboard <- paste(URL.date, "miniscoreboard.xml", sep = "")
XML.scoreboard <- xmlInternalTreeParse(URL.scoreboard)
parse.scoreboard <- sapply(c("gameday_link",
"away_name_abbrev", "home_name_abbrev",
"away_code", "home_code",
"away_file_code", "home_file_code",
"away_team_city", "home_team_city",
"away_team_name", "home_team_name"), function(x)
xpathSApply(XML.scoreboard, "//game[@*]", xmlGetAttr, x))
# get game URL of specified team
team.index <- apply(parse.scoreboard, 1, function(x) team %in% x)
team.URL <- parse.scoreboard[team.index, 1][1] # protect from double headers
URL.game <- paste(URL.date, "gid_", team.URL, "/", sep = "")
# get player data
URL.players <- ifelse(is.pitcher, paste(URL.game, "pitchers/", sep = ""),
paste(URL.game, "batters/", sep = ""))
HTML.players <- htmlParse(URL.players)
codes.players <- xpathSApply(HTML.players, "//a[@*]", xmlGetAttr, "href")[-1]
# loop through player codes to match last AND/OR first name
info.players <- sapply(codes.players, function(x) {
URL.player <- paste(URL.players, x, sep = "")
XML.player <- xmlInternalTreeParse(URL.player)
print(x)
info.player <- sapply(c("team", "id", "type", "first_name", "last_name"), function(x)
xpathSApply(XML.player, "//Player[@*]", xmlGetAttr, x))
})
# get results and match player names if necessary
if (just.player == TRUE) {
last.index <- last_name == info.players["last_name",]
first.index <- first_name == info.players["first_name",]
matched.index <- as.logical(last.index + first.index)
matched.players <- data.frame(id = info.players["id", matched.index],
first_name = info.players["first_name", matched.index],
last_name = info.players["last_name", matched.index])
return(matched.players)
}
else return(info.players)
}
Some output:
> aho <- GetPitcherBatterCodes()
> aho
id first_name last_name
1 453311 Tim Lincecum
> aho2 <- GetPitcherBatterCodes(just.player = FALSE)
> aho2
116615.xml 133982.xml 217096.xml 277405.xml 346793.xml 408241.xml
team "sfn" "sfn" "sfn" "sfn" "sfn" "sdn"
id "116615" "133982" "217096" "277405" "346793" "408241"
type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher"
first_name "Randy" "Bob" "Barry" "Justin" "Jeremy" "Jake"
last_name "Johnson" "Howry" "Zito" "Miller" "Affeldt" "Peavy"
425514.xml 429718.xml 429723.xml 429781.xml 429985.xml 430161.xml
team "sdn" "sdn" "sfn" "sdn" "sdn" "sfn"
id "425514" "429718" "429723" "429781" "429985" "430161"
type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher"
first_name "Heath" "Shawn" "Merkin" "Kevin" "Chad" "Noah"
last_name "Bell" "Hill" "Valdez" "Correia" "Gaudin" "Lowry"
430606.xml 430650.xml 430657.xml 430665.xml 430912.xml 432934.xml
team "sdn" "sdn" "sdn" "sfn" "sfn" "sdn"
id "430606" "430650" "430657" "430665" "430912" "432934"
type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher"
first_name "Mike" "Edwin" "Cha Seung" "Brandon" "Matt" "Chris"
last_name "Adams" "Moreno" "Baek" "Medders" "Cain" "Young"
435619.xml 445995.xml 446207.xml 448592.xml 450312.xml 450527.xml
team "sfn" "sdn" "sdn" "sdn" "sdn" "sfn"
id "435619" "445995" "446207" "448592" "450312" "450527"
type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher"
first_name "Pat" "Arturo" "Josh" "Cla" "Mark" "Alex"
last_name "Misch" "Lopez" "Geer" "Meredith" "Worrell" "Hinshaw"
450832.xml 451216.xml 452724.xml 453281.xml 453311.xml 456043.xml
team "sfn" "sfn" "sfn" "sdn" "sfn" "sfn"
id "450832" "451216" "452724" "453281" "453311" "456043"
type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher"
first_name "Jesse" "Brian" "Billy" "Wade" "Tim" "Jonathan"
last_name "English" "Wilson" "Sadler" "LeBlanc" "Lincecum" "Sanchez"
457117.xml 457566.xml 458155.xml 459987.xml 460044.xml 464351.xml
team "sdn" "sdn" "sfn" "sdn" "sdn" "sfn"
id "457117" "457566" "458155" "459987" "460044" "464351"
type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher"
first_name "Ernesto" "Greg" "Joe" "Cesar" "Cesar" "Kelvin"
last_name "Frieri" "Burke" "Martinez" "Ramos" "Carrillo" "Pichardo"
464400.xml 465629.xml 466412.xml 467683.xml 471183.xml 477581.xml
team "sfn" "sdn" "sdn" "sfn" "sfn" "sdn"
id "464400" "465629" "466412" "467683" "471183" "477581"
type "pitcher" "pitcher" "pitcher" "pitcher" "pitcher" "pitcher"
first_name "Henry" "Edward" "Luis" "Osiris" "Waldis" "Walter"
last_name "Sosa" "Mujica" "Perdomo" "Matos" "Joaquin" "Silva"
489265.xml 491159.xml 502381.xml 503355.xml
team "sfn" "sdn" "sdn" "sdn"
id "489265" "491159" "502381" "503355"
type "pitcher" "pitcher" "pitcher" "pitcher"
first_name "Sergio" "Joe" "Luke" "Jackson"
last_name "Romo" "Thatcher" "Gregerson" "Quezada"
(Follow the link for other posts by apeescape)
Color choosing in R made easy
I don’t know about you, but when I want to make a graph in R, I handpick the colors, line widths etc… to produce awesome output.
A lot of my time is spent on color choosing, I had to find a more convenient way of doing so. Earl F. Glynn’s “Chart of R colors” posted on http://research.stowers-institute.org/efg/R/Color/Chart/ gave me the idea to the following function.
R has an Internal called colors(), it’s output is a 657 long character vector of reserved names for colors.
The names can be used directly as in :
plot(iris,col="orange")
Alternatively, they can be referred to from colors()
plot(iris,col=colors()[503])
The color() function has a two arguments (notice it is not plural… I chose this unreserved name because it’s easy to remember) :
- plot.it – FALSE by default
- locate – 0 by default
The call color() is the same as colors().
> color()[555]==colors()[555] [1] TRUE
Calling color(plot.it=T) or color(T) gives the following output (very similar to Earl F. Glynn’s “Chart of R colors”) :
You can choose and remember the numbers, or print and stick above your working area… but the following makes color() more useful :
Specifying locate = k > 0 will plot the chart above and this time will use the locator() function in a loop to choose colors you want. After choosing k colors the output will be a k-long character vector of the chosen colors.
> color(T,5) [1] "firebrick4" "grey9" "green" "gray99" "khaki1"
You can use it directly in a plot function, the palette of colors will plot first, choose your colors, the plot you called for will be followed by it:
plot(iris,col=color(T,5))
The function is given by :
color <- function (plot.it=F,locate=0)
{
if(!plot.it)
{
return(.Internal(colors())) # so far, not different from colors()
} # close on if
else
{
ytop <- rep(seq(1/26,1,by=1/26),each=26)[1:657]
ybottom <- rep(seq(0,1-1/26,by=1/26),each=26)[1:657]
xleft <- rep(seq(0,1-1/26,by=1/26),times=26)[1:657]
xright <- rep(seq(1/26,1,by=1/26),times=26)[1:657]
pall <- round(col2rgb(colors())/256)
pall <- colSums(pall) ; pall2 <- character(0)
pall2[pall>0] <- "black"
pall2[pall==0] <- "white"
par(mar=c(0,0,1,0))
plot.new()
title(main="Palette of colors()")
rect(xleft,ybottom,xright,ytop,col=colors())
text(x=xleft+((1/26)/2)
,y=ytop-((1/26)/2)
,labels = 1:657
,cex=0.55
,col=pall2)
} # close on else
if(locate==0) print("Palette of colors()")
else
{
colmat <- matrix(c(1:657,rep(NA,26^2-657)),byrow=T,ncol=26,nrow=26)
cols <- NA
i <- NA
for(i in 1:locate)
{
h <- locator(1)
if(any(h$x<0,h$y<0,h$x>1,h$y>1)) stop("locator out of bounds!")
else
{
cc <- floor(h$x/(1/26))+1
rr <- floor(h$y/(1/26))+1
cols[i] <- .Internal(colors())[colmat[rr,cc]]
} # close on else
} # close on i
return(cols)
} # close on else
} # close on else+function
You can also write it to variable for further use:
> cols<- color(T,5) > cols [1] "magenta" "orange3" "palevioletred2" "seagreen4" [5] "seagreen2"
Of course it’s not perfect, I still have not solved issues of working with Devices such as pdf() jpeg() and such…
Your comments are welcome.
(Follow the link for other posts by aviadklein)


















