Some Python Nooks and Crannies

January 31, 2010 · Posted in R bloggers · Comments Off 

I spent this weekend reading Learning Python (Second Edition for Python 2.3!) by Mark Lutz. Python is my favorite programming language, but my experience with it has been mostly anecdotal; I come up with my own solutions and functions and I Google whatever I do not know. I decided to spend a couple of days with this incredibly out-of-date book to formalize my knowledge of the base Python language. It was fairly easy reading because I already had experience with about 80% of the constructs discussed. But it was fun to learn some things that I have not used, and some things that I did not even know existed. I want to share some of these gems here. Pardon me if all of this stuff is obvious to you ;-) .

Populating a String with a Dictionary

>> data = {}
>> data['first_name'] = "Ryan"
>> data['age'] = 21 #some programming humor
>> print "Hello, my name is %(first_name)s and I am %(age)d years old." % data
Hello, my name is Ryan and I am 21 years old.

Notice that we can also put a function after the last % sign above, as long as the function returns a dictionary.
The s and the d after the dictionary keys are the usual format specifiers (s for string and d for a number).

This gem is important to me because I use dictionaries containing tons of data and need to reformat it! This is what I used to have to do:

>> print "Hello, my name is %s and I am %d years old." % (data['first_name'], data['age'])

Some More on Printing to Streams

By default, separating variables by a comma will produce a space between the outputted variables

>> x = 3
>> y = 2
>> print x, y
3 2

Usually print automatically inserts a newline at the end of the output. We can suppress it by using a dangling comma. This is more useful when printing to another stream such as a file.

>>> x= 2
>>> y = 3
>>> out = open("temp.txt", "w")
>>> print >> out, x,
>>> print >> out, y,
>>> out.close()

The file then contains the line

2 3


Do it, Or Else…or Not

A for or while loop can have an else, to perform actions when control leaves the loop without encountering a break. Personally, I think done, when-complete or something similar would have been better than else.

Consider the example of searching a list for a value.

>>> names = ["Sarah", "Nick", "Sam", "Chloe"]
>>> for name in names:
... if name == "Ryan":
... break
... else:
... print "Not found!"
...
Not found!

Global only Matters for Assignment

If we define a variable outside any class or a function it is global. We can access the variable in any enclosing functions, but we cannot modify it. This is new to me. What I used to do was this:

>>> x = 3
>>> def f():
... global x
... print x
...
>>> f()
3

when all I really need is:


>>> x = 3
>>> def f():
... print x
...
>>> f()
3

However, we do need the statement global x if we modify the variable.

Class Properties

Class properties simplify the creation of getters and setters…sort of. Of course, if we define a class attribute outside of a class method, we can access it without a getter or setter:

>>> class MyClass:
... myvar = 2
...
>>> a = MyClass()
>>> a.myvar
2
>>> a.myvar = 3
>>> a.myvar
3

But if we want to be more careful, we do not define the variable in such a way. Instead, we can define it in a class method then write our getter and setter methods. By creating a property, we essentially overload the = operator and allow access to the variable as if it were defined as above. Using the property constructor, we tell what methods to use as the getter and setter. In other words, we do not need to use the getter and setter methods. After typing all of that, this seems trivial…


>>> class MyClass:
... def __init__(self):
... self.myvar = 2
... def getmyvar(self):
... return self.myvar
... def setmyvar(self, val):
... self.myvar = val
... myvar = property(getmyvar, setmyvar, None, None)
...
>>> h = MyClass()
>>> h.myvar
2
>>> h.myvar = 3
>>> h.myvar
3
>>>

Finally, Exceptions

I use exceptions a lot, but there are some constructs I just now read about, particularly, finally. Ha ha.

try/finally

Suppose, we want to run some code in a try block. We know that we want a particular code block to run whether or not an exception occurs. It is assumed that the exception is caught by the caller (or higher caller). The try block will run regardless and so will the finally block. This does not leverage the power of exceptions, unless the caller catches the exception.

Catch Multiple Exceptions

Our code can catch multiple exceptions by enclosing the exception types in a tuple. We can also get data associated with the exception. I have used this data several times, but always forget the syntax.

>>> try:
... html = urllib2.urlopen(url).read()
... except (urllib2.HTTPError, httplib.BadStatusLine), e:
... print "An error %s occurred. Let's sleep it off." % str(e.code)
... time.sleep(1)

try/except/else

Our code in the try block is executed and one of two things happen: an exception occurs, or not. If an exception occurs, hopefully we catch it using except. If we do not, we hope the caller catches it. If no exception occurs, the code within else is run.

try:
html = urllib2.urlopen(url).read()
except (urllib2.HTTPError, httplib.BadStatusLine):
print "Website is down or something."
time.sleep(1)
except:
print "Something else tragic happened."
else:
parse_html(html)

try/else

Unlike the try/finally construct, the else block only runs if the code in the try block runs successfully. This allows us to avoid using boolean flags to test for success!

Although I learned a lot from the second edition, I think it is time to buy the fourth edition…

Rcpp 0.7.4

January 31, 2010 · Posted in R bloggers · Comments Off 
Yesterday, and about nine days after release 0.7.3 of Rcpp (a set of R / C++ interface classes), Romain and I released version 0.7.4. It has been uploaded to CRAN and Debian, and mirrors should already have new versions. As before, my local page is also available for downloads and some more details.

The release once again combines a number of necessary fixes with numerous new features:

  • Building on OS X did not support multi-arch, and we are grateful for Simon who once again came to the rescue. Things should be fine now. The big take-away is that under no circumstances, include either a file configure or src/Makefile if you want multi-arch builds for free. As Rcpp is effectively a library to be used by other packages, this mattered.
  • We added a file NEWS from which I include the relevant section below.
  • Much more code re-organisation and enhancement making passage of various C++ types even easier -- see the NEWS entry below.
  • More unit tests, now including ones for the 'old Rcpp API'.
Post-release, I also reworked the doxygen setup slightly so that all examples are now browseable, and the whole documentation is now searchable as well.

Lastly, we had a remaining Windows build issue. Also, Brian Ripley and Uwe Ligges kindly sent us a small patch supporting the new Windows 64-bit builds using the new MinGW 64-bit compiler for Windows -- so release 0.7.5 may follow in due course.

The NEWS file entry for release 0.7.4 is as follows:

0.7.4	2010-01-30

    o	matrix matrix-like indexing using operator() for all vector 
    	types : IntegerVector, NumericVector, RawVector, CharacterVector
    	LogicalVector, GenericVector and ExpressionVector. 

    o	new class Rcpp::Dimension to support creation of vectors with 
    	dimensions. All vector classes gain a constructor taking a 
    	Dimension reference.

    o	an intermediate template class "SimpleVector" has been added. All
    	simple vector classes are now generated from the SimpleVector 
    	template : IntegerVector, NumericVector, RawVector, CharacterVector
    	LogicalVector.

    o	an intermediate template class "SEXP_Vector" has been added to 
    	generate GenericVector and ExpressionVector.

    o	the clone template function was introduced to explicitely
    	clone an RObject by duplicating the SEXP it encapsulates.

    o	even smarter wrap programming using traits and template
        meta-programming using a private header to be include only
        RcppCommon.h

    o 	the as template is now smarter. The template now attempts to 
    	build an object of the requested template parameter T by using the
    	constructor for the type taking a SEXP. This allows third party code
    	to create a class Foo with a constructor Foo(SEXP) to have 
    	as for free.

    o	wrap becomes a template. For an object of type T, wrap uses
    	implicit conversion to SEXP to first convert the object to a SEXP
    	and then uses the wrap(SEXP) function. This allows third party 
    	code creating a class Bar with an operator SEXP() to have 
    	wrap for free.

    o	all specializations of wrap :  wrap, wrap< vector >
    	use coercion to deal with missing values (NA) appropriately.

    o	configure has been withdrawn. C++0x features can now be activated
    	by setting the RCPP_CXX0X environment variable to "yes".

    o	new template r_cast to facilitate conversion of one SEXP
    	type to another. This is mostly intended for internal use and 
    	is used on all vector classes

    o	Environment now takes advantage of the augmented smartness
    	of as and wrap templates. If as makes sense, one can 
    	directly extract a Foo from the environment. If wrap makes
    	sense then one can insert a Bar directly into the environment. 
    	Foo foo = env["x"] ;  /* as is used */
	Bar bar ;
	env["y"] = bar ;      /* wrap is used */    	

    o	Environment::assign becomes a template and also uses wrap to 
    	create a suitable SEXP

    o	Many more unit tests for the new features; also added unit tests
        for older API

As always, even fuller details are in the ChangeLog on the Rcpp page which also leads to the downloads, the browseable doxygen docs and zip files of doxygen output for the standard formats. Questions, comments etc should go to the rcpp-devel mailing list off the R-Forge page

With With

January 31, 2010 · Posted in R bloggers · Comments Off 

No that is not a typo in the title. In my programming a came across a solution that I thought was pretty cool. I have a function that basically takes two objects and passes the elements of the objects to another function as arguments. This is a pretty simple thing to do but can be painful to type everything out

f<-function(obj1,obj2){
g(obj1$a,obj1$b,obj1$c,obj2$x,obj2$y,obj2%z)
}

read more

Congruential generators all are RANDUs!

January 30, 2010 · Posted in R bloggers · Comments Off 

In case you did not read all the slides of Regis Lebrun’s talk on pseudo-random generators I posted yesterday, one result from Marsaglia’s (in a 1968 PNAS paper) exhibited my ignorance during Regis’ Big’ MC seminar on Thursday. Marsaglia indeed showed that all multiplicative congruential generators

r_{i+1}= kr_i \text{modulo }m

lie on a series of hyperplanes whose number gets ridiculously small as the dimension d increases! If you turn the r_i’s into uniforms u_i and look at the d dimensional vectors

\pi_1=(u_1,\ldots,u_d),\,\pi_2=(u_2,\ldots,u_{n+1}),\,\ldots

they are on a small number of hyperplanes, at most (d!m)^{1/m}, which gives 41 hyperplanes when m=2^{32}… So in this sense all generators share the same poor property as the infamous RANDU which is such that that (u_{i},u_{i+1},u_{i+2}) is always over one of 16 hyperplanes, an exercise we use in both Introducing Monte Carlo Methods with R and Monte Carlo Statistical Methods (but not in our general audience out solution manual). I almost objected to the general result being irrelevant as the \pi_i’s share u_j’s, but of course the subsequence \pi_1,\pi_d,\pi_{2d},... also share enjoys this property!


Filed under: R, Statistics, University life Tagged: congruential generators, Introducing Monte Carlo Methods with R, Monte Carlo methods, Monte Carlo Statistical Methods, random simulation, RANDU

Practical Implementation of Neural Network based time series (stock) prediction – PART 2

January 30, 2010 · Posted in R bloggers · Comments Off 
As a brief follow up to the series, I want to take a moment to describe a bit about Weka, which is the machine learning tool that we will be using to implement the neural network. It is a fantastic open source JAVA based tool that was developed at the University of Waikato, New Zealand. Users who are not all that experienced with programming have access to the GUI shell that makes running a regression or classification scenario a snap. More advanced JAVA programmers may opt to use a command shell or customize their own classes. In addition there are numerous support options, including a fantastic Nabble thread that you may subscribe to--
Weka thread I have found that questions are answered very promptly and there is a lot of activity at the site, so you don't have to wait a long time to get a response. In addition there are some great books put out by Ian Witten and Eibe Frank that guide you through the practical data mining with a minimal barrage of mathematical theory:
Data Mining Practical Machine Learning Tools and Techniques With Java Implementations I have the first edition and have found it an immensely useful reference.

There are a variety of built in learning modules included in the free utility (Weka), such as linear regression, neural networks (a.k.a multilayer perceptrons), decision trees, support vector machines, and even genetic algorithms.



Fig 1. Using the Weka Gui

In Fig 1., we see the Weka GUI Chooser has been opened and the Explorer option was selected. The native format that Weka commonly uses is the .ARFF format, fortunately for us, however, it also reads in .CSV files, which are easily created with a save option in excel. The excel file we will first train is sim_training_set_perfect_sin.csv. Once loaded, you will see all of the relevant variables in the Weka Explorer shell.




Fig 2. Loaded Excel csv training source file for Weka

We notice some new variables have been introduced that were not in part 1.
To understand why, let's show the CSV file that is used here.



Fig 3. Training set variables.

What we see is that the original perfect sine wave signal has been preserved in the column labeled signal. The additional signals, s-1, s-2, s-3, s-4 are often called delayed or embedded (dimension) variables. They are simply lagged values of the signal that are used to train the neural network. There is no exact method to determine the number of lagged values, although a number of different methods exist. For now, we will simply accept that four delayed values of the signal are useful. The last column, called bias, is common to neural networks. The bias node allows the neural network to shift the constant signal input to the network via training. For instance, imagine our signal had an average of 2.0 but we were learning it. The neural network needs to have some input that will track that constant value or it will have large offset errors that will obstruct convergence. The bias node accomplishes that operation. Those familiar with Engineering theory will recognize this node as a DC bias.

Ok, so once other thing we notice in the GUI interface is the Class:signal(num) is selected on the bottom right. This is because we are predicting a numerical class, rather than a nominal one (which is the typical default for classification schemes).

Next, we select the classify tab to select our learning scheme, which in this case will be the MultilayerPerceptron.



We then want to make sure certain options are selected.



We set nominalToBinaryFilter and normalize attributes as False, as we don't wish to modify the input data to be binary and are not using nominal attributes. However, we
want the normalizeNumericClass set to True as mentioned earlier, it will force the normalization scheme to be set to Weka's internal limiting range, so we don't have to. Also, we will train for 1000 epochs.



Fig 6. Preferences for MLP training model.

We will build a model by training on 66% of the data. We want to store and output the predictions so that we can visually see what they look like. Lastly, we will Preserve order for split as it allows us to display the predicted out of sample time series in the original order. With all of these features set, we simply click OK and the start button and it will quickly build our first Neural Network model!



Fig 7. Results with summary of statistics console.

If we scroll up we can see the actual weights that the model converged upon for our Multilayer Perceptron that will be used to predict the out of sample data.
We can see that there is a nice printout of the last 34% of results (271 out of sample data points) along with the predicted value and error, as well as a useful summary of statistics in the bottom of the console. We often use Root mean squared error as a performance metric for neural net regressions. In this case, the number .0005 is quite good. But let's use a little trick to get a visual inspection of just how good. We can actually grab the data from the console (by selecting it with the left mouse button and dragging), then copy this data back into excel. As a result, we can then plot the actual versus predicted out of sample results inside of excel.



Fig 8. Importing prediction results back into Excel.

Notice that we cut and paste the data from the Weka console back into Excel, but must select text to columns in order to separate the data back into columns.



Fig 9. Selecting the regions to separate as columns.

And tada! We can now plot the predicted vs. actual values. And look how nicely they line up. The errors are extremely small on the out of sample set, notice some are 0, others are .001, imperceptible to the eye, without zooming way in on that point.
It actually found a perfect model for this time series (we will expand a bit later why), and the errors can be attributed to numerical precision.



Fig 10. Resulting plot of predicted vs. actual data.

We have now just built a basic Neural Network with a simple sine wave time series using Weka and Excel. The predicted out of sample results were extremely good.
However, as we will see, the data signal we used, the simple sine wave is a very easy signal to learn as it is perfectly repetitive and stationary. We will see that as the signal gets increasingly complex, the prediction results do not work as well.
That's it for Part 2, comments are welcome.

Mining Tuition Data for US Colleges and Universities, and a Tangent

January 30, 2010 · Posted in R bloggers · Comments Off 

I wrote this script for the UCLA Statistical Consulting Center. I don’t know all of the specifics, but one of our faculty members has this idea that we can help our paper, The Daily Bruin, with their graphics or something to that effect. I don’t quite understand because our paper has never really been big on graphics for data, but apparently some undergraduates are going to work on this.

Anyway, we need datasets that are of interest to UCLA students so that our undergraduates can create cool graphics that will stun the readers. Some of the data we were considering:

  • parking data for one week; gate entries, to correlate with some other variable (weather was mentioned. ugh)
  • Registrar study list/class schedule information for every student (anonymized of course) from Fall 2008. $50 for programmer time. (I could have done it quickly, for free! …if I worked in their office and it was legal, I mean.)
  • 9/11 pager intercepts.
  • tuition data for US colleges and universities over ten years.

The tuition data was presented in a bunch of tables presented on several pages. Unfortunately, the type of school is not reported. Due to this limitation, I had to execute separate queries to access each year of data, and each type of school. tuition.py is the result of my labor. It is always a lot of fun, and it is an awesome feeling to be able to extract bulky data! This was also one of my first experiences with pylint. As much as I love Python, it is easy to write ugly code. pylint checks the style of code for violations of Python style such as tabs vs. spaces, spaces between binary operators, function naming conventions, line length and commenting conventions. It also checks for most (if not all) syntax errors, and some logic errors.


I provide this code for educational purposes only. Some may be tempted to ask for the dataset, but for me to grant the request would be in violation of the copyright.

Although extracting the data was the fun part, I feel it would be “sudden” for me to end the post here. So, I should take a quick look at it to show how important mining messy data is. There has recently been an uproar in California regarding increasing fees at the University of California and the California State University systems. The University of California system is the “research” University system in California, whereas California State University does not emphasize research as much (sorry, that’s the best way I can explain it) and does not offer a PhD degree. UCs are generally harder to be admitted to. Some of the better CSUs (such as Cal Poly etc.) can be more highly regarded than some of the lower tier UCs however.

Anyway, I wanted to take a look at how fees at UCLA, the UC system and the California education system have changed over time. I also want to dispel some myths that students have propagated on campus about the current state of fees in the UC system. Note that next year these fees are expected to increase by 30%…

Myth #1: UC and CSU are in this together, and equally.

One would hope that the burden of higher in-state fees would be shared equally between UC and CSU. The figures below indicates that this is sadly not the case. Since 2002, the difference between UCs and CSUs began growing. This may suggest that the services offered only at UC have grown more expensive over time (more research centers, more specialized staff?). Starting in 2002 and more so in 2006, UC fees began to skyrocket compared to those of CSU. There seems to be some non-statistical evidence that the State of California has disproportionately raised fees for UC students mainly due to difference in philosophy and demographics at both University systems. MYTH.

Ucvscsu-1

Myth #2: “We went from being one of the cheapest public school systems to one of the most expensive!”

While we are hurting here in California, the only reliable way to compare how our fees fare over time is to compare them to the national average over time. One argument is that UC fees are now some of the most expensive in the nation. My guess is that in-state fees are approximately normally distributed, but I chose to use the median to point out how important it is to understand what the median is! The plot below compares UC in-state fees to other public 4-year schools across the country. Our fees have been above the national median since data re[orting began in 1999, but since 2003 or so, our fees have grown from being about $2000 above the median to about $3000 above the median. This means that our fees are in the top 50% of school systems, but it does not display how high into the top 50% the UC in-state fees fall. While the median is easy to interpret, top 50% and bottom 50% are not black and white. Instead, let’s look at the percentile rank of UC’s fees over time which is displayed in the second plot below. This gives us a way to quantitatively compare UC fees to national median fees without a “shock” factor.

Normative-2

From 1999-00 school year to 2009-10 school year, UC fees have consistently been above the national median for 4-year public schools. It appears that UC fees increased from the 70th percentile to about 85th percentile in the past 10 years. I do not have data from before 1999, so this rumor may be true, but based on this data, it is false. First part of the myth: Not enough data to conclude, second part: TRUE.

Myth 3: “UC(LA) fees have gotten so ridiculous, it is practically becoming a private school!” or “I might as well just go to a private school!”

Um, think again. This one should be simple to dispel. Let’s take the private school across town that Bruins love to hate: USC. We see that USC’s in-state fees have risen linearly since 1999 and the gap between UCLA is growing, not shrinking! Even when compared to the national median for private 4-year not for profit colleges, UCLA is not anywhere close. It may be true that UC is more expensive than some cheaper private schools though. So put away the USC gear and the notion that you won’t pay much more at USC than you would at UCLA. MYTH.

Uclavusc-1

Conclusion

So, yes, the California education system is a mess, but it is not the apocalypse that many students at UC schools have made it out to be…yet. California is in a deeper mess than the nation as a whole (relatively speaking, of course) so it is to be expected that our fees increase higher than other public school systems especially due to the cost of living in the state. Based on this data, I predict that even with an improving national economy, the national median in-state fees will catch up closer to UC’s fees.

Of course, all of my analysis considers data before the 30% increase that takes effect in Fall 2010. So stay tuned.

Practical Implementation of Neural Network based time series (stock) prediction – PART 1

January 29, 2010 · Posted in R bloggers · Comments Off 
The following introduction is to allow viewers to understand the basic concepts and practical implementation of neural nets towards a financial time series. I will not go too deep into detail about the mathematics behind the neural net at the moment. My goal is to get you to understand practical details about how to actually implement a neural net using simple tools and models. We will start with a simple model to understand a basic time series. The time series waveform is a simple sine wave with the period set to 30 days. It is implemented in excel as a source file to be processed in any Machine Learning capable software. For this example I will be using a very good GUI Java based program called Weka.



Fig 1. Shows a simple sine wave set to a period (T) of 30 days.

It is a very simple time series based upon the well known sine wave model.
We can see that one complete cycle occurs over a period of 30 days. Each time step is set to 1 unit or day per step.



Fig 2. A complex sinusoidal signal with f1 set to 1/T, where T=30 days.

Anyone who has worked with financial time series knows that they can be far more complicated than simple sine based models, however, it is often better to learn from basic principles and move up in complexity in order to have a good grasp of what we are doing. The second figure is a bit more complicated as it is the sum of three different sin based signals. Each signal has a different Amplitude and Frequency associated with it. We could use Fourier Analysis to show the spectrum of the three different tones if we wished. However, for now we'll just accept that it is a complex signal. Notice one property of this signal that is also a bit optimistic is that it is a stationary signal. Essentially a stationary signal has statistical properties that do not change over time. For example, if we were to sample the average from different slices, it would not change much. We also can visually see that the time series is mean reverting. Financial time series differ in that they are not stationary, but are typically unit root and must often be transformed in order for the neural network to process them. The purpose of the complex signal, however, is to show how we can move to an increasingly complex signal from a very simple model.



Fig 3. Normalized Complex Signal

The final step is to simply normalize the time series to be constrained between the vertical (what we call rails) range of minus 1 to plus 1. A typical neural net is limited by an internal function, sometimes called a squashing function. The function is a non-linear processing function that is often made up of a sigmoid or tanh (hyperbolic tangent) function, which saturate at (0,1) and (-1,1), respectively.
A simple transformation can be produced by xnew =xold*(vmaxn-vminn)/(vmaxo-vmino).
Vmax and Vmin are the new and old maximum values of the time series. In this case we will use -.9 and +.9 as the limiting rails so as to avoid saturation effects. Often software will do the normalizing for you. In the case of Weka, you can choose to have it do this operation for you, in which case no normalization is neccessary. Although we should understand it for future reference.

That's it for part I. Next we will investigate how to transport the data to Weka and have it build and predict the out of sample signal set!

Please add any comments on where I can improve my tutorial as I am new to the blogger scene and appreciate any feedback.

Big’MC seminar

January 29, 2010 · Posted in R bloggers · Comments Off 

Two very interesting talks at the Big’ MC seminar on Thursday:

Phylogenetic models and MCMC methods for the reconstruction of language history by Robin Ryder

Uniform and non-uniform random generators by Régis Lebrun

which are both on topics close to my interest, evolution of languages (I’ll be a philologist in another life!) and uniform random generators.


Filed under: R, Statistics, University life Tagged: Big' MC, language history, Monte Carlo methods, phylogenetic model, random generator

R creators win prestigious Statistical Computing and Graphics Award

January 29, 2010 · Posted in R bloggers · Comments Off 

The American Statistical Association recently created a new, bi-annual award to to recognize an individual or team for innovation in computing, software, or graphics that has had a great impact on statistical practice or research. The committee has just announced the winner (or in this, joint winners) of the first award: Robert Gentleman and Ross Ihaka, for their work in initiating the R Project for Statistical Computing.

It really can't be overstated how well-deserved this award is. No other project has made world-class software available to so many people while also by encouraging so many to participate in advancing the art and science of statistical computing through the medium of an open-source project. While many, many people have made significant contributions to the R project over the years, it's undeniable that Robert and Ross were the ones that got it all started. Congratulations, R&R!

(For more information about the history of the R project, this New York Times article from 2009 is a great resource. You can also read a profile of Ross Ihaka in the New Zealand Herald. Disclosure: Robert Gentleman recently joined the board of REvolution Computing.)

ASA Sections on Statistical Computing and Statistical Graphics: Statistical Computing and Graphics Award

Crayola crayon colors, 1949-present

January 29, 2010 · Posted in R bloggers · Comments Off 

Here's an example I featured in my list of 7 Awesome Things about R (awesome thing #3: graphics and data visualization). The Learning R blog features a reproduction of a graphic that recently appeared on Flowing Data. It shows the colors in a box of Crayola crayons: before 1949 there were only 8, but over the years additional colors have been added to the mix. Today, there are 120 colors, and the chart below shows the progression of colors added over time.

Crayola

The amazing thing about this graph isn't just that it's reproduced in R from a hand-crafted original. The amazing thing is that this chart is completely automated: it reads the list of colors from history directly from the list of Crayola colors on Wikipedia, creates barcharts for each time period, and sorts the colors in a pleasing manner (by converting them to the HSV color space) before exporting the custom chart to a PNG file. All of this is done in less than 30 lines of R code. If Crayola introduces a new set of colors this year, then as long a someone updates the Wikipedia page, a new, up-to-date version of this graph can be created in seconds. Now that's awesome.

Learning R: ggplot2: Crayola Crayon Colours (via @gaygoygourmet)

Next Page »