I'm a big fan of the Google Summer of Code. It brings great projects together with a learning opportunity for students. Once again the R Project was selected to be part of the Google Summer of Code in 2011. Some other notable mathemat...

Bob Muenchen, author of R for SAS and SPSS Users and co-author of R for Stata Users, has updated his in-depth analysis of the popularity of data analysis software. Determining "popularity" for software is a tricky task, but this analysis looks at several different metrics: mailing list traffic, blogs, search volumes, job listings and other such indirect methods, as...

Robin showed me a mathematical puzzle today that reminded me of a story my grand-father used to tell. When he was young, he and his cousin were working in the same place and on Sundays they used to visit my great-grand-mother in another village. However, they only had one bicycle between them, so they would

Feature NumPy R contiguous (virtual) memory ✔ ✔ 'view' memory model ✔ ✘ subset-assignment ✔ ✔ vectorized operations ✔ ✔ memory-mapping ✔ ✘* broadcasting rules ✔ ✔ index arrays ✔ ✔ This comparison is current as of R 2.13.0, NumPy version 1.4.1, and other web resources to date. Because this post was motivated by a

There are several graphics available for visualizing missing data. The following graphic was inspired by many sources. However, I wanted a version using ggplot2. What is visualized here is the percent missing for each variable in the PISA data across countries. The code will be available as part of the multilevelPSA package I am currently

I'm not sure how I missed this package, but I am sure glad I've found it. The data.table package for R provides something of a reconceptualization of the standard data.frame object. Though it remains (mostly) compatible with data.frame. The advantage

I came up with an idea to draw correlation network to get a grasp about relationship between a list of stocks. An alternative way to show correlation matrix would be head map, which can have limitations with big matrices (>100). Unfortunately, ggplot2 package doesn’t have a easy way to draw the networks, so I was left

Thanks to this post, I found OpenClassroom. In addition, thanks to Andrew Ng and his lectures, I took my first course in machine learning. These videos are quite easy to follow. Exercise 2 requires implementing gradient descent algorithm to model data with linear regression.

For its 20th anniversary, JCGS offers free access to papers, including Andrew’s discussion paper Why tables are really much better than graphs. (Another serious ending for an April fool joke!) Incidentally (or rather coincidentally), I received today the great news that our Using parallel computation to improve Independent Metropolis-Hastings based estimation paper is accepted by

First you need to create a workflow in Knime. This is what i used. I loaded in the Iris data, renamed the tables for further use in my scripts and showed a view, or first did an R snippet to show a view afterwards. Once this is done, make sure your R-B...

R can easily generate random samples from a whole library of probability distributions. We might want to do this to gain insight into the distribution's shape and properties. A tricky aspect of statistics is that results like the central limit theorem come with caveats, such as "...for sufficiently large n...". Getting a feel for how...

The ASA is launching a new blog called the Statistics Forum, managed by Andrew Gelman and to which I will periodically contribute items that may induce some amount of discussion within the community, like the first entry by Michael Lavine on testing. (Meaning I will double-post on the Og and on the Statistics Forum, if

Recently, the good folks at Infochimps.com rolled out a series of new APIs to add to their already impressive set of data resources. I have been in a perpetual state of catch-up since the new year, so I have only now got around to adding some of these new APIs to the infochimps R package. Here

First of all, welcome to my blog! I will write posts about trading, quantitative and algorithmic trading, programming and everything else what is on my mind. Feel free to comment and give suggestions how to improve this blog. Thank you! Well, one of th...

I recently started using RStudio, the amazing new IDE for R. You can view all of RStudio's keyboard shortcuts by going to the help menu, but I made this printable reference for myself and thought I'd share it. I only included the Windows shortcuts, and...

What is JRI? An additional library and plugin for Eclipse to call r functions in a Java-application. And according to wikipedia it’s a now obsolete API for invoking native C++ calls from Java that has long been supplanted by Java Native Interfa...

Here are the results of a "Curse of Dimensionality" homework assignment for Terran Lane's Introduction to Machine Learning class. Pretty pictures, interesting results, and a good exercise in explicit parallelism with R. It's neat to see distance scaling linearly with standard deviation, and linearly with the Lth-root...

Here is a fast R function to extract exon locations from a BED12 file. Note that fast is a relative term, the function below is fast enough for me, may not be fast enough for others :) Anyway, a BED12 file typically has locations of genomic features (t...

