June 19, 2011
This humble blog is proudly part of R-bloggers, since a couple of weeks. I had this website as my homepage for some months now and I have found therein really inspiring and informative things. So I wish all the best to Tal Galili and his great job with...

## A Little Sampling Puzzle

June 18, 2011
Suppose you have 10 objects from which you take a sample of size 20 (with replacement, or you're in trouble). What's the probability that each object was chosen at least once? Getting an answer via simulation is pleasantly easy:f <- function(n=10,...

## Efficient loops in R — the complexity versus speed trade-off

June 18, 2011
I've written before about the up- and downsides of the plyr package -- I love it's simplicity, but it can't be mindlessly applied, no pun intended. This week, I started building a agent-based model for a large population, and I figured I'd use something like a binomial per-timestep birth-death process for between-agent connections. My ballpark...

## Two textbooks on probability using R

June 18, 2011
Two textbooks on probability using R

This fall, I’ll be teaching a second-year course on Probability with Computer Applications, which is required for Computer Science majors.  I’ve taught this before, but that was five years ago, so I’ve been looking to see what new textbooks would be suitable.  The course aims not just to use computer science applications as examples, but

## Performance ratios, bootstrapping and infinite variances

June 18, 2011
If returns had infinite variance, would there be a problem bootstrapping information ratios? Background There is a discussion on the Quant Finance group of LinkedIn with the title: “How do you measure the confidence intervals of performance ratios?” One suggestion was to use the statistical bootstrap. This resulted in a discussion of the efficacy of … Continue reading...

## Speeding Up MLE Code in R

June 18, 2011
Recently, I’ve been fitting some models from the behavioral economics literature to choice data. Most of these models amount to non-linear variants of logistic regression in which I want to infer the parameters of a utility function. Because several of these models aren’t widely used, I’ve had to write my own maximum likelihood code to

## Binary Installation Now Available

The biggest complaint we had during the installation process was that Xcode (account required) and Rtools were required for MacOS X and Windows. Today we released universal binaries (PPC/i386/x86_64) for MacOS 10.5+ as well as binaries (i386/x86_64) for Windows. This addition wil

## R progress indicators

June 18, 2011
Complicated calculations usually take a lot of time. So how to know the progress status to estimate how much time the program still needs to finish?

## Tracking execution paths

June 18, 2011
Earlier this week, I was trying to figure out the path of execution through a big chunk of code. Once you reach a certain size of codebase, tracking which function gets called when can be tricky. My first thought for dealing with this was to add a message line at the start of each function

## A Brief Introduction to Mixture Distributions

Last time, I discussed some of the advantages and disadvantages of robust estimators like the median and the MADM scale estimator, noting that certain types of datasets – like the rainfall dataset discussed last time – can cause these estimators to fail spectacularly.  An extremely useful idea in working with datasets like this one is that of mixture distributions,...

## Exploring the Market with Hurst

June 17, 2011
Randomly trudging through PerformanceAnalytics source code, I was intrigued by the Hurst Index calculation, which I discovered is more commonly called Hurst Exponent.  After quickly satisfying myself that I could actually do the rolling Hurst calculat...

## Raster, CMSAF and solaR

The Satellite Application Facility on Climate Monitoring (CMSAF) generates, archives and distributes widely recognised high-quality satellite-derived products and services relevant for climate monitoring in operational mode. The data is freely accesible here after a registration process. I have ask them for several files with monthly averages of global solar radiation over the Iberian Peninsula (download).

## Big-Data PCA: 50 years of stock data

June 17, 2011
In this post, Revolution engineer Sherry LaMonica shows us how to use the RevoScaleR big-data package in Revolution R Enterprise to do principal components analysis on 50 years of stock market data -- ed. Principal components analysis, or PCA, seeks to find a set of orthogonal axes such that the first axis, or first principal component, accounts for as...

## solaR 0.24 at CRAN

The version 0.24 of solaR is at CRAN and R-Forge. Some days before the 0.23 version was uploaded, but I had to make a quick fix to readMAPA: the url of http://www.mapa.es/siar has been changed to http://www.marm.es/siar. Moreover, this function has been renamed to readSIAR, although it is still available as readMAPA. Consequently the mode

## Engineering Data Analysis (with R and ggplot2) – a Google Tech Talk given by Hadley Wickham

June 17, 2011
It appears that just days ago, Google Tech Talk released a new, one hour long, video of a presentation (from June 6, 2011) made by one of R’s community more influential contributors, Hadley Wickham. This seems to be one of the better talks to send a programmer friend who is interested in getting into R.

## serialize or turn a large parallel R job into smaller chunks for use with SGE

June 16, 2011
I use the snow package in R with OpenMPI and SGE quite often for my simulation studies; I’ve outlined how this can be done in the past. The ease of these methods make it so simple for me to just specify the maximum number of cores available all the time. However, unless you own your... Read more »

## REITs for Everybody Now REITs for Nobody Part 2

June 16, 2011
As a quick follow-up to my first REITs for Everybody Might Now Mean REITs for Nobody, I want to look at REITs and High Yield bonds, which also might simultaneously attract conservative yield buyers and speculative beta chasers.HYG (iShares High Yield) ...

## Where Ichiro Hits

June 16, 2011
Google research scientist Peter Hauck used Weka and k-means cluster analysis to describe where Mariners right-fielder Ichiro favours hitting the baseball. He then used R to visualize the 6 clusters the k-means analysis identified: I sometimes find K-means clusting tough to explain as a statistical technique, but this makes for a great example: if you're a fielder facing Ichiro,...

## 5000 R questions on stackoverflow.com

June 16, 2011
The R tag on stackoverflow.com hit a milestone yesterday: 5000 questions about the R language. (The 5000th question was about the fortunes package, incidentally -- thanks to Andrie de Vries for pointing this out on Twitter.) Stackoverflow.com continue...

## Fixed: Unable to plot a decent x-Axis in a time series plot using zoo

June 16, 2011
Here is the link to the original problem. Briefly, I was unable to plot a custom x-axis showing abbreviated months in a time series plot of a zoo object. In the plot.zoo() function set xaxt = "n" to suppress plotting … Continue reading →

June 16, 2011
As everyone knows, it seems that Sony is taking a bit of a battering from hackers.  Thanks to Sony, numerous account and password details are now circulating on the internet. Recently, Troy Hunt carried out a brief analysis of the password structure. Here is a summary of his post: There were around 40,000 passwords, of which

## Market arrows

June 16, 2011
Graphs like Figure 1 are reasonably common.  But they are not reasonable. Figure 1: A (log) price series with an explicit guide line. Some have the prices on a logarithmic scale, which is an improvement on the raw prices. The problem with this sort of plot is that two particular data points are taken as … Continue reading...

## How to plot points, regression line and residuals

June 16, 2011
x y # plot scatterplot and the regression linemod1 plot(x, y, xlim=c(min(x)-5, max(x)+5), ylim=c(min(y)-10, max(y)+10))abline(mod1, lwd=2)# calculate residuals and predicted valuesres pre # plot distances between points and the regression linesegments(x, y, x, pre, col="red")# add labels (res values) to pointslibrary(calibrate)textxy(x, y, res, cx=0.7)

## RTextTools now 100% Java-free!

When we first wrote RTextTools, we opted to use RWeka for boosting and bagging algorithms for lack of a better alternative. We've discovered that this leads to all sorts of ugly rJava installation issues across platforms and prevents our users from getting started quickly. Recently, we've stumbled upon two excellent non-Java alternatives: LogitBoost in the

## Further Bernoulli factories

June 15, 2011
Yesterday, Andrew Thomas and José Blanchet posted a note on the Bernouilli factory on arXiv. This short paper links with the recent paper of Flegal and Herbei I commented earlier. Considering the special target Thomas and Blanchet develop an elaborate scheme of cascading envelopes that converge to f from above. Their paper is very clear

## The Big Analytics Revolution starts with R

June 15, 2011
Thanks to everyone who attended our webinar The 'Big Analytics' Revolution Starts with R yesterday. If you missed the live session, you can download the presentation slides (PDF) and the 30-minute replay video (WMV) from the Revolution Analytics website. The presentation focuses on the isse of Big Data, and how businesses can use advanced analytics methods implemented in the...

## R: Analyisis of a Sport Event

June 15, 2011
Inspired by a post by a

## Statistical Analysis of the LAC Degerloch Volkslauf 2010

June 15, 2011
Inspired by a post by a R-blogger my interest was piqued to examine the runs in my athletic club. Therefore, I started R and analysed he LAC Degerloch Volkslauf 2010; a 10km race near Stuttgart-Hoffeld. Next lines, I present this statistical examination. The data can be found at: data. Firstly, I converted the data