Welcome to FOSS Trading
Meet the authors:
Joshua Ulrich is currently the author and maintainer of four R packages:
- TTR - Technical Trading Rules - a suite of technical analysis functions
- opentick - an R API to the opentick databases
- xts - eXtensible Time Series - a time-based data class that integrates all of the current time-series classes (co-authored with Jeff Ryan)
- pack - convert binary to and from formats other programs and machines can understand
- quantmod - specify, build, trade, and analyse quantitative financial trading strategies
- IBrokers - provides native R access to Interactive Brokers Trader Workstation API
- xts - eXtensible Time Series - a time-based data class that integrates all of the current time-series classes (co-authored with Joshua Ulrich)
- Defaults - create global function defaults
- R allows for rapid prototyping, since it is a scripted language.
- The list of finance-oriented R packages is large and growing.
- The R-Finance community continues to grow rapidly.
- A multitude of statistical routines are available in contributed packages.
Group-level variances and correlations
Design Flaws in R #3 — Zero Subscripts
Unlike the two design flaws I posted about before (here, here, and also here), where one could at least see a reason for the design decision, even if it was unwise, this design flaw is just incomprehensible. For no reason at all that I can see, R allows one to use zero as a subscript without triggering an error. (Remember that in R, indexes for vectors and matrices start at one, not zero.)
This is of course a terrible decision, because it makes debugging harder, and makes it more likely that bugs will exist that have never been noticed.
So what does R do with a zero subscript, seeing as it’s meaningless? It just ignores it, which is possible because it views all numeric subscripts as vectors, that extract or replace a set of elements, not necessarily just one. So R simply removes all zeros from a vector used as a subscript, producing a shorter vector.
Here’s what happens (with the current version of R, 2.7.2):
> a [1] 10 20 30 40 50 > a[0] numeric(0) > a[c(4,2)] [1] 40 20 > a[c(4,0,2,0)] [1] 40 20 > a[0] <- 7 [1] 10 20 30 40 50 > a[c(4,0,2,0)] <- 7 [1] 10 7 30 7 50
Contrast this with what happens when you use a subscript that is too large:
> a [1] 10 20 30 40 50 > a[7] [1] NA > a[c(4,7,2)] [1] 40 NA 20 > a[7] <- 7 [1] 10 20 30 40 50 NA 7
Extending vectors automatically when an assignment is made beyond the end can obviously be useful (though it might be wiser not to). Returning NA when extracting an element beyond the end is also a sensible action (though signalling an error immediately might be more useful for debugging). And negative subscripts are usefully defined as referring to their complement. But what possible use is there for ignoring zero subscripts rather than signalling an error?
It’s perhaps belabouring the obvious, but let me explain that signalling an error when a zero subscript is used is desirable because this is a very common sort of program bug. It can easily arise when a program is scanning backwards through the vector elements, and goes one step too far. It can also easily arise when data is initialized to zeros, with the intent to replace the zeros with something sensible later, but actually some zeros are never replaced. The way R behaves when zero is used as a subscript when replacing elements is particularly bad, since doing nothing at all can easily lead to an apparently working program that produces wrong answers. (The behaviour of returning an empty vector when zero is used as a subscript when extracting an element is more likely to produce an error later on, so that at least the problem will be evident.)
So what should be done? That’s easy — change R so that use of zero as a subscript produces an immediate error. That’s trivial to do (mixing positive and negative subscripts produces an immediate error now, so the apparatus for it must be there). Might that break some existing programs? Yes, it will. But 99.9% of those programs are already broken. The users just don’t know it, thinking that the answers they get are correct when they’re not. The remaining 0.1% of these broken programs were written by really stupid programmers who thought that exploiting an obscure and unwise feature in order to produce a really hard-to-understand program was a good idea. It wasn’t.
Along with this, R should be changed so that using NA as a subscript when replacing elements in a vector also produces an error. What to do with NA subscripts used to extract elements is a little bit harder to decide, but it seems to me that something about the following is a bit funny:
> a [1] 10 20 30 40 50 > a[NA] [1] NA NA NA NA NA > a[NA+0] [1] NA
pmin and pmax
Did you know that there are multiple versions of the min and max function. make sure that you are using the right one. pmin and pmax are the 'parallel' versions of the min and max function, meaning that they can take vector arguments and return vectors back. Much better than setting up your own apply function. So make sure that you are using the right version.
rgraph6 on R-Forge
I have moved my rgraph6 R package to R-Forge. R-Forge is a website that facilitates development of R packages by providing services for version control (through Subversion), automatic checking and building of the packages including binaries for Windows and MacOS, as well as for collaboration with other R users/developers.
The rgraph6 package has been already available through my private mini-repository. It provides an interface to a pretty compact format for storing undirected graphs as sequences of printable ASCII characters which is quite useful for handling large libraries of undirected graphs. The format itself is due to Brendan McKay. The detailed description of it is available here and is also included within the rgraph6 package.
Two crucial functions of the package are written in C. As my knowledge of C is rather low it might be far from perfect. If you know C well you are more than welcome to have a look at the sources and suggest some improvements. I believe that one of the crucial things is checking for the size of the character sequences that are converted to binary numbers and then to decimal. I’m not sure whether it will work for arbitrary network sizes. I plan to put a public advertisement on R-Forge to look for people who would be willing to do that. Actually that was one of the reasons I opened the package development to others through R-Forge. So don’t be shy and go ahead! ![]()
From now on I will not release any future version of rgraph6 through this website. All will be distributed through rgraph6‘s website on R-Forge. The older versions will still be available though. You can install the current version from R-Forge directly from R with:
install.packages("rgraph6",repos="http://R-Forge.R-project.org")How do you measure a major league slugger?
I gave a talk last month at SAP Labs in Palo Alto, along with Jim Porzak of ResponSys, introducing the R Statistical Language to a Business Intelligence interest group. The goal was to highlight how open source tools, like R, can be used to build predictive models. The example I gave centered around baseball and a simple question: how do you measure a baseball slugger?
Michael Lewis, in Moneyball , described how the baseball analyst Bill James was frustrated by the fact that major league hitters were consistently rated by their batting averages. James wrote:
“a hitter should be measured by his success in that which he is trying to do, … create runs. It is startling, when you think about it, how much confusion there is about this.”
- Bill James, 1979 Baseball Abstract
However, since teams create runs, not batters, the only way to connect batting statistics with runs is to use team averages. The idea is that if we know which statistics predict runs at the team level, these statistics could be used to measure individual hitters.
I decided to test the value of three batting statistics myself — batting average, slugging percentage, and OPS (on-base plus slugging) — and see how well they predicted team runs, using MLB team data for the years 2000-2005 (available from baseball-databank.org). The results are shown in the three scatter plots below, and no surprise, Bill James is right: a team’s overall batting average (top-most chart) is a comparatively poor predictor of how many runs it will score in an average game. Slugging percentage (middle plot) is a slightly better predictor, and OPS (bottom plot) is the best of the three statistics I looked at: it has a 0.95 correlation with runs scored (the r shown in the upper right corner of the plots is the Pearson correlation coefficient, the red lines represent least-squares fits to the points).
Read more
