(April 2009 Update: Unfortunately, The Money Tech Conference was indefinitely postponed, but fortunately I will be presenting a version of this talk in July at OSCON 2009).
I’ve been invited to speak at O’Reilly’s Money Tech conference this coming February 4-6th in New York City and thought I’d share the abstract for my talk here. I’ll likely be in New York for several days, if you’d like to get together to chat about data drop me a line!
My talk is entitled “Open Source Analytics: Visualization and Predictive Modeling of Big Data with the R Programming Language”
Just as the explosion of online data catalyzed the development of
storage technologies such as Hadoop, new challenges in data analytics
– turning terabytes into actionable insights — demand new tools. R,
an open-source language for statistical computing and graphics, is an
extensible, embeddable, and industry-strength solution for analytics.
In this session, I showcase R’s power by building predictive models
for Brazilian soybean harvests and baseball slugger salaries.
The economics of data aggregation and analysis are being disrupted by
falling costs for storage and CPU power, the continuing shift of
business processes online, and the deluge of data that is being
generated as a consequence.
Satellite images, SEC filings, supply chain data (RFID data streams),
online prices, and newsgroup content represent just a few of the data
sources that hold potential for predictive modeling of markets.
Much of this data does not fit within existing paradigms for business
analysis: either its size overwhelms traditional desktop tools such as
Excel, or else its unique dimensions (such as geocodes) prevent its
being pipelined into more powerful, but narrowly designed, analysis
tools. Finally, closed-source tools cannot keep pace with the leading
edge of innovation in statistical and machine-learning algorithms.
Enter the open source programming language R. R has been dubbed the
lingua franca for statistical computing and graphical analysis, with a
pedigree tracing back several decades at Bell Labs. Though its
million-plus users are concentrated within academia, R is gaining
currency within several high-profile quantitative analysis groups,
including Google’s Customer Insights team and Barclays Global
Investors. In addition, R’s extensibility via user-contributed
packages has spawned an active developer community.
In this session, I will focus on applying R’s powerful visualization
tools to guide the construction of predictive models, using the kind
of large, multidimensional data sets that increasingly confront
quantitative analysts. Along the way, I will highlight R’s packages
for inferential statistics, its compact modeling syntax, and its ease
of connectivity with persistent data stores.
The two specific examples I will discuss are:
- an analysis of NASA’s Landsat imagery of Brazil’s center-west
agricultural regions to detect correlates for soybean harvest yields,
and a derived predictor of the Brazilian soybean market based in part
on these correlates.
- a validation of Bill James’ sabermetrics approach to batting
performance using 30 years of Major League Baseball statistics, and a
derived predictor for batters’ salaries.
For all of its strengths, R has an admittedly steep learning curve.
While source code for the examples will be provided online, this talk
will emphasize techniques and working examples over technical details.
The goal of this session is to give quantitative analysts the courage
to invest in learning the R language, by showcasing R’s power,
highlighting its features, and providing examples of its use for