## Part 1 of 3: Building/Loading/Scoring Against Predictive Models in R

August 31, 2011
By

In this first installment, I'm going to focus on:Building/evaluating a predictive model with partitioned dataSaving the predictive model to diskLoading the predictive model from diskScoring data against a predictive model (within R)This installment is ...

## Seriously … why don’t math classes use computers?…

August 31, 2011
By

Seriously … why don’t math classes use computers? Excel, simple Python scripts, Mathematica / Sage, everything beyond the TI-83. Kids could be creating totally sweet visuals instead of cribbing formulae. And thinking instead of copying. I can sa...

## Seriously … why don’t math classes use computers?…

August 31, 2011
By

Seriously … why don’t math classes use computers? Excel, simple Python scripts, Mathematica / Sage, everything beyond the TI-83. Kids could be creating totally sweet visuals instead of cribbing formulae. And thinking instead of copying. I can sa...

## Story of the Ljung-Box Blues: Progress Not Perfection

August 31, 2011
By

In the last post we determined that our ARIMA(2,2,2) model failed to pass the Ljung-Box test.  In todays post we seek to completely discredit the last posts claim and finally arrive at some needed closure. The Ljung-Box is first performed on the s...

## rnpn: An R interface for the National Phenology Network

August 31, 2011
By

The team at rOpenSci and I have been working on a wrapper for the USA National Phenology Network API. The following is a demo of some of the current possibilities. We will have more functions down the road. Get the publicly available code, and contribu...

## XLConnect – A platform-independent interface to Excel

August 31, 2011
By

XLConnect is a comprehensive and platform-independent R package for manipulating Microsoft Excel files from within R. XLConnect differs from other related R packages in that it is completely cross-platform and as such runs under Windows, Unix/Linux and Mac (32- and 64-bit). Moreover, it … Continue reading →

## Posts of the year

August 30, 2011
By

Like last year, here are the most popular posts since last August: Home page 92,982 In{s}a(ne)!! 6,803 “simply start over and build something better” 5,834 Julien on R shortcomings 2,373 Parallel processing of independent Metropolis-Hastings algorithms 1,455 Do we need an integrated Bayesian/likelihood inference? 1,361 Coincidence in lotteries 1,256 #2 blog for the statistics geek?! 863

## What language is R written in?

August 30, 2011
By

On of the nice things about R is that a lot if it is written in the R language. That means, as an R user, if you want to see how R calculates a certain statistic, or you want to modify an existing function for your own use, you can just look at the R code by typing the...

## The Visual Difference – R and Anscombe’s Quartet

August 30, 2011
By

I spent a chunk of today trying to get my thoughts in order for a keynote presentation at next week’s The Difference that Makes a Difference conference. The theme of my talk will be on how visualisations can be used to discover structure and pattern in data, and as in many or my other recent

## Getting Started with Latent Dirichlet Allocation using RTextTools + topicmodels

RTextTools bundles a host of functions for performing supervised learning on your data, but what about other methods like latent Dirichlet allocation? With some help from the topicmodels package, we can get started with LDA in just five steps. Text in

## Nomograms everywhere!

August 30, 2011
By

At useR!, Jonty Rougier talked about nomograms, a once popular visualisation that has fallen by the wayside with the rise of computers. I’d seen a few before, but hadn’t understood how they worked or why you’d want to use them. Anyway, since that talk I’ve been digging around in biology books from the 60s and

## R combined gps-track plot of spatial intensity

August 30, 2011
By

To get a quick impression about the temporal stay of places it is helpful to generate a plot of the trackpoints spatial density (intensity). As the 3d visualisation has both advatages and disadvantages, a combination with a 2D plot is useful to interpret the data. The data used in this example is a gps record

## Realized beta and beta equal 1

August 30, 2011
By

What does beta look like in the out-of-sample period for the portfolios generated to have beta equal to 1? In the comments Ian Priest wonders if the results in “The effect of beta equal 1″ are due to a shift in beta from the estimation period to the out-of-sample period.  (The current post will make … Continue reading...

## How Much of R is Written in R Part 2: Contributed Packages

August 29, 2011
By

So that mean old boss of mine is at it again.  This morning I came in beaming about how many people had read my post How Much of R is Written in R (thanks by the way!).  He then asks me about one little line in that post; the one about how if you looked

## Sharing live R functions with OpenCPU

August 29, 2011
By

OpenCPU is a new initiative from R user Jeroen Ooms to make innovations in statistics, visualization and data-science more widely applicable. Based on open-source principles, it's a web-based service that lets you upload data visualizations and analyses as R scripts, and allow others to run them on demand. For example, you can upload a script to visualize a company's...

## another lottery coincidence

August 29, 2011
By
$another lottery coincidence$

Once again, meaningless figures are published about a man who won the French lottery (Le Loto) for the second time. The reported probability of the event is indeed one chance out of 363 (US) trillions (i.e., billions in the metric system. or 1012)… This number is simply the square of which is the number of

## The effect of beta equal 1

August 29, 2011
By

Investment Performance Guy had a post about beta equal 1.  It made me wonder about the properties of portfolios with beta equal 1.  When I looked, I got a bigger answer than I expected. Data I have some S&P 500 data lying about from the post ‘On “Stock correlation has been rising”‘.  So laziness dictated … Continue reading...

## Comparing Two Distributions

August 29, 2011
By

Here I compare two distributions, flowering duration of indigenous and allochtonous plant species. The hypothesis is that alien compared to indigenous plant species exhibit longer flowering periods. Read more »

## R is a cool image editor #2: Dithering algorithms

August 29, 2011
By

Here I implemented in R some dithering algorithms: - Floyd-Steinberg dithering - Bill Atkinson dithering - Jarvis-Judice-Ninke dithering - Sierra 2-4a dithering - Stucki dithering - Burkes dithering - Sierra2 dithering - Sierra3 dithering For each algorithm, I wrote a 2-dimensional convolution function (a matrix passing over a matrix); it is slow because I didn't implemented any fasting tricks. It can be easily implemented in C, then used...

## Slides of 10+ talks at R Users Groups

August 29, 2011
By

Links to slides of 10+ talks at R Users Groups in Australia are provided below. Slides of the talks are downloadable at the links, including R codes if any. MelbURN: Melbourne Users of R Network: Experiences with using R in … Continue reading →

## Real-time Scoring/Optimization of Predictive Models in R

August 28, 2011
By

I'm working on a 3 part post on how to build, score and perform optimization with predictive models in R. Having done this type of work in IBM SPSS for a number of years, I wanted to replicate it in R. It's amazing how little is published on how to s...

## Ra vs. compiler package

August 28, 2011
By

R seems to have two byte code compilers: the Ra add-on module (and the accompanying "jit" package) and the "compiler" package came with the default installation. I wonder how they differentiate from each other and what the strengths and weaknesses...

## HPC for biological research

August 28, 2011
By

In early May I had the opportunity to attend a workshop on using high performance computing in R hosted at Nimbios. I’ve been meaning to write a summary of the meeting ever since but got sidetracked by various other projects. Since a collaborator recently asked for meeting notes I finally took the time to write

## Real-time data collection and analysis in class

August 28, 2011
By

As September draws nearer, my mind inevitably turns away from my lofty (and largely unmet) summer research goals, and toward teaching.  This semester I will be trying out a teaching technique using live data collection and analysis as a tool to encourage student engagement.  The idea is based on the electronic polling technology known as

## Support Vector Machine with GPU

August 27, 2011
By

Most elementary statistical inference algorithms assume that the data can be modeled by a set of linear parameters with a normally distributed noise component. A new class of algorithms called support vector machine (SVM) remove such constraint. rea...

## Some Additional Thoughts on Useless Averages

In my last post, I described three situations where the average of a sequence of numbers is not representative enough to be useful: in the presence of severe outliers, in the face of multimodal data distributions, and in the face of infinite-variance distributions.  The post generated three interesting comments that I want to respond to here.First and foremost, I...

## Forecasting In R: The Greatest Shortcut That Failed The Ljung-Box

August 27, 2011
By

Okay so you want to forecast in R, but don't want to manually find the best model and go through the drudgery of plotting and so on.  I have recently found the perfect function for you.  Its called auto.arima and it automatically fits the bes...

## SIGKDD 2011 Conference — Days 2/3/4 Summary

August 27, 2011
By

<< My review of Day 1. I am summarizing all of the days together since each talk was short, and I was too exhausted to write a post after each day. Due to the broken-up schedule of the KDD sessions, I group everything together instead of switching back and forth among a dozen different topics. By far the most enjoyable...