## In case you missed it: January 2014 roundup

February 5, 2014
In case you missed them, here are some articles from January of particular interest to R users: Princeton’s Germán Rodríguez has published a useful “Introduction to R” guide, with a focus on linear and logistic regression. The rxDForest function in the RevoScaleR package fits random forests of histogram-binning trees. A tutorial on using the xts package to analyze and...

## An Inconvenient Statistic

February 4, 2014
As I sit here waiting on more frigid temperatures subsequent to another 10 inches of snow, suffering from metastatic cabin fever, I can't help but ponder what I can do examine global warming/climate change.  Well, as luck would have it, R has the tools to explore this controversy.  Using two packages, vars and forecast, I will see if I...

## Bad Bayes: an example of why you need hold-out testing

February 1, 2014
We demonstrate a dataset that causes many good machine learning algorithms to horribly overfit. The example is designed to imitate a common situation found in predictive analytic natural language processing. In this type of application you are often building a model using many rare text features. The rare text features are often nearly unique k-grams Related posts:

## Inference for ARMA(p,q) Time Series

January 30, 2014
$ARMA(1,1)$

As we mentioned in our previous post, as soon as we have a moving average part, inference becomes more complicated. Again, to illustrate, we do not need a two general model. Consider, here, some  process, where  is some white noise, and assume further that . > theta=.7 > phi=.5 > n=1000 > Z=rep(0,n) > set.seed(1) > e=rnorm(n) > for(t...

## A First Look at rxDForest()

January 30, 2014
by Joseph RIckert Last July, I blogged about rxDTree() the RevoScaleR function for building classification and regression trees on very large data sets. As I explaned then, this function is an implementation of the algorithm introduced by Ben-Haim and Yom-Tov in their 2010 paper that builds trees on histograms of data and not on the raw data itself. This...

## Comparing multiple (g)lm in one graph #rstats

January 29, 2014
It’s been a while since a user of my plotting-functions asked whether it would be possible to compare multiple (generalized) linear models in one graph (see comment). While it is already possible to compare multiple models as table output, I now managed to build a function that plots several (g)lm-objects in a single ggplot-graph. The

## Inference for AR(p) Time Series

January 28, 2014
$Y_t =\varphi_1 Y_{t-1}+\varphi_2 Y_{t-2}+\varepsilon_t$

Consider a (stationary) autoregressive process, say of order 2, for some white noise with variance . Here is a code to generate such a process, > phi1=.25 > phi2=.7 > n=1000 > set.seed(1) > e=rnorm(n) > Z=rep(0,n) > for(t in 3:n) Z=phi1*Z+phi2*Z+e > Z=Z > n=length(Z) > plot(Z,type="l") Here, we have to estimate two sets of parameters: the autoregressive...

## How to convert odds ratios to relative risks

January 27, 2014
My short paper on this came out on Friday in the British Medical Journal. The aim is to help both authors and readers of research make sense of this rather confusing but unavoidable statistic, the odds ratio (OR). The fundamental … Continue reading →

## New in forecast 5.0

January 26, 2014
Last week, version 5.0 of the forecast package for R was released. There are a few new functions and changes made to the package, which is why I increased the version number to 5.0. Thanks to Earo Wang for helping with this new version. Handling missing values and outliers Data cleaning is often the first step that data scientists...

## Tuning optim with parscale

January 26, 2014
I often get questions what is the use of parscale parameter in optim procedure in GNU R. Therefore I have decided to write a simple example showing its usage and importance. The function I test is a simplified version of estimation problem I had to sol...