A detailed guide to memory usage in R

November 12, 2013
R is designed as an in-memory application: all of the data you work with must be hosted in the RAM of the machine you're running R on. This optimizes performance and flexibility, but does place contraints on the size of data you're working with (since it must all work in RAM). When working with large data sets in R,...

RStudio OS X Mavericks Issues Resolved

November 12, 2013
When OS X Mavericks was released last month we were very disappointed to discover a compatibility issue between Qt (our cross-platform user interface toolkit) and OS X Mavericks that resulted in extremely poor graphics performance. We now have an updated preview version of RStudio for OS X (v0.98.475) that not only overcomes these issues, but

Workshop and Talk Slides from NEAIR Conference

November 12, 2013
I am about to head home from my fifth time attending the North East Association for Institutional Research (NEAIR), this year in Newport, RI, which was just fantastic. Really great people, interesting talks, and good food. I again taught an Introduction to R and LaTeX for Institutional Research pre-conference workshop and also gave a talk on Propensity Score...

googleVis 0.4.7 with RStudio integration on CRAN

November 12, 2013
In my previous post, I presented a preview version of googleVis that provided an integration with RStudio's Viewer pane (introduced with version 0.98.441).Over 80% in my little survey favoured the new default output mechanism of googleVis within RStudi...

A Shiny App for Playing with OLS

November 11, 2013
Ordinary least squares continues to be the staple estimator for causal inference for good reason.  In order to help new and veteran OLS users get a better sense of how it is working I have created a shiny app that allows for instant interactivity ...

In case you missed it: October 2013 Roundup

November 11, 2013
In case you missed them, here are some articles from October of particular interest to R users: Joe Rickert recounts the R presence at the Strata + Hadoop World conference, including slides from the R and Hadoop tutorial. Hadley Wickham's favorite tools, gadgets and software (including of course R). Revolution R Enterprise 7 is announced, with updated R engine...

Running Back-tests in parallel

November 11, 2013
Once you start experimenting with many different asset allocation algorithms, the computation time of running the back-tests can be substantial. One simple way to solve the computation time problem is to run the back-tests in parallel. I.e. if the asset allocation algorithm does not use the prior period holdings to make decision about current allocation,

Merge Relational Dataframes

November 11, 2013
A Question Recently, a student, working on her Senior Thesis at Northland College, asked me the following question: Attached is an Excel file with three “important to R” worksheets. The only thing that connects all 3 worksheets is the Lift.ID … Continue reading →

Imperialstan

November 11, 2013
Despite the map here, I'm not going to talk about yet another fraction of the former Soviet Empire which is taken the form of a people's republic, possibly with witty British Ambassadors.In fact, I'm going to talk about the Stan workshop that I have be...

A slightly different introduction to R, part V: plotting and simulating linear models

November 11, 2013
In the last episode (which was quite some time ago) we looked into comparisons of means with linear models. This time, let’s visualise some linear models with ggplot2, and practice another useful R skill, namely how to simulate data from known models. While doing this, we’ll learn some more about the layered structure of a

Visualising Structure in Topic Models

November 11, 2013
How exactly should we visualise topic models to get an overview of how topics relate to each other? This post is a brief lit review of that debate - I realise the subject matter is sooo last year. I also present my chosen solution to the dilemma: I use dendrograms to position topic, and add a...

Quick tip on controlling log output in tests

November 11, 2013
When running tests for a package, it’s important that the console output is unadulterated since test results are printed with …Continue reading »

A statistical review of ‘Thinking, Fast and Slow’ by Daniel Kahneman

November 11, 2013
I failed to find Kahneman’s book in the economics section of the bookshop, so I had to ask where it was.  ”Oh, that’s in the psychology section.”  It should have also been in the statistics section. He states that his collaboration with Amos Tversky started with the question: Are humans good intuitive statisticians? The wrong The post A...

sjPlotting functions now as package available #rstats

November 11, 2013
This weekend I had some time to deal with package building in R. After some struggling, I now managed to setup RStudio, Roxygen and MikTex properly so I can compile my collection of R-scripts into a package that even succeeds the package check. Downloads (package and manual) as well as package description are available at

cMDS: visualising changing distances

November 11, 2013
Gina Gruenhage has just arxived a new paper describing an algorithm we call cMDS. Here’s what it’s for: if you do any kind of data analysis you often find yourself comparing datapoints using some kind of distance metric. All’s well if you have a unique reasonable distance metric you can use, but often what you

ManchesterR and LondonR user group meetings

November 11, 2013
ManchesterR and LondonR user group meetings Mango Solutions advise details of the forthcoming R user group meetings in Manchester and London. These free meetings are open to anyone using R or interested in using R.   ManchesterR Date:                     Wednesday 13th November Venue:                 The Cornerhouse, 70 Oxford Street, Manchester M1 5NH Time:                     7pm For detailed information please see www.rmanchester.org   LondonR Date:                    ...

Plot axes with customized labels

November 11, 2013
How to modify axis labels is a FAQ for (almost) all R users. This short post try to give a simple but exhaustive reply to this question. First of all, data are generated. ?View Code RSPLUSdat = data.frame( label = … Continue reading →

What’s in my Pocket? (Part II) – Analysis of Pocket App Article Tagging

November 10, 2013
IntroductionYou know what's still awesome? Pocket.As I noted in an earlier post (oh god, was that really more than a year ago?!) I started using the Pocket application, previously known as Read It Later, in July of 2011 and it has changed my readi...

Hurricanes in South Carolina

November 10, 2013
In a recent post, I discussed the occurrence of hurricanes in the North Atlantic basin. The data comes from the National Oceanic and Atmospheric Association, a member of the US federal government. The data spans a bit more than 150 years. In that post, I make the observation that the data supports a model wherein

A small comparison of bio-equivalence calculations.

November 10, 2013
Last week I looked at two-way cross-over studies and followed the example of Schütz (http://bebac.at/) in the analysis. Since the EU has its on opinions (Questions & Answers: Positions on specific questions addressed to the pharmacokinetics working party) and two example data sets, I was wondering how the various computations compared.Data There...

Towards the R package sheldus: Part 2: Losses from Natural Disasters in the US

November 9, 2013
In my earlier post I summarized the work on my upcoming R package on the SHELDUS database. This is a database on human and property losses from natural disasters in the United States. Although the data is free, downloading the data is tedious and ...

Maximum Likelihood versus Goodness of Fit

November 8, 2013
$\{X_1,\cdots,X_n\}$

Thursday, I got an interesting question from a colleague of mine (JP). I mean, the way I understood the question turned out to be a nice puzzle (but I have to confess I might have misunderstood). The question is the following : consider a i.i.d. sample of continuous variables. We would like to choose between two (parametric) families for...

Key Driver vs. Network Analysis in R

November 8, 2013
When marketing researchers speak of driver analysis, they are referring to an input-output model with overall satisfaction as the output and performance ratings of specific product and service components as the inputs. The causal model is straightforwa...

CRAN now has 5000 R packages

November 8, 2013
Prof. Ripley today announced on the r-devel mailing list that CRAN now has it's 5000th R package: Package 'quint' brought the number of packages on CRAN (for all platforms: some are Windows-only or non-Windows only) to 5000 a few minutes ago: see http://cran.r-project.org/web/packages/index.html. That's quite a milestone! The number of CRAN packages has been increasing rapidly recently, as the...

Hurricanes and Reproducible Research

November 8, 2013
On vacation with my family this week and that means I have a few minutes now and again to read. One of the books I brought along is Christopher Gandrud’s excellent “Reproducible Research with R and RStudio”. Looking for some data as a test project, I latched onto Hurricane data. Folly Beach was hit pretty

Was 2013 a record year for strikeouts in World Series Baseball?

November 8, 2013
The new book Analyzing Baseball Data with R by Max Marchi and Jim Albert is now available, and the authors have also launched a companion blog to share some of the analyses from the book. For example, they used the Lahman package in R to look at the strikeout rate in World Series baseball games over the last century...

Generating functions

November 8, 2013
$F(x)=1-e^{-x}/3$

Today, I wanted to publish a post on generating functions, based on discussions I had with Jean-Francois while having our coffee after lunch a couple of times already. The other reason is that I publish my post while my student just finished their Probability exam (and there were a few questions on generating functions). A short introduction (back on...

Translating between R and SQL: the basics

November 8, 2013
An introductory comparison of using the two languages. Background R was made especially for data analysis and graphics.  SQL was made especially for databases.  They are allies. The data structure in R that most closely matches a SQL table is a data frame.  The terms rows and columns are used in both. A mashup There The post Translating...