Describing Data: Frequently Used Commands

May 13, 2011
By
Describing Data: Frequently Used Commands

Obtaining a coherent numerical summary of data is a common task, and it is common to want to port these summary statistics into a table of results. When I am in interactive mode with my data, I use the summary() command applied to my data frame. For ...

Read more »

Because it’s Friday: French Press Heat Retention

May 13, 2011
By
Because it’s Friday: French Press Heat Retention

While responding to this thread on Reddit I made a rough guess as to the heat retention of my french press when completely full of coffee. When I went to bed I realized there was no good reason why I … Continue reading →

Read more »

Review of 2011 Data Scientist Summit

May 13, 2011
By
Review of 2011 Data Scientist Summit

Some time over the past 6 weeks I randomly saw a tweet announcing the “Data Scientist Summit” and shortly below it I saw that it would be held in Las Vegas at the Venetian. Being a Data Scientist myself is reason enough to not pass up this opportunity, but Vegas definitely sweetens the deal! On Wednesday I woke up...

Read more »

Le Monde puzzle [#14]

May 13, 2011
By
Le Monde puzzle [#14]

Last week Le Monde puzzle (I have not received this week issue yet!) was about deriving an optimal strategy in less than 25 steps for finding the 25 answers to a binary multiple choice test, when at each trial, only the number of correct answers is known. Hence, if the correct answers are y1,…,y25, and

Read more »

Reflections on Data Science Summit 2011

May 13, 2011
By

The Data Science Summit held in Las Vegas this week was outstanding - kudos and thanks to EMC/Greenplum for organizing the event. The energy of 150+ data scientists coupled with a well-curated agenda of talks created a real sense of being at the cusp of a real revolution in the applications of data analysis. Here are just a few...

Read more »

plyr’s idata.frame VS. data.frame

May 13, 2011
By
plyr’s idata.frame VS. data.frame

I had seen the function idata.frame in plyr before, but not really tested it. Here are a few comparisons of operations on normal data frames and immutable data frames. Immutable data frames don't work with the doBy package, but do work with aggregate i...

Read more »

The confusing gamma parameter

May 13, 2011
By
The confusing gamma parameter

Boris from Ottawa sent me this email about Introducing Monte Carlo Methods with R: As I went through the exercises and examples, I believe I found a typo in exercise 6.4 on page 176 that is not in the list of typos posted on  your website.  For simulation of Gamma(a,1) random variables with  candidate distribution

Read more »

Competition: $45,000 for identification of substances from electromagnetic signatures

May 13, 2011
By
Competition: $45,000 for identification of substances from electromagnetic signatures

Canadian hi-tech company offers $45,000 for the best algorithm for identification of substances from electromagnetic signatures. —————————————— FIND Technologies Inc. is a Canadian company that owns novel sensor technology for measuring electromagnetic signatures of materials. The sensor is a robust, inexpensive instrument that detects passive electromagnetic emission from all matter. It has biomedical, homeland security, engineering, geological, and other...

Read more »

Speed tests for R — and a look at the compiler

May 13, 2011
By
Speed tests for R — and a look at the compiler

I’ve gotten back to work on speeding up R, starting with improving my suite of speed tests.  Among other new features, this suite allows one to easily try out the “byte-code” compiler that is now a standard part of the latest release of R, version 2.13.0. You can get the suite here. I’ve been running

Read more »

Fitting Distribution X to Data From Distribution Y

May 12, 2011
By
Fitting Distribution X to Data From Distribution Y

I had someone ask me about fitting a beta distribution to data drawn from a gamma distribution and how well the distribution would fit. I’m not a “closed form” kinda guy. I’m more of a “numerical simulation” type of fellow. So I whipped up a little R code to illustrate the process then we changed

Read more »

Makefiles and Sweave

May 12, 2011
By
Makefiles and Sweave

A Makefile is a simple text file that controls compilation of a target file. The key benefit of using Makefile is that it uses file time stamps to determine if a particular action is needed. In this post we discuss how to use a simple Makefile that compiles a tex file that contains a number

Read more »

Kaggle Competition Walkthrough: Fitting a model

May 12, 2011
By
Kaggle Competition Walkthrough: Fitting a model

Now that we've got the data we need into R, it is very easy to fit a model using the caret package. Caret's workhorse function is called 'train,' and it allows you to fit a wide variety of models using the same syntax. Furthermore, many models have '...

Read more »

The R-Files: Martin Morgan

May 12, 2011
By
The R-Files: Martin Morgan

"The R-Files" is an occasional series from Revolution Analytics, where we profile prominent members of the R Community. Name: Martin Morgan Profession: Senior Staff Scientist at Fred Hutchinson Cancer Research Center Nationality: Canadian Years Using R: 7 Known for: Director of the Bioconductor project Martin Morgan is a Senior Staff Scientist at the Fred Hutchinson Cancer Research Center (FHCRC)...

Read more »

Learning R — Installing Packages

May 12, 2011
By

One of the reasons to use R for analysis and visualization is the rich ecosystem of ‘packages’ contributed by others. In most cases, just as with smartphones, “There’s a package for that.” If you want to be efficient you n...

Read more »

XLConnect: Frequently Asked Questions

May 12, 2011
By
XLConnect: Frequently Asked Questions

In the two months since the first release of XLConnect we have received some great feedback from the community. Most questions we saw seemed to cluster around a few central topics – memory issues, font styling and Excel feature support. … Continue reading →

Read more »

Example 8.37: Read sheets from an excel file

May 11, 2011
By
Example 8.37: Read sheets from an excel file

Microsoft Excel is an awkward tool for data analysis. However, it is a reasonable environment for recording and transfering data. In our consulting practice, people frequently send us data in .xls (from Excel 97-2003) or .xlsx (from Excel 2007 or 201...

Read more »

sab-R-metrics: Basics of LOESS Regression

May 11, 2011
By
sab-R-metrics: Basics of LOESS Regression

Last week, I left you off at logistic regression. This week, I'll be pushing the limits of regression analysis a bit more with a smoothing technique called LOESS regression. There are a number of smoothing methods that can be used, such as Smoothing ...

Read more »

sab-R-metrics: Basics of LOESS Regression

May 11, 2011
By
sab-R-metrics: Basics of LOESS Regression

Last week, I left you off at logistic regression. This week, I'll be pushing the limits of regression analysis a bit more with a smoothing technique called LOESS regression. There are a number of smoothing methods that can be used, such as Smoothing ...

Read more »

One-way ANOVAs in R – including post-hocs/t-tests and graphs

May 11, 2011
By
One-way ANOVAs in R – including post-hocs/t-tests and graphs

In this post, I go over the basics of running an ANOVA using R. The dataset I’ll be examining comes from this website, and I’ve discussed it previously (starting here and then here). I’ve not seen many examples where someone runs through the … Continue reading →

Read more »

Multivariate probit regression using (direct) maximum likelihood estimators

May 11, 2011
By
Multivariate probit regression using (direct) maximum likelihood estimators

Consider a random pair of binary responses, i.e. with taking values 1 or 2. Assume that probability can be function of some covariates . The Gaussian vector latent structureA standard model is based a latent Gaussian structure, i.e. there exi...

Read more »

EC2 Trials and Tribulations, Part 1 (Web Crawling)

May 11, 2011
By
EC2 Trials and Tribulations, Part 1 (Web Crawling)

Elastic Compute Cloud (EC2) is a service provided a Amazon Web Services that allows users to leverage computing power without the need to build and maintain servers, or spend money on special hardware. The idea is simple, the user “boots” up one or more machines and then accesses those machines as if they were logged into any other machine...

Read more »

An essential vocabulary for the R language

May 11, 2011
By

The Oxford English Dictionary includes more than 600,000 words, yet most of us get by in our day-to-day lives with a vocabulary of just a few thousand. In a similar vein, the R language includes thousands of functions: when you start up R 2.13, you have 2832 functions at your disposal: > length(apropos(".", mode="function")) 2382 This includes only...

Read more »

Comparison of functions for comparative phylogenetics

May 11, 2011
By
Comparison of functions for comparative phylogenetics

With all the packages (and beta stage groups of functions) for comparative phylogenetics in R (tested here: picante, geiger, ape, motmot, Liam Revell's functions), I was simply interested in which functions to use in cases where multiple functions exis...

Read more »

Defining Custom Model Priors in BMS

Bayesian Model Averaging (BMA) allows for any kind of model prior distributions. While the R package BMS has built-in support for several types of commonly used priors, there may be the need for constructing a custom model prior in a particular exerci...

Read more »

A clock utility, via console hackery

May 11, 2011
By
A clock utility, via console hackery

A discussion on StackOverflow today shows an interesting use of special characters inside the cat function. The most common special characters that you may have come across are the tab and newline characters, represented by \t and \n respectively. Try them for yourself. cat("Red\tlorry\nYellow\tlorry\n") cat also respects the backspace character, \b, and the carriage return

Read more »

High Low Clustering on intraday high frequency sampled data

May 10, 2011
By
High Low Clustering on intraday high frequency sampled data

Nothing unusually exciting on this post, but I happened to be engaged in some particle based methods recently and made some simple visual observations as I was setting up some of the sampling environment in R.  I am also using Rkward and Ubuntu to...

Read more »

Publishing in Veterinary Academic Journals

May 10, 2011
By
Publishing in Veterinary Academic Journals

Following the post by Arthur Charpentier (Freakonometrics), I wondered what would be the outcome considering my current engagement (veterinary medicine, epidemiology, bovine mastitis). Briefly, Arthur Charpentier’s post looked at clusters of journals publishing the same kind of papers. So I looked at 25 journals (Journal of Dairy Science, Canadian Journal of Veterinary Medicine, Preventive Veterinary

Read more »

ABC model choice by DIC

May 10, 2011
By
ABC model choice by DIC

Yet another paper on ABC model choice was posted on arXiv a few days ago, just prior to the ABC in London meeting that ended in the pub above (most conveniently located next to my B&B!). It is written by Olivier Francois and Guillaume Laval and the approach relies on DIC for running model selection.

Read more »

Late to the party for R in Finance blogging

May 10, 2011
By
Late to the party for R in Finance blogging

I meant to blog about the R/Finance conference during a lull, but I didn’t find too many. Unlike many conferences I’ve been to the structure of R/Finance was simple: one room and one speaker at a time. Relying on each … Continue reading →

Read more »