## Exploratory Data Analysis: Useful R Functions for Exploring a Data Frame

Introduction Data in R are often stored in data frames, because they can store multiple types of data.  (In R, data frames are more general than matrices, because matrices can only store one type of data.)  Today's post highlights some common functions in R that I like to use to explore a data frame before

## Gaussian Processes with RStan

August 19, 2013


Email Previously I looked at how to simulate Gaussian processes in R, following the methods in Rasmussen and Williams. But now that Andrew Gelman et al. (of

## Question and Answer: Generating Binary and Discrete Response Data

August 19, 2013


I was recently contacted by a reader with two very specific questions and I thought that this would be a good topic to publicity respond to. He would like to simulate his data:I have firm level data and the model is discrete choice with the main expla...

## Text Mining with R – Comparing Word Counts in two Text Documents

August 19, 2013


Here's what I came up with to compare word counts in two pieces of text. If you got any idea, I'd love to learn about alternatives!## a function that compares word counts in two textswordcount ...

## Revolution Newsletter: August 2013

August 19, 2013


The most recent edition of the Revolution Newsletter is now available. In case you missed it, the news section is below, and you can read the full August edition (with highlights from this blog and community events) online. You can subscribe to the Revolution Newsletter to get it monthly via email. What is R? Has anyone ever asked you,...

## R vs Python Speed Comparison for Bootstrapping

August 19, 2013


I’m interested in Python a lot, mostly because it appears to be wickedly fast. The downside is that I don’t know it nearly as well as R, so any speed gain in computation time is more than offset by Google … Continue reading →

## The Bayesian Counterpart of Pearson’s Correlation Test

August 19, 2013


Except for maybe the t test, a contender for the title “most used and abused statistical test” is Pearson’s correlation test. Whenever someone wants to check if two variables relate somehow it is a safe bet (at least in psychology) that the first thing to be tested is the strength of a Pearson’s correlation. Only if that doesn’t...

## Is the Tax Code the longest Title?

August 19, 2013


Last week, I shared that Dan Katz and I had finally published a draft of our paper, Measuring the Complexity of the Law: The U.S. Code.  We’d previewed this research on Computational Legal Studies years ago.  Since then, we’ve received great… Read more ›

## Slides from Rcpp talk in Chicago

August 19, 2013


A couple of days ago, I gave a talk to the Chicago R Users Group which is run ever-so-smoothly by Paul Teetor and Chase Carpenter. The talk provided a brief introduction to Rcpp for R and C++ integration. Slides are now up on my talks / presentation...

## #21 – Find significant relationships in data with a CoCo Matrix

August 19, 2013


The CoCo Matrix (correlation coefficient matrix) is a script for R that takes a table headed with multiple variables and calculates the correlation coefficients between each of the variables, determines which are statistically significant, and represents them visually in a grid-plot. I created the CoCo Matrix to cross correlate a table with a large number of

## Fitting psychometric functions using STAN

August 19, 2013


STAN is a new system for Bayesian inference, similar to BUGS and JAGS. I’ve played with it a bit and it’s quite promising, it really has the potential to make MCMC less of a pain (on simple models). I’ve written a short introduction to fitting psychometric functions using STAN and R, in case that’s useful

## Endogenous Spatial Lags for the Linear Regression Model

August 18, 2013


Over the past number of years, I have noted that spatial econometric methods have been gaining popularity. This is a welcome trend in my opinion, as the spatial structure of data is something that should be explicitly included in the empirical modelling procedure. Omitting spatial effects assumes that the location co-ordinates for observations are unrelated

## Fitting a Model by Maximum Likelihood

August 18, 2013


Maximum-Likelihood Estimation (MLE) is a statistical technique for estimating model parameters. It basically sets out to answer the question: what model parameters are most likely to characterise a given set of data? First you need to select a model for the data. And the model must have one or more (unknown) parameters. As the name

## Exercise in REML/Mixed model

August 18, 2013


I want to build a bit more experience in REML, so I decided to redo some of the SAS examples in R. This post describes the results of example 59.1 (page 5001, SAS(R)/STAT User guide 12.3 link). Following the list from freshbiostats I will analyze ...

## Clarifying vague interactions

August 18, 2013


For some reason, authors occasionally present linear model results with vague or unintelligible interaction effects. One way to be vague when presenting interaction effects is to provide only a table of model coefficients, including no information on the range of covariate values observed, and no plots to aid in interpretation. Here’s an example: Suppose you have discovered a statistically significant...

## Mapping Australian electoral divisions with ggplot2

August 18, 2013


I’ve seen some creative visualisations of issues surrounding the Australian election recently though not as many maps as I expected. ‘ggplot2′ is the go-to package for plotting in R so I thought I’d see if I could plot the Australian electoral divisions with ggplot2. By using the Australian Electoral Commission’s GIS mapping coordinates and mutilating

## Negative Payments in Local Spending Data

August 17, 2013


In anticipation of a new R library from School of Data data diva @mihi_tr that will wrap the OpenSpending API and providing access to OpenSpending.org data directly from within R, I thought I’d start doodling around some ideas raised in Identifying Pieces in the Spending Data Jigsaw. In particular, common payment values, repayments/refunds and “balanced

## Update to Fantasy Football Draft Optimizer shiny app

August 17, 2013


By popular demand, I updated the Fantasy Football Draft Optimizer shiny app with two changes: The app now takes into account how many teams are in your league when estimating The post Update to Fantasy Football Draft Optimizer shiny app appeared first on Fantasy Football Analytics.

## Working with climate data from the web in R

August 17, 2013


I recently attended ScienceOnline Climate, a conference in Washington, D.C. at AAAS. You may have heard of the ScienceOnline annual meeting in North Carolina - this was one of their topical meetings focused on Climate Change. I moderated a session on working with data from the web in R, focusing on climate data. Search Twitter for...

## Accuracy versus F score: Machine Learning for the RNA Polymerases

August 16, 2013


Hello, today I'm going to show you the difference of using two different common performance measures (useful not only for Machine Learning purposes, is useful in every scientific field). Until now, I have found more the accuracy values than F scores in...

## Using Heatmaps to Uncover the Individual-Level Structure of Brand Perceptions

August 16, 2013


Heatmaps, when the rows and columns are appropriately ordered, provide insight into the data structure at the individual level.  In an earlier post I showed a cluster heatmap with dendrograms for both the rows and the columns.  In addition, I...

## Foodborne Chicago finds dodgy restaurants with tweets, and R

August 16, 2013


If, like me, you've ever had a sandwich from a dubious deli and then been laid up for days afterwards, you know that food poisoning is no trifling matter. In the past, local authorities would only ever learn of such public health issues if they get reported to the authorities by the victim (or the victim's doctor). But that...

## Equivocal Zones

August 16, 2013


In Chapter 11, equivocal zones were briefly discussed. The idea is that some classification errors are close to the probability boundary (i.e. 50% for two class outcomes). If this is the case, we can create a zone where we the samples are predicted as "equivocal" or "indeterminate" instead of one of the class levels. This only works if the...

## Programming style guidelines: R and MATLAB

August 16, 2013


summary of programming style conventions in R and MATLAB

## RcppArmadillo 0.3.910.0

August 15, 2013


A new minor release 3.910.0 of Armadillo came out a few days ago. A new RcppArmadillo release 0.3.910.0 was provided rightaway, and after a brief back-and-forth with CRAN (mostly having to do with the non-standard vignette corresponding to our CSD...

## Creating a Quick Report with knitr, xtable, R Markdown, Pandoc (and some OpenBLAS Benchmark Results)

August 15, 2013


To cut a long story short, I always wanted to write professional-looking documents (technical reports and potentially my thesis) with R codes. No more copy and paste. No more Microsoft Word. At the same time, I don't feel comfortable with LaTeX. Somehow I found a workaround with knitr, xtable, R Markdown...

## sapply is my new friend!

August 15, 2013


I’ve written previously about how the apply function is a major workhorse in many of my work projects. What I didn’t know is how handy the sapply function can be! There are a couple of cases so far where I’ve … Continue reading →

## R, drug development and the FDA

August 15, 2013


by Joseph Rickert When you not directly working in an industry it is often extremely difficult to get any real insight into common practices that may be blindly transparent to people who are. With some persistence though, every once in awhile you can stumble into an opportunity to see why things are the way they are. Last week, at...

