A few days ago I heard a talk about Simpson's paradox, and I decided to write a little example in R:library(MASS) # For multivariate normals# List of (vectors of) meansmu <- list(c(5, 175), c(6.25, 110))# List of covariance matricessigma ...

Kickstarter, a social funding platform where individuals can chip in cash to get a worthy project going, just celebrated their 10,000th kickstarted project. Kickstart employee Fred Benenson recognized the achievement by visualizing the funding of music, design, art, game and many other kinds of projects using R and ggplot2. For example, here's a chart that shows the increasing rate...

NppToR 2.6 is coming with improved flexibility and speed. Testers needed before setting as default.

At job-search site indeed.com, you can take a look at trends in the use of keywords used in job postings. As you might expect, job postings containing terms related to making sense from data are on the rise. Here's the growth in job postings mentioning big data: And here's statistician: The drop-off in demand for statisticians 2011 seems to...

Computer Assisted Reporting This is the second of four articles about analyzing distances between sex offenders and child daycare centers in Missouri as part of a joint project with KSHB NBC Action News in Kansas City. The previous article gave details...

Computer Assisted Reporting This is the first of three articles about analyzing distances between sex offenders and child daycare centers in Missouri as part of a joint project with KSHB NBC Action News in Kansas City. The Missouri State Highway Patrol...

Shravan Vasishth has written a response to my review both published on the Statistics Forum. His response is quite straightforward and honest. In particular, he acknowledges not being a statistician and that he “should spend more time studying statistics”. I also understand the authors’ frustration at trying “to recruit several statisticians (at different points) to

I'm really pleased that an article I wrote, "5 real-world uses of big data", has been published in the widely-read technology blog GigaOm. In the article, I review five examples of using data science techniques and R to make sense of some large real-world data sets: Drew Conway's analysis of the Afghanistan attacks data released by Wikileaks Benetech's use...

Friday July 22 is the last day on which you can register for UseR! 2011 at the University of Warwick. The conference will be 2011 August 16-18. You can peruse the book of abstracts and view the draft schedule. I am scheduled to give a talk on “Random input testing with R”. The abstract is: … Continue reading...

When conducting any statistical analysis it is important to evaluate how well the model fits the data and that the data meet the assumptions of the model. There are numerous ways to do this and a variety of statistical tests to evaluate deviations from model assumptions. However, there is little general acceptance of any of the statistical tests. Generally...

You might think that doing advanced statistical analysis on Big Data is out of reach for those of us without access to expensive hardware and software. For example, back in April SAS was proud to demonstrate being able to run logistic regression on a billion records (and "just a few" variables) in less than 80 seconds. But that feat...

Emilio Torres Manzanera has just announced the 1st Data Analysis Contest Using R: “Nestoria (http://www.nestoria.com/) is a specialized web search engine platform in house prices. Nestoria and Lokku Labs aim to improve the understanding of the public of the value of its databases. The company aims to engage a few brilliant statisticians in the expectation

Ticker Sense posted about the mean correlation of the S&P 500. The plot there — similar to Figure 1 — shows that correlation has been on the rise after a low in February. Figure 1: Mean 50-day rolling correlation of S&P 500 constituents to the index. For me, this post raised a whole lot more … Continue reading...

Reflection is a programming concept that sounds scarier than it is. There are three related concepts that fall under the umbrella of reflection, and I’ll be surprised if you haven’t come across most of these code ideas already, even if you didn’t know it was called reflection. The first concept is examination of your variables.

This post provides links to a range of resources related to the use and interpretation of correlations. I wanted to provide a page with links to a number of additional resources that would be useful both for those of my students who might be keen to le...

I just got the following email from PNAS about our Lack of confidence in ABC model choice. Editor's Remarks to Author: both referees now find the manuscript acceptable for publication as do I. Each suggests small changes which I encourage the authors to make prior to having the manuscript go into production. Congratulations on an

In response to my last post, Chris had the following comment: I am actually trying to better understand the distinction between mixture models and mixture distributions in my own work. You seem to say mixture models apply to a small set of models – namely regression models.This comment suggests that my caution about the difference between mixed-effect models and mixture distributions...