Looking at Measles Data in Project Tycho, part II

April 6, 2014
By
Looking at Measles Data in Project Tycho, part II

Continuing from last week, I will now look at incidence rates of measles in the US. To recap, Project Tycho contains data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to any...

Read more »

On the rise of Big Data and Data Science

April 5, 2014
By
On the rise of Big Data and Data Science

This post is going to differ slightly from the data-orientated material that I usually publish. I was recently playing around with the Google trends API and came across some very interesting…well….trends. There has definitely been a huge amount of publicity surrounding “Big Data”, maybe even too much. For those of us who have been working

Read more »

Five Reasons to Teach Elementary Statistics With R: #1

April 5, 2014
By
Five Reasons to Teach Elementary Statistics With R:  #1

Introduction Reason #1: Package mosaic Keeping Simple Things Simple Flow-Control for the Masses There is Much MoreReferencesIntroductionThis is is first in a projected five-part series of posts aimed at colleagues who teach...

Read more »

Analyzing VC investment strategies with Crunchbase data

April 5, 2014
By

If you look at the investments in Big Data companies in the last few years, one thing is obvious: This is a very dynamic and fast growing market. I am producing regular updates of this network map of Big Data investments with a Python program (actually an IPython Notebook). But what insights can be

Read more »

Choose Your Own Data Adventure

April 5, 2014
By
Choose Your Own Data Adventure

The question is: can we automate scientific discovery, and what might an interface to such a tool look like. I’ve been experimenting with automating simple and complex data analysis and report generation tasks for biological data and mostly using R and LATEX. You can see some of my progress and challenges encountered in the presentation

Read more »

Regressions with Multiple Fixed Effects – Comparing Stata and R

April 5, 2014
By
Regressions with Multiple Fixed Effects – Comparing Stata and R

In my paper on the impact of the recent fracking boom on local economic outcomes, I am estimating models with multiple fixed effects. These fixed effects are useful, because they take out, e.g. industry specific heterogeneity at the county level - or state specific time shocks. The models can take the form:    where is

Read more »

Making inferences about unusual population quantities.

April 4, 2014
By
Making inferences about unusual population quantities.

This post builds on my last: Alternatives to model diagnostics for statistical inference?, where I had claimed that we could make quality inferences about the best linear approximation to a quadratic relationship. The R code below implements such a scenario, to establish a framework for discussion. I have inserted comments to augment the code. set.seed(42)

Read more »

Scraping organism metadata for Treebase repositories from GOLD using Python and R

Scraping organism metadata for Treebase repositories from GOLD using Python and RI recently wanted to get hold of habitat/phenotype/sequencing metadata for the individual organisms of an archived Treebase project.)The GOLD database holds more than 18000 full genomes. For many of these it provides pretty good metadata (GOLDcards) which are indirectly linked to...

Read more »

Two R tutorials for beginners

Two R tutorials for beginnersI am currently in the process of rescuing some of the pages from my now defunct datajujitsu.co.uk blogger blog and moving to this Github/Clojure/Bootstrap version. I also today gave a tutorial to the University of Manche...

Read more »

Functional programming in R

Functional programming in R

Functional programming in RThis post is based on a talk I gave at the Manchester R User Group on functional programming in R on May 2nd 2013. The original slides can be found hereThis post is about functional programming, why it is at the heart of the R language and how it can hopefully help you...

Read more »

Develop in RStudio, run in RScript

Develop in RStudio, run in RScriptI have been using RStudio Server for a few months now and am finding it a great tool for R development. The web interface is superb and behaves in almost exactly the same way as the desktop version. However, I do have one gripe which has forced me to change my working...

Read more »

Mapping academic collaborations in Evolutionary Biology

Mapping academic collaborations in Evolutionary Biology

Mapping academic collaborations in Evolutionary BiologyThis post is a repubication of a visualisation I did in 2011 for my (now defunct) datajujitsu.co.uk blog. It was a naive first attempt at web-scraping from an academic publishers website. It was done before I was aware of the problems surrounding access to, and text-mining of, online academic content hosted by...

Read more »

R as a Publishing Engine | CPI Components Use Case

April 4, 2014
By

R was certainly not designed to be a publishing engine, but in my workflow, R is the primary method of content creation.  With that in mind, I have been thinking about a very different use case of rCharts in which we might want to include inflexible a...

Read more »

R for Open Science

April 4, 2014
By

FastCompany magazine recently published an in-depth feature on Open Science, with a focus on the R language and the ROpenSci project. If you're not familiar with ROpenSci, the article gives a nice introduction from Ted Hart, a member of the ROpenSci development team: A big sea change was the need to meet digital formatting requirements of scientific data. Hart...

Read more »

Flip the script, or, the joys of coord_flip()

April 4, 2014
By
Flip the script, or, the joys of coord_flip()

Has this ever happened to you?I hate it when the labels on the x-axis overlap, but this can be hard to avoid. I can stretch the figure out, but then the data become farther apart and the space where I want to put the figure (either in a talk or a paper...

Read more »

The Collatz Fractal

April 4, 2014
By
The Collatz Fractal

It seems to me that the poet has only to perceive that which others do not perceive, to look deeper than others look. And the mathematician must do the same thing (Sofia Kovalevskaya) How beautiful is this fractal! In previous posts I colored plots using module of complex numbers generated after some iterations. In this

Read more »

Le Monde puzzle [#860]

April 3, 2014
By
Le Monde puzzle [#860]

A Le Monde mathematical puzzle that connects to my awalé post of last year: For N≤18, N balls are placed in N consecutive holes. Two players, Alice and Bob, consecutively take two balls at a time provided those balls are in contiguous holes. The loser is left with orphaned balls. What is the values of

Read more »

Introduction to Data Science with R, April 28-29 San Francisco

April 3, 2014
By
Introduction to Data Science with R, April 28-29 San Francisco

Please join us for our popular Introduction to R course for data scientists and data analysts in San Francisco on April 28 and 29.  This is a two-day workshop, designed to provide a comprehensive introduction to R that will have you analyzing and modeling data with R in no time. We will cover practical skills for

Read more »

Some R Resources for GLMs

April 3, 2014
By
Some R Resources for GLMs

by Joseph Rickert Generalized Linear Models have become part of the fabric of modern statistics, and logistic regression, at least, is a “go to” tool for data scientists building classification applications. The ready availability of good GLM software and the interpretability of the results logistic regression makes it a good baseline classifier. Moreover, Paul Komarek argues that, with a...

Read more »

Does R have too many packages?

April 3, 2014
By
Does R have too many packages?

The Homeless EconometricianThe amazing growth and success of CRAN (Comprehensive R Archive Network) is marked by the thousands of packages have been developed and released by a highly active user base.  Yet even so, one of the founders and primary...

Read more »

Boston Marathon Winners and Challenging Africa

April 2, 2014
By
Boston Marathon Winners and Challenging Africa

The marathon is dominated by African runners.  David Epstein in a relatively recent interview mentions about a specific tribe in Kenya called the Kalenjin, "There are 17 American men in history who have run under 2:10 in the marathon...there were ...

Read more »

Inference for ARCH processes

April 2, 2014
By
Inference for ARCH processes

Consider some ARCH() process, say ARCH(), where with a Gaussian (strong) white noise . > n=500 > a1=0.8 > a2=0.0 > w= 0.2 > set.seed(1) > eta=rnorm(n) > epsilon=rnorm(n) > sigma2=rep(w,n) > for(t in 3:n){ + sigma2=w+a1*epsilon^2+a2*epsilon^2 + epsilon=eta*sqrt(sigma2) + } > par(mfrow=c(1,1)) > plot(epsilon,type="l",ylim=c(min(epsilon)-.5,max(epsilon))) > lines(min(epsilon)-1+sqrt(sigma2),col="red") (the red line is the conditional variance process). > par(mfrow=c(1,2)) > acf(epsilon,lag=50,lwd=2)...

Read more »

Seven quick facts about R

April 2, 2014
By

I've been spending the week at the Gartner Business Intelligence and Analytics Summit in Las Vegas, and R has been quite prominent here. Of course, R got namechecked several times on the panel about the Gartner Magic Quadrant for Advanced Analytics, and several of the regular talks mentioned R as well. I gave a short presentation on R and...

Read more »

Social Science Goes R: Weighted Survey Data

April 2, 2014
By

Social Science Goes R: Weighted Survey Data Social Science Goes R: Weighted Survey Data To get this blog started, I'll be rolling out a series of posts relating to the use of survey data in R. Most content comes from the ECPR...

Read more »

xts like endpoints in Javascript

April 2, 2014
By

I decided to promote this from a Twitter comment to a blog post.  I had hoped to do a prototype javascript interactive rebalancing visualization of Unsolved Mysteries of Rebalancing integrating this, but I have not had the time, so  I’ll release it...

Read more »

Announcing The Pooled Resources Open Access ALS Clinical Trial (PRO-ACT) database

April 2, 2014
By
Announcing The Pooled Resources Open Access ALS Clinical Trial (PRO-ACT) database

Prize4Life, and NEALS are proud to announce the launch of the Pooled Resources Open Access ALS Clinical Trial (PRO-ACT) database. It is a database of ALS clinical trials and contains 8500+ patients records, and over 8 million data points, making is not only the biggest AS clinical trial database currently available, but one of the largest clinical trial databases...

Read more »

Merge .ASC grids with R

April 2, 2014
By
Merge .ASC grids with R

A couple of years ago I found online a script to merge several .asc grids into a single file in R.I do not remember where I found it but if you have the same problem, the script is the following: setwd("c:/temp") library(rgdal) library(raster) # ...

Read more »

AERA Preview

April 2, 2014
By

The American Educational Research Association (AERA) annual conference is this weekend in Philadelphia. I was lucky to have a paper accepted into the conference. I am presenting a meta analysis that I have been working on for the past two years or so titled: Model misspecification and assumption violations with the linear mixed model: A meta analysis.In...

Read more »

Deploying Desktop Apps with R

April 2, 2014
By

(Update) Despite the original publish date (Apr 1), this post was not and April Fools joke. I’ve also shortened the title a bit. As part of my job, I develop utility applications that automate workflows that apply more involved analysis algorithms. When feasible, I deploy web applications as it lowers installation requirements to simply a modern (standards...

Read more »