Continuing from last week, I will now look at incidence rates of measles in the US. To recap, Project Tycho contains data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to any...

This post is going to differ slightly from the data-orientated material that I usually publish. I was recently playing around with the Google trends API and came across some very interesting…well….trends. There has definitely been a huge amount of publicity surrounding “Big Data”, maybe even too much. For those of us who have been working

If you look at the investments in Big Data companies in the last few years, one thing is obvious: This is a very dynamic and fast growing market. I am producing regular updates of this network map of Big Data investments with a Python program (actually an IPython Notebook). But what insights can be

The question is: can we automate scientific discovery, and what might an interface to such a tool look like. I’ve been experimenting with automating simple and complex data analysis and report generation tasks for biological data and mostly using R and LATEX. You can see some of my progress and challenges encountered in the presentation

In my paper on the impact of the recent fracking boom on local economic outcomes, I am estimating models with multiple fixed effects. These fixed effects are useful, because they take out, e.g. industry specific heterogeneity at the county level - or state specific time shocks. The models can take the form: where is

This post builds on my last: Alternatives to model diagnostics for statistical inference?, where I had claimed that we could make quality inferences about the best linear approximation to a quadratic relationship. The R code below implements such a scenario, to establish a framework for discussion. I have inserted comments to augment the code. set.seed(42)

Scraping organism metadata for Treebase repositories from GOLD using Python and RI recently wanted to get hold of habitat/phenotype/sequencing metadata for the individual organisms of an archived Treebase project.)The GOLD database holds more than 18000 full genomes. For many of these it provides pretty good metadata (GOLDcards) which are indirectly linked to...

Two R tutorials for beginnersI am currently in the process of rescuing some of the pages from my now defunct datajujitsu.co.uk blogger blog and moving to this Github/Clojure/Bootstrap version. I also today gave a tutorial to the University of Manche...

Develop in RStudio, run in RScriptI have been using RStudio Server for a few months now and am finding it a great tool for R development. The web interface is superb and behaves in almost exactly the same way as the desktop version. However, I do have one gripe which has forced me to change my working...

Mapping academic collaborations in Evolutionary BiologyThis post is a repubication of a visualisation I did in 2011 for my (now defunct) datajujitsu.co.uk blog. It was a naive first attempt at web-scraping from an academic publishers website. It was done before I was aware of the problems surrounding access to, and text-mining of, online academic content hosted by...

R was certainly not designed to be a publishing engine, but in my workflow, R is the primary method of content creation. With that in mind, I have been thinking about a very different use case of rCharts in which we might want to include inflexible a...

FastCompany magazine recently published an in-depth feature on Open Science, with a focus on the R language and the ROpenSci project. If you're not familiar with ROpenSci, the article gives a nice introduction from Ted Hart, a member of the ROpenSci development team: A big sea change was the need to meet digital formatting requirements of scientific data. Hart...

It seems to me that the poet has only to perceive that which others do not perceive, to look deeper than others look. And the mathematician must do the same thing (Sofia Kovalevskaya) How beautiful is this fractal! In previous posts I colored plots using module of complex numbers generated after some iterations. In this

A Le Monde mathematical puzzle that connects to my awalé post of last year: For N≤18, N balls are placed in N consecutive holes. Two players, Alice and Bob, consecutively take two balls at a time provided those balls are in contiguous holes. The loser is left with orphaned balls. What is the values of

Please join us for our popular Introduction to R course for data scientists and data analysts in San Francisco on April 28 and 29. This is a two-day workshop, designed to provide a comprehensive introduction to R that will have you analyzing and modeling data with R in no time. We will cover practical skills for

by Joseph Rickert Generalized Linear Models have become part of the fabric of modern statistics, and logistic regression, at least, is a “go to” tool for data scientists building classification applications. The ready availability of good GLM software and the interpretability of the results logistic regression makes it a good baseline classifier. Moreover, Paul Komarek argues that, with a...

Consider some ARCH() process, say ARCH(), where with a Gaussian (strong) white noise . > n=500 > a1=0.8 > a2=0.0 > w= 0.2 > set.seed(1) > eta=rnorm(n) > epsilon=rnorm(n) > sigma2=rep(w,n) > for(t in 3:n){ + sigma2=w+a1*epsilon^2+a2*epsilon^2 + epsilon=eta*sqrt(sigma2) + } > par(mfrow=c(1,1)) > plot(epsilon,type="l",ylim=c(min(epsilon)-.5,max(epsilon))) > lines(min(epsilon)-1+sqrt(sigma2),col="red") (the red line is the conditional variance process). > par(mfrow=c(1,2)) > acf(epsilon,lag=50,lwd=2)...

I've been spending the week at the Gartner Business Intelligence and Analytics Summit in Las Vegas, and R has been quite prominent here. Of course, R got namechecked several times on the panel about the Gartner Magic Quadrant for Advanced Analytics, and several of the regular talks mentioned R as well. I gave a short presentation on R and...

Social Science Goes R: Weighted Survey Data Social Science Goes R: Weighted Survey Data To get this blog started, I'll be rolling out a series of posts relating to the use of survey data in R. Most content comes from the ECPR...

I decided to promote this from a Twitter comment to a blog post. I had hoped to do a prototype javascript interactive rebalancing visualization of Unsolved Mysteries of Rebalancing integrating this, but I have not had the time, so I’ll release it...

Prize4Life, and NEALS are proud to announce the launch of the Pooled Resources Open Access ALS Clinical Trial (PRO-ACT) database. It is a database of ALS clinical trials and contains 8500+ patients records, and over 8 million data points, making is not only the biggest AS clinical trial database currently available, but one of the largest clinical trial databases...

The American Educational Research Association (AERA) annual conference is this weekend in Philadelphia. I was lucky to have a paper accepted into the conference. I am presenting a meta analysis that I have been working on for the past two years or so titled: Model misspecification and assumption violations with the linear mixed model: A meta analysis.In...

(Update) Despite the original publish date (Apr 1), this post was not and April Fools joke. I’ve also shortened the title a bit. As part of my job, I develop utility applications that automate workflows that apply more involved analysis algorithms. When feasible, I deploy web applications as it lowers installation requirements to simply a modern (standards...