## Analyzing VC investment strategies with Crunchbase data

April 5, 2014
If you look at the investments in Big Data companies in the last few years, one thing is obvious: This is a very dynamic and fast growing market. I am producing regular updates of this network map of Big Data investments with a Python program (actually an IPython Notebook). But what insights can be

April 5, 2014
The question is: can we automate scientific discovery, and what might an interface to such a tool look like. I’ve been experimenting with automating simple and complex data analysis and report generation tasks for biological data and mostly using R and LATEX. You can see some of my progress and challenges encountered in the presentation

## Regressions with Multiple Fixed Effects – Comparing Stata and R

April 5, 2014
In my paper on the impact of the recent fracking boom on local economic outcomes, I am estimating models with multiple fixed effects. These fixed effects are useful, because they take out, e.g. industry specific heterogeneity at the county level - or state specific time shocks. The models can take the form:    where is

## Making inferences about unusual population quantities.

April 4, 2014
This post builds on my last: Alternatives to model diagnostics for statistical inference?, where I had claimed that we could make quality inferences about the best linear approximation to a quadratic relationship. The R code below implements such a scenario, to establish a framework for discussion. I have inserted comments to augment the code. set.seed(42)

## Scraping organism metadata for Treebase repositories from GOLD using Python and R

Scraping organism metadata for Treebase repositories from GOLD using Python and RI recently wanted to get hold of habitat/phenotype/sequencing metadata for the individual organisms of an archived Treebase project.)The GOLD database holds more than 18000 full genomes. For many of these it provides pretty good metadata (GOLDcards) which are indirectly linked to...

## Two R tutorials for beginners

Two R tutorials for beginnersI am currently in the process of rescuing some of the pages from my now defunct datajujitsu.co.uk blogger blog and moving to this Github/Clojure/Bootstrap version. I also today gave a tutorial to the University of Manche...

## Functional programming in R

Functional programming in RThis post is based on a talk I gave at the Manchester R User Group on functional programming in R on May 2nd 2013. The original slides can be found hereThis post is about functional programming, why it is at the heart of the R language and how it can hopefully help you...

## Develop in RStudio, run in RScript

Develop in RStudio, run in RScriptI have been using RStudio Server for a few months now and am finding it a great tool for R development. The web interface is superb and behaves in almost exactly the same way as the desktop version. However, I do have one gripe which has forced me to change my working...

## Mapping academic collaborations in Evolutionary Biology

Mapping academic collaborations in Evolutionary BiologyThis post is a repubication of a visualisation I did in 2011 for my (now defunct) datajujitsu.co.uk blog. It was a naive first attempt at web-scraping from an academic publishers website. It was done before I was aware of the problems surrounding access to, and text-mining of, online academic content hosted by...

## R as a Publishing Engine | CPI Components Use Case

April 4, 2014
R was certainly not designed to be a publishing engine, but in my workflow, R is the primary method of content creation.  With that in mind, I have been thinking about a very different use case of rCharts in which we might want to include inflexible a...

## R for Open Science

April 4, 2014
FastCompany magazine recently published an in-depth feature on Open Science, with a focus on the R language and the ROpenSci project. If you're not familiar with ROpenSci, the article gives a nice introduction from Ted Hart, a member of the ROpenSci development team: A big sea change was the need to meet digital formatting requirements of scientific data. Hart...

## Flip the script, or, the joys of coord_flip()

April 4, 2014
Has this ever happened to you?I hate it when the labels on the x-axis overlap, but this can be hard to avoid. I can stretch the figure out, but then the data become farther apart and the space where I want to put the figure (either in a talk or a paper...

## The Collatz Fractal

April 4, 2014
It seems to me that the poet has only to perceive that which others do not perceive, to look deeper than others look. And the mathematician must do the same thing (Sofia Kovalevskaya) How beautiful is this fractal! In previous posts I colored plots using module of complex numbers generated after some iterations. In this

## Le Monde puzzle [#860]

April 3, 2014
A Le Monde mathematical puzzle that connects to my awalé post of last year: For N≤18, N balls are placed in N consecutive holes. Two players, Alice and Bob, consecutively take two balls at a time provided those balls are in contiguous holes. The loser is left with orphaned balls. What is the values of

## Introduction to Data Science with R, April 28-29 San Francisco

April 3, 2014
Please join us for our popular Introduction to R course for data scientists and data analysts in San Francisco on April 28 and 29.  This is a two-day workshop, designed to provide a comprehensive introduction to R that will have you analyzing and modeling data with R in no time. We will cover practical skills for

## Some R Resources for GLMs

April 3, 2014
by Joseph Rickert Generalized Linear Models have become part of the fabric of modern statistics, and logistic regression, at least, is a “go to” tool for data scientists building classification applications. The ready availability of good GLM software and the interpretability of the results logistic regression makes it a good baseline classifier. Moreover, Paul Komarek argues that, with a...

## Does R have too many packages?

April 3, 2014
The Homeless EconometricianThe amazing growth and success of CRAN (Comprehensive R Archive Network) is marked by the thousands of packages have been developed and released by a highly active user base.  Yet even so, one of the founders and primary...

## Boston Marathon Winners and Challenging Africa

April 2, 2014
The marathon is dominated by African runners.  David Epstein in a relatively recent interview mentions about a specific tribe in Kenya called the Kalenjin, "There are 17 American men in history who have run under 2:10 in the marathon...there were ...

## Inference for ARCH processes

April 2, 2014
$p$

Consider some ARCH() process, say ARCH(), where with a Gaussian (strong) white noise . > n=500 > a1=0.8 > a2=0.0 > w= 0.2 > set.seed(1) > eta=rnorm(n) > epsilon=rnorm(n) > sigma2=rep(w,n) > for(t in 3:n){ + sigma2=w+a1*epsilon^2+a2*epsilon^2 + epsilon=eta*sqrt(sigma2) + } > par(mfrow=c(1,1)) > plot(epsilon,type="l",ylim=c(min(epsilon)-.5,max(epsilon))) > lines(min(epsilon)-1+sqrt(sigma2),col="red") (the red line is the conditional variance process). > par(mfrow=c(1,2)) > acf(epsilon,lag=50,lwd=2)...

## Seven quick facts about R

April 2, 2014
I've been spending the week at the Gartner Business Intelligence and Analytics Summit in Las Vegas, and R has been quite prominent here. Of course, R got namechecked several times on the panel about the Gartner Magic Quadrant for Advanced Analytics, and several of the regular talks mentioned R as well. I gave a short presentation on R and...

## Social Science Goes R: Weighted Survey Data

April 2, 2014
Social Science Goes R: Weighted Survey Data Social Science Goes R: Weighted Survey Data To get this blog started, I'll be rolling out a series of posts relating to the use of survey data in R. Most content comes from the ECPR...

## xts like endpoints in Javascript

April 2, 2014
I decided to promote this from a Twitter comment to a blog post.  I had hoped to do a prototype javascript interactive rebalancing visualization of Unsolved Mysteries of Rebalancing integrating this, but I have not had the time, so  I’ll release it...

## Announcing The Pooled Resources Open Access ALS Clinical Trial (PRO-ACT) database

April 2, 2014
Prize4Life, and NEALS are proud to announce the launch of the Pooled Resources Open Access ALS Clinical Trial (PRO-ACT) database. It is a database of ALS clinical trials and contains 8500+ patients records, and over 8 million data points, making is not only the biggest AS clinical trial database currently available, but one of the largest clinical trial databases...

## Merge .ASC grids with R

April 2, 2014
A couple of years ago I found online a script to merge several .asc grids into a single file in R.I do not remember where I found it but if you have the same problem, the script is the following: setwd("c:/temp") library(rgdal) library(raster) # ...

## AERA Preview

April 2, 2014
The American Educational Research Association (AERA) annual conference is this weekend in Philadelphia. I was lucky to have a paper accepted into the conference. I am presenting a meta analysis that I have been working on for the past two years or so titled: Model misspecification and assumption violations with the linear mixed model: A meta analysis.In...

## Deploying Desktop Apps with R

April 2, 2014
(Update) Despite the original publish date (Apr 1), this post was not and April Fools joke. I’ve also shortened the title a bit. As part of my job, I develop utility applications that automate workflows that apply more involved analysis algorithms. When feasible, I deploy web applications as it lowers installation requirements to simply a modern (standards...

## Sales Dashboard in R with qplot and ggplot2 – Part 1

April 2, 2014
In a previous post on my personal blog about creating Pivot Tables in R with melt and cast we covered a simple way to generate sales reports and summary tables from a data set consisting of orders. It is often said that a picture is … Continue reading →

## Kaplan-Meier plots using ggplots2 (updated)

April 1, 2014
About 3 years ago I published some code on this blog to draw a Kaplan-Meier plot using ggplot2. Since then, ggplot2 has been updated (from 0.8.9 to 0.9.3.1) and has changed syntactically. Since that post, I have also become comfortable with Git and Github. I have updated the code, edited it for a small error,

