Fast(ish) extraction of exon locations from a BED12 file using data.table

March 20, 2011
By

Here is a fast R function to extract exon locations from a BED12 file. Note that fast is a relative term, the function below is fast enough for me, may not be fast enough for others :) Anyway, a BED12 file typically has locations of genomic features (t...

Read more »

Fast(ish) extraction of exon locations from a BED12 file using data.table

March 20, 2011
By

Here is a fast R function to extract exon locations from a BED12 file. Note that fast is a relative term, the function below is fast enough for me, may not be fast enough for others :) Anyway, a BED12 file typically has locations of genomic features (t...

Read more »

Machine Learning Ex5.2 – Regularized Logistic Regression

March 20, 2011
By
Machine Learning Ex5.2 – Regularized Logistic Regression

Exercise 5.2 Improves the Logistic Regression implementation done in Exercise 4 by adding a regularization parameter that reduces the problem of over-fitting. We will be using Newton's Method. Data Here's the data we want to fit. # linear regression # load the data mydata = read.csv("http://spreadsheets.google.com/pub?key=0AnypY27pPCJydHZPN2pFbkZGd1RKeU81OFY3ZHJldWc&output=csv", header = TRUE) # plot the data plot(mydata$u, mydata$v,, xlab="u", ylab="v") points(mydata$u,...

Read more »

Bertand’s paradox [R details]

March 19, 2011
By
Bertand’s paradox [R details]

Some may have had reservations about the “randomness” of the straws I plotted to illustrate Bertrand’s paradox. As they were all going North-West/South-East. I had actually made an inversion between cbind and rbind in the R code, which explained for this non-random orientation. Above is the corrected version, which sounds “more random” indeed. (And using

Read more »

How to: Binomial regression models in R

March 19, 2011
By
How to: Binomial regression models in R

Ever wondered how to predict success or failure as a function of other variables? Here's a quick tutorial on binomial regression in R.

Read more »

New GenABEL Website, and more *ABEL software

March 18, 2011
By
New GenABEL Website, and more *ABEL software

The *ABEL suite of R packages and software for genetic analysis has grown substantially since the appearance of GenABEL and the previously mentioned ProbABEL R packages. There are now a handful of useful R packages and other software utilities facilita...

Read more »

New GenABEL Website, and more *ABEL software

March 18, 2011
By

The *ABEL suite of R packages and software for genetic analysis has grown substantially since the appearance of GenABEL and the previously mentioned ProbABEL R packages. There are now a handful of useful R packages and other software utilities facilita...

Read more »

How to display scatter plot matrices with R and lattice

How to display scatter plot matrices with R and lattice

In lattice, there is a function called splom for the display of scatter plot matrices. For large datasets, the panel.hexbinplot from the hexbin package is a better option than the default panel. As an example, let’s use some meteorological data from MAPA-SIAR: library(solaR) library(hexbin) aranjuez <- readMAPA(prov=28, est=3, start='01/01/2004', end='31/12/2010') aranjuezDF <- subset(as.data.frame(getData(aranjuez)), select=c('TempMedia', 'TempMax',

Read more »

Flying off the Rack: R and the web in 2011

March 18, 2011
By

If there is ever a time to learn R and web application development, it is now...in the age of Big Data. The upcoming release of R 2.13 will provide basic functionality for developing R web applications on the desktop via the internal HTTP server, but t...

Read more »

Some upcoming R courses

March 18, 2011
By

A couple of quick notes about some upcoming R courses: In Vancouver, Canada, R trainer Isabella Ghement is presenting two R courses: An Introduction to the Statistical Software Package R, 8:30am-4:30pm, March 30-31, 2011, Vancouver, B.C., Canada (http://www.ghement.ca/RworkshopMarch30and31_2011.html) Advanced Statistical Modeling Using the Statistical Software Package R, 8:30am-4:30pm, May 5-6, 2011, Vancouver, B.C., Canada (http://www.ghement.ca/RworkshopMay5and6_2011.html); And in Seattle, Washington...

Read more »

More fun with sed

March 18, 2011
By

So I have this strange date and time string, which I would like to convert to a “useable” date, i.e., something that a spreadsheet programme or R can work with. It looks like this (MON has 3 chars): ddMONyr:hh:mm:ss The … Continue reading →

Read more »

Machine Learning Ex5.1 – Regularized Linear Regression

March 18, 2011
By
Machine Learning Ex5.1 – Regularized Linear Regression

Exercise 5.1 Improves the Linear Regression implementation done in Exercise 3 by adding a regularization parameter that reduces the problem of over-fitting. Over-fitting occurs especially when fitting a high-order polynomial, that we will try to do here. Data Here's the points we will make a model from: # linear regression mydata = read.csv("http://spreadsheets.google.com/pub?hl=en_GB&hl=en_GB&key=0AnypY27pPCJydGhtbUlZekVUQTc0dm5QaXp1YWpSY3c&output=csv", header = TRUE) # view data plot(mydata) http://al3xandr3.github.com/img/ml-ex51-data.png

Read more »

The housing bubble by city

March 17, 2011
By
The housing bubble by city

The housing bubble by city. Miami sailed high and fell far. Detroit rose modestly and but dropped more than it went up. Dallas held steady. DC is enjoying a bit of renewed growth, but are in and New York yet to fall?

Read more »

La historia detrás del software: el caso de R.

Hoy en día, en cualquiera que sea nuestra área de aplicación de la estadística requiere que sepamos programar. Los software convencionales como SPSS y STATA son un tanto limitados, las actualizaciones no son tan constantes y su precio puede ser con...

Read more »

More, Please!

March 17, 2011
By

Thanks to Jim, I’ve been using R in the shell more and more – in concert with vi. It’s been fun, and nice to integrate my workflows all on the server (although I haven’t had to do much graphing yet – I’m sure I’ll start kvetching then and return to a nice gui). One thing

Read more »

Staying up to date on R packages

March 17, 2011
By

Unless you regularly use particular R packages,  it’s becomes difficult to stay on top of updates and bug fixes.  Updates usually also include significant improvements in performance.  I wrote this short snippet of code which I run about once a month to keep up on updates. This short bit of code will give you a

Read more »

Updated tty Connection for R

March 17, 2011
By

Below are some links to a patch against the R-2.12.2 source code that implements a tty connection for R. Since the release of R-2.13.0 is coming soon, I’ll have a patch for it soon also. What’s a tty connection? The tty connection is an R interface to computer terminals, as defined by the Portable Operating

Read more »

Circular or spherical data, and density estimation

March 17, 2011
By
Circular or spherical data, and density estimation

I few years ago, while I was working on kernel based density estimation on compact support distribution (like copulas) I went through a series of papers on circular distributions. By that time, I thought it was something for mathematicians working ...

Read more »

Applying functions on groups: sqldf, plyr, doBy, aggregate or data.table ?

March 17, 2011
By
Applying functions on groups: sqldf, plyr, doBy, aggregate or data.table ?

Which one of the sqldf, plyr, doBy and aggregate functions/packages would be faster for applying functions on groups of rows? I was wondering about this earlier in this post.  It seems sqldf would be the fastest according to a post in manipulatr m...

Read more »

Applying functions on groups: sqldf, plyr, doBy, aggregate or data.table ?

March 17, 2011
By
Applying functions on groups: sqldf, plyr, doBy, aggregate or data.table ?

Which one of the sqldf, plyr, doBy and aggregate functions/packages would be faster for applying functions on groups of rows? I was wondering about this earlier in this post.  It seems sqldf would be the fastest according to a post in manipulatr m...

Read more »

$3.2M in prizes for predicting hospitalization

March 17, 2011
By

Heritage Health and Kaggle have teamed up to create the biggest data science competition thus far: the Heritage Health Prize, which challenges competitors to build a statistical model to predict the number of days a person is likely to spend in hospital over the next year, based on (anonymized) factors such as demographics, medical visits and treatments, and other...

Read more »

Risk-Opportunity Analysis: Houston

March 17, 2011
By
Risk-Opportunity Analysis: Houston

I will be attending Ralph Vince's risk-opportunity analysis workshop in Houston this weekend.  I'll be in town Friday-Monday.  Drop me a note if you're in the area and would like to meet for coffee / drinks.

Read more »

Global Migration Maps

March 17, 2011
By
Global Migration Maps

 Migrations of people have existed for millennia and oc

Read more »

basic ggplot2 network graphs

March 17, 2011
By
basic ggplot2 network graphs

I have been looking around on the web and have not found anything yet related to using ggplot2 for making graphs/networks. I put together a few functions to make very simple graphs. The bipartite function especially is not ideal, as of course we only w...

Read more »

Having a problem with R-2.12.2 64-bit and "gam’ package!

March 17, 2011
By
Having a problem with R-2.12.2 64-bit and "gam’ package!

While working with some pitch location data recently, I ran across something strange when using my new computer (with R-2.12.2 64-bit) versus my work computer (with R-2.11.1 x64). Both are 64-bit computers, but I got the new one for portability (it's a laptop) and speed.Anyway, I had been doing some work in the office with Pitch F/X data,...

Read more »

Having a problem with R-2.12.2 64-bit and "gam’ package!

March 17, 2011
By
Having a problem with R-2.12.2 64-bit and "gam’ package!

While working with some pitch location data recently, I ran across something strange when using my new computer (with R-2.12.2 64-bit) versus my work computer (with R-2.11.1 x64). Both are 64-bit computers, but I got the new one for portability (it's a laptop) and speed.Anyway, I had been doing some work in the office with Pitch F/X data,...

Read more »

Canabalt Revisited: Gamma Distributions, Multinomial Distributions and More JAGS Goodness

March 16, 2011
By
Canabalt Revisited: Gamma Distributions, Multinomial Distributions and More JAGS Goodness

Introduction Neil Kodner recently got me interested again in analyzing Canabalt scores statistically by writing a great post in which he compared the average scores across iOS devices. Thankfully, Neil’s made his code and data freely available, so I’ve been revising my original analyses using his new data whenever I can find a free minute.

Read more »

How the New York Times uses R for Data Visualization

March 16, 2011
By

The New York Times introduced R to the world with a feature article in 2009, and has been using R for many years to support its pioneering presentation data analysis and visualization, under the direction of graphics editor Amanda Cox. Last week, the New York R User Group's featured speaker was Amanda Cox, where she presented ... how R...

Read more »

Updates to SoilWeb Mobile: Distance from Nearest Map Unit Boundary

March 16, 2011
By

Working on some new ideas on how map unit data can be summarized on small screens-- particularly for our mobile version of SoilWeb. The distance from the nearest map unit polygon boundary is now printed above mini soil profile sketches. This gives the ...

Read more »