Ebola, Wikipedia and data janitors

September 21, 2014 | 0 Comments

Sometimes, several strands of thought come together in one place. For me right now, it’s the Wikipedia page “Ebola virus epidemic in West Africa”, which got me thinking about the perennial topic of “data wrangling”, how best to provide public data and why I can’t shake my irritation ... [Read more...]

Venn figures go wrong

August 12, 2014 | 0 Comments

I thought nothing could top the classic “6-way Venn banana”, featured in The banana (Musa acuminata) genome and the evolution of monocotyledonous plants. That is until I saw Figure 3 from Compact genome of the Antarctic midge is likely an adaptation to an extreme environment. What’s odd is that Figure 2 ... [Read more...]

When life gives you coloured cells, make categories

August 5, 2014 | 0 Comments

Let’s start by making one thing clear. Using coloured cells in Excel to encode different categories of data is wrong. Next time colleagues explain excitedly how “green equals normal and red = tumour”, you must explain that (1) they have sinned and (2) what they meant to do was add a column ... [Read more...]

Converting a spreadsheet of SMILES: my first OSM contribution

June 30, 2014 | 0 Comments

I’ve long admired the work of the Open Source Malaria Project. Unfortunately time and “day job” constraints prevent me from being as involved as I’d like. So: I was happy to make a small contribution recently in response to this request for help: Can anyone help @O_S_... [Read more...]

This is why code written by scientists gets ugly

May 13, 2014 | 0 Comments

There’s a lot of discussion around why code written by self-taught “scientist programmers” rarely follows what a trained computer scientist would consider “best practice”. Here’s a recent post on the topic. One answer: we begin with exploratory data analysis and never get around to cleaning it up. An ... [Read more...]

A minor update to my “apply functions” post

February 27, 2014 | 0 Comments

One of my more popular posts is A brief introduction to “apply” in R. Come August, it will be four years old. Technology moves on, old blog posts do not. So: thanks to BioStar user zx8754 for pointing me to this Stack Overflow post, in which someone complains that the ... [Read more...]

Box plots. Like box plots, only…box plots.

February 2, 2014 | 0 Comments

On a rare, brief holiday (here and here, if you’re interested; both highly-recommended), I make the mistake of checking my Twitter feed: paging @neilfws . . . RT @psudmant: Ground breaking new methods from @naturemethods – boxplots – no rly…— Chris Miller (@chrisamiller) January 30, 2014 This points me to BoxPlotR. It ... [Read more...]

BLATting the internet: the most frequent gene?

January 23, 2014 | 0 Comments

I enjoyed this story from the OpenHelix blog today, describing a Microsoft Research project to mine DNA sequences from web pages and map them to UCSC genome builds. Laura DeMare asks: what was the most-hit gene? Most hit gene? APOE? MT @GenomeBrowser We BLATed the Internet! DNA sequences from 40 billion ... [Read more...]

Quilt plots. Like heat maps, only…heat maps

January 15, 2014 | 0 Comments

Stephen tweets: Quilt Plots: A Simple Tool for the #Visualisation of Large Epidemiological Data— Stephen Rudd (@SAGRudd) January 15, 2014 Quilt plots. Sounds interesting. The link points to a short article in PLoS ONE, containing a table and a figure. Here is Figure 1. If you looked at that ... [Read more...]

R: how not to use savehistory() and source()

December 2, 2013 | 0 Comments

Admitting to stupidity is part of the learning process. So in the interests of public education, here’s something stupid that I did today. You’re working in the R console. Happy with your exploratory code, you decide to save it to a file. Then, you type something else, for ... [Read more...]

Microarrays, scan dates and Bioconductor: it shouldn’t be this difficult

August 21, 2013 | 0 Comments

When dealing with data from high-throughput experimental platforms such as microarrays, it’s important to account for potential batch effects. A simple example: if you process all your normal tissue samples this week and your cancerous tissue samples next week, you’re in big trouble. Differences between cancer and normal ... [Read more...]

Interestingly: the sentence adverbs of PubMed Central

July 15, 2013 | 0 Comments

Scientific writing – by which I mean journal articles – is a strange business, full of arcane rules and conventions with origins that no-one remembers but to which everyone adheres. I’ve always been amused by one particular convention: the sentence adverb. Used with a comma to make a point at the ... [Read more...]

-omics in 2013

June 24, 2013 | 0 Comments

Just how many (bad) -omics are there anyway? Let’s find out. 1. Get the raw data It would be nice if we could search PubMed for titles containing all -omics: However, we cannot since leading wildcards don’t work in PubMed search. So let’s just grab all articles from 2013: ... [Read more...]

Using the Ensembl Variant Effect Predictor with your 23andme data

June 3, 2013 | 0 Comments

I subscribe to the Ensembl blog and found, in my feed reader this morning, a post which linked to the Variant Effect Predictor (VEP). The original blog post, strangely, has disappeared. Not to worry: so, the VEP takes genotyping data in one of several formats, compares it with the Ensembl ... [Read more...]

A brief note: R 3.0.0 and bioinformatics

April 3, 2013 | 0 Comments

Today marks the release of R 3.0.0. There will be plenty of commentary and useful information at sites such as R-bloggers (for example, Tal’s post). Version 3.0.0 is great news for bioinformaticians, due to the introduction of long vectors. What does that mean? Well, several months ago, I was using the ... [Read more...]

R/ggplot2 tip: aes_string

February 25, 2013 | 0 Comments

I’m a big fan of ggplot2. Recently, I ran into a situation which called for a useful feature that I had not used previously: aes_string. Imagine that you have data consisting of observations for several variables – let’s say A, B, C – where each observation is from one ... [Read more...]

Basic R: rows that contain the maximum value of a variable

February 12, 2013 | 0 Comments

File under “I keep forgetting how to do this basic, frequently-required task, so I’m writing it down here.” Let’s create a data frame which contains five variables, vars, named A – E, each of which appears twice, along with some measurements: Now, let’s say we want only the ... [Read more...]

Addendum to yesterday’s post on custom CSS and R Markdown

August 27, 2012 | 0 Comments

Updates from RStudio support: (1) “Thanks for reporting and I was able to reproduce this as well. I’ve filed a bug and we’ll take a look.” (2) Taking a further look, this is actually a bug in the Markdown package and we’ve asked the maintainer (Jeffrey Horner) to look ... [Read more...]

Custom CSS for HTML generated using RStudio

August 26, 2012 | 0 Comments

People have been telling me for a while that the latest version of RStudio, the IDE for R, is a great way to generate reports. I finally got around to trying it out and for once, the hype is justified. Start with this excellent tutorial from Jeremy Anglim. Briefly: the ... [Read more...]
