# This is why code written by scientists gets ugly

May 13, 2014
By

(This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers)

There’s a lot of discussion around why code written by self-taught “scientist programmers” rarely follows what a trained computer scientist would consider “best practice”. Here’s a recent post on the topic.

One answer: we begin with exploratory data analysis and never get around to cleaning it up.

An example. For some reason, a researcher (let’s call him “Bob”) becomes interested in a particular dataset in the GEO database. So Bob opens the R console and use the GEOquery package to grab the data:

library(GEOquery)
gse <- getGEO("GSE22255")


Bob is interested in the covariates and metadata associated with the experiment, which he can access using pData().

pd <- pData(gse$GSE22255_series_matrix.txt.gz) names(pd) # [1] "title" "geo_accession" # [3] "status" "submission_date" # [5] "last_update_date" "type" # [7] "channel_count" "source_name_ch1" # [9] "organism_ch1" "characteristics_ch1" # [11] "characteristics_ch1.1" "characteristics_ch1.2" # [13] "characteristics_ch1.3" "characteristics_ch1.4" # [15] "characteristics_ch1.5" "characteristics_ch1.6" # [17] "treatment_protocol_ch1" "molecule_ch1" # [19] "extract_protocol_ch1" "label_ch1" # [21] "label_protocol_ch1" "taxid_ch1" # [23] "hyb_protocol" "scan_protocol" # [25] "description" "data_processing" # [27] "platform_id" "contact_name" # [29] "contact_email" "contact_phone" # [31] "contact_fax" "contact_laboratory" # [33] "contact_department" "contact_institute" # [35] "contact_address" "contact_city" # [37] "contact_state" "contact_zip/postal_code" # [39] "contact_country" "supplementary_file" # [41] "data_row_count"  Bob discovers that pd$characteristics_ch1.2 is “age at examination”, but it’s stored as a factor. He’d like to use it as a numeric variable. So he sets about figuring out how to do the conversion.

# first you need factor to character
as.character(pd$characteristics_ch1.2[1:5]) # [1] "age-at-examination: 68" "age-at-examination: 68" "age-at-examination: 73" # [4] "age-at-examination: 65" "age-at-examination: 73" # now you want to get rid of everything but the age value # so you wrap it in a gsub() gsub("age-at-examination: ", "", as.character(pd$characteristics_ch1.2[1:5]))
# [1] "68" "68" "73" "65" "73"

# and the last step is character to numeric
as.numeric(gsub("age-at-examination: ", "", as.character(pd\$characteristics_ch1.2[1:5])))
# [1] 68 68 73 65 73


Three levels of nested methods. Ugly. However, it works, so Bob moves to the next task (using those ages to do some science).

He thinks: “at some point, I should rewrite that as a function to convert ages.”

But he never does. We stop when it works.

Filed under: programming, R, research diary, statistics