This is why code written by scientists gets ugly

[This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

There’s a lot of discussion around why code written by self-taught “scientist programmers” rarely follows what a trained computer scientist would consider “best practice”. Here’s a recent post on the topic.

One answer: we begin with exploratory data analysis and never get around to cleaning it up.

An example. For some reason, a researcher (let’s call him “Bob”) becomes interested in a particular dataset in the GEO database. So Bob opens the R console and use the GEOquery package to grab the data:

library(GEOquery)
gse <- getGEO("GSE22255")

Bob is interested in the covariates and metadata associated with the experiment, which he can access using pData().

pd <- pData(gse$GSE22255_series_matrix.txt.gz)
names(pd)
#  [1] "title"                   "geo_accession"          
#  [3] "status"                  "submission_date"        
#  [5] "last_update_date"        "type"                   
#  [7] "channel_count"           "source_name_ch1"        
#  [9] "organism_ch1"            "characteristics_ch1"    
# [11] "characteristics_ch1.1"   "characteristics_ch1.2"  
# [13] "characteristics_ch1.3"   "characteristics_ch1.4"  
# [15] "characteristics_ch1.5"   "characteristics_ch1.6"  
# [17] "treatment_protocol_ch1"  "molecule_ch1"           
# [19] "extract_protocol_ch1"    "label_ch1"              
# [21] "label_protocol_ch1"      "taxid_ch1"              
# [23] "hyb_protocol"            "scan_protocol"          
# [25] "description"             "data_processing"        
# [27] "platform_id"             "contact_name"           
# [29] "contact_email"           "contact_phone"          
# [31] "contact_fax"             "contact_laboratory"     
# [33] "contact_department"      "contact_institute"      
# [35] "contact_address"         "contact_city"           
# [37] "contact_state"           "contact_zip/postal_code"
# [39] "contact_country"         "supplementary_file"     
# [41] "data_row_count"  

Bob discovers that pd$characteristics_ch1.2 is “age at examination”, but it’s stored as a factor. He’d like to use it as a numeric variable. So he sets about figuring out how to do the conversion.

# first you need factor to character
as.character(pd$characteristics_ch1.2[1:5])
# [1] "age-at-examination: 68" "age-at-examination: 68" "age-at-examination: 73"
# [4] "age-at-examination: 65" "age-at-examination: 73"

# now you want to get rid of everything but the age value
# so you wrap it in a gsub()
gsub("age-at-examination: ", "", as.character(pd$characteristics_ch1.2[1:5]))
# [1] "68" "68" "73" "65" "73"

# and the last step is character to numeric
as.numeric(gsub("age-at-examination: ", "", as.character(pd$characteristics_ch1.2[1:5])))
# [1] 68 68 73 65 73

Three levels of nested methods. Ugly. However, it works, so Bob moves to the next task (using those ages to do some science).

He thinks: “at some point, I should rewrite that as a function to convert ages.”

But he never does. We stop when it works.


Filed under: programming, R, research diary, statistics

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)