This is why code written by scientists gets ugly

May 13, 2014

(This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers)

There’s a lot of discussion around why code written by self-taught “scientist programmers” rarely follows what a trained computer scientist would consider “best practice”. Here’s a recent post on the topic.

One answer: we begin with exploratory data analysis and never get around to cleaning it up.

An example. For some reason, a researcher (let’s call him “Bob”) becomes interested in a particular dataset in the GEO database. So Bob opens the R console and use the GEOquery package to grab the data:

gse <- getGEO("GSE22255")

Bob is interested in the covariates and metadata associated with the experiment, which he can access using pData().

pd <- pData(gse$GSE22255_series_matrix.txt.gz)
#  [1] "title"                   "geo_accession"          
#  [3] "status"                  "submission_date"        
#  [5] "last_update_date"        "type"                   
#  [7] "channel_count"           "source_name_ch1"        
#  [9] "organism_ch1"            "characteristics_ch1"    
# [11] "characteristics_ch1.1"   "characteristics_ch1.2"  
# [13] "characteristics_ch1.3"   "characteristics_ch1.4"  
# [15] "characteristics_ch1.5"   "characteristics_ch1.6"  
# [17] "treatment_protocol_ch1"  "molecule_ch1"           
# [19] "extract_protocol_ch1"    "label_ch1"              
# [21] "label_protocol_ch1"      "taxid_ch1"              
# [23] "hyb_protocol"            "scan_protocol"          
# [25] "description"             "data_processing"        
# [27] "platform_id"             "contact_name"           
# [29] "contact_email"           "contact_phone"          
# [31] "contact_fax"             "contact_laboratory"     
# [33] "contact_department"      "contact_institute"      
# [35] "contact_address"         "contact_city"           
# [37] "contact_state"           "contact_zip/postal_code"
# [39] "contact_country"         "supplementary_file"     
# [41] "data_row_count"  

Bob discovers that pd$characteristics_ch1.2 is “age at examination”, but it’s stored as a factor. He’d like to use it as a numeric variable. So he sets about figuring out how to do the conversion.

# first you need factor to character
# [1] "age-at-examination: 68" "age-at-examination: 68" "age-at-examination: 73"
# [4] "age-at-examination: 65" "age-at-examination: 73"

# now you want to get rid of everything but the age value
# so you wrap it in a gsub()
gsub("age-at-examination: ", "", as.character(pd$characteristics_ch1.2[1:5]))
# [1] "68" "68" "73" "65" "73"

# and the last step is character to numeric
as.numeric(gsub("age-at-examination: ", "", as.character(pd$characteristics_ch1.2[1:5])))
# [1] 68 68 73 65 73

Three levels of nested methods. Ugly. However, it works, so Bob moves to the next task (using those ages to do some science).

He thinks: “at some point, I should rewrite that as a function to convert ages.”

But he never does. We stop when it works.

Filed under: programming, R, research diary, statistics

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)