This is why code written by scientists gets ugly

May 13, 2014
By

(This article was first published on What You're Doing Is Rather Desperate » R, and kindly contributed to R-bloggers)

There’s a lot of discussion around why code written by self-taught “scientist programmers” rarely follows what a trained computer scientist would consider “best practice”. Here’s a recent post on the topic.

One answer: we begin with exploratory data analysis and never get around to cleaning it up.

An example. For some reason, a researcher (let’s call him “Bob”) becomes interested in a particular dataset in the GEO database. So Bob opens the R console and use the GEOquery package to grab the data:

library(GEOquery)
gse <- getGEO("GSE22255")

Bob is interested in the covariates and metadata associated with the experiment, which he can access using pData().

pd <- pData(gse$GSE22255_series_matrix.txt.gz)
names(pd)
#  [1] "title"                   "geo_accession"          
#  [3] "status"                  "submission_date"        
#  [5] "last_update_date"        "type"                   
#  [7] "channel_count"           "source_name_ch1"        
#  [9] "organism_ch1"            "characteristics_ch1"    
# [11] "characteristics_ch1.1"   "characteristics_ch1.2"  
# [13] "characteristics_ch1.3"   "characteristics_ch1.4"  
# [15] "characteristics_ch1.5"   "characteristics_ch1.6"  
# [17] "treatment_protocol_ch1"  "molecule_ch1"           
# [19] "extract_protocol_ch1"    "label_ch1"              
# [21] "label_protocol_ch1"      "taxid_ch1"              
# [23] "hyb_protocol"            "scan_protocol"          
# [25] "description"             "data_processing"        
# [27] "platform_id"             "contact_name"           
# [29] "contact_email"           "contact_phone"          
# [31] "contact_fax"             "contact_laboratory"     
# [33] "contact_department"      "contact_institute"      
# [35] "contact_address"         "contact_city"           
# [37] "contact_state"           "contact_zip/postal_code"
# [39] "contact_country"         "supplementary_file"     
# [41] "data_row_count"  

Bob discovers that pd$characteristics_ch1.2 is “age at examination”, but it’s stored as a factor. He’d like to use it as a numeric variable. So he sets about figuring out how to do the conversion.

# first you need factor to character
as.character(pd$characteristics_ch1.2[1:5])
# [1] "age-at-examination: 68" "age-at-examination: 68" "age-at-examination: 73"
# [4] "age-at-examination: 65" "age-at-examination: 73"

# now you want to get rid of everything but the age value
# so you wrap it in a gsub()
gsub("age-at-examination: ", "", as.character(pd$characteristics_ch1.2[1:5]))
# [1] "68" "68" "73" "65" "73"

# and the last step is character to numeric
as.numeric(gsub("age-at-examination: ", "", as.character(pd$characteristics_ch1.2[1:5])))
# [1] 68 68 73 65 73

Three levels of nested methods. Ugly. However, it works, so Bob moves to the next task (using those ages to do some science).

He thinks: “at some point, I should rewrite that as a function to convert ages.”

But he never does. We stop when it works.

Filed under: programming, R, research diary, statistics

To leave a comment for the author, please follow the link and comment on their blog: What You're Doing Is Rather Desperate » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Sponsors

Mango solutions



plotly webpage

dominolab webpage



Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training

datasociety

http://www.eoda.de





ODSC

ODSC

CRC R books series





Six Sigma Online Training









Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)