I’m not saying this is a good idea, but bear with me.
A recent question on Stack Overflow [r] asked why a random forest model was not working as expected. The questioner was working with data from an experiment in which yeast was grown under conditions where (a) the growth rate could be controlled and (b) one of 6 nutrients was limited. Their dataset consisted of 6 rows – one per nutrient – and several thousand columns, with values representing the activity (expression) of yeast genes. Could the expression values be used to predict the limiting nutrient?
The random forest was not working as expected: not one of the nutrients was correctly classified. I pointed out that with only one case for each outcome, this was to be expected – as the random forest algorithm samples a proportion of the rows, no correct predictions are likely in this case. As sometimes happens the question was promptly deleted, which was unfortunate as we could have further explored the problem.
A little web searching revealed that the dataset in question is quite well-known. It’s published in an article titled Coordination of Growth Rate, Cell Cycle, Stress Response, and Metabolic Activity in Yeast and has been transformed into a “tidy” format for use in tutorials, here and here.
As it turns out, there are 6 cases (rows) for each outcome (limiting nutrient), as experiments were performed at 6 different growth rates. Whilst random forests are good for “large p small n” problems, it may be that ~ 5 500 x 36 is pushing the limit somewhat. But you know – let’s just try it anyway.
As ever, code and a report for this blog post can be found at Github.
First, we obtain the tidied Brauer dataset. But in fact we want to “untidy” it again (make wide from long) because for random forest we want observations (n) as rows and variables (p – both predicted and predictors) in columns.
library(tidyverse) library(randomForest) library(randomForestExplainer) brauer2007_tidy <- read_csv("https://4va.github.io/biodatasci/data/brauer2007_tidy.csv")
I’ll show the code for running a random forest without much explanation. I’ll assume you have a basic understanding of how the process works. That is: rows are sampled from the data and used to build many decision trees where variables predict either a continuous outcome variable (regression) or a categorical outcome (classification). The trees are then averaged (for regression values) or the majority vote is taken (for classification) to generate predictions. Individual predictors have an “importance” which is essentially some measure of how much worse the model would be were they not included.
Here we go then. A couple of notes. First, setting a seed for random sampling would not usually be used; it’s for reproducibility here. Second, unless you are specifically using the model to predict outcomes on unseen data, there’s no real need for splitting the data into test and training sets – the procedure is already performing a bootstrap by virtue of the out-of-bag error estimation.
brauer2007_tidy_rf1 <- brauer2007_tidy %>% mutate(systematic_name = gsub("-", "minus", systematic_name), nutrient = factor(nutrient)) %>% select(systematic_name, nutrient, rate, expression) %>% spread(systematic_name, expression, fill = 0) %>% randomForest(nutrient ~ ., data = ., localImp = TRUE, importance = TRUE)
The model seems to have performed quite well:
Call: randomForest(formula = nutrient ~ ., data = ., localImp = TRUE, importance = TRUE) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 74 OOB estimate of error rate: 5.56% Confusion matrix: Ammonia Glucose Leucine Phosphate Sulfate Uracil class.error Ammonia 6 0 0 0 0 0 0.0000000 Glucose 0 6 0 0 0 0 0.0000000 Leucine 0 1 5 0 0 0 0.1666667 Phosphate 0 0 0 6 0 0 0.0000000 Sulfate 0 0 0 0 6 0 0.0000000 Uracil 0 1 0 0 0 5 0.1666667
Let’s look at the expression of the top 20 genes (by variable importance), with respect to growth rate and limiting nutrient.
brauer2007_tidy %>% filter(systematic_name %in% important_variables(brauer2007_tidy_rf1, k = 20)) %>% ggplot(aes(rate, expression)) + geom_line(aes(color = nutrient)) + facet_wrap(~systematic_name, ncol = 5) + scale_color_brewer(palette = "Set2")
Several of those look promising. Let’s select a gene where expression seems to be affected by each of the 6 limiting nutrients and search online resources such as the Saccharomyces Genome Database to see what’s known.
|YHR208W||branched chain family amino acid biosynthesis*||leucine||Pathways – leucine biosynthesis|
|YKL216W||‘de novo’ pyrimidine base biosynthesis||uracil||URA1 – null mutant requires uracil|
|YLL055W||biological process unknown||sulfate||Cysteine transporter; null mutant absent utilization of sulfur source|
|YLR108C||biological process unknown||phosphate|
|YOR348C||proline catabolism*||ammonia||Proline permease; repressed in ammonia-grown cells|
|YOR374W||ethanol metabolism||glucose||Aldehyde dehydrogenase; expression is glucose-repressed|
I’d say the Brauer data “makes sense” for five of those genes. Little is known about the sixth (YLR108C, affected by phosphate limitation).
Normally, a study like this would start with the genes – identify those that are differentially expressed and then think about the conditions under which differential expression was observed. Here, the process is reversed in a sense: we view the experimental condition as an outcome, rather than a parameter and ask whether it can be “predicted” by other observations.
So whilst not my first choice of method for this kind of study, and despite limited outcome data, random forest does seem to be generating some insights into which genes are affected by nutrient limitation. And at the end of the day: if a method provides insights, isn’t that what data science is for?