[This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m not saying this is a good idea, but bear with me.

A recent question on Stack Overflow [r] asked why a random forest model was not working as expected. The questioner was working with data from an experiment in which yeast was grown under conditions where (a) the growth rate could be controlled and (b) one of 6 nutrients was limited. Their dataset consisted of 6 rows – one per nutrient – and several thousand columns, with values representing the activity (expression) of yeast genes. Could the expression values be used to predict the limiting nutrient?

The random forest was not working as expected: not one of the nutrients was correctly classified. I pointed out that with only one case for each outcome, this was to be expected – as the random forest algorithm samples a proportion of the rows, no correct predictions are likely in this case. As sometimes happens the question was promptly deleted, which was unfortunate as we could have further explored the problem.

A little web searching revealed that the dataset in question is quite well-known. It’s published in an article titled Coordination of Growth Rate, Cell Cycle, Stress Response, and Metabolic Activity in Yeast and has been transformed into a “tidy” format for use in tutorials, here and here.

As it turns out, there are 6 cases (rows) for each outcome (limiting nutrient), as experiments were performed at 6 different growth rates. Whilst random forests are good for “large p small n” problems, it may be that ~ 5 500 x 36 is pushing the limit somewhat. But you know – let’s just try it anyway.

As ever, code and a report for this blog post can be found at Github.

First, we obtain the tidied Brauer dataset. But in fact we want to “untidy” it again (make wide from long) because for random forest we want observations (n) as rows and variables (p – both predicted and predictors) in columns.

library(tidyverse)
library(randomForest)
library(randomForestExplainer)



I’ll show the code for running a random forest without much explanation. I’ll assume you have a basic understanding of how the process works. That is: rows are sampled from the data and used to build many decision trees where variables predict either a continuous outcome variable (regression) or a categorical outcome (classification). The trees are then averaged (for regression values) or the majority vote is taken (for classification) to generate predictions. Individual predictors have an “importance” which is essentially some measure of how much worse the model would be were they not included.

Here we go then. A couple of notes. First, setting a seed for random sampling would not usually be used; it’s for reproducibility here. Second, unless you are specifically using the model to predict outcomes on unseen data, there’s no real need for splitting the data into test and training sets – the procedure is already performing a bootstrap by virtue of the out-of-bag error estimation.

brauer2007_tidy_rf1 <- brauer2007_tidy %>%
mutate(systematic_name = gsub("-", "minus", systematic_name),
nutrient = factor(nutrient)) %>%
select(systematic_name, nutrient, rate, expression) %>%
spread(systematic_name, expression, fill = 0) %>%
randomForest(nutrient ~ ., data = ., localImp = TRUE, importance = TRUE)


The model seems to have performed quite well:

Call:
randomForest(formula = nutrient ~ ., data = ., localImp = TRUE,      importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 74

OOB estimate of  error rate: 5.56%
Confusion matrix:
Ammonia Glucose Leucine Phosphate Sulfate Uracil class.error
Ammonia         6       0       0         0       0      0   0.0000000
Glucose         0       6       0         0       0      0   0.0000000
Leucine         0       1       5         0       0      0   0.1666667
Phosphate       0       0       0         6       0      0   0.0000000
Sulfate         0       0       0         0       6      0   0.0000000
Uracil          0       1       0         0       0      5   0.1666667


Let’s look at the expression of the top 20 genes (by variable importance), with respect to growth rate and limiting nutrient.

brauer2007_tidy %>%
filter(systematic_name %in% important_variables(brauer2007_tidy_rf1, k = 20)) %>%
ggplot(aes(rate, expression)) +
geom_line(aes(color = nutrient)) +
facet_wrap(~systematic_name, ncol = 5) +
scale_color_brewer(palette = "Set2")


Several of those look promising. Let’s select a gene where expression seems to be affected by each of the 6 limiting nutrients and search online resources such as the Saccharomyces Genome Database to see what’s known.

systematic_name bp nutrient search_results
YHR208W branched chain family amino acid biosynthesis* leucine Pathways – leucine biosynthesis
YKL216W ‘de novo’ pyrimidine base biosynthesis uracil URA1 – null mutant requires uracil
YLL055W biological process unknown sulfate Cysteine transporter; null mutant absent utilization of sulfur source
YLR108C biological process unknown phosphate
YOR348C proline catabolism* ammonia Proline permease; repressed in ammonia-grown cells
YOR374W ethanol metabolism glucose Aldehyde dehydrogenase; expression is glucose-repressed

I’d say the Brauer data “makes sense” for five of those genes. Little is known about the sixth (YLR108C, affected by phosphate limitation).

In summary

Normally, a study like this would start with the genes – identify those that are differentially expressed and then think about the conditions under which differential expression was observed. Here, the process is reversed in a sense: we view the experimental condition as an outcome, rather than a parameter and ask whether it can be “predicted” by other observations.

So whilst not my first choice of method for this kind of study, and despite limited outcome data, random forest does seem to be generating some insights into which genes are affected by nutrient limitation. And at the end of the day: if a method provides insights, isn’t that what data science is for?