The genome represents the history book and instruction set of all organisms on Earth. Over the last century, genetic analysis has provided deep insight into the underpinnings of biological processes and the molecular etiology of disease. A crucial principle to understanding how genes function is that defects can arise due to mutation of the genes themselves, but also in the with the way that they are regulated and the amount of gene products they produce. Messenger RNA (mRNA) copies of DNA genes are copies in a process called “transcription”. These mRNA are then used in turn as templates to produce proteins in the cell in a process called “translation”. The number of copies of the mRNA copies transcribed from genes and the number of copies of protein translated from mRNA are referred to as the degree to which these molecules are “expressed”.
Expression of genes dynamically changes in response to cues from the environment and these changes coordinate cellular biochemistry, signaling and behavior. Aberrant gene expression changes are also associated with pathophysiology of many disease, for example numerous cancers, in which key genes are expressed at too high or too low of a level, which decouples certain pathways and behaviors from the normal function of the cell.
The study of transcriptional changes, known as transcriptomics, is an important component of the business model of biotech and pharmaceutical companies, often applied to the areas of synthetic biology or drug target discovery. The data analyzed in this project come from a next generation sequencing (NGS) technique called RNA sequencing (RNA-seq), which sequences and quantifies the RNA molecules in a cell.
Each of the samples analyzed in this set are from yeast, which is a very common model organism used for genetic research. Yeast are simple and inexpensive to work with, are eukaryotic cells (like the cells in our body), and have fewer genes compared to humans (a little over 6000 compared to our >20,000). Although they seem humble and simple, yeast genetics has led to the discovery of numerous crucial genetic pathways in our own cells, such as the molecules responsible for controlling and mediating cell division.
Data & Methods
This dataset consisted of several CSV files describing the experimental conditions of each sample, some information about the genes, and a table of RNA-seq reads (normalized in reads of the specific gene per million reads of mRNA) for each gene in each sample.
The total list of experimental condition in the sample set were extremely varied with many of them having < 3 replicate samples for the condition. To simplify presentability to the user and increase individual group data, only 4 comparisons were evaluated in this analysis:
• Yeast samples grown at high temperature (37 C) vs ideal growth conditions (30 C)
• Yeast samples grown at lower temperature (30 C) vs ideal growth conditions (15 C)
• Yeast samples growth with either glucose or ethanol as their carbon source
• Wildtype yeast samples or strains optimized for biofuel production
The other CSV files for intracellular localization and additional information about the genes were incomplete. This information will be filled in with web scraping of the SGD database (https://www.yeastgenome.org) at a later time and this information added to the analysis.
The columns for expression level were read in as strings and needed to be converted to numeric. Experimental conditions were converted to factors to provide categorical levels for grouping analysis.
For each data comparison below, the mean of each gene was calculated for the replica samples (experimental and control) and a 2-sample t-test for the comparison groups was carried out for the p-value of significant differences between of their distributions. Filtering, preprocessing and calculation of these data subsets were carried out ahead of time and provide for the webpage, because calculations and loading of larger datasets were initially impacting site performance. When users select the data, only the relevant expression and pre-calculated expression values are loaded for the experimental comparison of interest.
This data visualization was chosen because it is a common method of displaying gene expression data. The reason for this is because ratios of higher expression of genes (experimental/control) can range from 1 to no upper bound, whereas lower levels of comparable gene expression range from 0 to 1. This gives a skewed impression of expression change in a plot of raw expression ratios and undue weight to genes that have higher expression in the sample. The log2(experimental) – log2(control) transformation of the data provides a more symmetrical and even-weighted display of the change in genes of both higher and lower expression level. Statistical significance (t-test: p-value) are plotted as -log10(p-value), so higher values on the y-axis reflect a higher degree of significance of that gene expression. The user may select the threshold for significance of gene expression difference (fold-change of expression and the p-value for t-test comparison of experimental variable vs control), and the genes are dynamically displayed in red on the plot.
Clustering Analysis and Heatmap
The gene names from the Gene List selected by user-defined thresholds were used to filter the raw dataset to collect the expression data from each replicate for the significantly expressed genes. An expression matrix of experimental replicas were applied to hierarchical clustering and visualization by an interactive heatmap. This visualization is useful to assess the consistency of gene expression between experimental conditions and whether the primary source of variance is the experimental conditions themselves or between experimental repeats or individual strains used for the samples.
This tab shows a panel with a list of all of the significant genes that match the user threshold criteria along with their expression values for both experimental conditions, the fold gene expression change, and p-value. Additional information about these genes will be implemented in the future. The full list of these genes may be downloaded by the user with a click of a button in the side panel.
How to Use
The user may select 4 different datasets in which RNA expression from yeast grown in a comparison condition is compared against samples from control growth conditions. Slider bars on the left side panel allow the user to select the threshold cutoffs for significantly different genes of interest. The Gene Expression tab displays a volcano plot of mean gene fold change (experiment vs control) against statistical significance of difference of the means (p-values). Genes of interest matching the thresholds set by the user are displayed in red.
The Clustering Analysis tab, displays a heat map of all user selected genes with gene expression from each experimental sample from the 2 comparison conditions. The margins of the heat map also have a dendrogram showing hierarchical clustering of the samples based on their gene expression profiles. This enables the user to view trends in genes across samples and experimental conditions, as well as determine whether the comparison of variance between their sample replicates and experimental conditions.
In the Gene List tab lists all of the user selected genes along with their mean normalized expression level (experimental condition vs control), ratio of gene expression (experimental/control) and the t-test p-value of the difference between experimental conditions. This list may be downloaded with the click of a button in the user side panel.
Using the app, I collected the genes with a gene expression change of ≥ 2-fold and p ≤ 0.02 for each of the experimental conditions featured in the app. I manually browsed the SGD database for these significant genes to get an impression of the biological processes, pathways and functions that maybe affected in these conditions (summarized below).
This project was also chosen because it will be easily extensible with the webscraping and machine learning functionality. Web scraping will be used to collect additional information on each gene from publicly available genetic repositories, such as standard name, description, functional/pathway information, gene ontology terms, and sequence data. Machine learning methods will be used on these additional data to glean further biological insights into the mechanism of genetic networks under these experimental conditions.