Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Characterising and understanding how human phenotypes relates to populations requires being able to count the occurrence of certain traits in individuals from different populations. Using NMR and MS based metabolic profiles, this means being able to estimate the presence of a feature, aka a signal, across the whole dataset aggregate the results by population and estimate if the occurrence vary significantly from one population to another.

Because NMR is known to be quantitative, the ideal solution would be to extract the concentration of as many metabolites as possible and stratify these latter. However, this step is not always as trivial as it sounds. Instead of a proper quantification we could estimate the presence or absence of a signal, aka a trait, by the computing cross-correlation between a feature of reference, i.e., an arbitrarily chosen portion of the spectra with a well formed pattern (signal), and the same portion from other individuals.

A complete description of the process can be found here

Here we describe how to get started with this approach using the R code developed for this study. The code itself depends on two R packages, MetaboMate for multivariate analysis and hastaLaVista for interactive visualization. For installing both packages, please refer to the README files provided with both packages.

## data modeling

In the github.com/cheminfo/COMPASS repository a demo multiblocking.R file is provided that works with a demo dataset. Running this file will chunk the dataset in blocks of 0.5 ppm and run a PCA for each of them.

On line 24 of this demo file, the definition of the blocks can be modified according to the necessity:

If you are planning to use very large dataset, the loading of the dataset into the browser for interactive visualization may become impractical. Instead, hastaLaVista could be configured to retrieve the spectra only when necessary. For this, uncomment lines 45-53 and define a path to store the JSON. Each spectrum will be converted into a single JSON files to make the interaction fast and efficient.

To do that you must locate the folder where hastalavista is stored in your installation. This is achieved using the .libPathts() command and then creating the folder visu/data/json. Line 46 must point to this json folder.

Running this script will point your default browser on the URL

http://127.0.0.1:5474/?viewURL=http://127.0.0.1:5474/view/modelExplorer_1_1.view.json&dataURL=http://127.0.0.1:5474/data/multiblocking.data.json


and you should see the following page where you can explore the scores and loadings for all the blocks, the results for the whole PCA and super scores and superloadings that are reconstructions based on all blocks. Using Alt-click it is possible to select scores and then select spectra that are to be displayed.

This view (vista in Spanish) allows one to explore the dataset and identify with a high level of granularity the features of interest.

## feature identification and selection

A nice feature of the view allows to select spectra from the score plot and the use these data to compute STOCSY. The driver peak to compute STOCSY is simple set by Alt-click on the selected spectra. A STOCSY trace will appear in the middle gray area that allows to identify features that belongs to the same molecules.

## computing cross-correlation

In the github.com/cheminfo/COMPASS repository a demo MBCC-metaboliteX.Rmd file is provided that works with a demo dataset.

The following code chunk from the demo file allows to select the feature that will be used as a reference. For this example we selected two feature, a triplet and a doublet that are well resolved in the first sample (ID = 1).

The reference features are shown here:

And finally we can compute and display the cross-correlation with as shown in the following chunk.

Results are shown here:

These figures enable the selection of samples with highest correlation. Although cross-correlation is insensitive to shifts, it is sensitive to distortions of the pattern due to overlap for example. Therefore it is necessary to visually inspect the results. Since this inspection is mandatory but is cumbersome, we made efforts to provide the best interface possible to quickly check the results.

In the upper figure (triplet pattern) a threshold may be safely assumed. In the second case (doublet), the decision in less simple. Most of the time we choose to combine several (as much as possible) patterns and compute the combined cross-correlation. The more complete the pattern the sharper the distribution. Signal to noise is another factor that impact the cross-correlation. For large signals, CC is usually sharper that for signals close to the noise.

The first approach used in this demo script consists of defining two threshold that define three areas, green for spectra with CC above all thresholds, orange between the two limits and red below. This semaphore system allows to quickly check the results.

Examples of figures are shown below.

As just mentioned, signal to noise affect the results. Close to the noise, CC becomes less a sharp statistics, while with larger signal, it works nicely. This means that this approach has its own limit of detection, below which it is assumed that the feature is not found.

The second option makes use of the hastaLaVista package to make the visualization interactive, as shown here: