# Why R for Mass Spectrometrist and Computational Proteomics

[This article was first published on

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Why R:**Computational Proteomics**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Actually, It is a common practice the integration of the statistical analysis of the resulted data and

*in silico*predictions of the data generated in your manuscript and your daily research. Mass spectrometrist, biologist and bioinformaticians commonly use programs like excel, calc or other office tools to generate their charts and statistical analysis. In recent years many computational biologists especially those from the Genomics field, regard R and Bioconductor as fundamental tools for their research.R is a modern, functional programming language that allows for rapid development of ideas; it is a language and environment for statistical computing and graphics.The rich set of inbuilt functions makes it ideal for high-volume analysis or statistical studies.

**Installing R on Windows or Linux:**

*Windows*: You can download the last version from: https://www.r-project.org/ you need to select a mirror, then in the

**base**page you can select the last release of R. The next steps are really straightforward like Windows aplications.

*Linux*: You can download the latest precompile release from the same page (https://www.r-project.org/) for (suse, devian, ubuntu, redhat) and the source files in R-XXX. tar.gz.

Here you can find some tips if you have problem to install

**R**http://cran.r-project.org/doc/manuals/R-admin.html.

**First MS Example in Three lines**:

*“I want to know the mass distribution of my identified peptides“*

First create a peptide-histogram.txt file with the list of mass as follow:

1392.6207

1576.7609

1809.956

1653.8549

1929.0003

then 1576.7609

1809.956

1653.8549

1929.0003

**>**peptides.txt <- read.table("peptide-histogram.txt", header=FALSE)

> peptides <-as.vector(peptides.txt$V1)

> hist(peptides,breaks=400)

***if you want to compute the mean of the masses, it’s simple:**

> mean(peptides)

[1] 1791.695

The hist() function can be customize with different options (remember you can always see the help for each funtion using ? , for example: ?hist):

http://msenux.redwoods.edu/math/R/hist.php

http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/hist.html

One of the key advances to work with R is the amount of data that you can analyze, some desktop tool have row limits (for example MS excel previous to version 2007 is 65536 and MS 2007 is 1,048,576). Other reasons to consider R: (1) commercial software’s such as SPSS are expensive and not up-to-date; (2) public website services has a limited data volume; (3) self written software is not an option

*“mass spectrometrist are not IT people”.***Generating the Venns for Search Engines (Mascot, XTadem, Sequest)**

*” I want a Venn diagram with the share proteins identified with Sequest, XTandem and Mascot”*

Each file mascot.txt, xtandem.txt, sequest.txt

*is the list of Protein IDs..*

*** you can use the uniprot www.uniprot.org mapping service pr PICR http://www.ebi.ac.uk/Tools/picr/ to convert different PROTEIN IDs to a unique representation.**

>library(gplots)

>mascot.txt<-read.table("mascot.txt", header=FALSE)

>xtandem.txt<-read.table("xtandem.txt", header=FALSE)

>sequest.txt<-read.table("sequest.txt", header=FALSE)

>sequest<- as.vector(sequest.txt$V1)

>mascot<- as.vector(mascot.txt$V1)

>xtandem<- as.vector(xtandem.txt$V1)

>input<- list(Mascot=mascot, XTandem=xtandem, sequest=sequest)

>venn(input)

The venn diagrams are part of the

**gplots**

*library*

*and they are really useful*to show all possible logical relations between a finite collection of sets.

When i read for the first time

*“Five statistical things I wished I had been taught 20 years ago”*(Ewan Birney) the first thing that i thought was “

*…which R packages must be useful for mass spectrometrist such as biologist*

*case*.

- The
**ggplot2**for data visualization guaranty a set of functions to represent your data such as: Scatterplot function (Basic Introduction to ggplot2). - The
**caret**package (short for**C**lassification**A**nd**RE**gression**T**raining) is a set of functions that attempt to streamline the process for creating predictive models. It is a complete package for regression and classification techniques(caret) - The
**factominer**is an**R**package dedicated to multivariate Exploratory Data Analysis.It performs classical methods such as Principal Components Analysis (PCA), Correspondence analysis (CA), Multiple Correspondence Analysis (MCA) as well as more advanced methods. GUI is available. (factominerR) - The
**mzR**provides a unified API to the common file formats and parsers available for mass spectrometry data. It comes with a wrapper for the ISB random access parser for mass spectrometry mzXML, mzData and mzML files. (mzR) - The
**Bioconductor**provides tools for the analysis and comprehension of high-throughput biology data. Bioconductor has two releases each year, 554 software packages, and an active user community. (bioconductor) - The
**msProcess**provides tools for protein mass spectra processing including data preparation, denoising, noise estimation, baseline correction, intensity normalization, peak detection, peak alignment, peak quantification, and various functionalities for data ingestion/conversion, mass calibration, data quality assessment, and protein mass spectra simulation. (msProcess)

**R is the leading tool for statistics, data analysis, and machine learning in the research community is time. Time to begin!!!!**

*we can provide our scripts to the comunity using our manuscripts and papers, it means we can check the statistics analysis and the results.*Some Ref’s:

- Statistics Using R with Biological Examples (http://cran.r-project.org/doc/contrib/Seefeld_StatsRBio.pdf)
- Biological Data Analysis Using R (http://dyerlab.bio.vcu.edu/downloads/Dyer_Data_Analysis_Using_R.pdf)
- R-bloggers (http://www.r-bloggers.com/)

To

**leave a comment**for the author, please follow the link and comment on their blog:**Computational Proteomics**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.