Using R: accessing PANTHER classifications

Posted on February 10, 2013 by mrtnj in R bloggers | 0 Comments

[This article was first published on There is grandeur in this view of life » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Importing, subsetting, merging and exporting various text files with annotation (in the wide sense, i.e. anything that might help when interpreting your experiment) is not computation and it’s not biology either, but it’s housekeeping that needs to be done. Everyone has a weapon of choice for general-purpose scripting and mine is R. Yes, this is ”quite trivial if you only think it out”, but it still takes me some time. This script is a work in progress and could certainly be much cleaner and friendlier, but I’ll post it for the benefit of any other user that might google for this kind of thing.

Panther is a protein database with phylogenetic trees of proteins annotated with Gene Ontology terms and organised into pathways. (See the paper in the 2013 database issue of NAR.) Right now, I’m after pathway classification of chicken proteins. The pathways are also available in some xml formats for systems biologists, but I’m going to use the classification table. It contains all UniProtKB proteins, so it should cover most known genes products.

Note, if you just want to check up on a few genes or a few pathways the Panther web interface seems pretty nice. If you’re after nice pathway diagrams, also check out the website and the SBML format. But for accessing classifications en masse, a treatment in R may be useful.

For this post, I’ve broken down the script into parts. If you want a function for the whole thing, see github.

First, download the classification data for your favourite organism from the Panther ftp.

  panther <- read.delim(filename, sep="\t", head=F,
                        stringsAsFactors=F)

  colnames(panther)[1] <- "gene.id"
  colnames(panther)[3:5] <- c("panther.id", "panther.family",
                            "panther.subfamily")
  colnames(panther)[6:8] <- c("go.mf", "go.bp", "go.cc")
  colnames(panther)[9:10] <- c("panther.ontology", "panther.pathway")

This is a tab separated text file. Since some fields at the end of lines may be left empty, we use read.delim instead of read.table. I add som column names that I think agrees reasonably well with the readme. The first is a gene id string that contains mapping to Ensembl or Entrez ids, as well as the uniprot accession. It looks something like this:

CHICK|Gene=ENSGALG00000013995|UniProtKB=F1P447

CHICK|ENTREZ=378784|UniProtKB=Q75XU5

Of course, we want the ids as separate columns: stringr does regular expressions in a vectorised manner. str_match gets grouped matches into arrays. (Yes, it can be done with a single regexp, but I don’t take pleasure in that kind of thing.)

  panther$ensembl.id <- str_match(panther$gene, "Gene=(.*)\\|")[,2]
  panther$uniprot.id <- str_match(panther$gene, "UniProtKB=(.*)$")[,2]
  panther$entrez.id <- str_match(panther$gene, "ENTREZ=(.*)\\|")[,2]

The gene ontology fields are terms from each ontology, stringed together and separated by semicolons. This is not the best format for querying, so I’ll make a list of GO ids from each ontology:

  parse.go <- function(go.column) {
    go.list <- str_match_all(go.column, "GO:([0-9]*)")
    names(go.list) <- panther$gene.id.string
    go.list <- llply(go.list, function(x) {if (!is.null(dim(x))) x[,1]})
    return(go.list)
  }
  go.mf <- parse.go(panther$go.mf)
  go.bp <- parse.go(panther$go.bp)
  go.cc <- parse.go(panther$go.cc)

Finally, the pathway column. It says this in the readme:

***Example Pathway:
Inflammation mediated by chemokine and cytokine signaling pathway#Inflammation mediated by chemokine and cytokine signaling pathway#P00031>Integrin#Integrin#P00853;Integrin signalling pathway#Integrin signalling pathway#P00034>Integrin alpha#Integrin alpha#P00941

The format of the pathway information is: pathway_long_name#pathway_short_name#pathway_accession>
component_long_name#component_short_name#component_accession

Explanation of pathway accessions:
Gxxxxx Gene or RNA
Pxxxxx Protein
Sxxxxx small molecules
Uxxxxx others, such as ”unknown”, etc.

For now, I’d like to keep just the pathway id, which is the first id of each pathway entry. Again, there are often multiple entries for the same gene.

  pathway.list <- str_match_all(panther$panther.pathway,
                                "#([G,P,S,U][0-9]*)>")
  names(pathway.list) <- panther$gene.id.string
  pathway.list <- llply(pathway.list,
                        function(x) {if (!is.null(dim(x))) x[,2]})

We might want to have these additional descriptions (names of GO terms, name of pathway) later, but for now, I’ll be satisfied with the unique ids. Let’s package everything in a list, and slap a class attribute on it for good measure.

  panther.classification <- list(data=panther,
                                 go.mf=go.mf,
                                 go.bp=go.bp,
                                 go.cc=go.cc,
                                 panther.pathway=pathway.list)
  class(panther.classification) <- "panther.classification"

Now we can get the pathway ids associated with our favourite gene. Take for example GNB3 which has Uniprot id E1C9D6.

get.pathways <- function(panther, acc) {
  
  ix.uni <- which(panther$data$uniprot.id == acc)
  ix.ens <- which(panther$data$ensembl.id == acc)
  ix.ent <- which(panther$data$entrez.id == acc)
  
  if (length(ix.uni) > 0) {
    ix <- ix.uni
  } else if (length(ix.ens > 0)) {
    ix <- ix.ens
  } else if (length(ix.ent > 0)){
    ix <- ix.ent
  } 
  panther$panther.pathway[[ix]]
  
}
get.pathways(panther.classification, "E1C9D6")

I'll get back to this to maybe do something more useful with it.

Postat i:computer stuff, english, genetik Tagged: PANTHER, pathway, R