Mapping SNPs to Genes for GWAS Enrichment Analysis

June 30, 2011

(This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers)

There are several tools available for conducting a post-hoc analysis of GWAS data looking for enrichment of significant SNPs using literature or pathway based resources. Examples include GRAIL, ALLIGATOR, and WebGestalt among others (see SNPath R Package). Since gene enrichment and pathway analysis essentially evolved from methods for analyzing gene expression data, many of these tools require specific gene identifiers as input.

To prep GWAS data for analyses such as these, you’ll need to ensure that your SNPs have current map position information, and assign each SNP to genes. To update your map information, the easiest resource to use is the UCSC Tables browser.

To use UCSC, go to the TABLES link on the browser page. Select the current assembly, “Variation and Repeats” for group, “Common SNPs(132)” for the track, and “snp132Commmon” for the table. Now you’ll need to upload your SNP list. You can manually edit the map file using a tool like Excel or Linux commands like awk to extract the rs_ids for your SNPs into a text file. Upload this file by clicking the “Upload List” button in the “identifiers” section. Once this file has uploaded, select “BED – browser extensible data” for the output format and click “Get Output”. The interface will ask you if you want to add upstream or downstream regions — for SNPs this makes little sense, so leave these at zero. Also, ensure that “Send to Galaxy” is not checked.

This will export a BED file containing the positions of all your SNPs in the build you selected, preferably the most current one — let’s call it SNPs.bed. BED files were developed by the folks behind the UCSC browser to document genomic annotations. The basic format is:


Because BED files were designed to list genomic ranges, SNPs are listed with a range of 1 BP, meaning the StartPosition is the location of the SNP and the EndPosition is the location of the SNP + 1.

The next step is to map these over to genes. To save a lot of effort resolving identifiers (which is no fun, I promise), I have generated a list of ~17,000 genes that are consistently annotated across Ensembl and Entrez-gene databases, and which have HUGO identifiers. These files are available here:

Hugo Genes
Entrez Genes
Ensembl Genes

The genes listed in these files were generated by comparing the cross-references between the Ensembl and Entrez-gene databases. To arrive at a set of “consensus” genes, I selected only genes where Ensembl refers to an Entrez-gene with the same coordinates, and that Entrez-gene entry refers back to the same Ensembl gene. Nearly all cases of inconsistent cross-referencing are genes annotated based on electronic predictions, or genes with multiple or inconsistent mappings. For the purpose of doing enrichment analysis, these would be ignored anyway, as they aren’t likely to show up in pathway or functional databases. For these genes, we then obtained the HUGO approved gene identifier. The coordinates for all genes are mapped using hg19/GRChB37.

BED files are plain text and easy to modify. If you wanted to add 10KB to the gene bounds, for example, you could alter these files using excel formulas, or with a simple awk command:

awk ‘{printf(“%d\t%d\t%d\t%s\n”, $1, $2-10000, $3+10000, $4); }’ ensemblgenes.bed

You’ll need to download a collection of utilities called BEDTools. If you have access to a Linux system, this should be fairly straightforward. These utilities are already installed on Vanderbilt’s Linux application servers, but to compile the program yourself, download the .tar.gz file above, and use the following commands to compile:

tar -zxvf BEDTools..tar.gz
cd BEDTools
make clean
make all
ls bin

# copy the binaries to a directory in your PATH. e.g.,
sudo cp bin/* /usr/local/bin

Once you get BEDTools installed, there is a utility called intersectBed. This command allows you to examine the overlap of two BED files. Since we have our collection of SNPs and our collections of Genes in BED file format already, all we need to do is run the command.

intersectBed -a SNPs.bed -b entrezgenes.bed -wa -wb

This will print to the screen any SNP that falls within an EntrezGene boundary, along with the coordinates from each entry in each file. To strip this down to only RS numbers and gene identifiers, we can pipe this through awk.

intersectBed -a SNPs.bed -b entrezgenes.bed -wa -wb | awk ‘{printf(“%s\t%s\n”, $4, $8);}’

This will produce a tab-delimited file with SNPs that fall within genes identified by EntrezGene ID number. For HUGO names or Ensembl gene IDs, use the corresponding BED file.

To leave a comment for the author, please follow the link and comment on their blog: Getting Genetics Done. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Dommino data lab

Quantide: statistical consulting and training



CRC R books series

Six Sigma Online Training

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)