Manhattan plots for SNP marker effects using ggplot2

June 17, 2010
By

(This article was first published on John B. Cole's Big Blog O' Science » John B. Cole's Big Blog O' Science, and kindly contributed to R-bloggers)

At AIPL, we've been posting Manhattan plots of the marker effects for each breed-trait combination with each official release of our genomic predictions. For example, consider the plot of lifetime net merit for Holsteins from the April, 2010 run: Marker effects for lifetime net merit of Holsteins from the April, 2010 U.S. genetic evaluation.These plots are produced automatically using Python and matplotlib. The scripts run fairly quickly and they've proven to be fairly dependable. I recently had a request from a colleague to produce similar plots for data produced as part of an ongoing research project here in the lab. I thought that this would be a great time to try making similar plots using the very cool ggplot2 library for R. I got 90% of the way to the result I wanted very quickly, which was great. Then I ran into problems. Fortunately, I found a straightforward solution that I will discuss below. As you can see from the plot above, the data on the x-axis consist of markers ordered by their location in the genome. Markers assigned to the same chromosome are shaded the same color, and I'm using a simple seven-color palette that repeats. I was able to produce a graphic that had the correct shading but an x-axis labeled with marker locations, not chromosome numbers. All I needed to do was replace the x-axis labels, or scale in the vocabulary used by ggplot2, with the correct labels. Fortunately, I had seen a final product similar to what I wanted over at Getting Genomics Done in the post GWAS Manhattan plots and QQ plots using ggplot2 in R. The solution was very simple: I had to provide a list of breaks (tick mark locations) and labels to scale_x_continuous() which I then overlaid on the graph. Here are the results: Unadjusted marker effects for Holstein milk yield I used the aggregate() function to find the locations of the first and last marker on each chromosome. The breaks were calculated as: first SNP location + floor( ( last SNP location - first SNP location ) / 2 ), and the chromosome numbers were used as the corresponding labels. (I know that cattle have only 30 chromosomes; we use a numbering scheme internally which distinguishes between SNP in the pseudoautosomal and non-pseudoautosomal regions of the X chromosome, which is why there's a 30 between autosome 29 and the X. The label "PAR" is too big to fit in the space available.) Here's the code that produced the second plot:

p_milk <- ggplot(data = yld, aes(snp_id, milk, color = chrome)) +
        geom_point() +
        scale_colour_manual(value = colours) +
        opts(legend.position = "none") +
        scale_x_continuous(name = "Chromosome", breaks = unique(yld$tick), labels = unique(yld$label)) +
        ylab("Marker Effect (pounds)") +
        ylim(0, 35) +
        opts(title = "Distribution of adjusted marker effects for HO milk")

Of particular interest is the line scale_x_continuous(name = "Chromosome", breaks = unique(yld$tick), labels = unique(yld$label)), which overlays the new scale with the chromosome numbers on it. The vector yld$tick includes the breaks calculated using the formula mentioned above, and yld$label is a vector of strings used to label the breaks. I also ran into a minor issue while trying to use scale_colour_brewer() to shade the points on each chromosome -- the number of chromosomes (31) exceeded the maximum number of levels that can scale_colour_brewer() accommodate. I still wanted to use those colors, so I manually pulled the first 7 colors from the Brewer "Set1" palette, put them in a list, and handed them to scale_colour_manual(). Worked like a charm.

library("RColorBrewer")
colours <- rep(c(brewer.pal(n = 7, name = "Set1")),5)

In addition to looking at the examples and code on Getting Genomics Done I also made heavy use of Hadley Wickham's excellent book "ggplot2: Elegant Graphics for Data Analysis". (Note: I copied the link from the author's website, so I suppose there could be an affiliate code in there.) The conceptual framework of building graphics in layers is both easy to understand and powerful in application. I look forward to making more use of ggplot2 in the future.

To leave a comment for the author, please follow the link and comment on his blog: John B. Cole's Big Blog O' Science » John B. Cole's Big Blog O' Science.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...



If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.