An easy way to manage your genome-wide-association data: GenABEL package.

June 12, 2012

(This article was first published on Milano R net, and kindly contributed to R-bloggers)

Here is a little overview on GenABEL library developed by Yurii Aulchenko (

GenABEL is a full-featured R library for dealing with Genome-Wide Association analysis of binary and quantitative traits.

Compared to the ‘genetics’ package and many other tools, GenABEL provides specific features for storage and manipulation of large amounts of data, testing for GWA analysis, and functions for estimating the kinship matrix from a dense marker panel.

Maybe the most useful feature of GenABEL is the special data class: An object of this class permits to store GWA data in a efficient way and to retrieve in a simple way the information in your dataset.

At the first level, a object has the phdata ‘slot’ that can be accessed by command [email protected], which contains all phenotypic information in a data frame (data.frame-class object). The rows of this data frame correspond to study subjects, and the columns correspond to the variables/phenotypes. There are two default variables, which are always present in phdata: the first of these is ”id” (must be unique), which contains study subject identification code: the second one is a dummy variable indicating the sex.

If you want to add phenotypes from another dataframe to phdata object already created, special GenABEL function add.phdata should be used. This function allows you to add variables contained in some data frame to the existing [email protected] object. The data frame to be added should contain ”id” variable, identical to that existing in the object.

The other slot of an object of is gtdata, which contains all genetic data in an object of class This class, in turn, has slots containing the number of study subjects, ID names of these subjects, the number of SNPs typed, the SNP names, the name of the chromosome the SNPs belong to and map position of SNPs, strand information and the sex code for the subjects. The latter is identical to the ”sex” variable contained in the phdata.

To import data to GenABEL, you need to prepare two files: one containing the phenotypic data, and another one containing genotypic data.

  • Phenotype file: the first column must contain the subjects’ unique ID. The IDs listed here and in the genotypic data file must be the same. The second column must contain the sex information and other columns in the file should contain phenotypic information. The names of the first two columns must be ‘id’ and ‘sex’.
  • Genotype file: information on chromosome, map position and strand should be provided for every SNP and the SNPs genotype have to be indicated for every study subjects.

In GenABEL there are a number of functions to convert these dataset from different formats to the internal GenABEL raw format. One of those format is the Illumina format. To be clear the ”illumina” format is just one of the possible text output format from the Illumina BeadStudio; similar formats are generated by HapMap and Affymetrix. The file of the ”Illumina” format contains SNPs in rows and IDs in columns and the first four columns should contain information on SNP name, chromosome, position and strand. After those columns, each of the residual ones corresponds to an individual, with ID as the column name, the elements of these colums are the genotypes.

Anyways, this file contains all required genotypic information, now you can convert the data to GenABEL raw format using the conversion command:

> convert.snp.illumina(inf = "gen.illu", out = "gen.raw", strand = "file")

The option strand=”file” shows that strand information is provided in the file.

Finally, you can load the data into GenABEL typing

> dataset <- = phe.txt", gen = "gen.raw")

Now you can start with the analyses!

This is only a brief introduction to this package: in my opinion there are many different methods (parametric and non-parametric) that are suitable to conduct a genome wide association study, but GenABEL package could give a fundamental help with the management and quality control of your dataset.

Ps In the GenABEL website you will find the documentation, tutorials and also a forum where you can find the answers to your questions.

To leave a comment for the author, please follow the link and comment on their blog: Milano R net. offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Tags: , ,

Comments are closed.


Mango solutions

RStudio homepage

Zero Inflated Models and Generalized Linear Mixed Models with R

Quantide: statistical consulting and training


CRC R books series

Contact us if you wish to help support R-bloggers, and place your banner here.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)