GenABEL: an annoying error after the import of PLINK data format

[This article was first published on Milano R net, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In the previous post we saw how much convenient could be GenABEL in the management of genotypic/phenotypic data.
We introduced the import of genotypic data from an Illumina format file:
> convert.snp.illumina(inf = "gen.illu", out = "gen.raw", strand = "file")
but what happens if you’re analysing your data with PLINK, the open source toolset for GWAS?
GenABEL provides an useful function:

> convert.snp.tped(tped = "dataset.tped", tfam="dataset.tfam", out = "gen.raw", strand = "+")  

This function allows you to import the transposed-ped format file and the tfam format file and it permits to convert them into the .raw file which contains the genotypic raw data readable by load.gwaa.data function.
The conversion is performed by C++ code that is both fast and memory efficient.

It’s all great. But..

when you use the function:

> df <- load.gwaa.data(phe = phe.txt", gen = "gen.raw")

you could receive this warning:

person with id = a123456 was not found in genotypic file; excluded

Good! In your phenotypic dataset there are more person than in the genotypic dataset, GenABEL understands that ‘mistake’ and it doesn’t load that person.

What happens you are adding a person who is present in the genotypic file and not in the phenotypic file?

person with id = 654321a was not found in phenotypic file!!! – FATAL

Bad. You’ll get an error!

So you need to fix your dataset: you have to delete persons, which aren’t present in phenotypic data, from your genotypic dataset. This kind of problem could happen when you receive the datasets from another person (it’s my case).
You could edit the .tfam o .tped files. You should check the persons present in your phenotypic file, then you have to take off the rows in the .tfam file corresponding to those persons and the corresponding columns in the .tped file.

There could be another solution:

> convert.snp.tped(tped = "genabelGWASALL.tped", tfam="genabelGWASALL.tfam", out = "genot.raw", strand = "+") > file_tfam <- read.table(file="genabelGWASALL.tfam", sep=' ', head=FALSE) # genotypes > file_phe <- read.table(file="phe.txt", sep=' ', head=TRUE) # phenotypes > ot_id ← setdiff(file_tfam[,2], file_phe$id) # ids present only in tfam (and tped) file > lot_id ← length(ot_id) > fakepheno <- data.frame(id = ot_id, sex=rep('1', lot_id), phen = rep('NA', lot_id)) > names(fakepheno) <- names(file_phe) > fennew <- rbind(file_phe, fakepheno)

Now you can create the new phenotypic file:

> write.table(fennew, file="newpheno.txt", row.names=FALSE, col.names=TRUE, sep=' ')

Finally you can use load.gwaa.data function and move on with your analysis!

> newdataset <- load.gwaa.data(phe = "newpheno.txt", gen = "genot.raw",force = T)

Remenber to delete the fake phenotypes:

> newdataset1 <- newdataset[newdataset@phdata$id %in% file_phe$id,]

To leave a comment for the author, please follow the link and comment on their blog: Milano R net.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)