Impute2 crashes when using 23andme file converted to Oxford format with Plink 1.9

415 views
Skip to first unread message

J1

unread,
Apr 29, 2014, 11:37:42 PM4/29/14
to plink2...@googlegroups.com
Hello.

I have been trying to use Impute2 with the 1000 genomes dataset to impute my 23andme results on my PC.
The examples included with Impute2 ran without any problems. For example, only using g and m files.


I then used Plink 1.9 to convert my 23andme file to Oxford format (attached file Oxford Conversion 1). {Was not sure how to use --list-23-indels in Plink 1.9.} 


The resulting output file (attached Oxford Convert 1 {portion of file}) appears to be in Oxford format.

However, there are 3 concerns.

1) The first column in this file is listed as the chromosome though the Impute2 example files
use SNP_IDs. This might be unimportant.

2) Homozygous alleles report a "0" for the missing allele. This might also be unimportant.

3) It would be helpful if the 23andme file converted to Oxford format could be restricted to a single chromosome or region of a single chromosome.
The 1000 Genomes files are huge and it would be computationally more manageable to have Plink format a smaller region.
(This way the .gen and .sample files would both be present. When the large conversion file was pasted into an rtf file the sample portion of the file did not
carry over.) Is there a way to do this with Plink 1.9?


Then, I ran the full 23andme converted file as the g file (which I called GLox1.gen [the file Oxford Conversion 1 is only a sample from this file])
and the 1000 Genomes chromosome 15 genetic map as the m file, Impute2 crashed after reaching the first MCMC iteration (attached file Impute2 run1).


Some of the files from the 1000 Genomes dataset are very large. So, I copied chromosome 15 from the full 23andme file converted to Oxford format using Plink 1.9 and made an rtf file for a more manageable analysis. (file called GLox15.gen.rtf.). When I ran this file with the 1000 genomes genetic map for chromosome 15, Impute2 detected 0 individuals (attached file Impute2 run2). (The reason why 0 individuals were detected might be because the sample file did not carry over when I cut and pasted into GLox15.gen.rtf.) Impute2 excluded all 9 of the SNPs that it read in from chromosome 15 (attached Impute2 run2 warnings.)

There is probably a simple formatting issue involved, as the Impute2 examples ran without complication. This has been very frustrating. Any help would be greatly appreciated.      




Oxford Conversion 1.PNG
Oxford Convert 1.txt
Impute2 run2.PNG
Impute2 run2 warnings.PNG
Impute2 run1.png

Christopher Chang

unread,
Apr 30, 2014, 12:47:30 AM4/30/14
to plink2...@googlegroups.com
* The Oxford file format page (http://www.stats.ox.ac.uk/~marchini/software/gwas/file_format.html#Genotype_File_Format ) explicitly states that the SNP ID column can be used for chromosome numbers.  However, you're correct that in practice chromosome splitting is usually a good idea when working with their tools.  If it's important, I can add a "oxford-1chr" --recode mode which requires only a single chromosome to be present, and repeats the rsID in the SNP ID column instead.
* To only recode a single chromosome, add the "--chr" flag, e.g. "--chr 15" if you just want a file with chromosome 15.
* I'll try to figure out what's causing the Impute2 crashes today.  In the meantime, you can try just using PLINK to convert to .bed and following up with GTOOL for the .bed -> .gen conversion; let me know whether or not Impute2 still has a problem there.

Christopher Chang

unread,
Apr 30, 2014, 1:03:44 AM4/30/14
to plink2...@googlegroups.com
I'm guessing that rtf is sufficiently different from plain text that Impute2 cannot read that format.  (I doubt PLINK can read it.)  You would need to save as plain text.

As for run1, I can make a better guess as to what's going on after I know whether using GTOOL for the .bed -> .gen conversion fixes the problem (and, if it does, what the difference is between the .gen files).


On Wednesday, April 30, 2014 11:37:42 AM UTC+8, J1 wrote:

J1

unread,
May 1, 2014, 2:27:27 AM5/1/14
to plink2...@googlegroups.com
Thank you very much for your quick reply.

The example g file from Impute2 and the g file resulting from converting the 23andme file using Plink were very similar. The snpID and the homozygous calls were the only obvious differences between them. There does not appear to be any difference between the g files, though Impute2 only works with its examples.   



I recoded the chromosome with Plink as you suggested and obtained bed, bim and fam file types. I am not sure about installing gtool on a Windows computer.

I downloaded gtool. When I unzipped it, the file was not executable. I am not really clear how to work with Unix in Windows. I have Cygwin64 installed. It would be great if I could follow your suggestion and do the bed> gen file transfer. Impute2 might then run properly.    

It would be very helpful if Impute2 had a method for splitting the files (especially the haplotype files). The genetic files (for example 1000 Genomes) are starting to become unmanageable. Uncompressed haplotype files from the latest 1000 Genomes download for single chromosomes average 3 GB. The Impute2 listserv is reporting that a new genome release will soon have a dataset of 20,000 people. I am very uncertain how to split the 1000 Genomes Haplotypes files into more manageable files, as these files are composed entirely of 0s and 1s. These files should be formatted. I am not clear whether my computer will be able to run Impute2 with the 3 GB haplotype files.

 
I reran the chr15 data using a .txt g file (instead of an .rtf file). My computer made it to the first MCMC iteration and then crashed (see attached Impute Run 4). This was the same result as with the Impute2 Run 1 from the previous post.

(I am worried that in run 1, the m file was for chromosome 15, though the g file was for the whole genome. In the command line -int 20.4e6 20.5e6, might be ambiguous for Impute2, as this interval could apply to all of the chromosomes. Yet, the program might not be confused by this. In Impute Run 4, the g file was only for chromosome 15, thus there would not be such ambiguity.) 



This time the program recognized 1 individual (Impute2 Run 4). When I ran the program before as an .rtf (see attached Impute2 run2 from previous post), Impute2 detected 0 individuals, excluded all the SNPs, and then the program reported "There are no SNPs in the imputation interval, so there is nothing for Impute2 to analyze; the program will quit now."

Thank you for your help. If I can figure out how to use gtool, Impute2 might recognize the g file and run. 
Impute2 Run 4.png

Christopher Chang

unread,
May 1, 2014, 11:26:18 PM5/1/14
to plink2...@googlegroups.com
Hmm, looks like you'll need to find a computer running OS X or Linux to run GTOOL.

J1

unread,
May 3, 2014, 11:11:52 PM5/3/14
to plink2...@googlegroups.com
Progress!

When I ran the example file from Impute2 with only 1 individual (using only g and m files), Impute2 crashed (exactly as when I tried to run the 23andme file converted with Plink 1.9).

Yet, when I ran the example file using 2 individuals (with only the g and m files), Impute2 ran without trouble.
It appears that Impute2 requires at least 2 individuals in the g file.
If Plink could read in 2 23andme files, merge them, and then recode to Oxford format, Impute2 might work properly. ( I am not sure what should be done if 1 person
had a genotype call at a locus and the other person did not).


However, when I ran the 1000 Genomes files with the full example file (with the g,m,h, and l files), Impute2 had a problem locating the type 2 SNPs (attached file Impute Run1 part b).I was worried that the h file would be too large for my computer to handle (it is 2GB), though this apparently was not the problem. It is possible that the example file and the 1000 Genomes files are using different genome builds (see red error box in attached file Impute Run 1 part b).

       




 


On Tuesday, April 29, 2014 11:37:42 PM UTC-4, J1 wrote:
Impute Run 1 part a.png
Impute Run 1 part b.png

J1

unread,
May 4, 2014, 1:24:27 AM5/4/14
to plink2...@googlegroups.com
Success!!!

I took the Plink 1.9 formatted Oxford style 23andme file and added another person's genotype data (coded in the oxford triplet notation). When I ran a .1 e6
interval of this file with the 1000 Genomes haplotype file, I received a file with almost 1,400 imputted genotypes for the two people!!

If a full genome were imputted, then this might produce approximately 40 million imputted SNPs. This is amazing!

It would be very helpful now if Plink 1.9 could input 2 23andme files, merge them and output in Oxford notation. It would also be great if Plink 1.9 allowed missing genotypes in the Oxford file. (Sometimes one person would have a no call at a locus, while the other person would call at that locus. When hand coding the genotypes, I just threw out such SNPs. This might not be necessary.) 

Christopher Chang

unread,
May 4, 2014, 1:28:53 AM5/4/14
to plink2...@googlegroups.com
Oxford-format import/export has no problem with missing genotypes.

To merge multiple 23andMe files, first convert them to PLINK binary, and then use --bmerge or --merge-list.
Message has been deleted

J1

unread,
May 4, 2014, 6:59:19 PM5/4/14
to plink2...@googlegroups.com

Thank you very much!

I was able to impute almost half a million SNPs on one chromosome today. This is amazing.

I am concerned, though, about the accuracy of the results.

First, I started with a Plink Oxford formatted 22nd chromosome 23andme file, and then I copied all the genotypes and pasted them into the file. This created an Oxford file with one person's genotype recorded twice. I successfully ran this file with Impute2 (attached sample file of the output Ldoubleerrors1).

Second, I read 2 different 23andme files into Plink, merged them with --merge, and formatted them into Oxford style. I successfully ran this file with Impute2 (attached sample file of the output Lcombodouble errors 1).  

When I examined the output, I noticed a problem. In file Ldoubleerrors1 (this is the file with one person's genotypes recorded twice), some of the calls start with a 22 in the first column. These calls are those reported in 23andme file. Sometimes these calls are reported again in the output (often with different probablilities associated with different genotypes -e.g. see the last call in the file- rs624100). Impute2 appears to be trying to impute the calls from the 23andme file (even though they are provided as input to the program {possibly double checking}).

Interestingly, in file Lcombodouble errors 1 (this is the file with two people's genotype recorded {produced using Plink's --merge function} [the first genotype is for the person recorded twice in Ldoubleerrors1]) the genotypes are different and appear much less accurate. For example, the above noted rs624100 (last reported SNP in the file) has a listed genotype of G T 0.089 0.526 0.385 for individual 1. However, as noted above, this call is on the 23andme chip and is reported as 0 0 1.

I am worried that Impute2 only gives a 38.5% probability for a TT  call when the 23andme chip is calls it TT (23andme gene chip has approximately 99.99% accuracy).

For some reason, Impute2 has dramatically changed its imputation probabilities apparently because 1 run had the same person reported twice and the other run had two different people.       
Ldoubleerrors1.txt
Lcombodouble errors 1.txt
Reply all
Reply to author
Forward
0 new messages