3+ alleles present when merging SNP chip data and NGS VCF: 'Use --flip...'

1,554 views
Skip to first unread message

Mark Ebbert

unread,
Mar 28, 2014, 12:50:31 AM3/28/14
to plink2...@googlegroups.com
Hi,

We have SNP array and whole-genome sequencing data from several hundred individuals. I'm merging the SNP array data with the VCF data. Not surprisingly, there were several thousand SNPs with 3+ alleles, so I flipped the SNP chip data since I want to use the designated reference allele from the NGS data. After flipping I still had a small subset of the same SNPs with 3+ alleles. I discovered that the NGS data had two alternates but the more common alternate in the NGS data was not the same alternate genotyped on the SNP array. So basically the two data sets look like this:

# NGS
4 184507881  rs10011527 C A,T

# SNP array
T C

The 'T' alternate in the NGS data was set as missing. What I think should happen in this case is for Plink to have an option to flip only when the reference alleles differ, in which case I'll be forced to decide which data set should overwrite the other. Or an option to only print SNPs to the .missnp file when the reference alleles differ. Or maybe there's an existing solution that's better?

I hope I explained that clearly. 

Thanks!

Mark

Christopher Chang

unread,
Mar 28, 2014, 12:59:17 AM3/28/14
to plink2...@googlegroups.com
PLINK really shouldn't be used to merge genuinely triallelic data right now; instead, export the PLINK-format file(s) to VCF (via e.g. "--recode vcf") and use another tool for the merge.  Sorry about the inconvenience; this issue will be properly addressed next year with a major change to the file format.

Mark Ebbert

unread,
Mar 28, 2014, 3:36:37 PM3/28/14
to Christopher Chang, plink2...@googlegroups.com
I guess I don't understand. I'm trying to ignore the second alternate NGS allele as plink did when I read the vcf in. And I'm more likely to trust the snp array anyway. I will just overwrite the NGS allele with the snp array allele (merge option 5). I was just looking for an option to tell plink that the ref allele should match the NGS allele and flip accordingly. I wound up writing a script to do this. 

I also don't know of any vcf tool that would get this right if I converted to vcf. 

Please excuse the brevity. Sent from my Zack Morris phone.
--
You received this message because you are subscribed to a topic in the Google Groups "plink2-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/plink2-users/N8pj5opCUa4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to plink2-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Christopher Chang

unread,
Mar 28, 2014, 7:15:41 PM3/28/14
to plink2...@googlegroups.com, Christopher Chang
To just remove all sites with mismatching alleles from the NGS file,

plink --bfile [original NGS fileset prefix] --exclude [.missnp file generated by failed merge] --make-bed --out [new NGS prefix]

should work; you can then follow up with merge mode 5.

Christopher Chang

unread,
Mar 28, 2014, 8:39:40 PM3/28/14
to plink2...@googlegroups.com, Christopher Chang
To better answer your original question, it seems to me that the "reference allele only" flip you're asking for is equivalent to using --flip on the first .missnp file, and then flipping back the errors that remain in the second merge attempt.

The broader thing to keep in mind is that --flip has limited value on triallelic data.  I've changed the merge error message to reflect this, and will update the online documentation.

Christopher Chang

unread,
Mar 31, 2014, 9:54:29 AM3/31/14
to plink2...@googlegroups.com, Christopher Chang
The rewritten merge failure documentation is now up at https://www.cog-genomics.org/plink2/data#merge3 ; it has an very important point concerning A/T and C/G SNPs and --flip-scan.

(--flip-scan itself should be implemented within the next few days; for now you can use the PLINK 1.07 implementation.)


On Saturday, March 29, 2014 3:36:37 AM UTC+8, Mark Ebbert wrote:

Mark Ebbert

unread,
Mar 31, 2014, 8:38:59 PM3/31/14
to Christopher Chang, plink2...@googlegroups.com
Thanks for your suggestions. I wound up removing from NGS and then merging, as you suggested.

Vincent Laufer

unread,
Apr 8, 2014, 9:42:13 AM4/8/14
to plink2...@googlegroups.com, Christopher Chang
Hi Christopher - I have a follow up to this question.


I have 62 different .ped files each of which originated from some NGS data. I want to merge all of the .ped files into one large file, but I am afraid that every successive merge step will fail due to the presence of triallelic sites.

In other words, even if I exclude all the .missnps from the first merge event, it will still fail successively on merging of the 3rd file to the first 2, then the 4th file to the first 3, etc.

Does that make sense?

What I would love to be able to do is just exclude any triallelic site ... all I need is ancestry, and I have way more than enough markers to calculate it, so the loss of all triallelic sites is not a concern to me.

What do you recommend?

Thank you so much for your help here and on other threads.

Kindly,

Vincent

Christopher Chang

unread,
Apr 8, 2014, 10:14:15 PM4/8/14
to plink2...@googlegroups.com, Christopher Chang
1. Convert the .ped/.map filesets to binary format.  (The .missnp file may not be generated if you don't do this first; this is technically an incompatibility with PLINK 1.07 but there's a very good reason for it.)
2. Use --merge-list to attempt to attempt to merge all 62 filesets at once.  This will generate a master .missnp file containing all triallelic sites.
3. Exclude those SNPs from every fileset.
4. Use --merge-list again, successfully this time.
Reply all
Reply to author
Forward
0 new messages