need to deal with duplicate rsIDs in UK Biobank imputed data

358 views
Skip to first unread message

jie huang

unread,
Jul 3, 2023, 10:01:49 AM7/3/23
to plink2-users

Dear Chris:

Please see the screenshot below. I got a ""variant ID appears multiple times" error. 
Due to this error, PLINK will fail the full command. 

Is there a way for PLINK to deal with this issue, by comparing both REF and ALT alleles instead of comparing just one allele? Or, it would be nice if PLINK can still process the rest of SNPs that don't have the duplicate variant ID issue.

Thank you & best regadrs,
jie

1.png
2.png

Christopher Chang

unread,
Jul 3, 2023, 1:29:08 PM7/3/23
to plink2-users
This is one of the main applications of --set-all-var-ids.

jie huang

unread,
Jul 6, 2023, 6:21:11 AM7/6/23
to plink2-users

Thanks, Chris!

I assume that plink2  --set-all-var-ids @:#[b37] will replace all the rsID with names made by CHR and POS (and alleles).
But I do NOT want to rename the SNP ID in the original UK Biobank dataset. I want to keep things original as possible.

I simply wanted to run the above --score function without getting an error message. Ideally, I don't need to update the huge genotype dataset itself. 
If I do need to update the variant ID in the original genotype data, it would be good to have something like --set-DUPLICATE-var-ids

Your clarification and help is greatly appreciated!

Best regards,
Jie

Chris Chang

unread,
Jul 6, 2023, 8:36:24 AM7/6/23
to jie huang, plink2-users
Sorry, the —set-duplicate-var-ids operation you have in mind would not actually result in correct scores.  You’re on your own if you refuse to use —set-all-var-ids.

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/901e3375-439d-45a7-b744-5afefb1287c5n%40googlegroups.com.

jie huang

unread,
Jul 6, 2023, 4:36:49 PM7/6/23
to plink2-users

Dear Chris:

I did not mean to "refuse to use" the nice features from PLINK2. 
However, as you might understand, the UK Biobank imputed data files are huge, with ~100 million SNPs. It would be good to keep the original files as is, for integrity reasons.
If I manully use  --set-all-var-ids to change all the variant ID into something like chr1:12345:A:C, then later on I will get into difficulty when I do need to use rsIDs for other software and other purposes.

Best regards,
Jie

Chris Chang

unread,
Jul 6, 2023, 5:12:37 PM7/6/23
to jie huang, plink2-users
Nobody said you had to delete the old .pvar file.

Matthew Maher

unread,
Jul 6, 2023, 6:55:47 PM7/6/23
to jie huang, plink2-users
I think you’re looking for:

—make-just-pvar

and: 

—pvar

Phil Greer

unread,
Jul 7, 2023, 7:35:55 AM7/7/23
to plink2-users

It is common to run into multiallelic SNPs when computing PRS from published scores. While this is not specific to PLINK, you should look at this very good paper/tutorial to base your workflow for dealing with these common issues.

Paper can be found here:
https://www.frontiersin.org/articles/10.3389/fgene.2022.818574/full

and the tutorial is here:

One of the better things about this tutorial, is that it extracts only the data you will use for the PRS calculation, making the data files much more manageable.

-Phil 

jie huang

unread,
Jul 7, 2023, 9:05:59 AM7/7/23
to plink2-users
Thanks for the reply. 

I understand that it is ideal that all variants in all genotype data is set to a common format such as  CHR:POS:REF.ALT.  But that format has a lot of problems too. First, it is very difficult/impossible to remember; Second, the position could also evolve periodically; third, the REF and ALT, is debatable, and could be very long ...

So, I think there is still value to keep using rsID. If I have a score file with 2 SNPs (rs11 and rs22) and I want to get a PRS for all samples in UK Biobank, and if the UK Biobank pvar file now uses CHR:POS:REF.ALT format, then apparently I will get stuck again.

So, I simply hope that PLINK allows duplicate rsID, as long as they have different alleles. And then when it computes PRS, somehow it is intelligent enough to check the REF/ALT alleles besides the rsIDs.

Best regards,
Jie

Chris Chang

unread,
Jul 7, 2023, 9:16:41 AM7/7/23
to jie huang, plink2-users
Immediately below the documentation for —set-all-var-ids is the documentation for —recover-var-ids, which lets you switch back to rsIDs at will.

For the last time, —set-all-var-ids IS the mechanism for correctly managing duplicate rsIDs.  —score is far from the only command that basically requires it.  Either learn it, or stop using plink2.

Reply all
Reply to author
Forward
0 new messages