Is it possible to remove duplicates SNPs?

486 views
Skip to first unread message

Peiyuan Zhu

unread,
Oct 2, 2023, 4:26:06 PM10/2/23
to plink2-users
Is it possible to remove duplicate SNPs with PLINK2? 

I don't know why UKBiobank dataset could have duplicate rsids. 

Dominick A. Leone

unread,
Oct 2, 2023, 5:37:29 PM10/2/23
to Peiyuan Zhu, plink2-users
Have not worked with UK Biobank data, but in our data, we have loci where there are multiple alternative alleles, and I believe rsID is based on genomic position. Depending on the data format and how multiple alleles are formatted in the input file (VCF?), you could have multiple “lines” for the same locus: one line for each alternative allele. In that case you may not want to remove the additional lines (they are variants). 

Another issue I have seen with “duplicate SNPs” arose because one of our SNP arrays we used had multiple probes for some loci. The chip was still in development and we ended up with some overlap in the coverage. We consulted with a bioinformatician to resolve selecting SNPs based on probes used for our other chips. If I recall, most of the SNP calls were almost exactly the same between the probes used in the chip — but not all. Our down-stream imputation and analyses was not affected (imputation based on biallelic SNPs).

Hope that helps!
 
Dominick Leone, MPH, MS
Doctoral Candidate, Epidemiology Department
Chronic Kidney Disease in Central America Research Group    
Boston University School of Public Health

801 Massachusetts Avenue
Biostatistics Dept; Suite 345K
Boston, MA 02118
 
Phone: (617) 893-9493
 
THINK. TEACH. DO.
FOR THE HEALTH OF ALL.





On Oct 2, 2023, at 4:26 PM, Peiyuan Zhu <gary...@gmail.com> wrote:

Is it possible to remove duplicate SNPs with PLINK2? 

I don't know why UKBiobank dataset could have duplicate rsids. 

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/45d7acac-cf4e-4a7c-8645-6859798074abn%40googlegroups.com.

Christopher Chang

unread,
Oct 2, 2023, 11:34:07 PM10/2/23
to plink2-users
Yes, many datasets, including some from the UK Biobank, have "split" multiallelic variants.  For example, if the original reference allele was A and both C and G alternate alleles have been observed, there will be both a REF=A ALT=C and a REF=A ALT=G variant, both with the same ID+position.  Lots of software, including PLINK 1.x, can only handle this split representation.

Even with PLINK2, you sometimes want to keep these variants in split form.  When you do, it is generally a good idea to use --set-all-var-ids to assign unique IDs to the pieces (which have different ALT alleles).

If true duplicate SNPs remain after --set-all-var-ids (i.e. the ALT alleles *aren't* different), you can use --rm-dup to deduplicate.

Peiyuan Zhu

unread,
Oct 3, 2023, 1:09:30 AM10/3/23
to Christopher Chang, plink2-users
Hi Chris, 

I just want to make sure we're talking about the same issue. Here are the commands that I use to process pfile into bgen format, then this is ready to be imported by R software bigsnpr. Here rs751773215 appears twice. I've already seen alleles labelled with 55501335_T_A so I suppose different alternative alleles have already been labelled differently?

# run command to filter gene within range: ./plink2 --pfile ukb22828_c1_b0_v3 --from-bp 55479771 --to-bp 55555903 --chr 1 --make-pgen --out ukb22828_c1_b0_v3_xxx
# run command to convert pgen into bgen: ./plink2 --pfile ukb22828_c1_b0_v3_xxx --export bgen-1.2 bits=8 --out ukb22828_c1_b0_v3_xxx
# run command to index bgen file: bgenix -g ukb22828_c1_b0_v3_xxx.bgen -index -clobber

I'll check if  --set-all-var-ids and --rm-dup would work by adding them to the filter command. Thanks!


You received this message because you are subscribed to a topic in the Google Groups "plink2-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/plink2-users/m-5yRlrylHE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/a96469a1-2071-4da3-91a3-78ed6d93e16en%40googlegroups.com.

Peiyuan Zhu

unread,
Oct 3, 2023, 1:21:38 AM10/3/23
to plink2-users

Using the above command gives error "Error: 9 duplicate IDs with inconsistent genotype data or variant information"





./plink2 --pfile ukb22828_c1_b0_v3_xxx --export bgen-1.2 bits=8 --rm-dup --out ukb22828_c1_b0_v3_xxx

PLINK v2.00a3LM 64-bit Intel (11 Oct 2021)     www.cog-genomics.org/plink/2.0/

(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3

Logging to ukb22828_c1_b0_v3_xxx.log.

Options in effect:

  --export bgen-1.2 bits=8

  --out ukb22828_c1_b0_v3_xxx

  --pfile ukb22828_c1_b0_v3_xxx

  --rm-dup


Start time: Mon Oct  2 22:19:14 2023

257860 MiB RAM detected; reserving 128930 MiB for main workspace.

Allocated 7259 MiB successfully, after larger attempt(s) failed.

Using up to 64 threads (change this with --threads).

487409 samples (264251 females, 222957 males, 201 ambiguous; 487409 founders)

loaded from ukb22828_c1_b0_v3_xxx.psam.

2789 variants loaded from ukb22828_c1_b0_v3_xxx.pvar.

Note: No phenotype data present.

Error: 9 duplicate IDs with inconsistent genotype data or variant information

detected by --rm-dup; see ukb22828_c1_b0_v3_xxx.rmdup.mismatch .

End time: Mon Oct  2 22:19:14 2023

Peiyuan Zhu

unread,
Oct 3, 2023, 1:26:45 AM10/3/23
to plink2-users
Now it seems "force-first" argument has done the job. Thanks. 
Reply all
Reply to author
Forward
0 new messages