PLINK 1.9 LD-clumping

3,990 views
Skip to first unread message

Robin Liu

unread,
Aug 11, 2017, 10:01:54 AM8/11/17
to plink2-users
I'm trying to perform the --clump command on PLINK 1.9 to return a list of independent SNPs from a list of GWAS SNPs. I've the 1000 genomes phase 3 vcf file (from the 1000 genomes website) and tried converting it into binary files in PLINK for use in clumping (using vcf 1kg.vcf --make-bed). However, I keep coming across an the error Duplicate ID "rsxxxx", despite excluding duplicate variants from analysis using the list-duplicate-vars command. Does anyone have a solution to this? Is there a special 1000 genomes phase 3 file that I should be using to perform this analysis?

Christopher Chang

unread,
Aug 11, 2017, 1:16:32 PM8/11/17
to plink2-users
--list-duplicate-vars lists variants with matching positions and allele codes; variant ID is not considered.

Two possible solutions:
1. Dump all the variant IDs with --write-snplist (or Unix cut), use Unix sort and "uniq -d" on the .snplist file to extract just the duplicates, and then --exclude those duplicates.
2. Use PLINK 2.0's --set-all-var-ids flag to generate new variant IDs based on chromosome/position/alleles.

alymf...@gmail.com

unread,
Nov 30, 2017, 6:23:22 PM11/30/17
to plink2-users
I am also trying to do LD-clumping with the 1000 Genomes phase 1 data and having a similar problem. After getting an error message of duplicate IDs, I tried using PLINK2.0s  --set-all-var-ids flag. I then got an error: "Error: 10335 allele codes too long for --set-all-var-ids. You should either

switch to a different allele/variant naming scheme for long indels, or use

--new-id-max-allele-len to raise the length limit."


I used  --new-id-max-allele-len 15000  but it still states allele codes were too long. It does not allow me to go above 15000. How should I proceed?

Thanks very much

Christopher Chang

unread,
Nov 30, 2017, 6:30:48 PM11/30/17
to plink2-users
You have two options in plink 2.0:
  "--new-id-max-allele-len 7500 missing" causes any variant with a longer allele code to have ID set to ".".  If you want, you can follow up with a script which assigns a different kind of name to these very long indels.
  "--new-id-max-allele-len 7500 truncate" causes just the first 7500 characters in each allele code to be used in the variant ID.  This is a bit sloppy, but it'll probably give you unique IDs.
(Note that I changed the constant from 15000 to 7500: plink has a variant ID length limit of 16000 characters, so if for some weird reason there are *two* super-long allele codes, 15000 could fail.)

alymf...@gmail.com

unread,
Dec 2, 2017, 2:24:59 PM12/2/17
to plink2-users
Thanks. I'm now getting an "out of memory" message when I input the resulting files, as expected with the 1000 Genomes dataset. I get this message after using both the  "--new-id-max-allele-len 7500 missing" and  "--new-id-max-allele-len 7500 missing" files.

After getting the memory error, I tried both increasing and decreasing the workspace size e.g. 8000 and 2000; but am still getting this error message (see below). I saw you recommend splitting the data by chromosome - should I just download each chr dataset, use plink2--set-all-var-ids flag to generate new variant IDs, then input each chr binary file, and clump my GWAS summary stats? (e.g. --bfile chr1, --clump input.txt; --bfile chr2 --clump input.txt, and so on)...

 "--new-id-max-allele-len 7500 missing"

PLINK v1.90b5 64-bit (14 Nov 2017)             www.cog-genomics.org/plink/1.9/

(C) 2005-2017 Shaun Purcell, Christopher Chang   GNU General Public License v3

Logging to plink.log.

Options in effect:

  --bfile plink2

  --clump input.txt

  --clump-kb 500

  --clump-p1 5e-8

  --memory 8000 \


8192 MB RAM detected; reserving 8000 MB for main workspace.


Error: Out of memory.  The --memory flag may be helpful.



Christopher Chang

unread,
Dec 2, 2017, 3:26:09 PM12/2/17
to plink2-users
If you want to use plink 1.9, lower the limit to ~50; it wasn’t really designed with very long variant IDs in mind.

alymf...@gmail.com

unread,
Dec 4, 2017, 5:57:49 PM12/4/17
to plink2-users
It did not work at truncating to 50, but did at 20. However it still says I have duplicate rsIDs when I clump on plink 1.9. Do you see a reason why I would have duplicate IDs after setting new variant IDs on plink2?

On plink2, I generated new variant IDs from the 1000 genomes phase 1 data and truncated to 20 alleles: 

./plink2 --bfile 1kg_phase1_all --make-bed --new-id-max-allele-len 20 truncate --set-all-var-ids @:#\$1,\$2 \


Then on plink1.9, I input the binary file:
./plink --bfile plink2_truncate20 –clump DEPICT_input.txt --clump-kb 500 --clump-p1 5e-8 --clump-r2 0.05 \
Random number seed: 1512427828
8192 MB RAM detected; reserving 4096 MB for main workspace.
39728178 variants loaded from .bim file.
1092 people (525 males, 567 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 1083 founders and 9 nonfounders present.
Calculating allele frequencies... done.
Warning: 855 het. haploid genotypes present (see plink.hh ); many commands
treat these as missing.
Total genotyping rate is 0.999956.
39728178 variants and 1092 people pass filters and QC.
Note: No phenotypes present.
Error: Duplicate ID '6:26745501<DEL>,C'.


Christopher Chang

unread,
Dec 4, 2017, 6:55:24 PM12/4/17
to plink2-users
There may be a few actual duplicate (chr, pos, allele 1, allele 2) tuples in your dataset; it looks like (6, 26745501, <DEL>, C) is one of them.  It should be fine to get rid of them with e.g. --exclude + --make-bed, and then proceed with your analysis.

alymf...@gmail.com

unread,
Dec 12, 2017, 2:55:20 PM12/12/17
to plink2-users

I excluded duplicate variants and am getting the error " Warning: No significant --clump results.  Skipping.":


I have tried clumping with p-values  5e-8 and 5e-5, but get the same error. How do you suggest I proceed? I am trying to clump my GWAS summary stats before running DEPICT analysis.


./plink

  --bfile plink2_truncate20

  --clump Full_GWAS.txt

  --clump-kb 500

  --clump-p1 5e-8

  --clump-r2 0.05

  --exclude duplicates.dupvar

Christopher Chang

unread,
Dec 13, 2017, 7:04:31 PM12/13/17
to plink2-users
I'd first check whether
(i) plink2_truncate20.bim has the same variant IDs as Full_GWAS.txt , and whether
(ii) you can see at least one significant p-value in Full_GWAS.txt corresponding to a variant in plink2_truncate20.bim .
Reply all
Reply to author
Forward
0 new messages