PLINK 1.9 LD-clumping

Robin Liu

unread,

Aug 11, 2017, 10:01:54 AM8/11/17

to plink2-users

I'm trying to perform the --clump command on PLINK 1.9 to return a list of independent SNPs from a list of GWAS SNPs. I've the 1000 genomes phase 3 vcf file (from the 1000 genomes website) and tried converting it into binary files in PLINK for use in clumping (using vcf 1kg.vcf --make-bed). However, I keep coming across an the error Duplicate ID "rsxxxx", despite excluding duplicate variants from analysis using the list-duplicate-vars command. Does anyone have a solution to this? Is there a special 1000 genomes phase 3 file that I should be using to perform this analysis?

Christopher Chang

unread,

Aug 11, 2017, 1:16:32 PM8/11/17

to plink2-users

--list-duplicate-vars lists variants with matching positions and allele codes; variant ID is not considered.

Two possible solutions:

1. Dump all the variant IDs with --write-snplist (or Unix cut), use Unix sort and "uniq -d" on the .snplist file to extract just the duplicates, and then --exclude those duplicates.

2. Use PLINK 2.0's --set-all-var-ids flag to generate new variant IDs based on chromosome/position/alleles.

alymf...@gmail.com

unread,

Nov 30, 2017, 6:23:22 PM11/30/17

to plink2-users

I am also trying to do LD-clumping with the 1000 Genomes phase 1 data and having a similar problem. After getting an error message of duplicate IDs, I tried using PLINK2.0s --set-all-var-ids flag. I then got an error: "Error: 10335 allele codes too long for --set-all-var-ids. You should either

switch to a different allele/variant naming scheme for long indels, or use

--new-id-max-allele-len to raise the length limit."

I used --new-id-max-allele-len 15000 but it still states allele codes were too long. It does not allow me to go above 15000. How should I proceed?

Thanks very much

Christopher Chang

unread,

Nov 30, 2017, 6:30:48 PM11/30/17

to plink2-users

You have two options in plink 2.0:

"--new-id-max-allele-len 7500 missing" causes any variant with a longer allele code to have ID set to ".". If you want, you can follow up with a script which assigns a different kind of name to these very long indels.

"--new-id-max-allele-len 7500 truncate" causes just the first 7500 characters in each allele code to be used in the variant ID. This is a bit sloppy, but it'll probably give you unique IDs.

(Note that I changed the constant from 15000 to 7500: plink has a variant ID length limit of 16000 characters, so if for some weird reason there are *two* super-long allele codes, 15000 could fail.)

alymf...@gmail.com

unread,

Dec 2, 2017, 2:24:59 PM12/2/17

to plink2-users

Thanks. I'm now getting an "out of memory" message when I input the resulting files, as expected with the 1000 Genomes dataset. I get this message after using both the "--new-id-max-allele-len 7500 missing" and "--new-id-max-allele-len 7500 missing" files.

After getting the memory error, I tried both increasing and decreasing the workspace size e.g. 8000 and 2000; but am still getting this error message (see below). I saw you recommend splitting the data by chromosome - should I just download each chr dataset, use plink2--set-all-var-ids flag to generate new variant IDs, then input each chr binary file, and clump my GWAS summary stats? (e.g. --bfile chr1, --clump input.txt; --bfile chr2 --clump input.txt, and so on)...

"--new-id-max-allele-len 7500 missing"

PLINK v1.90b5 64-bit (14 Nov 2017) www.cog-genomics.org/plink/1.9/

Logging to plink.log.

Options in effect:

--bfile plink2

--clump input.txt

--clump-kb 500

--clump-p1 5e-8

--memory 8000 \

8192 MB RAM detected; reserving 8000 MB for main workspace.

Error: Out of memory. The --memory flag may be helpful.

Christopher Chang

unread,

Dec 2, 2017, 3:26:09 PM12/2/17

to plink2-users

If you want to use plink 1.9, lower the limit to ~50; it wasn’t really designed with very long variant IDs in mind.

alymf...@gmail.com

unread,

Dec 4, 2017, 5:57:49 PM12/4/17

to plink2-users

It did not work at truncating to 50, but did at 20. However it still says I have duplicate rsIDs when I clump on plink 1.9. Do you see a reason why I would have duplicate IDs after setting new variant IDs on plink2?

On plink2, I generated new variant IDs from the 1000 genomes phase 1 data and truncated to 20 alleles:

./plink2 --bfile 1kg_phase1_all --make-bed --new-id-max-allele-len 20 truncate --set-all-var-ids @:#\$1,\$2 \

Then on plink1.9, I input the binary file:

./plink --bfile plink2_truncate20 –clump DEPICT_input.txt --clump-kb 500 --clump-p1 5e-8 --clump-r2 0.05 \

Random number seed: 1512427828

8192 MB RAM detected; reserving 4096 MB for main workspace.

39728178 variants loaded from .bim file.

1092 people (525 males, 567 females) loaded from .fam.

Using 1 thread (no multithreaded calculations invoked).

Before main variant filters, 1083 founders and 9 nonfounders present.

Calculating allele frequencies... done.

Warning: 855 het. haploid genotypes present (see plink.hh ); many commands

treat these as missing.

Total genotyping rate is 0.999956.

39728178 variants and 1092 people pass filters and QC.

Note: No phenotypes present.

Error: Duplicate ID '6:26745501<DEL>,C'.

Christopher Chang

unread,

Dec 4, 2017, 6:55:24 PM12/4/17

to plink2-users

There may be a few actual duplicate (chr, pos, allele 1, allele 2) tuples in your dataset; it looks like (6, 26745501, <DEL>, C) is one of them. It should be fine to get rid of them with e.g. --exclude + --make-bed, and then proceed with your analysis.

alymf...@gmail.com

unread,

Dec 12, 2017, 2:55:20 PM12/12/17

to plink2-users

I excluded duplicate variants and am getting the error " Warning: No significant --clump results. Skipping.":

I have tried clumping with p-values 5e-8 and 5e-5, but get the same error. How do you suggest I proceed? I am trying to clump my GWAS summary stats before running DEPICT analysis.

./plink

--bfile plink2_truncate20

--clump Full_GWAS.txt

--clump-kb 500

--clump-p1 5e-8

--clump-r2 0.05

--exclude duplicates.dupvar

Christopher Chang

unread,

Dec 13, 2017, 7:04:31 PM12/13/17

to plink2-users

I'd first check whether

(i) plink2_truncate20.bim has the same variant IDs as Full_GWAS.txt , and whether

(ii) you can see at least one significant p-value in Full_GWAS.txt corresponding to a variant in plink2_truncate20.bim .

Reply all

Reply to author

Forward