Issue with data pruning for ancestry estimation

260 views
Skip to first unread message

Safa Majeed

unread,
Feb 15, 2024, 3:26:00 PM2/15/24
to plink2-users

I'm trying to estimate ancestry using the plink tutorial by Hannah Meyer. I've gotten to the Prune study data step but when I try running:

plink2 --bfile $qcdir/$name.no_ac_gt_snps \
--exclude range $refdir/$highld \
--indep-pairwise 50 5 0.2 \
--allow-extra-chr
--out $qcdir/$name.no_ac_gt_snps

I get the following error:
13362315 variants remaining after main filters.
Error: --indep-pairwise requires unique variant IDs. (--set-all-var-ids and/or
--rm-dup may help.)

I've tried adding --set-missing-var-ids '@:#$r,$a' (which works but I still get the same error) and --rm-dup list (but I get Error: 286705 duplicate IDs with inconsistent genotype data or variant). If I overwrite all variant ids with --set-all-var-ids then the SNP ids are all replaced and it cannot correlate variants in the next step: plink2 --bfile $qcdir/$name.no_ac_gt_snps
--extract $qcdir/$name.no_ac_gt_snps.prune.in
--make-bed
--allow-extra-chr
--out $qcdir/$name.pruned

--extract: 0 variants remaining.
Error: No variants remaining after main filters.

I'm using iOS, PLINK v2.00a4.4LM 64-bit Intel

My sample info is below: 

85 samples (0 females, 0 males, 85 ambiguous; 85 founders)
26252508 variants loaded 
Total (hardcall) genotyping rate is 0.972771.

Christopher Chang

unread,
Feb 16, 2024, 10:40:38 AM2/16/24
to plink2-users
0. You should update your plink2 build; see the 21 Nov 2023 entry in the version history.
1. Set unique variant IDs upfront (--set-all-var-ids + --make-bed, then --rm-dup if some duplicates still remain at this point); don't try to do anything else yet.  After you have a working variant ID scheme, perform --indep-pairwise and other operations starting from the new IDs.

Message has been deleted

Safa Majeed

unread,
Feb 21, 2024, 12:39:26 PM2/21/24
to plink2-users
Hi Christopher,

Thank you for your response. I have amended my script such that I am setting the unique variant IDs upfront. I am working on updating my plink build but need to go through my IT department to do so.
# Convert VCF to PLINK binary format (bed, bim, fam)
plink2 --vcf $vcf_file --keep-allele-order --make-bed --out $output_dir --allow-extra-chr --set-all-var-ids '@_#_\$r_\$a' --new-id-max-allele-len 928
plink2 --bfile $output_dir --make-bed --allow-extra-chr --rm-dup list --out $qcdir/$name.rm_dup

Now when I prune the study data it works. However, the next step fails for the same reason:
#Filter reference data for the same SNP set as in study
#We will use the list of pruned variants from the study sample to reduce the reference dataset to the size of the study samples:
plink2 --bfile $refdir/$refname \
--extract $qcdir/$name.no_ac_gt_snps.prune.in \
--make-bed \
--allow-extra-chr \
--out $qcdir/$refname.pruned
mv $qcdir/$refname.pruned.log $qcdir/plink_log/$refname.pruned.log

Start time: Wed Feb 21 12:31:05 2024
122722 MiB RAM detected, ~115959 available; reserving 61361 MiB for main
workspace.
Using 1 compute thread.
3202 samples (1603 females, 1599 males; 2583 founders) loaded from
74929081 variants loaded from
Note: No phenotype data present.

--extract: 0 variants remaining.
Error: No variants remaining after main filters.
End time: Wed Feb 21 12:31:31 2024

Christopher Chang

unread,
Feb 22, 2024, 3:27:39 PM2/22/24
to plink2-users
Think about how you fixed your first step.
Reply all
Reply to author
Forward
0 new messages