--extract doesn't work correctly with non-unique variant IDs (happens in 1000 Genomes)

Alejandro Ochoa

unread,

Feb 6, 2019, 3:20:45 PM2/6/19

to plink2-users

Dear Christopher,

I noticed SNP filtering using --extract was giving me weird results, and I just realized it's because the data has duplicate variant IDs, but --extract only takes variant IDs and not chromosome number and position so it lets things through that I don't want.

This is not weird data but the official 1000 Genomes! I'm actually using the plink2-formatted version you posted on your website, so you can verify directly that this data has duplicate variant IDs. Some are surprisingly odd, for example rs6658405 actually occurs at two completely different positions in Chr1:

zstdcat all_phase3.pvar.zst |grep -P '\trs6658405\t'
1 13019068 rs6658405 T C 100 PASS AC=575;AF=0.114816;AN=5008;NS=2504;DP=11215;EAS_AF=0.0268;AMR_AF=0.0375;AFR_AF=0.2806;EUR_AF=0.0258;SAS_AF=0.1278;AA=.|||;VT=SNP
1 13348171 rs6658405 A G 100 PASS AC=575;AF=0.114816;AN=5008;NS=2504;DP=12697;EAS_AF=0.0159;AMR_AF=0.0303;AFR_AF=0.2935;EUR_AF=0.0328;SAS_AF=0.1196;AA=.|||;VT=SNP

Also, lots of distinct loci have a period as their variant ID. You can see them by running this:

zstdcat all_phase3.pvar.zst |grep -P '\t\.\t'

8    89329148    .    G    A    100    PASS    AC=1;AF=0.000199681;AN=5008;NS=2504;DP=18482;EAS_AF=0;AMR_AF=0;AFR_AF=0.0008;EUR_AF=0;SAS_AF=0;VT=SNP
8    89329148    .    GTT    G    100    PASS    AC=4;AF=0.000798722;AN=5008;NS=2504;DP=18482;EAS_AF=0;AMR_AF=0;AFR_AF=0.003;EUR_AF=0;SAS_AF=0;VT=INDEL
12    6015534    .    A    G    100    PASS    AC=5;AF=0.000998403;AN=5008;NS=2504;DP=16975;EAS_AF=0;AMR_AF=0.0029;AFR_AF=0;EUR_AF=0.003;SAS_AF=0;AA=-|||;VT=SNP
12    6018909    .    G    C    100    PASS    AC=17;AF=0.00339457;AN=5008;NS=2504;DP=16597;EAS_AF=0;AMR_AF=0.0029;AFR_AF=0.0113;EUR_AF=0;SAS_AF=0;AA=G|||;VT=SNP
12    6027148    .    A    G    100    PASS    AC=1;AF=0.000199681;AN=5008;NS=2504;DP=22980;EAS_AF=0;AMR_AF=0;AFR_AF=0;EUR_AF=0.001;SAS_AF=0;AA=A|||;VT=SNP
12    6030041    .    T    A    100    PASS    AC=3;AF=0.000599042;AN=5008;NS=2504;DP=15974;EAS_AF=0;AMR_AF=0;AFR_AF=0.0023;EUR_AF=0;SAS_AF=0;AA=T|||;VT=SNP
12    6032289    .    C    G    100    PASS    AC=2;AF=0.000399361;AN=5008;NS=2504;DP=16153;EAS_AF=0;AMR_AF=0;AFR_AF=0.0015;EUR_AF=0;SAS_AF=0;AA=G|||;VT=SNP
...

So now that duplicate variant IDs are established as relatively common in an important dataset, here's my suggestion:

Can --extract and --write-snplist options be updated to include chromosome number and position, so filtering can be more accurate? I'm not bothered by multiallelic cases (i.e. when a multiallelic variant is shown taking up multiple rows as "8 89329148 ." above), but when my SNP list says to keep "12 6032289 ." I don't want to also keep "12 6030041 ."

Thanks!

-Alex

Christopher Chang

unread,

Feb 6, 2019, 3:28:40 PM2/6/19

to plink2-users

--extract is working properly. If you need unique IDs, use plink2's --set-all-var-ids flag to generate position/allele-based IDs, and follow up by identifying remaining duplicates with Unix sort/uniq and removing them.

Alejandro Ochoa

unread,

Feb 6, 2019, 3:34:07 PM2/6/19

to plink2-users

By the way, I've been using plink 2.0 alpha, and I'd be personally happy if only that version gets my requested feature/fix. Thanks!

Christopher Chang

unread,

Feb 6, 2019, 3:40:01 PM2/6/19

to plink2-users

In addition, there is already an "--extract range" mode (in both plink 1.9 and 2.0) which performs position- rather than ID-based extraction.

Alejandro Ochoa

unread,

Mar 1, 2019, 9:26:50 AM3/1/19

to plink2-users

Thank you for these pointers, I found --set-all-var-ids very useful! And of course plink2 overall is super useful, I appreciate all the work you're putting into it.

In my case I was generating a list of millions of SNPs to keep (goal is to ascertain for biallelic loci polymorphic in YRI), so "--extract range" wasn't helpful in that setting.

For everybody else, in case this helps, I had to do things in several steps.

1) Rewrite missing SNP IDs with "chr:pos" so these are treated each as a different ID:

> plink2 --pfile all_phase3 vzs --set-missing-var-ids '@:#' --make-just-pvar zs --out all_phase3_uniq
> # replace data
> mv all_phase3.pvar.zst all_phase3_orig.pvar.zst
> mv all_phase3_uniq.pvar.zst all_phase3.pvar.zst

2) Remove all remaining duplicate IDs (I gave up with these, though originally I wanted to differentiate them too, they are very few and most of these are multiallelic but unmarked/unmerged properly due to a 1000 Genomes bug documented on their website).
> plink2 --pfile all_phase3 vzs --rm-dup exclude-all --write-snplist zs --out nodups

3) Apply my other filters (in this case all_phase3_YRI.psam has YRI samples only, also only want biallelic SNPs that are polymorphic in YRI)

> plink2 --pfile all_phase3 vzs --keep all_phase3_YRI.psam --extract nodups.snplist.zst --autosome --snps-only just-acgt --max-alleles 2 --mac 1 --write-snplist zs --out YRI

I found that I couldn't '--rm-dup exclude-all' together with the filters of step 3 because this uniqueness filter is set after SNPs are removed due to not being polymorphic biallelic SNPs, so it kept some IDs that are duplicated in the full 1000 Genomes (I wanted to remove all of those).

-Alex

Reply all

Reply to author

Forward