Thank you for all that info - I was unaware of that "pvar-cols=+info" and the "PR" value.
Followup question:
Is there a way to distinguish the "PR" cases of:
- both alleles are valid against the REF (e.g. most indels)
- neither allele is valid against the REF (in liftover application, these generally result from strand changes)
?
For example, I have an input genotype fileset, all the variants of which I've confirmed are 100% valid against the reference. After executing UCSC liftover, dropping the liftover-failures, --update-chr, --update-map, there are still a few thousand variants not valid against the NEW reference, mostly due to changes in choice of REF allele between the old+new references. Applying --ref-from-fa nicely fixes perhaps 3/4 of those. The remainders appear to be nearly all cases of strand-changes between the old+new references. I would like to pick those out (for either deletion, or a --flip), but I cannot use the PR flag to select them as that is mostly tagging the indels (which I presume are largely correct since they were valid before the liftover).
Perhaps if --ref-from-fa had an option (e.g. 'trust-indels') which would accept (i.e. NOT add the PR tag on) indels that are ambiguous (both alleles valid)? That would be useful for a case like mine where I know I'm working with a fileset that had previously been 100% validated against a reference.
Of course, I can certainly recognize that even still, there's a few super edge cases that will remain unresolvable:
- indels that were REF/ALT-flipped between reference versions
- A/T or C/G SNPs that were strand-flipped between the reference versions.
But I'll bet instances of those could be counted on one hand.
Sorry this is so long - this topic is really confusing.
=============================================
On that warning message about the sort order, I updated my build, but I'm still seeing that message. Maybe I'm missing some parameter?
PLINK v2.00a5LM 64-bit Intel (31 May 2023) www.cog-genomics.org/plink/2.0/
(C) 2005-2023 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to CRIMSON_QC.LIFTTEMP2.log.
Options in effect:
--allow-extra-chr
--bfile CRIMSON_QC.LIFTTEMP1
--chr 1-22,X,Y
--make-pgen
--out CRIMSON_QC.LIFTTEMP2
--sort-vars
--update-map CRIMSON_QC.UCSC.OUT.bed 2 4
Start time: Mon Jun 5 20:49:20 2023
515437 MiB RAM detected, ~430642 available; reserving 257718 MiB for main
workspace.
Using 1 compute thread.
774 samples (774 females, 0 males; 774 founders) loaded from
CRIMSON_QC.LIFTTEMP1.fam.
712035 out of 712139 variants loaded from CRIMSON_QC.LIFTTEMP1.bim.
Note: No phenotype data present.
--update-map: 712035 values updated, 104 variant IDs not present.