Liftover, etc.

298 views
Skip to first unread message

Matthew Maher

unread,
Jun 2, 2023, 7:15:59 PM6/2/23
to plink2-users
I've been streamlining a liftover process using PLINK* and a few comments/questions arise:

When applying --update-map, one gets the warning(s):

Warning: Base-pair positions are now unsorted!
Warning: Variants are not sorted by position.  Consider rerunning with the --sort-vars flag added to remedy this.


When the --sort-vars is added, everything works great, but the first of those two warning messages persists.  Which I guess makes sense as the data are momentarily unsorted.  But I wonder:  should that message be suppressed if --sort-vars is in the command?    Otherwise, the message gives a sense that something is wrong, but I don't believe anything is wrong.


I had to drop back to PLINK1 to get just --update-chr.  That seems odd to me since liftover involves possible changes to both Basepair Position (whose --update-map IS in PLINK2) and Chromosome.  That's no big deal, but I'm surprised that --update-map and --update-chr are not supported in parallel since I think liftover is the primary use case for both.  Again, not a big deal, just wondering if I'm missing something?


Since the difference between genome builds can involve changes to alleles and/or strands, getting those cleaned up is challenging.   Using --ref-from-fa does a wonderful job of correcting most cases where the reference allele changed between builds.  But there's always a few stragglers that are either ambiguous or neither allele matches the reference.  Does PLINK2 have some way to get a list of such ambiguous-AND/OR-invalid-against-the-REF variants?   This seems like it would be a subset of --ref-from-fa's functionality and would be very useful, allowing those stragglers to be dropped, resulting in a lifted-over fileset which is completely valid against the Target Ref.

Thanks for any info and thanks for PLINK*

Christopher Chang

unread,
Jun 3, 2023, 7:26:32 PM6/3/23
to plink2-users
1. Update your plink2 build: yes, I noticed that the --sort-vars part of that warning was silly when --sort-vars was going to be executed later in the same run, and I revised the warning-printing logic accordingly.
2. The data structure normally used by plink 1.9 and 2.0 to represent chromosomes makes the --update-chr operation relatively inconvenient to implement: it's a small table of chromosome IDs and start/end variant-indexes, not a big array with one chromosome-code entry per variant.  Simultaneously, it's straightforward to do that job with e.g. a short Python script operating on the .bim/.pvar.  This combination makes it a low priority compared to many other bits of missing functionality (e.g. merge...), though I have no intention of dropping the command.
3. If your .pvar has an INFO column, the variants that weren't disambiguated by --ref-from-fa will still have the INFO/PR ("provisional reference") flag set.  "--make-pgen pvar-cols=+info" can be used to force this column to exist.  You can then use e.g. "--require-no-info PR" (or bcftools) to filter these variants out.

Matthew Maher

unread,
Jun 5, 2023, 7:31:27 PM6/5/23
to Christopher Chang, plink2-users
Thank you for all that info - I was unaware of that "pvar-cols=+info" and the "PR" value. 
Followup question:
Is there a way to distinguish the "PR" cases of:  
- both alleles are valid against the REF (e.g. most indels)
- neither allele is valid against the REF (in liftover application, these generally result from strand changes)


For example, I have an input genotype fileset, all the variants of which I've confirmed are 100% valid against the reference.  After executing UCSC liftover, dropping the liftover-failures, --update-chr, --update-map, there are still a few thousand variants not valid against the NEW reference, mostly due to changes in choice of REF allele between the old+new references.  Applying --ref-from-fa nicely fixes perhaps 3/4 of those.   The remainders appear to be nearly all cases of strand-changes between the old+new references.  I would like to pick those out (for either deletion, or a --flip), but I cannot use the PR flag to select them as that is mostly tagging the indels (which I presume are largely correct since they were valid before the liftover).  

Perhaps if --ref-from-fa had an option (e.g. 'trust-indels') which would accept (i.e. NOT add the PR tag on) indels that are ambiguous (both alleles valid)? That would be useful for a case like mine where I know I'm working with a fileset that had previously been 100% validated against a reference.

Of course, I can certainly recognize that even still, there's a few super edge cases that will remain unresolvable:
- indels that were REF/ALT-flipped between reference versions
- A/T  or C/G  SNPs that were strand-flipped between the reference versions.
But I'll bet instances of those could be counted on one hand.

Sorry this is so long - this topic is really confusing.

=============================================

On that warning message about the sort order, I updated my build, but I'm still seeing that message.  Maybe I'm missing some parameter?

PLINK v2.00a5LM 64-bit Intel (31 May 2023)     www.cog-genomics.org/plink/2.0/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to CRIMSON_QC.LIFTTEMP2.log.
Options in effect:
  --allow-extra-chr
  --bfile CRIMSON_QC.LIFTTEMP1
  --chr 1-22,X,Y
  --make-pgen
  --out CRIMSON_QC.LIFTTEMP2
  --sort-vars
  --update-map CRIMSON_QC.UCSC.OUT.bed 2 4


Start time: Mon Jun  5 20:49:20 2023
515437 MiB RAM detected, ~430642 available; reserving 257718 MiB for main
workspace.
Using 1 compute thread.
774 samples (774 females, 0 males; 774 founders) loaded from
CRIMSON_QC.LIFTTEMP1.fam.
712035 out of 712139 variants loaded from CRIMSON_QC.LIFTTEMP1.bim.
Note: No phenotype data present.
--update-map: 712035 values updated, 104 variant IDs not present.
Warning: Base-pair positions are now unsorted!

Writing CRIMSON_QC.LIFTTEMP2.pvar ... done.
Writing CRIMSON_QC.LIFTTEMP2.psam ... done.
Writing CRIMSON_QC.LIFTTEMP2.pgen ... done.
End time: Mon Jun  5 20:49:21 2023


Bonus Question
In the above, I also worry I'm misunderstanding something pertaining to "--allow-extra-chr":  my input file set has some of those alternative chromosome IDs (courtesy of liftover), and I want to drop them, as accomplished above with my "--chr ..." parameter.  But yet, the --aec is required, which feels odd.  I wonder if --aec should be assumed if --chr (or --autosome) are in effect?

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/cf8b8e99-83ba-4142-808c-e14cd9b2a12bn%40googlegroups.com.

Chris Chang

unread,
Jun 5, 2023, 8:35:34 PM6/5/23
to Matthew Maher, plink2-users
1. —snps-only, in combination with —extract/—exclude/—write-snplist, is the main way to distinguish SNPs from indels for now.

In your case, you could start with “—require-info PR” + “—snps-only” + “—write-snplist” to write a list of just the SNPs you *don’t* want to keep, and then —exclude that list in the subsequent plink2 command.

2. Ok, I should have downgraded the Warning to a Note, in addition to removing the comment about —sort-vars.  I will clean that up in the next build.

Chris Chang

unread,
Jun 6, 2023, 12:37:02 AM6/6/23
to Matthew Maher, plink2-users
As for —aec… I’m pretty sure at this point that I should have turned it on by default in the first plink 2.0 build; it’s mostly just wasting users’ time, not adding meaningful error-checking value.  Better late than never: I will add a —prohibit-extra-chr flag to unambiguously specify the current behavior, and then turn —aec on by default in the next compatibility-breaking build.

On Mon, Jun 5, 2023 at 4:31 PM Matthew Maher <mma...@broadinstitute.org> wrote:
Reply all
Reply to author
Forward
0 new messages