General Questions about "Phase"

758 views
Skip to first unread message

Matthew Maher

unread,
Aug 3, 2022, 7:43:09 PM8/3/22
to plink2-users
Short form of the question:
When PLINK2 documentation discusses "phase" (i.e. 'phased dosages'), should that be taken to mean: 
a.  related to the linkage of specific alleles across multiple variants
b.  the splitting of one value into two.  e.g. two 0-1 dosages as opposed to one 0-2 dosage
c   both a and b?

Longer form with more detailed questions:
My understanding (quite possibly wrong!) of the VCF file spec is that the '|' versus '/' distinction is meant to be interpreted in conjunction with the PS ("phase sets") values that define which variants are phased together. Or, according to the VCF spec doc, if no PS is given, they are all assumed to be phased in a giant single set.  This latter situation is what I believe one generally expects in results from imputation servers, since my understanding is they start by phasing the genotypes along entire chromosomes, and then impute each haplotype separately.  Does PLINK load/preserve detailed "phase set" information?  or just the one-giant phase set?  

If I have a minimac4 imputation output that I load using "dosage=HDS" option, then sure enough I do then have a fileset for which a subsequent invocation of --pgen-info reports:
  Explicitly phased hardcalls present
  Explicitly phased dosages present

which sounds good.   
Now if I want to get back the two 0-1 dosage values for a particular variant in some text file, how would I do that?
doing --export A seems to only output a combined 0-2 dosage value
doing --export AD adds a second column which the documentation calls the 'dominant' value, but I'm unclear if/how those two values can be converted to a pair of 0-1 dosages? 

I also noticed that doing --export haps (the name sounded promising) refuses with:
Error: '--export haps' must be used with a fully phased dataset.
But since --pgen-info said that phased data was present, I assume the key word in the error is 'fully', which makes me wonder: 
Is there some way to list which variants contain phased versus unphased data? 
FWIW, I scanned the minimac4 VCF that I had loaded and there are no '/'s anywhere - it's all '|'s, so I feel like my fileset should be 'fully' phased.  But I'm probably misunderstanding....

I'm using version: PLINK v2.00a3.4LM 64-bit Intel (1 Aug 2022)

Thanks for any enlightenment and thanks for PLINK(2)!

Christopher Chang

unread,
Aug 3, 2022, 8:15:08 PM8/3/22
to plink2-users
* plink2 does not currently preserve phase-set information, though there is a plan to eventually do so if this annotation is found to be broadly useful.
* Neither "--export A" nor "--export AD" report phase information.
* Yes, "--export haps" reports phase information.  If your minimac4 VCF contained dosages and you used "--vcf <filename> dosage=HDS" to import it, plink2 will take the HDS values seriously and treat e.g. HDS=0.49,0.51 as 0/1 unphased, even if the GT field says 0|1; this may be why your dataset is being reported as not "fully phased" even though the VCF actually was; the workaround is to remove "dosage=HDS" in this one file-management operation.

Chris Chang

unread,
Aug 3, 2022, 9:17:58 PM8/3/22
to Matthew Maher, plink2-users
1. Yes, it does support the "all-in-one-giant-phase-set" concept; sorry about not stating that in my previous response.  There aren't many plink2 commands which use this information yet; --ld is one that does, and which illustrates how other LD-computing plink2 commands will treat phase when they are implemented.

2. If you're talking specifically about the "--export A" format, you do have to perform an intermediate conversion; "--make-pgen erase-dosage" is a more efficient way to do this than --make-bed.

On Wed, Aug 3, 2022 at 5:57 PM Matthew Maher <mma...@broadinstitute.org> wrote:
Thanks for that info and quick response.   A couple follow-ups because I can tell I'm still failing to understand something:

While you say plink2 does not preserve phase-set information, I suspect you mean just those PS tags in VCFs; I'm guessing it IS supporting the all-in-one-giant-phase-set concept (in parallel with VCF format spec's description of having no PS values)?
I figure it must support that one-giant-phase-set concept, because otherwise, without any between-variant relationship, I don't understand what would be the difference between "0/1" and "0|1"? 
And if it does support that one-giant-phase-set concept, what are examples of plink2 functionalities that make use of that between-variant aspect of phase information (i.e. haplotypes of length > 1 )?  no doubt plenty; I just haven't yet encountered them...

Also:  If I have a PLINK2 fileset that contains dosages + genotypes and I want to export the genotypes (0/1/2), can that be done in one step?  I know I could first convert down to a PLINK1 fileset and then --export A, but perhaps there's a more direct way...

Thanks again!


--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/ee738d77-26d8-4f79-8059-6b41ffaba7b4n%40googlegroups.com.

Matthew Maher

unread,
Aug 3, 2022, 9:30:06 PM8/3/22
to Chris Chang, plink2-users

Thanks again - quite helpful!
Reply all
Reply to author
Forward
0 new messages