Treating vcf-half-calls as "half-calls"

301 views
Skip to first unread message

Nils Paffen

unread,
Feb 14, 2024, 10:16:47 AM2/14/24
to plink2-users
The vcf-half-calls parameter states:

"The current VCF standard does not specify how '0/.' and similar GT values should be interpreted."

Therefore I'm missing an option to just treat the half-calls as  what they are haplotype information, where one haplotype is missing in a diploid organism.

Use case?
Consider one has run global or local admixture on a a large sample size and now wants to calculate admixture-related PRS for each origin. Would it be possible to add a case to this parameter so that the information is treated as is, even if this could cause trouble with some other functions? So this had to be used with cautious and might need a "Only use this if you know what you are doing" advice.

Best,
Nils

Christopher Chang

unread,
Feb 14, 2024, 11:23:28 AM2/14/24
to plink2-users
This use case is already covered by e.g. "<DEL>" allele codes.  plink 2.0 supports multiallelic variants.

Nils Paffen

unread,
Feb 14, 2024, 11:56:10 AM2/14/24
to plink2-users
Great. How would I call plink2 to convert a VCF that contains GTs like 0|. to a valid plink2 format using the existing approach?

Nils Paffen

unread,
Feb 14, 2024, 11:56:55 AM2/14/24
to plink2-users
While keeping the half-call information, of course.

Nils Paffen

unread,
Feb 14, 2024, 12:09:47 PM2/14/24
to plink2-users
After I gave it a second thought using the haplotype option is good enough since it does not matter if we score 1/. or 1/0. Thanks for your reply anyway!

Nils Paffen

unread,
Apr 29, 2024, 7:48:22 AM4/29/24
to plink2-users
Hi Chen,


So I tried this and it does exactly what I want but blows up the filesize by a magnitude of 35x? Did I miss something big time here?

E.g. The original imputed dataset contains 11k+ samples and is about 22.2 GB large for the pvar and pgen respectively. After I run my splitter program to get the partial genotypes I end up with a pgen of 770GB+. I expect that this comes from the large character of the "<DEL>" allele at each position and the extra haplotype-value "2" to express a missing haplotype? Is there any better way to do this?  Thought about some extra binary file that stores the "<DEL>" values in a format with a higher compression rate. Maybe a binary format such as pgen which just adds information about a "missing/<DEL>" haplotype at each samples genotype and position?

Best Nils
chrch...@gmail.com schrieb am Mittwoch, 14. Februar 2024 um 17:23:28 UTC+1:

Nils Paffen

unread,
Apr 29, 2024, 7:50:00 AM4/29/24
to plink2-users
I ended up with a pgen of 770GB+ for just one origin*

Chris Chang

unread,
Apr 29, 2024, 10:18:01 AM4/29/24
to Nils Paffen, plink2-users
How large is the corresponding VCF?

Did you convert every missing call in your file to <DEL> when some (most?) of them don’t have that meaning?

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/plink2-users/40296bf5-d760-4446-8f7f-4c40c239d816n%40googlegroups.com.

Nils Paffen

unread,
Apr 29, 2024, 10:55:23 AM4/29/24
to plink2-users
About 1.2 TB. I converted every haplotype that is not part of the rfmix ancestry results (msp files) for a given population into <DEL>, yes. Most of those information do not represent <DEL> alleles but I need a way to store the partial genotypes without loosing the haplotype order (phasing) and indicate whether some haplotype information is either part of some population or not.  Does this make sense to you?

Nils Paffen

unread,
Apr 29, 2024, 10:58:17 AM4/29/24
to plink2-users
I'm aware that I do not need the phasing information for the partial polygenic scores that I'm interested in but we need this information for further downstream analysis.

Chris Chang

unread,
Apr 29, 2024, 11:15:45 AM4/29/24
to Nils Paffen, plink2-users
Ok, no reason for prioritize more-efficient compression of this misrepresented data when this is already somewhat better than VCF.

Nils Paffen

unread,
Apr 29, 2024, 11:28:17 AM4/29/24
to plink2-users
Do you have any idea how one could better represent this data so that the filesize shrinks by some magnitude and plink2 is still understands the context?

Christopher Chang

unread,
Apr 29, 2024, 12:49:12 PM4/29/24
to plink2-users
Reread the last question I asked.

Nils Paffen

unread,
Apr 29, 2024, 12:59:59 PM4/29/24
to plink2-users
I just followed your suggestion you gave me as an answer to my original question. I'm aware that this is a misrepresentationof the data but plink2 does not provide any other way to handle this use-case, right?

Nils Paffen

unread,
Apr 29, 2024, 1:02:04 PM4/29/24
to plink2-users
To be more explicit: plink2 does not handle my use-case in a more efficient way, than the one you suggestd, right?

Christopher Chang

unread,
Apr 29, 2024, 2:04:45 PM4/29/24
to plink2-users
It looks like, when you wrote "where one haplotype is missing in a diploid organism" in the original post, you did not actually mean "haplotype is physically missing, i.e. deleted", which is what I thought you meant.  That is why I suggested <DEL>.  I never intended to suggest indiscriminately replacing ./. with double-<DEL> when it represents missing *information* rather than known deletion.

Generic data management for genotype likelihoods, which are defined in the VCF and BGEN specifications, is outside of plink2's scope.  When the likelihoods are needed for downstream analysis, you are expected to use e.g. bcftools for the relevant data handling.  plink2 performs lossy import (collapsing down to allele dosages) of this type of data.

Generic data management for your notion of half-missing call, which isn't even defined in either the VCF or BGEN specification, is even further outside plink2's scope than genotype likelihoods.  plink2 --vcf-half-call provides all the lossy import options anyone has asked for in 7 years.  As you noted, the result when importing with a sensible --vcf-half-call mode is compact.

Nils Paffen

unread,
Apr 29, 2024, 2:18:20 PM4/29/24
to plink2-users
Yes. My initial description was sketchy at best and I'm sorry for the confusion. The way you describe my situation in your post is exactly what I'm stuck with. My institute wants this data quite urgent and I got the unpleasant task to create the ancestry splitted data from the main dataset but the splitted data should also be in a reasonable compressed format, if possible plink2 format to run quicker downstream analysis as bcftools is quite slow.  So the 1.2 TB VCF with the pseudo-<DEL> is not an option. So I guess I have to go with '.' (missing) option and keep telling my institute that the compressed VCF is the best what they can get and have to use bcftools for calculating statistics such as allele frequencies. Open for any further suggestion to my outline here but from what you wrote in your last post it seems plink2 is a dead end here.

Christopher Chang

unread,
Apr 29, 2024, 2:24:22 PM4/29/24
to plink2-users
plink2 is a dead end if you're looking for direct support for this bespoke workflow, yes.

But for allele frequency computations and the like, you may want to think harder about what can be done with *multiple* plink2 datasets, each imported with different --vcf-half-call modes.

Nils Paffen

unread,
Apr 29, 2024, 2:47:58 PM4/29/24
to plink2-users
Alright. Thanks for your time and support.
Reply all
Reply to author
Forward
0 new messages