Dosage detection

188 views
Skip to first unread message

Anthony Marcketta

unread,
May 21, 2018, 4:47:57 PM5/21/18
to plink2-users
Hi Chris,

What is the easiest/safest way to determine if a given PGEN file contains both hardcalls and dosages, or only hardcalls?

Christopher Chang

unread,
May 21, 2018, 5:26:23 PM5/21/18
to plink2-users
If the .pgen was generated by plink2, an inefficient but safe way is to run
  plink2 --pfile mydata --make-pgen erase-dosage --hard-call-threshold 0 --out nodosage
  plink2 --pfile mydata --make-pgen --out withdosage
  diff -q nodosage.pgen withdosage.pgen
The first command replaces all dosages with missing genotypes before recompressing the dataset, while the second command doesn't make any changes.  (You can almost always omit the second command and just compare mydata.pgen with nodosage.pgen, but the additional step protects against occasional minor changes to the compression algorithm.)

Yes, this is a bit silly.  I'll add a --pgen-stats command to plink2 today which reports this directly.

Christopher Chang

unread,
May 21, 2018, 6:32:13 PM5/21/18
to plink2-users
--pgen-info implementation has been posted to GitHub; will upload new binaries tonight.


On Monday, May 21, 2018 at 1:47:57 PM UTC-7, Anthony Marcketta wrote:

Anthony Marcketta

unread,
May 22, 2018, 9:10:39 AM5/22/18
to plink2-users
Thanks Chris! This update is much appreciated.

sy06...@gmail.com

unread,
Sep 4, 2018, 4:34:12 PM9/4/18
to plink2-users
Dear Chris, 

I have just  used the --pgen-info command on a newly created pgen file ( chr22 only) from UK biobank.  The resulting summary states that there are " no phased hardballs present"   
Does this means the pgen file does not contain any hardcall info?   Could you please comment? From my understanding the pgen files contains both dosage and hardcalls so I am little confused

below are the 2 log files, the first one for the creation of pgen file the second using --pgen--info command

Many thanks!

Cheers

Saliha


PLINK v2.00a2LM 64-bit Intel (30 Jul 2018)
Options in effect:
  --bgen ukb_imp_chr22_v3.bgen ref-first
  --make-pgen
  --memory 1600
  --out UKbb_imp_chr22. 
  --sample ukb31984_imp_chr22_v3_s487395.sample
  --threads 5


Random number seed: 1536090252
386743 MiB RAM detected; reserving 1600 MiB for main workspace.
Using up to 5 compute threads.
--bgen: 1255683 variants detected, format v1.2.
487409 samples imported from .sample file to UKbb_imp_chr22-temporary.psam .
--bgen: UKbb_imp_chr22-temporary.pgen + UKbb_imp_chr22-temporary.pvar written.
487409 samples (264362 females, 223033 males, 14 ambiguous; 487409 founders)
loaded from UKbb_imp_chr22-temporary.psam.
1255683 variants loaded from UKbb_imp_chr22-temporary.pvar.
Note: No phenotype data present.
Writing UKbb_imp_chr22.pgen ... done.
Writing UKbb_imp_chr22.pvar ... done.
Writing UKbb_imp_chr22.psam ... done.

End time: Tue Sep  4 20:09:16 2018




(C) 2005-2018 Shaun Purcell, Christopher Chang   GNU General Public License v3

Logging to plink2.log.

Options in effect:

  --pgen UKbb_imp_chr22.pgen

  --pgen-info


Start time: Tue Sep  4 20:21:46 2018

386743 MiB RAM detected; reserving 193371 MiB for main workspace.

Using up to 40 threads (change this with --threads).

--pgen-info on UKbb_imp_chr22.pgen:

  Variants: 1255683

  Samples: 487409

  REF alleles are all known

  Maximum allele count for a single variant: not explicitly stored

  No phased hardcalls present

  Dosage present, none explicitly phased

End time: Tue Sep  4 20:21:50 2018

Christopher Chang

unread,
Sep 4, 2018, 4:38:28 PM9/4/18
to plink2-users
No, this is just saying that the hardcalls are unphased; will try to rephrase this today so that the meaning is more obvious.

sy06...@gmail.com

unread,
Sep 5, 2018, 8:44:37 AM9/5/18
to plink2-users
thank you very much for clarifying Chris!
Reply all
Reply to author
Forward
0 new messages