Malformed pgen file

805 views
Skip to first unread message

Rachel Kember

unread,
Jun 21, 2018, 2:46:02 PM6/21/18
to plink2-users

I am trying to convert a VCF file to a bgen 1.1 file. I am using the following command:

plink2 --vcf test.vcf.gz dosage=DS --export bgen-1.1 --out outputfile

This has worked for five of my VCF files, but for the sixth one when it reaches 70% I get the error 'malformed pgen file':

PLINK v2.00a2LM 64-bit Intel (30 May 2018)     www.cog-genomics.org/plink/2.0/
(C) 2005-2018 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to allAmish705K.chrAll.log.
Options in effect:
  --export bgen-1.1
  --out outputfile
  --pgen test.pgen
  --psam test.psam
  --pvar test.pvar

Start time: Thu Jun 21 14:34:33 2018
128760 MiB RAM detected; reserving 64380 MiB for main workspace.
Using up to 24 threads (change this with --threads).
294 samples (0 females, 0 males, 294 ambiguous; 294 founders) loaded from
test.psam.
8597169 variants loaded from test.pvar.
Note: No phenotype data present.
Writing outputfile.bgen ... 70%
Error: Malformed .pgen file.

I have attempted to circumvent this error by retreating to an earlier VCF file (a larger set of the same data) and find that it works fine.

I then broke down the subset VCF file into chromosomes and found that it only happened with the chr11 file. I combined the chr11 file with chr10 and converted it to bgen and found that it worked. However, when I combine all chromosomes together it stops working again.

I tried converting the vcf to pgen, which worked fine. I then converted the pgen file to bgen, and got the malformed error at 70% again.

Any idea what is the cause of this error?

Thanks, Rachel

Christopher Chang

unread,
Jun 21, 2018, 2:52:32 PM6/21/18
to plink2-users
Hi Rachel,

Can you retry this with the latest build?  There was a VCF-import bugfix earlier this month.

If it still fails, run with the --debug flag and post the .log file.

Rachel Kember

unread,
Jun 21, 2018, 3:10:17 PM6/21/18
to plink2-users
Hi Christopher,

I re-ran with the new version and it still breaks. Output of the log file is as follows:

PLINK v2.00a2LM 64-bit Intel (20 Jun 2018)
Options in effect:
  --debug
  --export bgen-1.1
  --out allAmish705K.chrAll
  --vcf allAmish705K.chrAll.imputed.poly.R2_0.7_maf_0.01.vcf.gz dosage=DS

Hostname: compute0144
Working directory: /work-zfs/pbs/bsc/data/Imputation/Ambigen/allAmish705K/indiv_chrom
Start time: Thu Jun 21 14:59:55 2018

Random number seed: 1529607595

128760 MiB RAM detected; reserving 64380 MiB for main workspace.
Using up to 24 threads (change this with --threads).
--vcf: 8597169 variants scanned.
--vcf: allAmish705K.chrAll-temporary.pgen + allAmish705K.chrAll-temporary.pvar
+ allAmish705K.chrAll-temporary.psam written.

294 samples (0 females, 0 males, 294 ambiguous; 294 founders) loaded from
allAmish705K.chrAll-temporary.psam.
8597169 variants loaded from allAmish705K.chrAll-temporary.pvar.

Note: No phenotype data present.
Writing allAmish705K.chrAll.bgen ...
Error: Malformed .pgen file.

End time: Thu Jun 21 15:07:48 2018

Every time it happens when the file reaches 70%:

Writing allAmish705K.chrAll.bgen ... 70%
Error: Malformed .pgen file.

Christopher Chang

unread,
Jun 21, 2018, 3:29:52 PM6/21/18
to plink2-users
Okay.  Is it possible for you to send me the VCF to reproduce this bug with?

Rachel Kember

unread,
Jun 21, 2018, 3:33:06 PM6/21/18
to plink2-users
Unfortunately not - this is not my data and there are restrictions on material transfer. And I don't have the issue with other VCF files, so sending you the others wouldn't help. It's always at 70% that it breaks, so I was wondering if there is a way to keep the temporary .pgen file so I can see if there is a particular line in the file that it gets stuck at each time? I'm happy to help debug if you let me know the steps to take.

Christopher Chang

unread,
Jun 21, 2018, 3:52:16 PM6/21/18
to plink2-users
Okay. Try just —vcf + —out, then; this will create the intermediate .pgen, and you can follow up with —validate on that .pgen, maybe that will give a variant number.

Rachel Kember

unread,
Jun 21, 2018, 4:05:31 PM6/21/18
to plink2-users
Thanks, that gives me the following error:

PLINK v2.00a2LM 64-bit Intel (20 Jun 2018)     www.cog-genomics.org/plink/2.0/

(C) 2005-2018 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink2.log.
Options in effect:
  --pgen allAmish705K.chrAll.pgen
  --psam allAmish705K.chrAll.psam
  --pvar allAmish705K.chrAll.pvar
  --validate

Start time: Thu Jun 21 16:04:19 2018

128760 MiB RAM detected; reserving 64380 MiB for main workspace.
Using up to 24 threads (change this with --threads).
294 samples (0 females, 0 males, 294 ambiguous; 294 founders) loaded from
allAmish705K.chrAll.psam.
8597169 variants loaded from allAmish705K.chrAll.pvar.
Validating allAmish705K.chrAll.pgen...
Error: Extra byte(s) in (0-based) variant record #1.
End time: Thu Jun 21 16:04:21 2018

Christopher Chang

unread,
Jun 21, 2018, 4:52:11 PM6/21/18
to plink2-users
Hmm.  What is the output of --pgen-info on that .pgen?

Rachel Kember

unread,
Jun 21, 2018, 4:59:54 PM6/21/18
to Christopher Chang, plink2-users
PLINK v2.00a2LM 64-bit Intel (20 Jun 2018)     www.cog-genomics.org/plink/2.0/
(C) 2005-2018 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink2.log.
Options in effect:
  --pgen allAmish705K.chrAll.pgen
  --pgen-info
  --psam allAmish705K.chrAll.psam
  --pvar allAmish705K.chrAll.pvar

Start time: Thu Jun 21 16:59:18 2018

128760 MiB RAM detected; reserving 64380 MiB for main workspace.
Using up to 24 threads (change this with --threads).
294 samples (0 females, 0 males, 294 ambiguous; 294 founders) loaded from
allAmish705K.chrAll.psam.
8597169 variants loaded from allAmish705K.chrAll.pvar.
--pgen-info on allAmish705K.chrAll.pgen:
  Variants: 8597169
  Samples: 294
  REF alleles are all known
  Maximum allele count for a single variant: 2
  Phased hardcalls present
  Explicitly phased dosages present
End time: Thu Jun 21 16:59:20 2018

--
You received this message because you are subscribed to the Google Groups "plink2-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Christopher Chang

unread,
Jun 21, 2018, 5:29:39 PM6/21/18
to plink2-users
Try creating a tiny VCF with just the header lines and first two genotype lines of test.vcf.gz (let me know if you want me to give you commands for this), and then check if plink2 still generates an invalid .pgen from that.

On Thursday, June 21, 2018 at 1:59:54 PM UTC-7, Rachel Kember wrote:
PLINK v2.00a2LM 64-bit Intel (20 Jun 2018)     www.cog-genomics.org/plink/2.0/
(C) 2005-2018 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink2.log.
Options in effect:
  --pgen allAmish705K.chrAll.pgen
  --pgen-info
  --psam allAmish705K.chrAll.psam
  --pvar allAmish705K.chrAll.pvar

Start time: Thu Jun 21 16:59:18 2018
128760 MiB RAM detected; reserving 64380 MiB for main workspace.
Using up to 24 threads (change this with --threads).
294 samples (0 females, 0 males, 294 ambiguous; 294 founders) loaded from
allAmish705K.chrAll.psam.
8597169 variants loaded from allAmish705K.chrAll.pvar.
--pgen-info on allAmish705K.chrAll.pgen:
  Variants: 8597169
  Samples: 294
  REF alleles are all known
  Maximum allele count for a single variant: 2
  Phased hardcalls present
  Explicitly phased dosages present
End time: Thu Jun 21 16:59:20 2018

Rachel Kember

unread,
Jun 21, 2018, 8:50:57 PM6/21/18
to plink2-users
No issues when I do that:

128760 MiB RAM detected; reserving 64380 MiB for main workspace.
Using up to 24 threads (change this with --threads).
--vcf: 2 variants scanned.
--vcf: test-temporary.pgen + test-temporary.pvar + test-temporary.psam written.

294 samples (0 females, 0 males, 294 ambiguous; 294 founders) loaded from
test-temporary.psam.
2 variants loaded from test-temporary.pvar.

Note: No phenotype data present.
Writing test.bgen ... done.
Writing test.sample ... done.
To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

Christopher Chang

unread,
Jun 21, 2018, 9:01:35 PM6/21/18
to plink2-users
Okay, thanks.

I'll work on producing a build with more debug logging that you can use.  In the meantime, one last thing to try: add "--threads 1" to the original VCF import command, and see if the problem still remains.

Rachel Kember

unread,
Jun 21, 2018, 9:16:48 PM6/21/18
to plink2-users
Well I fixed it, but I don't know why this works. Following on from your idea, I just created a VCF file with all the variants using head:

zcat original.vcf.gz | head -8597193 > test.vcf

When I then ran:

plink2 --vcf test.vcf --export bgen-1.1 --out test

It worked:

128760 MiB RAM detected; reserving 64380 MiB for main workspace.
Using up to 24 threads (change this with --threads).
--vcf: 8597169 variants scanned.

--vcf: test-temporary.pgen + test-temporary.pvar + test-temporary.psam written.
294 samples (0 females, 0 males, 294 ambiguous; 294 founders) loaded from
test-temporary.psam.
8597169 variants loaded from test-temporary.pvar.

Note: No phenotype data present.
Writing test.bgen ... done.
Writing test.sample ... done.

Hopefully that helps with your testing!

Christopher Chang

unread,
Jun 21, 2018, 9:35:41 PM6/21/18
to plink2-users
Thanks.  This implies there's some rare decompression bug.

If you can quickly check whether adding "--threads 1" makes this work even with the gzipped VCF, that would be helpful.

Rachel Kember

unread,
Jun 22, 2018, 11:32:44 AM6/22/18
to plink2-users
Yes, adding "--threads 1" also solves the problem! However, when I just unzipped the file and tried without "--threads 1" it didn't work. So the original file, zipped or unzipped, doesn't work. Writing it to a new file using --head and then running it works. Using "--threads 1" works.

Christopher Chang

unread,
Jun 22, 2018, 11:47:56 AM6/22/18
to plink2-users
Okay, that's very useful to know, and makes more sense: the exact positioning of the uncompressed files on your disk may result in systematic timing differences that cause a multithread race condition to be triggered for one file but not for the other file with identical contents.  I will focus on adding debug logging that is likely to detect such a race condition, then.

Christopher Chang

unread,
Jun 22, 2018, 6:34:39 PM6/22/18
to plink2-users
Debug build is now posted.  This also makes --validate print more information about the "Extra byte(s)" error.

Joseph S Reddy

unread,
Oct 4, 2018, 12:42:53 PM10/4/18
to plink2-users
Hi Chris,

I am having similar errors with the latest build (October 2) of PLINK 2 while importing VCF (Minimac 3 VCF from Michigan Imputation Server). This is only happening with a few chromosomes, not all. 

I imported the vcf using "plink2 --vcf chr1.dosage.vcf dosage=DS --out chr1_vcfDosage". 

After this I ran "plink2 --pfile chr1_vcfDosage --validate" and it generates the following error "Error: Invalid unconditional phased-dosages for (0-based) variant #4". Complete log below:

--------------------------------------------------------------------------------------------------------------

$ plink2 --pfile chr1_vcfDosage --validate
PLINK v2.00a2LM 64-bit Intel (2 Oct 2018)      www.cog-genomics.org/plink/2.0/
(C) 2005-2018 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink2.log.
Options in effect:
  --pfile chr1_vcfDosage
  --validate

Start time: Thu Oct  4 11:37:48 2018
775365 MiB RAM detected; reserving 387682 MiB for main workspace.
Using up to 80 threads (change this with --threads).
513 samples (0 females, 0 males, 513 ambiguous; 513 founders) loaded from
chr1_vcfDosage.psam.
3069931 variants loaded from chr1_vcfDosage.pvar.
Validating chr1_vcfDosage.pgen...
Error: Invalid unconditional phased-dosages for (0-based) variant #4.
End time: Thu Oct  4 11:37:48 2018
-----------------------------------------------------------------------------

I've tried using --threads 1 and this did not resolve the issue. Can you please help debug this issue?

Thanks,
Joseph. 

Christopher Chang

unread,
Oct 4, 2018, 12:49:55 PM10/4/18
to plink2-users
Thanks for checking --threads 1 in advance!  If this bug is deterministic (you get the same bugged output every time with the same input), would it be possible for you to send me a VCF that I can reproduce the problem with?  Otherwise I'll have to generate random datasets and hope to get 'lucky'.

Christopher Chang

unread,
Oct 4, 2018, 2:15:49 PM10/4/18
to plink2-users
Since the problem is occurring in the 5th variant, what happens if you clip the chr1 VCF file down to the first 5 variants and try to import that?  If that reproduces the error, the clipped VCF file would be really helpful; I was unable to quickly replicate it.

On Thursday, October 4, 2018 at 9:42:53 AM UTC-7, Joseph S Reddy wrote:
Reply all
Reply to author
Forward
0 new messages