Malformed PGEN issue

108 views
Skip to first unread message

Anthony Marcketta

unread,
Mar 25, 2022, 5:00:50 PM3/25/22
to plink2-users
Hi, I'm running into an issue with malformed PGEN files that I can't seem to figure out.

Firstly, the data validates perfectly fine: 
PLINK v2.00a3LM 64-bit Intel (22 Mar 2022)
Options in effect:
  --out ../gwvalidate
  --pgen ./req_files/COLLAB_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pgen
  --psam ./EAS/COLLAB_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.wSex.psam
  --pvar ./req_files/COLLAB_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pvar
  --validate

Hostname: ip-172-24-83-240
Working directory: /mnt/EBS/dev/imputationSAS/rerun_clean
Start time: Fri Mar 25 11:04:24 2022

Random number seed: 1648220664
127358 MiB RAM detected; reserving 63679 MiB for main workspace.
Using up to 16 threads (change this with --threads).
44002 samples (21829 females, 21746 males, 427 ambiguous; 44002 founders)
loaded from
./EAS/COLLAB_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.wSex.psam.
307475576 variants loaded from
./req_files/COLLAB_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pvar.
Validating
./req_files/COLLAB_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pgen...
done.

End time: Fri Mar 25 12:13:45 2022

However, when running some of our QC process on the data, plink2 complains about the PGEN being malformed. If we run the same command multiple times, it will fail on different variants. See below for 2 different log files:

PLINK v2.00a3LM 64-bit Intel (22 Mar 2022)     www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ./EAS/intermediate.log.
Options in effect:
  --allow-extra-chr
  --chr 1-22 x xy
  --extract ./EAS/extract_list.snplist ./req_files/ALL.BRAVO_TOPMed_Freeze_8.MAF0.0001.variants.wchr.list ./req_files/TOPMED_HRC_1KG_UKB.REP_functional_variants.wchr.list
  --indiv-sort n
  --keep ./req_files/COLLAB_Freeze_Two.EAS.samples.txt
  --maf 1e-30
  --make-pgen erase-phase fill-missing-from-dosage pvar-cols=-xheader,-maybequal,-maybefilter,-maybeinfo,-maybecm
  --memory 80000
  --out ./EAS/intermediate
  --output-chr 26
  --pgen ./req_files/COLLAB_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pgen
  --psam ./EAS/COLLAB_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.wSex.psam
  --pvar ./req_files/COLLAB_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pvar
  --rm-dup exclude-mismatch
  --sort-vars

Start time: Thu Mar 24 09:16:09 2022
127358 MiB RAM detected; reserving 80000 MiB for main workspace.
Using up to 16 threads (change this with --threads).
44002 samples (21829 females, 21746 males, 427 ambiguous; 44002 founders)
loaded from
./EAS/COLLAB_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.wSex.psam.
307475576 variants loaded from
./req_files/COLLAB_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pvar.
Note: No phenotype data present.
--extract: 75157537 variants remaining.
Note: Skipping --rm-dup since no duplicate IDs are present.
--keep: 679 samples remaining.
679 samples (421 females, 254 males, 4 ambiguous; 679 founders) remaining after
main filters.
Calculating allele frequencies... done.
5574485 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
69583052 variants remaining after main filters.
--indiv-sort: 679 samples reordered.
Writing ./EAS/intermediate.pvar ... done.
Writing ./EAS/intermediate.psam ... done.
Writing ./EAS/intermediate.pgen ... 69%
Error: Failed to unpack (0-based) variant #211495665 in .pgen file.
You can use --validate to check whether it is malformed.
* If it is malformed, you probably need to either re-download the file, or
  address an error in the command that generated the input .pgen.
* If it appears to be valid, you have probably encountered a plink2 bug.  If
  you report the error on GitHub or the plink2-users Google group (make sure to
  include the full .log file in your report), we'll try to address it.
End time: Thu Mar 24 11:01:20 2022

____________________

PLINK v2.00a3LM 64-bit Intel (23 Feb 2022)     www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ./EAS/intermediate.log.
Options in effect:
  --allow-extra-chr
  --chr 1-22 x xy
  --extract ./EAS/extract_list.snplist ./req_files/ALL.BRAVO_TOPMed_Freeze_8.MAF0.0001.variants.wchr.list ./req_files/TOPMED_HRC_1KG_UKB.REP_functional_variants.wchr.list
  --indiv-sort n
  --keep ./req_files/COLLAB_Freeze_Two.EAS.samples.txt
  --maf 1e-30
  --make-pgen erase-phase fill-missing-from-dosage pvar-cols=-xheader,-maybequal,-maybefilter,-maybeinfo,-maybecm
  --memory 80000
  --out ./EAS/intermediate
  --output-chr 26
  --pgen ./req_files/COLLAB_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pgen
  --psam ./EAS/COLLAB_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.wSex.psam
  --pvar ./req_files/COLLAB_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pvar
  --rm-dup exclude-mismatch
  --sort-vars

Start time: Tue Mar 22 23:18:57 2022
127358 MiB RAM detected; reserving 80000 MiB for main workspace.
Using up to 16 threads (change this with --threads).
44002 samples (21829 females, 21746 males, 427 ambiguous; 44002 founders)
loaded from
./EAS/COLLAB_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.wSex.psam.
307475576 variants loaded from
./req_files/COLLAB_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pvar.
Note: No phenotype data present.
--extract: 75157537 variants remaining.
Note: Skipping --rm-dup since no duplicate IDs are present.
--keep: 679 samples remaining.
679 samples (421 females, 254 males, 4 ambiguous; 679 founders) remaining after
main filters.
Calculating allele frequencies... done.
5574485 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
69583052 variants remaining after main filters.
--indiv-sort: 679 samples reordered.
Writing ./EAS/intermediate.pvar ... done.
Writing ./EAS/intermediate.psam ... done.
Writing ./EAS/intermediate.pgen ... 84%
Error: Failed to unpack (0-based) variant #257550605 in .pgen file.
You can use --validate to check whether it is malformed.
* If it is malformed, you probably need to either re-download the file, or
  address an error in the command that generated the input .pgen.
* If it appears to be valid, you have probably encountered a plink2 bug.  If
  you report the error on GitHub or the plink2-users Google group (make sure to
  include the full .log file in your report), we'll try to address it.
End time: Wed Mar 23 01:10:44 2022

We've been trying our best to troubleshoot this on our own, with no luck. We have regenerated the input file a few times, but it always seems to wind up malformed. Do you have any other tips for us to try?

Christopher Chang

unread,
Mar 25, 2022, 5:10:25 PM3/25/22
to plink2-users
I'm guessing this is related to --sort-vars; I would not be surprised if the command succeeded without that flag.

So, try removing that flag and performing the sort in a separate step.  If that succeeds, or it succeeds after adding "--threads 1", we can debug the --sort-vars interaction with your immediate workflow unblocked.  (I'll go ahead and generate some random data, and throw it at your command line, in an effort to reproduce the crash on my end.)

Christopher Chang

unread,
Mar 25, 2022, 5:21:48 PM3/25/22
to plink2-users
Clarification: "--threads 1" is intended to be added to the second sort-vars-only step if it fails on the first try.  If the first everything-but-sort-vars step fails without it... well, you could still try adding "--threads 1" there and see if it gets the immediate job done, but the computational cost would be higher.  It is reasonable for you to wait for me to send you a debug build in that case.

Christopher Chang

unread,
Mar 25, 2022, 5:36:10 PM3/25/22
to plink2-users
Incidentally, this is almost certainly the same issue that you ran into in https://groups.google.com/g/plink2-users/c/7_OmvZYMeuE/m/kmzl0fseAgAJ .

Christopher Chang

unread,
Mar 26, 2022, 2:57:04 AM3/26/22
to plink2-users
I have not been able to reproduce this crash with random phased-dosage data.  The two console-.log excerpts you copied made it to 69% and 84%, respectively, so it does seem to be a very rare case; maybe triggered by a single variant in your entire dataset.  More precisely, I'm guessing that there's some variant that, when processed by the --make-pgen --sort-vars code path (maybe --indiv-sort is also relevant), corrupts the reader's state in a way that isn't immediately detected, but is practically guaranteed to cause a crash before --make-pgen is done.

Anyway, first priority is to unblock your immediate workflow, which I hope separating out --sort-vars (and maybe --indiv-sort) does.  I'm ready to create a sequence of debug builds, etc. if/when you're able to investigate this further, but I understand if you don't have much bandwidth for that right now.

Christopher Chang

unread,
Mar 28, 2022, 12:58:21 AM3/28/22
to plink2-users
A debug build aimed at this crash has been posted to GitHub; you can also use the prebuilt 64-bit Linux binary at https://s3.amazonaws.com/plink2-assets/plink2_linux_x86_64_20220327.zip .  If you run this build with the --debug flag added, additional debug-prints and asserts will occur along the --make-pgen + --sort-vars code path.

Anthony Marcketta

unread,
Apr 4, 2022, 11:39:35 AM4/4/22
to plink2-users
Thanks very much for this response.

After removing both --sort-vars and --indiv-sort the command completed successfully and we are able to work with this properly formatted dataset.

See below for the log from running the previous command using this build with --debug on. Please let me know if there's anything more I can do to help debug.

PLINK v2.00a3LM 64-bit Intel (27 Mar 2022)
Options in effect:

  --allow-extra-chr
  --chr 1-22 x xy
  --debug

  --extract ./EAS/extract_list.snplist ./req_files/ALL.BRAVO_TOPMed_Freeze_8.MAF0.0001.variants.wchr.list ./req_files/TOPMED_HRC_1KG_UKB.REP_functional_variants.wchr.list
  --indiv-sort n
  --keep ./req_files/COHORT_Freeze_Two.EAS.samples.txt

  --maf 1e-30
  --make-pgen erase-phase fill-missing-from-dosage pvar-cols=-xheader,-maybequal,-maybefilter,-maybeinfo,-maybecm
  --memory 80000
  --out ./EAS/intermediate
  --output-chr 26
  --pgen ./req_files/COHORT_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pgen
  --psam ./EAS/COHORT_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.wSex.psam
  --pvar ./req_files/COHORT_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pvar
  --rm-dup exclude-mismatch
  --sort-vars

Hostname: ip-172-24-83-240
Working directory: /mnt/efs_v2/imputationSAS/rerun_clean
Start time: Sun Apr  3 14:21:36 2022

Random number seed: 1649010096

127358 MiB RAM detected; reserving 80000 MiB for main workspace.
Using up to 16 threads (change this with --threads).
44002 samples (21829 females, 21746 males, 427 ambiguous; 44002 founders)
loaded from
./EAS/COHORT_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.wSex.psam.
307475576 variants loaded from
./req_files/COHORT_Freeze_Two.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pvar.

Note: No phenotype data present.
--extract: 75157537 variants remaining.
Note: Skipping --rm-dup since no duplicate IDs are present.
--keep: 679 samples remaining.
679 samples (421 females, 254 males, 4 ambiguous; 679 founders) remaining after
main filters.
Calculating allele frequencies... done.
5574485 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
69583052 variants remaining after main filters.
--indiv-sort: 679 samples reordered.
Writing ./EAS/intermediate.pvar ... done.
Writing ./EAS/intermediate.psam ... done.
read_dosage_present: 1
read_phase_present: 0
read_dphase_present: 0
write_gflags: 16
nonref_flags_storage: 1
spgw_alloc_cacheline_ct: 3263941
max_vrec_len: 1613
write_mhc_needed: 1
write_block_size: 65536
Writing ./EAS/intermediate.pgen ...
Error: Failed to unpack (0-based) variant #239172884 in .pgen file.

You can use --validate to check whether it is malformed.
* If it is malformed, you probably need to either re-download the file, or
  address an error in the command that generated the input .pgen.
* If it appears to be valid, you have probably encountered a plink2 bug.  If
  you report the error on GitHub or the plink2-users Google group (make sure to
  include the full .log file in your report), we'll try to address it.

End time: Sun Apr  3 16:12:31 2022

Christopher Chang

unread,
Apr 8, 2022, 10:04:59 PM4/8/22
to plink2-users
I've posted another debug build to GitHub and https://s3.amazonaws.com/plink2-assets/plink2_linux_x86_64_20220408.zip .

Try also adding "--randmem --seed 1" to your command line, and running the command twice to check whether or not this is enough to get it to fail at the same variant both times.

If you have additional time, and "--randmem --seed 1" is not enough to make the point-of-failure reproducible, try also adding "--threads 1".

Anthony Marcketta

unread,
Apr 19, 2022, 1:17:16 PM4/19/22
to plink2-users
I have tried the experiment you suggested, here are the results of 4 runs:

Run 1 | previous command but adding --randmem --seed 1 | failed

PLINK v2.00a3LM 64-bit Intel (8 Apr 2022)

Options in effect:
  --allow-extra-chr
  --chr 1-22 x xy
  --debug
  --extract ./EAS/extract_list.snplist ./req_files/ALL.BRAVO_TOPMed_Freeze_8.MAF0.0001.variants.wchr.list ./req_files/TOPMED_HRC_1KG_UKB.REP_functional_variants.wchr.list
  --indiv-sort n
  --keep ./req_files/Cohort.EAS.samples.txt

  --maf 1e-30
  --make-pgen erase-phase fill-missing-from-dosage pvar-cols=-xheader,-maybequal,-maybefilter,-maybeinfo,-maybecm
  --memory 60000
  --out ./EAS/intermediate
  --output-chr 26
  --pgen ./req_files/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pgen
  --psam ./EAS/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.wSex.psam
  --pvar ./req_files/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pvar
  --randmem
  --rm-dup exclude-mismatch
  --seed 1
  --sort-vars

Hostname: ip-172-24-83-240
Working directory: /EBS/dev/imputationSAS/rerun_clean
Start time: Mon Apr 18 09:09:53 2022

127355 MiB RAM detected; reserving 60000 MiB for main workspace.
Using up to 32 threads (change this with --threads).

44002 samples (21829 females, 21746 males, 427 ambiguous; 44002 founders)
loaded from
./EAS/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.wSex.psam.
307475576 variants loaded from
./req_files/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pvar.

Note: No phenotype data present.
--extract: 75157537 variants remaining.
Note: Skipping --rm-dup since no duplicate IDs are present.
--keep: 679 samples remaining.
679 samples (421 females, 254 males, 4 ambiguous; 679 founders) remaining after
main filters.
Calculating allele frequencies... done.
5574485 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
69583052 variants remaining after main filters.
--indiv-sort: 679 samples reordered.
Writing ./EAS/intermediate.pvar ... done.
Writing ./EAS/intermediate.psam ... done.
Writing ./EAS/intermediate.pgen ...
Error: Failed to unpack (0-based) variant #222501167 in .pgen file.

You can use --validate to check whether it is malformed.
* If it is malformed, you probably need to either re-download the file, or
  address an error in the command that generated the input .pgen.
* If it appears to be valid, you have probably encountered a plink2 bug.  If
  you report the error on GitHub or the plink2-users Google group (make sure to
  include the full .log file in your report), we'll try to address it.
write-index: 50593792
previous read-index: 222501161
block_widx: 0
g_debug_get_raw: 8

End time: Mon Apr 18 10:54:50 2022

________________

Run 2 | same as Run 1 | failed at a different position

PLINK v2.00a3LM 64-bit Intel (8 Apr 2022)

Options in effect:
  --allow-extra-chr
  --chr 1-22 x xy
  --debug
  --extract ./EAS/extract_list.snplist ./req_files/ALL.BRAVO_TOPMed_Freeze_8.MAF0.0001.variants.wchr.list ./req_files/TOPMED_HRC_1KG_UKB.REP_functional_variants.wchr.list
  --indiv-sort n
  --keep ./req_files/Cohort.EAS.samples.txt

  --maf 1e-30
  --make-pgen erase-phase fill-missing-from-dosage pvar-cols=-xheader,-maybequal,-maybefilter,-maybeinfo,-maybecm
  --memory 60000
  --out ./EAS/intermediate
  --output-chr 26
  --pgen ./req_files/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pgen
  --psam ./EAS/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.wSex.psam
  --pvar ./req_files/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pvar
  --randmem
  --rm-dup exclude-mismatch
  --seed 1
  --sort-vars

Hostname: ip-172-24-83-240
Working directory: /EBS/dev/imputationSAS/rerun_clean
Start time: Mon Apr 18 11:37:48 2022

127355 MiB RAM detected; reserving 60000 MiB for main workspace.
Using up to 32 threads (change this with --threads).

44002 samples (21829 females, 21746 males, 427 ambiguous; 44002 founders)
loaded from
./EAS/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.wSex.psam.
307475576 variants loaded from
./req_files/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pvar.

Note: No phenotype data present.
--extract: 75157537 variants remaining.
Note: Skipping --rm-dup since no duplicate IDs are present.
--keep: 679 samples remaining.
679 samples (421 females, 254 males, 4 ambiguous; 679 founders) remaining after
main filters.
Calculating allele frequencies... done.
5574485 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
69583052 variants remaining after main filters.
--indiv-sort: 679 samples reordered.
Writing ./EAS/intermediate.pvar ... done.
Writing ./EAS/intermediate.psam ... done.
Writing ./EAS/intermediate.pgen ...
Error: Failed to unpack (0-based) variant #211495665 in .pgen file.
You can use --validate to check whether it is malformed.
* If it is malformed, you probably need to either re-download the file, or
  address an error in the command that generated the input .pgen.
* If it appears to be valid, you have probably encountered a plink2 bug.  If
  you report the error on GitHub or the plink2-users Google group (make sure to
  include the full .log file in your report), we'll try to address it.
write-index: 48103424
previous read-index: 211495663
block_widx: 0
g_debug_get_raw: 8

End time: Mon Apr 18 13:21:25 2022

________________

Run 3 | previous command but adding --threads 1 | succeeded!

PLINK v2.00a3LM 64-bit Intel (8 Apr 2022)

Options in effect:
  --allow-extra-chr
  --chr 1-22 x xy
  --debug
  --extract ./EAS/extract_list.snplist ./req_files/ALL.BRAVO_TOPMed_Freeze_8.MAF0.0001.variants.wchr.list ./req_files/TOPMED_HRC_1KG_UKB.REP_functional_variants.wchr.list
  --indiv-sort n
  --keep ./req_files/Cohort.EAS.samples.txt

  --maf 1e-30
  --make-pgen erase-phase fill-missing-from-dosage pvar-cols=-xheader,-maybequal,-maybefilter,-maybeinfo,-maybecm
  --memory 60000
  --out ./EAS/intermediate
  --output-chr 26
  --pgen ./req_files/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pgen
  --psam ./EAS/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.wSex.psam
  --pvar ./req_files/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pvar
  --randmem
  --rm-dup exclude-mismatch
  --seed 1
  --sort-vars
  --threads 1

Hostname: ip-172-24-83-240
Working directory: /EBS/dev/imputationSAS/rerun_clean
Start time: Mon Apr 18 14:38:38 2022

127355 MiB RAM detected; reserving 60000 MiB for main workspace.
Using 1 compute thread.

44002 samples (21829 females, 21746 males, 427 ambiguous; 44002 founders)
loaded from
./EAS/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.wSex.psam.
307475576 variants loaded from
./req_files/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pvar.

Note: No phenotype data present.
--extract: 75157537 variants remaining.
Note: Skipping --rm-dup since no duplicate IDs are present.
--keep: 679 samples remaining.
679 samples (421 females, 254 males, 4 ambiguous; 679 founders) remaining after
main filters.
Calculating allele frequencies... done.
5574485 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
69583052 variants remaining after main filters.
--indiv-sort: 679 samples reordered.
Writing ./EAS/intermediate.pvar ... done.
Writing ./EAS/intermediate.psam ... done.
Writing ./EAS/intermediate.pgen ... done.

End time: Mon Apr 18 16:40:48 2022

________________

Run 4 | same as Run 3 | failed at a new position

PLINK v2.00a3LM 64-bit Intel (8 Apr 2022)

Options in effect:
  --allow-extra-chr
  --chr 1-22 x xy
  --debug
  --extract ./EAS/extract_list.snplist ./req_files/ALL.BRAVO_TOPMed_Freeze_8.MAF0.0001.variants.wchr.list ./req_files/TOPMED_HRC_1KG_UKB.REP_functional_variants.wchr.list
  --indiv-sort n
  --keep ./req_files/Cohort.EAS.samples.txt

  --maf 1e-30
  --make-pgen erase-phase fill-missing-from-dosage pvar-cols=-xheader,-maybequal,-maybefilter,-maybeinfo,-maybecm
  --memory 60000
  --out ./EAS/intermediate
  --output-chr 26
  --pgen ./req_files/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pgen
  --psam ./EAS/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.wSex.psam
  --pvar ./req_files/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pvar
  --randmem
  --rm-dup exclude-mismatch
  --seed 1
  --sort-vars
  --threads 1

Hostname: ip-172-24-83-240
Working directory: /EBS/dev/imputationSAS/rerun_clean
Start time: Tue Apr 19 09:43:07 2022

127355 MiB RAM detected; reserving 60000 MiB for main workspace.
Using 1 compute thread.

44002 samples (21829 females, 21746 males, 427 ambiguous; 44002 founders)
loaded from
./EAS/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.wSex.psam.
307475576 variants loaded from
./req_files/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pvar.

Note: No phenotype data present.
--extract: 75157537 variants remaining.
Note: Skipping --rm-dup since no duplicate IDs are present.
--keep: 679 samples remaining.
679 samples (421 females, 254 males, 4 ambiguous; 679 founders) remaining after
main filters.
Calculating allele frequencies... done.
5574485 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
69583052 variants remaining after main filters.
--indiv-sort: 679 samples reordered.
Writing ./EAS/intermediate.pvar ... done.
Writing ./EAS/intermediate.psam ... done.
Writing ./EAS/intermediate.pgen ...
Error: Failed to unpack (0-based) variant #251348498 in .pgen file.

You can use --validate to check whether it is malformed.
* If it is malformed, you probably need to either re-download the file, or
  address an error in the command that generated the input .pgen.
* If it appears to be valid, you have probably encountered a plink2 bug.  If
  you report the error on GitHub or the plink2-users Google group (make sure to
  include the full .log file in your report), we'll try to address it.
write-index: 57147392
previous read-index: 251348497
block_widx: 0
g_debug_get_raw: 8

End time: Tue Apr 19 11:33:13 2022

_________

Let me know if there's any more I can do to help!

Christopher Chang

unread,
Apr 21, 2022, 12:00:42 AM4/21/22
to plink2-users
Thanks!

I've posted another debug build on GitHub and at https://s3.amazonaws.com/plink2-assets/plink2_linux_x86_64_20220420.zip .

Try the "--threads 1" debug configuration.  (--randmem/--seed 1 appears to be irrelevant here.)  Assuming it fails, try again and see if it fails in roughly the same place this time.

Anthony Marcketta

unread,
Apr 22, 2022, 1:45:08 PM4/22/22
to plink2-users
I ran it with this build twice and it produced the same log. It was hard to tell if it failed in the same place though.

PLINK v2.00a3LM 64-bit Intel (20 Apr 2022)

Options in effect:
  --allow-extra-chr
  --chr 1-22 x xy
  --debug
  --extract ./EAS/extract_list.snplist ./req_files/ALL.BRAVO_TOPMed_Freeze_8.MAF0.0001.variants.wchr.list ./req_files/TOPMED_HRC_1KG_UKB.REP_functional_variants.wchr.list
  --indiv-sort n
  --keep ./req_files/Cohort.EAS.samples.txt
  --maf 1e-30
  --make-pgen erase-phase fill-missing-from-dosage pvar-cols=-xheader,-maybequal,-maybefilter,-maybeinfo,-maybecm
  --memory 60000
  --out ./EAS/intermediate
  --output-chr 26
  --pgen ./req_files/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pgen
  --psam ./EAS/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.wSex.psam
  --pvar ./req_files/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pvar
  --rm-dup exclude-mismatch

  --sort-vars
  --threads 1

Hostname: ip-172-24-83-240
Working directory: /EBS/dev/imputationSAS/rerun_clean
Start time: Thu Apr 21 11:47:15 2022

Random number seed: 1650556035

127355 MiB RAM detected; reserving 60000 MiB for main workspace.
Using 1 compute thread.
44002 samples (21829 females, 21746 males, 427 ambiguous; 44002 founders)
loaded from
./EAS/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.wSex.psam.
307475576 variants loaded from
./req_files/Cohort.GT_hg38.pVCF.rgcpid.TOPMED_dosages.HDS.pvar.
Note: No phenotype data present.
--extract: 75157537 variants remaining.
Note: Skipping --rm-dup since no duplicate IDs are present.
--keep: 679 samples remaining.
679 samples (421 females, 254 males, 4 ambiguous; 679 founders) remaining after
main filters.
Calculating allele frequencies... done.
5574485 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
69583052 variants remaining after main filters.
--indiv-sort: 679 samples reordered.
Writing ./EAS/intermediate.pvar ... done.
Writing ./EAS/intermediate.psam ... done.
Writing ./EAS/intermediate.pgen ... debug_bubble: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
write-index: 65540

End time: Thu Apr 21 12:43:18 2022

Christopher Chang

unread,
Apr 22, 2022, 1:50:23 PM4/22/22
to plink2-users
By "same place", I meant similar write-index.

We've finally nailed this down to a buffer-overrun occurring at a known point in execution.  I might have enough information to fix the bug now; if not, it's likely that one more debug-build will do the trick.

Christopher Chang

unread,
Apr 22, 2022, 4:16:00 PM4/22/22
to plink2-users
Attempted bugfix is posted to GitHub and https://s3.amazonaws.com/plink2-assets/plink2_linux_x86_64_20220422a.zip .  Feel free to try your original command (with "--threads 16", etc.) with it.

Anthony Marcketta

unread,
Apr 25, 2022, 11:43:21 AM4/25/22
to plink2-users
Using this build, we have been able to run the original command multiple times without any malformed pgen errors.

Thank you for your time and effort to help fix this issue.

Reply all
Reply to author
Forward
0 new messages