BGEN file compatibility

794 views
Skip to first unread message

Anthony Marcketta

unread,
Apr 19, 2018, 12:24:03 PM4/19/18
to plink2-users
I am attempting to use BGEN v1.2 format files that I created with Plink2 in another program (Bolt-LMM). I'm not entirely sure if the error is because of Plink2 or Bolt, but it looks like it may be a BGEN file formatting issue, see logs below:

##Error from BOLT-LMM v2.3.2
Checking test.dosages.bgen
(with SAMPLE file test.dosages.sample)...
snpBlocks (Mbgen): X
samples (Nbgen): X
CompressedSNPBlocks: 1
Layout: 2
first snpID: 
first rsID: snp1
ERROR: snp1 has ploidy/missingness byte = 130 (not 2)


###Creating the BGENv1.2 format files used above
PLINK v2.00a2LM 64-bit Intel (16 Apr 2018)     www.cog-genomics.org/plink/2.0/
(C) 2005-2018 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to test.dosages.log.
Options in effect:
  --autosome
  --export bgen-1.2 ref-first bits=8
  --memory 1000000000
  --out test.dosages
  --pfile test.dosages
  --keep samples.txt

Start time: Thu Apr 19 12:04:09 2018
64329 MB RAM detected; reserving 1000000000 MB for main workspace.
Allocated 100451 MB successfully, after larger attempt(s) failed.
Using up to 12 threads (change this with --threads).
X samples (X females, X males; X founders) loaded from
test.dosages.psam.
X out of X variants loaded from
test.dosages.pvar.
--keep: X samples remaining.
X samples (X females, X males; X founders) remaining after main
filters.
X variants remaining after main filters.
Writing
test.dosages.bgen
... done.
Writing
test.dosages.sample
... done.
End time: Thu Apr 19 12:04:20 2018

If you can shed any light on this issue, it would be much appreciated. Thanks!

Christopher Chang

unread,
Apr 19, 2018, 12:49:49 PM4/19/18
to plink2-users
I'm pretty sure that ploidy/missingness byte = 130 is the correct way to represent a missing diploid genotype (qctool/bgenie/etc. shouldn't complain?), so this may be a consequence of BOLT-LMM's bgen-1.2 support being limited to UK Biobank data which has imputed dosages everywhere and no missing values ("WARNING: The BGEN format comprises a few sub-formats; we have only implemented support for the versions (and specific data layouts) used in the UK Biobank N=150K and N=500K releases. In particular, for BGEN v1.2, BOLT-LMM currently only supports the 8-bit encoding used for the UK Biobank N=500K data.").  However, let me know if any other programs complain about the .bgen files emitted by --export bgen-1.2.

Oveis Jamialahmadi

unread,
Apr 17, 2020, 3:54:52 PM4/17/20
to plink2-users
Hi Chris,

I have a question about "bits=" option in --export. Is there any particular reason that you set the default bits to 16 (and not 8)?  When I check BGENIX output (e.g. bgenix -g X.bgen -incl-rsids RS > Y.bgen) it's in 8 bits (as also in v1.2 probabilities are in bytes). 

Thanks
Oveis

Christopher Chang

unread,
Apr 17, 2020, 4:01:37 PM4/17/20
to plink2-users
~16 has been the historical norm for BGEN.  8 was chosen for UK Biobank due to the huge size of that dataset, so some tools designed primarily for it may also default to 8.

Oveis Jamialahmadi

unread,
Apr 17, 2020, 4:36:37 PM4/17/20
to plink2-users
Thanks for your response!

Please consider these two following situations:
  1. I have a BGEN v1.2 (from UKB) and I extract/export 1 variant, and read dosage values (using rbgen) So: 
    plink2 --bgen XX.bgen ref-first --sample XX.sample --extract rslist.txt --export bgen-1.2 --out YY
    d1
    = bgen.load('YY.bgen', rsrange)
  2. I read the same variant using the original BGEN file:
    d2 = bgen.load('XX.bgen',rsrange)

    When I compare these two, genotype probabilities for a number of samples are different (please note that I don't mean differences due to round-off).
     What could be wrong here?
Thanks/Oveis

Christopher Chang

unread,
Apr 17, 2020, 5:20:07 PM4/17/20
to plink2-users
Dosages vs. genotype probabilities: see the boldfaced sentence under https://www.cog-genomics.org/plink/2.0/input#vcf .  (I'll add another copy of that statement under the --bgen documentation today.)

Oveis Jamialahmadi

unread,
Apr 17, 2020, 5:33:21 PM4/17/20
to plink2-users
Yes, I saw it. But I assumed it's only the case for VCF. Thanks for the clarification. So given that, is there any way to use PLINK to losslessly filter BGEN files for downstream analysis?
Just out of curiosity, I noticed that probability differences only happen for few samples (~8000 out of ~500 k). So, why does those mismatches even happen?

Best /Oveis 

Oveis Jamialahmadi

unread,
Apr 17, 2020, 5:49:22 PM4/17/20
to plink2-users

Sorry for asking again and my naive question.

So, do you mean there won't be a difference at dosage level or just we should accept the fact that they are just different (before vs after filtering the original BGEN file). Because the difference in probabilities imply different dosages (for those ~8000 samples):

plink2dosage.png


Thanks/Oveis

Christopher Chang

unread,
Apr 17, 2020, 5:58:25 PM4/17/20
to plink2-users
Can you provide a concrete example of a "different dosage"?  (You can post a dataset with e.g. just 1 variant.)

Oveis Jamialahmadi

unread,
Apr 17, 2020, 6:14:23 PM4/17/20
to plink2-users

Chris, since the data are from UKB, I'm not sure how much freedom I have to share them. But in my example, I'm using rs370652263 (chr 22, pos: 51237712) from UKB imputed genotypes v3. I just calculated dosage from genotype probabilities of d1 and d2 variables (in my example above) as dosage1 = 2*d1(alt/alt) + d1(alt/ref), and similarly for dosage2. Among 487409 samples in original BGEN file, there are 8457 samples with "different dosage values" between dosage1 and dosage2 (plotted in my previous response).
 Is there something wrong with my approach? 


Best/Oveis

Christopher Chang

unread,
Apr 17, 2020, 6:19:48 PM4/17/20
to plink2-users
You can literally create a one-sample, one-variant file from scratch with a genotype-probability-triplet which is handled incorrectly.  Please provide SOME concrete example, otherwise I can't help you.

Oveis Jamialahmadi

unread,
Apr 17, 2020, 7:45:40 PM4/17/20
to plink2-users

Sorry for my mistake, I carelessly calculated dosage, so that's why I saw the "false" difference. Indeed they are the same, as I show in the following example.
I use the example bgen files from here, with a slight modification on samples (to be compatible with PLINK2). I've attached all needed files for the following lines:
  • Starting from example.8bits: 
    plink2 --bgen example.8bits.bgen ref-first --sample example.8bits.sample --extract rs.txt --export bgen-1.2 "bits=8" --out yy
  • Corresponding PLINK dosages can be generated as:
    link2 --bgen yy.bgen ref-last --sample yy.sample --export A --out dos1
    plink2
    --bgen example.8bits.bgen ref-first --sample example.8bits.sample --extract rs.txt --export A --out dos2

untitled.png


However, there are 218 samples (of 500) with different genotype probabilities (as you already mentioned). 

So again sorry for my mistake, and thanks for your help.


Best/Oveis

plink.bgen.example.tar
Reply all
Reply to author
Forward
0 new messages