Oxford format with PLINK2

515 views
Skip to first unread message

Gad Abraham

unread,
May 9, 2017, 10:51:10 AM5/9/17
to plink2-users
Hi Chris,

I've been using PLINK2 to process imputed genotypes in Oxford format
(.bgen), and it's amazingly fast and useful, thanks for all the hard
work!

A couple of questions:

- Is it possible to export gzipped dosages (.gen.gz) natively, i.e.,
without writing an intermediate .gen file and then gzipping it?

- Can PLINK2 compute the impute INFO statistic
(http://www.well.ox.ac.uk/~gav/qctool/#documentation, under 'Info')?

Thanks again,
Gad

Christopher Chang

unread,
May 9, 2017, 12:05:02 PM5/9/17
to plink2-users
1. While it would be straightforward to add this to PLINK2, I suspect it would create more problems than it solves.

The issue is, the .gen format includes separate P(hom ref), P(het), and P(hom alt) probabilities, whereas PLINK2 only keeps track of a single "alt dosage" value (since the other values take up quite a bit of space, while having minimal relevance to PLINK2's analysis functions).  I.e. {P(hom ref)=0.2, P(het)=0.52, P(hom alt)=0.28} and {P(hom ref)=0, P(het)=0.92, P(hom alt)=0.08} are both imported as "alt dosage = 1.08", and if you then ask PLINK2 to re-export this information, it'll always be represented as {P(hom ref)=0, P(het)=0.92, P(hom alt)=0.08}.  This "lossy compression" makes PLINK2 a poor choice for general-purpose .gen data management.

What I can do is create a mini-qctool which uses PLINK2's multithreaded .bgen loader, and can directly export a .gen.gz without throwing away any data.

2. The impute INFO statistic requires all three probabilities, so PLINK2 cannot compute it (instead, it will have an option for computing the similar MaCH r^2 statistic).  However, this would also be straightforward to support in a mini-qctool.

Gad Abraham

unread,
May 9, 2017, 12:36:20 PM5/9/17
to Christopher Chang, plink2-users
Right, I see the problem now.

How much work would it be to make a tool that can read .bgen and write
.gen/.gen.gz but also supports --extract and --keep? That would be
super useful for people using imputed data as existing tools are very
slow.

Gad
> --
> You received this message because you are subscribed to the Google Groups
> "plink2-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to plink2-users...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Christopher Chang

unread,
May 9, 2017, 12:43:56 PM5/9/17
to plink2-users, chrch...@gmail.com
I can probably have that ready by next week, if only .gen{.gz} export, --extract, --keep, and INFO computation are needed.


On Tuesday, May 9, 2017 at 9:36:20 AM UTC-7, Gad Abraham wrote:
Right, I see the problem now.

How much work would it be to make a tool that can read .bgen and write
.gen/.gen.gz but also supports --extract and --keep? That would be
super useful for people using imputed data as existing tools are very
slow.

Gad

Gad Abraham

unread,
May 9, 2017, 12:48:13 PM5/9/17
to Christopher Chang, plink2-users
That would be really great, happy to test it out when you're ready.

Gad

Gad Abraham

unread,
May 11, 2017, 5:51:51 AM5/11/17
to Christopher Chang, plink2-users
One more question, when using --dosage together with --score, does
PLINK internally hard-threshold the dosages before computing the score
or does it treat it as a proper dosage?

Thanks
Gad

Christopher Chang

unread,
May 11, 2017, 12:06:17 PM5/11/17
to plink2-users, chrch...@gmail.com
PLINK 1.x --score does actually work properly with --dosage (no thresholding).


On Thursday, May 11, 2017 at 2:51:51 AM UTC-7, Gad Abraham wrote:
One more question, when using --dosage together with --score, does
PLINK internally hard-threshold the dosages before computing the score
or does it treat it as a proper dosage?

Thanks
Gad

> That would be really great, happy to test it out when you're ready.
>
> Gad
>

Gad Abraham

unread,
May 11, 2017, 12:14:32 PM5/11/17
to Christopher Chang, plink2-users
Great!

And to confirm, the --bgen/--sample interface does do hard
thresholding internally when calling --score?

Christopher Chang

unread,
May 11, 2017, 12:15:42 PM5/11/17
to plink2-users, chrch...@gmail.com
With v1.9, and not with v2.0.


On Thursday, May 11, 2017 at 9:14:32 AM UTC-7, Gad Abraham wrote:
Great!

And to confirm, the --bgen/--sample interface does do hard
thresholding internally when calling --score?

Gad Abraham

unread,
May 28, 2017, 11:39:39 PM5/28/17
to Christopher Chang, plink2-users
Could you clarify something:

I'm using PLINK2 2017-05-26 to do profile scoring on bgen v1.1 files.
As far as I understand, the above ambiguity in the genotype
probabilities shouldn't affect scoring since the expected dosage is
identical.

In the .sscore output file, what is the column NMISS_ALLELE_CT? I'm
getting values lower then the number of alleles (2x number of SNPs),
but there shouldn't be missing alleles, unless PLINK2 is interpreting
some dosages as missing?

Christopher Chang

unread,
May 29, 2017, 1:49:06 PM5/29/17
to plink2-users, chrch...@gmail.com
You're interpreting NMISS_ALLELE_CT correctly.  "0 0 0" entries in the .bgen file are interpreted as missing dosages, and chrY variants only contribute one allele per SNP for males and zero for females; are either of these present in your data?

Katherine Fawcett

unread,
Dec 5, 2017, 6:42:19 AM12/5/17
to plink2-users
Hi Chris,

I was just wondering whether you wrote a mini-qctool for converting from bgen to gen.gz format (see e-mails below)?  If so, how do I use it (I'm new to PLINK...)?

Many thanks

Kath

Christopher Chang

unread,
Dec 5, 2017, 2:09:33 PM12/5/17
to plink2-users
Didn't end up getting around to it back then, sorry.  But if qctool2, etc. still can't handle this, I should be able to do it soon.
Reply all
Reply to author
Forward
0 new messages