managing dosage data in binary format

1,285 views
Skip to first unread message

Mike Miller

unread,
Apr 25, 2014, 6:49:40 PM4/25/14
to plink2 users
I have been working on the problem of managing dosage data in binary
format. Where do things stand with PLINK? If you already have it figured
out, I would love to see what you are doing. In case anyone can profit
from reading about my experience, I'm sharing a bunch of info below about
what I have been doing. I think my best idea is at the end -- encode
dosage using single-byte integers from 0 to 254, reserving 255 for a
missing-value code -- see "(6)" below and subsequent comments.

First, I have studied the PLINK bed binary format and figured out how to
extract data from it using a bash script. Given the bfile info and a
marker name, I can extract a single marker to the .raw (allele count)
format 5-10 times faster than the new PLINK can do it. The new PLINK does
this about 100 times faster than the old plink. The output is identical
for all three programs. The bash script is mostly relying on xxd and fast
grep to do hard/fast parts. The speed of the script surprised me a lot.
I will share it here later.

Second, having studied PLINK and come to an understanding of how binary
formats work (I really never looked into it, but it's pretty simple), I am
now rethinking what I have been doing with dosage data. This is what
happened to me:

(1) dosages came to me in text files, tab delimited, always with three
digits to the right of the decimal: 0.000 1.000 2.000 0.894 1.007, etc.

(2) I wanted to store the data in a compact format so that the dosage data
would fit on a small hard drive and be readable at least by R. I studied
a few systems and decided to use R's writeBin() with gzip to store
two-byte integers, multiplying dosage by 1000: 0 1000 2000 894 1007, etc.
I knew that the largest integer, 2000, was only using 11 bits, so there
were always 5 bits per dosage unused. I saw no way to store the data more
efficiently, but I thought gzip would compress away the unused parts of
those bytes so that they wouldn't use extra disk space. I stored the data
in multiple files and would always unzip on the fly and read the entire
file into memory in R using readBin().

(3) When I figured out how to extract data from PLINK .bed files, I
realized that GNU/Linux/UNIX programs like xxd, od and tail (with -c+N)
can start reading from any byte of the stored file almost instantaneously.
I assume they can find the correct block somehow from the file system and
thereby avoid reading many gigabytes of data to find the starting
position. This means there is a big advantage to not gzipping the binary
files -- if they are gzipped, one must unzip to stdout and read through
that output stream to find a certain byte number, which takes forever.

(4) A second mistake I'd made was to store the binary data in order by
sample (individual) with markers ordered within sample. I guess they made
the same mistake in early PLINK, using "individual-major format." That
format is usually not good because it means that the data for any marker
are distributed all over the file. We often want to retrieve data for all
subjects for a single marker, rarely for all markers for a single subject.
Thus, we always want to use SNP-major storage so that data for one marker
can be extracted by reading one small, continguous part of the file. My
trick for rapidly extracting data for a single marker is totally dependent
on the SNP-major format of the PLINK .bed file. I want to be able to do
the same kind of thing with dosages.

(5) Now that I want do do away with gzipping the binary data, I'm more
concerned than before about those unused bits. What I see is that the
dosage precision is a bit overstated -- we can't really make probability
statements to .001 when we only have a few hundred reference haplotypes,
so what's wrong with getting to within .004 of the dosage that mach or
Beagle gave me? In other words a one-byte integer should be almost as
good as a two-byte integer for these data. I would store integers from 0
to 255 and I would then multiply them by 2/255 to approximate the original
dosages. I tested this with 40,000 markers and saw that the smallest
correlation was .9986 (there were usually 4 or 5 nines). I also saw that
the one-byte approximation never decoded to a value that was more than
.004 away from the original three-digit dosage value. So the one-byte
integer seems like the efficient way to go, a little lossy, but quite good
enough considering the quality of the original data.

(6) One thing about my result with one-byte integers was bothering me.
They would always return 0.000 and 2.000 when the original data were 0.000
and 2.000, but when the original datum was 1.000, my one-byte scheme
always returned 1.004. To deal with that, I came up with a better scheme:
I would only encode dosage using numbers from 0 to 254 and I would decode
by multiply by 2/254. Thus, 1.000 would be encoded as 127 (01111111) and
it would always decode to 1.000. I compared my 255 and 254 single-byte
encoding schemes using 40,000 markers and I found that 254 did the same or
better than 255 by every criterion. For one, use of 255 never decoded to
the exact original dosages for any marker (for all 7694 samples), but 254
sometimes did so. I assume this is because of the encoding of 1.000,
which is a fairly common dosage value. Thinking in terms of resolution,
1/255 and 1/254 are very similar, differing only by 1/64770, so we are
giving up almost nothing by going from 255 to 254, but we are gaining
precision where we need it at the midpoint.

Added bonus: I never have missing values in my dosage data, but that
certainly is not impossible in some use cases, so we need a missing value,
and we have one: 255. It's a win-win.

Best,
Mike

Christopher Chang

unread,
Apr 26, 2014, 1:36:33 AM4/26/14
to plink2...@googlegroups.com, mbmi...@umn.edu
* Yes, the PLINK 1 binary format is essentially perfect for extracting one marker at a time, and I'm not surprised that a well-written script can do it even more quickly than PLINK 1.9 (the order of operations essentially forces PLINK to load the entire .bim file and sort the IDs, instead of just hunting for one variant ID).

* I have not made a final decision on how dosage data will be represented, and I'd appreciate others' input on this.  It will be similar to the final scheme you describe, though.  More precisely:

  - I was leaning toward using 2 bytes instead of one per dosage value, partly to avoid loss of precision when exchanging data with IMPUTE2.  However, you correctly note that the additional precision is practically irrelevant for association analysis, it results in doubled file sizes and computation times, and sequencing/variant calling/imputation are not yet accurate enough for the trailing digits to be very meaningful.  So if enough other people tell me that they're happy to throw away the third decimal place of their imputation results, and practically nobody voices the contrary opinion, I will make the default a 1-byte format instead.

  - Yes, missing data will be represented by all-1 binary values of the appropriate width.  (I sometimes want to make an exception for PLINK 1-style data, to reduce the amount of code I need to rewrite between PLINK 1.9 and 2.0, but it's better to bite the bullet and clean up everything I can in one shot.)

  - For computational reasons, I prefer to use a power of 2 to represent the maximal dosage.  This is painless when using a 2 byte format (nobody really cares about the loss of accuracy associated with going from 0-65534 to 0-32768), and it conveniently matches up exactly with the IMPUTE2 .bgen format.  (It's too bad that .bgen lacks some other features I need in the PLINK 2 core file format, otherwise I would have adopted it wholesale.)  But going from 0-254 to 0-128 may be too aggressive.  x/254 (or x/200) really is a substantially bigger pain to deal with than x/128, though.

  - When many dosage values are identical, there will be a "store only deviations from reference" compression option, but it will always be possible to disable this for e.g. easier loading from R.  And it will definitely be practical to use a modified version of your script to quickly extract dosages for specific variants from PLINK 2 binary files.

  - It may eventually be worthwhile to include some measure of uncertainty, rather than just point dosage estimates.  But that's outside PLINK 2.0's scope.

* There is a new "BGZF" compressed format that largely solves the compressed random access problem in a manner backward-compatible with gunzip.  PLINK 2 won't use it because gzip is overkill for the kind of binary data it handles ("deviations from reference" is faster and good enough), but it's appropriate for and supported by BCF2.

Mike Miller

unread,
Apr 28, 2014, 3:07:38 AM4/28/14
to Christopher Chang, plink2...@googlegroups.com
Thanks so much for your fast reply and great ideas. I think we should do
a study of the effect of dosage approximation in GWAS. We want to be able
to argue that the rounding doesn't reduce power. The effect seems to be
minimal, but that might depend on sample sizes and effect sizes (e.g., a
big effect with a small sample might be most vulnerable).

(For the work I did below, I just realized that I should have used 128 to
see how that worked out. I did use 100 and 200, so interpolating between
those will give about the right answer. I would say the results are
pretty encouraging for 128. What I am calling a >25% error in the p-value
means a change in -log10(p) of about 0.1 or more, and that seems to be
very rare with 100. Regarding the computational issue -- can't analysis
be done using the integers with the beta and SE adjusted appropriately in
the output? In other words, you don't have to divide the integer
representations to convert them to appropriately-scaled dosages to do
analyses with them. If so, you could use 250, which is what I will
probably do with my R data.)

I happened to have some data readily available, so I took a quick look.
These data were for 51,850 imputed dosages (1000G, European, 2/2012) in
the region of a possible signal. Data were from 4,466 subjects using 15
covariates in a generalized least squares regression (because of family
structures). All R²s were > .25.

When I refer to p_2000, I mean the p-value derived from regression
analysis of the original dosages, encoded as integers from 0 to 2000. I
compare p-values derived from the identical analysis of the same dosages
after rounding by encoding dosages as integers from 0 to either 250, 200,
100, 50, 20, 10 or 2 (using 2 means rounding to the nearest genotype).

In this table I look only at markers with p<1e-7. The probability ratio
is the p-value after rounding the dosage divided by the p_2000:


Probability Ratios after Rounding
-------------------------------------------------------------
p_2000 R² 250 200 100 50 20 10 2
--------- ------- ------- ------- ------- ------- ------- ------- -------
3.277e-09 0.42178 1.00238 1.00417 0.99368 1.01864 1.01171 1.12193 1.14666
3.958e-08 0.98877 1.00527 1.00815 0.99208 0.99053 0.96306 0.91693 1.44968
5.473e-08 0.99923 0.99751 1.00026 0.99707 1.00591 0.99557 0.98069 0.92960
5.490e-08 0.97396 0.99929 0.99869 0.99771 1.00104 0.99890 1.01012 0.92668
6.151e-08 0.98587 1.00313 0.99887 1.00221 1.01718 0.93848 1.04009 0.98202
6.987e-08 0.99820 1.00063 1.00563 1.00323 0.99572 0.98115 1.05582 0.89140
7.796e-08 0.99990 1.00286 0.99625 0.99522 1.03217 0.98965 0.92442 0.92425
7.856e-08 0.99320 1.00094 1.00094 1.00094 1.00094 1.04444 1.04444 1.04444
8.201e-08 0.99968 1.00042 1.00042 1.00042 1.00042 1.00042 1.00042 1.00042
8.212e-08 1.00007 0.99915 0.99915 0.99915 0.99915 0.99915 0.99915 0.99915
8.641e-08 0.91052 1.00290 1.00494 0.99626 1.04478 1.00574 0.91154 0.80820
9.753e-08 0.68969 1.00019 0.99886 0.96685 0.99620 0.93171 0.91562 0.56726
9.956e-08 0.99185 1.00845 0.99983 0.98642 1.04271 1.08779 0.90709 1.09417


The probability ratio is never terribly bad, but it is more consistently
near 1.0 for 250 or 200 than for 50 or less.

In the following tables I count the number of times I saw "errors"
exceeding 15%, 25%, 50%, 100% or 200% as a function of the max integer (M)
used for encoding dosage. When the denominator changes, that is because
markers became monomorphic, which almost certainly means that those
markers were useless to begin with.


>15% error

M p/p_2000 > 1.15 p_2000/p > 1.15
--- ------------------- -------------------
250 2/51850 = 0.00% 5/51850 = 0.01%
200 7/51850 = 0.01% 7/51850 = 0.01%
100 90/51850 = 0.17% 100/51850 = 0.19%
50 405/51850 = 0.78% 457/51850 = 0.88%
20 1648/51849 = 3.18% 1708/51849 = 3.29%
10 3195/51849 = 6.16% 3267/51849 = 6.30%
2 8148/51845 = 15.72% 7901/51845 = 15.24%

>25% error

M p/p_2000 > 1.25 p_2000/p > 1.25
--- ------------------- -------------------
250 0/51850 = 0.00% 0/51850 = 0.00%
200 0/51850 = 0.00% 0/51850 = 0.00%
100 13/51850 = 0.03% 10/51850 = 0.02%
50 109/51850 = 0.21% 112/51850 = 0.22%
20 696/51849 = 1.34% 766/51849 = 1.48%
10 1697/51849 = 3.27% 1761/51849 = 3.40%
2 5563/51845 = 10.73% 5494/51845 = 10.60%

>50% error

M p/p_2000 > 1.50 p_2000/p > 1.50
--- ------------------- -------------------
250 0/51850 = 0.00% 0/51850 = 0.00%
200 0/51850 = 0.00% 0/51850 = 0.00%
100 3/51850 = 0.01% 0/51850 = 0.00%
50 10/51850 = 0.02% 7/51850 = 0.01%
20 148/51849 = 0.29% 143/51849 = 0.28%
10 577/51849 = 1.11% 548/51849 = 1.06%
2 3022/51845 = 5.83% 2984/51845 = 5.76%

>100% error

M p/p_2000 > 2.00 p_2000/p > 2.00
--- ------------------- -------------------
250 0/51850 = 0.00% 0/51850 = 0.00%
200 0/51850 = 0.00% 0/51850 = 0.00%
100 0/51850 = 0.00% 0/51850 = 0.00%
50 1/51850 = 0.00% 1/51850 = 0.00%
20 19/51849 = 0.04% 15/51849 = 0.03%
10 114/51849 = 0.22% 103/51849 = 0.20%
2 1387/51845 = 2.68% 1350/51845 = 2.60%

>200% error

M p/p_2000 > 3.00 p_2000/p > 3.00
--- ------------------- -------------------
250 0/51850 = 0.00% 0/51850 = 0.00%
200 0/51850 = 0.00% 0/51850 = 0.00%
100 0/51850 = 0.00% 0/51850 = 0.00%
50 0/51850 = 0.00% 0/51850 = 0.00%
20 0/51849 = 0.00% 4/51849 = 0.01%
10 24/51849 = 0.05% 10/51849 = 0.02%
2 494/51845 = 0.95% 542/51845 = 1.05%


Use of 250 might be slightly better than 200. I see that the error
distribution is symmetric -- the p-value is as likely to be too big as it
is too small. However, if I restrict the analysis to selected markers
where p_2000 < .001, I see that the p-value is much more likely to become
larger with rounding than to become smaller, but that is only true for the
smaller numbers of encoding digits. That makes sense as a regression
effect.


>15% error, p_2000 < .001

M p/p_2000 > 1.15 p_2000/p > 1.15
--- ----------------- -----------------
250 0/159 = 0.00% 0/159 = 0.00%
200 1/159 = 0.63% 0/159 = 0.00%
100 1/159 = 0.63% 3/159 = 1.89%
50 5/159 = 3.14% 4/159 = 2.52%
20 14/159 = 8.81% 12/159 = 7.55%
10 23/159 = 14.47% 10/159 = 6.29%
2 68/159 = 42.77% 18/159 = 11.32%

>25% error, p_2000 < .001

M p/p_2000 > 1.25 p_2000/p > 1.25
--- ----------------- -----------------
250 0/159 = 0.00% 0/159 = 0.00%
200 0/159 = 0.00% 0/159 = 0.00%
100 0/159 = 0.00% 0/159 = 0.00%
50 3/159 = 1.89% 3/159 = 1.89%
20 5/159 = 3.14% 7/159 = 4.40%
10 15/159 = 9.43% 4/159 = 2.52%
2 38/159 = 23.90% 14/159 = 8.81%

>50% error, p_2000 < .001

M p/p_2000 > 1.50 p_2000/p > 1.50
--- ----------------- -----------------
250 0/159 = 0.00% 0/159 = 0.00%
200 0/159 = 0.00% 0/159 = 0.00%
100 0/159 = 0.00% 0/159 = 0.00%
50 1/159 = 0.63% 1/159 = 0.63%
20 1/159 = 0.63% 1/159 = 0.63%
10 11/159 = 6.92% 2/159 = 1.26%
2 23/159 = 14.47% 10/159 = 6.29%

>100% error, p_2000 < .001

M p/p_2000 > 2.00 p_2000/p > 2.00
--- ----------------- -----------------
250 0/159 = 0.00% 0/159 = 0.00%
200 0/159 = 0.00% 0/159 = 0.00%
100 0/159 = 0.00% 0/159 = 0.00%
50 0/159 = 0.00% 0/159 = 0.00%
20 0/159 = 0.00% 1/159 = 0.63%
10 7/159 = 4.40% 0/159 = 0.00%
2 13/159 = 8.18% 3/159 = 1.89%

>200% error, p_2000 < .001

M p/p_2000 > 3.00 p_2000/p > 3.00
--- ----------------- -----------------
250 0/159 = 0.00% 0/159 = 0.00%
200 0/159 = 0.00% 0/159 = 0.00%
100 0/159 = 0.00% 0/159 = 0.00%
50 0/159 = 0.00% 0/159 = 0.00%
20 0/159 = 0.00% 0/159 = 0.00%
10 5/159 = 3.14% 0/159 = 0.00%
2 7/159 = 4.40% 3/159 = 1.89%

So it looks like 250 and 200 both perform really well in this example. A
problem with this analysis is that I know know if the "effect" associated
with small p-values was real and I don't know if the rounding made the
result better or worse.

I also did this retaining only markers with certain values of r2 and it
seemed like the high r2 markers showed less effect of rounding, as one
would expect.

Based on this, I would think that use of an integer from zero to 200 or
250 would be quite adequate, but I'll have to do a little more before I'd
recommend that with great confidence. Same for 128.

Mike
> --
> You received this message because you are subscribed to the Google Groups "plink2-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

Christopher Chang

unread,
Apr 28, 2014, 3:47:13 AM4/28/14
to plink2...@googlegroups.com, Christopher Chang, mbmi...@umn.edu
While 250 is better than 200 in theory, it has an awkward rounding problem when we import dosages with 2 or 3 digits of precision: should 0.99 be represented as 247/250 or 248/250?  How about 0.994?  200, on the other hand, behaves very cleanly on both import and export.
> To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

Mike Miller

unread,
Apr 28, 2014, 2:30:20 PM4/28/14
to Christopher Chang, plink2...@googlegroups.com
Use of 200 is neat because integers always decode to two-digit dosages,
while 250 would require three digits, but I'm not understanding why you
see 250 as causing special problems for three-digit dosages. If dosage
data come as two-digit numbers, then 200 will always represent them
perfectly, but are programs producing two-digit dosages? (Data I have
received from minimac and from impute2 were both three-digit.) Anyway, if
the dosages come to us as two-digit numbers, then 200 is better. If they
come as three-digit numbers, I think 250 is better and I don't see the
awkwardness. I'm not saying you are wrong, but I don't get the problem.

I'm using R to round numbers (M is the max integer):

round(M*dosage/2)

So the answer is that 0.990 would be encoded as 124 and decoded as 0.992
while 0.994 would be treated the same (also 0.992). Using 200, both would
decode to 0.99, which is too small on average. (In your question you may
have been thinking of 1.99 and 1.994, but the results are the same after
adding 1.) Some R code:

> .99 -> dosage ; encoded <- c(1000*dosage, round(250*dosage/2), round(200*dosage/2)) ; decoded <- encoded/c(1000, 125, 100); matrix(c(encoded,decoded),3,2, byrow=F)
[,1] [,2]
[1,] 990 0.990
[2,] 124 0.992
[3,] 99 0.990

> .994 -> dosage ; encoded <- c(1000*dosage, round(250*dosage/2), round(200*dosage/2)) ; decoded <- encoded/c(1000, 125, 100); matrix(c(encoded,decoded),3,2, byrow=F)
[,1] [,2]
[1,] 994 0.994
[2,] 124 0.992
[3,] 99 0.990

You asked about 0.994, but you didn't ask about 0.995. Isn't 0.995
awkward for rounding to two digits? I'm not sure, so I'm asking. With R,
when the third digit is 5, it rounds to the even second digit, which is a
well-known convention. The matrix below shows how this works for the top
41 three-digit dosages. The columns are (1) three-digit dosage, (2)
250-encoded dosage before rounding, (3) 250-encoded dosage, (4)
200-encoded dosage before rounding, (5) 200-encoded dosage. The asterisks
show the values ending with .5 that were rounded to the even value.

> seq(2.000,1.960,-.001) -> dosage ; matrix(c(dosage, 250*dosage/2, round(250*dosage/2), 200*dosage/2, round(200*dosage/2)), 41, 5, byrow=F)
[,1] [,2] [,3] [,4] [,5]
[1,] 2.000 250.000 250 200.0 200
[2,] 1.999 249.875 250 199.9 200
[3,] 1.998 249.750 250 199.8 200
[4,] 1.997 249.625 250 199.7 200
[5,] 1.996 249.500* 250 199.6 200
[6,] 1.995 249.375 249 199.5* 200
[7,] 1.994 249.250 249 199.4 199
[8,] 1.993 249.125 249 199.3 199
[9,] 1.992 249.000 249 199.2 199
[10,] 1.991 248.875 249 199.1 199
[11,] 1.990 248.750 249 199.0 199
[12,] 1.989 248.625 249 198.9 199
[13,] 1.988 248.500* 248 198.8 199
[14,] 1.987 248.375 248 198.7 199
[15,] 1.986 248.250 248 198.6 199
[16,] 1.985 248.125 248 198.5* 198
[17,] 1.984 248.000 248 198.4 198
[18,] 1.983 247.875 248 198.3 198
[19,] 1.982 247.750 248 198.2 198
[20,] 1.981 247.625 248 198.1 198
[21,] 1.980 247.500* 248 198.0 198
[22,] 1.979 247.375 247 197.9 198
[23,] 1.978 247.250 247 197.8 198
[24,] 1.977 247.125 247 197.7 198
[25,] 1.976 247.000 247 197.6 198
[26,] 1.975 246.875 247 197.5* 198
[27,] 1.974 246.750 247 197.4 197
[28,] 1.973 246.625 247 197.3 197
[29,] 1.972 246.500* 246 197.2 197
[30,] 1.971 246.375 246 197.1 197
[31,] 1.970 246.250 246 197.0 197
[32,] 1.969 246.125 246 196.9 197
[33,] 1.968 246.000 246 196.8 197
[34,] 1.967 245.875 246 196.7 197
[35,] 1.966 245.750 246 196.6 197
[36,] 1.965 245.625 246 196.5* 196
[37,] 1.964 245.500* 246 196.4 196
[38,] 1.963 245.375 245 196.3 196
[39,] 1.962 245.250 245 196.2 196
[40,] 1.961 245.125 245 196.1 196
[41,] 1.960 245.000 245 196.0 196


The matrix below shows results for the highest 41 dosages: 2.000 down to
1.960. I'm showing original dosage in column 1, dosage decoded from 250
in column 2 and dosage decoded from 200 in column three. I've added a
fourth column by hand showing which encoding scheme came closest to the
original value. An empty fourth column means it was a tie.

> seq(2,1.96,-.001) -> dosage ; matrix(c(dosage, round(250*dosage/2)/125, round(200*dosage/2)/100),41,3,byrow=F)
[,1] [,2] [,3]
[1,] 2.000 2.000 2.00
[2,] 1.999 2.000 2.00
[3,] 1.998 2.000 2.00
[4,] 1.997 2.000 2.00
[5,] 1.996 2.000 2.00
[6,] 1.995 1.992 2.00 250
[7,] 1.994 1.992 1.99 250
[8,] 1.993 1.992 1.99 250
[9,] 1.992 1.992 1.99 250
[10,] 1.991 1.992 1.99
[11,] 1.990 1.992 1.99 200
[12,] 1.989 1.992 1.99 200
[13,] 1.988 1.984 1.99 200
[14,] 1.987 1.984 1.99
[15,] 1.986 1.984 1.99 250
[16,] 1.985 1.984 1.98 250
[17,] 1.984 1.984 1.98 250
[18,] 1.983 1.984 1.98 250
[19,] 1.982 1.984 1.98
[20,] 1.981 1.984 1.98 200
[21,] 1.980 1.984 1.98 200
[22,] 1.979 1.976 1.98 200
[23,] 1.978 1.976 1.98
[24,] 1.977 1.976 1.98 250
[25,] 1.976 1.976 1.98 250
[26,] 1.975 1.976 1.98 250
[27,] 1.974 1.976 1.97 250
[28,] 1.973 1.976 1.97
[29,] 1.972 1.968 1.97 200
[30,] 1.971 1.968 1.97 200
[31,] 1.970 1.968 1.97 200
[32,] 1.969 1.968 1.97
[33,] 1.968 1.968 1.97 250
[34,] 1.967 1.968 1.97 250
[35,] 1.966 1.968 1.97 250
[36,] 1.965 1.968 1.96 250
[37,] 1.964 1.968 1.96
[38,] 1.963 1.960 1.96
[39,] 1.962 1.960 1.96
[40,] 1.961 1.960 1.96
[41,] 1.960 1.960 1.96

We see this win/loss record:

M win loss tie
--- --- ---- ---
250 16 9 16
200 9 16 16

For all 2001 possible three-digit dosages 0.000 to 2.000, I get this
win-loss record:

M win loss tie
--- --- ---- ---
250 928 554 519
200 554 928 519

Code for that:

> seq(2,0,-.001) -> dosage ; mat <- matrix(c(dosage, round(250*dosage/2)/125, round(200*dosage/2)/100),2001,3,byrow=F)
> sum(abs(mat[,1]-mat[,2]) < abs(mat[,1]-mat[,3]))
[1] 928
> sum(abs(mat[,1]-mat[,2]) > abs(mat[,1]-mat[,3]))
[1] 554
> sum(abs(mat[,1]-mat[,2]) == abs(mat[,1]-mat[,3]))
[1] 519

Of course, the average deviation from the original value is about 25%
greater when encoding with 200 than when encoding with 250 which is why I
think I will go with 250 for storing data for use with R.

Mike
>> an email to plink2-users...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>
> --
> You received this message because you are subscribed to the Google Groups "plink2-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

Christopher Chang

unread,
Apr 28, 2014, 7:56:41 PM4/28/14
to plink2...@googlegroups.com, Christopher Chang, mbmi...@umn.edu
I was thinking about the IMPUTE2 .gen text format, which contains triplets of probabilities (P(g=0), P(g=1), P(g=2)) represented with four digits after the decimal, but the fourth digit is always zero for now.  This would be converted to a 0-2 dosage of [P(g=1) + 2P(g=2)] / [P(g=0) + P(g=1) + P(g=2)], where the denominator could be less than 1.  A 0-250 representation does seem to have significantly more edge cases than 0-200 when P(g=0)+P(g=1)+P(g=2) is exactly 1, but I was incorrect about 0-200 having no edge cases since I had forgotten about the "halving" of P(g=1); P(g=1)=0.995, P(g=2)=0.005 is an edge case, P(g=0)=0.006, P(g=1)=0.993, P(g=2)=0.001 is another, etc.

So I'm mostly convinced that 250 is a better choice than 200 at this point; the top candidates are now 128, 250, and 32768.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>
> --
> You received this message because you are subscribed to the Google Groups "plink2-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

Mike Miller

unread,
Apr 29, 2014, 2:56:31 PM4/29/14
to Christopher Chang, plink2...@googlegroups.com
I did some imputation with IMPUTE2 last month and had three-digit numbers
in the probability output. I imported the probabilities to GNU Octave and
multiplied by a sparse Kroneker product to extract the dosages from the
probabilities...

> [M,S]=size(Probs);
> Dosage = Probs * kron(eye(S/3), [2;1;0]);

...but I wasn't considering that they might not sum to 1. I think in my
case that only happened due to (ignorable, I hope) rounding error:

> min( min( Probs * kron(eye(S/3), ones(3,1)) ) ) ;
ans = 0.99800

So the smallest sum of a length-three probability vector was .998. They
can also sum to more than 1:

> max( max( Probs * kron(eye(S/3), ones(3,1)) ) )
ans = 1.0010

I read here...

http://www.stats.ox.ac.uk/~marchini/software/gwas/file_format.html#Genotype_File_Format

...that "the probabilities need not sum to 1 to allow for the possibility
of a NULL genotype call," but I assumed that meant that I could use three
zeros in the genotype input file instead of using NA. I was also assuming
that use of non-zero sums for null genotypes applied only to the input
data, and that output data in the same format would never have missing
genotypes because genotype probabilities would revert to population values
under low information. It seems like that worked out in my case. I
should run it again, adding a fake subject with all markers missing to see
what I get for output for that subject.

Looking more closely at my data, this is what I find: The probabilities
vectors summed to a precise 1.000 in about 99% of cases. I'll refer to
"Sum" which is defined like so:

Sum = [P(g=0) + P(g=1) + P(g=2)]

Here is the frequency distribution:

Sum freq
------- --------
0.998 252
0.999 27814
1.000 3310339
1.001 6664

When the sum did not equal 1.0, I computed dosages two ways:

D1 = [P(g=1) + 2P(g=2)]

D2 = [P(g=1) + 2P(g=2)] / Sum

It turned out that the largest absolute value of D2 - D1 was 0.002. When
Sum == 1.001, abs(D1 - D2) exceeded .001 about 56% of the time, and when
Sum == 0.999, abs(D1 - D2) exceeded .001 about 69% of the time.

The interesting cases had Sum == 0.998. All of them had 0.998 for the
heterozygote probability and zeros for the two homozygotes. So D1 was
always 0.998 and D2 was always 1.000. That doesn't really make sense
because to round to 0.998, the actual value had to be no more than 0.9985
which would leave 0.0015 to divide between the homozygotes and at least
one of them would have to be 0.00075 or greater, so it would round to at
least 0.001.

Anyway, if IMPUTE2 output probability vectors are not summing to 1.0 only
because of rounding error, I think it is safe to ignore that and just
compute dosage as [P(g=1) + 2P(g=2)], especially if we will then round to
the neareast 1/125 or so.

You wrote:

> So I'm mostly convinced that 250 is a better choice than 200 at this
> point; the top candidates are now 128, 250, and 32768.

Can you explain how use of 128 (a power of 2) helps your computations?
Which computations are those? If it's the division by 64 to get the
dosage, I get how that is easy, but I'm not sure why you have to do it.
Can't the integer encoding of dosage be used for computations and
adjustments made on the output (e.g., regression betas and s.e.s) without
ever having to convert dosages? I suppose multiplying betas by 64 is
easier than multiplying by 250, but it isn't a big deal because you only
have one per marker. With dosages you have N per marker (with N
subjects). You also don't want to convert integers to dosage values in
memory because it uses too much space.

How big of a deal would it be to have both single-byte and two-byte
integer storage formats? You would have to code it both ways, of course,
which is a hassle, but you could do something like what is done with the
magic number and third byte in PLINK files -- the third byte could be used
to tell you the number of bytes per dosage.

Best,
[snip]

Christopher Chang

unread,
Apr 30, 2014, 1:51:34 AM4/30/14
to plink2...@googlegroups.com, Christopher Chang, mbmi...@umn.edu
* Multiplication of dosages is a lot faster when a power-of-2 step size is used; the division part of (a * b) / 128 or (a * b) / 32768 is a bit shift that can actually be executed on 4-8 dosages simultaneously, while (a * b) / 250 is substantially more painful to work with.  This comes up in quite a few contexts.

* While it's possible to support both single-byte and two-byte dosage formats, I feel that it would create a bigger mess than can be justified by the mere factor of two savings, and I would rather just support the two-byte format.  The open question is whether *anyone* cares for the accuracy provided by the second byte today; if I can't find even one person who does, I am willing to only code one-byte dosage logic next year and let two-byte support be bolted on at some indeterminate later time.

Mike Miller

unread,
Apr 30, 2014, 1:16:43 PM4/30/14
to Christopher Chang, plink2...@googlegroups.com
I didn't know that you needed to multiply dosages. Is this about GRM
computations? Or are you planning some kind of epistasis analysis?

Is it necessary to have dosages on the 0,1,2 scale instead of, say, the
0,125,250 scale when doing these computations?

In your formulas, do "a" and "b" each represent an integer-encoded dosage?

Mike
> --
> You received this message because you are subscribed to the Google Groups "plink2-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

Christopher Chang

unread,
Apr 30, 2014, 1:39:54 PM4/30/14
to plink2...@googlegroups.com, Christopher Chang, mbmi...@umn.edu
* PLINK 1.07 has an epistasis module, 1.9 has a highly optimized implementation, and it would be appropriate for 2.0 to extend it to support dosage data.  Other computations involving lots of multiplication include r^2/LD and (as you note) the GRM.
* Yes, "a" and "b" denoted integer-encoded dosages.
> To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

Mike Miller

unread,
Apr 30, 2014, 2:48:47 PM4/30/14
to Christopher Chang, plink2...@googlegroups.com
Thanks. Since "a" and "b" are integer-encoded dosages, if you are using a
max integer of 128, say, then you would divide both of them by 64:

( a / 64 ) * ( b / 64 )

Or ( a * b ) / 4096. For epistasis, I assume you are using the resulting
values as regressors. Can you not do the regression with a * b (without
dividing first), then multiply the coefficients and standard errors by
4096, thus achieving the identical result? Wouldn't that be a more
efficient approach?

I think the division of the dosages will increase the computational effort
unnecessarily.

When doing R^2 computations, can't you compute R^2 on the integer
encodings? What is the benefit of rescaling to 0,1,2 dosage?

And if I am right about these issues, doesn't that minimize the problem
with use of 250 as a max integer instead of using a power of 2?

Mike
>> an email to plink2-users...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups "plink2-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

Christopher Chang

unread,
Apr 30, 2014, 9:57:02 PM4/30/14
to plink2...@googlegroups.com, Christopher Chang, mbmi...@umn.edu
* The (a * b) / 250 expression keeps the dosage in the same "units"; sometimes you want this, sometimes you don't.
* Yes, division (or multiplication) steps can frequently be deferred until many individuals/variants have been summed over, and if any comparison needs to be done against a fixed threshold, that threshold can be pre-converted to /250 or /62500 units.  So, okay, the time penalty might not be bad if everything has been properly optimized.  Still, there's an extra level of implementation complexity that simply is not there when a power of 2 is used, and I feel that if the 8th bit of precision is important enough to justify the extra complexity, it's probably best to just go with 15 bits instead.  I want it to be practical for others to write code that efficiently interacts with PLINK 2 dosages without needing to spend as much time accounting for low-level machine details as I do.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups "plink2-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

Mike Miller

unread,
May 1, 2014, 1:31:50 AM5/1/14
to Christopher Chang, plink2...@googlegroups.com
Sorry, I thought there were typos in your earlier messages, but you meant
what you were saying and I just wasn't figuring it out fast enough. So
you mean to divide by the max integer ("MAX") used in encoding dosages and
thereby create a new integer c = (a * b) / MAX that is of the same
numerical type (uint8 or uint16) as both a and b. If MAX is a power of 2,
this is much faster and easier. The new number c is automatically
rounded, or truncated, because it has to fit into the same number of bits
as both a and b. Maybe I'm understanding it now.

I didn't understand the part about making it "practical for others to
write code that efficiently interacts with PLINK 2 dosages." What kind of
code would that be? PLINK source code? Code for programs that would
interface with PLINK?

I'm just curious about these issues and not trying to persuade you. I'll
use PLINK no matter what you do -- and all of your work has been great.
I just like to learn things about stuff like this. So thanks.
>>>> an email to plink2-users...@googlegroups.com.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>> Groups "plink2-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>> an email to plink2-users...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups "plink2-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

Christopher Chang

unread,
May 1, 2014, 1:57:47 AM5/1/14
to plink2...@googlegroups.com, Christopher Chang, mbmi...@umn.edu
Programs like GCTA are able to efficiently perform computations on data in PLINK 1 binary files today.  I want to choose a data representation that makes it easy for them to efficiently support the PLINK 2 format when the time comes.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>> Groups "plink2-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups "plink2-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users+unsubscribe@googlegroups.com.

Peter Joshi

unread,
May 3, 2015, 1:51:11 PM5/3/15
to plink2...@googlegroups.com, mbmi...@umn.edu, chrch...@gmail.com
Hi Chris and Mike,

I can't see great benefit in very accurate dosage information. Although imputations are getting better,  an r2 of 0.95 is unlikely to be terribly affected by rounding to 1/128 and for most of the time dosage seems to be for association.

One slightly different thought. My imputations have genotyped SNPs, most of us (naively) attribute 100% accuracy to these. Indeed consortia often ask us to report which SNPs are GTd and which are not. Consortia also ask us to report an imputation quality score. As I'm sure you know, this is usually managed through an info file. It would be helpful to integrate info files into the plink format. I'm guessing we only need a report per snp not per snp x sample although, as a matter of principle,  it would be handy to have a quality score on a SNP x sample basis. Anyway, in summary I think it will be helpful to have a --dosage-info file and to produce consortium style results files automatically.

Best

>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>> Groups "plink2-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups "plink2-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to plink2-users...@googlegroups.com.

Peter Joshi

unread,
Oct 13, 2015, 7:16:08 AM10/13/15
to plink2-users, mbmi...@umn.edu, chrch...@gmail.com
HI Chris and Mike,

sorry for returning to a rather old thread, but recent work with UK Biobank has set me thinking.


plink --assoc is doing a great job for me on bed files - superfast with 120k samples and nearly 1m SNPs.

However, SNPtest is painful on the bgen files. The files are huge and SNPtest is incredibly slow, even working with residuals and no covariates.

I have also been looking at regscan http://www.geenivaramu.ee/en/tools/regscan. Which does a fast test on residuals, and runtime is governed by i/o limits. Unfortunately it works on .gen not .bgen files so isn't practical for UK BB.


All this has drawn me to the following conclusion. For the very very large datasets that are likely to become common, a very compact dosage format is desirable - one byte? This will help with both runtimes due to i/o and disk storage. It will also be very helpful if a fast test like regscan, which does a minimal single phenotype univariate regression  (I presume plink -assoc is already similar but want to check).

Ideally this would become a near universal format and the only format I would store my imputation results in. But one problem with that is we're missing the genotype probabilities. If I want to test recessive effects, or do some other genotype based analysis, I'd be stuck. I haven't quite thought through an optimal solution, but perhaps two bytes (p(a1=0), p(a1=1))? I realise I've just doubled file sizes and probably runtimes for a fast test, but it would be a format I'd be happy to store my definitive imputations in and I only want one version given the huge file sizes, although I would not have phasing. 

Interested to hear your thoughts.

Kevin Keys

unread,
Apr 7, 2016, 5:18:42 PM4/7/16
to plink2-users, mbmi...@umn.edu, chrch...@gmail.com
Apologies for reviving a dead thread...

I have discussed fractional dosage compression with my professor off and on over the last several months.

I personally feel like one-byte (256-bit) compression is the best. I also like the idea of an injective (one-to-one) map, i.e. 0.00 -> 0000 0000, 0.01 -> 0000 0001, 0.02 -> 0000 0002, etc. The injective map only uses 201 of the 256 possible bit configurations. However it is (1) consistent, (2) easy to describe, and (3) does not fuss about roundoff error.

My professor has some code for dealing with fractional dosages. He simply divides the uint8 number by 256 to obtain the fractional dosage. In my mind this is bad practice since multiple bit configurations map to the same dosage, which makes compression messy: two newly compressed BED files with different bit configurations could have the same fractional dosages in floating point representation, and you would only know by decompressing the BED files or inspecting the bits themselves.

Also, he and all of his colleagues are quite happy with two decimal places of precision.

Lastly, I do not use IMPUTE2 and do not understand the two-byte format that it requires. I second Peter Joshi's comment about bigbigbig files; more compression is better. But if interoperability is your goal, then it may be best to use IMPUTE2's protocol (or perhaps code a separate translation function?)

--Kevin

Christopher Chang

unread,
Apr 8, 2016, 4:54:27 PM4/8/16
to plink2-users, mbmi...@umn.edu, chrch...@gmail.com
I'm currently leaning to 2-byte dosages, but with compression of the 0/1/2 hardcall values.  This way, you can avoid data loss when converting to and back from BGEN v1.1, but you don't suffer from much file bloat when the calls are highly certain or only a few samples were imputed.
Reply all
Reply to author
Forward
0 new messages