Thanks so much for your fast reply and great ideas. I think we should do
a study of the effect of dosage approximation in GWAS. We want to be able
to argue that the rounding doesn't reduce power. The effect seems to be
minimal, but that might depend on sample sizes and effect sizes (e.g., a
big effect with a small sample might be most vulnerable).
(For the work I did below, I just realized that I should have used 128 to
see how that worked out. I did use 100 and 200, so interpolating between
those will give about the right answer. I would say the results are
pretty encouraging for 128. What I am calling a >25% error in the p-value
means a change in -log10(p) of about 0.1 or more, and that seems to be
very rare with 100. Regarding the computational issue -- can't analysis
be done using the integers with the beta and SE adjusted appropriately in
the output? In other words, you don't have to divide the integer
representations to convert them to appropriately-scaled dosages to do
analyses with them. If so, you could use 250, which is what I will
probably do with my R data.)
I happened to have some data readily available, so I took a quick look.
These data were for 51,850 imputed dosages (1000G, European, 2/2012) in
the region of a possible signal. Data were from 4,466 subjects using 15
covariates in a generalized least squares regression (because of family
structures). All R²s were > .25.
When I refer to p_2000, I mean the p-value derived from regression
analysis of the original dosages, encoded as integers from 0 to 2000. I
compare p-values derived from the identical analysis of the same dosages
after rounding by encoding dosages as integers from 0 to either 250, 200,
100, 50, 20, 10 or 2 (using 2 means rounding to the nearest genotype).
In this table I look only at markers with p<1e-7. The probability ratio
is the p-value after rounding the dosage divided by the p_2000:
Probability Ratios after Rounding
-------------------------------------------------------------
p_2000 R² 250 200 100 50 20 10 2
--------- ------- ------- ------- ------- ------- ------- ------- -------
3.277e-09 0.42178 1.00238 1.00417 0.99368 1.01864 1.01171 1.12193 1.14666
3.958e-08 0.98877 1.00527 1.00815 0.99208 0.99053 0.96306 0.91693 1.44968
5.473e-08 0.99923 0.99751 1.00026 0.99707 1.00591 0.99557 0.98069 0.92960
5.490e-08 0.97396 0.99929 0.99869 0.99771 1.00104 0.99890 1.01012 0.92668
6.151e-08 0.98587 1.00313 0.99887 1.00221 1.01718 0.93848 1.04009 0.98202
6.987e-08 0.99820 1.00063 1.00563 1.00323 0.99572 0.98115 1.05582 0.89140
7.796e-08 0.99990 1.00286 0.99625 0.99522 1.03217 0.98965 0.92442 0.92425
7.856e-08 0.99320 1.00094 1.00094 1.00094 1.00094 1.04444 1.04444 1.04444
8.201e-08 0.99968 1.00042 1.00042 1.00042 1.00042 1.00042 1.00042 1.00042
8.212e-08 1.00007 0.99915 0.99915 0.99915 0.99915 0.99915 0.99915 0.99915
8.641e-08 0.91052 1.00290 1.00494 0.99626 1.04478 1.00574 0.91154 0.80820
9.753e-08 0.68969 1.00019 0.99886 0.96685 0.99620 0.93171 0.91562 0.56726
9.956e-08 0.99185 1.00845 0.99983 0.98642 1.04271 1.08779 0.90709 1.09417
The probability ratio is never terribly bad, but it is more consistently
near 1.0 for 250 or 200 than for 50 or less.
In the following tables I count the number of times I saw "errors"
exceeding 15%, 25%, 50%, 100% or 200% as a function of the max integer (M)
used for encoding dosage. When the denominator changes, that is because
markers became monomorphic, which almost certainly means that those
markers were useless to begin with.
>15% error
M p/p_2000 > 1.15 p_2000/p > 1.15
--- ------------------- -------------------
250 2/51850 = 0.00% 5/51850 = 0.01%
200 7/51850 = 0.01% 7/51850 = 0.01%
100 90/51850 = 0.17% 100/51850 = 0.19%
50 405/51850 = 0.78% 457/51850 = 0.88%
20 1648/51849 = 3.18% 1708/51849 = 3.29%
10 3195/51849 = 6.16% 3267/51849 = 6.30%
2 8148/51845 = 15.72% 7901/51845 = 15.24%
>25% error
M p/p_2000 > 1.25 p_2000/p > 1.25
--- ------------------- -------------------
250 0/51850 = 0.00% 0/51850 = 0.00%
200 0/51850 = 0.00% 0/51850 = 0.00%
100 13/51850 = 0.03% 10/51850 = 0.02%
50 109/51850 = 0.21% 112/51850 = 0.22%
20 696/51849 = 1.34% 766/51849 = 1.48%
10 1697/51849 = 3.27% 1761/51849 = 3.40%
2 5563/51845 = 10.73% 5494/51845 = 10.60%
>50% error
M p/p_2000 > 1.50 p_2000/p > 1.50
--- ------------------- -------------------
250 0/51850 = 0.00% 0/51850 = 0.00%
200 0/51850 = 0.00% 0/51850 = 0.00%
100 3/51850 = 0.01% 0/51850 = 0.00%
50 10/51850 = 0.02% 7/51850 = 0.01%
20 148/51849 = 0.29% 143/51849 = 0.28%
10 577/51849 = 1.11% 548/51849 = 1.06%
2 3022/51845 = 5.83% 2984/51845 = 5.76%
>100% error
M p/p_2000 > 2.00 p_2000/p > 2.00
--- ------------------- -------------------
250 0/51850 = 0.00% 0/51850 = 0.00%
200 0/51850 = 0.00% 0/51850 = 0.00%
100 0/51850 = 0.00% 0/51850 = 0.00%
50 1/51850 = 0.00% 1/51850 = 0.00%
20 19/51849 = 0.04% 15/51849 = 0.03%
10 114/51849 = 0.22% 103/51849 = 0.20%
2 1387/51845 = 2.68% 1350/51845 = 2.60%
>200% error
M p/p_2000 > 3.00 p_2000/p > 3.00
--- ------------------- -------------------
250 0/51850 = 0.00% 0/51850 = 0.00%
200 0/51850 = 0.00% 0/51850 = 0.00%
100 0/51850 = 0.00% 0/51850 = 0.00%
50 0/51850 = 0.00% 0/51850 = 0.00%
20 0/51849 = 0.00% 4/51849 = 0.01%
10 24/51849 = 0.05% 10/51849 = 0.02%
2 494/51845 = 0.95% 542/51845 = 1.05%
Use of 250 might be slightly better than 200. I see that the error
distribution is symmetric -- the p-value is as likely to be too big as it
is too small. However, if I restrict the analysis to selected markers
where p_2000 < .001, I see that the p-value is much more likely to become
larger with rounding than to become smaller, but that is only true for the
smaller numbers of encoding digits. That makes sense as a regression
effect.
>15% error, p_2000 < .001
M p/p_2000 > 1.15 p_2000/p > 1.15
--- ----------------- -----------------
250 0/159 = 0.00% 0/159 = 0.00%
200 1/159 = 0.63% 0/159 = 0.00%
100 1/159 = 0.63% 3/159 = 1.89%
50 5/159 = 3.14% 4/159 = 2.52%
20 14/159 = 8.81% 12/159 = 7.55%
10 23/159 = 14.47% 10/159 = 6.29%
2 68/159 = 42.77% 18/159 = 11.32%
>25% error, p_2000 < .001
M p/p_2000 > 1.25 p_2000/p > 1.25
--- ----------------- -----------------
250 0/159 = 0.00% 0/159 = 0.00%
200 0/159 = 0.00% 0/159 = 0.00%
100 0/159 = 0.00% 0/159 = 0.00%
50 3/159 = 1.89% 3/159 = 1.89%
20 5/159 = 3.14% 7/159 = 4.40%
10 15/159 = 9.43% 4/159 = 2.52%
2 38/159 = 23.90% 14/159 = 8.81%
>50% error, p_2000 < .001
M p/p_2000 > 1.50 p_2000/p > 1.50
--- ----------------- -----------------
250 0/159 = 0.00% 0/159 = 0.00%
200 0/159 = 0.00% 0/159 = 0.00%
100 0/159 = 0.00% 0/159 = 0.00%
50 1/159 = 0.63% 1/159 = 0.63%
20 1/159 = 0.63% 1/159 = 0.63%
10 11/159 = 6.92% 2/159 = 1.26%
2 23/159 = 14.47% 10/159 = 6.29%
>100% error, p_2000 < .001
M p/p_2000 > 2.00 p_2000/p > 2.00
--- ----------------- -----------------
250 0/159 = 0.00% 0/159 = 0.00%
200 0/159 = 0.00% 0/159 = 0.00%
100 0/159 = 0.00% 0/159 = 0.00%
50 0/159 = 0.00% 0/159 = 0.00%
20 0/159 = 0.00% 1/159 = 0.63%
10 7/159 = 4.40% 0/159 = 0.00%
2 13/159 = 8.18% 3/159 = 1.89%
>200% error, p_2000 < .001
M p/p_2000 > 3.00 p_2000/p > 3.00
--- ----------------- -----------------
250 0/159 = 0.00% 0/159 = 0.00%
200 0/159 = 0.00% 0/159 = 0.00%
100 0/159 = 0.00% 0/159 = 0.00%
50 0/159 = 0.00% 0/159 = 0.00%
20 0/159 = 0.00% 0/159 = 0.00%
10 5/159 = 3.14% 0/159 = 0.00%
2 7/159 = 4.40% 3/159 = 1.89%
So it looks like 250 and 200 both perform really well in this example. A
problem with this analysis is that I know know if the "effect" associated
with small p-values was real and I don't know if the rounding made the
result better or worse.
I also did this retaining only markers with certain values of r2 and it
seemed like the high r2 markers showed less effect of rounding, as one
would expect.
Based on this, I would think that use of an integer from zero to 200 or
250 would be quite adequate, but I'll have to do a little more before I'd
recommend that with great confidence. Same for 128.
Mike
> --
> You received this message because you are subscribed to the Google Groups "plink2-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
plink2-users...@googlegroups.com.
> For more options, visit
https://groups.google.com/d/optout.
>