pooling with normalization

181 views
Skip to first unread message

Norman B. Grover

unread,
Feb 8, 2021, 7:00:18 AM2/8/21
to methylkit_...@googlegroups.com
Dear Users,
    I have a simple question involving the processing of a single tile.
    Consider the following tile: 2 groups, with 5 samples in the control group and 6 samples in the test group.
 
sample        coverage            Cs        Ts
control1          192                192        0
control2          NA                 NA        NA
control3          192                192        0
control4          138                138        0
control5          156                 156       0
 
test1              NA                 NA          NA
test2              198                0            198
test3              471                453        18
test4              112                104        8
test5              144                131        13
test6              126                126        0
 
    Pooling without normalization gives a meth.diff of (0+453+104+131+126)/(198+471+112+144+126) - (192+192+138+156)/(192+192+138+156)
= 814/1051 - 1
= 0.225499524
That is what I get by hand and that is what methylKit gives.
    The question involves normalization (using the default option: median). How does methylKit compute the numbers that go into getting meth.diff in this case? What numbers does methylKit use and what result does it get?
    There is a 'part 2' to this question, but it depends on the reply to the above.
    I am completely stuck and any help would be greatly appreciated.
 
 
 
Norman

Altuna Akalin

unread,
Feb 9, 2021, 3:52:57 AM2/9/21
to methylkit_...@googlegroups.com
how does the coverage distribution look like after pooling ? normalization may not be necessary if they are similar. I think normalization might make sense if there are systematic differences of coverage values between samples. If the coverages are very different, the samples that have systematically high coverage will dominate the pooled samples.

Best,
Altuna


--
You received this message because you are subscribed to the Google Groups "methylkit_discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/methylkit_discussion/3631.186296437.1612785609%40gmail.com.

Norman Grover

unread,
Feb 9, 2021, 6:41:37 AM2/9/21
to methylkit_discussion
Dear Altuna,
    Thank you for your quick reply.
    You could  be right, that pooling may not be necessary, but the purpose of my post was to find out how methyl.Kit does the calculation using the data I provided: I am looking for the method used for the normalization, not its scientific implications.

Altuna Akalin

unread,
Feb 9, 2021, 6:58:32 AM2/9/21
to methylkit_...@googlegroups.com
Hi Norman,
I'm saying normalizing may not be necessary. check the coverage distributions, normalization only normalizes the coverages, percent methylation values stays the same. 
Pooling might be good if you have missing CpGs in different experiments.

Best,
Altuna

Norman Grover

unread,
Feb 9, 2021, 8:32:42 AM2/9/21
to methylkit_discussion
Dear Altuna,
    You're right. That was a slip-up on my part. Sorry. Normalization may indeed not be necessary. But how does methylKit do it? As I said in the original post, there is a follow-up question, but it depends on how methylKit computes the meth.diff from the numbers I provided. Please, if you know how, share that information with me. I would be very grateful.

Altuna Akalin

unread,
Feb 10, 2021, 7:11:48 AM2/10/21
to methylkit_...@googlegroups.com
Dear Norman,
For the median normalization, the medians of coverage distributions for each sample is calculated. Then, a scaling factor is calculated from these medians and coverage values are multiplied with that scaling factor, keeping the percent methylation values intact or very similar to the original values. 


If you need more help or clarification please send a reproducible example that includes your code and subset of the data.

Best,
Altuna

Norman Grover

unread,
Feb 13, 2021, 4:51:59 AM2/13/21
to methylkit_discussion
Dear Altuna,
    Thank you for the code. Just what I needed. We we able to extract the four numbers I was looking for, and my suspicions were confirmed: pooled data following normalisation may give the requisite meth.diff value but they certainly cannot be analysed by Fisher's Exact Test. Let me explain.
    When the data are not normalised prior to pooling, the 2x2 table that is analysed by Fisher is

            678        0
            814        237

These give a meth.diff of -22.55% and a p-value of 4.42e-57. After normalisation, the pooled 2x2 table becomes

            1884        0
            1789        566

    The ratios are more or less preserved, as you see, but the absolute numbers have more than doubled. Fisher's Exact Test requires counts, actual counts, in all four positions of the table, and nothing else. The F-test, t-test, r statistic and others are invariant to linear transformations, Fisher's is not. nothing but counts, otherwise you get garbage, as in this case where your normalisation decreased the p-value by more than 100 orders of magnitude, to 3.54e-160. (The meth.diff, which depends on ratios rather than absolute quantities, is now -24.03%.)
    I think this may not be the only problem with methylKit's statistics, but I'll leave it at that for now.
   Once again, thank you for your help.

Altuna Akalin

unread,
Feb 15, 2021, 7:37:52 AM2/15/21
to methylkit_...@googlegroups.com
Dear Norman,
I think you are missing a point I raised earlier. Do you have systematic coverage differences between samples? If yes, that coverage difference will affect the test results artificially, similar to the case you described. Just because you happen to sequence some samples deeper than others you might get such effects as you described (inflated p-values). If you have such systematic differences, that should be normalized in my opinion. If you have better ways of dealing with this, we are open to suggestions and pull requests as well.

Best,
Altuna

Norman Grover

unread,
Feb 16, 2021, 5:48:18 AM2/16/21
to methylkit_discussion

Dear Altuna,

No, I’m not missing the point. The tile in my example may be problematic and give rise to the “100 orders of magnitude”, but even a change in p-value of half an order of magnitude is unacceptable: it is not the change in p-value that is the problem3---that was merely a striking example---no, the problem lies in your use of Fisher’s Exact Test for 2by2 Tables after having changed the observed counts during the normalisation:

 

1884              0

1789              566

 

are no longer counts and cannot be analysed by Fisher’s Exact Test.

I see only one alternative to pulling Fisher following normalisation, and that is to restore the original counts without affecting the results of the normalisation. Sounds like a contradiction, but it is not. The normalisation is concerned only with relative not absolute coverage; Fisher requires that all four entries be counts but only that the original coverage be preserved. Let me explain by way of the treatment group in my tile:

 

                                                                                                                           +                                -                                total

before normalisation                                                                                    814                            237                            1051

after normalisation                                                                                      1789                            566                            2355

 

1. multiply all entries in row 2 by 1051/2355                                          798.40                          252.60                      1051

2. to allocate the final bp, you can

               (a)truncate entry with smaller remainder                                  798                                253                            1051

or            (b)take into account that Fisher deals in factorials

                              truncate max(799*(1-0.40),253*(1-0.60))                   798                                253                            1051

or              (c)consider effect in context of normalisation

                                      choose according to effect on normalised meth.diff

                                      (increase or decrease in absolute value)

                                      requires processing both groups together

 

That is my suggestion. The relative coverage computed by the normalisation is preserved (except for the final integer, but even that becomes part of the normalisation if you adopt option (c), and Fisher is permitted because the normalisation did not change the total number of counts but merely reapportioned them between the + and - categories.

When you wrote “If you have better ways of dealing with this”, I thought a bit about your logistic regression but I don’t see how you can set up a logistic regression in the present case (only 2 groups, albeit with replicates, and no covariates). What have I missed?

Altuna Akalin

unread,
Feb 16, 2021, 6:07:25 AM2/16/21
to methylkit_...@googlegroups.com
my suggestion is not to normalize or think about normalization if you don't have systematic differences. I'm sorry your suggestions don't make total sense at the moment for me. what you suggest seems to change percent methylation values and that is not "acceptable"

Best,
Altuna

Norman Grover

unread,
Feb 17, 2021, 4:56:12 AM2/17/21
to methylkit_discussion


Dear Altuna,

My tile doesn’t need help, my data doesn’t need help, I don’t need helpmethylKit needs help: its algorithm for normalisation transforms the empirical data so that it cannot subsequently be analysed by Fisher’s Exact Test. I realise you do not understand this, so get a proper statistician. And while he (or she) is at it, have him (her) take a look at the rest of methylKit’s statistical procedures, I have my suspicions about them too.

MethylKit has been unwittingly providing erroneous results to scientists for years based on its sloppy statistics. You owe it to the scientific community to rectify the situation as quickly as possible.

Reply all
Reply to author
Forward
0 new messages