pooling with normalization

Norman B. Grover

unread,

Feb 8, 2021, 7:00:18 AM2/8/21

to methylkit_...@googlegroups.com

Dear Users,

I have a simple question involving the processing of a single tile.

Consider the following tile: 2 groups, with 5 samples in the control group and 6 samples in the test group.

sample coverage Cs Ts

control1 192 192 0

control2 NA NA NA

control3 192 192 0

control4 138 138 0

control5 156 156 0

test1 NA NA NA

test2 198 0 198

test3 471 453 18

test4 112 104 8

test5 144 131 13

test6 126 126 0

Pooling without normalization gives a meth.diff of (0+453+104+131+126)/(198+471+112+144+126) - (192+192+138+156)/(192+192+138+156)

= 814/1051 - 1

= 0.225499524

That is what I get by hand and that is what methylKit gives.

The question involves normalization (using the default option: median). How does methylKit compute the numbers that go into getting meth.diff in this case? What numbers does methylKit use and what result does it get?

There is a 'part 2' to this question, but it depends on the reply to the above.

I am completely stuck and any help would be greatly appreciated.

Norman

Altuna Akalin

unread,

Feb 9, 2021, 3:52:57 AM2/9/21

to methylkit_...@googlegroups.com

how does the coverage distribution look like after pooling ? normalization may not be necessary if they are similar. I think normalization might make sense if there are systematic differences of coverage values between samples. If the coverages are very different, the samples that have systematically high coverage will dominate the pooled samples.

Best,

Altuna

--
You received this message because you are subscribed to the Google Groups "methylkit_discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/methylkit_discussion/3631.186296437.1612785609%40gmail.com.

Norman Grover

unread,

Feb 9, 2021, 6:41:37 AM2/9/21

to methylkit_discussion

Dear Altuna,
Thank you for your quick reply.
You could be right, that pooling may not be necessary, but the purpose of my post was to find out how methyl.Kit does the calculation using the data I provided: I am looking for the method used for the normalization, not its scientific implications.

Altuna Akalin

unread,

Feb 9, 2021, 6:58:32 AM2/9/21

to methylkit_...@googlegroups.com

Hi Norman,

I'm saying normalizing may not be necessary. check the coverage distributions, normalization only normalizes the coverages, percent methylation values stays the same.

Pooling might be good if you have missing CpGs in different experiments.

Best,

Altuna

To view this discussion on the web visit https://groups.google.com/d/msgid/methylkit_discussion/f578b727-0b96-444f-b254-72172ae73b00n%40googlegroups.com.

Norman Grover

unread,

Feb 9, 2021, 8:32:42 AM2/9/21

to methylkit_discussion

Dear Altuna,
You're right. That was a slip-up on my part. Sorry. Normalization may indeed not be necessary. But how does methylKit do it? As I said in the original post, there is a follow-up question, but it depends on how methylKit computes the meth.diff from the numbers I provided. Please, if you know how, share that information with me. I would be very grateful.

Altuna Akalin

unread,

Feb 10, 2021, 7:11:48 AM2/10/21

to methylkit_...@googlegroups.com

Dear Norman,

For the median normalization, the medians of coverage distributions for each sample is calculated. Then, a scaling factor is calculated from these medians and coverage values are multiplied with that scaling factor, keeping the percent methylation values intact or very similar to the original values.

The code is here: https://github.com/al2na/methylKit/blob/master/R/normalizeCoverage.R#L82

If you need more help or clarification please send a reproducible example that includes your code and subset of the data.

Best,

Altuna

To view this discussion on the web visit https://groups.google.com/d/msgid/methylkit_discussion/79e3e8c0-c202-41bf-a765-79ac032928e5n%40googlegroups.com.

Norman Grover

unread,

Feb 13, 2021, 4:51:59 AM2/13/21

to methylkit_discussion

Dear Altuna,
    Thank you for the code. Just what I needed. We we able to extract the four numbers I was looking for, and my suspicions were confirmed: pooled data following normalisation may give the requisite meth.diff value but they certainly cannot be analysed by Fisher's Exact Test. Let me explain.
    When the data are not normalised prior to pooling, the 2x2 table that is analysed by Fisher is

            678        0
            814        237

These give a meth.diff of -22.55% and a p-value of 4.42e-57. After normalisation, the pooled 2x2 table becomes

            1884        0
            1789        566

    The ratios are more or less preserved, as you see, but the absolute numbers have more than doubled. Fisher's Exact Test requires counts, actual counts, in all four positions of the table, and nothing else. The F-test, t-test, r statistic and others are invariant to linear transformations, Fisher's is not. nothing but counts, otherwise you get garbage, as in this case where your normalisation decreased the p-value by more than 100 orders of magnitude, to 3.54e-160. (The meth.diff, which depends on ratios rather than absolute quantities, is now -24.03%.)
    I think this may not be the only problem with methylKit's statistics, but I'll leave it at that for now.
   Once again, thank you for your help.

Altuna Akalin

unread,

Feb 15, 2021, 7:37:52 AM2/15/21

to methylkit_...@googlegroups.com

Dear Norman,

I think you are missing a point I raised earlier. Do you have systematic coverage differences between samples? If yes, that coverage difference will affect the test results artificially, similar to the case you described. Just because you happen to sequence some samples deeper than others you might get such effects as you described (inflated p-values). If you have such systematic differences, that should be normalized in my opinion. If you have better ways of dealing with this, we are open to suggestions and pull requests as well.

Best,

Altuna

To view this discussion on the web visit https://groups.google.com/d/msgid/methylkit_discussion/dae5aa97-7e9a-4ebe-9153-302f70ccc357n%40googlegroups.com.

Norman Grover

unread,

Feb 16, 2021, 5:48:18 AM2/16/21

to methylkit_discussion

Dear Altuna,

No, I’m not missing the point. The tile in my example may be problematic and give rise to the “100 orders of magnitude”, but even a change in p-value of half an order of magnitude is unacceptable: it is not the change in p-value that is the problem3---that was merely a striking example---no, the problem lies in your use of Fisher’s Exact Test for 2by2 Tables after having changed the observed counts during the normalisation:

1884 0

1789 566

are no longer counts and cannot be analysed by Fisher’s Exact Test.

I see only one alternative to pulling Fisher following normalisation, and that is to restore the original counts without affecting the results of the normalisation. Sounds like a contradiction, but it is not. The normalisation is concerned only with relative not absolute coverage; Fisher requires that all four entries be counts but only that the original coverage be preserved. Let me explain by way of the treatment group in my tile:

+ - total

before normalisation 814 237 1051

after normalisation 1789 566 2355

1. multiply all entries in row 2 by 1051/2355 798.40 252.60 1051

2. to allocate the final bp, you can

(a)truncate entry with smaller remainder 798 253 1051

or (b)take into account that Fisher deals in factorials

truncate max(799*(1-0.40),253*(1-0.60)) 798 253 1051

or (c)consider effect in context of normalisation

choose according to effect on normalised meth.diff

(increase or decrease in absolute value)

requires processing both groups together

That is my suggestion. The relative coverage computed by the normalisation is preserved (except for the final integer, but even that becomes part of the normalisation if you adopt option (c), and Fisher is permitted because the normalisation did not change the total number of counts but merely reapportioned them between the + and - categories.

When you wrote “If you have better ways of dealing with this”, I thought a bit about your logistic regression but I don’t see how you can set up a logistic regression in the present case (only 2 groups, albeit with replicates, and no covariates). What have I missed?

Altuna Akalin

unread,

Feb 16, 2021, 6:07:25 AM2/16/21

to methylkit_...@googlegroups.com

my suggestion is not to normalize or think about normalization if you don't have systematic differences. I'm sorry your suggestions don't make total sense at the moment for me. what you suggest seems to change percent methylation values and that is not "acceptable"

Best,

Altuna

To view this discussion on the web visit https://groups.google.com/d/msgid/methylkit_discussion/af180ffc-7576-4fb0-b149-fe87ee6ce034n%40googlegroups.com.

Norman Grover

unread,

Feb 17, 2021, 4:56:12 AM2/17/21

to methylkit_discussion

Dear Altuna,

My tile doesn’t need help, my data doesn’t need help, I don’t need help—methylKit needs help: its algorithm for normalisation transforms the empirical data so that it cannot subsequently be analysed by Fisher’s Exact Test. I realise you do not understand this, so get a proper statistician. And while he (or she) is at it, have him (her) take a look at the rest of methylKit’s statistical procedures, I have my suspicions about them too.

MethylKit has been unwittingly providing erroneous results to scientists for years based on its sloppy statistics. You owe it to the scientific community to rectify the situation as quickly as possible.

Reply all

Reply to author

Forward