A suggestion to improve MACS FDR calculation

110 views

Skip to first unread message

lishen_mssm

unread,

Aug 11, 2010, 9:06:57 AM8/11/10

to MACS announcement

I've read through several posts here about the FDR calculation in
MACS. Forgive me that I'm a bit paranoid to learn that the P-value is
not monotonically related with FDR. According to Liu's explanation,
this is due to MACS's way to determine FDR from two peak lists: a
positive and a negative peak list. If the negative peak list contains
several peaks with highly significant P-values, then the top ranked
positive peaks will have larger FDR because the FDR is calculated as:

FDR = (# of negative peaks < P-cutoff) / (# of positive peaks < P-
cutoff)

This can be problematic because it is based on the theoretical Poisson
distribution about ChIP and control libraries. In reality, ChIP
usually has a much smaller coverage than the control. Therefore, there
will be many regions that show only enrichment in the control library.
And sometimes, these enrichments can be highly significant, depending
on the experimental design, anti-body efficiency, sequencing machine,
etc. Therefore, we'll always see many significant positive peaks being
tagged with "100%" FDR, making the FDR column meaningless.

Why don't we simply force the FDR to be monotonic with P-value? For
example, without loss of generality, let's assume we have observed:

P_1 < P_2 < ... < P_N

while min(FDR(P_1), FDR(P_2), ... , FDR(P_N)) = FDR(P_N)

Then we simply set all FDR(P_1) = FDR(P_2) = ... = FDR(P_N).

This is very similar with the BH method. The advantage is that we
don't lose those top ranked peaks because of careless choice of FDR
cutoff!

Noboru Jo Sakabe

unread,

Aug 13, 2010, 3:41:42 PM8/13/10

to macs-ann...@googlegroups.com

Since you got no responses, I'll put my 2 cents, hoping that more
knowledgeable people jump in.
As I understand it, it makes sense that the (empirical) FDR does not
strictly correlate with P-value. This FDR is not a simple multiple test
correction. It is the "real" P-value (or Q-value), because it tells the
user what are the real odds of finding the experimental peaks just by
chance.
The solution for the lack of reads is only obtaining more reads. If
you get 100% FDR, it means that just with input you could get those
peaks, i.e. the sample is not enriched. When the sample is good, peaks
with much less than 100% FDR will be found in abundance.

Reply all

Reply to author

Forward

0 new messages