MACS2 -- pValue-cutoff-filtered subsets of a peak set differ from the peak set generated by macs2 if you specified that pValue in the first place

John Urban

unread,

Jul 3, 2013, 8:21:07 PM7/3/13

to macs-ann...@googlegroups.com

Subject:

MACS2 -- pValue-cutoff-filtered subsets of a peak set differ from the peak set generated by macs2 if you specified that pValue in the first place

Hi Tao and MACS community,

I am hoping for a recommendation --- and perhaps some more insight into the process that leads to what I will describe below.

I called a set of peaks (macs2 latest version) with a relatively loose cutoff of 1e-3. Then I wanted to see if the more stringent subsets obtained by filtering the peaks based on their -log10(pvalue) scores (e.g. take peaks with -log10(p) >= 5) were the same as if I just used that pValue cutoff in the first place when calling peaks.

So I did something like:

awk '{if ($5 >= 5) print $0}' peaks.bed > moreStringentPeaksFiltered.bed

as well as called the peaks using the same exact data and parameters aside from "-p 1e-5" instead of "-p 1e-3" -- lets say these more stringently called peaks are in a file called: "moreStringentPeaksCalled.bed"

These 2 "moreStringent" sets have dramatically different numbers of peaks:

wc -l moreStringentPeaksFiltered.bed

143034 (84321 of which overlap below set) -- all of which are in the set they were filtered from

wc -l moreStringentPeaksCalled.bed

97154 (96907 of which overlap above filtered set) -- all of which overlap the less stringently called set

This is surprising to me.

As for the recommendation:

Should I move forward always getting more stringent subsets straight from peak calling? or from filtering from less stringent sets?

...which is more ..."correct" ...?

As for some insight on this process -- why is it the case that these 2 approaches to obtaining the "same" subset of peaks give different results (in this case dramatically different)?

Many thanks,

John

--
John Urban

Ph.D. Candidate

NSF Graduate Research Fellow

Brown University

Dept. of Molecular Biology, Cell Biology, and Biochemistry

Laboratory of Susan Gerbi, Sydney Frank Hall 257

Providence, RI 02906

john_...@brown.edu

Tao Liu

unread,

Jul 3, 2013, 9:38:46 PM7/3/13

to macs-ann...@googlegroups.com

Hi John,

You would get identical result between filtering bed file from loose cutoff with stringent cutoff and applying the stringent cutoff in the first place. The reason is that, the p-value or q-value assigned to each peak corresponds to the highest point of peak -- peak summit; and while you set -p in command line, this cutoff will be required at EVERY position in the peak region. In another word, by setting '-p 1e-3' then filtering using '-p 1e-5', you will get the peaks where the whole regions are higher than 1e-3 and summits are higher than 1e-5; by setting '-p 1e-5' directly, you will have peaks where whole regions are higher than 1e-5. So, in general, -p 1e-5 peaks should be smaller than those from -p 1e-3. This will also cause some "-p 1e-5" being rejected due to smaller size.

Best,

Tao

--
You received this message because you are subscribed to the Google Groups "MACS announcement" group.
To unsubscribe from this group and stop receiving emails from it, send an email to macs-announcem...@googlegroups.com.
To post to this group, send email to macs-ann...@googlegroups.com.
Visit this group at http://groups.google.com/group/macs-announcement.
For more options, visit https://groups.google.com/groups/opt_out.

John Urban

unread,

Jul 4, 2013, 9:18:08 AM7/4/13

to macs-ann...@googlegroups.com

Thanks Tao,

That cleared up the issue for me.

To put what you said into my own words:

1. the p-value scores associated with peaks in the BED file only have to do with the summit -- not the entire peak. {This also goes for q-value -- and FE as well}

2. peaks are called if the p-value associated with the entire peak is smaller than the threshold (and > than a minimum length cutoff).

3. Most summits probably have smaller (more significant) p-values than the entire peak (e.g. a peak with p ~1e-3 that has a summit with p ~1e-5).

4. Thus, the number of 1e-3 peaks with summits with p < 1e-5 is larger than the number of peaks called at p<1e-5.

5. As a corollary to #2, though peaks called at 1e-5 overlap peaks called at 1e-3, they are likely to be narrower than the peaks they overlap in the less stringent set.

6. As a consequence of #5, many potential peaks that have p-val < 1e-5 (e.g. regions around summits in some of the peaks called at 1e-3) become too narrow (length < minimum size cutoff) and get rejected as peaks because of being too narrow. (What is the heuristic used here? --- d value?).

It seems clear that if I want peaks that are significant at p<1e-5, I need to call them at that level -- not filter for them from a less stringent set (as that is looking at summit p-values, not peak p-values). Best thing to do seems to be to use bdgcmp to generate the p-value signal track and use bdgpeakcall to call the stringency levels I need (to speed up the peak calling process).