MACS2 -- pValue-cutoff-filtered subsets of a peak set differ from the peak set generated by macs2 if you specified that pValue in the first place

893 views
Skip to first unread message

John Urban

unread,
Jul 3, 2013, 8:21:07 PM7/3/13
to macs-ann...@googlegroups.com
Subject: 
MACS2 --  pValue-cutoff-filtered subsets of a peak set differ from the peak set generated by macs2 if you specified that pValue in the first place

Hi Tao and MACS community,

I am hoping for a recommendation --- and perhaps some more insight into the process that leads to what I will describe below. 

I called a set of peaks (macs2 latest version) with a relatively loose cutoff of 1e-3. Then I wanted to see if the more stringent subsets obtained by filtering the peaks based on their -log10(pvalue) scores (e.g. take peaks with -log10(p) >= 5) were the same as if I just used that pValue cutoff in the first place when calling peaks.  

So I did something like:
awk '{if ($5 >= 5) print $0}' peaks.bed > moreStringentPeaksFiltered.bed 
    as well as called the peaks using the same exact data and parameters aside from "-p 1e-5" instead of "-p 1e-3" -- lets say these more stringently called peaks are in a file called: "moreStringentPeaksCalled.bed"


These 2 "moreStringent" sets have dramatically different numbers of peaks:
wc -l moreStringentPeaksFiltered.bed 
143034  (84321 of which overlap below set) -- all of which are in the set they were filtered from
wc -l moreStringentPeaksCalled.bed
97154  (96907 of which overlap above filtered set)  -- all of which overlap the less stringently called set


This is surprising to me. 


As for the recommendation:
Should I move forward always getting more stringent subsets straight from peak calling? or from filtering from less stringent sets?  
    ...which is more ..."correct" ...?


As for some insight on this process -- why is it the case that these 2 approaches to obtaining the "same" subset of peaks give different results (in this case dramatically different)?


Many thanks,

   John



-- 
John Urban
Ph.D. Candidate
NSF Graduate Research Fellow
Brown University
Dept. of Molecular Biology, Cell Biology, and Biochemistry
Laboratory of Susan Gerbi, Sydney Frank Hall 257
Providence, RI 02906






Tao Liu

unread,
Jul 3, 2013, 9:38:46 PM7/3/13
to macs-ann...@googlegroups.com
Hi John,

You would get identical result between filtering bed file from loose cutoff with stringent cutoff and applying the stringent cutoff in the first place. The reason is that, the p-value or q-value assigned to each peak corresponds to the highest point of peak -- peak summit; and while you set -p in command line, this cutoff will be required at EVERY position in the peak region. In another word, by setting '-p 1e-3' then filtering using '-p 1e-5', you will get the peaks where the whole regions are higher than 1e-3 and summits are higher than 1e-5; by setting '-p 1e-5' directly, you will have peaks where whole regions are higher than 1e-5. So, in general, -p 1e-5 peaks should be smaller than those from -p 1e-3. This will also cause some "-p 1e-5" being rejected due to smaller size. 

Best,
Tao



--
You received this message because you are subscribed to the Google Groups "MACS announcement" group.
To unsubscribe from this group and stop receiving emails from it, send an email to macs-announcem...@googlegroups.com.
To post to this group, send email to macs-ann...@googlegroups.com.
Visit this group at http://groups.google.com/group/macs-announcement.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

John Urban

unread,
Jul 4, 2013, 9:18:08 AM7/4/13
to macs-ann...@googlegroups.com
Thanks Tao,

That cleared up the issue for me.

To put what you said into my own words:

1. the p-value scores associated with peaks in the BED file only have to do with the summit -- not the entire peak. {This also goes for q-value -- and FE as well}

2. peaks are called if the p-value associated with the entire peak is smaller than the threshold (and > than a minimum length cutoff). 

3. Most summits probably have smaller (more significant) p-values than the entire peak (e.g. a peak with p ~1e-3 that has a summit with p ~1e-5). 

4. Thus, the number of 1e-3 peaks with summits with p < 1e-5 is larger than the number of peaks called at p<1e-5. 


5. As a corollary to #2, though peaks called at 1e-5 overlap peaks called at 1e-3, they are likely to be narrower than the peaks they overlap in the less stringent set.


6. As a consequence of #5, many potential peaks that have p-val < 1e-5 (e.g. regions around summits in some of the peaks called at 1e-3) become too narrow (length < minimum size cutoff) and get rejected as peaks because of being too narrow. (What is the heuristic used here?  --- d value?).  


It seems clear that if I want peaks that are significant at p<1e-5, I need to call them at that level -- not filter for them from a less stringent set (as that is looking at summit p-values, not peak p-values).  Best thing to do seems to be to use bdgcmp to generate the p-value signal track and use bdgpeakcall to call the stringency levels I need (to speed up the peak calling process).

Thanks again,

   John






Reply all
Reply to author
Forward
0 new messages