MACS2 -- pValue-cutoff-filtered subsets of a peak set differ from the peak set generated by macs2 if you specified that pValue in the first place
Hi Tao and MACS community,
I am hoping for a recommendation --- and perhaps some more insight into the process that leads to what I will describe below.
I called a set of peaks (macs2 latest version) with a relatively loose cutoff of 1e-3. Then I wanted to see if the more stringent subsets obtained by filtering the peaks based on their -log10(pvalue) scores (e.g. take peaks with -log10(p) >= 5) were the same as if I just used that pValue cutoff in the first place when calling peaks.
So I did something like:
awk '{if ($5 >= 5) print $0}' peaks.bed > moreStringentPeaksFiltered.bed
as well as called the peaks using the same exact data and parameters aside from "-p 1e-5" instead of "-p 1e-3" -- lets say these more stringently called peaks are in a file called: "moreStringentPeaksCalled.bed"
These 2 "moreStringent" sets have dramatically different numbers of peaks:
wc -l moreStringentPeaksFiltered.bed
143034 (84321 of which overlap below set) -- all of which are in the set they were filtered from
wc -l moreStringentPeaksCalled.bed
97154 (96907 of which overlap above filtered set) -- all of which overlap the less stringently called set
This is surprising to me.
As for the recommendation:
Should I move forward always getting more stringent subsets straight from peak calling? or from filtering from less stringent sets?
...which is more ..."correct" ...?
As for some insight on this process -- why is it the case that these 2 approaches to obtaining the "same" subset of peaks give different results (in this case dramatically different)?
Many thanks,
John
-- John UrbanPh.D. Candidate
NSF Graduate Research Fellow
Brown University
Dept. of Molecular Biology, Cell Biology, and Biochemistry
Laboratory of Susan Gerbi, Sydney Frank Hall 257
Providence, RI 02906