For me, there is still an interest in getting FE, p and q information associated with the entire peak region output by MACS2 whether or not they are identical peak regions to what MACS1 would have output (in addition to having the information about the summit). My needs for this are not for subsetting, but for sorting the peak set by the value associated with the entire peak rather than the summit.
To motivate why I want this:
Sometimes, to do a quick check of reproducibility between replicate peak sets, I will sort and look to see if at least 80% of the top ("best") 40% of peaks (sorted by some score) overlap between replicates. In general, I expect the % of the top 40% that overlaps a replicate to be higher than the % of the entire peak set that overlaps the replicate (i.e. the top 40% is more likely to be real). Likewise, I expect the % of the weakest peaks (e.g. bottom 40%) to be lower than the % of the entire set that overlaps the replicate (i.e. bottom 40% is more likely to be noise). With my data, the above expectations were always met for MACS1 when sorting by -10*log10(p), %FDR, and FE. However, with MACS2 using the same data, the % of the top 40%, % of bottom 40%, and % of the entire set are pretty much the same -- i.e. the top 40% has the same effect on what % will overlap the replicate as taking a random sample of the same size. So in my mind, at least for my data (other people may not see this), the values associated with the peaks may be more useful than with the summits for this application.
Thus, I came up with the 2 ways discussed in previous emails (that I still have not tried) to get scores that reflect the entire peak: Approach#1 being the average -log10(p) of the entire peak taken from the -log10(p) track; Approach #2 being to try to calculate a p-value from the entire peak with information from the treatment pileup and control lambda signal tracks (or from BAM files of treatment an control reads [applying relevant scaling factor where necessary] --
discussed in this thread).
It seems to me that approach#2 can be tweaked to be similar to what MACS1 does after it identifies candidate peak regions. In this scenario, the MACS2 peaks can be considered the candidate peak regions (I am not saying that these would be the same candidate peak regions MACS1 would have identified though). One then can get the number of treatment reads in the MACS2 peak region, call it X. Then, for lambda, take the largest reads-per-bp-average amongst the genome-wide average and local averages in a small and large windows centered around the summit. The largest per bp average would then be multiplied by the peak length to get the lambda value to use. The MACS1-like poisson upper-tailed p-value for the entire MACS2 peak could then be calculated.
If one did not want the p-value to be dependent on the peak's length (but still wanted a p-value for the entire peak), then instead of multiplying the control's maximum per bp average by the peak length to get the lambda, one could divide the treatment read count by the peak length to get the average number of treatment reads per bp in the peak. Then the poisson upper-tailed p-value can be calculated from X = treatment-per-bp-avg-inside-peak-region, lambda=max-per-bp-average(global, slocal, llocal).
Any thoughts on this?
Best,
John