What you have stated is absolutely right.
The peak boundaries that come out of idrOverlap2npk.sh are not as accurate as what you directly get out of the peak caller for the pooled set since it is based on merging the peak boundaries of the overlapping peaks in the two replicates. Also, you lose the "summit" information which is more accurate from calling peaks on the pooled data, which is why we prefer going with the peak coordinates from the pooled data.
However, you are also right that while most peaks from the pooled set (after rank thresholding) will also part of the idrOverlap2npk output, some will not. This was partially by design for 2 reasons.
- If a peak was identified in the pooled set (but for some reason was not identified in one of the replicates) and lies within the ranks of mostly reproducible and IDR approved reproducible peaks then we are fine with accepting it in the final list.
- The IDR scores generally have a strong monotonic relationship with the original ranking measure. So we decided to maintain the original ranking measure of the peak caller on the pooled data when selecting our final peak sets even if some of the peaks dont necessarily pass IDR analysis on the replicates.
There is actually a way to be more conservative.
- Lets call the peak list on the pooled data 'P' and for the individual replicates 'R1' and 'R2'
- For each peak in P, record the peak coordinate from P. Then get the "best" representative peak from R1 record the ranking measures from the R1 peak. If you have multiple candidate R1 peaks, you can decide what you mean by most representative e.g. maximal overlap with the P peak, closest summit to the P peak, strongest peak etc. In my experience each of the above work well.
- Do the same for R2.
- If a peak in P has no overlap in one of the replicates, drop that peak.
- Thus, you create two new peak lists where the peak coordinates are identical between the two lists (since they come from peaks in P) but with independent ranking measures from R1 and R2.
- Run IDR on these two lists.
- Now use idrOverlap2npk on the IDR output to obtain peaks ranked by their IDR scores and now threshold this list based on your selected IDR threshold (Lets call this list Q). The selected peaks from Q will have the exact same coordinates as some peak in P. So you can match these exactly and obtain the peaks, summits, original peak caller ranking scores etc. from P that match the peaks in Q.
This way you only keep peaks that are perfectly reproducible (identified in both replicates) and also exactly pass IDR which maintaining the best resolution of peak coordinates and summits. It also allows you finer control of how you match peaks in the two replicates. This also helps when the peak caller calls peaks that are very near each other that may get merged into a single meta-peak if they partially overlap in the default IDR code.
We will likely be recommending to shift to this strategy soon since nowadays many of the newer peak callers can often deconvolve multiple binding events within the same peak region and we dont want to lose that resolution.
Thanks,
Anshul.