MACS2: bdgpeakcall output not totally consistent with callpeak output

1,062 views
Skip to first unread message

John Urban

unread,
Jul 9, 2013, 11:55:10 AM7/9/13
to macs-ann...@googlegroups.com
Hi Tao,

I have been using bdgpeakcall for a few days. It is great - e.g. it takes less than 10 minutes to call peaks on the human genome if you start at this step. However, an observation is that the narrowPeak file output by bdgpeakcall is different from the one generated from callpeak (with the same data and parameters).  To give an example:


narrowPeak from callpeak -- using q; cutoff of 0.01 ==> -log10(0.01) = 2:
chr1 9715  10753 peak_1 1049 . 6.25641 109.11100  104.95456  484
chr1 12402 13100 peak_2 57 . 2.42105 7.60889      5.74906      201

narrowPeak from bdgpeakcall -- from qvalue bdg; cutoff of 2:
chr1  9715 10753 narrowPeak1 1175 . 0.00000 0.00000 0.00000 631
chr1 12402 13100 narrowPeak2 74 . 0.00000 0.00000 0.00000 175


-- The number of peaks in each file is identical. Good.
-- The start and end locations of each peak seem to be identical. Great.
-- What concerns me is that columns 4 ("score") and 10 (relative summit) differ between the 2 ways of calling the peaks. 
Is there a reason for this? Why do the summit locations differ? While we are at it, what exactly is the score in column 4? How is it obtained?

As for columns 7, 8, and 9 (which usually give information on FE, -log(p) and -log(q)), would it be possible to fill in at least the relevant column (depending on which signal track you are using) with a score (e.g. if you used the p-value signal track, fill in column 8 with the -log10(p). I do realize it would not be possible to calculate FE from the pval or qval signal track and vice versa -- but I do imagine it might be possible to calculate the pvalues and qvalues interchangeably (so maybe fill in both columns if using one of those tracks). 

For now I will go back to using callpeaks for everything so it is all consistent. Nonetheless, I do like having the option to use bdgpeakcall. Thanks for adding it to the MACS arsenal.

Best,

  John 

p.s.
I saw the same thing when using p-value to call peaks. Same number of peaks, same starts/ends, different scores/summits.

-- 
John Urban
Ph.D. Candidate
NSF Graduate Research Fellow
Brown University
Dept. of Molecular Biology, Cell Biology, and Biochemistry
Laboratory of Susan Gerbi, Sydney Frank Hall 257
Providence, RI 02906

Tao Liu

unread,
Jul 10, 2013, 2:51:05 PM7/10/13
to macs-ann...@googlegroups.com
Hi Jon,

On Jul 9, 2013, at 11:55 AM, John Urban <john_...@brown.edu> wrote:

I have been using bdgpeakcall for a few days. It is great - e.g. it takes less than 10 minutes to call peaks on the human genome if you start at this step. However, an observation is that the narrowPeak file output by bdgpeakcall is different from the one generated from callpeak (with the same data and parameters).  To give an example:

narrowPeak from callpeak -- using q; cutoff of 0.01 ==> -log10(0.01) = 2:
chr1 9715  10753 peak_1 1049 . 6.25641 109.11100  104.95456  484
chr1 12402 13100 peak_2 57 . 2.42105 7.60889      5.74906      201

narrowPeak from bdgpeakcall -- from qvalue bdg; cutoff of 2:
chr1  9715 10753 narrowPeak1 1175 . 0.00000 0.00000 0.00000 631
chr1 12402 13100 narrowPeak2 74 . 0.00000 0.00000 0.00000 175

-- The number of peaks in each file is identical. Good.
-- The start and end locations of each peak seem to be identical. Great.
-- What concerns me is that columns 4 ("score") and 10 (relative summit) differ between the 2 ways of calling the peaks. 
Is there a reason for this? Why do the summit locations differ? While we are at it, what exactly is the score in column 4? How is it obtained?

I am not sure why you saw different results. I just tested and saw the same peak regions and summits… When I ran bdgpeakcall, besides -c for cutoff, I have to make sure other options such as '-g' for maximum gap being the same as read length, and '-l' for minimum length being the same as fragment size. But I am sure you have set them correctly otherwise you won't see same number of peaks and the same peak boundaries.  Also, you need to use bdgcmp with -m qpois to generate a qscore track.

$head -2 A_c2.0_l146_g101_peaks.narrowPeak
chr1 9349196 9349342 A_narrowPeak1 35 . 0.00000 0.00000 0.00000 98
chr1 9687137 9687283 A_narrowPeak2 45 . 0.00000 0.00000 0.00000 127

$head -2 NA_peaks.narrowPeak
chr1 9349196 9349342 NA_peak_1 35 . 5.37249 8.49611 3.50002 98
chr1 9687137 9687283 NA_peak_2 45 . 6.26791 10.27468 4.58189 127

As for columns 7, 8, and 9 (which usually give information on FE, -log(p) and -log(q)), would it be possible to fill in at least the relevant column (depending on which signal track you are using) with a score (e.g. if you used the p-value signal track, fill in column 8 with the -log10(p). I do realize it would not be possible to calculate FE from the pval or qval signal track and vice versa -- but I do imagine it might be possible to calculate the pvalues and qvalues interchangeably (so maybe fill in both columns if using one of those tracks). 

I don't think it's possible to infer other types of scores. I want to keep bdgpeakcall as a naive peak caller as possible. Column 7/8/9 are totally useless in its output. The only reason I use narrowPeak format is to incorporate summit information.

Best,
Tao
Reply all
Reply to author
Forward
0 new messages