qvalue when specifying at p-value cutoff

711 views
Skip to first unread message

Dietmar Rieder

unread,
Aug 23, 2013, 4:02:55 AM8/23/13
to macs-ann...@googlegroups.com
Hi Tao,

I'm currently testing MACS2 from github. I use callpeaks with a p-value cutoff (e.g. -p 1e-04) and the help output for this parameter says that the qvalue will not be calculated and reported as -1 in the final .xls file. However, when I look at the xls file I still see q-values in the -log10(qvalue) column they are not set to -1.

In MACS version 2.0.10.20120913 the q-values were set to -1 but now with MACS version 2.0.10.20130712 they are not. Is this intended? I'm wondering since the header in the xls file still says "qvalue will not be calculated and reported as -1 in the final output".


Thanks
  Didi

Tao Liu

unread,
Aug 23, 2013, 11:50:13 AM8/23/13
to macs-ann...@googlegroups.com
Thanks for reminding me! Right, currently q-value will still be calculated even -p is used. It will spend more time, but will give a more standard output. I will change the descriptions in later releases.

Tao


--
You received this message because you are subscribed to the Google Groups "MACS announcement" group.
To unsubscribe from this group and stop receiving emails from it, send an email to macs-announcem...@googlegroups.com.
To post to this group, send email to macs-ann...@googlegroups.com.
Visit this group at http://groups.google.com/group/macs-announcement.
For more options, visit https://groups.google.com/groups/opt_out.

Dietmar Rieder

unread,
Aug 23, 2013, 2:32:48 PM8/23/13
to macs-ann...@googlegroups.com

Thanks for the explanation, I like the fact that macs is now also calculating the q-value even if -p is used.

Didi

Dietmar Rieder

unread,
Aug 26, 2013, 6:51:20 AM8/26/13
to macs-ann...@googlegroups.com
Hi Tao,

I'm trying to understand the influence of the -p parameter in MACS2 on the # of called peaks.
When I'm calling peaks on my dataset using the default cutoff setting w/o -p and w/o -q prarameter I get 14824 peaks.

wc -l TF-MACS2_peaks.narrowPeak 
14824

When I then use -p 1e-03, as expected I get many more peaks

wc -l TF-MACS2-p1e-03_peaks.narrowPeak 
34062

When I now filter this result for peaks with -log10(p) >= 5 and -log10(q) >= 1.30 I was expecting to get a number similar to the # of peaks called with the default settings. However I get about double the amount.

cut -f8,9 TF-MACS2-p1e-03_peaks.narrowPeak  | sort -n -k1,1 | awk '{if (($1 >=5) && ($2 >= 1.30)) print $1 " " $2}' | wc -l
27760

How can I interpret this difference?

Thanks for your help

  Didi

John Urban

unread,
Aug 26, 2013, 7:32:37 AM8/26/13
to macs-ann...@googlegroups.com
Hi Didi,

Perhaps this post will help you:


In short, the p-value reported is the p-value of the summit, not of the peak. Thus, you are filtering for peaks that were called at p >log10(p) >=3  and filtering for peaks with summit log10(p) >= 5 (and summit log10(q) >= 1.3).

Perhaps in the future one can specify to MACS2 whether to report the p-values/q-values/FEs associated with the summit or with the entire peak (as in MACS1)...?? That would be very helpful to me as well. 

In the mean time, one can generate the log10(p)/log10(q)/FE tracks and get, for example, the average FE over the entire peak by converting to bigWig and using bigWigAverageOverBed from UCSC Kent Utilities.


Best,

 John

John Urban

unread,
Aug 26, 2013, 7:34:04 AM8/26/13
to macs-ann...@googlegroups.com
little bit of a type in my previous email:
Fixed sentence:
Thus, you are filtering for peaks that were called at log10(p) >=3  and filtering for peaks with summit log10(p) >= 5 (and summit log10(q) >= 1.3).

John Urban

unread,
Aug 26, 2013, 8:25:47 AM8/26/13
to macs-ann...@googlegroups.com
I have been wondering about this (below) actually -- and am yet to do it. Let me know the results if you try it.

To get the peak associated p-values (rather than summit) I can think of at least 2 ways using MACS2 files.

1) Use bigWigAverageOverBed to obtain the average log10(p) of the peak region in the log10(p) signal track.
---> Can do this for log10(q) as well.

2) Use bigWigAverageOverBed on the treatment pileup track and the control lambda track. Use the sum column in the out.tab file to get the sum of the pileup over all bases in the peak. Do sum/readLength to get the approximate number of reads in the peak region.  Do the same over the control lambda. Then one could get the poisson p-value (e.g. in R) of the treatment read count in the region using the control lambda of the region.
--- One could also do this with SAMtools and other tools. What I like about the above approach, though, is that you can be sure it is looking at what MACS2 is looking at.
--- Question: Would this p-value approach be most similar to the MACS1 approach?


I am curious as to how the 2 estimates of p-value of the peak would be related to each other. I will probably be trying it this week, so will let you know.


Best,

John

Dietmar Rieder

unread,
Aug 26, 2013, 10:49:12 AM8/26/13
to macs-ann...@googlegroups.com
Hi John,

Tanks for the explanation and the link - I should have checked the archive before posting :-)
I also wonder what the cutoff size of a peak is below which it gets rejected at "-p 1e-5"...

Didi

John Urban

unread,
Aug 26, 2013, 11:00:41 AM8/26/13
to macs-ann...@googlegroups.com
***any region with -log10(p) > 5 but shorter than 350 is eliminated.


On Mon, Aug 26, 2013 at 10:59 AM, John Urban <mr.joh...@gmail.com> wrote:
Default cutoff size is fragment size or 'd'.
e.g. If one were to specify shiftsize as 175 indicating fragment length (or whatever we are calling it now) is 350, then I believe any region > -log10(p) > 5 but shorter than 350 is eliminated.

Unlike filtering for p-values in the resulting peak file, filtering for lengths gives the same result you would get if you specified different length cutoffs initially -- if I recall correctly. I forget if I checked that on regular "callpeak" or "bdgpeakcall" though -- it was a while ago.

John Urban

unread,
Aug 26, 2013, 10:59:52 AM8/26/13
to macs-ann...@googlegroups.com
Default cutoff size is fragment size or 'd'.
e.g. If one were to specify shiftsize as 175 indicating fragment length (or whatever we are calling it now) is 350, then I believe any region > -log10(p) > 5 but shorter than 350 is eliminated.

Unlike filtering for p-values in the resulting peak file, filtering for lengths gives the same result you would get if you specified different length cutoffs initially -- if I recall correctly. I forget if I checked that on regular "callpeak" or "bdgpeakcall" though -- it was a while ago.

Dietmar Rieder

unread,
Aug 27, 2013, 3:59:34 AM8/27/13
to macs-ann...@googlegroups.com
Thanks!
  Didi

Dietmar Rieder

unread,
Aug 27, 2013, 10:17:19 AM8/27/13
to macs-ann...@googlegroups.com
Hi John,

I now used bdgcmp to generate the fe-value, p-value and q-value signal tracks and compiled a new narrowPeak file using the first approach (1) you suggested. Basically for each peak called at p < 1e-03 it averages the fe-value, p-value and q-value from the signal tracks over the peak region from the bed file (TF-MACS2_peaks.narrowPeak).

Remember the original peak file called at p < 1e-03 and filtered for p <= 1e-05 and q <= 5e-02 resulted into 27760 peaks.

cut -f8,9 TF-MACS2-p1e-03_peaks.narrowPeak  | sort -n -k1,1 | awk '{if (($1 >=5) && ($2 >= 1.30)) print $1 " " $2}' | wc -l
27760

when I now apply the same filter with the fe-value, p-value and q-value averaged over the peak region i get 16850 peaks.

cut -f8,9 TF-MACS2-p1e-03_peaks_AVG.narrowPeak  | sort -n -k1,1 | awk '{if (($1 >=5) && ($2 >= 1.30)) print $1 " " $2}' | wc -l
16850

so this is closer to the # of peaks called with MACS2's defaults (14824), but still quite different as if I call peaks using -p 1e-05 (5620).

I'll see if I can try the second approach as well.

Best
  Didi

Tao Liu

unread,
Aug 27, 2013, 12:28:31 PM8/27/13
to macs-ann...@googlegroups.com
On Aug 26, 2013, at 8:25 AM, John Urban <mr.joh...@gmail.com> wrote:

> I have been wondering about this (below) actually -- and am yet to do it. Let me know the results if you try it.
>
> To get the peak associated p-values (rather than summit) I can think of at least 2 ways using MACS2 files.
>
> 1) Use bigWigAverageOverBed to obtain the average log10(p) of the peak region in the log10(p) signal track.
> ---> Can do this for log10(q) as well.
>
> 2) Use bigWigAverageOverBed on the treatment pileup track and the control lambda track. Use the sum column in the out.tab file to get the sum of the pileup over all bases in the peak. Do sum/readLength to get the approximate number of reads in the peak region. Do the same over the control lambda. Then one could get the poisson p-value (e.g. in R) of the treatment read count in the region using the control lambda of the region.
> --- One could also do this with SAMtools and other tools. What I like about the above approach, though, is that you can be sure it is looking at what MACS2 is looking at.
> --- Question: Would this p-value approach be most similar to the MACS1 approach?

They won't be the same in either way. In MACS1, the local bias (lambda) is calculated in a way related to candidate peak. For example: Given same amount of tags of ChIP and Control, MACS1 finds a candidate peak region of certain size (L) containing number of ChIP tags (X); it computes the 'maximum average tag number for each bp' at 1k/10k surrounding regions centered at candidate peak summit(B); then the background bias/local lambda is calculated as B*L; the final p-value is computed between X and B*L. In brief, the p-value in MACS1 will be influenced by L of each peak. In MACS2, L is fixed at scan-window size for every position on chromosome in order to get consistent p-value scores, and X is the number of fragment ( extended by d) falling into the same scan window.

-T

Tao Liu

unread,
Aug 27, 2013, 1:02:51 PM8/27/13
to macs-ann...@googlegroups.com
Hi Didi,

MACS2 calls a region 'ENTIRELY ABOVE' threshold, which means every point in such region should have a MINIMUM value, either p or q-value cutoff that you specified. The approaches you tried are either 1) to find regions where the SUMMITS are above cutoff ( by filtering MACS2 output ); or 2) to find regions where the AVERAGE is above cutoff ( by using bigWigAverageOverBed ). Thus they will all give you different results. If you tried 'macs2 bdgpeakcall', you will notice there are other two parameters, maximum-gap and minimum-length, that will also influence results.

Best,
Tao

Dietmar Rieder

unread,
Aug 28, 2013, 4:35:17 AM8/28/13
to macs-ann...@googlegroups.com
Hi Tao,

thanks for explaining and confirming the differences. I'm wondering if any of the two alternative approaches that I tried is valid and makes sense to use. Maybe you can comment on this.

Thanks also for suggesting to try 'macs2 bdgpeakcall', I'll defenitely play with this.

Didi

John Urban

unread,
Aug 28, 2013, 5:02:00 PM8/28/13
to macs-ann...@googlegroups.com
Thanks for all the helpful clarifications. 

For me, there is still an interest in getting FE, p and q information associated with the entire peak region output by MACS2 whether or not they are identical peak regions to what MACS1 would have output (in addition to having the information about the summit). My needs for this are not for subsetting, but for sorting the peak set by the value associated with the entire peak rather than the summit. 

To motivate why I want this: 
Sometimes, to do a quick check of reproducibility between replicate peak sets, I will sort and look to see if at least 80% of the top ("best") 40% of peaks (sorted by some score) overlap between replicates. In general, I expect the % of the top 40% that overlaps a replicate to be higher than the % of the entire peak set that overlaps the replicate (i.e. the top 40% is more likely to be real). Likewise, I expect the % of the weakest peaks (e.g. bottom 40%) to be lower than the % of the entire set that overlaps the replicate (i.e. bottom 40% is more likely to be noise). With my data, the above expectations were always met for MACS1 when sorting by -10*log10(p), %FDR, and FE. However, with MACS2 using the same data, the % of the top 40%, % of bottom 40%, and % of the entire set are pretty much the same -- i.e. the top 40% has the same effect on what % will overlap the replicate as taking a random sample of the same size. So in my mind, at least for my data (other people may not see this), the values associated with the peaks may be more useful than with the summits for this application. 

Thus, I came up with the 2 ways discussed in previous emails (that I still have not tried) to get scores that reflect the entire peak: Approach#1 being the average -log10(p) of the entire peak taken from the -log10(p) track; Approach #2 being to try to calculate a p-value from the entire peak with information from the treatment pileup and control lambda signal tracks (or from BAM files of treatment an control reads [applying relevant scaling factor where necessary] -- discussed in this thread).

It seems to me that approach#2 can be tweaked to be similar to what MACS1 does after it identifies candidate peak regions. In this scenario, the MACS2 peaks can be considered the candidate peak regions (I am not saying that these would be the same candidate peak regions MACS1 would have identified though). One then can get the number of treatment reads in the MACS2 peak region, call it X. Then, for lambda, take the largest reads-per-bp-average amongst the genome-wide average and local averages in a small and large windows centered around the summit. The largest per bp average would then be multiplied by the peak length to get the lambda value to use. The MACS1-like poisson upper-tailed p-value for the entire MACS2 peak could then be calculated. 

If one did not want the p-value to be dependent on the peak's length (but still wanted a p-value for the entire peak), then instead of multiplying the control's maximum per bp average by the peak length to get the lambda, one could divide the treatment read count by the peak length to get the average number of treatment reads per bp in the peak. Then the poisson upper-tailed p-value can be calculated from X = treatment-per-bp-avg-inside-peak-region, lambda=max-per-bp-average(global, slocal, llocal).


Any thoughts on this?

Best,

   John





Reply all
Reply to author
Forward
0 new messages