P value cutoff for output file

538 views
Skip to first unread message

rpsilos

unread,
Sep 19, 2016, 11:43:25 AM9/19/16
to MACS announcement
Hi 
I have been using MACS2 for calling peaks at ChIPseq experiments.
In an effort to increase the number of called peaks, I used more relaxed -p value threshold (e.g. from 0.01 to 0.1) without changing any other parameter.

e.g. for calling peaks from FILE.bam

macs2 callpeak -t FILE.bam --keep-dup all -f BAMPE -g hs -n tmp1 --nomodel --extsize 150 -B -p 0.01

macs2 callpeak -t FILE.bam --keep-dup all -f BAMPE -g hs -n tmp2 --nomodel --extsize 150 -B -p 0.1

And then testing the concordance of the top 100 
bedtools jaccard \
-a <(sortBed -i <(sort -k8,8gr tmp1_peaks.narrowPeak|head -100)) \
-b <(sortBed -i <(sort -k8,8gr tmp2_peaks.narrowPeak|head -100))

intersection union-intersection jaccard n_intersections

8706523 13567330 0.641727 89


As you can see there is Jaccard's J = 0.64 indicating a partial overlap.


Should it be 100% overlap ?

Anna Battenhouse

unread,
Sep 19, 2016, 1:31:52 PM9/19/16
to macs-announcement
So this comes back to what the effect of a less stringent P-value threshold is in MACS2. One might think that the only difference was the number of peaks, but there is another significant difference -- peak width. Peaks called with a less stringent threshold are wider, sometimes much wider. Also, sometimes one wide peak region called with lower stringency can be reported as more than one peak with higher stringency.

Why is this? Without doing a deep dive into the MACS2 source code I think of it like this. MACS2 may first identify summits, then "reach out" from those maxima to define the peak region. With a less stringent threshold, the reach extends farther before the signal drops below threshold. Hence wider peak regions.

In any case, that is why your jaccard is not 100%. Even if MACS2 called the "same" peaks in both cases (e.g., around the same peak summit positions), the lower-stringency set will be wider, leading to a considerably larger union than intersection.

If you just want to see the correlation between the top 100 in the two sets, you might try instead extracting a fixed interval around the peak summit for each set. Or use bedtools merge and count the number of merged peaks versus input peaks.

One additional note from our experience with MACS2 thresholding. We found that using a P-value threshold less stringent that 0.01 (which appears roughly equivalent to a Q-value threshhold of 0.25) produced overly broad peaks such that the genomic coverage of called peaks did not seem reasonable. This was true for for marks ranging from broad (e.g. H3K27me3) to "sharp" (e.g. H3K4me3, CTCF).

Hope this helps,
- Anna

--
Anna Battenhouse | Associate Research Scientist | Iyer Lab | Institute for Cellular & Molecular Biology | Center for Systems and Synthetic Biology | The University of Texas at Austin | 2500 Speedway, MBB 3.106 | Austin, TX 78712-0159 | (512) 232-2632 (o) | (512) 587-4159 (m) | abatte...@utexas.edu

--
You received this message because you are subscribed to the Google Groups "MACS announcement" group.
To unsubscribe from this group and stop receiving emails from it, send an email to macs-announcement+unsubscribe@googlegroups.com.
To post to this group, send email to macs-announcement@googlegroups.com.
Visit this group at https://groups.google.com/group/macs-announcement.
For more options, visit https://groups.google.com/d/optout.



Reply all
Reply to author
Forward
0 new messages