why is macs calling peaks on unpaired pile-ups of duplicate reads?

TRypdal

unread,

Jun 26, 2013, 11:30:22 AM6/26/13

to macs-ann...@googlegroups.com

Hi

I'm using MACS2 2.0.10.20130419. I''m experimenting with the --keep-dup=all option and I'm experiencing a problem. Basically, MACS2 will call a peak even when a pileup is composed exclusively of the following kind of reads:

1-reads reside on 1 strand only

2-pileup is composed fully of duplicate reads

The following image shows an example of such peaks:

Now, I know I could exclude completely this peak by setting --keep-dup=1, however I do want to keep duplicates at the moment, as I have reason to believe these might contribute to real signal in other peaks. In this one however, only duplicates are present, and moreover no pair signal is present on the opposite strand. Therefore I would tend to be quite confident that this is not a peak. Setting the qvalue to a more stringent value does not seem to help, because it will get rid of some of these bad peaks and some good ones.

What is the best way to get rid of these? I know that another peak caller, genetrack, tags each peak with a std_dev value, and will give this kind of peak a std_dev=0.0. Is there anything similar I can do for MACS2? Additionally, why is this even considered a peak, given that there only signal on the (+) strand? Apologies if I have missed something obvious.

Best,

T

Tao Liu

unread,

Jun 27, 2013, 12:04:00 PM6/27/13

to macs-ann...@googlegroups.com

Good question! One possibility is to keep-dup=1 to call peaks REGIONS, then use the signal from keep-dup=all to assign values for each peak. We don't have such implementation currently. Another possibility is to make use of + and - strand tags distribution as you pointed out -- or like SPP or GPS. In MACS2, you will find a subcommand named 'refinepeak' which implements the WTD method in SPP to filter out those peaks with unbalanced + and - strand tags. It will take the peak from MACS2 callpeak command and alignment file for ChIP, then evaluate each peak region for strand balance and refine peak summit as the best point with balanced + and - tags on each side.

Best,

Tao Liu

--
Assistant Professor
Department of Biochemistry
University at Buffalo
NY State Center of Excellence in Bioinformatics & Life Sciences

B2-163 COEBLS
(O) 716-829-2749
tl...@buffalo.edu

Mailing address:
University at Buffalo-COEBLS
701 Ellicott St, B2-163
Buffalo, NY 14203-1221

--
You received this message because you are subscribed to the Google Groups "MACS announcement" group.
To unsubscribe from this group and stop receiving emails from it, send an email to macs-announcem...@googlegroups.com.
To post to this group, send email to macs-ann...@googlegroups.com.
Visit this group at http://groups.google.com/group/macs-announcement.
For more options, visit https://groups.google.com/groups/opt_out.

TRypdal

unread,

Jul 1, 2013, 11:48:06 AM7/1/13

to macs-ann...@googlegroups.com

Hi Tao

thanks a lot for your answer. I'm now looking at the refinepeak MACS2 subprogram, however I'm not sure I'm using it correctly. It will output a bed file with the same number of peaks present in the original macs2 callpeak output - the difference is that the new peaks will be smaller (1bp with 40bp reads). As for the peaks mentioned in my earlier email (all duplicates + 1strand only) they are still all there in the output bed of refinepeak, however the peak will be shifted to the left of the actual pileup. The amount of the shift is proportional to the size of the -w (window) parameter. This should show it better:

The first track is the original bed, with a spurious peak. The following two tracks are the output of refinepeak, with w=50 and w=100 respectively. Could you help me to interpret this output? Additionally, what is the meaning of the 4th column in the bed output of refinepeak? Thanks once more!

Tao Liu

unread,

Jul 1, 2013, 1:23:35 PM7/1/13

to macs-ann...@googlegroups.com

Sorry, I forgot to mention how to interpret the results in my previous email. The 'peak' name in this example from refinepeak module should be marked as '_F' which means it fails to pass the criteria of balanced reads distribution around peak summit. The last column in refinepeak output shows you how 'balance' it is, similar to Peter K's SPP paper published on Nat Biotech v26 n12 2008, equation is at bottom right of page 1357 about 'wtd' value. Bigger the value, better the summit is. And if it's a negative value, the summit should be marked as failed.

Basically, refinepeak module takes 'summits' from MACS2 output file, then scan the surrounding '-w' basepair window for the best summit with a highest wtd value. In the case of a bad peak, it will give you unstable bad result.

Tao

John Urban

unread,

Sep 24, 2013, 9:18:40 AM9/24/13

to macs-ann...@googlegroups.com

Hi all,

Would refinepeak work on broadly distributed signals/peaks where one does not expect the canonical positive and negative strand peaks with reads facing each other (i.e. in a situation where you specify --no-model)?

Basically, I am in need of a function that checks whether the + and - strands have 'equivalent' amounts of reads over a specified region regardless of whether it is a point source ChIP-seq peak or other-- i.e. even if its only genomic DNA sequencing. In other words, a function/command that marks regions (that I provide) that are heavily weighted with reads on one strand (and ,if so, then also reporting which strand is heavily covered by reads would be useful for my analysis). Since refinepeak looks at the "balance", it sounded similar to what I am looking for. I guess the question is -- does it do what I described? Can it be made to do it with a certain set of parameters? Or does it only work with canonical ChIP-seq type peaks?

Best,

John

Tao Liu

unread,

Sep 25, 2013, 5:38:27 PM9/25/13

to macs-ann...@googlegroups.com

Hi John,

On Sep 24, 2013, at 9:18 AM, John Urban <john_...@brown.edu> wrote:

> Would refinepeak work on broadly distributed signals/peaks where one does not expect the canonical positive and negative strand peaks with reads facing each other (i.e. in a situation where you specify --no-model)?

No. Currently it's the same 'wtd' method as that in SPP and it's only for factors with sharp and narrow binding sites. This refinepeak calculates a score for each potential summit, measuring how balanced tags distributed around the summit. So it's not appropriate for broad marks where 'summits' are not expected.

> Basically, I am in need of a function that checks whether the + and - strands have 'equivalent' amounts of reads over a specified region regardless of whether it is a point source ChIP-seq peak or other-- i.e. even if its only genomic DNA sequencing. In other words, a function/command that marks regions (that I provide) that are heavily weighted with reads on one strand (and ,if so, then also reporting which strand is heavily covered by reads would be useful for my analysis). Since refinepeak looks at the "balance", it sounded similar to what I am looking for. I guess the question is -- does it do what I described? Can it be made to do it with a certain set of parameters? Or does it only work with canonical ChIP-seq type peaks?

A good suggestion. I will consider counting plus/minus strand tags as a filtering step while detecting broad domains. Not sure whether it helps.

Best,
Tao

Davide Cittaro

unread,

Sep 25, 2013, 11:48:09 PM9/25/13

to macs-ann...@googlegroups.com

Succede che io ho detto giovedì

d
--
Sent from my iPhone

John Urban

unread,

Sep 29, 2013, 11:29:10 AM9/29/13

to macs-ann...@googlegroups.com

Thanks! It could help - especially as a modular function similar to 'refinepeak'. I have data that enriches specific genomic regions like ChIP-seq, but is a little different -- probably more like ChIP for a mark that is randomly distributed over specific regions. Hence, there are no canonical peaks that face each other. It is from an enzymatic prep that enriches select regions of the genome that sometimes are point-source-like in size and other times broad -- i.e. peaks are anywhere from ~350 bp to 10kb or more. MACS2 does great in identifying the regions when specifying --no-model (so did MACS1).

The reason it could help is because there are undesirable biases in the prep that can lead to one strand being enriched over the other (detectable with strand-specific sequencing). Nonetheless, since those regions are enriched, they are called as peaks just the same as the desirable enrichments. However, true positives should have a balanced amount of reads on both strands. Thus false positive peaks could be eliminated just by making sure both strands have "equivalent" amounts of reads.

More generally, looking for strand coverage equivalences/biases like this could also flag peaks called from redundant read pileups on one strand if using the --keep-all option (as alluded to earlier in this thread) as well as erroneous (or desirable) strand-biased peaks that arise in settings other than mine.

Reply all

Reply to author

Forward