macs2 give the wrong number of reads in treatment

1,125 views
Skip to first unread message

w...@sanger.ac.uk

unread,
Oct 21, 2013, 3:54:56 PM10/21/13
to macs-ann...@googlegroups.com
Dear MACS2 users,

I am using MACS2 for peak calling. I have around 10 million reads in treatment and 6 million reads in control. However when I ran macs2 and I have the following information which really confused me:

INFO  @ Mon, 21 Oct 2013 16:33:05: #1 read tag files...
INFO  @ Mon, 21 Oct 2013 16:33:05: #1 read treatment tags...
INFO  @ Mon, 21 Oct 2013 16:33:13:  1000000
INFO  @ Mon, 21 Oct 2013 16:33:21:  2000000
INFO  @ Mon, 21 Oct 2013 16:33:29:  3000000
INFO  @ Mon, 21 Oct 2013 16:33:37:  4000000
INFO  @ Mon, 21 Oct 2013 16:33:44:  5000000
INFO  @ Mon, 21 Oct 2013 16:33:51:  6000000
INFO  @ Mon, 21 Oct 2013 16:33:58:  7000000
INFO  @ Mon, 21 Oct 2013 16:34:05:  8000000
INFO  @ Mon, 21 Oct 2013 16:34:12:  9000000
INFO  @ Mon, 21 Oct 2013 16:34:19:  10000000
INFO  @ Mon, 21 Oct 2013 16:34:26:  11000000
INFO  @ Mon, 21 Oct 2013 16:34:26: #1.2 read input tags...
INFO  @ Mon, 21 Oct 2013 16:34:33:  1000000
INFO  @ Mon, 21 Oct 2013 16:34:40:  2000000
INFO  @ Mon, 21 Oct 2013 16:34:47:  3000000
INFO  @ Mon, 21 Oct 2013 16:34:54:  4000000
INFO  @ Mon, 21 Oct 2013 16:35:01:  5000000
INFO  @ Mon, 21 Oct 2013 16:35:08:  6000000
INFO  @ Mon, 21 Oct 2013 16:35:13: #1 tag size is determined as 50 bps
INFO  @ Mon, 21 Oct 2013 16:35:13: #1 tag size = 50
INFO  @ Mon, 21 Oct 2013 16:35:13: #1  total tags in treatment: 472739
INFO  @ Mon, 21 Oct 2013 16:35:13: #1 user defined the maximum tags...
INFO  @ Mon, 21 Oct 2013 16:35:13: #1 filter out redundant tags at the same location and the same strand by allowing at most 1 tag(s)
INFO  @ Mon, 21 Oct 2013 16:35:13: #1  tags after filtering in treatment: 334465
INFO  @ Mon, 21 Oct 2013 16:35:13: #1  Redundant rate of treatment: 0.29
INFO  @ Mon, 21 Oct 2013 16:35:13: #1  total tags in control: 6339252
INFO  @ Mon, 21 Oct 2013 16:35:13: #1 user defined the maximum tags...
INFO  @ Mon, 21 Oct 2013 16:35:13: #1 filter out redundant tags at the same location and the same strand by allowing at most 1 tag(s)
INFO  @ Mon, 21 Oct 2013 16:35:15: #1  tags after filtering in control: 6206920
INFO  @ Mon, 21 Oct 2013 16:35:15: #1  Redundant rate of control: 0.02
INFO  @ Mon, 21 Oct 2013 16:35:15: #1 finished!

From the above information given by macs2, there are only 472739 tags in treatment but actually I have 10m reads in treatment. And the number of tags in control is correct. Can anyone explain why macs2 gave the wrong number of reads in treatment? Does reads and tags have the same meaning? Thank you!

Wei

Tao Liu

unread,
Oct 21, 2013, 4:01:32 PM10/21/13
to macs-ann...@googlegroups.com
47million is the 'valid' reads -- correctly mapped or paired reads. Perhaps the mapping rate of your treatment data is about 40%? You can check it through 'samtools flagstat' command.

Tao Liu

--
You received this message because you are subscribed to the Google Groups "MACS announcement" group.
To unsubscribe from this group and stop receiving emails from it, send an email to macs-announcem...@googlegroups.com.
To post to this group, send email to macs-ann...@googlegroups.com.
Visit this group at http://groups.google.com/group/macs-announcement.
For more options, visit https://groups.google.com/groups/opt_out.

abk

unread,
Oct 21, 2013, 4:02:41 PM10/21/13
to macs-ann...@googlegroups.com
MACS removed duplicate reads by default. Your ChIP dataset probably has a very high duplicate rate (lots of PCR/bottleneck duplicates) which is why you are losing most of your reads. You should try to obtain a more complex library before sequencing as you have sequenced way beyond your library complexity. Also use Picard's mark duplicate tool and see how many duplicates you have.

-Anshul.

abk

unread,
Oct 21, 2013, 4:04:21 PM10/21/13
to macs-ann...@googlegroups.com
Actually I take that back. Tao is right. I misread the statistics reported.

-Anshul.

w...@sanger.ac.uk

unread,
Oct 21, 2013, 4:14:36 PM10/21/13
to macs-ann...@googlegroups.com
Hi Tao,

You are right. I used samtools to check the mapping rate which is only 35% in my treatment. Could you please give me some advise on my data if I would like to use macs2 to call peaks? Should I use --downsample to 'shrink' the number of reads in control?

Cheers,
Wei

w...@sanger.ac.uk

unread,
Oct 21, 2013, 4:29:27 PM10/21/13
to macs-ann...@googlegroups.com
Hi Tao,

samtools shows that 3.8million reads are mapped(mapping rate ~35.18%), however macs2 only found 0.47million valid reads. Can you explain why macs2 only found 0.47 million valid reads out of 11 million reads to use? Thank you!

Regards,
Wei
 

Tao Liu

unread,
Oct 21, 2013, 4:34:41 PM10/21/13
to macs-ann...@googlegroups.com
Hi Wei,

Could you post the output from samtools and your command line for macs2?

Sorry I thought you have 47million reads read by MACS2, but actually it's 470K reads. 

Best,
Tao Liu

w...@sanger.ac.uk

unread,
Oct 21, 2013, 4:40:09 PM10/21/13
to macs-ann...@googlegroups.com
The output of samtools:
11051448 + 0 in total (QC-passed reads + QC-failed reads)
3415200 + 0 duplicates
3887939 + 0 mapped (35.18%:nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (nan%:nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (nan%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

The command I am using for peakcalling:
macs2 callpeak -t sample1.bam -c sample4.bam -f BAM -g hs -n sample1 --call-summits --bdg

Thanks,
Wei

Tao Liu

unread,
Oct 21, 2013, 4:54:10 PM10/21/13
to macs-ann...@googlegroups.com
Hi Wei,

This is the simple math:

3887939 (mapped) - 3415200 (duplicates) = 472739 (usable)

So most of mapped reads are marked as 'PCR or optical duplicate' ( flag 0x400) in your BAM file. Perhaps you used other software to identify duplicates? 

Best,
Tao

w...@sanger.ac.uk

unread,
Oct 21, 2013, 5:03:00 PM10/21/13
to macs-ann...@googlegroups.com
Hi Tao,

Can I still use macs2 to call peaks since I only have a little valid reads in treatment compared to the reads in control? What parameters should I use for peak calling?

Thanks,
Wei

Tao Liu

unread,
Oct 21, 2013, 5:22:59 PM10/21/13
to macs-ann...@googlegroups.com
If your genome is small, perhaps 470K valid reads are still useful.  By default, MACS2 will down-scale large sample -- in your case, the control. If you have more control, you would expect to see more specific peak calls since more 'chromatin bias' have been captured. Just run a normal MACS2 run to see what peaks you can get. Load them together with the bedGraph ( converted to bigWig ) to genome browser (e.g., IGV). Lower the cutoff while necessary. 

I suggest you do some quality control analyses on your data: if you have replicates then try IDR method (https://sites.google.com/site/anshulkundaje/projects/idr); check if peaks are more conserved than surrounding regions; check other characteristics of your data such as signal-to-noise ratio as mentioned in <http://www.ncbi.nlm.nih.gov/pubmed/22955991>. 

Best,
Tao



w...@sanger.ac.uk

unread,
Oct 22, 2013, 5:14:51 AM10/22/13
to macs-ann...@googlegroups.com
Hi Tao,

My genome is human. Thank you for your suggestions, I will check the quality of my data before I run macs2.

Regards,
Wei

w...@sanger.ac.uk

unread,
Oct 25, 2013, 7:09:35 AM10/25/13
to macs-ann...@googlegroups.com
Hi Tao,

From the samtools flagstat report, there are 3.8 million reads mapped to the genome which is human. However macs2 only used 470k reads to do the analysis. How can I let macs2 used these 3.8m to call peaks instead of 470k reads. I have tried '--keep-dup=all' but it seems not working. Thank you!

Regards,
Wei

Tao Liu

unread,
Oct 25, 2013, 1:32:23 PM10/25/13
to macs-announcement@googlegroups.com announcement
Hi Wei,

But samtools also told you 3.4millions reads are duplicates. Do you have any idea on how you or others marked those ‘duplicate’ reads?

If you still want to use them, first, convert your BAM file to BED format using “bedtools bamtobed”  function, then feed the bed files to MACS2 with ‘—keep-dup=all’ option.

Best,
Tao Liu

--
Assistant Professor
Department of Biochemistry
University at Buffalo
NY State Center of Excellence in Bioinformatics & Life Sciences

B2-163 COEBLS
(O) 716-829-2749
tl...@buffalo.edu
http://biomisc.org/

Mailing address:
University at Buffalo-COEBLS
701 Ellicott St, B2-163
Buffalo, NY 14203-1221

Nate Hoverter

unread,
Mar 31, 2014, 7:09:03 PM3/31/14
to macs-ann...@googlegroups.com

Hi Tao,

I have a similar problem that is baffling me.  I am getting different values for macs2 "tags after filtering in control" and samtools flagstat.  For my input.bam file here is the result of samtools flagstat:

66481032 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
55127587 + 0 mapped (82.92%:-nan%)

0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)

0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)

0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

For MACS2 output I am getting

total tags in control: 55127587
# tags after filtering in control: 26027415
# maximum duplicate tags at the same position in control = 1
# Redundant rate in control: 0.53

What is the Redundant rate referring to?  It can't be PCR duplicates because Samtools flagstat says I have 0 duplicates..or am I misreading this?

As a result of MACS2 filtering my input file, my ChIP files have ~ 53 million tags after filtering and my input files have ~ 26 million tags after filtering.  Is this problematic given that input is supposed to be sequenced at greater depth?  We are doing MNase treated samples and I read that input is particularly important for these samples so I'm worried that we don't have enough input (even though MACS2 scaled the larger sample to the smaller sample).






Ian Donaldson

unread,
Apr 1, 2014, 4:18:38 AM4/1/14
to macs-announcement
MACS removes redundant reads (sharing the same 5' coordinate), so two reads may occupy the same position, but be on different strands.  Sometimes with smaller genomes (e.g. fly, yeast) two or more reads can have the same 5' start coordinate.

Ian


--
You received this message because you are subscribed to the Google Groups "MACS announcement" group.
To unsubscribe from this group and stop receiving emails from it, send an email to macs-announcem...@googlegroups.com.
To post to this group, send email to macs-ann...@googlegroups.com.
Visit this group at http://groups.google.com/group/macs-announcement.
For more options, visit https://groups.google.com/d/optout.

Nate Hoverter

unread,
Apr 1, 2014, 12:03:16 PM4/1/14
to macs-ann...@googlegroups.com
Hi Ian,

Thank you for your response.  I used human cells, so that shouldn't be a problem.  It might have something to do with using MNase to break up the chromatin.  I was going to use the --keep-dup option but I don't think it will solve the problem because it only keeps duplicates at the same coordinate and the same strand (PCR duplicates).

Nate

Timothy Parnell

unread,
Apr 1, 2014, 12:28:29 PM4/1/14
to macs-ann...@googlegroups.com
Hi Nate,

MNase has a restricted cut pattern; it only cuts at [AT][AT] dinucleotides. On top of that, digestion only occurs in linker regions. Therefore, it is not surprising to have a very high duplication rate. Removing the duplicates is throwing away the data on the premise that these are PCR duplicates when, in fact, they are not. Unfortunately, it is not possible to discriminate between real PCR duplicates and identical cut positions.

I think setting the --keep-dup=1 is wrong for MNase. In my MNase ChIPSeq experiments, I typically keep all the duplicates, although that can result in false positives at some loci. You may want to experiment with trying different options for the the --keep-dup option. You can try manually setting a maximum number of alignments, maybe 20 or 30 or more (check in a genome browser for a reasonable value). Recent versions of MACS2 can also calculate an appropriate duplicate maximum number using a binomial distribution.

Good luck.
Tim
--
You received this message because you are subscribed to the Google Groups "MACS announcement" group.
To unsubscribe from this group and stop receiving emails from it, send an email to macs-announcem...@googlegroups.com<mailto:macs-announcem...@googlegroups.com>.
To post to this group, send email to macs-ann...@googlegroups.com<mailto:macs-ann...@googlegroups.com>.

Nate Hoverter

unread,
Apr 1, 2014, 11:26:10 PM4/1/14
to macs-ann...@googlegroups.com, timothy...@hci.utah.edu
Thanks Tim for the info on keeping duplicates (--keep-dup 'all'), I'll try that.  This might not be an appropriate place to bring this up, but I'm wondering if you've tried other peak calling software for Mnase ChIP-seq experiments or have you found that MACS works the best for you?  I'm wondering why MACS2 gives me the best results out of the 3 I've tried on the same dataset:  MACS2, SPP, and Quest.  



On Tuesday, April 1, 2014 9:28:29 AM UTC-7, Timothy Parnell wrote:
Hi Nate,

MNase has a restricted cut pattern; it only cuts at [AT][AT] dinucleotides. On top of that, digestion only occurs in linker regions. Therefore, it is not surprising to have a very high duplication rate. Removing the duplicates is throwing away the data on the premise that these are PCR duplicates when, in fact, they are not. Unfortunately, it is not possible to discriminate between real PCR duplicates and identical cut positions.

I think setting the --keep-dup=1 is wrong for MNase. In my MNase ChIPSeq experiments, I typically keep all the duplicates, although that can result in false positives at some loci. You may want to experiment with trying different options for the the --keep-dup option. You can try manually setting a maximum number of alignments, maybe 20 or 30 or more (check in a genome browser for a reasonable value). Recent versions of MACS2 can also calculate an appropriate duplicate maximum number using a binomial distribution.

Good luck.
Tim


On Apr 1, 2014, at 10:03 AM, Nate Hoverter <hopu...@gmail.com<mailto:hopup...@gmail.com>> wrote:

Hi Ian,

Thank you for your response.  I used human cells, so that shouldn't be a problem.  It might have something to do with using MNase to break up the chromatin.  I was going to use the --keep-dup option but I don't think it will solve the problem because it only keeps duplicates at the same coordinate and the same strand (PCR duplicates).

Nate

--
You received this message because you are subscribed to the Google Groups "MACS announcement" group.
To unsubscribe from this group and stop receiving emails from it, send an email to macs-announcem...@googlegroups.com<mailto:macs-announcement+unsubscribe@googlegroups.com>.
To post to this group, send email to macs-ann...@googlegroups.com<mailto:macs-announ...@googlegroups.com>.
Reply all
Reply to author
Forward
0 new messages