How exactly does MACS normalize by the total number of reads?

530 views
Skip to first unread message

Nora

unread,
Feb 1, 2011, 12:41:32 PM2/1/11
to MACS announcement
Dear all,

I wonder how MACS adjusts the total number of reads between treatment
and control. AFAIK, MACS works on count data, so does its
normalization randomly remove "excess" peaks in the sample with more
peaks, or does it multiply the number of reads in the sample with more
peaks by the difference factor (but then we don't have count data any
more)?

Nora

unread,
Feb 1, 2011, 12:42:59 PM2/1/11
to MACS announcement
This was supposed to be "does its normalization randomly remove
"excess" READS in the sample with more peaks", obviously sorry!

Jacob Biesinger

unread,
Feb 2, 2011, 4:28:12 PM2/2/11
to macs-ann...@googlegroups.com
Hi Nora,

On Tue, Feb 1, 2011 at 9:41 AM, Nora <macs...@googlemail.com> wrote:
Dear all,

I wonder how MACS adjusts the total number of reads between treatment
and control. AFAIK, MACS works on count data, so does its
normalization randomly remove "excess" READS in the sample with more

peaks, or does it multiply the number of reads in the sample with more
peaks by the difference factor (but then we don't have count data any
more)?

MACS does work on count data for the foreground data, but the background is actually considered a rate. When considering the different background regions (e.g., peak, 1k, 10k, genome-wide), the total number of tags will be higher for the larger regions, but when you take into account how wide the regions are, the tag RATE in the background will be similar.

The count difference between treatment and control is handled similarly-- the background rate is always scaled by the total difference in unique tag counts between treatment and control.

Hope that helps!

--
Jake Biesinger
Graduate Student
Xie Lab, UC Irvine
(949) 231-7587

Batool Akhtar-Zaidi

unread,
Feb 2, 2011, 5:21:58 PM2/2/11
to macs-ann...@googlegroups.com
Hi Jake,

Thanks for that informative answer.

I've got another question for you along those lines:

Is there any reason why, in spite of unique treatment and control tags being equal in number, MACS might report an FDR of 100% for all peaks when the number of peaks called is very low (eg 500)?
We've noticed this with some of our data sets. Interestingly, even with 40 million unique tags, the --diag option revealed that saturation was not achieved for these 500 peaks called by MACS. This makes me wonder if there is some aspect of the algorithm by which MACS calculates FDRs such that the background rate scaling is inaccurate when you have low numbers of peaks called? Or am I completely off here?

Thanks in advance for your input.

Best regards,

Batool


--
You received this message because you are subscribed to the Google Groups "MACS announcement" group.
To post to this group, send email to macs-ann...@googlegroups.com.
To unsubscribe from this group, send email to macs-announcem...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/macs-announcement?hl=en.

Noboru Jo Sakabe

unread,
Feb 2, 2011, 5:35:53 PM2/2/11
to macs-ann...@googlegroups.com
Hi Nora, there were previous discussion on this list whether MACS
should be given treatment and controls with similar numbers of reads or
not. In my experience, and some other users agree, it's better to use
similar numbers of reads. So you would have to "manually" remove excess
reads in one of the samples.
The method explained by Jake should mean that MACS corrects for
differences, but for some reason, this may not work as expected.
You can perform your own experiment by running MACS in both
conditions. Use another peak calling programs and see which conditions
yields results more similar to the other peak callers.

Daniel Aaron Newkirk

unread,
Feb 2, 2011, 5:42:01 PM2/2/11
to macs-ann...@googlegroups.com
The issue is a combination of the number of peaks and the nature of your control sample. As the number of peaks in the enriched sample decreases, the more likely you are to have a similar number of peaks in the background. Of course, it isn't quite as simple as that, as you have to consider the number of tags per peak, etc. What it means is that for that particular experiment, the preimmune (or whatever control you use) has peaks with significant pileup in regions that the experimental sample does not. When the control sample is particularly noisy, the control sample can have peaks that have a large tag count (and therefore low p-value). When the FDR is calculated, the p-value for these peaks, when binned, are at or above that of the experimental sample. This is also why the FDR and p-values aren't monotonically correlated.
--
Daniel Newkirk
Graduate Student
Yokomori Lab
Dept. of Biological Chemistry, UCI
(949) 824-2158

Nora

unread,
Feb 3, 2011, 6:30:54 AM2/3/11
to MACS announcement
Thanks to all of you for your useful input!

@Jake:
Ok, so that means that the Lambdas are actually estimated from a rate
and not from count data?

@Noboru:
Yes, I have read the discussions about this, but I couldn't find any
detailed explanation of how MACS normalizes by total read number.
I'm not a statistician, so would removing the "excess" reads to have
the same amount of reads in sample and control be statistically sound?

Noboru Jo Sakabe

unread,
Feb 3, 2011, 9:57:52 AM2/3/11
to macs-ann...@googlegroups.com
Hi Nora, I'm no statistician either, but starting with the same
number of reads should be statistically sound, since you empirically
eliminate any differences (but of course, each time you randomly obtain
a smaller set, you may obtain a different peak calling).
In my understanding, you eliminate the FDR bias, but you introduce
a sub-sampling one.
Again, in my experience, good IPs do not suffer from sub-sampling.
But poorly enriched IPs do. You will obtain overlap <100% when using 2
different sub-samples of say, control reads.

In fact, I didn't know about the correction that Jake mentioned. I
know that MACS warns against using unbalanced treatment and control, so
in any case, MACS will only guarantee best results if you manually
balance them.

Jacob Biesinger

unread,
Feb 9, 2011, 2:15:06 AM2/9/11
to macs-ann...@googlegroups.com
On Thu, Feb 3, 2011 at 6:57 AM, Noboru Jo Sakabe <nobo...@gmail.com> wrote:
   Hi Nora, I'm no statistician either, but starting with the same number of reads should be statistically sound, since you empirically eliminate any differences (but of course, each time you randomly obtain a smaller set, you may obtain a different peak calling).
   In my understanding, you eliminate the FDR bias, but you introduce a sub-sampling one.
   Again, in my experience, good IPs do not suffer from sub-sampling. But poorly enriched IPs do. You will obtain overlap <100% when using 2 different sub-samples of say, control reads.

I've heard of this approach a lot, and have done it myself. There are two potential issues: First, MACS removes most redundant tags (tags on the same start and strand).  It seems to me that you'd like subsampling to take place *after* the redundant tags have been removed from both datasets.

The second issue is that subsampling could remove some of the signal used in building the shifting model. Ideally, you'd want to keep the tags around until after the shift takes place, then do subsampling.

I've considered adding this as an option inside of MACS, and pushing the code to Tao for the next update.  Is this something people would be interested in?  You'd still want to run MACS several times to see what peaks are consistent across subsampling runs, but the change in the code would be pretty straightforward. 
 
   In fact, I didn't know about the correction that Jake mentioned. I know that MACS warns against using unbalanced treatment and control, so in any case, MACS will only guarantee best results if you manually balance them.

Agreed on this point! 


@Jake:
Ok, so that means that the Lambdas are actually estimated from a rate
and not from count data?

Rather the lambdas are from counts, but they are always normalized to a certain width.  For example,

1,000 bases around a peak, I see 10 reads in the background, so the rate would be something like 10 / 1000 * 200 = 2.0
and in the 10,000 bases around a peak, I see 80 reads in the background, so the rate would be 80 / 10000 * 200 = 1.6

To account for the difference in total reads, the rate is scaled up or down. For example, if there are 1.5x more reads in the treatment than there are in control, these rates are scaled up: 2.0 x 1.5 = 3.0 and 1.6 x 1.5 = 2.4

Nora

unread,
Feb 9, 2011, 4:16:51 AM2/9/11
to MACS announcement
Thanks Jake, it makes a lot more sense to me now!
I would definitely be interested in a subsampling option for MACS - at
least in the data I'm working on, I frequently have large differences
in read counts!

Zhu Wang

unread,
Feb 9, 2011, 2:17:29 PM2/9/11
to macs-ann...@googlegroups.com
Hi, Jake,
I have the same problem that numbers of my input and chip reads are too different (I have 22million of input and 7 million of ChIP). Do you know how to do sub-sampling.
ZHu


From: jake.bi...@gmail.com
Date: Tue, 8 Feb 2011 23:15:06 -0800
Subject: Re: [macs-announscement] Re: How exactly does MACS normalize by the total number of reads?
To: macs-ann...@googlegroups.com

Jacob Biesinger

unread,
Feb 9, 2011, 5:46:04 PM2/9/11
to macs-ann...@googlegroups.com
On Wed, Feb 9, 2011 at 11:17 AM, Zhu Wang <zwan...@hotmail.com> wrote:
Hi, Jake,
I have the same problem that numbers of my input and chip reads are too different (I have 22million of input and 7 million of ChIP). Do you know how to do sub-sampling.
ZHu

I stripped out this code from one of my workflows. It only works on BED-formatted files in a UNIX environment (Mac, Linux, Cygwin, etc-- it relies on the "sort" command). I haven't tested extensively, but it should work fine. Let me know if you have issues with it.

Usage: unique_and_subsample.py treatment.bed control.bed

Remove *all* redundant tags from two bed files, then downsample the
    larger file so that the total count in both is even.
    
    Generates '.unique' files for both datasets and possibly one '.unique.downsampled'
    
    Note that this is different from the MACS behavior in that only ONE tag is
    kept from redundant tags, while MACS may keep 1, 2, or 3 depending on the
    dataset sizes.

Options:
  -h, --help  show this help message and exit
unique_and_subsample.py

Jacob Biesinger

unread,
Feb 9, 2011, 5:47:27 PM2/9/11
to macs-ann...@googlegroups.com
On Wed, Feb 9, 2011 at 2:46 PM, Jacob Biesinger <jake.bi...@gmail.com> wrote:
On Wed, Feb 9, 2011 at 11:17 AM, Zhu Wang <zwan...@hotmail.com> wrote:
Hi, Jake,
I have the same problem that numbers of my input and chip reads are too different (I have 22million of input and 7 million of ChIP). Do you know how to do sub-sampling.
ZHu

I stripped out this code from one of my workflows. It only works on BED-formatted files in a UNIX environment (Mac, Linux, Cygwin, etc-- it relies on the "sort" command). I haven't tested extensively, but it should work fine. Let me know if you have issues with it.

I guess I should mention that you'll probably want to do this several times and see how the called peaks overlap.

Zhu Wang

unread,
Feb 9, 2011, 9:54:24 PM2/9/11
to macs-ann...@googlegroups.com
Thanks Jake. I will try it.
Zhu


From: jake.bi...@gmail.com
Date: Wed, 9 Feb 2011 14:47:27 -0800

Subject: Re: [macs-announscement] Re: How exactly does MACS normalize by the total number of reads?
To: macs-ann...@googlegroups.com

Zhu Wang

unread,
Feb 10, 2011, 1:35:20 PM2/10/11
to macs-ann...@googlegroups.com
Hi, Jake
I tried your script (unique_and_subsample.py) but I got some error message (the attached). I am using a python 2.7.
I have another question. I submit my unbalanced data to MACS and it returns me ~5000 peaks. However, all of them have pretty high FDR and very low fold of enrichment. For example, for a positive control peak that we have verified by ChiP-qPCR, I got 24% FDR and only 0.4 fold of enrichment. My understanding is that it is because the normalization algorithm of MACS is biased to the sample with more peaks (in my case is the input control) and therefore there is more chance to call peaks from input sample.  The bias also "increases'' the lambda(local) and therefore decreases the fold_enrichment. Am I right?
If I don't want to introduce a sub-sampling error and just use the result from unbalanced input data, can I just use the fold_enrichment of my positive control to filter the whole peak list? I assume the fold_enrichment can be compared among different peaks, right?
Thank you very much.
Zhu


From: jake.bi...@gmail.com
Date: Wed, 9 Feb 2011 14:46:04 -0800

Subject: Re: [macs-announscement] Re: How exactly does MACS normalize by the total number of reads?
To: macs-ann...@googlegroups.com

Error message.doc

Nora

unread,
Feb 11, 2011, 4:48:48 AM2/11/11
to MACS announcement
Zhu, what environment are you using?
-S is a regular argument to the unix sort. What does
sort --help
give you if you type it into your shell?

Jacob Biesinger

unread,
Feb 11, 2011, 1:49:09 PM2/11/11
to macs-ann...@googlegroups.com
The -S is just to make sure that sort doesn't eat too much of your memory.  You can remove that option from the source, with the caveat that if you're sorting huge files, you're going to run out of memory (and the sort will take forever, if not fail completely).  Alternatively, do a 'man sort' to see what the correct option is on your machine. 

Zhu Wang

unread,
Feb 11, 2011, 5:25:50 PM2/11/11
to macs-ann...@googlegroups.com
Hi, Jake and Nura,
Thanks for your response.
I deleted the "-S 2G" and the script works, although I did not find an option to limit the cache usage. I am using a GNU sort in Mac OS X version 10.4.11. What environment are you running your sample?
I have another question, do you have any experience how many times to do sub-sampling and peak-calling are enough? Thanks.
Zhu


From: jake.bi...@gmail.com
Date: Fri, 11 Feb 2011 10:49:09 -0800

Subject: Re: [macs-announscement] Re: How exactly does MACS normalize by the total number of reads?
To: macs-ann...@googlegroups.com


Jacob Biesinger

unread,
Feb 14, 2011, 3:18:34 PM2/14/11
to macs-ann...@googlegroups.com
On Fri, Feb 11, 2011 at 2:25 PM, Zhu Wang <zwan...@hotmail.com> wrote:
Hi, Jake and Nura,
Thanks for your response.
I deleted the "-S 2G" and the script works, although I did not find an option to limit the cache usage. I am using a GNU sort in Mac OS X version 10.4.11. What environment are you running your sample?
Ubuntu 10.10
 
I have another question, do you have any experience how many times to do sub-sampling and peak-calling are enough? Thanks.

What we're doing is essentially jackknifing the data.  One dataset I ran this on had only ~1600 total peaks, 1200 were consistently called across 10 subsamples, while ~400 were highly variable (many of these were called in one sample, but did not overlap with *any* of the other samples).
 
Reply all
Reply to author
Forward
0 new messages