Dear all,
I wonder how MACS adjusts the total number of reads between treatment
and control. AFAIK, MACS works on count data, so does its
normalization randomly remove "excess" READS in the sample with more
peaks, or does it multiply the number of reads in the sample with more
peaks by the difference factor (but then we don't have count data any
more)?
--
You received this message because you are subscribed to the Google Groups "MACS announcement" group.
To post to this group, send email to macs-ann...@googlegroups.com.
To unsubscribe from this group, send email to macs-announcem...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/macs-announcement?hl=en.
In fact, I didn't know about the correction that Jake mentioned. I
know that MACS warns against using unbalanced treatment and control, so
in any case, MACS will only guarantee best results if you manually
balance them.
Hi Nora, I'm no statistician either, but starting with the same number of reads should be statistically sound, since you empirically eliminate any differences (but of course, each time you randomly obtain a smaller set, you may obtain a different peak calling).
In my understanding, you eliminate the FDR bias, but you introduce a sub-sampling one.
Again, in my experience, good IPs do not suffer from sub-sampling. But poorly enriched IPs do. You will obtain overlap <100% when using 2 different sub-samples of say, control reads.
In fact, I didn't know about the correction that Jake mentioned. I know that MACS warns against using unbalanced treatment and control, so in any case, MACS will only guarantee best results if you manually balance them.
@Jake:
Ok, so that means that the Lambdas are actually estimated from a rate
and not from count data?
Hi, Jake,
I have the same problem that numbers of my input and chip reads are too different (I have 22million of input and 7 million of ChIP). Do you know how to do sub-sampling.
ZHu
On Wed, Feb 9, 2011 at 11:17 AM, Zhu Wang <zwan...@hotmail.com> wrote:Hi, Jake,
I have the same problem that numbers of my input and chip reads are too different (I have 22million of input and 7 million of ChIP). Do you know how to do sub-sampling.
ZHuI stripped out this code from one of my workflows. It only works on BED-formatted files in a UNIX environment (Mac, Linux, Cygwin, etc-- it relies on the "sort" command). I haven't tested extensively, but it should work fine. Let me know if you have issues with it.
Hi, Jake and Nura,
Thanks for your response.
I deleted the "-S 2G" and the script works, although I did not find an option to limit the cache usage. I am using a GNU sort in Mac OS X version 10.4.11. What environment are you running your sample?
I have another question, do you have any experience how many times to do sub-sampling and peak-calling are enough? Thanks.