'nomodel' OR very small 'mfold'

289 views
Skip to first unread message

Yuan Hao

unread,
Dec 10, 2010, 9:29:37 AM12/10/10
to macs-ann...@googlegroups.com
Dear members,

I used MACS for peak calling on my Chip-seq data which has 5m reads in
total, but only 3m unique reads. My input control data has 3.8m reads
in total and 3.7m out of which are unique. Command used as followings:

" $macs14 --format=BOWTIE -t treated.map --name=treated --gsize=hs --
tsize=30 --bw=200 --shiftsize=100 --mfold=5,10"

Tested on different '--mfold' generated the following statistics:

--mfold d #Peaks #negativePeaks FDR
[5,10] 116 1662 2368 most 100
[11,20] 43 1961 2444 most 100
[21,30] 40 1753 2200 most 100

From the estimated d values, it looks like I shall use '--mfold'
range [5,10], but as far as I know, it is not recommended to use --
mfold<10. If so, it's preferred to use '--nomodel' instead. With '--
nomodel' specified, I got 'd=200', '#peaks=1036', '#negativePeak=1701"
and most FDR =100. I am not sure which way would suits my data better,
'nomodel' or using very small 'mfold'. Please excuse me if this
question is too naive. Also, apart from the estimated d value, what's
other information could I read from the model.r pdf from MACS output?
All my model.r pdf look quite similar to each other in shape. Thank
you very much for any of your input!

Regards,
Yuan

Ivan Gregoretti

unread,
Dec 10, 2010, 10:04:43 AM12/10/10
to macs-ann...@googlegroups.com
Hello Yuan,

The human genome is very large and usually requires 10 million tags to
draw conclusions.

Exceptionally, conclusions are interpretable with ~4 million tags when
the antibody is very specific __and__ the immunoprecipitation has
worked very well. Antibodies against different varieties of H3 are
good examples.

Three million unique tags out of 5 million is, generally, not good. It
tells you that you have over-amplified by PCR. This is a very common
scenario when the amount of tissue to be analysed is limited. PCR
amplification is a fact of life for the most common sequencing
technologies. It will change when single-molecule techniques become
popular.

Before MACS output in different conditions is interpreted, I would
make I am not trying to interpret noise. In other words, you need to
evaluate the biological quality of the sample. As a first steps guide,
I recommend to you search a past conversation in this email list under
the subject "recommended params for histone modification?".

If your conclusion is that the sample is indeed good, then it makes
sense to discuss the variety of MACS models.

Happy ChIP-ing!

Ivan


Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1016 and 1-301-496-1592
Fax: 1-301-496-9878

On Fri, Dec 10, 2010 at 9:29 AM, Yuan Hao <yuan...@ucd.ie> wrote:
> Dear members,
>
> I used MACS for peak calling on my Chip-seq data which has 5m reads in
> total, but only 3m unique reads. My input control data has 3.8m reads in
> total and 3.7m out of which are unique. Command used as followings:
>
> " $macs14 --format=BOWTIE -t  treated.map --name=treated --gsize=hs

> --tsize=30 --bw=200 --shiftsize=100 --mfold=5,10"


>
>  Tested on different '--mfold' generated the following statistics:
>
> --mfold d       #Peaks  #negativePeaks  FDR
> [5,10]  116             1662    2368            most 100
> [11,20] 43              1961    2444            most 100
> [21,30] 40              1753    2200            most 100
>
> From the estimated d values, it looks like I shall use '--mfold' range

> [5,10], but as far as I know, it is not recommended to use --mfold<10. If
> so, it's preferred to use '--nomodel' instead. With '--nomodel' specified, I


> got 'd=200', '#peaks=1036', '#negativePeak=1701" and most FDR =100. I am not
> sure which way would suits my data better, 'nomodel' or using very small
> 'mfold'. Please excuse me if this question is too naive. Also, apart from
> the estimated d value, what's other information could I read from the
> model.r pdf from MACS output? All my model.r pdf look quite similar to each
> other in shape. Thank you very much for any of your input!
>
> Regards,
> Yuan
>

> --
> You received this message because you are subscribed to the Google Groups
> "MACS announcement" group.
> To post to this group, send email to macs-ann...@googlegroups.com.
> To unsubscribe from this group, send email to
> macs-announcem...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/macs-announcement?hl=en.
>
>

Yuan Hao

unread,
Dec 10, 2010, 10:11:34 AM12/10/10
to macs-ann...@googlegroups.com
Hey Ivan,

Like your explanation very much! Thanks!!

While, the experiment was for TF rather than for histone modification.
Actually, we had some problem on the flow cell during sequencing which
ended up with less reads.

Regards,
Yuan

James Hensman

unread,
Dec 10, 2010, 10:00:24 AM12/10/10
to macs-ann...@googlegroups.com
On 10 December 2010 14:29, Yuan Hao <yuan...@ucd.ie> wrote:
Dear members,

I used MACS for peak calling on my Chip-seq data which has 5m reads in total, but only 3m unique reads. My input control data has 3.8m reads in total and 3.7m out of which are unique. Command used as followings:

" $macs14 --format=BOWTIE -t  treated.map --name=treated --gsize=hs --tsize=30 --bw=200 --shiftsize=100 --mfold=5,10"


 Tested on different '--mfold' generated the following statistics:

--mfold d       #Peaks  #negativePeaks  FDR
[5,10]  116             1662    2368            most 100
[11,20] 43              1961    2444            most 100
[21,30] 40              1753    2200            most 100

From the estimated d values, it looks like I shall use '--mfold' range [5,10], but as far as I know, it is not recommended to use --mfold<10. If so, it's preferred to use '--nomodel' instead. With '--nomodel' specified, I got 'd=200', '#peaks=1036', '#negativePeak=1701" and most FDR =100. I am not sure which way would suits my data better, 'nomodel' or using very small 'mfold'. Please excuse me if this question is too naive. Also, apart from the estimated d value, what's other information could I read from the model.r pdf from MACS output? All my model.r pdf look quite similar to each other in shape. Thank you very much for any of your input!

Regards,
Yuan

--
You received this message because you are subscribed to the Google Groups "MACS announcement" group.
To post to this group, send email to macs-ann...@googlegroups.com.
To unsubscribe from this group, send email to macs-announcem...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/macs-announcement?hl=en.


Hi Yuan, 

Sorry for dragging this conversation off-topic, but do you have any idea why there should be so many duplicate reads in the treatment data when there are so few in the control?

I have a very similar data set, 29% duplicates in treatment, <1% in control. Yet MACS removes all but two duplicates based on the genome size and the binomial distribution, which assumes that uniform tag distribution. I'm concerned that with so many more tags appearing in control, the duplication is due to the enrichment being very heavy locally, rather than due to some PCR artifact or similar. Therefore the duplicate reads are (at least in part) 'good' reads, which should be counted?

Does anybody have any insight?

James.

Yuan Hao

unread,
Dec 13, 2010, 5:43:24 AM12/13/10
to macs-ann...@googlegroups.com
Hi James,

My data is a bit unusual in that apart from the possible PCR artifact
which might cause large fraction of duplication in the chipped sample,
there was a fault happened in the centra area of the flow cell during
sequencing, resulted in large reduction of reads from the lanes
located in the middle. Unfortunately, we put all the chipped sample in
the middle lanes of the flow cell, and all input controls on the edge
(i.e. Lane 8). Given such a situation, I can't really tell what's the
real cause of such a fact. It is possible to even out between chipped
and control reads by random sampling similar amount of reads from the
input control, but the total number of reads is not adequate in any
sense and there might be some bias introduced from such a sampling, so
I might go adopting another peak caller, such as QuEST (Which is said
to prefer more reads in control), then to confirm two results with
each other. Hope this helps!

Cheers,
Yuan

James Hensman

unread,
Dec 13, 2010, 8:32:27 AM12/13/10
to macs-ann...@googlegroups.com
Hi Yuan, 

Thanks for the input, most helpful. 

I've since decided that the duplicate removal by MACS is sensible: I'm sticking with it. I was concerned that duplicates could be caused by the enrichment, and that MACS' method for determining the p-value for duplicate removal assumed a uniform distribution of reads across the genome. A back-of-the-envelope calculation tells me that even with a large number (thousands) of enriched regions, with strong enrichment (100 fold), the p-value for duplicate cut-off changes very little. This is due simply to the gargantuan size of the remaining genome. 

Cheers, 

James.

On 13 December 2010 10:43, Yuan Hao <yuan...@ucd.ie> wrote:
Hi James,

My data is a bit unusual in that apart from the possible PCR artifact which might cause large fraction of duplication in the chipped sample, there was a fault happened in the centra area of the flow cell during sequencing, resulted in large reduction of reads from the lanes located in the middle. Unfortunately, we put all the chipped sample in the middle lanes of the flow cell, and all input controls on the edge (i.e. Lane 8). Given such a situation, I can't really tell what's the real cause of such a fact. It is possible to even out between chipped and control reads by random sampling similar amount of reads from the input control, but the total number of reads is not adequate in any sense and there might be some bias introduced from such a sampling, so I might go adopting another peak caller, such as QuEST (Which is said to prefer more reads in control), then to confirm two results with each other. Hope this helps!

Cheers,
Yuan



On 10 Dec 2010, at 15:00, James Hensman wrote:



On 10 December 2010 14:29, Yuan Hao <yuan...@ucd.ie> wrote:
Dear members,

I used MACS for peak calling on my Chip-seq data which has 5m reads in total, but only 3m unique reads. My input control data has 3.8m reads in total and 3.7m out of which are unique. Command used as followings:

" $macs14 --format=BOWTIE -t  treated.map --name=treated --gsize=hs --tsize=30 --bw=200 --shiftsize=100 --mfold=5,10"

 Tested on different '--mfold' generated the following statistics:

--mfold d       #Peaks  #negativePeaks  FDR
[5,10]  116             1662    2368            most 100
[11,20] 43              1961    2444            most 100
[21,30] 40              1753    2200            most 100

From the estimated d values, it looks like I shall use '--mfold' range [5,10], but as far as I know, it is not recommended to use --mfold<10. If so, it's preferred to use '--nomodel' instead. With '--nomodel' specified, I got 'd=200', '#peaks=1036', '#negativePeak=1701" and most FDR =100. I am not sure which way would suits my data better, 'nomodel' or using very small 'mfold'. Please excuse me if this question is too naive. Also, apart from the estimated d value, what's other information could I read from the model.r pdf from MACS output? All my model.r pdf look quite similar to each other in shape. Thank you very much for any of your input!


Regards,
Yuan

--
You received this message because you are subscribed to the Google Groups "MACS announcement" group.
To post to this group, send email to macs-ann...@googlegroups.com.
To unsubscribe from this group, send email to macs-announcem...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/macs-announcement?hl=en.


Hi Yuan,

Sorry for dragging this conversation off-topic, but do you have any idea why there should be so many duplicate reads in the treatment data when there are so few in the control?

I have a very similar data set, 29% duplicates in treatment, <1% in control. Yet MACS removes all but two duplicates based on the genome size and the binomial distribution, which assumes that uniform tag distribution. I'm concerned that with so many more tags appearing in control, the duplication is due to the enrichment being very heavy locally, rather than due to some PCR artifact or similar. Therefore the duplicate reads are (at least in part) 'good' reads, which should be counted?

Does anybody have any insight?

James.

--
You received this message because you are subscribed to the Google Groups "MACS announcement" group.
To post to this group, send email to macs-ann...@googlegroups.com.
To unsubscribe from this group, send email to macs-announcem...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/macs-announcement?hl=en.
Reply all
Reply to author
Forward
0 new messages