Interpreting models in MACS

1,257 views
Skip to first unread message

Matt Blades

unread,
Oct 20, 2011, 11:09:41 AM10/20/11
to MACS announcement

Hi,

I am analysing my first set of ChIP-Seq data and have run MACS with a
control and treated data set using a range of mfold and bw parameters
and generated the model R script so that the quality of the models can
be assessed. I am ok with the data contained in the *_peaks.xls file,
i.e. p-values, FDR, enchrichment ratio in terms of identifying good
peaks, but what I'm not sure of is how to interpret the .pdf graphs
output by the R-script, what is a good d value and how can i use those
graphs to pick MACS parameters that produce the best model?

Any help or direction to further sources of information would be
really appreciated

Also, any tips for analysing histone data would also be great

Thanks

Matt

Tao Liu

unread,
Oct 20, 2011, 5:05:54 PM10/20/11
to macs-ann...@googlegroups.com
Hi Matt,

If the d is not small ~ < 2*tag size (for those tag size < 50bp), and the model image in PDF shows clean bimodal shape, d may be good. And several bp differences on d shouldn't affect the peak detection on general transcription factor ChIP-seq much.

However, for Pol2 or histone marks, things may be different. Pol2 is moving so it's not appropriate to say there is a fixed fragment size. I don't know the correct answer. For histone mark ChIP-seq, since they would have a underlying characteristic 147bp resolution for a nucleosome size, you can simply skip model building and use "--shiftsize 74 --nomodel" instead. Also if you want, you can try other software like SICER and NPS.

Best,
Tao

> --
> You received this message because you are subscribed to the Google Groups "MACS announcement" group.
> To post to this group, send email to macs-ann...@googlegroups.com.
> To unsubscribe from this group, send email to macs-announcem...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/macs-announcement?hl=en.
>

Tao Liu

Research Fellow
Dept of Biostats and Comp Bio, DFCI / HSPH
450 Brookline Ave., Boston, MA 02215

Batool Akhtar-Zaidi

unread,
Oct 20, 2011, 5:21:35 PM10/20/11
to macs-ann...@googlegroups.com
Hi Matt,

While I can't speak to much of the first paragraph of your message, I wanted to let you know that when it comes to histone modification analysis, using the --call-supeaks option in the command line has proven to be a phenomenally successful way to use MACS to handle modifications (right now, I'm working on H3K4me1) which have both sharp defined peaks but also very broad "peaks" which are more like domains of H3K4me1 signals clustered together. This obviously created many situations, in part because MACS will automatically combine peaks which are separated by a distance of 10bp or less into one called peak, where artefactually broad regions were called (like 30kb peaks... yeah right!).

So many people have argued, including in a recent review in Nat Immunology about ChIP-seq and peak-calling, that modifications with broad distribution (even H3K27Ac can be broad in our hands) should not be analyzed with MACS; using various biological end-points as testing I have found that this is not true and that MACS outperforms ZINBA (specifically designed for broad histone modifications) WHEN the subpeaks function is called for detection of histone peaks with broad, narrow, or both types of signal distribution.

Hope this helps!

Good luck,

Batool 

--
You received this message because you are subscribed to the Google Groups "MACS announcement" group.
To post to this group, send email to macs-ann...@googlegroups.com.
To unsubscribe from this group, send email to macs-announcem...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/macs-announcement?hl=en.




--

Batool Akhtar-Zaidi
PhD Candidate, Scacheri Lab
Depts Molecular Medicine & Genetics
Cleveland Clinic & Case Western Reserve University
Cleveland OH 44106
tel. 216-368-2636


Matt Blades

unread,
Oct 21, 2011, 5:06:58 AM10/21/11
to MACS announcement

Thanks both for your suggestions I will give them a try.

Tao - Given what you said about d values and bimodal curves it looks
like my d values and curves are not too bad. Tag size 39bp and d
values ranging from 91 to 97 depending on parameters used, the curves
are also bimodal and look ok. That being the case when I have sorted
out optimum parameters, using --nomodel, --call-subpeaks etc etc.
What are the best discriminators to use when identifying significant
peaks from the *_peaks.xls file? i.e. do I consider tags,
-10*log10(pvalue), fold enrichment and FDR when picking peaks or just
one or two of these values?

My FDR is causing me concern as out of ~20,000 peaks called only a few
have a FDR(%) of 0 the rest are >35. I assume that is not good?

Apologies for my rather simple questions, but I need to get this
straight before I proceed

Thanks again,

Matt

Tao Liu

unread,
Oct 21, 2011, 9:52:46 AM10/21/11
to macs-ann...@googlegroups.com
Hi Matt,

I remember in a PloS One paper last year by Elizabeth G. Wilbanks et al.,  authors pointed out the best way to sort results in MACS is by -10*log10(pvalue) then fold enrichment. I agree with them. You don't have to worry about FDR too much if your input data are far more than ChIP data. MACS1.4 calculates FDR by swapping samples, so if your input signal has some strong bias somewhere in the genome, your FDR result would be bad. Bad FDR may mean something but it's just secondary. 

Best,
Tao

Lana Schaffer

unread,
Oct 21, 2011, 5:10:53 PM10/21/11
to macs-ann...@googlegroups.com

Tao,

I interpret 1.0 FDR as not a good peak at all and 0.4 FDR as perhaps.

Is this plausable?

Lana

--

You received this message because you are subscribed to the Google Groups "MACS announcement" group.

Anshul Kundaje

unread,
Aug 9, 2012, 12:01:24 PM8/9/12
to macs-ann...@googlegroups.com
The shift-size parameter is totally off (often happens with MACS1.4). See https://groups.google.com/d/msg/macs-announcement/XawMJuBLYrc/ErL5oWVUWdYJ for a discussion of how this can happen.

Use MACS2's shift-size parameter detection. The shift-size detection method used in MACS2 is much more robust and results in appropriate shifts than MACS1.4.

Thanks,
-A

On Thu, Aug 9, 2012 at 10:47 AM, António Miguel de Jesus Domingues <amjdom...@gmail.com> wrote:
Hi Tao,

I am having a similar issue: tag size is 75 bp (as expected) but d=76 which I assume is somewhat lower than expected. The distribution is binomial but with a smaller summit (see attach). Using these settings:
macs14 -tR1_trimmed.map -c input_trimmed.map -n R1_macs -f BOWTIE --llocal 10000 --verbose 3 --diag

I got the following results:

# effective genome size = 2.70e+09







# band width = 300







# model fold = 10,30







# pvalue cutoff = 1.00e-05







# Large dataset will be scaled towards smaller dataset.







# Range for calculating regional lambda is: 1000 bps and 10000 bps
















# tag size is determined as 74 bps







# total tags in treatment: 38801194







# tags after filtering in treatment: 26535992







# maximum duplicate tags at the same position in treatment = 1







# Redundant rate in treatment: 0.32







# total tags in control: 32842482







# tags after filtering in control: 31898554







# maximum duplicate tags at the same position in control = 1







# Redundant rate in control: 0.03







# d = 76







chr start end length summit tags -10*log10(pvalue) fold_enrichment FDR(%)
chr1 705945 706381 437 73 20 64.85 8.44 65.95
chr1 947036 947283 248 100 12 57.41 9.37 58.3

Changing the model fold to 10,50 only increased the number of FDR=100%, and increasing bandwidth to 500 (expected fragment size) made it worst and reduced d.

is that d=76 an acceptable model? What can I do to improve it?

Thanks.
To view this discussion on the web visit https://groups.google.com/d/msg/macs-announcement/-/7S98xOVGtigJ.

António Miguel de Jesus Domingues

unread,
Aug 13, 2012, 4:54:44 AM8/13/12
to macs-ann...@googlegroups.com
Hi A,

I've now tested MACS2 over the weekend with 2 different settings (--mfold default  and --mfold=5,50)  and still no luck:
macs2 -t R1_trimmed_sorted.bam -c Hela_input_trimmed_sorted.bam -f BAM -g hs -n ../MACS_results/test2/ -B -q 0.01 --llocal 10000 --mfold=5,50

In both cases d is still ~75 and now I get fewer called peaks (~20) in comparison with previous runs in MACS1.4 (~1000).

# This file is generated by MACS version 2.0.9 20111102 (tag:alpha)
# ARGUMENTS LIST:
# name = ../MACS_results/test1/
# format = BAM
# ChIP-seq file = R1_trimmed_sorted.bam
# control file = Hela_input_trimmed_sorted.bam
# effective genome size = 2.70e+09
# band width = 300
# model fold = 10,30
# qvalue cutoff = 1.00e-02
# Larger dataset will be scaled towards smaller dataset.
# Range for calculating regional lambda is: 1000 bps and 10000 bps
# Broad region calling is off

# tag size is determined as 74 bps
# total tags in treatment: 38801195
# tags after filtering in treatment: 35268985
# maximum duplicate tags at the same position in treatment = 2
# Redundant rate in treatment: 0.09
# total tags in control: 32842496
# tags after filtering in control: 32275086
# maximum duplicate tags at the same position in control = 2
# Redundant rate in control: 0.02
# d = 74

i. could it be it something funky with my data? How to test that?
ii. Should I force a model instead of having MACS modeling one (-nomodel + -shiftsize)?

Your help is greatly appreciated.

António

Ian Donaldson

unread,
Aug 15, 2012, 3:50:07 AM8/15/12
to macs-ann...@googlegroups.com
Try running HOMER to get an idea of what level of enrichment you have in your ChIP reads.
Also, have you done some general QC of your reads, e.g. with FASTQC?

It may just be that you have not got sufficient enrichment.  But these tools should help answer that.

Ian



António

--
You received this message because you are subscribed to the Google Groups "MACS announcement" group.
To view this discussion on the web visit https://groups.google.com/d/msg/macs-announcement/-/rvS1w0m1RqUJ.

António Miguel de Jesus Domingues

unread,
Aug 15, 2012, 10:36:04 AM8/15/12
to macs-ann...@googlegroups.com

First of all, many thanks to Anshul and Ian for the suggestions.

Now a detailed reply.

QC
I did QC of the reads using FASTQC, and it revealed an issue with kmers and sequence duplication levels (TrueSeq adaptors). These were removed using cutadapt and QCed (did I just made a verb?) again (see attach).

Mapping was done with bowtie -S -n 1 -k 1 --best, with high mapability
# reads processed: 34405769
# reads with at least one reported alignment: 32842496 (95.46%)
# reads that failed to align: 1563273 (4.54%)

I am also looking at the peaks in IGV: loading the raw reads of input and IP and then visually inspecting the called peaks. Regardless of the MACS setting there is evidence for enrichment in the called peaks. Funnily enough the model d=74 appears to call sharper peaks (vs d=100) and when peaks are called in the same region, these peaks fit better with read enrichment, that is, the summit is more centred. That said, we have no expectation as to whether the peaks will be sharper or broad (and how broad).

I am now running a diagnostic tool check enrichment. I'll report back just in case someone runs into the same problem.

António
duplication_levels.png
kmer_profiles.png
per_base_gc_content.png
per_base_n_content.png
per_base_quality.png
per_base_sequence_content.png
per_sequence_gc_content.png
per_sequence_quality.png
sequence_length_distribution.png
Reply all
Reply to author
Forward
Message has been deleted
0 new messages