This mainly happens due to a mappability phenomenon that gets amplified in
the following situations (individually or a combination of the situations
below).
- Your dataset is a good ChIP-seq dataset but has few binding sites and
hence peaks ( relative to the overall size of the genome. For human/mouse ~
< 1000 peaks)
- Your dataset is undersequenced (insufficient sequencing depth)
- Your dataset has quite a few mismapping due to lower read quality
- Your dataset has poor ChIP efficiency (hence lots of background noise and
weak peaks)
- Your dataset has broad regions of enrichment and not strong punctate peaks
A useful way of estimating fragment length (different from how MACS does
it) is to compute a strand cross-correlation profile of read start density
on the + and - strand i.e. you compute the number of read starts at each
position on the + strand and separately on the - strand for each
chromosome. Then simply shift these vectors wrt each other and compute the
correlation for each shift. You can then plot a cross-correlation profile
as the cross-correlation values on the y-axis and the shift that you used
to compute the correlation on the x-axis. This is the cross-correlation
profile for the dataset. Due to the 'shift' phenomenon of reads on the +
and - strand around true binding sites, one would get a peak in the
cross-correlation profile at the predominant fragment length.
For a really strong ChIP-seq dataset such as say CTCF in human cells (great
antibody and 45-60K peaks typically), the cross-correlation profile looks
like what u see in the attached Figure CTCF.pdf. Notice the RED vertical
line which is the dominant peak at the true peak shift. Also notice the
little bump (the blue vertical line). This is at read-length.
At the other extreme, lets take a control dataset (input DNA). The
cross-correlation profile is shown in CONTROL.pdf. Now notice how the
strongest peak is the blue line (read length) and there is basically almost
no other significant peak in the profile. The absence of a peak shud be
expected since unlike a ChIP-seq dataset for input DNA one expects no
significant clustering of fragments around specific target sites (except
potentially weak biases in open chromatin regions depending on the protocol
used). Now the read-length peak occurs due to unique mappability properties
of the mapped reads. If a position 'i' on the + strand in the genome is
uniquely mappable (i.e. a read starting at 'i' on the + strand maps
uniquely), it implies that the position 'i+readlength-1' is also uniquely
mappable on the - strand (ie. a read starting at i+readlength-1 on the -
strand maps uniquely to that position). So in the input dataset or in
random scattering of reads to uniquely mappable locations (in a genome made
up of unmappable, multimappable locations and unique mappable locations),
there is a greater odds of finding reads starting on the + and - strand
separated by read-length than any other shift. Which is why the
cross-correlation profile peaks at read-length compared to other values of
strand-shift and the cross-correlation at the true fragment
length/peak-shift is washed away since there are is no significant +/-
strand read density shift in the input dataset.
Now take a look at what you get for some a ChIP-seq dataset that is an
inbetween case.
POL2B.pdf : has few peaks (just about 3000 detectable ones in the human
genome), this particular antibody is not very efficient (there are other
POL2 antibodies that are very effective) and these are broad scattered
peaks (following elongation patterns of POL2). Notice how you now have 2
peaks in the cross-correlation profile. One at the true peak shift
(~185-200 bp) thats the one marked in red and the other at read length (the
one marked in blue). For such weaker datasets, the read-length peak starts
to dominate. Depending on the data quality characteristics of the dataset,
the read-length peak scales relative to the true fragment length peak.
So long story short, MACS effectively tends to just pick up just the
strongest peak in the cross-correlation profile (although it uses a
different method of estimating the peak-shift) and for datasets that have
the properties listed at the top of this email, basically it picks up the
read length. For strong datasets, it picks up the true shift. What one
needs to do is find the peak in the cross-correlation profile ignoring any
peak at read-length (which may be stronger or weaker than the other peaks
in the profile). This always gives reliable estimates of fragment length
(d/peak-shift). We have confirmed this using paired-end sequencing on a
variety of different TFs and histone marks with different binding
characteristics and ubiqiuity (where you can actually observe the
distribution of fragment lengths for comparison). We have seen this
phenomenon in a large number of datasets (ENCODE and modENCODE datasets).
We have a paper in press right now that deals with this phenomenon as well
as how it can be used as a useful data quality measure. Once it is
published I can send a link to those interested.
If you would like to have some code that computes the fragment length based
on the cross-correlation method shoot me an email. I am hesitant to link it
here without Tao's permission, since it uses the code-base from another
peak caller. You can then use the --shift-size parameter set to 1/2 the
estimated fragment length with --no-model. You will notice significantly
better results with a correctly estimated 'd'.
I think at some point, it might be useful to have this cross-correlation
method incorporated within MACS so as to make the d estimation more robust
(which is probably one of the only unstable aspects of an otherwise
fantastic peak caller).
Thanks,
Anshul.
On Sat, Jan 28, 2012 at 3:14 PM, Yuan Hao <yuan.x.
...@gmail.com> wrote:
> Hi,
> Here are my two cents. It's not uncommon to see such a small 'd' from
> MACS. I have ChIP-Seq data from different breast cancer cell lines,
> sequenced on different machines by different parties, with from very high
> to relatively low sequencing and alignment qualities years apart. Using
> MACS-1.x, I always got the d values as small as yours. As the predicted
> fragment size (i.e. 'd') is always smaller than 2 x read size, MACS
> couldn't build up the mode, but rather using the user specified shifting
> size. Through checking of the called peak qualities and by comparing
> results from using other peak calling programs, I had an impression that
> MACS-1.x tends to underestimates the distance. As MACS picks the top 1000
> best peaks for d estimation (if I memory is right), my personal guess is
> that this sampling process might caused some bias with regards to peak
> profiles in whole. I would take considerations of my actually fragment
> size, refer to at least one of another peak calling program, and check out
> the peaks called in a browser to decide on a proper shifting size that
> could be supplied to MACS. This is just my personal experience without
> application to MACS performance in general. Also, I haven't tried out MACS2
> yet, so no experience in the latest version. Hope this helps.
> Cheers,
> Yuan
> On 28 Jan 2012, at 19:11, Gyan Prakash Srivastava wrote:
> Hello MACS experts,
>> I have a serious concern with the MACS output from my chip-seq data. I
>> have chip-seq data from postmortem brain samples (same regions) from many
>> individuals. The protocol is exactly same for generating data. We expect
>> lot of similarity from chip-seq data from these samples. When I run MACS
>> for each sample with same parameter and control, I see different "d" values
>> like below.
>> # d = 43
>> # d = 269
>> # d = 38
>> # d = 302
>> # d = 38
>> # d = 39
>> # d = 38
>> # d = 37
>> # d = 40
>> # d = 39
>> # d = 37
>> # d = 35
>> # d = 40
>> My concern is, whether "d" is very sensitive variable? I think different
>> "d" values could lead different shiftsize and hence completely different
>> peak profile (count and location). Any suggestion?
>> Thanks,
>> Gyan
>> The information in this e-mail is intended only for the person to whom it
>> is
>> addressed. If you believe this e-mail was sent to you in error and the
>> e-mail
>> contains patient information, please contact the Partners Compliance
>> HelpLine at
>> http://www.partners.org/**complianceline<http://www.partners.org/complianceline>. If the e-mail was sent to you in error
>> but does not contain patient information, please contact the sender and
>> properly
>> dispose of the e-mail.
>> --
>> You received this message because you are subscribed to the Google Groups
>> "MACS announcement" group.
>> To post to this group, send email to macs-announcement@**googlegroups.com<macs-announcement@googlegroups.com>
>> .
>> To unsubscribe from this group, send email to
>> macs-announcement+unsubscribe@**googlegroups.com<macs-announcement%2Bunsubs cribe@googlegroups.com>
>> .
>> For more options, visit this group at http://groups.google.com/**
>> group/macs-announcement?hl=en<http://groups.google.com/group/macs-announcement?hl=en>
>> .
> --
> You received this message because you are subscribed to the Google Groups
> "MACS announcement" group.
> To post to this group, send email to macs-announcement@**googlegroups.com<macs-announcement@googlegroups.com>
> .
> To unsubscribe from this group, send email to
> macs-announcement+unsubscribe@**googlegroups.com<macs-announcement%2Bunsubs cribe@googlegroups.com>
> .
> For more options, visit this group at http://groups.google.com/**
> group/macs-announcement?hl=en<http://groups.google.com/group/macs-announcement?hl=en>
> .