Hi Anton,
I've got a rather specific question regarding
QuEST and results with my dataset. I've been very happy using QuEST to
analyze my ChIP-Seq data in bacteria, as I think that being able to
optimize the parameters for my data is ideal and I appreciate the
documentation and support you have provided via the Google Group.
With respect to my data, I am looking at a large number
of datasets so I won't be able to validate all the peaks, but I have
validated a small number and have seen good correlation between RT-PCR
and ChIP-Seq. Additionally, we have very few (almost no) regions to
serve as a positive control for enrichment- that is to say, we don't
actually have much data about where DNA-associated proteins are located
on the chromosome.
In order to be rigorous with respect to my data analysis,
I perform each ChIP-Seq experiment in biological triplicate, align with
bowtie2, and use QuEST to find peaks. Because I am working with the
tiny chromosome of a bacterium and getting millions of reads back, we
see significant background over the entire genome. As a result, I keep
the requirement for fold enrichment low, (1.25 seeding, 1.5 fold over
background, 1.25 extension). I then look through the peaks (from
peak_caller.ChIP.out.with_metrics, ignoring the "region" data), and
require that they have a positive peak shift, positive strand
correlations, and a -log10q value >= 2. If a peak matches those
parameters in 2/3 samples, I consider it a called peak.
So while I have used QuEST to analyze all my data,
and that data seem consistent with what we would expect (and there has
been some validation), I have just realized that one of my few potential
positive control regions was not called as a peak. That is to say, we
know a particular protein regulates expression of itself and would
expect it to be found at its own promoter. ChIP-Seq of this protein does
result in enrichment at the promoter site, but the issue is that the
peak shift is negative (note that the average peak shift in my
experiments tends to be around 130). These are the results for the peak
region called around that promoter region, in three biological
replicates:
A:
R-24 xxxxxx 424810-425186 ChIP: 15.8725 control: 1.4801
max_pos: 424988 ef: 10.7239 ChIP_tags: 5373 background_tags: 6419
tag_ef: 5.4187 ps: 120 cor: 0.356835 -log10_qv: 5883.77 -log10_pv:
5886.19 qv_rank: 36
P-24-1 xxxxxx 424988 ChIP: 15.8725 control: 1.4801 region:
424810-425187 ef: 10.7239 ps: 119 cor: 0.548702 -log10_qv: 6678.5
-log10_pv: 6680.92 qv_rank: 24
B:
R-25 xxxxxx 424807-425187
ChIP: 14.6417 control: 1.45236 max_pos: 424979 ef: 10.0813 ChIP_tags:
4031 background_tags: 6702 tag_ef: 4.94983 ps: -163 cor: 0.271332
-log10_qv: 3923.51 -log10_pv: 3925.91 qv_rank: 38
P-25-1 xxxxxx 424979 ChIP: 14.6417 control: 1.45236 region:
424807-425188 ef: 10.0813 ps: -158 cor: 0.465295 -log10_qv: 4628.29
-log10_pv: 4630.68 qv_rank: 25
C:
R-25 xxxxxx 424781-425199
ChIP: 14.9932 control: 1.50569 max_pos: 424990 ef: 9.95771 ChIP_tags:
5959 background_tags: 7012 tag_ef: 5.22367 ps: 132 cor: 0.229571
-log10_qv: 6125.94 -log10_pv: 6128.38 qv_rank: 34
P-25-1 xxxxxx 424990 ChIP: 14.9932 control: 1.50569 region:
424781-425200 ef: 9.95771 ps: -181 cor: 0.414566 -log10_qv: 6029.09
-log10_pv: 6031.53 qv_rank: 25
It is surprising to me
that such good enrichment would be an artifact- which is of course,
possible. But when I look at the distribution of the forward and reverse
sequencing reads (image attached, around regions 24 or 25, red lines
are forward reads, blue are reverse), I don't understand why sample A
has a positive peak shift and samples B and C don't- qualitatively, the
relative abundances of the forward and reverse reads look similar to me!
Can you possibly suggest a reason QuEST is calling a
negative peak shift in this region or a way to modify this parameter? I think I may have to use a different peak calling algorithm if one of the only control regions we have is not called, and I would much rather not switch at this point! I appreciate any assistance you
could offer.