Issues calling peaks for H3K4me1: low quality data or am I doing something wrong?

950 views
Skip to first unread message

Alvaro J. González

unread,
Aug 24, 2013, 3:06:17 PM8/24/13
to idr-d...@googlegroups.com
Dear Anshul and IDR community,

I'm having issues trying to calculate peaks for one of the histone modification ChIP-seq datasets released by NIH Epigenomics Roadmap. This time I'm interested in H3K4me1 in embryonic stem cells (data downloaded from here). I've tried using the two BI replicates and two of the three UCSD replicates, running always into the same bug, which I describe below.

I initially tried macs2 with the input parameters that you suggest. This is what I get, for instance, when using replicate UCSD.H1.H3K4me1.LL311.bed.gz:

INFO  @ Sat, 24 Aug 2013 14:47:18: #2 Build Peak Model... 
INFO  @ Sat, 24 Aug 2013 14:47:44: #2 number of paired peaks: 0 
WARNING @ Sat, 24 Aug 2013 14:47:44: Too few paired peaks (0) so I can not build the model! Broader your MFOLD range parameter may erase this error. If it still can't build the model, we suggest to use --nomodel and --extsize 147 or other fixed number instead.

So obviously the peak-caller isn't being able to come up with a model for peak-pairing. This happens not only with this replicate but also with three others that I've tried so far, which makes me feel that maybe the data quality is low. One solution to this could be to estimate the fragment length and give it to macs2 as a fixed parameter. I tried to figure this out by using spp, following your guidance, but I run into a bug once again:

Error in if ((crosscorr$rel.phantom.coeff >= 0) & (crosscorr$rel.phantom.coeff <  : 
  argument is of length zero
Execution halted

This happens in the run_spp.R script, in the block "# Compute phantom peak coefficient". Obviously, the crosscorr$rel.phantom.coeff is not being successfully calculated, therefore the evaluation throws an error. I wonder if this is happening, again, because the data quality is low.

Anyway, I went on and tried to do peak-calling directly using spp (command exactly as you use it in your pipeline), but the same error was thrown.

I notice someone else had previously reported this bug, but I didn't find an answer to it.

Any ideas? Have you been able to successfully call peaks on this cell type, for this histone mark?

Thanks a lot!

- Alvaro.

Anshul Kundaje

unread,
Aug 25, 2013, 3:04:13 PM8/25/13
to idr-d...@googlegroups.com
Hi Alvaro,

Unfortunately there are several inconsistencies and formatting issues with the Roadmap datasets at the EDACC (I have asked them to correct these and they will be doing this soon). So the problems you are facing have nothing to do with SPP or MACS or data quality. See below.

1. Strand information is often not in the correct column.
The BED files at the EDACC site are a mixture of some 6 column files and some 5 column files.
e.g. BI.H1.H3K4me1.Solexa-10529.bed.gz has only 5 columns with strand information in the 5th column. As per standard BED format which SPP and MACS expect, strand information should be in the 6th column.

2. All the BED files at the EDACC site have 1-based start positions. As per standard BED format these should be 0 based. So 1 needs to be subtracted from all the start positions.

3. Finally, for some reason the reads have been pre-extended to 200 bp thereby resulting in complete loss of information about what the actual read lengths are. For SPP and MACS2, the read length is important as it is used to correctly estimate fragment length and to compute QC scores. So you should artificially re-truncate the reads to 36 bp i.e. adjust start or stop according to strand so that the reads end up being 36 bp. If you are integrating data from multiple experiments, you may also want to refilter all reads using a 36 bp mappability mask so that differences in read length do not contribute to artifactual differences in the datasets.

Incase it helps, I have already reprocessed a large chunk of the Roadmap data (that we are using in some upcoming papers) fixing these 3 problems http://www.broadinstitute.org/~anshul/projects/roadmap/mapped/36bpReadLensFiltered/  . I also want you to note that a small fraction of the datasets are likely to get replaced with better quality data so the repository I have linked to is subject to some minor changes. You can reproduce my files by following the above steps if you like.

Thanks,
Anshul.


--
You received this message because you are subscribed to the Google Groups "idr-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to idr-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Alvaro J. González

unread,
Aug 26, 2013, 3:50:14 PM8/26/13
to idr-d...@googlegroups.com
Thanks for your post, Anshul. It was extremely useful. I had actually noticed weird things in the format of these files (the 200 long reads was unexpected, plus the peak-callers were calculating a length of 199; now I realize it's because of the incorrect 1-based start positions), but hadn't reasoned that this could be throwing the peak-callers off. It now makes sense to me: how could you calculate a fragment length when the reads are potentially as long as the peak?

But unfortunately I keep getting the same results even when your formatted files are used.

For instance, take again the H3K4me1 histone modification in the H1hesc cell line. For input I'm concatenating the two Broad Institute input files (BI.H1.Input.Solexa-10531.filt.tagAlign.gz and BI.H1.Input.Solexa-12532.filt.tagAlign.gz). The ChIP-seq sample could be BI.H1.H3K4me1.Solexa-10529.filt.tagAlign.gz (the results are the same no matter the replicate). This is the command I use to run spp (identical to what you have in your pipeline, I believe):

Rscript run_spp.R -c=BI.H1.H3K4me1.Solexa-10529.filt.tagAlign.gz -i=input.catted.tagAlign.gz -npeak=300000 -odir=BI.H1.H3K4me1.Solexa-10529.filt.tagA/ -savr -savp -rf -out=BI.H1.H3K4me1.Solexa-10529.filt.tagA/phantomPeakStats.tab

Which produces the following:

ChIP data: BI.H1.H3K4me1.Solexa-10529.filt.tagAlign.gz 
Control data: input.catted.tagAlign.gz 
strandshift(min): -100 
strandshift(step): 5 
strandshift(max) 600 
user-defined peak shift NA 
exclusion(min): 10 
exclusion(max): NaN 
num parallel nodes: NA 
FDR threshold: 0.01 
NumPeaks Threshold: 3e+05 
Output Directory: BI.H1.H3K4me1.Solexa-10529.filt.tagA/ 
narrowPeak output file name: NA 
regionPeak output file name: BI.H1.H3K4me1.Solexa-10529.filt.tagA//BI.H1.H3K4me1.Solexa-10529.filt.tagAlign_VS_catted.tagAlign.regionPeak 
Rdata filename: NA 
plot pdf filename: BI.H1.H3K4me1.Solexa-10529.filt.tagA//BI.H1.H3K4me1.Solexa-10529.filt.tagAlign.pdf 
result filename: BI.H1.H3K4me1.Solexa-10529.filt.tagA/phantomPeakStats.tab 
Overwrite files?: TRUE

Decompressing ChIP file
Decompressing control file
Loading required package: caTools
Reading ChIP tagAlign/BAM file BI.H1.H3K4me1.Solexa-10529.filt.tagAlign.gz 
opened /tmp/RtmpT2MCvR/BI.H1.H3K4me1.Solexa-10529.filt.tagAlign741522c3fa4f
done. read 7615187 fragments
ChIP data read length 36 
[1] TRUE
Reading Control tagAlign/BAM file input.catted.tagAlign.gz 
opened /tmp/RtmpT2MCvR/input.catted.tagAlign7415658dafcd
done. read 16518507 fragments
Control data read length 36 
Calculating peak characteristics
Warning message:
In max(ccl.av$x[ccl.av$y >= th]) :
  no non-missing arguments to max; returning -Inf
Minimum cross-correlation value  
Minimum cross-correlation shift  
Peak cross-correlation value NA 
Peak strand shift NA 
Warning message:
In max(crosscorr$cross.correlation$x[crosscorr$cross.correlation$y >=  :
  no non-missing arguments to max; returning -Inf
Window half size -Inf 
Phantom peak location  
Phantom peak Correlation  
Normalized cross-correlation coefficient (NCCC)  
Relative Cross correlation Coefficient (RCCC)  
Error in if ((crosscorr$rel.phantom.coeff >= 0) & (crosscorr$rel.phantom.coeff <  : 
  argument is of length zero
Execution halted

which is the same bug I had reported before. When I try macs2, as before, it's unable to calculate a model for peak-pairing.

Has this happened to you with these samples?

Anshul Kundaje

unread,
Aug 26, 2013, 3:52:31 PM8/26/13
to idr-d...@googlegroups.com
Alvaro,

Use the run_spp_nodups.R file. Since duplicates have been removed. Let me know if you still have the problem. The model building step is MACS2 is still a little unstable. So use SPP to compute the fragment length and then use it in MACS2. I have processed these files this way and everything has worked fine.

-A


--

Alvaro J. González

unread,
Aug 28, 2013, 1:10:33 PM8/28/13
to idr-d...@googlegroups.com
Hi Anshul,

It's running like a breeze now. Thank you so much.

Two interesting observations that I'm sure you're aware of:

1) SPP is significantly slower than MACS2 for peak-calling. A little unexpected since SPP is using the Boost libraries. I guess MACS2 being a python script vs SPP being in R is making a huge difference here.

2) Most of the NIH Epig Rdm samples, even after your filtering, don't pass the SPP quality control. Also a little unexpected. The good news is that for almost all the celltype-HistMod combos that I've tried, you at least find 2 replicates that pass the control, and in the end that's all you need for IDR. Now it makes sense to me what you mentioned before about some of those replicates being planned for replacement in the near future.

Thanks again, Anshul. It was tons of help.

- Al.

Anshul Kundaje

unread,
Sep 2, 2013, 9:02:03 PM9/2/13
to idr-d...@googlegroups.com
Hi Alvaro,


1) SPP is significantly slower than MACS2 for peak-calling. A little unexpected since SPP is using the Boost libraries. I guess MACS2 being a python script vs SPP being in R is making a huge difference here.

The differences in speed of SPP and MACS have not much to do with Python vs R. The core of SPP is in C++. The difference in speed is inherently due to differences in how scores are computed for the 2 methods. MACS uses a parametric assumption of a Poisson distribution to compute p-values and then computes q-values directly using BH. Whereas SPP does a ChIP/input swap and computes empirical FDRs. Also SPP does a genome-wide cross-correlation to estimate fragment length and QC scores whereas MACS does this only at pre-thresholded regions. Finally, SPP does have a parallel mode that makes it run faster (it uses the snow package) but in my experience the snow package is unstable on clusters, so I tend not to use it. Anyway I dont find either of them rate limiting.
 
2) Most of the NIH Epig Rdm samples, even after your filtering, don't pass the SPP quality control. Also a little unexpected. The good news is that for almost all the celltype-HistMod combos that I've tried, you at least find 2 replicates that pass the control, and in the end that's all you need for IDR. Now it makes sense to me what you mentioned before about some of those replicates being planned for replacement in the near future.

Not all Roadmap samples have biological replicates (which ENCODE does). Infact most dont because these are primary tissues and cell types and its often hard to get enough material. The multiple files per sample are often replicates but as you may have seen are really not well matched in terms of seq. depth. So I would highly recommend using the pooled-pseudoreplicate strategy to get the best sensitivity in terms of peak calling cuz comparing 2 replicates with very different seq. depths or data quality will only give you highly conservative results.

Thanks,
-A

Ryan Dale

unread,
Sep 4, 2013, 12:51:13 PM9/4/13
to idr-d...@googlegroups.com
Hi Anshul -

Finally, SPP does have a parallel mode that makes it run faster (it uses the snow package) but in my experience the snow package is unstable on clusters, so I tend not to use it.

Can you elaborate on what kind of unstability you observe with the snow package on a cluster?  For example, does R crash with an error or just hang?  The reason I ask is I've recently been having some jobs using snow mysteriously fail on a cluster, and wonder if it could be related.

I recognize this is off-topic for the thread -- please feel free to move if necessary.

thanks,
-ryan

Anshul Kundaje

unread,
Sep 4, 2013, 9:03:15 PM9/4/13
to idr-d...@googlegroups.com
They typically just randomly fail or sometimes never end (seem to hang). Not sure what the problem is. The failures are pretty random (atleast I havent been able to detect some pattern to them). So I much rather wait a bit and get a reliable run ..

-A
Reply all
Reply to author
Forward
0 new messages