Starting with IDR2 - Basic questions

Pierre-Francois Roux

unread,

Oct 16, 2015, 5:21:38 AM10/16/15

to idr-discuss

Dear Anshul,

First of all, I would like to apologize if the questions I am about to ask sound really naive. This is the first time I include the IDR procedure in a ChIP-seq pipeline.

I have many ChIP-seq data sets, most of them related to "sharp" histone modifications (K4Me3, K27Ac ...), and generated for two biological replicates. So far I've processed all the datasets independently, and performed the peak calling with MACS2 2.0.8 with a lenient threshold (p = 1e-2), as suggested on you website (https://sites.google.com/site/anshulkundaje/projects/idr).

I finally ran IDR2 considering the -log10pval to compare peaks called in each biological replicate for each histone modification.

idr --samples D0K27Ac_Rep1_MACS2_peaks_SORT.narrowPeak D0K27Ac_Rep2_MACS2_peaks_SORT.narrowPeak --plot

According to IDR2, for this specific example, around 55% of the peaks are < 5% IDR.

1) My first question is : considering good ChIP libraries, what is the expected value here ? Is 55% a reasonable amount of reproducible peaks ? Below which value we consider that the data sets should be discarded from further analysis ? For one of the histone modification, I get only 20% of reproducible peaks, and I am wondering if I should be concerned.

2) Is the workflow I am using the good one ? Or is it better run a peak calling on pooled data to generate on oracle ? On the pipeline you described on your website, you also talk about generating pseudo-replicates. Just to be sure I understood : is this "pseudo-replicates" approach only used to asses the self-consistency when you have no duplicate ?

3) I've read quite a lot of threads about IDR, explaining how to interpret the results and the graphs. But all of them are referring to the "old" version of the pipeline. Since the graphs generated with the latest release of the pipeline are quite different, I am not sure to properly understand them. Here is the output I get running the previous command line. Even considering that 55% of the peaks called are reproducible, the plots looks very messy to me, especially the ones at the top, and no diagonal line is really showing up. What should I conclude from this kind of pattern ? Did I do something wrong ?

My previous exploration on the reproducibility for this specific two samples (comparison of binding in 100bp bins across the genome ...) revealed they were pretty good. I am therefore a little bit concerned by the results obtained with IDR2.

Many thanks for your advice and sorry for such naive questions.

Cheers,

Pef

idrValues.txt.png

Pierre-Francois Roux

unread,

Oct 20, 2015, 10:13:40 AM10/20/15

to idr-discuss

Could someone help me please ?

I am really sorry to insist.

Best

Pierre-François

Anshul Kundaje

unread,

Oct 20, 2015, 11:47:41 AM10/20/15

to idr-d...@googlegroups.com

I'll reply tomorrow. Have a few deadlines I need to get by.

Anshul

--
You received this message because you are subscribed to the Google Groups "idr-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to idr-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pierre-Francois Roux

unread,

Oct 20, 2015, 12:17:39 PM10/20/15

to idr-discuss

Thanks a lot Anshul for your precious help.

Cheers,

Pierre-Francois Roux

unread,

Nov 4, 2015, 10:24:19 AM11/4/15

to idr-discuss

Dear Anshul,

Do you think you will have time to answer my questions next days ?

Many thanks,

Pierre-François

Anshul Kundaje

unread,

Dec 18, 2015, 1:23:16 PM12/18/15

to idr-d...@googlegroups.com

Hi Pef,

Sorry for the slow response.

What ranking measure are you using? signal scores or -log10(pvalues)? For MACS2, you should use the -log10(pvalues) as your ranking measure in IDR. They tend to behave better.

According to IDR2, for this specific example, around 55% of the peaks are < 5% IDR.

1) My first question is : considering good ChIP libraries, what is the expected value here ? Is 55% a reasonable amount of reproducible peaks ? Below which value we consider that the data sets should be discarded from further analysis ? For one of the histone modification, I get only 20% of reproducible peaks, and I am wondering if I should be concerned.

There is no specific expected value here. It really depends on the reproducibility of your data and the specific histone mark and the cell type as well as the number of peaks you feed into IDR. Note that the peaks you input into IDR are relaxed calls. So the should infact include reproducible and irreproducible peaks. How many peaks pass the IDR 5% threshold and how many peaks are you feeding into IDR.

To estimate whether your replicates are reproducible or not you should compare the IDR peaks obtained by comparing replicates to the IDR peaks you obtain by comparing pooled-pseudoreplicates.

2) Is the workflow I am using the good one ? Or is it better run a peak calling on pooled data to generate on oracle ?

You should call peaks on the pooled data as the oracle. For histone marks especially that is a better way to do it since peak boundaries and peak merging/splitting is rather unstable for most peak callers.

On the pipeline you described on your website, you also talk about generating pseudo-replicates. Just to be sure I understood : is this "pseudo-replicates" approach only used to asses the self-consistency when you have no duplicate ?

Pseudoreplicates are not just to be used in cases where you dont have replicates. Even with replicates you want to compute how many peaks pass IDR when you compare pooled-pseudoreplicates. That gives you a sort of upper bound on how well you would do if all you had was sampling noise. If the true replicates give dramatically different number of reproducible peaks compared to the pooled-pseudoreplicates, it indicates one of your replicates is very different from the other.

3) I've read quite a lot of threads about IDR, explaining how to interpret the results and the graphs. But all of them are referring to the "old" version of the pipeline. Since the graphs generated with the latest release of the pipeline are quite different, I am not sure to properly understand them. Here is the output I get running the previous command line. Even considering that 55% of the peaks called are reproducible, the plots looks very messy to me, especially the ones at the top, and no diagonal line is really showing up. What should I conclude from this kind of pattern ? Did I do something wrong ?
My previous exploration on the reproducibility for this specific two samples (comparison of binding in 100bp bins across the genome ...) revealed they were pretty good. I am therefore a little bit concerned by the results obtained with IDR2.

Generate these by using the oracle peaks. That is sorta equivalent to using fixed bins from the oracle and obtaining the scores from each replicate. In general, for histones you dont expect as strong rank concordance as you do for TFs even more so with MACS2 where the ranking measure are not very stable.

-Anshul.

Many thanks for your advice and sorry for such naive questions.

Cheers,

Pef

Pierre-Francois Roux

unread,

Jan 5, 2016, 4:06:50 AM1/5/16

to idr-discuss

Hi Anshul,

Thank you so much for this detailed answer.

This is far more clear to me now and I managed to integrate the IDR in the workflow I was creating to analyze my ChIP-seq and my ATAC-seq data.