How are the number of peaks passing IDR threshold of 0.05 generated - idr version 2.0.3

582 views
Skip to first unread message

Charlie George

unread,
May 11, 2017, 10:29:51 AM5/11/17
to idr-discuss
Hi,

Thanks for all your work with IDR and the discuss group, its really helpful! However  I'm abit confused about how the number of peaks passing the idr-threshold are filtered.

When I run the statement on two macs2 peak files (for example two pseudoreplicates) using the python3 idr version 2.0.3 - the log output from idr tells me I have:

'Number of peaks passing IDR cutoff of 0.05 - 3849/62510 (6.2%)'    

I know in this case this is very low and that I need to change the IDR parameters and number of peaks I use for my input files, however my question is how is the number of peaks passing IDR cutoff of 3849 generated? When I output the whole list of peaks using the --soft-idr-threshold option as 0.05 I get the full list of merged peaks (62510) , however if I try to filter this list myself down to those passing an IDR threshold of 0.05 by any of the methods (below) I get different number in my filtered list of peaks and I can't find the discrepancy between how the idr program is filtering and how I should filter my peak lists manually to get the same answer. Any advice would be appreciated.

Call to IDR:
idr   --samples sample1_pseudo_1.macs2.IDRpeaks sample2_pseudo_2.macs2.IDRpeaks
       --output-file sample1_pseudo_1.macs2_v_sample1_filtered_pseudo_2.macs2.tsv
       --plot
       --soft-idr-threshold 0.05
       --input-file-type narrowPeak
       --rank p.value
       --output-file-type narrowPeak
       --use-best-multisummit-IDR
       --verbose 2>>sample1_pseudo_1.macs2_v_sample2_filtered_pseudo_2.macs2.log        

My filtering attempts of the tsv file using python notebook python2.7.11:

import maths
import pandas as pd
df = pd.read_csv('sample1_pseudo_1.macs2_v_sample1_filtered_pseudo_2.macs2.tsv', sep='\t', header=None)
df.rename(columns={0:'chr', 1:'idr_start', 2:'idr_stop',3:'name',4:'score',5:'strand',6:'signalval',7:'pval',8:'qval',9:'summit',10:'localIDR',11:'globalIDR',12:'rep1start',13:'rep1stop',14:'rep1sig'},inplace=True)

Threshold no rounding - filtering on column 12 (globalIDR value)

threshold =-math.log(0.05,10)                 #this equals 1.301029995663981
df[df['globalIDR']>= threshold]].shape     # returns 3823  peaks  

Threshold rounded to 5 d.p. (as calculated in bash like in Kundaje lab ATAC-Seq pipeline) - filtering on column 12 (globalIDR value)
bash_thresh = 1.30103
df[df['globalIDR']>= bash_threshold]].shape     # returns 3823  peaks  

Filter on 'IDR score' column 5 of idr output
idr_score_threshold= int(-125* math.log(0.05+1e-12,2))          # this equals 540
df[df['score']>=idr_score_threshold].shape                            #returns 3853 peaks

As you can see none of these give the 3849 answer that is specified in the idr log output - how should i filter the tsv file produced using --soft-idr-threshold 0.05 to get the 3849 peaks specified by the software?

Many thanks in advance
Charlie

Anshul Kundaje

unread,
May 16, 2017, 6:14:26 PM5/16/17
to idr-d...@googlegroups.com, Nathan Boley
I'm cc-ing Nathan to answer this.

My guess is the text output in the IDR file is rounding the IDR values resulting in some discrepancy relative to the thresholding done internal to the code on the doubles. So it causes a few peaks on the border of the threshold go one way or the other. We'll push a fix for this if this is the case.

Nathan - Can you confirm.

-Anshul.

--
You received this message because you are subscribed to the Google Groups "idr-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to idr-discuss+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Charlie George

unread,
May 17, 2017, 7:53:50 AM5/17/17
to idr-discuss

Thank you - I really appreciate it.
To unsubscribe from this group and stop receiving emails from it, send an email to idr-discuss...@googlegroups.com.

Charlie George

unread,
Jun 1, 2017, 5:32:52 AM6/1/17
to idr-discuss, npb...@gmail.com
Hi Nathan,

sorry to bother you, I wondered if you'd had a chance to check this out.

Best,
Charlie

On Tuesday, May 16, 2017 at 11:14:26 PM UTC+1, Anshul Kundaje wrote:
To unsubscribe from this group and stop receiving emails from it, send an email to idr-discuss...@googlegroups.com.

Nathan Boley

unread,
Jun 3, 2017, 9:27:55 PM6/3/17
to idr-d...@googlegroups.com
Hi Charlie,

Sorry for taking so long to get back to you.

I took a look at this, and I have a couple of hypotheses, but it's impossible to say for sure without looking at the test data.

1) The IDR values are being rounded to two decimal points before being written to the tsv, but the filtering is happening on the un-rounded values. Could you please try changing the formatting on line 329 in idr.py from "%.2f" to "%f" and trying again?
2) it doesn't appear as if you specified a random seed during your idr runs. When ranking peaks, ties are broken stochastically so the different peak counts could be due to this. I would try running these again and see if you get different results.

If neither of these end up being the cause, I'd be happy to take a look but I'll need access to your data.

Best, Nathan

Charlie George

unread,
Jun 14, 2017, 12:56:05 PM6/14/17
to idr-discuss
Hi,

Thanks Nathan and Anshul, It was indeed the rounding of the global IDR values as they were being written out that causing the filtering error - changing line "%.2f" to  "%f" in line 334 has fixed this. I've also changed this for line 333 for output for the localIDR and I've forked the code and made a pull request incase you want to update the code.

Best,
Charlie

Anshul Kundaje

unread,
Jun 14, 2017, 2:53:42 PM6/14/17
to idr-d...@googlegroups.com
Great thanks! We'll accept the fix.

-Anshul.

--
You received this message because you are subscribed to the Google Groups "idr-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to idr-discuss+unsubscribe@googlegroups.com.

Jin

unread,
Jun 14, 2017, 3:58:19 PM6/14/17
to idr-d...@googlegroups.com
Hello Charlie,

Thanks for figuring this out. I tested your PR and applied it to the forked repo (https://github.com/kundajelab/idr). Nathan's original repo is no longer maintained.
Thanks,

Jin

Charlie George

unread,
Jun 15, 2017, 6:26:27 AM6/15/17
to idr-discuss, lee...@stanford.edu

Brilliant! Thank you :)

On Wednesday, June 14, 2017 at 8:58:19 PM UTC+1, Jin wrote:
Hello Charlie,

Thanks for figuring this out. I tested your PR and applied it to the forked repo (https://github.com/kundajelab/idr). Nathan's original repo is no longer maintained.
Thanks,

Jin
On Wed, Jun 14, 2017 at 11:53 AM, Anshul Kundaje <ans...@kundaje.net> wrote:
Great thanks! We'll accept the fix.

-Anshul.
On Wed, Jun 14, 2017 at 9:56 AM, Charlie George <clgeor...@gmail.com> wrote:
Hi,

Thanks Nathan and Anshul, It was indeed the rounding of the global IDR values as they were being written out that causing the filtering error - changing line "%.2f" to  "%f" in line 334 has fixed this. I've also changed this for line 333 for output for the localIDR and I've forked the code and made a pull request incase you want to update the code.

Best,
Charlie



On Sunday, June 4, 2017 at 2:27:55 AM UTC+1, Nathan Boley wrote:
Hi Charlie,

Sorry for taking so long to get back to you.

I took a look at this, and I have a couple of hypotheses, but it's impossible to say for sure without looking at the test data.

1) The IDR values are being rounded to two decimal points before being written to the tsv, but the filtering is happening on the un-rounded values. Could you please try changing the formatting on line 329 in idr.py from "%.2f" to "%f" and trying again?
2) it doesn't appear as if you specified a random seed during your idr runs. When ranking peaks, ties are broken stochastically so the different peak counts could be due to this. I would try running these again and see if you get different results.

If neither of these end up being the cause, I'd be happy to take a look but I'll need access to your data.

Best, Nathan

--
You received this message because you are subscribed to the Google Groups "idr-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to idr-discuss...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "idr-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to idr-discuss...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages