Hi,
Thanks for all your work with IDR and the discuss group, its really helpful! However I'm abit confused about how the number of peaks passing the idr-threshold are filtered.
When I run the statement on two macs2 peak files (for example two pseudoreplicates) using the python3 idr version 2.0.3 - the log output from idr tells me I have:
'Number of peaks passing IDR cutoff of 0.05 - 3849/62510 (6.2%)'
I know in this case this is very low and that I need to change the IDR parameters and number of peaks I use for my input files, however my question is how is the number of peaks passing IDR cutoff of 3849 generated? When I output the whole list of peaks using the --soft-idr-threshold option as 0.05 I get the full list of merged peaks (62510) , however if I try to filter this list myself down to those passing an IDR threshold of 0.05 by any of the methods (below) I get different number in my filtered list of peaks and I can't find the discrepancy between how the idr program is filtering and how I should filter my peak lists manually to get the same answer. Any advice would be appreciated.
Call to IDR:
idr --samples sample1_pseudo_1.macs2.IDRpeaks sample2_pseudo_2.macs2.IDRpeaks
--output-file sample1_pseudo_1.macs2_v_sample1_filtered_pseudo_2.macs2.tsv
--plot
--soft-idr-threshold 0.05
--input-file-type narrowPeak
--rank p.value
--output-file-type narrowPeak
--use-best-multisummit-IDR
--verbose 2>>sample1_pseudo_1.macs2_v_sample2_filtered_pseudo_2.macs2.log
My filtering attempts of the tsv file using python notebook python2.7.11:
import maths
import pandas as pd
df = pd.read_csv('sample1_pseudo_1.macs2_v_sample1_filtered_pseudo_2.macs2.tsv', sep='\t', header=None)
df.rename(columns={0:'chr',
1:'idr_start',
2:'idr_stop',3:'name',4:'score',5:'strand',6:'signalval',7:'pval',8:'qval',9:'summit',10:'localIDR',11:'globalIDR',12:'rep1start',13:'rep1stop',14:'rep1sig'},inplace=True)
Threshold no rounding - filtering on column 12 (globalIDR value)
threshold =-math.log(0.05,10) #this equals 1.301029995663981
df[df['globalIDR']>= threshold]].shape # returns 3823 peaks
Threshold rounded to 5 d.p. (as calculated in bash like in Kundaje lab ATAC-Seq pipeline) - filtering on column 12 (globalIDR value)
bash_thresh = 1.30103
df[df['globalIDR']>= bash_threshold]].shape # returns 3823 peaks
Filter on 'IDR score' column 5 of idr output
idr_score_threshold= int(-125* math.log(0.05+1e-12,2)) # this equals 540
df[df['score']>=idr_score_threshold].shape #returns 3853 peaks
As you can see none of these give the 3849 answer that is specified in the idr log output - how should i filter the tsv file produced using --soft-idr-threshold 0.05 to get the 3849 peaks specified by the software?
Many thanks in advance
Charlie