qvalue results differ from FIMO q-value results

106 views
Skip to first unread message

Teshome Mulugeta

unread,
Jan 30, 2018, 9:29:20 AM1/30/18
to MEME Suite Q&A
Hi,

I am not getting similar qvalue results as FIMO did?  

FIMO logs

fimo --verbosity 4 --bgfile Ssal.allpromoter.masked.fasta.markov1 -oc result x8650 T2_SP.induced_UP.fa
-------------------------------------------------------------------------------------------
Using motif +REST_M09650_TRANSFAC of width 22.
Using motif -REST_M09650_TRANSFAC of width 22.
Num stored matches 59
Num scanned positions 315036
Computing q-values.
Estimating pi_0 from a uniformly sampled set of 10000 p-values.
Estimating pi_0.
Minimal pi_zero = 0.986395
Estimated pi_0=0.994619

q-value logs
I took all p-vlaues reported by FIMO result
awk -F'\t' -v OFS='\t' '{print $1,$3,$8}' result/fimo.txt >REST_M09650_TRANSFAC.pval

computed qvalue using qvalue tool obtained from MEME suite
qvalue --header 1 --column 3 --append --pi-zero --pi-zero-file REST_M09650_TRANSFAC.pval.pi_0  REST_M09650_TRANSFAC.pval  >REST_M09650_TRANSFAC.pval.qva
----------------------------------------------------------------------------------------------
Read 58 values from REST_M09650_TRANSFAC.pval.
Warning: Cannot estimate pi_0 accurately from fewer than 100 p-values.         total p-values = 58. Using pi_zero = 1.0.

Differences
$> head result/fimo.txt
# motif_id motif_alt_id sequence_name start stop strand score p-value q-value matched_sequence
REST_M09650_TRANSFAC REST gene56427:106591944::NW_012347437.1:20808-22008(+) 997 1018 - 30.5741 2.99e-11 9.37e-06 TTCAGCACCGGGGACAGCTCCT
REST_M09650_TRANSFAC REST gene40373:106579160::ssa19:55991880-55993080(-) 1031 1052 - 18.8519 1.42e-07 0.0222 GTCAAAACCAAGGAGAGCGACA
REST_M09650_TRANSFAC REST gene57355:106592182::NW_012347787.1:17807-19007(-) 167 188 + 15.537 8.52e-07 0.089 ATCAGCACCTTGGATACGGGCA
REST_M09650_TRANSFAC REST gene4980:106590030::ssa02:55733108-55734308(-) 293 314 - 13.9352 1.91e-06 0.123 TTCAGCACTCTGGGGAGCAGAG
REST_M09650_TRANSFAC REST gene43968:106582810::ssa22:11947443-11948643(-) 711 732 + 13.8796 1.96e-06 0.123 TCCAGCACCAGGGAGACATGAA
REST_M09650_TRANSFAC REST gene13172:106606520::ssa06:12603464-12604664(-) 1136 1157 + 13.0093 2.99e-06 0.156 TTCTGCACCCCGCAGAGCTGGA
REST_M09650_TRANSFAC REST gene5016:106590330::ssa02:56911576-56912776(+) 477 498 - 12.3426 4.1e-06 0.18 TTCAGCACTCAGGGGAGCAGAG
REST_M09650_TRANSFAC REST gene27443:106566391::ssa13:5831858-5833058(+) 1045 1066 + 11.8889 5.06e-06 0.18 TTCAGCACCATAAACACCTCAT
REST_M09650_TRANSFAC REST gene14716:106607921::ssa06:62116024-62117224(+) 993 1014 - 11.8426 5.17e-06 0.18 ATGAGCGCCCTGGACAGCCTTA
[ teshmu@login-0 ] ~/Anadromy/motif6-masked/upregulated/logfoldchange/upregulated/fimo/all/test

$> head REST_M09650_TRANSFAC.pval.qva
# motif_id sequence_name p-value q-value
REST_M09650_TRANSFAC gene56427:106591944::NW_012347437.1:20808-22008(+) 2.99e-11 1.7641e-09
REST_M09650_TRANSFAC gene40373:106579160::ssa19:55991880-55993080(-) 1.42e-07 4.189e-06
REST_M09650_TRANSFAC gene57355:106592182::NW_012347787.1:17807-19007(-) 8.52e-07 1.6756e-05
REST_M09650_TRANSFAC gene4980:106590030::ssa02:55733108-55734308(-) 1.91e-06 2.3128e-05
REST_M09650_TRANSFAC gene43968:106582810::ssa22:11947443-11948643(-) 1.96e-06 2.3128e-05
REST_M09650_TRANSFAC gene13172:106606520::ssa06:12603464-12604664(-) 2.99e-06 2.94017e-05
REST_M09650_TRANSFAC gene5016:106590330::ssa02:56911576-56912776(+) 4.1e-06 3.38922e-05
REST_M09650_TRANSFAC gene27443:106566391::ssa13:5831858-5833058(+) 5.06e-06 3.38922e-05
REST_M09650_TRANSFAC gene14716:106607921::ssa06:62116024-62117224(+) 5.17e-06 3.38922e-05

Is it because FIMO did sample 10000 p-values? We have only 59 matches and where did FIMO got the 10000 p-values?

Best,
Teshome

CharlesEGrant

unread,
Jan 30, 2018, 3:06:19 PM1/30/18
to MEME Suite Q&A
Hi Teshome,

The command you used:

fimo --verbosity 4 --bgfile Ssal.allpromoter.masked.fasta.markov1 -oc result x8650 T2_SP.induced_UP.fa

is using the default p-value threshold of 0.0001, so it's only reporting the matches that pass that default p-value threshold. In order to compute the q-values corresponding to p-values, you have to have the full distribution of observed p-values, or at least a 'fair' sample of the distribution. By taking the p-values from the output file, you are only getting the significant p-values. There are two few of those to calculate q-values and more importantly, they are a highly skewed sample.

The closest you could come would be to use the FIMO '--text' option which will report all the observed p-values, but which doesn't report q-values.

Charles

Teshome Mulugeta

unread,
Jan 30, 2018, 5:15:04 PM1/30/18
to MEME Suite Q&A
Thanks, i tried to run it with --text option and the result is same. Similar p-values are reported. 

fimo --verbosity 4 --bgfile Ssal.allpromoter.masked.fasta.markov1 --text xx8650 T2_SP.induced_UP.fa >x8650.txt

awk -F'\t' -v OFS='\t' '{print $1,$3,$8}' x8650.txt  >REST_M09650_TRANSFAC.pval
qvalue --header 1 --column 3 --append --pi-zero --pi-zero-file REST_M09650_TRANSFAC.pval.pi_0  REST_M09650_TRANSFAC.pval >REST_M09650_TRANSFAC.pval.qval

-------------------------------------------------------
Read 59 values from REST_M09650_TRANSFAC.pval.
Warning: Cannot estimate pi_0 accurately from fewer than 100 p-values.         total p-values = 59. Using pi_zero = 1.0.

$> head REST_M09650_TRANSFAC.pval.qval
# motif_id sequence_name p-value q-value
REST_M09650_TRANSFAC gene56427:106591944::NW_012347437.1:20808-22008(+) 2.99e-11 1.7641e-09
REST_M09650_TRANSFAC gene40373:106579160::ssa19:55991880-55993080(-) 1.42e-07 4.189e-06
REST_M09650_TRANSFAC gene57355:106592182::NW_012347787.1:17807-19007(-) 8.52e-07 1.6756e-05
REST_M09650_TRANSFAC gene4980:106590030::ssa02:55733108-55734308(-) 1.91e-06 2.3128e-05
REST_M09650_TRANSFAC gene43968:106582810::ssa22:11947443-11948643(-) 1.96e-06 2.3128e-05
REST_M09650_TRANSFAC gene13172:106606520::ssa06:12603464-12604664(-) 2.99e-06 2.94017e-05
REST_M09650_TRANSFAC gene5016:106590330::ssa02:56911576-56912776(+) 4.1e-06 3.38922e-05
REST_M09650_TRANSFAC gene27443:106566391::ssa13:5831858-5833058(+) 5.06e-06 3.38922e-05
REST_M09650_TRANSFAC gene14716:106607921::ssa06:62116024-62117224(+) 5.17e-06 3.38922e-05

Thank you for the help.

Teshome

CharlesEGrant

unread,
Jan 30, 2018, 10:45:46 PM1/30/18
to meme-...@googlegroups.com
You didn't adjust the p-value threshold, so it's still using the default of 0.0001. You need to set that to 1.0 so that all the p-values are reported. Use the --thresh option, as in

fimo --thresh 1.0

Teshome Mulugeta

unread,
Jan 31, 2018, 3:06:02 AM1/31/18
to MEME Suite Q&A
I am sorry, I missed that. The q-values between FIMO and qvalue are almost identical  (p-value=2.99e-11 q-value=9.37425e-06)

$> qvalue --header 1 --column 3 --append --pi-zero --pi-zero-file REST_M09650_TRANSFAC.pval.pi_0  REST_M09650_TRANSFAC.pval >REST_M09650_TRANSFAC.pval.qval

Read 315036 values from REST_M09650_TRANSFAC.pval.

Estimating pi_0 from all 315036 observed p-values.

Estimating pi_0.

Estimated pi_0=0.996425



Now, the problem is dealing with the millions of rows . I need to do sampling. I did it with R but i lost the lowest p-value and q-values that reported by FIMO


library(data.table)

pval <- fread("REST_M09650_TRANSFAC.pval", sep='\t')

pval.10000 <- pval[sample(dim(pval)[1],10000),]


 summary(pval[sample(dim(pval)[1],10000),])
  # motif_id        sequence_name         p-value        
 Length:10000       Length:10000       Min.   :0.000103  
 Class :character   Class :character   1st Qu.:0.248000  
 Mode  :character   Mode  :character   Median :0.512000  
                                       Mean   :0.507579  
                                       3rd Qu.:0.768000  
                                       Max.   :1.000000
> summary(pval)
  # motif_id        sequence_name         p-value      
 Length:315036      Length:315036      Min.   :0.0000  
 Class :character   Class :character   1st Qu.:0.2470  
 Mode  :character   Mode  :character   Median :0.5070  
                                       Mean   :0.5041  
                                       3rd Qu.:0.7630  
                                       Max.   :1.0000

$> qvalue --header 1 --column 3 --append --pi-zero --pi-zero-file REST_M09650_TRANSFAC.pval.pi_0  REST_M09650_TRANSFAC.pval.10000 >REST_M09650_TRANSFAC.pval.qval.10000Read 10000 values from REST_M09650_TRANSFAC.pval.10000.
Estimating pi_0 from all 10000 observed p-values.
Estimating pi_0.
Estimated pi_0=0.996751


Thanks for the help.
Teshome

CharlesEGrant

unread,
Jan 31, 2018, 2:34:04 PM1/31/18
to MEME Suite Q&A
But why would you want to sample the results when you have all the p-values and q-values in hand? Don't you just want to filter out all the results below your confidence threshold?

Teshome Mulugeta

unread,
Feb 1, 2018, 3:05:50 AM2/1/18
to MEME Suite Q&A
I wanted the sampling to minimise the computation time. I thought FIMO did the sampling. I have 81000 genes and 13000 motifs. It took 25 minutes for one motif if i use all 92,305,182 observed p-values (this includes getting pvalue:5, extracting:6, running qvalue:9 and filtering:5). Imagine this will be done for 5 - 9 genomes. I am happy with FIMO result without q-value but i have to find a way to get FDR for each motif discovered. It doesn't have to be q-value. 
Reply all
Reply to author
Forward
0 new messages