How exactly does FIMO calculate the score value and corresponding pvalue?

2,291 views
Skip to first unread message

affan akhter

unread,
May 21, 2015, 3:38:08 PM5/21/15
to meme-...@googlegroups.com
Hello,

I have a question about how FIMO generates it's scores. Consider a experiment run here. From the text file output, the first line is

chr1 3501424 3501433 + 10.9542 2.27e-06 1 CTATATATAG

So the software assigns the score 10.9542 to the sequence CTATATATAG. But I don't see how it is getting this number. If we turn the frequency matrix into a log-odds score (with the exponential base then the score should be 12.20125. If instead we use a base two log-odds, then the score should be 8.460273. If just using the frequency matrix, the score should be 5.9168. Here by 'score' I mean the addition of the respective column in the matrix per position of the sequence.

I was thinking maybe it adds a pseudo-count but even with pseudo-counts I don't see why the discrepancy would be so large.

As an additional question, my q value is 1 for all of my entries, what does this mean?

cegrant

unread,
May 21, 2015, 6:02:58 PM5/21/15
to meme-...@googlegroups.com
FIMO computes the log-odds scores using log base 2. However, in order to estimate the corresponding p-values, the raw scores are scaled and converted to integers before they are stored internally. A dynamic programming algorithm is then used to estimate the probability distribution of all possible match scores to the motif. Using that PDF the p-value corresponding to a score can be calculated. When reporting match scores FIMO inverts the scaling, but the integer conversion step means that the reported scores are close to the original raw score, but may not match them exactly.

The raw log-odds scores are of limited utility. An excellent match to a short motif may have the same score as a poor match to a wide motif. This is why the FIMO HTML output only reports p-values and q-values for matches. The q-value is explained here: What should I use as a threshold of significance for q-value?. A q-value of 1.0 indicates that almost all matches at that level of significance are due simply to chance. The probability of a perfect match to your motif occurring by chance is (0.25)^10 = 9e-7. That may not seem very likely, but you are scanning the entire mouse genome, which contains about 3e9 bp. That means you'd expect to get ~2700 perfect matches to your motif entirely by chance! Biologically meaningful matches will be swamped by the chance matches.

affan akhter

unread,
May 22, 2015, 3:19:23 PM5/22/15
to meme-...@googlegroups.com
Thank you for that. Clears up a few things for me. I have some follow up questions (easy) if you don't mind:


  • So my PWM using log base 2 scores the sequence at 12.20 where as FIMO scores it at 10.95. Like you mentioned, the discrepancy is due to the scaling and conversion to integers. Is there a way to measure this discrepancy?

  • Is there a way to see the estimated probability distribution? I do this step manually also (so I can compute the pvalue). I fit a normal distribution on all possible scores of my PWM (which can be as a high as 4^10).

  • I have experimentally verified peaks on the transcription factor of interest. I am guessing that I can simply ignore the q-value here as I am not 'discovering' new sites. Basically, I can compare the results FIMO gives me with the experimentally found sites and conduct my sensitivity analysis and ROC curves.

  • I realized that a perfect match on average will occur 2700 times in the genome. But this is very little, is it not? Compared to the size of the genome this is negligible, is it not? Furthermore, the results from FIMO shows ~20,000 matches. If 2700 of them occur by chance, then the rest should really have a lower q-value right? The q-value is 1 for ALL my matches which is disheartening.  

cegrant

unread,
May 22, 2015, 4:46:11 PM5/22/15
to meme-...@googlegroups.com


On Friday, May 22, 2015 at 12:19:23 PM UTC-7, affan akhter wrote:
Thank you for that. Clears up a few things for me. I have some follow up questions (easy) if you don't mind:


  • So my PWM using log base 2 scores the sequence at 12.20 where as FIMO scores it at 10.95. Like you mentioned, the discrepancy is due to the scaling and conversion to integers. Is there a way to measure this discrepancy?
You could look at the FIMO source code and perform the same scaling and rounding calculations.

  • Is there a way to see the estimated probability distribution? I do this step manually also (so I can compute the pvalue). I fit a normal distribution on all possible scores of my PWM (which can be as a high as 4^10).

No, there is no FIMO option that prints out that table. You could modify the FIMO code to do this if you like. I'm don't think a normal distribution is a very good estimate of the score distribution. By your own calculation, the best possible match to the motif has a log odds score on the order of 10, so I don't understand how a distribution could produce a score as high as 4^10.
 
  • I have experimentally verified peaks on the transcription factor of interest. I am guessing that I can simply ignore the q-value here as I am not 'discovering' new sites. Basically, I can compare the results FIMO gives me with the experimentally found sites and conduct my sensitivity analysis and ROC curves.

If you have a biological gold standard that gives you the ground truth, then you are all set, but given the q-values FIMO is reporting I wouldn't expect FIMO to perform very well on the sensitivity analysis. I'd expect most of FIMOs reported matches will be false positives. 

 
  • I realized that a perfect match on average will occur 2700 times in the genome. But this is very little, is it not? Compared to the size of the genome this is negligible, is it not? Furthermore, the results from FIMO shows ~20,000 matches. If 2700 of them occur by chance, then the rest should really have a lower q-value right? The q-value is 1 for ALL my matches which is disheartening.  
2700 is the rough estimate of the number of the perfect matches are due to chance. If you start considering lower quality matches, the number attributable to chance goes up very quickly. Your FIMO results only contain about 4000 perfect matches. That means well over half of your perfect matches are completely accounted for by chance. It only gets worse for the matches that are less than perfect. FIMO is just telling you that it can't assign much statistical confidence as to whether any given match is more than a chance occurrence.

affan akhter

unread,
May 22, 2015, 5:21:44 PM5/22/15
to meme-...@googlegroups.com
No, there is no FIMO option that prints out that table. You could modify the FIMO code to do this if you like. I'm don't think a normal distribution is a very good estimate of the score distribution. By your own calculation, the best possible match to the motif has a log odds score on the order of 10, so I don't understand how a distribution could produce a score as high as 4^10.

Sorry, what I meant was that the total number of possible scores are 4^10 = 1048576. With my PWM, the minscore is -33.543 and the maxscore is 12.20125. The diagram below shows my scores in a histrogram (blue) with a normal distribution fitted (red) on it.


So I can compute the significance of a score basedon the CDF of this (which requires some modification and approaches Gumbel distribution).
 
As you mentioned, I do have my biological gold standard. I am expecting a high number of false positives. My Masters thesis is to modify the scoring function of the PWM by incorporating epigenetic priors and hopefully reduce the number of false positives.

As an additional control, I may actually compute the score using my PWM for which the pvalue < 1e-5. Based on this score, I will simply call the matchPWM() function in R to scan the genome and see how many results I get.

Thank you for your help.
Reply all
Reply to author
Forward
0 new messages