chr1 3501424 3501433 + 10.9542 2.27e-06 1 CTATATATAG
So the software assigns the score 10.9542 to the sequence CTATATATAG. But I don't see how it is getting this number. If we turn the frequency matrix into a log-odds score (with the exponential base then the score should be 12.20125. If instead we use a base two log-odds, then the score should be 8.460273. If just using the frequency matrix, the score should be 5.9168. Here by 'score' I mean the addition of the respective column in the matrix per position of the sequence.
I was thinking maybe it adds a pseudo-count but even with pseudo-counts I don't see why the discrepancy would be so large.
As an additional question, my q value is 1 for all of my entries, what does this mean?
Thank you for that. Clears up a few things for me. I have some follow up questions (easy) if you don't mind:
So my PWM using log base 2 scores the sequence at 12.20 where as FIMO scores it at 10.95. Like you mentioned, the discrepancy is due to the scaling and conversion to integers. Is there a way to measure this discrepancy?
- Is there a way to see the estimated probability distribution? I do this step manually also (so I can compute the pvalue). I fit a normal distribution on all possible scores of my PWM (which can be as a high as 4^10).
- I have experimentally verified peaks on the transcription factor of interest. I am guessing that I can simply ignore the q-value here as I am not 'discovering' new sites. Basically, I can compare the results FIMO gives me with the experimentally found sites and conduct my sensitivity analysis and ROC curves.
- I realized that a perfect match on average will occur 2700 times in the genome. But this is very little, is it not? Compared to the size of the genome this is negligible, is it not? Furthermore, the results from FIMO shows ~20,000 matches. If 2700 of them occur by chance, then the rest should really have a lower q-value right? The q-value is 1 for ALL my matches which is disheartening.
No, there is no FIMO option that prints out that table. You could modify the FIMO code to do this if you like. I'm don't think a normal distribution is a very good estimate of the score distribution. By your own calculation, the best possible match to the motif has a log odds score on the order of 10, so I don't understand how a distribution could produce a score as high as 4^10.
