FIMO matching outside the provided matrix

35 views
Skip to first unread message

Teshome Mulugeta

unread,
Dec 1, 2017, 5:52:38 AM12/1/17
to MEME Suite Q&A
Hi,

My motif:

MEME version 4

ALPHABET= ACGT

strands: + -

Background letter frequencies (from uniform background):
A 0.25000 C 0.25000 G 0.25000 T 0.25000

MOTIF GGAAAWDHMC

letter-probability matrix: alength= 4 w= 10 nsites= 20 E= 0
  0.000000   0.000000   1.000000   0.000000
  0.000000   0.000000   1.000000   0.000000
  1.000000   0.000000   0.000000   0.000000
  1.000000   0.000000   0.000000   0.000000
  1.000000   0.000000   0.000000   0.000000
  0.500000   0.000000   0.000000   0.500000
  0.333333   0.000000   0.333333   0.333333
  0.333333   0.333333   0.000000   0.333333
  0.500000   0.500000   0.000000   0.000000
  0.000000   1.000000   0.000000   0.000000


My sequence:

>gene35158:106574031
CTGTTTGTCACAGAAACATCAAACTAGGTATGATTTATTCTATAAATGTCTAGCATACCTTTACAACTTATTTTGAGGAAAAAATCCAGTTTTGATTTGATAGAGGGCATGTAGTACGTCTAGAGGGCGTGTGGTAAAAGTTCTAACGAGTCTGTTATACCCTATTAGAAGGGGCCTGGCAGAAAATTCTGCACCTTCAGTCATAATAAATTAGATTACAGGAAGAAACGAAGGGACGAAGGGGTCAGGCCTGACTAACAAAGAAGACCGGTGAATGGATCAAGAGGACCAAGATAACCAAGAGGATCAACATGGACATTCCACGGCGGAGATAAGGAAAACTACGAGTGAACCATAACATACATTTTTACTGTGTATATATATATAAAATAGCAGGGGATCAAATACTTGATCCCCTGCTATATATATATATATACACAGTAAAAATGTATGTTATGGTTCACTCGTAGTTTTCCTTATCTCCGCCGTGGAATGTCCATGTTGATCCTCTTGGTTATCTTGGTCCTCTTGATCCATTCACCGGTCTTCTTTGTTAGTCAGGCCTGACCCCTTCGTCCCTTCGTTTCTTCCTGTAATCTAATTTATTATGACTGAAGGTGCAGAATTTTCTGCCAGGCCCCTTCTAATAGGGTATAACAGACTCGTTAGAACTTTTACCACACGCCCTCTAGACGTACTACATGCCCTCTATCAAATCAAAACTGGATTTTTTCCTCAAAATAAGTTGTAAAGGTATGCTAGACATTTATAGAATAAATCATACCTAGTTTGATGTTTCTGTGACAAACAGTGAATTAGTTATTGATTTATACCGCTTTAAAATGGCATTTTATAGATTTTATGTATTTCTATCGTCAAAACTGGCATTTTTCCTCAAAACAAAGGTTGTAAAAGTATGCTAGACATTTATAGAATAAATCATACATCGTTTGATGTTTCTATGACAAACAGTGAATTAGTTATTAATTTATACTGCTTTAAATGGCATTTTATAGATTTTAGGTATTTCTATCAAGTCAAAACTGGAATTTCTACTCAAAACAAAGGTTGTAAAAGTATGCTAGACATTTATAGAATACATCATATCTAGCTTGATGATTCTGTGACAACCAACTTTACCTCGGTAGCTAACGATACCTAGGACACCCGTAGACAGTGTACTTCGCCCACGCCTGTCGG


Fimo result:
fimo ore.meme test.fa

#pattern name sequence name start stop strand score p-value q-value matched sequence
GGAAAWDHMC gene35158:106574031 336 345 + 6.31633 6.82e-05 0.0772 GGAAAACTAC
GGAAAWDHMC gene35158:106574031 467 476 - 6.31633 6.82e-05 0.0772 GGAAAACTAC


My question:

Where does the 7th C in the matched sequence come from? The motif PSSM at the 7th row is only for [AGT]. 

Best,
Teshome

CharlesEGrant

unread,
Dec 1, 2017, 1:18:15 PM12/1/17
to MEME Suite Q&A
The FIMO match score is the sum of the match scores over all the positions of the motif. Even if one column has a score of zero, the sum of scores over all the columns may result in a significant score. The FIMO scores are not calibrated. That is, a score of 1 may not be statistically significant for one motif, but highly significant for another. Therefore the raw FIMO score is not directly interpretable. You should be looking at the p-values or preferably the q-values of the matches. The matches you report here have q-values of 0.077. That is, they are not statistically significant even at the relaxed threshold of 0.05.

This posting may be helpful:

Teshome Mulugeta

unread,
Dec 1, 2017, 2:31:59 PM12/1/17
to MEME Suite Q&A
Thank you for the clarification. The example i gave is a sample but will FIMO report similar results at lower q-value, for example 0.001? If yes, then what is the best way to get or filter the exact matches in the PSSM? Will stringent q-value helps? In my experiment, i want exact matches as in the PSSM.

CharlesEGrant

unread,
Dec 1, 2017, 3:28:27 PM12/1/17
to MEME Suite Q&A
i want exact matches as in the PSSM.

Asking for an exact match to a PSSM is not well defined. You can ask for the best possible match to a PSSM, but the whole point of a PSSM is to allow for non-exact matches! 

Maybe what you actually want is an exact match to a regular expression. FIMO doesn't provide this. The command line version of the MEME Suite does contain a utility that might work for you: fasta-grep. Note though that fasta-grep doesn't provide p-values or q-values. It simply reports the locations of exact matches to the provided regular expression.

Teshome Mulugeta

unread,
Dec 1, 2017, 10:29:34 PM12/1/17
to MEME Suite Q&A
Hi,

I agree that the whole point of a PSSM is to allow for non-exact matches but not including none-represented nucleotide. In my example, I don't want to get the 7th C in GGAAAACTAC. On that position, only [AGT] are represented and each has position score of 0.333 in the PSSM. C is not represented here. This is useful if someone is interested to see a mutation in the motif. In my case, if there is no [AGT] match at that position then this shouldn't be reported as a match or reported with a gap GGAAAA-TAC or with extra information so that it is easier to filter it. We need FIMO because it finds individual occurrence of a motif. If it is reporting different variant of a motif outside the PSSM then it is search de novo motifs. By the way, fasta-grep is not an option here as it is not considering background information of the sequence and not doing any statistical test. I am probably lost here but i need to understand why FIMO is adding unrepresented nucleotide in a particular position and if there is a way to filter them from the result. 


CharlesEGrant

unread,
Dec 2, 2017, 12:48:47 AM12/2/17
to meme-...@googlegroups.com
This is useful if someone is interested to see a mutation in the motif. In my case, if there is no [AGT] match at that position then this shouldn't be reported as a match or reported with a gap GGAAAA-TAC or with extra information so that it is easier to filter it. 

While that may be useful for you, that's simply not a facility FIMO provides. 

FIMO evaluates motif matches by computing a log likelihood score at each position in your sequence database. Since the log of 0 is not defined, probabilities of 0 in the PSSM matrix are computationally awkward. FIMO gets around this by adding pseudo-counts to the provided PSSM. By default, the pseudo-count is 0.1. You can adjust the pseudo-count on the FIMO command line using the --motif-pseudo option. You might try setting this to a very small value. However, if you set it to 0, and try to use a PSSM that has entries containing 0, FIMO will simply end up halting with floating point exceptions. While the immediate motivation here is the computational awkwardness of taking the log of 0, this also reflects the biological reality that even if none of your experiments have every observed a particular nucleotide at a particular position in a motif, this is simply a limit of your observations,  not a biological law.

It sounds like you need an idiosyncratic scoring system that applies an infinite penalty to matches containing nucleotides not represented in the column of the PSSM. FIMO simply doesn't provide this. 
Reply all
Reply to author
Forward
0 new messages