Only find motifs present in every input sequence

169 views
Skip to first unread message

Michael Milton

unread,
Jun 11, 2021, 2:08:55 AM6/11/21
to MEME Suite Q&A
I have a series of enzyme sequences that I know all share a certain motif. I'm interested in finding if there is a longer version of this motif or perhaps other new motifs in these enzymes, but the key thing is that any motif we discover much be present in every single input sequence. I'm currently using MEME which runs fine and finds a reasonable looking motif, but I gather from the E value that it isn't in every single sequence. MEME doesn't seem to have a flag for this, but is there some kind of E value hack I can use to force this condition? Thanks.

Michael Milton

unread,
Jun 11, 2021, 3:03:04 AM6/11/21
to MEME Suite Q&A
To be more specific, I was hoping to get MEME to output a regular expression that matches every sequence in the dataset, and not just a matrix that returns a high probability for every sequence. The regex output by MEME seems to only capture the consensus motif, which doesn't actually match all input sequences, at least with my settings.

cegrant

unread,
Jun 16, 2021, 11:56:17 PM6/16/21
to MEME Suite Q&A
Note that the point of tools like MEME is that different instances of a motif are typically not exact matches, and you want to find the underlying abstract motif from a collection of candidate motif sites that may vary considerably among themselves. The E-value in in itself doesn't tell you how many sequences motif was found in. It is a measure of the overall statistical significance of the motif. A highly significant E-value might be due to  nearly identical instances in a few sequences or somewhat similar sequences in all the sequences. You should be looking at the site count which is in the column next to the E-value. 

You can't force MEME to find an exactly matching sites in each sequence, but you can force it to find a statistically significant matching site in each sequence. You need to use OOPS model (One occurrence per sequence) for the site distribution model. If you are using the command line tools this would be done with the '-mod oops' command line option. If you are using the public web application, under "Select the site distribution" you'd choose "Once occurrence per sequence".

Michael Milton

unread,
Jun 27, 2021, 10:18:23 PM6/27/21
to MEME Suite Q&A
Many thanks for this answer! The oops setting was broadly what I was looking for. It's a pity it's not possible to force MEME to find 100% conserved positions, but this still gives me something very close to what I want.

Michael Milton

unread,
Jun 28, 2021, 12:34:00 AM6/28/21
to MEME Suite Q&A
Also, is there now any way to find a motif that is present "two or more times" (or ideally "n or more times") per sequence, rather than OOPS?

cegrant

unread,
Jun 30, 2021, 9:25:30 PM6/30/21
to MEME Suite Q&A
| It's a pity it's not possible to force MEME to find 100% conserved positions, but this still gives me something very close to what I want.

Finding exact matches to a given regular expression is pretty simple in any of the popular scripting languages. In the MEME Suite command line tools we include fasta-grep for this.

| Also, is there now any way to find a motif that is present "two or more times" (or ideally "n or more times") per sequence, rather than OOPS?

MEME has three models for the number of motif occurrences in a sequence database: OOPS (only once per sequence), ZOOPS (zero or once per sequence), and ANR (any number of repetitions per sequence).
However, that doesn't seem to fit what you are asking for. If you are limiting yourself to exact matches of a regular expression it would be pretty easy to implement in a script.

Michael Milton

unread,
Jun 30, 2021, 10:21:24 PM6/30/21
to MEME Suite Q&A
> Finding exact matches to a given regular expression is pretty simple in any of the popular scripting languages. In the MEME Suite command line tools we include fasta-grep for this.

It's trivial to find known motifs using a regex, but what I want MEME to do is discover a regex that I can use. It does produce a regex as discussed, but that isn't guaranteed to actually match all the sequences, so isn't sufficient for what I want.


> However, that doesn't seem to fit what you are asking for. If you are limiting yourself to exact matches of a regular expression it would be pretty easy to implement in a script.

Can you explain how I would do this in a script? I want MEME to discover a motif that is present 2 or more times in each sequence. If I use ANR it will output one motif and then I can verify if it occurs twice in every sequence, but if it doesn't match my criteria I'll have to repeat this process until I find one that does (which might be never). It would be ideal if the objective function of the algorithm could look for a motif with that criteria in mind.

cegrant

unread,
Jul 1, 2021, 12:51:53 AM7/1/21
to MEME Suite Q&A
Sorry, I was confused as whether you were doing motif discovery or motif search. 

Given your particular requirements I'm not sure MEME is the tool you want. Internally MEME is entirely devoted to finding the best PWM, and only comes up with a consensus regular expression based on the final PWM as a convenience.  Since you are looking for 100% conservation, maybe one of the k-mer counting tools would be more appropriate? You might have to make multiple runs to find the largest common k-mer in your sequences, but at least it would automatically be enforcing the 100% conservation.

Reply all
Reply to author
Forward
0 new messages