Dreme masking and sequence length

91 views
Skip to first unread message

Raymond Wan

unread,
Apr 14, 2016, 2:35:45 AM4/14/16
to MEME Suite Q&A

Dear all,

I have a couple of general questions about Dreme (command-line version).

In the tutorial, it says that sequences of lengths up to 100 bp is suggested.  If efficiency is not a factor (i.e., I don't mind waiting), is there a problem with using input sequences that are much longer than this -- for example, 1000 bp or longer?  I mean, is there a problem with the correctness of the result?

Second, can Dreme handle masked sequences (i.e., by ignoring them)?  By this, I mean A, C, G, and T in lower case.  Or should I replace them with N instead, which is what step 8 in the tutorial says that Dreme does internally.

Thank you!

Ray

PS:  I'm asking for the Meme 4.11.1 distribution.



CharlesEGrant

unread,
Apr 14, 2016, 4:11:01 PM4/14/16
to meme-...@googlegroups.com
Hi Raymond,

DREME discovers motifs by finding matches to regular expressions that are enriched in the positive sequence set over the negative sequence set. At one step in the algorithm, the number of sequences containing a match to a regular expression is compared between the two sets. However, each sequence is counted only once, whether it contains 1 match or 100 matches. Longer sequences are more likely to contain multiple matches, so if you submit a collection of long sequences to DREME it may miss some significant motifs. The multiple matches in a single sequence won't add to the evidence for the motif. You'll increase DREME's sensitivity if you break up your 1000bp sequences into 10 100bp sequences.

Since we've added support for arbitrary alphabets in the MEME Suite, the standard DNA alphabet will map nucleotides in lower case to upper case. If you want DREME to skip the soft-masked characters you can either replace the lower case nucleotides with 'N', or provide a custom alphabet where the lower case letter are mapped to 'N'. See the format for a custom alphabet for more information.

Raymond Wan

unread,
Apr 14, 2016, 10:51:07 PM4/14/16
to MEME Suite Q&A


Hi Charles,

Thanks a lot for getting back to me so quickly!



On Friday, 15 April 2016 04:11:01 UTC+8, CharlesEGrant wrote:
Hi Raymond,

DREME discovers motifs by finding matches to regular expressions that are enriched in the positive sequence set over the negative sequence set. At one step in the algorithm, the number of sequences containing a match to a regular expression is compared between the two sets. However, each sequence is counted only once, whether it contains 1 match or 100 matches. Longer sequences are more likely to contain multiple matches, so if you submit a collection of long sequences to DREME it may miss some significant motifs. The multiple matches in a single sequence won't add to the evidence for the motif. You'll increase DREME's sensitivity if you break up your 1000bp sequences into 10 100bp sequences.


I see.  So, if the number of sequences is counted only once, then for long sequences, there should be "ties".  How are these ties "broken" so that only a single one is printed out?  Is it somewhat arbitrary?

Currently, I'm providing a set of N sequences.  Would I be correct in saying if I want to know what motif occurs frequently across the N sequences, it is acceptable to not break up the 1000 bp sequences?  However, if I want some comparison that "weights" this by the number of times the motif occurs per sequence, then I can only get that by breaking it up.

Is there something special with the number "100"?  I presume sensitivity would increase if I continue breaking the sequences into even smaller chunks.  However, if I start breaking motifs in half, sensitivity would suffer.  Thus, "100" is just a compromise?
 


Since we've added support for arbitrary alphabets in the MEME Suite, the standard DNA alphabet will map nucleotides in lower case to upper case. If you want DREME to skip the soft-masked characters you can either replace the lower case nucleotides with 'N', or provide a custom alphabet where the lower case letter are mapped to 'N'. See the format for a custom alphabet for more information.


Ah!  I was not aware of MEME Suite's use of custom alphabets.  For my purpose, substitution with "N" is fine, but I will keep the above web page in mind if I want to do anything fancy later.

Thanks a lot for your help!

Ray



 

CharlesEGrant

unread,
Apr 15, 2016, 9:07:28 PM4/15/16
to meme-...@googlegroups.com
I see.  So, if the number of sequences is counted only once, then for long sequences, there should be "ties".  How are these ties "broken" so that only a single one is printed out?  Is it somewhat arbitrary?

I'm sorry, I wasn't clear. I was referring to a single step in the DREME algorithm. The overall process involves multiple steps. DREME goes through an iterative process. It guesses a regular expression, counts the number of sequences that contain at least one match to the regular expression in the positive and negative data sets, figures out if sequences with a match to the regular expression are enriched in the positive set relative to the negative set, and if so, tweaks the regular expression so there are even more matches. The details of the algorithm are explained in DREME: motif discovery in transcription factor ChIP-seq data

Currently, I'm providing a set of N sequences.  Would I be correct in saying if I want to know what motif occurs frequently across the N sequences, it is acceptable to not break up the 1000 bp sequences?

I'm not sure I understand your goal here, but consider this: suppose you have a sequence in the positive that contains 100 instances of a motif, and no sequences in the negative set containing the motif. During it's search, DREME guesses a regular expression that matches the motif well. Then DREME going to count the number of sequences containing a match to that regular expression. That'll be one sequence in the positive set, and zero sequences in the negative set. Hmm 1 vs 0. That's not a significant enrichment, so that regular expression is dropped and DREME goes on to try something else.  Instead suppose you'd broken up your sequence into smaller pieces. Instead of 1 sequence containing 100 instances of the motif you have 100 sequences, 50 of which contain instances of the motif (maybe you broke some by splitting up the sequences). Now when DREME goes to count up the sequences that contain a match to the motif, it will count 50 sequences in the positive set, and still zero in the negative set, which is a pretty substantial enrichment, so DREME will start trying to refine the regular expression.

Is there something special with the number "100"?  I presume sensitivity would increase if I continue breaking the sequences into even smaller chunks.  However, if I start breaking motifs in half, sensitivity would suffer.  Thus, "100" is just a compromise?

Yep it's a compromise. Note though that DREME was designed especially to work with ChIP-Seq analyses. ChIP-Seq peaks are typically centered in an interval of about 100bp. It may be that MEME using the ANR (Any number of repetitions) model would be a better choice for your analysis. Unfortunately MEME is much slower than DREME, and the running time of MEME grows as the square of the overall size of the sequence data, and the cube of the number of sequences. MEME is really only practical for sequence files less then 1Mb in size.

Raymond Wan

unread,
Apr 18, 2016, 2:40:49 AM4/18/16
to MEME Suite Q&A

Hi Charles,

Thanks a lot for taking the time from your weekend to reply to me!  I very much appreciate it!



On Saturday, 16 April 2016 09:07:28 UTC+8, CharlesEGrant wrote:
I see.  So, if the number of sequences is counted only once, then for long sequences, there should be "ties".  How are these ties "broken" so that only a single one is printed out?  Is it somewhat arbitrary?

I'm sorry, I wasn't clear. I was referring to a single step in the DREME algorithm. The overall process involves multiple steps. DREME goes through an iterative process. It guesses a regular expression, counts the number of sequences that contain at least one match to the regular expression in the positive and negative data sets, figures out if sequences with a match to the regular expression are enriched in the positive set relative to the negative set, and if so, tweaks the regular expression so there are even more matches. The details of the algorithm are explained in DREME: motif discovery in transcription factor ChIP-seq data




I see.  Perhaps I did misunderstand something.  I thought if there are N sequences of length 100 bp then the counting you mentioned should be at most N (i.e., excluding the generalizing step).  If these N sequences were all of length 1000 bp, then the maximum count is still N.  So, breaking up each of the N sequences will make the algorithm more sensitive by create 10N sequences of length 100 bp each.

What I was thinking was that maybe there's a possibility that two motifs (let's say TTTTTT and GGGGGG) occur the same number of times.  Let's say 75, and that's even with a generalization step.  I think this is what I meant by a "tie" and I was wondering how DREME would choose which one to output.  (I guess we're also assuming these two motifs occur with the same frequency in the negative sequences...)  I guess there are too many "if's" at this point so maybe such a tie would very rarely occur.

I will take another look at the DREME paper -- clearly I misunderstood something.  Sorry about this and thanks for taking the time to explain it to me!

 
Currently, I'm providing a set of N sequences.  Would I be correct in saying if I want to know what motif occurs frequently across the N sequences, it is acceptable to not break up the 1000 bp sequences?

I'm not sure I understand your goal here, but consider this: suppose you have a sequence in the positive that contains 100 instances of a motif, and no sequences in the negative set containing the motif. During it's search, DREME guesses a regular expression that matches the motif well. Then DREME going to count the number of sequences containing a match to that regular expression. That'll be one sequence in the positive set, and zero sequences in the negative set. Hmm 1 vs 0. That's not a significant enrichment, so that regular expression is dropped and DREME goes on to try something else.  Instead suppose you'd broken up your sequence into smaller pieces. Instead of 1 sequence containing 100 instances of the motif you have 100 sequences, 50 of which contain instances of the motif (maybe you broke some by splitting up the sequences). Now when DREME goes to count up the sequences that contain a match to the motif, it will count 50 sequences in the positive set, and still zero in the negative set, which is a pretty substantial enrichment, so DREME will start trying to refine the regular expression.


Indeed, I'm not using DREME for Chip-Seq data but I am using it for something which someone did use DREME for.  However, they never gave DREME such long sequences -- I think they were all less than 100 bp.  I think I understand why...

I guess I am making the gross assumption that the motif will appear once per sequence (yes, even if it's 1000 bp in length).  And I would like to know how many sequences has the motif (as opposed to how many times the motif appears across all sequences).

I think leaving the 1000 bp sequences as-is will answer this question for me, but I think I need to return to the drawing board to see if that is truly what I want.

Thanks a lot for the detailed example above!  That certainly makes things clearer for me!

 

Is there something special with the number "100"?  I presume sensitivity would increase if I continue breaking the sequences into even smaller chunks.  However, if I start breaking motifs in half, sensitivity would suffer.  Thus, "100" is just a compromise?

Yep it's a compromise. Note though that DREME was designed especially to work with ChIP-Seq analyses. ChIP-Seq peaks are typically centered in an interval of about 100bp. It may be that MEME using the ANR (Any number of repetitions) model would be a better choice for your analysis. Unfortunately MEME is much slower than DREME, and the running time of MEME grows as the square of the overall size of the sequence data, and the cube of the number of sequences. MEME is really only practical for sequence files less then 1Mb in size.


My data is larger than 1 Mb, but I can give MEME a try.  Since the aforementioned paper used DREME, I was repeating its procedure (but of course, with my data set). 

I can also see if MEME and ANR works for me.

Thank you very much for your help!

Ray



CharlesEGrant

unread,
Apr 25, 2016, 5:43:32 PM4/25/16
to meme-...@googlegroups.com
Hi Ray,

What I was thinking was that maybe there's a possibility that two motifs (let's say TTTTTT and GGGGGG) occur the same number of times.  Let's say 75, and that's even with a generalization step.  I think this is what I meant by a "tie" and I was wondering how DREME would choose which one to output.  (I guess we're also assuming these two motifs occur with the same frequency in the negative sequences...)  I guess there are too many "if's" at this point so maybe such a tie would very rarely occur.

There isn't any explicit competition between motifs in DREME. DREME depends on having two sets of sequences, one containing instances of the motifs and one not. If you don't provide a negative sequence set, DREME generates one by randomly shuffling the sequences you do provide. DREME then counts the number of exact matches in the two sequence sets to all words between length 4 and 8 in the two sets. At this stage wildcards are not part of the allowed alphabet. For each word DREME compares the number of exact matches in the positive and negative sets, and picks initial candidate motifs based on the p-value of the Fischer exact test for the counts in the two sets. The candidate motifs are then extended by adding wild cards to the allowed alphabet. If two motifs end up with the same significant final p-value they are both reported.

I guess I am making the gross assumption that the motif will appear once per sequence (yes, even if it's 1000 bp in length).  And I would like to know how many sequences has the motif (as opposed to how many times the motif appears across all sequences).

For that kind of experiment it's typically useful to break it into two separate tasks: motif discovery and motif scanning. MEME and DREME are used to establish the existence of motifs, and compute the motif's PWM, but don't they don't necessarily report every match to the motif in the input sequences. In fact, the DREME output doesn't include any information about which sequences contributed to the identification of the motif.   However, once you've identified the motif and have a PWM, you can then scan a sequence database using FIMO. Depending on what you are looking for, FIMO may be overkill, since it's scanning with the PWM. If you just want to identify all the matches to the regular expressions used by DREME then this post may be helpful.


Charles

Raymond Wan

unread,
Apr 28, 2016, 3:53:56 AM4/28/16
to MEME Suite Q&A

Hi Charles,



On Tuesday, 26 April 2016 05:43:32 UTC+8, CharlesEGrant wrote:
Hi Ray,

What I was thinking was that maybe there's a possibility that two motifs (let's say TTTTTT and GGGGGG) occur the same number of times.  Let's say 75, and that's even with a generalization step.  I think this is what I meant by a "tie" and I was wondering how DREME would choose which one to output.  (I guess we're also assuming these two motifs occur with the same frequency in the negative sequences...)  I guess there are too many "if's" at this point so maybe such a tie would very rarely occur.

There isn't any explicit competition between motifs in DREME. DREME depends on having two sets of sequences, one containing instances of the motifs and one not. If you don't provide a negative sequence set, DREME generates one by randomly shuffling the sequences you do provide. DREME then counts the number of exact matches in the two sequence sets to all words between length 4 and 8 in the two sets. At this stage wildcards are not part of the allowed alphabet. For each word DREME compares the number of exact matches in the positive and negative sets, and picks initial candidate motifs based on the p-value of the Fischer exact test for the counts in the two sets. The candidate motifs are then extended by adding wild cards to the allowed alphabet. If two motifs end up with the same significant final p-value they are both reported.


I see -- thank you for the detailed explanation!

I guess I was mistakenly visualizing the DREME's results like the web pages returned by Google.  And that, somehow, order of the results matter.  But I see your point that they would both be returned and I should not look too deeply into the order of the results.

By the way, I noticed the original DREME paper also recommended motifs of 4 to 8 bases.  However, when I extended the search to 10 bases, the program ran without complaints and returned results to me.  Is the upper limit of 8 a recommendation to limit running time (yes, it did run for a while) or are results above 8 unreliable?

 

I guess I am making the gross assumption that the motif will appear once per sequence (yes, even if it's 1000 bp in length).  And I would like to know how many sequences has the motif (as opposed to how many times the motif appears across all sequences).

For that kind of experiment it's typically useful to break it into two separate tasks: motif discovery and motif scanning. MEME and DREME are used to establish the existence of motifs, and compute the motif's PWM, but don't they don't necessarily report every match to the motif in the input sequences. In fact, the DREME output doesn't include any information about which sequences contributed to the identification of the motif.   However, once you've identified the motif and have a PWM, you can then scan a sequence database using FIMO. Depending on what you are looking for, FIMO may be overkill, since it's scanning with the PWM. If you just want to identify all the matches to the regular expressions used by DREME then this post may be helpful.




 I see.  I never considered FIMO in my workflow.  I will look into it -- thank you for mentioning it and the link to the past post!

Ray



Reply all
Reply to author
Forward
0 new messages