Hi Raymond,DREME discovers motifs by finding matches to regular expressions that are enriched in the positive sequence set over the negative sequence set. At one step in the algorithm, the number of sequences containing a match to a regular expression is compared between the two sets. However, each sequence is counted only once, whether it contains 1 match or 100 matches. Longer sequences are more likely to contain multiple matches, so if you submit a collection of long sequences to DREME it may miss some significant motifs. The multiple matches in a single sequence won't add to the evidence for the motif. You'll increase DREME's sensitivity if you break up your 1000bp sequences into 10 100bp sequences.
Since we've added support for arbitrary alphabets in the MEME Suite, the standard DNA alphabet will map nucleotides in lower case to upper case. If you want DREME to skip the soft-masked characters you can either replace the lower case nucleotides with 'N', or provide a custom alphabet where the lower case letter are mapped to 'N'. See the format for a custom alphabet for more information.
I see. So, if the number of sequences is counted only once, then for long sequences, there should be "ties". How are these ties "broken" so that only a single one is printed out? Is it somewhat arbitrary?
Currently, I'm providing a set of N sequences. Would I be correct in saying if I want to know what motif occurs frequently across the N sequences, it is acceptable to not break up the 1000 bp sequences?
Is there something special with the number "100"? I presume sensitivity would increase if I continue breaking the sequences into even smaller chunks. However, if I start breaking motifs in half, sensitivity would suffer. Thus, "100" is just a compromise?
I see. So, if the number of sequences is counted only once, then for long sequences, there should be "ties". How are these ties "broken" so that only a single one is printed out? Is it somewhat arbitrary?I'm sorry, I wasn't clear. I was referring to a single step in the DREME algorithm. The overall process involves multiple steps. DREME goes through an iterative process. It guesses a regular expression, counts the number of sequences that contain at least one match to the regular expression in the positive and negative data sets, figures out if sequences with a match to the regular expression are enriched in the positive set relative to the negative set, and if so, tweaks the regular expression so there are even more matches. The details of the algorithm are explained in DREME: motif discovery in transcription factor ChIP-seq data.
Currently, I'm providing a set of N sequences. Would I be correct in saying if I want to know what motif occurs frequently across the N sequences, it is acceptable to not break up the 1000 bp sequences?I'm not sure I understand your goal here, but consider this: suppose you have a sequence in the positive that contains 100 instances of a motif, and no sequences in the negative set containing the motif. During it's search, DREME guesses a regular expression that matches the motif well. Then DREME going to count the number of sequences containing a match to that regular expression. That'll be one sequence in the positive set, and zero sequences in the negative set. Hmm 1 vs 0. That's not a significant enrichment, so that regular expression is dropped and DREME goes on to try something else. Instead suppose you'd broken up your sequence into smaller pieces. Instead of 1 sequence containing 100 instances of the motif you have 100 sequences, 50 of which contain instances of the motif (maybe you broke some by splitting up the sequences). Now when DREME goes to count up the sequences that contain a match to the motif, it will count 50 sequences in the positive set, and still zero in the negative set, which is a pretty substantial enrichment, so DREME will start trying to refine the regular expression.
Is there something special with the number "100"? I presume sensitivity would increase if I continue breaking the sequences into even smaller chunks. However, if I start breaking motifs in half, sensitivity would suffer. Thus, "100" is just a compromise?Yep it's a compromise. Note though that DREME was designed especially to work with ChIP-Seq analyses. ChIP-Seq peaks are typically centered in an interval of about 100bp. It may be that MEME using the ANR (Any number of repetitions) model would be a better choice for your analysis. Unfortunately MEME is much slower than DREME, and the running time of MEME grows as the square of the overall size of the sequence data, and the cube of the number of sequences. MEME is really only practical for sequence files less then 1Mb in size.
What I was thinking was that maybe there's a possibility that two motifs (let's say TTTTTT and GGGGGG) occur the same number of times. Let's say 75, and that's even with a generalization step. I think this is what I meant by a "tie" and I was wondering how DREME would choose which one to output. (I guess we're also assuming these two motifs occur with the same frequency in the negative sequences...) I guess there are too many "if's" at this point so maybe such a tie would very rarely occur.
I guess I am making the gross assumption that the motif will appear once per sequence (yes, even if it's 1000 bp in length). And I would like to know how many sequences has the motif (as opposed to how many times the motif appears across all sequences).
Hi Ray,What I was thinking was that maybe there's a possibility that two motifs (let's say TTTTTT and GGGGGG) occur the same number of times. Let's say 75, and that's even with a generalization step. I think this is what I meant by a "tie" and I was wondering how DREME would choose which one to output. (I guess we're also assuming these two motifs occur with the same frequency in the negative sequences...) I guess there are too many "if's" at this point so maybe such a tie would very rarely occur.There isn't any explicit competition between motifs in DREME. DREME depends on having two sets of sequences, one containing instances of the motifs and one not. If you don't provide a negative sequence set, DREME generates one by randomly shuffling the sequences you do provide. DREME then counts the number of exact matches in the two sequence sets to all words between length 4 and 8 in the two sets. At this stage wildcards are not part of the allowed alphabet. For each word DREME compares the number of exact matches in the positive and negative sets, and picks initial candidate motifs based on the p-value of the Fischer exact test for the counts in the two sets. The candidate motifs are then extended by adding wild cards to the allowed alphabet. If two motifs end up with the same significant final p-value they are both reported.
I guess I am making the gross assumption that the motif will appear once per sequence (yes, even if it's 1000 bp in length). And I would like to know how many sequences has the motif (as opposed to how many times the motif appears across all sequences).For that kind of experiment it's typically useful to break it into two separate tasks: motif discovery and motif scanning. MEME and DREME are used to establish the existence of motifs, and compute the motif's PWM, but don't they don't necessarily report every match to the motif in the input sequences. In fact, the DREME output doesn't include any information about which sequences contributed to the identification of the motif. However, once you've identified the motif and have a PWM, you can then scan a sequence database using FIMO. Depending on what you are looking for, FIMO may be overkill, since it's scanning with the PWM. If you just want to identify all the matches to the regular expressions used by DREME then this post may be helpful.