How to refine motifs?

161 views
Skip to first unread message

Alan Tourancheau

unread,
Oct 23, 2017, 10:10:05 AM10/23/17
to MEME Suite Q&A
Hello,

I'd like to know if there is a way to refine motifs found with MEME. I have a large set of fixed length sequences detected from peaks in the signal. Depending on the dataset and the threshold used, this set comprise between 50k and 100k sequences explained by 1 to 30 motifs ranging from 4 to 13 bp. Those are expected to have some degeneracy (ambiguous nucleotides).

I already tried DREME, MEME-ChIP, and MEME. The first two handle those large dataset easily but motifs are often combined (short or overlapping motifs), and DREME do not handle gapped motifs which I have. Right now I got the best results by using MEME recursively on the sequence set, ~2000 sequences at a time. I remove sequences with motifs found in step 1 and re-run MEME until no motif are detected.

Unfortunately, I often got motifs slightly longer than expected with 1 or 2 positions with low confidence/probs bases which make it difficult to distinguish from actual ambiguous bases. Is there a robust way to refine the motifs found like that? I tried to increase the number of sequences in MEME but it quickly become too computationally intensive for my application. Could I use the PWM found by MEME and refined them with the complete set of sequence? I did not found other software in your suite to do that but I may be missing something (AME?). Or is there other parameters in MEME that I can tweak?

Regards,

Alan

cegrant

unread,
Nov 8, 2017, 4:18:48 PM11/8/17
to MEME Suite Q&A
I'd like to know if there is a way to refine motifs found with MEME. I have a large set of fixed length sequences detected from peaks in the signal. Depending on the dataset and the threshold used, this set comprise between 50k and 100k sequences explained by 1 to 30 motifs ranging from 4 to 13 bp.

I don't know the details of your experiment, but this is probably not a computationally tractable dataset for MEME analysis! 

MEME does not perform an exhaustive search of all possible motifs in your sequence data. Rather it identifies alignments of short sequences that are statistically over-represented in you data.  In such a large data set with such a diversity of motifs, any biological signal is probably going to be overwhelmed by chance signals, particularly for the short motifs. You don't mention screening the motifs reported by MEME by E-value. MEME will generally report the most statistically significant motif first, but if you ask MEME to report 20 motifs, it will report 20 motifs, even if those motifs have E-values indicating that they are indistinguishable from chance. If you start recursively deleting sequences based on the motifs MEME is reporting, then the E-values are no longer reliable and you may end up just chasing noise. 

If your data are akin to ChIP-Seq you may be able to use MEME-ChIP rather than MEME directly. ChIP-Seq data are highly redundant, so before analyzing the sequence data with MEME the sequences are trimmed to their central 100bp, and 600 sequences are sequences are sampled from the trimmed sequences. MEME is then used to analyze the trimmed, and sampled data.
Reply all
Reply to author
Forward
0 new messages