Query on MEME for large dataset

75 views
Skip to first unread message

Arun Prasanna

unread,
Dec 27, 2016, 11:40:57 AM12/27/16
to MEME Suite Q&A

Hello,

I am trying to do motif discovery for ~22000 promoter sequences each with length 1000 bp. I did lookup your questions on MEME on large dataset and runtime and understood that it doesn't make sense to use -maxsize 23000000. My questions are:

Is it ok to split the sequences into batches of 1000 & run MEME with zoops ?. In each batch the 'background' will vary leading to variation when I merge the results at last right ?

Can you help me decide on -nmotifs value ?. I saw different work using different values from 3,5, 10 etc., My dataset is eukaryote.

Can you give idea about how to run these sequences ? 

Thanks in advance,

AP

CharlesEGrant

unread,
Dec 29, 2016, 4:35:20 PM12/29/16
to MEME Suite Q&A
Can I clarify what you are trying to do here? Do you believe that most of your 22,000 promotor sequences contain one, or maybe a handful, of transcription factor binding sites in common? Or do you expect them to contain a wide variety of TFBS?

You have to keep in mind that MEME performs motif discovery by spotting short subsequences that occur more frequently in your sequence data then would be expected by chance. This means you can't give MEME 22,000 disparate promotor sequences and ask it to identify all the motifs present. There wouldn't be enough statistical evidence from any one motif for MEME to be able to spot it. On the other hand, if you expect all 22,0000 promotor sequences to contain the same 2 or 3 motifs, then your data set is highly redundant, and it would just waste compute time, forcing all of it into the analysis.

If you are looking for a few TFBS common to all your sequences, then you should randomly sample a few hundred promotor sequences from your full set, and run MEME on those. You can use the fasta-subsample utility included in the MEME Suite source for this. MEME generally will find the most statistically significant motifs first, but this isn't strictly guaranteed. If you expect that there might be 2 or 3 motifs in your data set, you should start off with '-nmotifs' of around 10-20. You should observe that the motifs identified quickly fall off in statistical significance.

If you expect your sequences to contain dozens and dozens of different motifs, with only a few instances of each, then motif discovery is not computationally feasible. The best option might be motif search, using a tool like FIMO, and one of the databases of known motifs. 




Arun Prasanna

unread,
Dec 30, 2016, 4:56:19 AM12/30/16
to MEME Suite Q&A
Hi Charles,
Thanks for the insights. I expect my promoter sequences to contain wide variety of TFBS. Hence, if I understand your explanation correctly, I should better use those promoters of genes that are upregulated in my experiment (~100). Find the motif occurrences. Use those to search against the other sequences to detect the co-regulated genes. Please correct me if my strategy is wrong.

Happy New year wishes.

Regards,
AP

CharlesEGrant

unread,
Dec 30, 2016, 4:27:45 PM12/30/16
to meme-...@googlegroups.com
That sounds about right. You want to provide MEME with multiple sequences that you think contain instances of a common motif. To flesh out your reasoning a bit more: if a collection of genes are all upregulated under similar biological conditions, this may be evidence of a common regulator pathway. Transcription factors often drive regulation, so maybe the promotors for these genes all bind a common transcription factor. If those promotors all bind a common transcription factor, then they may contain a common binding motif. If the promotor sequences all contain a common binding motif, then MEME should be able to spot it.
Reply all
Reply to author
Forward
0 new messages