Retaining all contributing sites from command line run of MEME

30 views
Skip to first unread message

Angus Hilts

unread,
May 23, 2025, 5:25:19 PMMay 23
to MEME Suite Q&A
Hi everyone,
I am interested in looking at the specific instances of motifs after running MEME. The issue is that in some of my data sets, I have a large number of sites, which MEME seems to suppress. Is there a way to increase the number of sites that MEME will report so that I don't keep hitting the limit and can then extract the sequences themselves afterwards?
thanks in advance!

cegrant

unread,
May 23, 2025, 5:48:16 PMMay 23
to MEME Suite Q&A
Have you looked at the -nsites -minsites, and -maxsites options described in the MEME documentation? Note that depending on your choice for the site distribution (the -mod option model one or all of these parameters may be ignored. If you choose the "only one per sequence" (OOPS)  model you are telling MEME that each sequence contains exactly one site, so the number of sites used is always equal to the number of sequences. If you choose  the "zero or one per sequence" (ZOOPS) model you are telling MEME that each sequences contains at most one site, so the max sites parameter will always be equal to the number of sequences. If you choose the "any number of repetitions" (ANR) model the maximum number of sites that can be used is set to 5 times the number of sequences.

Typically you'd just let MEME use the default number of sites for the model you select, and then use the discovered motifs to scan for other sites matching the motif with FIMO. You'd usually only adjust the number of sites to be used if the motifs discovered had low statistical significance and you wanted MEME to try a bit harder to find supporting evidence.

Depending on  your needs you may want to switch to STREME for motif discovery: it's faster and more thorough.

X L

unread,
Dec 15, 2025, 2:57:19 PM (2 days ago) Dec 15
to MEME Suite Q&A

Hi everyone,

It seems that even with the "ANR" model selected, the maximum number of reported sites is still 1000. I have attached my MEME report here:
https://drive.google.com/drive/folders/1Dcqja_QfmqyX8YmO-1BBy_7Xuw_KX-Ce?usp=sharing

I would appreciate any thoughts or suggestions on this issue.

Thanks,
Xiao


Screenshot 2025-12-15 at 12.16.20 PM.png

cegrant

unread,
Dec 15, 2025, 3:03:43 PM (2 days ago) Dec 15
to MEME Suite Q&A
As I mentioned in the previous message, you can adjust the '-maxsites' parameter. This is available in the "Advanced options" section of the public web application. Look for the heading "How many sites must each motif have?". More details are available in the documentation for the command line version of MEME. For the ANR model the default is for max sites to be five times the number of sequences.

Charles

X L

unread,
Dec 15, 2025, 3:11:12 PM (2 days ago) Dec 15
to MEME Suite Q&A
Hi, Charles,

However, my number of sequences is 165273, so the maximum sites should be 165273 x 5, which is significantly greater than 1000. Am I on the right track?

Regards,
Xiao

cegrant

unread,
Dec 15, 2025, 3:12:26 PM (2 days ago) Dec 15
to MEME Suite Q&A

I've now looked at your MEME output. While the motifs found are highly statistically significant it looks like the underlying sequences contain low complexity regions and repeats. MEME has no way to distinguish these from actual biological relevant motifs. You should filter your sequences with DUST or RepeatMasker before analyzing there with MEME.

cegrant

unread,
Dec 15, 2025, 8:01:15 PM (2 days ago) Dec 15
to MEME Suite Q&A
Ah, there is one more issue I missed. The public MEME application limits the number of sites to 1000 because the size of the MEME HTML output grows linearly with the number of sites. Any more than 1000 sites and the HTML output will grow unresponsive on typical computers. In the command line version of MEME you can overwrite this using the '-brief' option, but that option is not available on the public web server (we don't want people creating giant HTML output files on our server).

I think you are getting off the right track. As I mentioned earlier, it looks like your sequence data contains many low complexity regions and repeats like GCGCGCGC ... MEME will identify those as highly significant motifs with many instances, but they generally are not biologically relevant. Unless of course you are studying repeats, and then you need different tools than the MEME Suite.  You need to mask those regions before running your sequences through MEME. We include DUST in the MEME Suite source, and RepeatMasker is available here.

Note that MEME is NOT performing an exhaustive search for motif sites to consider. It's using statistical sampling with heuristics to make the initial guesses for the motifs. There are some special cases where you do need to adjust the number of sites MEME uses to identify a significant motif, but generally it's not helpful to override the default number of sites considered: it significantly increase the running time while not greatly improving the statistical confidence in the motif.

X L

unread,
Dec 16, 2025, 11:09:27 AM (yesterday) Dec 16
to MEME Suite Q&A

Thank you very much. I filtered my sequences with RepeatMasker and am using MEME version 5.5.8 installed locally on a Mac (ARM64). I ran the following command:

meme -oc CLIP74BC1_Kmer_MEME_motifs_out_Intron/MEME_motif_Intron -dna -mod anr -nmotifs 20 -minw 8 -maxw 12 -minsites 30 CLIP74BC1_Kmer_MEME_motifs_out_Intron/clusters_for_motif.fa

My sequence count is 7,568, and in anr mode I would expect maxsites to be 7,568 × 5, but as you can see it is still capped at 1,000. This seems odd to me.

Results are attached here:
https://drive.google.com/file/d/1LOzUjimMxYeOFfmjQAqr0qbN9lyxYY8K/view?usp=sharing

cegrant

unread,
12:21 AM (22 hours ago) 12:21 AM
to MEME Suite Q&A
Did you see my later messages on Dec 15? I'll repeat it here:

Ah, there is one more issue I missed. The public MEME application limits the number of sites to 1000 because the size of the MEME HTML output grows linearly with the number of sites. Any more than 1000 sites and the HTML output will grow unresponsive on typical computers. In the command line version of MEME you can overwrite this using the '-brief' option, but that option is not available on the public web server (we don't want people creating giant HTML output files on our server).

I think you are getting off the right track. As I mentioned earlier, it looks like your sequence data contains many low complexity regions and repeats like GCGCGCGC ... MEME will identify those as highly significant motifs with many instances, but they generally are not biologically relevant. Unless of course you are studying repeats, and then you need different tools than the MEME Suite.  You need to mask those regions before running your sequences through MEME. We include DUST in the MEME Suite source, and RepeatMasker is available here.

Note that MEME is NOT performing an exhaustive search for motif sites to consider. It's using statistical sampling with heuristics to make the initial guesses for the motifs. There are some special cases where you do need to adjust the number of sites MEME uses to identify a significant motif, but generally it's not helpful to override the default number of sites considered: it significantly increase the running time while not greatly improving the statistical confidence in the motif.

This should explain why you are not getting more than 1000 sites.  I guess I'd ask why you want more than 1000 sites? Do you actually expect to find more than 1000 biologically functional motifs in your sequence data? Note that several of the motifs you are finding still look like runs of AG or tandem repeats of TC. By allowing for up to 1000 sites may be very well overwhelming the functional motifs  that only occur at a few tens of sites with repetitive junk that occurs thousands of time. Now that you've filtered out repeats, I would suggest you just change your search back to using the default number of sites.

Is it possible you are misunderstanding the function of MEME? MEME is performing de novo motif identification, that is it's trying to come up with position-weight matrixes for any  unknown motifs that may be in your sequence data. If you want to know where ALL good matches to a motif in your sequences are you'd use the FIMO (Find Individual Motif Occurences) tool.

By the way, for debugging it's more helpful for us to have the actual HTML files rather than a PDF. If we have the HMTML files there are several options on the page we can use to examine the motifs in more detail.
Reply all
Reply to author
Forward
0 new messages