Why does centrimo skip sequences?

661 views
Skip to first unread message

James Johnson

unread,
Jul 3, 2013, 1:46:51 AM7/3/13
to meme-...@googlegroups.com
When running CentriMo, either on its own or as part of a MEME-ChIP, occasionally it prints a message like the following:

Skipping sequence chrX:148668851,148669031 as its length (181) does not match the expected length (61).
Skipping sequence chrX:158943431,158943471 as its length (41) does not match the expected length (61).

Why does this happen? How can this be avoided?

James Johnson

unread,
Jul 3, 2013, 2:19:06 AM7/3/13
to meme-...@googlegroups.com
CentriMo needs all sequences to be the same length!

This is because the statistical calculation assumes that it is equally likely for a uninteresting motif site to appear anywhere in the sequence whereas interesting motifs are likely to have sites biased toward a small section of sequence (specifically the center, if not in local mode). If we try to include sequences that are not the same length then it causes unavoidable biases in these calculations and may give the wrong results, thus it was concluded that we would accept a single length and skip any sequences that were different.

If you run CentriMo without telling it what length you want, it will assume that the first sequence in the file is the correct length and then filter everything else to match. Sometimes this is not what is wanted so we also added the switch "--seqlen" which you can use to specify the desired sequence length. Like so:
centrimo --seqlen 500 sequences.fna motifs.meme

When implementing MEME-ChIP we realised that calling CentriMo with the default behaviour could lead to problems. If for some reason the first sequence was truncated by some sequence feature and so wasn't the same as most of the rest it was leading to the rest of the sequences being discarded! To work around this problem we run the MEME-ChIP sequences through the script "fasta-most" which finds the most frequently occurring sequence length and then when CentriMo is called that length is passed with the "--seqlen" parameter.

Malcolm Cook

unread,
Nov 1, 2017, 12:29:11 PM11/1/17
to MEME Suite Q&A
I think it is arguable that meme-chip, when called with -ccut, should pass that as the value of -seqlen to centrimo and ALSO pass in the "centered-seq" rather  that the full sequence file.  It does NOT do that in version 4.12.0.

Also, I think centrimo should NOT have -norc passed in unless it was passed to meme-chip.

Is this the best forum to get these suggestions vetted?

cegrant

unread,
Nov 7, 2017, 7:41:36 PM11/7/17
to meme-...@googlegroups.com
 I think it is arguable that meme-chip, when called with -ccut, should pass that as the value of -seqlen to centrimo and ALSO pass in the "centered-seq" rather  that the full sequence file.  It does NOT do that in version 4.12.0.

This would drastically reduce Centrimo's statistical power, so it would not be a good idea. Centrimo works best with sequences of length 400-500bp. You need sequence from both peaks and non-peaks to reliably calculate the statistical significance of the identified, centrally enriched regions.

Also, I think centrimo should NOT have -norc passed in unless it was passed to meme-chip.

In fact MEME-ChIP does not pass -norc to Centrimo unless the "scan given strand only" checkbox under "Universal Options" is checked (when using the web application. When using the command line version of MEME-ChIP the default is to scan both strands, which means that MEME is run with the -revcomp option (scan reverse complement) and Centrimo is run without the '-norc' option. If you pass in the '-norc' option to MEME-ChIP then MEME will be run without the '-revcomp' options (scan only forward strand), and Centrimo will be run with the '-norc' option.

Is this the best forum to get these suggestions vetted?

You are welcome to ask in this group, or to send email directly to meme-...@uw.edu.

Malcolm Cook

unread,
Nov 8, 2017, 4:25:22 PM11/8/17
to MEME Suite Q&A
The reason I recommend to pass in the centered-seq is that centrimo's approach requires that all sequences be the same length and as written centrimo skips most input sequences if they vary in length since it filters all those whose length differs from the first input sequence.  Choosing instead to analyze the centered-seq that have been clipped to length -ccut, and passing in -ccut as the length of -seqlen ensures that all input sequences with length at least -ccut will contribute toward analysis.     

In other words, unless I pass meme-chip sequences of the same length in the first place, almost all of them will be discarded from the centrimo analysis.  My proposal attempts to ameliorate this matter.  I realize that meme-chip has to strike certain compromises to be generally useful as a combined pipeline.  In this case I think my suggestion would strike a better balance.  Would you reconsider your response?

FWIW, I actually think the ideal may be to develop a variant approach to "central motif enrichment analysis" that did not require all the input sequences to be of the same length.

Regarding the handling of -norc, I see you are quite correct and my observation was not.  Thanks for setting me straight.

cegrant

unread,
Nov 8, 2017, 5:16:33 PM11/8/17
to MEME Suite Q&A
Would you reconsider your response?

No. MEME's statistical power is increased by trimming the sequences to the central peak, while Centrimo's analysis depends on including the flanking regions.

You can always trim or reorder your sequence file before submitting it to MEME-ChIP, or run Centrimo on the trimmed sequences after running MEME-ChIP.

FWIW, I actually think the ideal may be to develop a variant approach to "central motif enrichment analysis" that did not require all the input sequences to be of the same length.

Sure, but the statistical analysis behind Centrimo depends on the sequences having equal length (see https://academic.oup.com/nar/article/40/17/e128/2411117/Inferring-direct-DNA-binding-from-ChIP-seq for details). If you have an alternate analysis, by all means pursue it!

Malcolm Cook

unread,
Nov 14, 2017, 2:09:39 PM11/14/17
to MEME Suite Q&A
OK.  Thanks for the discussion.  I will pursue the practice of trimming the sequences prior to submitting to meme-chip if I continue to use this useful pipeline.  
Reply all
Reply to author
Forward
0 new messages