Control set selection for STREME in short sequence enrichment analysis

12 views

Skip to first unread message

Yrjö Koski

unread,

Dec 17, 2025, 12:22:53 AM12/17/25

to MEME Suite Q&A

Hi,

I’m trying to find enriched sequence motifs from a large set of short sequences that are associated with a DNA treatment in nanopore sequencing. The DNA is from a bacterial plasmid, which is around 10 kbp long. I feel like I have trouble setting up a proper control sequence set for my experiment. Here is my data analysis approach summarized:

i) I group sequencing variables of interest (e.g., quality score or nanopore sequencing-derived values, such as signal mean level) by reference position in each sample (control and treated).

ii) I compare the treated sample and the control sample using a distributional test, such as Kolmogorov-Smirnov, for each reference position.

iii) I collect 9-mers centered at the reference positions that show a significant difference based on the statistical test (after multiple testing correction). Optionally, I have collected the 9-mers only for the positions within the top 5% test scores.

iv) I run STREME to find sequence motifs that are enriched within the set of 9-mers that show a difference between the control and treated samples.

I have tried two different approaches for the control sequence set, but I think they both have their own issues:

1. Building the control set from the positive sequences. As I understood based on STREME’s documentation, if no control set is provided, the positive sequences are shuffled to create a control set. However, this creates the control set based on the background frequencies of the positive set instead of the reference, which, in my opinion, might cause a bias if the background frequency in the positive set is very different from the reference genome. So, for this reason, I would not like to build the control set based on the positive sequences.

2. Using all 9-mers in the reference genome as a control set. I think this approach is able to capture the differences between the control and positive sets, but the issue is that if a large number of 9-mers show a difference between the treated and control samples, the overlap between the positive and control sequences grows, which likely hides the enrichment. This approach seems to work best when I use the sequences with the top 5% test scores and the overlap between the sets is smaller.

Finally, I would like to get thoughts on these questions:

i) Is STREME the correct tool for this task, or could you suggest any alternatives?

ii) Which approach would you suggest for creating the control set? Do you have any other suggestions other than the two that I have tried?

iii) When applying approach 2, would it make sense to exclude the positive sequences from the control set with all reference 9-mers, or would that introduce additional bias? iv) Should I somehow take into account the test score values when creating the positive set? This could be done potentially by sampling from the reference 9-mers while using the test score values as weights.