SEA control sequences, background model and trimming

60 views
Skip to first unread message

Zoe Wessely

unread,
Oct 29, 2023, 7:31:46 AM10/29/23
to MEME Suite Q&A
Hello,

I am using SEA to search for motif enrichment in promoter sequences from co-expression clusters. Three observations currently confound me, and I greatly appreciate any help in clarifying them.

1. Am I right in assuming that a zero-order background model only represents nucleotide frequencies? If the control sequences are derived from shuffled primary sequences, the nucleotide frequencies should remain identical.

2. Therefore, I then tried to provide control sequences that were identical for multiple distinct sets of promoter sequences. However, the numbers of false positive matches (in the control sequences) widely varied depending on the set of primary sequences. How is this possible?

3. All of the promoter sequences have identical lengths. However, SEA indicates that this is not the case and attempts to trim them - although this process is not always successful. As a result, either the Fisher's exact test or the Binomial test is used depending on the situation. The sequences have been masked using dust; could this be the cause of the issue?
Further, it is important to note that this problem arises with both shuffled and provided control sequences.

Here are some example output lines from a run with shuffled control sequences:
# Attempting to trim control hold-out sequences by 0.40% to average primary sequence length (1491.0).
# Using Binomial test for p-values because primary and control sequences have
#   different average lengths: 1497.6 vs. 1496.93. Bernoulli = 0.500112
# Warning: p-values will be inaccurate if primary and control
#          sequences have different length distributions.

# Attempting to trim control sequences by 0.01% to average primary sequence length (1499.0).
# Using Binomial test for p-values because primary and control sequences have
#   different average lengths: 1498.96 vs. 1498.96. Bernoulli = 0.500000
# Warning: p-values will be inaccurate if primary and control
#          sequences have different length distributions.

Thank you in advance for your assistance!

cegrant

unread,
Nov 6, 2023, 7:28:30 PM11/6/23
to MEME Suite Q&A
A zero-order background model is indeed just the background frequencies. However, when working with motifs higher order models may be needed. Consider for example GC rich regions. By default SEA will assume a first order background model generated by shuffling the input sequences, but which preserves not only the nucleotide frequencies, but the frequencies of 2-mers. If the control sequences you provide differ in their 2-mer frequencies from the target sequences you could end up with the behavior troubling you in #2

To troubleshoot #3 I'd need to look at a copy of your input sequences. Can you forward us a copy?

Reply all
Reply to author
Forward
0 new messages