Hello,
I am using SEA to search for motif enrichment in promoter sequences from co-expression clusters. Three observations currently confound me, and I greatly appreciate any help in clarifying them.
1. Am I right in assuming that a zero-order background model only represents nucleotide frequencies? If the control sequences are derived from shuffled primary sequences, the nucleotide frequencies should remain identical.
2. Therefore, I then tried to provide control sequences that were identical for multiple distinct sets of promoter sequences. However, the numbers of false positive matches (in the control sequences) widely varied depending on the set of primary sequences. How is this possible?
3. All of the promoter sequences have identical lengths. However, SEA indicates that this is not the case and attempts to trim them - although this process is not always successful. As a result, either the Fisher's exact test or the Binomial test is used depending on the situation. The sequences have been masked using dust; could this be the cause of the issue?
Further, it is important to note that this problem arises with both shuffled and provided control sequences.
Here are some example output lines from a run with shuffled control sequences:
# Attempting to trim control hold-out sequences by 0.40% to average primary sequence length (1491.0).
# Using Binomial test for p-values because primary and control sequences have
# different average lengths: 1497.6 vs. 1496.93. Bernoulli = 0.500112
# Warning: p-values will be inaccurate if primary and control
# sequences have different length distributions.
# Attempting to trim control sequences by 0.01% to average primary sequence length (1499.0).
# Using Binomial test for p-values because primary and control sequences have
# different average lengths: 1498.96 vs. 1498.96. Bernoulli = 0.500000
# Warning: p-values will be inaccurate if primary and control
# sequences have different length distributions.
Thank you in advance for your assistance!