Thank you for your response!
1. Unfortunately, I'm still a bit confused by the score reported by MAST -hit_list.
If I understand correctly (based on
MAST's documentation, the
MAST paper, and general bioinformatics resources like the
Durbin textbook), the log of the ratio of the likelihood of observing the candidate sequence assuming it is an instance of the motif to the likelihood of observing it assuming it is just part of the background is simply calculated by summing up the relevant values in the score matrix.
> The match score of a motif with w columns and position i in a target sequence is defined as the sum of the scores for the letters in the sequence at positions i to i + w – 1 matched with columns 1 to w of the motif, respectively.
Example
Consider a sequence matching the consensus sequence of the CTCF motif
MA0139.1: TGGCCACCAGGGGGCGCTA
Scoring the motif
MA0139.1 against its own consensus sequence using FIMO gives a score of 28.6557:
motif_id motif_alt_id sequence_name start stop strand score p-value q-value matched_sequence
MA0139.1 CTCF test_MA0139.1 1 19 + 28.6557 1.67e-12 TGGCCACCAGGGGGCGCTA
Scoring the same motif against its own consensus sequence using MAST gives a score of 2871.34:
# All non-overlapping hits in all sequences from "/fasta/test_CTCF.fa".
# sequence_name (strand+/-)motif id alt_id hit_start hit_end score hit_p-value
test_MA0139.1 +1 MA0139.1 CTCF 1 19 2871.34 2.33e-12
# mast -bfile /genome/mm10.order_0.bfile -ev 1000 -hit_list /motifs/MA0139.1.meme /fasta/test_CTCF.fa
I understand that these scores are not useful for comparing across motifs (for which the p-values may be a better statistic to compare). But here, I'm using the same input sequence and the same motif. Why are the scores so different between what FIMO reports and MAST reports? Is there a way to "convert" the MAST scores to the same scale as output by FIMO?
Based on computing the score myself by summing up the relevant positions in the position score matrix, I get a value nearly identical to that output by FIMO.
Commands
fimo --text --bfile mm10.order_0.bfile MA0139.1.meme CTCF.fa
mast -bfile mm10.order_0.bfile -hit_list MA0139.1.meme CTCF.fa
where the input files are as follows:
mm10.order_0.bfile
# 0-order Markov frequencies from file /genome/mm10.fa
# seqs: 66 min: 1976 max: 195471971 avg: 41376845.1 sum: 2730871774 alph: DNA
# order 0
A 2.917e-01
C 2.083e-01
G 2.083e-01
T 2.917e-01
MA0139.1.meme
MEME version 4
ALPHABET= ACGT
strands: + -
Background letter frequencies
A 0.25 C 0.25 G 0.25 T 0.25
MOTIF MA0139.1 CTCF
letter-probability matrix: alength= 4 w= 19 nsites= 913 E= 0
0.095290 0.318729 0.083242 0.502738
0.182913 0.158817 0.453450 0.204819
0.307777 0.053669 0.491785 0.146769
0.061336 0.876232 0.023001 0.039430
0.008762 0.989047 0.000000 0.002191
0.814896 0.014239 0.071194 0.099671
0.043812 0.578313 0.365827 0.012048
0.117325 0.474781 0.052632 0.355263
0.933114 0.012061 0.035088 0.019737
0.005488 0.000000 0.991218 0.003293
0.365532 0.003293 0.621295 0.009879
0.059276 0.013172 0.553238 0.374314
0.013187 0.000000 0.978022 0.008791
0.061538 0.008791 0.851648 0.078022
0.114411 0.806381 0.005501 0.073707
0.409241 0.014301 0.557756 0.018702
0.090308 0.530837 0.338106 0.040749
0.128855 0.354626 0.080396 0.436123
0.442731 0.199339 0.292952 0.064978
URL http://jaspar2022.genereg.net/matrix/MA0139.1
CTCF.fa
>test_MA0139.1
TGGCCACCAGGGGGCGCTA
2. What I'm trying to do
Why did I try MAST instead of FIMO? Here's the overall problem I want to solve:
I have a set of mouse genome sequences ~200 bp each that I would like to compute TF motif scores for - i.e., for each sequence, obtain a single value representing how strongly a given TF will recognize that sequence.
Naively, for each sequence, I could scan a motif across and report the highest score at any position in the sequence. (This would be akin to using -hit_list -best with MAST.)
However, I want to include 2 additional considerations:
- If a given sequence contains multiple copies of the TF motif, the score should be higher than just the score of a single motif (e.g., add the scores).
- Since the TF may recognize different motifs, I want to scan each sequence with all high quality motif matrices of the TF, and then report a summary score/statistic for that sequence.
- As an example, the 2 CTCF motifs MA1929.1 and MA1930.1 are highly similar except that the length of the "linker" region is different between the 2 high-information-content subsequences. Because MAST/FIMO tools assume fixed motif lengths, these should be considered distinct motifs recognized by CTCF.
Consequently, I used MAST -hit_list (with a very high e-value threshold) to find non-overlapping instances of CTCF motifs for each sequence. If a sequence had 2 or more "hits," I sum the scores of the hits to obtain a final single value per sequence.
Any suggestion on how to better solve this problem would also be appreciated.