MAST hit_list score meaning

31 views
Skip to first unread message

Liam

unread,
Mar 19, 2025, 12:06:03 AMMar 19
to MEME Suite Q&A
The scores reported from MAST -hit_list are much larger (in the thousands) than simply based on summing the appropriate positions of the PWMs. Comparing the MAST hit_list scores to the scores I would expect (via Biopython) suggested that they are highly correlated, just that the scales are different.

Can you please describe how to interpret the MAST hit_list scores? What scales/units are they in?

Thanks!

cegrant

unread,
Mar 19, 2025, 12:54:51 AMMar 19
to MEME Suite Q&A
The score reported by MAST is the log of the ratio of the likelihood of observing the candidate sequence assuming it is an instance of the motif to the likelihood of observing it assuming it is just part of the background. Note that the raw scores are not calibrated. That is, you can only use them for relative comparisons of matches to the same motif using the same background model. You can't use them to compare matches to different motifs. If you are comparing matches to different motifs you should be looking at the p-value and not the raw score.

Is there some reason that you are using MAST -hit_list? FIMO provides all the same information, but also provides q-values, which are p-values corrected for multiple testing, and would generally be the preferred tool.

Liam

unread,
Mar 19, 2025, 5:06:38 PMMar 19
to MEME Suite Q&A
Thank you for your response!

1. Unfortunately, I'm still a bit confused by the score reported by MAST -hit_list.

If I understand correctly (based on MAST's documentation, the MAST paper, and general bioinformatics resources like the Durbin textbook), the log of the ratio of the likelihood of observing the candidate sequence assuming it is an instance of the motif to the likelihood of observing it assuming it is just part of the background is simply calculated by summing up the relevant values in the score matrix.

> The match score of a motif with w columns and position i in a target sequence is defined as the sum of the scores for the letters in the sequence at positions i to i + w – 1 matched with columns 1 to w of the motif, respectively.

Example

Consider a sequence matching the consensus sequence of the CTCF motif MA0139.1: TGGCCACCAGGGGGCGCTA

Scoring the motif MA0139.1 against its own consensus sequence using FIMO gives a score of 28.6557:

motif_id  motif_alt_id  sequence_name  start  stop  strand  score    p-value   q-value  matched_sequence
MA0139.1  CTCF          test_MA0139.1  1      19    +       28.6557  1.67e-12           TGGCCACCAGGGGGCGCTA

Scoring the same motif against its own consensus sequence using MAST gives a score of 2871.34:

# All non-overlapping hits in all sequences from "/fasta/test_CTCF.fa".
# sequence_name (strand+/-)motif id alt_id hit_start hit_end score hit_p-value
test_MA0139.1 +1 MA0139.1 CTCF 1 19  2871.34 2.33e-12
# mast -bfile /genome/mm10.order_0.bfile -ev 1000 -hit_list /motifs/MA0139.1.meme /fasta/test_CTCF.fa

I understand that these scores are not useful for comparing across motifs (for which the p-values may be a better statistic to compare). But here, I'm using the same input sequence and the same motif. Why are the scores so different between what FIMO reports and MAST reports? Is there a way to "convert" the MAST scores to the same scale as output by FIMO?

Based on computing the score myself by summing up the relevant positions in the position score matrix, I get a value nearly identical to that output by FIMO.

Commands

fimo --text --bfile mm10.order_0.bfile MA0139.1.meme CTCF.fa

mast -bfile mm10.order_0.bfile -hit_list MA0139.1.meme CTCF.fa

where the input files are as follows:

mm10.order_0.bfile

# 0-order Markov frequencies from file /genome/mm10.fa
# seqs: 66    min: 1976    max: 195471971    avg: 41376845.1    sum: 2730871774    alph: DNA
# order 0
A 2.917e-01
C 2.083e-01
G 2.083e-01
T 2.917e-01

MA0139.1.meme

MEME version 4

ALPHABET= ACGT

strands: + -

Background letter frequencies
A 0.25 C 0.25 G 0.25 T 0.25

MOTIF MA0139.1 CTCF
letter-probability matrix: alength= 4 w= 19 nsites= 913 E= 0
 0.095290  0.318729  0.083242  0.502738
 0.182913  0.158817  0.453450  0.204819
 0.307777  0.053669  0.491785  0.146769
 0.061336  0.876232  0.023001  0.039430
 0.008762  0.989047  0.000000  0.002191
 0.814896  0.014239  0.071194  0.099671
 0.043812  0.578313  0.365827  0.012048
 0.117325  0.474781  0.052632  0.355263
 0.933114  0.012061  0.035088  0.019737
 0.005488  0.000000  0.991218  0.003293
 0.365532  0.003293  0.621295  0.009879
 0.059276  0.013172  0.553238  0.374314
 0.013187  0.000000  0.978022  0.008791
 0.061538  0.008791  0.851648  0.078022
 0.114411  0.806381  0.005501  0.073707
 0.409241  0.014301  0.557756  0.018702
 0.090308  0.530837  0.338106  0.040749
 0.128855  0.354626  0.080396  0.436123
 0.442731  0.199339  0.292952  0.064978
URL http://jaspar2022.genereg.net/matrix/MA0139.1


CTCF.fa

>test_MA0139.1
TGGCCACCAGGGGGCGCTA

2. What I'm trying to do

Why did I try MAST instead of FIMO? Here's the overall problem I want to solve:

I have a set of mouse genome sequences ~200 bp each that I would like to compute TF motif scores for - i.e., for each sequence, obtain a single value representing how strongly a given TF will recognize that sequence.

Naively, for each sequence, I could scan a motif across and report the highest score at any position in the sequence. (This would be akin to using -hit_list -best with MAST.)

However,  I want to include 2 additional considerations:
  1. If a given sequence contains multiple copies of the TF motif, the score should be higher than just the score of a single motif (e.g., add the scores).
  2. Since the TF may recognize different motifs, I want to scan each sequence with all high quality motif matrices of the TF, and then report a summary score/statistic for that sequence.
    1. As an example, the 2 CTCF motifs MA1929.1 and MA1930.1 are highly similar except that the length of the "linker" region is different between the 2 high-information-content subsequences. Because MAST/FIMO tools assume fixed motif lengths, these should be considered distinct motifs recognized by CTCF.

Consequently, I used MAST -hit_list (with a very high e-value threshold) to find non-overlapping instances of CTCF motifs for each sequence. If a sequence had 2 or more "hits," I sum the scores of the hits to obtain a final single value per sequence.

Any suggestion on how to better solve this problem would also be appreciated.

cegrant

unread,
Mar 31, 2025, 4:20:51 PMMar 31
to MEME Suite Q&A
Thanks for clarifying what you are trying to do. 

As far as the computation of the motif match scores reported by MAST: in order to compute the p-values for motif match scores we apply a dynamic programming algorithm, this requires that the raw log-likelihood ratios for each position get scaled and converted to integers. For historical reasons the match score that MAST reports is the sum of these scaled and integerized log-likelihood ratios, while FIMO reports the sum of the raw log-likelihood rations. The MAST score for a single position is

100 * log2(p/b)

where p is the probability of observing the given base/residue in the given position of the motif, and b is the probability of observing the given base/residue assuming it belongs to the background. You can see the code for this in the convert_freqs_into_scores() function in src/motif.c in the MEME Suite source code.

I'd note that MAST doesn't really satisfy your first requirement, but may partially satisfies your 2nd requirement. The MAST sequence score is computed by finding the single best match to each motif in a sequences. The p-values for these best matches are then multiplied together. The product of the p-values should follow a Chi-square distribution, and MAST uses that to report a p-value for the sequence. The MAST score will not be affected by a sequence having multiple instances of a single motif. Only the highest scoring instance of that motif is considered.  

You should also be aware that the hit list is generated before any of the sequence scores are computed. The high scoring motif matches listed in the hit list may not be associated with the high scoring sequences in the MAST HTML output.

The procedure you list in 2.1 seems reasonable, but there is no reason you couldn't use FIMO for that, other than filtering out the overlapping matches, which would be pretty simple to do as a post-processing step with FIMO.
Reply all
Reply to author
Forward
0 new messages