Regarding FIMO background and masking target sequence

270 views
Skip to first unread message

Bharata Kalbuaji

unread,
Jan 18, 2019, 4:11:06 AM1/18/19
to MEME Suite Q&A
Hello,

I have several questions regarding the use of FIMO as motif scanning. I have used it for a while but I still don't understand some part of it.

1. Regarding low complexity region or repeating region. I noticed that FIMO result actually match a lot with repeating region. Does this mean anything to the protein binding site prediction? It seems that this low complexity match actually skewing the q-value calculation so that almost all significant result come from the match in the low complexity region. After I try to mask the these low complexity region, I got the whole different result with more match. WIth q-value threshold 0.1, I got ~2,299 significant matches but after masking the fasta, I got ~46,000 significant result.

2. Regarding background frequency for FIMO. I have tested several promoter region, for example 1000 nt upstream 1st exon, 2000 nt upstream 1st exon, until 5000 nt upstream 1st exon. I noticed that because of the difference in length, the background frequency of ACGT (from fasta-get-markov) is also different. This makes same PWM with same target sequence have a different p-value and q-value. So, should I use same background frequency for all target sequence so that my result is consistent for p-value and q-value calculation?


3. Is there any paper or list of paper that have used FIMO? I want to know what is the good way to select the background for FIMO. I have read 1 paper that use FIMO in their published method, but it seems the author use all default parameter which is uniform background. 

4. Regarding uniform random background ACGT frequency (A, C, G, and T frequency are 0.25). What is the justification of using uniform background frequency rather than whole genome frequency or the target sequence? From what I understand how FIMO calculate score using log odd to calculate the score and p-value, I understand that the background is used as a measure of the probability of the motif compare to random sequence. What is the actual best practice in determining this background frequency? What is the meaning to the p-value if I use uniform background, whole genome background, and target sequence background? I have read the paper but I don't think I can find a clear explanation. As far as my understanding, the purpose is to compare the probability of the motif compare to random sequence. So, does this mean we should always use uniform background? If I choose whole genome background or the sequence background frequency, how does this affecting the randomness test?

Thank you.

Bharata Kalbuaji

unread,
Jan 18, 2019, 4:27:02 AM1/18/19
to MEME Suite Q&A
Sorry, for point no 3, the default for background in FIMO is from NRDB which I think the whole genome frequency. CMIIW.

cegrant

unread,
Jan 29, 2019, 3:41:58 PM1/29/19
to meme-...@googlegroups.com
I noticed that FIMO result actually match a lot with repeating region. Does this mean anything to the protein binding site prediction? 

FIMO is simply scoring matches to the Position Weight Matrix (PWM) that you provide in the motif file. It doesn't have any "biological" knowledge that would allow it to distinguish repeats and low complexity sequences from actual TF binding sites. You should definitely mask repeats and low complexity regions from your sequences before analyzing them with any of the tools in the MEME Suite. Note that some masking tools offer two styles of masking: one marks repeats and low complexity regions using lower case letters, and the other replaces them with 'X'. You'll want the latter style. I'd double check the process by which you were masking, the number of significant matches should go down significantly. That fact that they went up for you indicates that there might have been something wrong with your procedure.

So, should I use same background frequency for all target sequence so that my result is consistent for p-value and q-value calculation?

Yes, assuming you are scanning promoter regions that are biologically similar. If the promotors are dissimilar (say from different organisms), then you have to use some sort of compromise for the background. The most pragmatic choice is simply to generate the background model from the sequences you are scanning.

Is there any paper or list of paper that have used FIMO? 

FIMO has over a 800 citations now.  You can find a list of citations on the Web of Science. Unfortunately there is no guarantee that a particular paper has used FIMO in an optimal way. As you point out, many papers simply use the default background which is derived from an older version of the NRDB. Using the default generally works, in that it finds the best matches to the motif, but it isn't necessarily optimal, that is it may miss some of the weaker matches.

The trick of choosing the sequences used to generate a background  model is that they should be biologically similar to the sequences you wish to scan for motifs, but should contain no (or few) actual instances of the motifs. This is rarely easy to sort out since we don't know in advance where the "true" (i.e. biologically significant) motif occurrences are. Therefore you have to use your judgment at what a reasonable compromise is. Depending on the sequence data you are scanning ,you might use the frequencies for the full genome, the NRDB average values, or simply the uniform frequencies. If you are scanning a large number of sequences where you expect the number of motif matches to be small (typical of promoters), then the most pragmatic choice is usually to use the observed frequencies for the sequences you are scanning. However, this won't work very well though if your sequences contain repeats or low complexity regions. This is another reason why you should mask your sequences for repeats and low-complexity regions. 

What is the justification of using uniform background frequency rather than whole genome frequency or the target sequence?

It's simply a convenient choice that is not too far off for most regions of most organisms. If you are scanning regions of high GC content or sequences from organisms that have high AT content, then it wouldn't be the best choice. There is no automatic perfect choice for a background model. You have use your judgement as to what model would be appropriate. Again, the ideal is to derive the background model from sequences that are biologically similar to the ones you scanning, but that don't contain many instances of the motifs you are scanning for.
Reply all
Reply to author
Forward
0 new messages