Hello,
I have several questions regarding the use of FIMO as motif scanning. I have used it for a while but I still don't understand some part of it.
1. Regarding low complexity region or repeating region. I noticed that FIMO result actually match a lot with repeating region. Does this mean anything to the protein binding site prediction? It seems that this low complexity match actually skewing the q-value calculation so that almost all significant result come from the match in the low complexity region. After I try to mask the these low complexity region, I got the whole different result with more match. WIth q-value threshold 0.1, I got ~2,299 significant matches but after masking the fasta, I got ~46,000 significant result.
2. Regarding background frequency for FIMO. I have tested several promoter region, for example 1000 nt upstream 1st exon, 2000 nt upstream 1st exon, until 5000 nt upstream 1st exon. I noticed that because of the difference in length, the background frequency of ACGT (from fasta-get-markov) is also different. This makes same PWM with same target sequence have a different p-value and q-value. So, should I use same background frequency for all target sequence so that my result is consistent for p-value and q-value calculation?
3. Is there any paper or list of paper that have used FIMO? I want to know what is the good way to select the background for FIMO. I have read 1 paper that use FIMO in their published method, but it seems the author use all default parameter which is uniform background.
4. Regarding uniform random background ACGT frequency (A, C, G, and T frequency are 0.25). What is the justification of using uniform background frequency rather than whole genome frequency or the target sequence? From what I understand how FIMO calculate score using log odd to calculate the score and p-value, I understand that the background is used as a measure of the probability of the motif compare to random sequence. What is the actual best practice in determining this background frequency? What is the meaning to the p-value if I use uniform background, whole genome background, and target sequence background? I have read the paper but I don't think I can find a clear explanation. As far as my understanding, the purpose is to compare the probability of the motif compare to random sequence. So, does this mean we should always use uniform background? If I choose whole genome background or the sequence background frequency, how does this affecting the randomness test?
Thank you.