Scanning known motifs Position Specific Scoring Matrix in a single promoter region

87 views
Skip to first unread message

meseret...@gmail.com

unread,
Nov 12, 2017, 5:18:29 PM11/12/17
to MEME Suite Q&A
Hi,

I want to scan motifs PSSM in 2500bp upstream promoter regions of a gene using FIMO. I have no any biological data or question. I want to check which known motifs exist in the promoter region of the gene. If it works, i want to do it for all genes in the genome one promoter region at a time. Is FIMO made for this purpose also? If not, any suggestion how to achieve what i want?

Best,
Mesi

cegrant

unread,
Nov 14, 2017, 2:29:33 PM11/14/17
to MEME Suite Q&A
FIMO is certainly a appropriate task for this analysis. There are a couple of caveats though. First, you'll need to provide a database of motif PSSM in the MEME motif format. You can download a collection of appropriately formatted databases here. Be aware that by scanning with multiple motifs you are introducing a multiple testing problem (also illustrated here). This means that p-values of some chance motif matches may appear highly significant. The FIMO results include a q-value, which is a p-value corrected for the multiple testing. However, the FIMO q-value is only corrected for having scanned for a match at each position in a collection of sequences. FIMO does not correct for having scanned with multiple motifs. You may want to apply a Bonferroni correction to your threshold for statistical significance.

Teshome Mulugeta

unread,
Nov 17, 2017, 7:51:57 AM11/17/17
to MEME Suite Q&A
Thank you for the clarification. Can we avoid the multiple testing problem by providing one promoter sequence per gene (2500bp length) and one MEME formatted PSSM? We can submit thousands of jobs to a cluster to compute the one-to-one scanning if the approach is correct and doesn't introduce another statistical problem.

Best,
Teshome

CharlesEGrant

unread,
Nov 20, 2017, 7:29:19 PM11/20/17
to MEME Suite Q&A
No, that won’t fix the problem. This isn’t an issue with the FIMO software. It’s an inherent limitation of using p-values as a measure of statistical significance. 

You should refer to the paper I linked to above on the multiple testing problem in my answer on the Q&A site for a detailed review of the problem. But briefly, a p-value is the probability that you will observe a score at least as good as the given score, but entirely due to chance. Traditionally a p-value of 0.01 is used as a threshold for statistical significance. That means that the probability of seeing an entirely chance match scoring at least as well, is 1 in 100. However, that’s really only applicable to a single test of a single motif. If instead, you apply that same p-value test 100 times in a single experiment, you are just about certain to hit a entirely random match scoring at least as well. It doesn’t matter if FIMO runs the experiments all at once, or if you run them one at a time, you are still using the p-value threshold to identify the significant matches. 

The FIMO q-value corrects for the fact that the p-value test has been run at thousands of positions in your sequence data, but it doesn’t correct for scanning with multiple motifs. You should use the FIMO q-value rather than the p-value for your measure of statistical significance, but you probably should also correct for the fact that you are scanning with hundreds of different motifs. The simplest correction is the Bonferroni correction I mentioned in the Q&A post. Essentially, take your nominal q-value threshold, say 0.01, divide by twice the number of motifs you scan with, and use that for your threshold of statistical significance. Use twice the number of the number of motifs because they are scanned in both the forward and reverse orientation.

Reply all
Reply to author
Forward
0 new messages