FIMO only calling motifs on plus strand

35 views
Skip to first unread message

Kunle Demuren

unread,
Jul 11, 2016, 1:56:39 PM7/11/16
to MEME Suite Q&A
Hi,

I'm using FIMO to find all the matches for a particular motif (TBX5, from the JASPAR core database) in the mouse genome (mm9), and on both the command-line version and the web version I only get + strand hits. I'm using all the default options, except relaxing the p-value threshold to 0.1. I've looked around and can't figure out why this might be happening, any advice? Input motif file pasted below.

Thanks,
Kunle


MEME version 4

ALPHABET= ACGT

strands: + -

Background letter frequencies (from uniform background):
A 0.25000 C 0.25000 G 0.25000 T 0.25000 

MOTIF MA0807.1 TBX5

letter-probability matrix: alength= 4 w= 8 nsites= 2380 E= 0
  0.719012  0.035472  0.163013  0.082503
  0.099492  0.080440  0.763760  0.056308
  0.000000  0.000000  0.984716  0.015284
  0.020969  0.061044  0.077353  0.840634
  0.014746  0.000000  0.985254  0.000000
  0.067726  0.094064  0.084030  0.754181
  0.036406  0.177983  0.521237  0.264374
  0.719872  0.030327  0.136872  0.112929

CharlesEGrant

unread,
Jul 11, 2016, 8:32:33 PM7/11/16
to meme-...@googlegroups.com
I was able to duplicate the issue you are seeing.

The problem is that you have a relatively short motif (8 positions). That means that at any given position in the genome, there is a about a 1 in 100,000 chance of having a perfect match to your motif, entirely by chance. That may sound like good odds, but when you take that chance at every position in the mouse genome, you are going to get tens of thousands of perfect matches to your motif simply by chance. This is an example of the multiple testing problem. A p-value threshold of 0.1 would allow less than perfect matches, which will result in millions of matches. FIMO can't hold millions of matches in memory, so it periodically purges the matches with least significant p-values. I think that if you look at your results you'll see that all of them are the sequence 'AGGTGTGA', which is the best possible match to your motif, with a p-value of 1.36e-05. As it happens, after all the purging, only the matches on the positive strand were retained.

If you used the '--text' option for FIMO, this would write out the matches as it finds them, and you will see matches on both strands. However, you will end up with millions of matches, almost all of which are simply due to chance. Any biologically significant matches will be completely drowned out by the chance matches. Normally the way to work around this is to pick a more stringent p-value threshold, or preferably, to use a q-value threshold instead. However, your motif is so short, that even the best possible match to your motif only has a p-value of 1.36e-05, and a q-value of 1.0.

By itself, the motif doesn't provide FIMO enough statistical power to pinpoint the biologically significant matches in the full genome. You'll need to narrow down the regions you check for the motif, or provide some sort of external priors as described in this paper: Gabriel Cuellar-Partida, Fabian A. Buske, Robert C. McLeay, Tom Whitington, William Stafford Noble, and Timothy L. Bailey, "Epigenetic priors for identifying active transcription factor binding sites", Bioinformatics 28(1): 56-62, 2012. FIMO does provide epigenetic priors based on ENCODE DNAse I hypersensitivity for three tissue types for mm9 online. Click on the  "Enable tissue/cell-specific scanning" check box. If those tissues are suitable you'll have to create your own priors using the create-priors utility.


Reply all
Reply to author
Forward
0 new messages