FIMO

Rocky Parida

unread,

May 3, 2017, 10:46:27 AM5/3/17

to MEME Suite Q&A

Hi All

I am trying to find motif occurrences of certain motifs in my whole genome using FIMO.

I tried taking 14 known motif set as an example from Jaspar and searched all the upstream sequences in my genome with a uniform background and found not a single occurrence that is q-value <0.05.

But when I reduce the number of input sequences I find some occurrences.

Does this has to do with the number of input sequences and false discovery rate?

If so, how do I go around finding all the occurrences of a motif in the whole genome?

Please let me know.

Thanking you

Rocky

CharlesEGrant

unread,

May 3, 2017, 3:44:13 PM5/3/17

to meme-...@googlegroups.com

But when I reduce the number of input sequences I find some occurrences.
Does this has to do with the number of input sequences and false discovery rate?

The estimation of the q-value is derived from the observed distribution of p-values. If all your sequences are all roughly similar then omitting some should not change the distribution of p-values much, and you should get roughly the same q-values for the remaining matches. Since you are seeing some substantial changes in the q-value that would indicate that there is something about the sequences you are removing that is skewing the p-values. Did you select the sequences to be omitted randomly or did you have some biological basis for selecting them? Did you screen your sequences for low complexity regions or repeats?

If so, how do I go around finding all the occurrences of a motif in the whole genome?

First, be aware the scanning an entire genome for motif occurrences is a difficult statistical problem. It is addressed by the "Futility Theorem" ( Wasserman WW, Sandelin A. Applied bioinformatics for the identification of regulatory elements. Nat Rev Genet 2004;5:276-87.). The statistical power to identify biologically relevant motif matches depends on the information content of the motif and the quality of the background model. If the motif is short, or not very distinctive, there simply may not be the statistical power to distinguish random matches from biologically relevant ones. Consider this: in a 1GB genome there will be roughly 15,000 perfect matches to an arbitrary motif of length 8, entirely by chance. For FIMO, a perfect match is a perfect mach, so the perfect matches that occur by chance, will vastly outnumber the perfect matches that are "biological". On the other hand, in the same genome you wouldn't expect any perfect matches to a motif of length 12 to occur by chance, so you might have a shot at spotting the biologically active matches. This is just a limitation of using sequence similarity to identify motif occurrences.

Of course the organism is able to distinguish between active motif sites and random sequences matches. If you have other sources of information like DNAse I Hypersensitivity, you can convert that into position specific priors which FIMO can use in addition to simple sequence matching.

There are some other purely technical issues in using FIMO to scan whole genomes, due to the limits on the amount of memory FIMO can use, but it doesn't sound like you ran into those.

Finally, in you post you mentioned that you are using a uniform background model. FIMO's statistical power depends critically on how good the background model is. If you are scanning a whole genome you should probably use the actual genome nucleotide frequencies for the background model. This is admittedly a compromise since the frequencies may vary widely by region, but it's probably the best you can do given FIMO's simplistic background model.

Rocky Parida

unread,

May 3, 2017, 7:07:59 PM5/3/17

to MEME Suite Q&A

Thanks Grant for your reply.

Did you select the sequences to be omitted randomly or did you have some biological basis for selecting them? Did you screen your sequences for low complexity regions or repeats?

I selected a group of random sequences and did a FIMO on them to see if I can find sig occurrences of motifs on my random set of sequences.

This was a test to see how specific are my motif matrices to my initial set of genes vs a random set of genes.

Also I changed the background of ATGC according to my genome and then did FIMO and I couldn't find a sig occurrence of any motif in my set against the whole genome (this also included the input sequences I used to generate these motifs in the first place using MEME).

However, I do understand your point:

The statistical power to identify biologically relevant motif matches depends on the information content of the motif and the quality of the background model. If the motif is short, or not very distinctive, there simply may not be the statistical power to distinguish random matches from biologically relevant ones. Consider this: in a 1GB genome there will be roughly 15,000 perfect matches to an arbitrary motif of length 8, entirely by chance. For FIMO, a perfect match is a perfect mach, so the perfect matches that occur by chance, will vastly outnumber the perfect matches that are "biological". On the other hand, in the same genome you wouldn't expect any perfect matches to a motif of length 12 to occur by chance, so you might have a shot at spotting the biologically active matches. This is just a limitation of using sequence similarity to identify motif occurrences.