FIMO returns no results when scanning large genomes

655 views
Skip to first unread message

CharlesEGrant

unread,
Oct 26, 2012, 6:55:58 PM10/26/12
to meme-...@googlegroups.com
Several users have reported problems when using FIMO to scan for motif occurrences in the human genome. The following error message is reported in the error log: 

fimo: heap.c:361: get_node: Assertion `get_num_nodes(h) > 0' failed. 

FIMO is finding so many sites back that the code that generates the HTML for display on your web browser is running out of memory and failing. This can be corrected by selecting a more stringent p-value threshold. The default p-value threshold is reasonable for small sequence sets, but scanning the human genome at that level will generate around 3e5 matches for each motif, simply by chance! The HTML generation typically fails at around 1e5 matches. 

Aside from the software problem, having 300,000 chance matches to your motif is probably not very helpful. When scanning whole genomes you need to use a higher level of stringency to avoid being overwhelmed by chance matches. We'd suggest setting a p-value threshold of 1e-6 or even less. You'll also want to look closely at the q-values for your results as these provides you with a measure of statistical significance corrected for multiple testing. 

We'll incorporate a more helpful error message for this problem in a future version of FIMO.

CharlesEGrant

unread,
Apr 8, 2013, 3:44:36 PM4/8/13
to meme-...@googlegroups.com
In fact this is the purpose of the  '--psp' option for the command line version of FIMO. PSP stands for position specific priors. You can pass into FIMO estimates of your prior belief that each position in the input sequence is a binding site. The prior belief is used to correct the p-value and q-value of the FIMO matches. This is described in the paper: 

Gabriel Cuellar-Partida, Fabian A. Buske, Robert C. McLeay, Tom Whitington, William Stafford Noble, and Timothy L. Bailey, "Epigenetic priors for identifying active transcription factor binding sites", Bioinformatics 28(1): 56-62, 2012 [pdf]

Currently the priors have to be provided in a FASTA style file with prior values taking the place of nucleotide symbols. We're working on adding support for prior input as Wiggle or BigWig files.

Tonatiuh Pena Centeno

unread,
May 12, 2013, 9:06:50 AM5/12/13
to meme-...@googlegroups.com
To whom it may concern

I am interested on finding the Shine Dalgarno (SD) sequence within E.Coli's, and for that purpose I am using FIMO (default options) to scan the whole genome. However doing this results in a list with a huge number of matches, many of them not being true locations of the SD sequence. I was wondering what type of information could I pass to --psp so that the list with resulting matches decreases in size but improves in quality.

Tonatiuh

CharlesEGrant

unread,
May 13, 2013, 3:09:13 PM5/13/13
to
Hi Tonatiuh,

Before you start thinking about position specific priors, you may want to read these posts:



The problem is that when you apply even a stringent p-value test to data on the genomic scale you'll get an overwhelming number of false positives. It becomes important to apply some form of multiple testing correction. The problem is particularly difficult for a relatively short motif like Shine Dalgamo. Good matches will frequently occur entirely by chance. You may want to use the command line version of FIMO so you can apply a q-value rather than a p-value threshold. Otherwise you could try decreasing the p-value threshold until you get a manageable result set.

To use position specific priors you will need some other source of information about where biologically significant occurrences of your motif are likely to occur. For example, when searching for transcription factor binding sites in eukaryotes you can make use of the fact that actual binding sites are more likely to be associated with open chromatin. You can therefore use genome wide assays of DNase I hypersensitivity to generate a prior belief that a particular site is a possible binding site.  Do you haven any sort of assay that would inform you whether or not a particular site is likely to contain a biologically significant instance of a Shine Daigamo sequence?

Tonatiuh Pena Centeno

unread,
May 14, 2013, 11:43:30 AM5/14/13
to meme-...@googlegroups.com
Hallo Charles,

Right now I am trying to find the Shine Dalgarno motif in Archaeal genomes. Archaea share some distinctive features with Prokarya, like e.g. the Shine Dalgarno (SD) pattern. As a means of preparation, I have used one part of E.colis genome as training data in order to characterise the SD sequence through a MEME motif; the looks very well "AGGAGG". My idea was then to use the remaining part of E.colis genome as test data and see how well FIMO could find the SD pattern on it. If results were good, I could then move on and try predictions of SD on Archaea. However, running FIMO on E.colis test data leads me to several matches that do not seem to have much "functional meaning", thus I was wondering if FIMO's resulting annotation could be cleaned some how.

I am not a biologist, but as far as I am aware, archaeal and bacterial genomes tend to be relatively small (~3Mbp's), so overwhelming the computer with a huge search space won't seem to be a problem.

I do not have the assay data the "Epigenetics priors" paper and you are talking about.

Any advice or opinion will be helpful,

Tonatiuh


CharlesEGrant

unread,
May 14, 2013, 2:30:03 PM5/14/13
to meme-...@googlegroups.com
Your motif is very short, only 6 bases long. If you consider a series of 6 bases chosen at random the chance of getting a perfect match to your motif is roughly (0.25)^6 = 0.0002441. That may seem small, but if you multiply 0.0002441 * 3Mb, you'd expect to get around 732 perfect matches to your motif entirely by chance. Those chance matches will probably swamp the biologically relevant matches. This is a well known problem in searching for motifs, jokingly called "The Futility Theorem" (
  1. Wasserman WW
  2. Sandelin A
Applied bioinformatics for the identification of regulatory elementsNat Rev Genet 2004;5:276-87.). Your motif taken in isolation simply doesn't contain in enough information to pick out the biologically active sites. You'll need some other source of biological information to filter out the chance matches. I don't know anything about bacterial molecular biology, but the Wikipedia article on Shine Dalgarno sequences says that they are generally located 8 bases upstream of a start codon. Just as an example, maybe you could filter out the motif hits that don't have a start codon within 50 bases.

Tonatiuh Pena Centeno

unread,
May 16, 2013, 5:28:11 AM5/16/13
to meme-...@googlegroups.com
Hi Charles,

Ok, sounds like a good idea :-)

Thanks,

Tonatiuh
Reply all
Reply to author
Forward
0 new messages