Scanning genome sized datasets with FIMO

1,172 views
Skip to first unread message

CharlesEGrant

unread,
Nov 7, 2016, 3:25:58 PM11/7/16
to meme-...@googlegroups.com
Some FIMO users have written to us about difficulties they've experienced when scanning whole genome with FIMO. The most common problem is running out of memory even on machines with quite generous amounts of RAM. 

We've used FIMO to scan hg18 and hg19 with the motifs from the TRANSFAC and JASPAR database and generated custom tracks for the UCSC Genome Browser. 

The memory problems occur when FIMO is writing out the results. By default, FIMO saves its results in XML format, and then uses XSLT to transform the XML into the various output formats (HTML, GFF, plain text). The drawback of the this is that XML is extremely verbose, and the XSLT library that FIMO uses reads the entire XML file into memory in order to parse it. This process is very memory intensive, but it can be turned off by using FIMO's '--text' option, which is described below. Except for the translation of the XML file, FIMO's memory complexity is O(1). While FIMO calculates the score and p-value for each position in the input sequences, it only retains the results up to the limit of '--max-stored-scores'. This defaults to 1e5. Once the stored score limit is reached, FIMO begins dropping the least significant scores. FIMO uses a randomly sampled selection of 10000 of the observed p-values to calculate the q-values. 

The main trick to scanning whole genomes with FIMO is to set a stringent p-value threshold. In most cases you really don't want 10 million matches almost all of which are statistically insignificant after correcting for multiple testing. When we were generating the custom tracks mentioned above we used the following settings: 

--output-pthresh 1e-6 
--max-stored-scores 500000 

This ran without problem on a cluster node with 16GB of memory. We filtered the FIMO output afterwards so that we only included matches with a q-value less than 0.01. 

If you can't apply a stringent p-value threshold then you'll need to use FIMO's '--text' option. The '--text' option directs FIMO to skip creating the XML output, and simply writes each motif match to the standard output as it is evaluated. This prevents FIMO from providing a q-value with each score, but the q-values can be calculated after the fact using the 'qvalue' utility included with the MEME Suite. The documentation for 'qvalue' can be found here. To use 'qvalue' with the '--text' output of FIMO you will have to write scripts to extract the p-values from the FIMO output, and to assign the resulting q-values back to the FIMO matches. In order for 'qvalue' to correctly calculate q-values you'll need run FIMO with '--output-pthresh 1.0', and work with one motif at a time.

Oriol Fornés

unread,
Apr 3, 2017, 2:13:16 PM4/3/17
to MEME Suite Q&A
Hi Charles, where are exactly the UCSC tracks for the FIMO scans of the JASPAR/TRANSFAC databases available? Thank you in advance.

CharlesEGrant

unread,
Apr 11, 2017, 8:40:53 PM4/11/17
to MEME Suite Q&A
We used to have them linked from the FIMO web application page, but nobody expressed any interest in them, so we've since taken them down.

Teshome Mulugeta

unread,
Dec 11, 2017, 4:06:17 AM12/11/17
to MEME Suite Q&A
This is useful information but needs update i think. I couldn't find the --output-pthresh option in meme/4.12.0. Is this equivalent to --qv-thresh in the new release? The link to the qvalue is http://meme-suite.org/doc/qvalue.html. It will be interesting if you can update the text with more information based on FIMO's change after this post. What is the best options to qvalue (for example, --good-score, --pi-zero, --bootstraps, --fdr) to get similar results as FIMO if i prefer to compute the FDR myself from the p-value?

CharlesEGrant

unread,
Dec 11, 2017, 10:48:56 PM12/11/17
to MEME Suite Q&A
--output-pthresh 1e-6 is now just --thresh 1e-6. This is using a p-value threshold not a q-value threshold. The only qvalue parameter set by FIMO is --pi-zero. Of course you'll have to set the appropriate values for --column and --header.

Teshome Mulugeta

unread,
Dec 11, 2017, 11:23:38 PM12/11/17
to MEME Suite Q&A
The reason i want to try is that I run FIMO with 50,000 sequences database and 13,000 PWM. I used a server with 192GB RAM but my job crashed after almost a week. Before i try this solution, i was thinking to run FIMO with 50,000 sequences database for each PWM in a cluster. This means i will have 13,000 jobs. Will the qvalue result will be same as running FIMO with all sequence database and all PWM database?

CharlesEGrant

unread,
Dec 11, 2017, 11:45:20 PM12/11/17
to MEME Suite Q&A
You have to keep the sequence data all together, but each PWM is treated independently, so the q-values will be the same, whether you run them all at once (probably impossible), or individual.  As I've discussed with you elsewhere, FIMO doesn't apply a multiple testing correction for using 13,000 PWM. You might apply a Bonferroni correction by taking your nominal q-value threshold and dividing by 13,000. Note though that this means you are unlikely to get useful results. Given inputs of this size, you are only going to reach statistical significance for very long, very specific motifs.
Reply all
Reply to author
Forward
0 new messages