FIMO: detect all the promoters with transcription factor motifs

71 views
Skip to first unread message

Tao Zhu

unread,
Nov 23, 2017, 2:50:52 AM11/23/17
to MEME Suite Q&A
I need to detect all the promoters with transcription factor motifs using FIMO, there are two methods:

fimo [options] <motif file> <sequence file>

The first method is putting only one promoter sequence in the  sequence file in a single run for each time, and run the program for multiple times.

The second method is putting all the promoter sequencs (about 60,000) in the sequence fileand run them together.

Currently I use the second method, and I found that the running speed is very slow. In addition, it generates huge GFF files (fimo.gff), currently ~629G.

I want to know the difference of the two methods, and how to deal with the huge file properly. I only want fimo.txt, not fimo.gff

cegrant

unread,
Nov 28, 2017, 5:38:09 PM11/28/17
to MEME Suite Q&A
The second method is preferred as it will provide accurate q-values for your matches. If you use the first method the q-value reported will not accurately reflect the appropriate multiple testing correction. 

How long is each of your promoter sequences? Typically this would be one or two thousand bp upstream from the TSS. FIMO should be able to handle that easily, and the output files should all be much, much smaller than 629GB. In fact I’d expect the text and GFF files each to be much less than 100MB.  The only way I can imagine the output files getting that large is if you have a very large sequence file and set the p-value or q-value threshold to near 1.0.

Could you attach copies of your motif and sequence files, and the exact command line you used? That would help us troubleshoot the problem.

Teshome Mulugeta

unread,
Dec 1, 2017, 2:19:01 PM12/1/17
to MEME Suite Q&A
Interesting but it seems it is the opposite of what we discussed earlier https://groups.google.com/forum/#!topic/meme-suite/r2mFqWv6DLI. Can we say that the result (discovered motifs) will be same for both methods except the second method produces accurate q-value? I am currently running both methods to see the difference.

CharlesEGrant

unread,
Dec 1, 2017, 3:03:57 PM12/1/17
to MEME Suite Q&A
No,  I gave the same advice in both questions: you should use a single sequence database containing all the sequences in which you wish to identify motifs. if you split up the sequence database into multiple files then the q-values reported by FIMO will be incorrect for the assignment of statistical significance to your results. 

In both of these questions there are two distinct sources of multiple testing problems. The first source is applying the significance threshold test at every position in the sequence database. The second source is scanning with hundreds of motifs. The q-value calculated by FIMO corrects for the first source but not the second. The q-values are computed for each motif separately so you could split up you motif database into multiple files if you wish, but it's still incumbent on you to apply some sort  of correction to the q-value for the total number of motifs you use.

Can we say that the result (discovered motifs) will be same for both methods except the second method produces accurate q-value?

Yes, the two methods will give the same raw scores and p-values, but different q-values. You definitely want to use the q-values for assigning statistical confidence to your results though!
Reply all
Reply to author
Forward
0 new messages