Advice on filtering FIMO output

152 views
Skip to first unread message

Amira Kramdi

unread,
Jun 28, 2024, 11:45:05 AM6/28/24
to MEME Suite Q&A
Hello everyone,

I am using FIMO to scan a motif in a region of 200bp centered around ChIP-seq peak summits (npeaks=500). The peaks correspond to binding sites of a protein that binds DNA via ​a partner. Thus, the motif scanned belongs to the partner,​ so I expect FIMO to return positive hits in a ​significant fraction of the input sequences.

H​ere's the command I used initially :

fimo --norc --oc $fimoDIR/$motifID --verbosity 1 ​ --qv-thresh --thresh ​0.01 $motifFile $fimoDIR/$motifID/sequencesToScan.fa

First, I noticed that when setting ​--qv-thresh --thresh ​0.01​ ​the program does not threshold on the q-value, which was misleading at first. Am I using these options correctly ?

​Using this command and after applying the q-value filter myself on best_sit​e.narrowPeak file, I​ was surprised to get very few significant hits (6 hits)​.​ I go up to 100 best hits with q-value 0.05.

At this point, I considered the hypothesis that the ChIPed protein may have different DNA binding partners, so I decided to I run MEME-chip on the sequences with no hits (including no significant hits based on the q-value threshold). To my surprise, the top motif detected by STREME was the once I initially scanned and Centrimo showed a nice central enrichment around the summit​. This made me wonder if I was missing likely true occurrences because of p-value/q-value filters.

​While I am aware that it is important to account for multi testing due to the sequence length and that these thresholds are arbitrary (this discussion was very helpful in this regard btw), I am tempted not to filter on the q-value and work with p-value=1e-3​ in this case.

Any thoughts on this ? Do FIMO users always use the q-value to report hits ? I've seen papers that use only the p-value (may be because the reported motifs checked out in terms of central enrichment, ChIP signal and such..)

​Many thanks in advance for the help !
Best,
Amira

cegrant

unread,
Jun 30, 2024, 12:54:13 AM6/30/24
to MEME Suite Q&A
First, I noticed that when setting ​--qv-thresh --thresh ​0.01​ ​the program does not threshold on the q-value, which was misleading at first. Am I using these options correctly ?

I just double checked, and that is the correct usage. When I run it, it does correctly threshold the results on the q-value. Could you forward us a copy of the input file you used and the FIMO HTML output? That would help us troubleshoot the problem.

At this point, I considered the hypothesis that the ChIPed protein may have different DNA binding partners, so I decided to I run MEME-chip on the sequences with no hits (including no significant hits based on the q-value threshold). To my surprise, the top motif detected by STREME was the once I initially scanned and Centrimo showed a nice central enrichment around the summit​. This made me wonder if I was missing likely true occurrences because of p-value/q-value filters.

This is entirely possible! You have to keep in mind though that FIMO has no biological insight. It's performing a purely statistical test of whether a short sequence is a "good" match to a motif. In many cases truly functional sequences may not have a statistically significant match to the motif, while other, non-functional sequences are, a highly significant match. The larger you sequence set the bigger the problem is due to the multiple testing issue that you noted. This is discussed briefly in the Example section of the FIMO paper. If you can provide priors for which segments of your sequences are more likely to be biologically active (say epigenetic marks), then you might take advantage of FIMO's ability to include position specific priors in its scoring (see the FIMO documentation on the --psp option, also see Gabriel Cuellar-Partida, Fabian A. Buske, Robert C. McLeay, Tom Whitington, William Stafford Noble, and Timothy L. Bailey, "Epigenetic priors for identifying active transcription factor binding sites",
Bioinformatics 28(1): 56-62, 2012). 

If you want to use FIMO as an exploratory tool setting up your later work, then you are free to choose the filters and thresholds you find useful. However, if you are going to present FIMO output as actual evidence for the locations of motif binding sites, then it's best to be rigorous. Use q-values and a significance threshold that other researchers will find credible. 

Amira Kramdi

unread,
Aug 9, 2024, 6:26:14 AM8/9/24
to MEME Suite Q&A
Hello,

Thanks a lot for your response. I ran a clean test and attached the html file and the sequences. 
Of note, I mainly consider best_site.narrowPeak where only sites passing the significance threshold are output, based on FIMO documentation. My remark about results thresholding concerns the best sites file (I noticed that q-values in column 9 were not not limited to my threshold 0.05). I attached the output.
fimo.gff and fimo.tsv are indeed correctly filtered.
Could you please take a look at the html file just in case I missed anything ?

Up to this point I've considered that centering the input sequences around top ChIP-seq peaks summits was "enough" to focus on biologically active sequences and that most of these sequences should have significant matches for the expected motif and that a good match means potential binding site (which is not always the case). However, it is becoming clear to me that this first selection is a good start but not enough to retrieve the precise location of true TFBS given the length of the sequences and multi-testing issues. Including epigenetic priors is super interesting here, at one point I considered running FIMO on the valleys of H3K27ac densities that overlap my ChIP-seq peaks but I found it difficult to define those valleys, so I'm excited to test the --psp option !
I have a couple more questions about these priors. I will open a new discussion about this subject.

Many thanks.
Best,
(sorry for the duplicate, I couldn't attach files here)
Reply all
Reply to author
Forward
0 new messages