High True positive percentage in SEA report

Ya Jiang

unread,

Aug 5, 2023, 11:47:53 AM8/5/23

to MEME Suite Q&A

Hi,

I want to use SEA to find transcription factors motifs enriched in cis-regulatory elements (CREs). But I got very high true positive percentage (>90%) in the output. Below is my code:

sea --n all_CRE.fasta --order 2 --seed 123 --m CIS-BP_2.00_hsa.meme --p CRE_of_DE.genes.fasta --o SEA

The control sequence is all cis-regulatory elements. Primary sequence is cis-regulatory elements assigned to differential expressed genes. Motif is downloaded from MEME database "CIS-BP_2.00/Homo_sapiens.meme".

And the top enriched TFs are listed as below:

RANK DB ID ALT_ID CONSENSUS TP TP% FP FP% ENR_RATIO SCORE_THR PVALUE LOG_PVALUE EVALUE LOG_EVALUE QVALUE LOG_QVALUE
1 CD8_Nve_CB.UP_TF.meme M04519_2.00 ZNF704 YRCCGGCCGGYR 3977 77.1 18222 72.93 1.06 0.2 2.15E-10 -22.26 2.59E-08 -17.47 1.38E-08 -18.1
2 CD8_Nve_CB.UP_TF.meme M08275_2.00 ZFP64 SRBTCCCGGGSCCCS 4671 90.56 21896 87.64 1.03 0.17 8.43E-10 -20.89 1.01E-07 -16.11 2.71E-08 -17.42
3 CD8_Nve_CB.UP_TF.meme M02460_2.00 GMEB1 NNACGYNNN 4720 91.51 22214 88.91 1.03 4.1 9.77E-09 -18.44 1.17E-06 -13.66 1.66E-07 -15.61
4 CD8_Nve_CB.UP_TF.meme M01914_2.00 TET1 NNYRCGYWN 4925 95.48 23355 93.48 1.02 3.3 1.03E-08 -18.39 1.24E-06 -13.6 1.66E-07 -15.61
5 CD8_Nve_CB.UP_TF.meme M04400_2.00 KLF6 NRCCACGCCCH 5017 97.27 23906 95.69 1.02 1.9 2.49E-08 -17.51 2.99E-06 -12.72 3.20E-07 -14.95
6 CD8_Nve_CB.UP_TF.meme M08310_2.00 ZNF341 GCTSTTCCYBCYBCCSCCCBS 4137 80.21 19181 76.77 1.04 5.7 3.16E-08 -17.27 3.80E-06 -12.48 3.39E-07 -14.9

The top enriched motifs have very high true positive percentage, some even more than 90%. It is likely to because some motifs are badly annotaed as they contain many Ns in the motif sequence. My question is: should I trust these enriched motifs with high TP%? if not, is there anything I can do to filter out these motifs when I run SEA?

Best,

Ya

cegrant

unread,

Aug 6, 2023, 6:03:54 PM8/6/23

to MEME Suite Q&A

Hi YY,

Hmm. There are also many FP as well. That is, there are many motif hits in the control set too, just more in the target set.The enrichment ratio is just slightly greater than 1 for each of your motifs. The control set looks like it is MUCH larger than the target set. The size of the control set shouldn't be too critical, but it's very important that the control set and the target set to have nearly the same distribution of sequence lengths, which is harder when the two sets differ dramatically in size. Could I ask for a clarification? Is your target set included in your control set? The target set and the control set should be disjoint.

As a first troubleshooting step I'd run SEA without specifying a control set. SEA will then generate a control set internally by shuffling the target set. This guarantees that the control and target will have the same distribution of sequence lengths. If the enriched motifs go away I'd be inclined to think that it's some issue with the your control set. If the enriched motifs remain, then there may be genuine enrichment. However, it would still be useful to screen both your sequence collections for simple repeats and low complexity regions, as these can create spurious matches to motifs.

Message has been deleted

Ya Jiang

unread,

Aug 7, 2023, 4:29:05 PM8/7/23

to MEME Suite Q&A

Hi,

Thanks for your advice! In my original analysis, the target set is a subset of peaks in the control set. And I tried to troubleshoot my analysis as you suggested. First, I run SEA without specifying a control set, I get rid of the high TP% and FP%, which may indicate I have some problems in control set. Then, I removed target set from control set and mask repeats by DUST, but I still get very high TP% and FP% in the output. Next, I generated a 2-order shuffled control set by `fasta-shuffle-letters -kmer 2 -dna control.DUST.fasta control.DUST.shuffled.fasta` . The shuffled control set does increase score threshold and reduce TP% in many motifs. But I am not sure whether it is appropriate to use shuffled control set and whether I need to discard those motifs have low match score threshold.

I also checked the length distribution of target and control set as listed as below. According to the table, the target set do have similar length distribution as control set, but the length of peaks are not uniform. Do you think it may be better if I remove some extreme long and short sequences for target and control set?

Min. 1st Qu. Median Mean 3rd Qu. Max.
target set 11bp 270bp 490bp 625.7 900bp 3270bp
Min. 1st Qu. Median Mean 3rd Qu. Max.
control set 12bp 260bp 500bp 660.3 1000bp 3150bp

Best,

Ya

cegrant

unread,

Aug 20, 2023, 12:37:48 AM8/20/23

to MEME Suite Q&A

Hi Ya,

I chatted about your issues with Tim Bailey the author of SEA. He made a few points:

1. The distribution of sequences lengths is probably a non-issue.

2. Having the target set included in the control set is not really a problem, it just makes that statistics a little conservative.

3. Except in very peculiar circumstances, letting SEA generate the control set by shuffling the target set is the best way to go. Providing your own control set can obscure matters because of difference in the nucleotide and 2-mer frequencies between the two sets.

He suspects the real problem is the some of your motifs are going to get great matches in any C/G rich region so along with repeats, and low complexity regions, you are going to have to worry about CpG islands. The difficulty with the tools in the MEME Suite in general is that they simply can't distinguish between spurious matches like that, and genuine binding sites for the TF.

Reply all

Reply to author

Forward