Hi,
I want to use SEA to find transcription factors motifs enriched in cis-regulatory elements (CREs). But I got very high true positive percentage (>90%) in the output. Below is my code:
sea --n all_CRE.fasta --order 2 --seed 123 --m CIS-BP_2.00_hsa.meme --p CRE_of_DE.genes.fasta --o SEA
The control sequence is all
cis-regulatory elements. Primary sequence is
cis-regulatory elements assigned to differential expressed genes. Motif is downloaded from MEME database "CIS-BP_2.00/Homo_sapiens.meme".
And the top enriched TFs are listed as below:
RANK DB ID ALT_ID CONSENSUS TP TP% FP FP% ENR_RATIO SCORE_THR PVALUE LOG_PVALUE EVALUE LOG_EVALUE QVALUE LOG_QVALUE
1 CD8_Nve_CB.UP_TF.meme M04519_2.00 ZNF704 YRCCGGCCGGYR 3977 77.1 18222 72.93 1.06 0.2 2.15E-10 -22.26 2.59E-08 -17.47 1.38E-08 -18.1
2 CD8_Nve_CB.UP_TF.meme M08275_2.00 ZFP64 SRBTCCCGGGSCCCS 4671 90.56 21896 87.64 1.03 0.17 8.43E-10 -20.89 1.01E-07 -16.11 2.71E-08 -17.42
3 CD8_Nve_CB.UP_TF.meme M02460_2.00 GMEB1 NNACGYNNN 4720 91.51 22214 88.91 1.03 4.1 9.77E-09 -18.44 1.17E-06 -13.66 1.66E-07 -15.61
4 CD8_Nve_CB.UP_TF.meme M01914_2.00 TET1 NNYRCGYWN 4925 95.48 23355 93.48 1.02 3.3 1.03E-08 -18.39 1.24E-06 -13.6 1.66E-07 -15.61
5 CD8_Nve_CB.UP_TF.meme M04400_2.00 KLF6 NRCCACGCCCH 5017 97.27 23906 95.69 1.02 1.9 2.49E-08 -17.51 2.99E-06 -12.72 3.20E-07 -14.95
6 CD8_Nve_CB.UP_TF.meme M08310_2.00 ZNF341 GCTSTTCCYBCYBCCSCCCBS 4137 80.21 19181 76.77 1.04 5.7 3.16E-08 -17.27 3.80E-06 -12.48 3.39E-07 -14.9
The top enriched motifs have very high true positive percentage, some even more than 90%. It is likely to because some motifs are badly annotaed as they contain many Ns in the motif sequence. My question is: should I trust these enriched motifs with high TP%? if not, is there anything I can do to filter out these motifs when I run SEA?
Best,
Ya