%FP variation using the same motifs and control sequence with SEA

14 views

Skip to first unread message

César Martínez

unread,

Jun 18, 2024, 1:18:18 PMJun 18

to MEME Suite Q&A

Hi, I'm using Meme Suite to discover overrepresented motifs in a set of 5 genes upregulated in my experimental condition. My workflow was:

Search for motifs de novo: Using xstreme, I use as a Primary Sequences the 3 kbp previous from ATG for the 5 genes and as a Negative Control 10.000 random generated sequences of 3 kbp long from the reference genome. The command was: xstreme --p promotors3kbp.fasta --n 3000bp.random.fasta --minw 6 --maxw 10 --streme-nmotifs 10 --meme-nmotifs 10
Motif enrichment: Using the 20 predicted motifs in the last step, the same Negative Control (3000bp.random.fasta) and as Primary sequences all the 3 kbp nucleotides previous to ATG from all the genes of the genome. I did this step to ensure if my motifs were overrepresented in the whole genome. The command was: sea --p 3kbpALLgenesPromotors.fasta --m Motifs_deNovo_CNRand.meme --n 3000bp.random.fasta

I found my motifs enriched in a lot of genes, but I don't understad why If I'm using the same Negative Control in both analysis I'm obtaining a very different values of False Positives. In the SEA output of the de novo analysis the %FP is very low and in the motif enrichment analysis (2 using the predicted motifs) the %FP is very high. I thought that the reason could be the hold out set (because in 1 there is not enough sequence for the hold out) but I set --hofract 0.0 in the motif enrichment and I obtained very similar results.

Am I doing my analysis correctly, and is my negative control good enough to achieve my objective?

Why are these differences in %FP observed?

My files are in dropbox

Thank you very much in advance
César

This message is intended exclusively for its addressee and may contain information that is CONFIDENTIAL and protected by professional privilege. If you are not the intended recipient you are hereby notified that any dissemination, copy or disclosure of this communication is strictly prohibited by law. If this message has been received in error, please immediately notify us via e-mail and delete it.

DATA PROTECTION. We inform you that your personal data, including your e-mail address and data included in your email correspondence, are included in the ISGlobal Foundation files. Your personal data will be used for the purpose of contacting you and sending information on the activities of the above foundations. You can exercise your rights of access, rectification, cancellation and opposition by contacting the following address: lo...@isglobal.org. ISGlobal Privacy Policy at www.isglobal.org.

-----------------------------------------------------------------------------------------------------------------------------

CONFIDENCIALIDAD. Este mensaje y sus anexos se dirigen exclusivamente a su destinatario y puede contener información confidencial, por lo que la utilización, divulgación y/o copia sin autorización está prohibida por la legislación vigente. Si ha recibido este mensaje por error, le rogamos lo comunique inmediatamente por esta misma vía y proceda a su destrucción.

PROTECCIÓN DE DATOS. Sus datos de carácter personal utilizados en este envío, incluida su dirección de e-mail, forman parte de ficheros de titularidad de la Fundación ISGlobal para cualquier finalidades de contacto, relación institucional y/o envío de información sobre sus actividades. Los datos que usted nos pueda facilitar contestando este correo quedarán incorporados en los correspondientes ficheros, autorizando el uso de su dirección de e-mail para las finalidades citadas. Puede ejercer los derechos de acceso, rectificación, cancelación y oposición dirigiéndose a lo...@isglobal.org . Política de privacidad en www.isglobal.org.

tlawb...@gmail.com

unread,

Jun 19, 2024, 5:25:15 PMJun 19

to MEME Suite Q&A

César,

SEA is working as designed.

Given primary and control sets of sequences, SEA finds the

score threshold for a given motif that maximizes the statistical

significance (minimizes the unadjusted p-value). This threshold

will depend on the primary sequences as well the control

sequences. The false positive percentage (FP%) will also depend on

both sets of sequences because it depends on the score threshold.

This is the explanation for your observation that using a very large

set of primary sequences resulted in a higher FP%. If you compare

the score thresholds for motif STREME-1, you will notice that it is

lower with the large primary set compared to the small primary set.

Since the control set is the same, a lower threshold implies a

higher FP%.

Hope this helps,

Tim

Reply all

Reply to author

Forward

0 new messages