meme duplicated motifs

84 views
Skip to first unread message

Rocky Parida

unread,
Sep 12, 2016, 4:35:00 PM9/12/16
to MEME Suite Q&A
Hi All
I am getting the same motif called enriched 3 different times for 6 different sequences.

for example 
a) CGGCGC is enriched for gene 1 and 2
b) CGGCGC is also enriched for gene 3 and 4
c) CGGCGC is also enriched for gene 6 and 7
My question is why meme is not calling them all together as one motif enriched in upstream region of 6 different genes?

Here is my command line:
meme INPUTSEQ.txt -oc OUTDIR -dna -mod zoops -nmotifs 10 -minw 6 -maxw 8 -revcomp -bfile BACKGROUNDMODEL -maxsize 18422000

This is the output I am getting:

MOTIF  3 MEME   width =   6  sites =   2  llr = 20  E-value = 5.0e+007
MOTIF  5 MEME   width =   6  sites =   2  llr = 20  E-value = 5.0e+007
MOTIF  9 MEME   width =   6  sites =   2  llr = 20  E-value = 5.0e+007

I will be very grateful for your concern.
Rocky


CharlesEGrant

unread,
Sep 12, 2016, 5:13:28 PM9/12/16
to MEME Suite Q&A
I'm afraid I don't understand your question. MEME doesn't measure motif enrichments, and doesn't know anything about genes. It performs de novo motif discovery on a collection of sequences. In any case, notice that each of your motifs has an E-value of 5.0e+007, which means that you are really just looking at noise. The typical threshold for statistical significance is an E-value of 0.01 (a smaller E-value is more statistically significant.

Rocky Parida

unread,
Sep 12, 2016, 6:07:25 PM9/12/16
to MEME Suite Q&A
HI Charles
Thank you for getting back to me. Appreciate it.
True we are looking at noise but since the number of input sequences are less we do not have much power to get a typical E-value threshold.
For example, my input sequence has 48 sequences and my background model is made out of close to 40,000 sequences.

But the top ten motifs returned by meme are making sense biologically. If I take those motifs and check against JASPAR they do make sense based on our experiment.
Therefore, we are taking the noise motifs into account, its just we do not have enough power to get a better E-value.

Next, when I am using these commands:

meme INPUTSEQ.txt -oc OUTDIR -dna -mod zoops -nmotifs 10 -minw 6 -maxw 8 -revcomp -bfile BACKGROUNDMODEL -maxsize 18422000

********************************************************************************
--------------------------------------------------------------------------------
        Motif 4 Description
--------------------------------------------------------------------------------
Simplified        A  :::a::
pos.-specific     C  :a::aa
probability       G  a:a:::
matrix            T  ::::::

         bits    2.3 *** **
                 2.1 *** **
                 1.9 *** **
                 1.6 ******
Relative         1.4 ******
Entropy          1.2 ******
(13.5 bits)      0.9 ******
                 0.7 ******
                 0.5 ******
                 0.2 ******
                 0.0 ------

Multilevel           GCGACC
consensus
sequence

--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
        Motif 4 sites sorted by position p-value
--------------------------------------------------------------------------------
Sequence name             Start   P-value             Site
-------------             ----- ---------            ------
scaffold_88:338158-33899    569  8.79e-05 TCAAAACATA GCGACC ATTTTATATG
scaffold_25:1033470-1033    129  8.79e-05 CAGATGATAA GCGACC GATTGACGGA
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
        Motif 4 block diagrams
--------------------------------------------------------------------------------
SEQUENCE NAME            POSITION P-VALUE  MOTIF DIAGRAM
-------------            ----------------  -------------
scaffold_88:338158-33899          8.8e-05  568_[+4]_262
scaffold_25:1033470-1033          8.8e-05  128_[+4]_365
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
        Motif 4 in BLOCKS format
--------------------------------------------------------------------------------
BL   MOTIF 4 width=6 seqs=2
scaffold_88:338158-33899 (  569) GCGACC  1
scaffold_25:1033470-1033 (  129) GCGACC  1
//


MOTIF  9 MEME   width =   6  sites =   2  llr = 19  E-value = 6.2e+006
********************************************************************************
--------------------------------------------------------------------------------
        Motif 9 Description
--------------------------------------------------------------------------------
Simplified        A  :::a::
pos.-specific     C  :a::aa
probability       G  a:a:::
matrix            T  ::::::

         bits    2.3 *** **
                 2.1 *** **
                 1.9 *** **
                 1.6 ******
Relative         1.4 ******
Entropy          1.2 ******
(13.5 bits)      0.9 ******
                 0.7 ******
                 0.5 ******
                 0.2 ******
                 0.0 ------

Multilevel           GCGACC
consensus
sequence

--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
        Motif 9 sites sorted by position p-value
--------------------------------------------------------------------------------
Sequence name             Start   P-value             Site
-------------             ----- ---------            ------
scaffold_84:365050-36600    221  8.79e-05 CGTTTAAGTC GCGACC AAAAATTGGT
scaffold_2:1677226-16777    228  8.79e-05 GGAAGAGCCA GCGACC CCGGCCGCCG
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
        Motif 9 block diagrams
--------------------------------------------------------------------------------
SEQUENCE NAME            POSITION P-VALUE  MOTIF DIAGRAM
-------------            ----------------  -------------
scaffold_84:365050-36600          8.8e-05  220_[+9]_728
scaffold_2:1677226-16777          8.8e-05  227_[+9]_267
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
        Motif 9 in BLOCKS format
--------------------------------------------------------------------------------
BL   MOTIF 9 width=6 seqs=2
scaffold_84:365050-36600 (  221) GCGACC  1
scaffold_2:1677226-16777 (  228) GCGACC  1
//

meme INPUTSEQ.txt -oc OUTDIR -dna -mod anr -nmotifs 10 -minw 6 -maxw 8 -revcomp -bfile BACKGROUNDMODEL -maxsize 18422000

I am getting motifs that are present in sets of 2 independent sequences for example:
MOTIF  3 MEME   width =   6  sites =   2  llr = 19  E-value = 2.1e+006
********************************************************************************
--------------------------------------------------------------------------------
        Motif 3 Description
--------------------------------------------------------------------------------
Simplified        A  :::a::
pos.-specific     C  :a::aa
probability       G  a:a:::
matrix            T  ::::::

         bits    2.3 *** **
                 2.1 *** **
                 1.9 *** **
                 1.6 ******
Relative         1.4 ******
Entropy          1.2 ******
(13.5 bits)      0.9 ******
                 0.7 ******
                 0.5 ******
                 0.2 ******
                 0.0 ------

Multilevel           GCGACC
consensus
sequence

--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
        Motif 3 sites sorted by position p-value
--------------------------------------------------------------------------------
Sequence name             Start   P-value             Site
-------------             ----- ---------            ------
scaffold_84:365050-36600    221  8.79e-05 CGTTTAAGTC GCGACC AAAAATTGGT
scaffold_100:79340-80455    159  8.79e-05 ACAAACCAAA GCGACC CTAATTTGAA
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
        Motif 3 block diagrams
--------------------------------------------------------------------------------
SEQUENCE NAME            POSITION P-VALUE  MOTIF DIAGRAM
-------------            ----------------  -------------
scaffold_84:365050-36600          8.8e-05  220_[+3]_728
scaffold_100:79340-80455          8.8e-05  158_[+3]_951
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
        Motif 3 in BLOCKS format
--------------------------------------------------------------------------------
BL   MOTIF 3 width=6 seqs=2
scaffold_84:365050-36600 (  221) GCGACC  1
scaffold_100:79340-80455 (  159) GCGACC  1
//

********************************************************************************
MOTIF  4 MEME   width =   6  sites =   2  llr = 19  E-value = 2.1e+006
********************************************************************************
--------------------------------------------------------------------------------
        Motif 4 Description
--------------------------------------------------------------------------------
Simplified        A  :::a::
pos.-specific     C  :a::aa
probability       G  a:a:::
matrix            T  ::::::

         bits    2.3 *** **
                 2.1 *** **
                 1.9 *** **
                 1.6 ******
Relative         1.4 ******
Entropy          1.2 ******
(13.5 bits)      0.9 ******
                 0.7 ******
                 0.5 ******
                 0.2 ******
                 0.0 ------

Multilevel           GCGACC
consensus
sequence

--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
        Motif 4 sites sorted by position p-value
--------------------------------------------------------------------------------
Sequence name             Start   P-value             Site
-------------             ----- ---------            ------
scaffold_88:338158-33899    569  8.79e-05 TCAAAACATA GCGACC ATTTTATATG
scaffold_25:1033470-1033    129  8.79e-05 CAGATGATAA GCGACC GATTGACGGA
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
        Motif 4 block diagrams
--------------------------------------------------------------------------------
SEQUENCE NAME            POSITION P-VALUE  MOTIF DIAGRAM
-------------            ----------------  -------------
scaffold_88:338158-33899          8.8e-05  568_[+4]_262
scaffold_25:1033470-1033          8.8e-05  128_[+4]_365
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
        Motif 4 in BLOCKS format
--------------------------------------------------------------------------------
BL   MOTIF 4 width=6 seqs=2
scaffold_88:338158-33899 (  569) GCGACC  1
scaffold_25:1033470-1033 (  129) GCGACC  1
//

My question is why am I getting such examples of same motif discovered separately like this. Why all the sequences are not taken a single set and called motif GCGACC is found in 4 sequences.
I am very grateful for your concern Charles.

Rocky


Rocky Parida

unread,
Sep 14, 2016, 3:01:37 PM9/14/16
to MEME Suite Q&A
For all those who might benefit from this issue here is my answer.
 
When I tried model ANR and maxsites to 80 the repeated motifs were gone and two distinct motifs showed up such as GCGACC and GCGACCCC .
It has to do with occurrences of a motif in a set of sequences when using ANR. I tried different maxsites to see how it affects my issue and maxsite of 80 or above works resolve the same motif appearing in more than one set of sequences independently.

Thanks
Rocky

CharlesEGrant

unread,
Sep 16, 2016, 4:03:02 PM9/16/16
to meme-...@googlegroups.com
The underlying issue is that MEME uses a greedy algorithm. As soon as MEME has enough evidence for a motif to estimate its statistical significance it will report that motif, and begin looking for other motifs.

In particular, when you choose the ZOOPS model, but don't specify the -minsites option, MEME default to requiring only two sites. As soon as MEME had found two sites for a motif, it would start looking for other motifs. This is why in your first result set all your motifs showed only two occurrences. Except for the OOPS model, the minimum number of sites defaults to 2.

Whether you choose the OOPS, ZOOPS, or ANR models should be governed by your understanding of the biology. Do you expect your sequences to contain at most one instance of the motif (OOPS or ZOOPS), or could they contain multiple instances of the motif (ANR).
Message has been deleted
Message has been deleted

Rocky Parida

unread,
Sep 23, 2016, 8:05:55 PM9/23/16
to MEME Suite Q&A
Thank you so much for getting back to me Charles. I highly appreciate it. I understand that meme is trying to find the minimum sites for a motif and as soon as it finds them it starts looking for other motifs.

I have tried this new command:
meme ISEQUENCES.txt -oc MEMEMOTIF -dna -revcomp -maxsites 100 -mod anr -nmotifs 50 -minw 6 -maxw 8 -bfile BG_5.model -maxsize 19810000 -p 16
Based on the manual, the default minimum number of sites for this command should be 465 correct!, min( 5* 93, 600), as I am submitting 93 sequences only as my input.

I am getting the following scenerio:
------------------------------------------------------------
--------------------
        Motif 21 sites sorted by position p-value
--------------------------------------------------------------------------------
Sequence name            Strand  Start   P-value             Site
-------------            ------  ----- ---------            ------
scaffold_1:2079505-20800     +    412  1.36e-04 AAACTGTTTC CGACGA TCTTTTTCGA
scaffold_1:2079505-20800     -    165  1.36e-04 GAAATTCAGA CGACGA TTGTGTATTT
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
        Motif 29 sites sorted by position p-value
--------------------------------------------------------------------------------
Sequence name            Strand  Start   P-value             Site
-------------            ------  ----- ---------            ------
scaffold_17:1203250-1203     +    191  1.36e-04 AACCGAAAAA CGACGA TTGTTGGACA
scaffold_9:1776808-17773     -    327  1.36e-04 ATCCTCGTCC CGACGA TTACATCGTC
--------------------------------------------------------------------------------

My question is if it is the same motif occurring in all these sequences then why doesn't MEME combine all the occurrences to make a better consensus. I am allowing a range until 100 as maxsites and clearly there are more sites for this motif than 2.

I am confused because this is not a consistent behavior for all motifs discovered via MEME, it gives me motifs such as this:

--------------------------------------------------------------------------------
        Motif 4 sites sorted by position p-value
--------------------------------------------------------------------------------
Sequence name            Strand  Start   P-value              Site
-------------            ------  ----- ---------            --------
scaffold_138:281271-2817     -     68  8.62e-06 GTGCATACAG GGAGGAGG AGACTCTCTC
scaffold_21:1032274-1032     +    245  8.62e-06 ATACAATAGC GGAGGGGG GGGGGGGGTG
scaffold_8:401504-402004     -    464  8.62e-06 TCCTTGAAGC GGAGGGGG AAAAGAAACC
scaffold_86:153800-15429     -    224  8.62e-06 TTGTGTTTTT GGAGGGGG ATGTTCTTAT
scaffold_86:153800-15429     -    128  8.62e-06 TGAAGACTAG GGAGGAGG GCTTTTAATC
scaffold_7:572197-572697     -     16  8.62e-06 GTGGGAAGTA GGAGGGGG AGTCACATGA
scaffold_17:688200-68869     +    404  8.62e-06 TGTGGTTGGT GGAGGAGG AGGCTGGCCG
scaffold_4:702324-702823     +    441  8.62e-06 CGAATAGGAA GGAGGGGG GGGGGGAACG
scaffold_1329:2586-3085      -    161  8.62e-06 CTAATAGGAA GGAGGAGG GAGGTGGGAA
scaffold_32:261278-26177     +     57  8.62e-06 GAAACAGGGT GGAGGGGG GGATACAAAA
scaffold_5:1253243-12537     +    386  8.62e-06 CGTAAGGGCA GGAGGAGG GGGGTCCAAA
scaffold_471:17995-18495     +    354  8.62e-06 AGGCGGAAGT GGAGGAGG ATTCCTTGTT
scaffold_58:309773-31027     -    212  1.20e-05 GGAAGGCCGT GGAGGCGG AAAAAGAGGA
scaffold_9:1776808-17773     +    292  1.20e-05 TCTACATATA GGAGGCGG GCGATCGGCT
scaffold_12:923707-92420     -    249  1.20e-05 TTACATGGCT GGAGGCGG TGATGGCAAA
scaffold_2:2191051-21915     -    405  1.76e-05 AGGGTTTTTT GGGGGAGG CCTCTTTATA
scaffold_7:992552-993052     +     44  1.76e-05 GCCCCGTTGA GGGGGAGG ATGTGACAAT
scaffold_37:989489-98998     +    360  1.76e-05 GGCTCTCTGT GGGGGAGG AATTTTCAAG
scaffold_21:601673-60217     -    361  1.97e-05 GGAGAGCGGC GGGGGCGG GTCAATGGCC
scaffold_5:1259883-12603     -    113  2.83e-05 TTGTGGTGGT GGTGGAGG TGGTTGTGGA
scaffold_60:721319-72181     +     14  2.83e-05 GTATTAGTAT GGTGGGGG AACTTATGGT
scaffold_86:158050-15854     +      7  2.83e-05     CATGAA GGTGGAGG GTTCCAGCTC
scaffold_3:1327506-13280     -    272  3.17e-05 GCAGTCCCAG GGTGGCGG CTCCTGTTTT
--------------------------------------------------------------------------------

How come it takes into account multiple occurrences for this motif GCT[GC]CTG[CG] but in the former case it splits them into two separate motifs when it is actually the same motif? How come here it does not split them into sets of 2 occurrences.
 
Hope this makes sense. I highly appreciate your answers and I will be grateful for your concern.
This is an excellent resource for us and for that I am very grateful.
Rocky

CharlesEGrant

unread,
Sep 27, 2016, 8:57:28 PM9/27/16
to meme-...@googlegroups.com
Hi Rocky,

Sorry, there was an error in my previous post, which I've since corrected. For the  ANR model,   min(5 × sequence count, 600)  is the default value for max sites, not min sites. For the ANR and ZOOPs models the min sites parameter defaults to 2. For OOPS min sites defaults to the number of sequences. These parameters are described in the command line documentation for MEME. If you explicitly set the minsites option to at least four, it should pull  motif 21 and 29 together.

Rocky Parida

unread,
Sep 27, 2016, 9:41:27 PM9/27/16
to MEME Suite Q&A
Thank you very much for getting back to me. I did try increasing minsites few days back and as you mentioned i saw multiple (>2) occurrences for motif 21 taken into account for the final consensus.
I still do not know how meme decides for motif 3 there are multiple (>2) sites but not for motif 21.
As always i am very grateful for your concern.

CharlesEGrant

unread,
Sep 27, 2016, 11:30:37 PM9/27/16
to meme-...@googlegroups.com
Hi Rocky,

MEME does not perform an exhaustive search for motifs. The search space of possible all possible motifs for even a tiny amount of sequence data is beyond astronomical! Instead, like other de novo motif discovery programs, MEME is performing a stochastic search. MEME makes intelligent initial guesses for the locations of motif instances, and the  Position Weight Matrix (PWM) for the motifs.  It then makes adjustments to both the locations of the motif instances and the contents of the PWM, trying to optimize the match. It keeps pursuing the optimizations as long as good progress is being made. Presumably the optimization for motif 4 quickly identified dozens of plausible sites, while the optimization of motif 21 gave up before finding the additional sites that were later used for motif 29.  Notice that the p-values for the sites for motif 4 are all more significant than the p-values for the sites of motif 21 and 29.

If you want to know the details of the MEME algorithm you should consult this paper:

Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California, 1994. [postscript] [pdf]


Rocky Parida

unread,
Sep 29, 2016, 12:26:17 AM9/29/16
to MEME Suite Q&A
Thank you Charles for this answer. I am ver grateful.
Rocky
Reply all
Reply to author
Forward
0 new messages