Enrichment of de-novo discovered motifs

210 views
Skip to first unread message

Nick Riddiford

unread,
Jul 20, 2018, 6:21:30 AM7/20/18
to MEME Suite Q&A
I have identified some motifs in set of sequences using MEME, and I now want to ask if these are enriched in these sequences compared to background set of sequences that we've generated. 

To do this I'm using: 

    meme $sequences -dna -oc "$outdir" \
     
-bfile $markov_model \
     
-evt 0.01 \
     
# -neg $negativeSeqs \
     
# -objfun de \
     
-nostatus \
     
-time 18000 \
     
-mod anr \
     
-nmotifs 15 \
     
-maxw 30 \
     
-revcomp

  ame -oc "$outdir/ame" \
     
--scoring totalhits \
     
--control $negativeSeqs \
     
--bfile $markov_model \
      $sequences
\
     
"$outdir/meme.txt" \
      $motif_databases
/*

Where $markov_model is a 2-order mm generated by `fasta-get-markov`. 

My first question relates to the `-objfun` selection in MEME (v5): If I run with -neg $negativeSeqs and specify -objfun de, does this look for enrichment of de-novo discovered motifs relative to my background sequences in a way comparable to the approach AME uses? If so, can I use this in place of AME, or does AME provide a better estimate of enrichment? 

Secondly, I would like to be able to see the raw counts of motif presence in both my $sequences and $negativeSeqs so that I can assess the results I get from AME a bit more clearly. Here is a line from my output ame.html: 

   Database                              ID                               Alt ID   p-value  E-value   TP Thresh   TP (%)               FP (%) 
memeKTTTDTWTTTKTTTTTKTTTTTTTMAYTTMEME-11.15e-51.79e-21.0028 (34.1%)217 (10.9%)

Can I say here that I have a match to this motif in 34% of my $sequences and 10.9% of my $negativeSeqs, so roughly 3 fold? How confident can I be about this sort of enrichment (where the E/P values are fairly weak, and there's only a modest enrichment)? Statistics aside, that fact that only 34% of my sequences have this motif would suggest to me that it's probably not that biological relevant.  

I also wanted to ask whether I can omit the $markov_model entirely from both commands. Seeing as I provide $negativeSeqs which is a better reflection of the k-mer frequencies in my $sequences than the whole genome I used to generate the $markov_model? How does the order of the markov model influence motif enrichment? 

Let me know if I can provide any more details of my set up. 

Thanks! 

Nick 

CharlesEGrant

unread,
Jul 23, 2018, 3:02:12 PM7/23/18
to meme-...@googlegroups.com
Hi Nick,

I noticed a few issues with your example.

The first thing I'd note is that MEME is the only tool in the MEME Suite that can take advantage of background models of higher order. All the other tools (like AME) will simply extract the 0th order model (single symbol frequencies) from higher order models. Second, you probably don't want to use the '-evt' option. MEME is a greedy algorithm, and while it generally finds 'good' motifs first, it isn't guaranteed to find motifs in strictly increasing order of E-value. If you use the '-evt' option MEME will quit as soon as it finds a motif with an E-value larger than the threshold. This may result in MEME  missing significant motifs. Finally, I'd note that the motif that shows up in your AME results ('KTTTDTWTTTKTTTTTKTTTTTTTMAYTT'), essentially looks like a run of 'T'. MEME isn't able to discriminate simple repeats or low complexity regions from more biologically significant motifs. You may want to filter repeats and low complexity regions from you input sequences using tools like RepeatMasker or DUST.

Now to your questions:

My first question relates to the `-objfun` selection in MEME (v5): If I run with -neg $negativeSeqs and specify -objfun de, does this look for enrichment of de-novo discovered motifs relative to my background sequences in a way comparable to the approach AME uses? If so, can I use this in place of AME, or does AME provide a better estimate of enrichment? 


MEME is performing de novo motif discovery while AME is measuring enrichment of specified motifs which are quite different tasks.

MEME's  '-de' objective function, is useful when your input sequences contain multiple distinct motifs. As I mentioned, MEME is a greedy algorithm, so it can happen that if multiple motifs are present, one motif may end up dominating the results. It can be useful to tell MEME to ignore certain motifs, making it easier to spot other motifs. The idea is that your primary input sequences contain instances of both motif A and motif B, and as it turns out motif B is known to be irrelevant to your experiment.  You'd like MEME to ignore the instances the B motif. You provide a collection of negative sequences that are known to contain instances of motif B, and are presumed not to contain motif A. As MEME searches for motifs it will find that motif A is enriched in the primary set compared to the control set, so it will score instances of motif A higher than motif B.  Note that the MEME output won't return any statistics about the significance of the enrichment of A between the two sequence collections, so this is not a substitute for running AME.


Secondly, I would like to be able to see the raw counts of motif presence in both my $sequences and $negativeSeqs so that I can assess the results I get from AME a bit more clearly. Here is a line from my output ame.html: 

AME doesn't necessarily count motif instances. AME provides several  scoring methods, and the default is  to take the average log-odds match score over the entire sequence. The 'total_hits'  value for the '-scoring' option is the only scoring method that will count individual motif matches. See the section on the '--scoring' option of the AME documentation for more information.

The 'total_hits' method doesn't report raw counts of motif matches. You could run FIMO with the appropriate value of the p-value threshold to obtain this.

I also wanted to ask whether I can omit the $markov_model entirely from both commands. Seeing as I provide $negativeSeqs which is a better reflection of the k-mer frequencies in my $sequences than the whole genome I used to generate the $markov_model? How does the order of the markov model influence motif enrichment? 
 
The background model is completely distinct function from the negative sequences used with the '-de' objective function. The idea of the background model is that it provides the probabilities of observing particular nucleotides when a position is NOT part of a motif. Ideally you'd derive the background model from sequences that are biologically similar to the sequences of interest, but that are believed not to contain motifs. That can be a hard collection to come up with. Pragmatically we may assume that motif instances are not common in the input sequences and simply use the observed frequencies from input sequences as the background model. This is the default behavior if you don't provide an external background model. Choosing an accurate background model is the single most important thing you can do to improve MEME's statistical power.

Nick Riddiford

unread,
Jul 24, 2018, 10:40:22 AM7/24/18
to MEME Suite Q&A

Hi Charles, 

 

Thanks a lot for your very helpful and detailed response.

 

We are looking for motifs surrounding particular loci (initially we are looking +/- 100 bps from each locus) that are found genome-wide (excluding unmappable regions). The repetitive motif is actually one that we regularly find in sequences surrounding these loci, and are thus interested in. As long as MEME doesn't have any bias towards finding these sorts of highly repetitive motifs, we don't want to exclude it. 

 

To then test whether these motifs are indeed enriched at these loci (rather than just being very abundant in the genome) we are essentially extracting sequences of the same length from randomly generated positions in the mappable genome. One thing to note is that we simulate many more control sequences (10,000) than we have in our test sequences (~200), to give us a better approximation of the background nucleotide frequencies. 

 

Given your above response, I am now using these control sequences to construct a 0-order background (rather the whole genome that I was previously using), which I now use as input to run MEME for de-novo motif discovery in our test sequences ($sequences). I also remove the -evt as you suggest


    meme $sequences -dna -oc "$outdir" \
     
-bfile $markov_model \

     
-time 18000 \
     
-mod anr


This gives me some motifs, that I then use as input to AME, along with my test sequences ($sequences) and my control sequences ($control_seqs) to test for enrichment:


    ame -oc "$outdir/ame" \
     
--scoring totalhits \

     
--control $control_seqs \

     
--bfile $markov_model \
      $sequences
\
     
"$outdir/meme.txt" \
      $motif_databases
/*


Is it appropriate to use same background for both MEME and AME? In this example, the background is also created from the control sequences. Is this the recommended approach? 


Seeing as there's no pattern per se to the genomic distribution of the loci that we are looking at, I think that it will be difficult to generate a set of background sequences that we believe not to contain motifs. In this case, maybe it's better to use the default and generate a background from shuffled input? 


Many thanks, 


Nick 

Reply all
Reply to author
Forward
0 new messages