Background Models for MEME

2,431 views
Skip to first unread message

CharlesEGrant

unread,
Oct 26, 2012, 2:29:38 PM10/26/12
to meme-...@googlegroups.com
What is the background model and why is it important? 

The background model is used by MEME to estimate the probability of a candidate motif appearing in your dataset simply by chance. MEME's default background model is a 0th order Markov model, the character frequencies of which are derived from the submitted sequence data. A 0th order Markov model assumes that character frequencies at each position in the sequence are independent of the characters found in the previous positions. In many cases this is a reasonable assumption, but in other cases it may be an invalid assumption (CpG islands, for example). MEME allows you to submit a custom background Markov model. Providing your own higher-order background model can greatly improve MEME's ability to discover motifs. 

How do I create a custom background model? 

The command-line utility fasta-get-markov, included in the MEME Suite download, is used to generate custom background Markov models. The input to fasta-get-markov is a FASTA file containing "background" sequences. Ideally, these "background" sequences will be different from the sequences you are analyzing with MEME, but as similar in nature as possible. For example, if you wanted to discover motifs in certain intergenic regions, you might use other sequence data from other intergenic regions to generate the background. The larger the set of "background" sequences is, the better the results will be. The '-m' option for fasta-get-markov allows you to set the order of the background Markov model. The order of the Markov model is the number of preceding positions considered when calculating the character frequencies for the current position. 

Typically, you should not specify an order larger than 3 for DNA sequences, or larger than 2 for protein sequences. However, if your input sequences contain higher-order non-random effects that are getting in the way of motif finding, you can follow the following "rules of thumb": 
    * Use a background model at least four orders less than the shortest motifs you are looking for. So, if you want to find motifs as short as six, I wouldn't use a model higher than order two. 
    * For an accurate model of order N, you need to use a FASTA file as input to fasta-get-markov with at least 10 times 4**(N+1) DNA characters in it. So,

      order-3 requires 2560 characters 
      order 4 requires 10240 characters 
      order 5 requires 40960 characters 
      etc.


Example 

Suppose the FASTA file is called 'background.fasta'. Then a typical use of fasta-get-markov would be 

Code:
fasta-get-markov -m 0 < background.fasta > background.model


This would read the sequences from background.fasta, generate a 0th order Markov model, and write it to background.model 

Similarly you could generate a 1st order Markov model with 

Code:
fasta-get-markov -m 1 < background.fasta > background.modelt


The file background.model can then be used as the background file when running MEME.

Gonzalo Olivares

unread,
Nov 26, 2014, 5:31:24 PM11/26/14
to meme-...@googlegroups.com
Dear Charles,

Thanks a lot for the post. I'm trying to use MEME for finding motifs and I used fasts-get-markov to generate the 2nd order background model. However when I used it I received an error that said that the probability for AAD is missing. I don't know why is asking for this combination if I'm using DNA. Do you have any insight that can help me?

Thanks

Gonzalo

James Johnson

unread,
Nov 30, 2014, 9:15:12 PM11/30/14
to
This happens when MEME has been run without the "-dna" switch. By default MEME assumes the sequences are protein. The default for fasta-get-markov is the opposite, in the currently released version it assumes DNA (though in future it will auto-detect the alphabet as we have that feature on a development branch).

Gonzalo Olivares

unread,
Dec 15, 2014, 3:20:51 AM12/15/14
to meme-...@googlegroups.com
Thanks!!

av...@ualberta.ca

unread,
Dec 4, 2015, 6:28:44 AM12/4/15
to MEME Suite Q&A, marian...@na.icar.cnr.it
Hi there,
I have a background data set that does not include the sequences I am analyzing with meme, the background is a very large sequence file and the two datasets are of same kind, i.e. upstream sequences, thus, I guess, I am respecting all the rules you are specifying. My only doubt is that my background is human and my dataset sequences that I am running in meme (FIMO) against that background are two: one is human as well and the other one is murine (with the upstream homologues genes of the human one). Do you think is a problem to run a human background against a murine one?
Thanks a lot in advance,
Mariano

CharlesEGrant

unread,
Dec 7, 2015, 7:25:56 PM12/7/15
to MEME Suite Q&A, marian...@na.icar.cnr.it
Hi Mariano,

Murine nucleotide frequencies can differ from human, so using the same background model for both may not be ideal. Why not create two background models: one for your murine sequences, and another for your human sequences? On the other hand the differences may not be significant enough to make a practical difference. If FIMO is finding unambiguous matches to your motifs, then your background model may be good enough. There is no hard and fast rule here. Essentially you give it try and see if it works.

Charles
Reply all
Reply to author
Forward
0 new messages