meme $sequences -dna -oc "$outdir" \
-bfile $markov_model \
-evt 0.01 \
# -neg $negativeSeqs \
# -objfun de \
-nostatus \
-time 18000 \
-mod anr \
-nmotifs 15 \
-maxw 30 \
-revcomp
ame -oc "$outdir/ame" \
--scoring totalhits \
--control $negativeSeqs \
--bfile $markov_model \
$sequences \
"$outdir/meme.txt" \
$motif_databases/*
meme | KTTTDTWTTTKTTTTTKTTTTTTTMAYTT | MEME-1 | 1.15e-5 | 1.79e-2 | 1.00 | 28 (34.1%) | 217 (10.9%) |
My first question relates to the `-objfun` selection in MEME (v5): If I run with -neg $negativeSeqs and specify -objfun de, does this look for enrichment of de-novo discovered motifs relative to my background sequences in a way comparable to the approach AME uses? If so, can I use this in place of AME, or does AME provide a better estimate of enrichment?
Secondly, I would like to be able to see the raw counts of motif presence in both my $sequences and $negativeSeqs so that I can assess the results I get from AME a bit more clearly. Here is a line from my output ame.html:
I also wanted to ask whether I can omit the $markov_model entirely from both commands. Seeing as I provide $negativeSeqs which is a better reflection of the k-mer frequencies in my $sequences than the whole genome I used to generate the $markov_model? How does the order of the markov model influence motif enrichment?
Hi Charles,
Thanks a lot for your very helpful and detailed response.
We are looking for motifs surrounding particular loci (initially we are looking +/- 100 bps from each locus) that are found genome-wide (excluding unmappable regions). The repetitive motif is actually one that we regularly find in sequences surrounding these loci, and are thus interested in. As long as MEME doesn't have any bias towards finding these sorts of highly repetitive motifs, we don't want to exclude it.
To then test whether these motifs are indeed enriched at these loci (rather than just being very abundant in the genome) we are essentially extracting sequences of the same length from randomly generated positions in the mappable genome. One thing to note is that we simulate many more control sequences (10,000) than we have in our test sequences (~200), to give us a better approximation of the background nucleotide frequencies.
Given your above response, I am now using these control sequences to construct a 0-order background (rather the whole genome that I was previously using), which I now use as input to run MEME for de-novo motif discovery in our test sequences ($sequences). I also remove the -evt as you suggest:
meme $sequences -dna -oc "$outdir" \
-bfile $markov_model \
-time 18000 \
-mod anr
This gives me some motifs, that I then use as input to AME, along with my test sequences ($sequences) and my control sequences ($control_seqs) to test for enrichment:
ame -oc "$outdir/ame" \
--scoring totalhits \
--control $control_seqs \
--bfile $markov_model \
$sequences \
"$outdir/meme.txt" \
$motif_databases/*
Is it appropriate to use same background for both MEME and AME? In this example, the background is also created from the control sequences. Is this the recommended approach?
Seeing as there's no pattern per se to the genomic distribution of the loci that we are looking at, I think that it will be difficult to generate a set of background sequences that we believe not to contain motifs. In this case, maybe it's better to use the default and generate a background from shuffled input?
Many thanks,
Nick