Trying to find motifs in Gly-Xaa-Yaa protein sequences

57 views
Skip to first unread message

Mark

unread,
Dec 17, 2016, 4:07:37 PM12/17/16
to MEME Suite Q&A
I have protein sequences that have a glycine every third amino acid. I am trying to find if there is a pattern for the second or the third amino acids.
Example sequence would be GPTGPQGATGPRGEAGPQGPQGVT

I counted the probability of every possible Gly-Xaa-Yaa sequence (all 400) and generated data to be used as background/reference data. There is no pattern as I generated those triplets randomly, but it should nulify and patterns you automatically get by randomly assigning the common triplets like GPQ, GPT, GAT, etc.

I also tried to generate the markov background file.

But when I search for motifs my motifs are basically Gly-Xaa-Yaa's where some sequences that are found have zero identity besides the first glycine.

For example, I get GSTGSTGSTGSTGSTGST and GERGPRGKDGPQGPAGIQ that are aligning. It seems it scores the glycines anyway.
The motifs I get are just always G in the first position, and a mix of all the popular AAs for X and Y in the second and third. That is not a pattern. It seems that the actual patterns are combined because they share the G in the first position.


I don't want the glycines to affect any of the e or p values, as they are already fixed and contain zero information.

Are there any settings that allow me to fix this? I am considering removing all the glycines from the sequences, but then it will see RGE and GRE as identical, when they aren't.

CharlesEGrant

unread,
Dec 19, 2016, 9:04:53 PM12/19/16
to MEME Suite Q&A
I don't think MEME is the appropriate tool for this analysis. MEME is looking for short sequences that are statistically overrepresented in your sequence data. By default, the shortest motif MEME considers is 8 positions wide.  If I've understood your question correctly, you're only interested in a 'motif' 2-3 positions wide. In an 8 position motif the effect of the content of one or two positions is going to be completely washed out by the content of all the other positions. You could force MEME to only consider only smaller motifs by setting MEME's '-minw' parameter to 2 or and the '-maxw' paramter to  3, but unless your motif has a super strong signal it's unlikely that any motifs MEME finds would be statistically significant.

Using MEME for this analysis seems like overkill. Have you considered just writing your own script to count and catalog all the 2-mers and 3-mers in you sequences that start with G? That might even be doable by hand, if your sequences aren't too long. You could then compare that table with the  2-mer and 3-mer counts you'd expect given your background sequence. You'd still need a pretty strong signal for a result to be statistically significant though.


Mark

unread,
Dec 20, 2016, 7:37:53 AM12/20/16
to MEME Suite Q&A
No, I need 6-mers and more. You can see by eye it repeats with G every third time. The question is if you have 10 of these 'triplets', if there is a pattern to which triplets follow which, or which are grouped together.

I already know the propensities of each triplet. That's how I was able to generate a background file.

I found out that generating the the fasta-get-markov on my SSD truncated. On my normal HDD, I was able to generate an up to 6 word file, and the results seem somewhat better.
It seems the complete Markov model improved the results somewhat. I ran the algorithm over the weekend, but it had 50 locations as a max.

Current problem is turning on parallel computing. It doesn't autodetect whatever parallel computing library I install and directing the makefile to the correct directory using ---with-mpidir=[dir] and ot crashes because the makefile calls 'DHAVE_CONFIG [stuff] rather than 'gcc -DHAVE_CONFIG [stuff].

Maybe openmpi aka mpirun is not the right package? Or makefile is bugged?

CharlesEGrant

unread,
Dec 20, 2016, 6:42:58 PM12/20/16
to MEME Suite Q&A
I'm afraid I'm still not understanding what you are trying to force MEME to do, and whether it's an appropriate tools for your task.

MEME uses heuristics to guess a motif that might be present, and where instances of the motif are located. It then evaluates the match between the guessed motif and its instances by taking the log likelihood ratio of the instance assuming the motif vs the instance assuming the background model. It then modifies the motif and locations to improve those scores and repeats the process. There is no way to tell MEME not to consider certain positions when calculating the log likelihood score.

Note that there can be issues when using anything more than a 2nd order background model for AA with MEME. See Background Models for MEME for details. 

Are you using the current version of the MEME Suite, 4.11.2 patch 2? I wasn't able to reproduce your problem with the '--with-mpidir' flag. Can you let us know which version you are using, and the exact configure command line you used, we can try to troubleshoot the problem. It would be helpful too, if you could a attach the 'configure' script.

Mark

unread,
Dec 21, 2016, 12:23:18 PM12/21/16
to MEME Suite Q&A
Manually installing openmpi seemed to have fixed the problem. For some reason, it wasn't recognizing the apt-get installed openmpi, and using the directory flag gave an error that seems to be related to it not being installed correct. Maybe it gives that error every time you give an incorrect mpidir flag? It just seemed odd, since the directory was there and the error it got was running a command where obviously the 'gcc -' part had disappeared.

I downloaded meme_4.11.2

I want to use at least 6th order, because I want to look how groups of 3 amino acids are arranged. I already know how common the 3-words are. I want to know how they are connected.

I now have MEME installed with parallel. But the new problem is that using 6th order background model, with multiple processors, MEME uses so much memory, it gets killed
'mpirun noticed that process rank1 with PID 0 on node mark exited on signal 9 (killed)

So far, it seems to have enough memory to give 3 processors the 24 gig 6th order background model, without it running out of memory. I guess 3 is better than using 1.

Do you want to look into the error I worked around? I am using linux mint 18.1 and I can imagine it would be nice if users with Mint (and Ubuntu/Debian maybe) can just use apt-get to install openmpi and get MEME working.

As for my problem, I think I found the right tool in MEME. I have 3-words that all start with glycine and I want to know if they form motifs that are 9, 12, 15-word. Without a background, it will recognise all my sequences as a motif, because they all have G every third AA.

All I seem to be missing is the ability to run on all 8 of my processors using the 24 gig background. But maybe that just requires 192 gig of memory, by definition?
Otherwise, ill revert back to 1 processor, if memory is the bottle neck.

One thing I am missing, though, is an option to only find motifs that occur n times in the same sequence.



Mark

unread,
Dec 27, 2016, 9:22:39 PM12/27/16
to MEME Suite Q&A
Hmm, I seem to run into several different issues.

When I use markov background of 1 on my normal data, I get several hits that always get the maximum number of sites. They are just a bunch of GxyGxyGxy where in x and y, several amino acids are stacked. So low information content. It just aligns almost everything that follows GxyGxyGxy. which is 99% of the data. And because of the G's, that gets a very low E value.

So I want to use markov background of 3-words. That way, I think the fact something is follows the Gxy pattern means nothing, as 99% of all the 3-words are Gxy as well.
But using that setting, I get 3 hits, at or near the maximum of 500 sites (there are 336 sequences in the datafile, but I am using arn mode of course, as repeats within the same sequence are very important). But all the other hits are made up of only 2 sequences being aligned, that have little in common except for the G's. I am sure there are several motifs in the dataset that are identical in at least 10 or so sequences. They are not found using the 3-word background. I don't know why it is aligning dissimilar sequences, then giving it a high E score.

I figured out though that I can replace all G's by X'es and still align accurately. it seems that works best, as I am looking for the pattern in the x and y positions. But then I have to manually adjust all logo's to get nice images.

I just want to find the 20 or so most common motifs that make up my dataset.

Mark

unread,
Dec 28, 2016, 7:02:40 AM12/28/16
to MEME Suite Q&A
I see that the G's are causing another issue. If I use a high number for maxsites, the first motif gets a very high site number. But the motif is just a G every time, and then something else in the second or third position. And then all those sites are assigned to motif 1 and ignored for any further searches.

Seems I need to replace the G's by X's for sure.

CharlesEGrant

unread,
Dec 29, 2016, 5:42:00 PM12/29/16
to MEME Suite Q&A
One thing I am missing, though, is an option to only find motifs that occur n times in the same sequence.

MEME only offers three models for motif occurrences: OOPS - occurs exactly one per sequence, ZOOPS - occurs zero or one per sequence, and ANR - any number of repetitions per sequence.  There is no option to find motifs that occur n times per sequence.
Reply all
Reply to author
Forward
0 new messages