"Error: too many ambiguity codons in the data."

310 views
Skip to first unread message

Nathan Galloway

unread,
Jul 21, 2014, 5:32:38 PM7/21/14
to pamlso...@googlegroups.com
I am attempting to run codeml analysis on a large data set containing a single CDS of 777 nucleotides across 135 species. I have collapsed within-species SNPs using ambiguity codes.

The data is read in and then the log file reads:
"Error: too many ambiguity codons in the data.  Contact author."

What's the best approach to this problem? Is my data set simply too large, or is it the fact that I have used ambiguity codes to collapse by species?

Thanks for your time,
Nathan

Ziheng

unread,
Aug 3, 2014, 4:41:49 PM8/3/14
to pamlso...@googlegroups.com
135 species or sequences are not too many. The error has to do with the fact that I wrote the program to accommodate at most 256 (I think) distinct codons, including TTT, TTC, ..., GGG, TYY, NGG, etc... If you used lots of ambiguity codes, you may have reached the limit.
One thing you can do is perhaps to sample a sequence from each specie at random, rather than using ambiguity codes to represent polymorphisms. You can then perhaps somehow try to analyze your polymorphism data separately and somehow integrate the results.
The program is not designed to deal with polymorphism data, so even if you code the polymorphic codons TTT and TTC as TTY, the program will be misinterpreting the data. It will treat TTY as meaning that the species has a codon TT?, where ? is T or C but definitely not both and we don't know whether it is T or C. But in fact you know both T and C are observed.
Ziheng

Nathan Galloway

unread,
Aug 4, 2014, 1:32:59 PM8/4/14
to pamlso...@googlegroups.com
Thank you so much for the reply. This really clarifies for me just how the program deals with those ambiguity codes.

I am wondering about an alternative approach. Perhaps I could take each species sequence and divide it into two (or three) separate sequences showing the observed haplotypes. For example, if the sequence for species 1 is TAY GGY, I could double the number of sequences:

species 1A: TAT GGT
species 1B: TAC GGC

with the tree now showing a small sub-tree to accommodate within-species variation; "(species 1A, species 1B)".

It seems that I could do this at many codon sites because PAML does not consider site-by-site interaction. In other words, I wouldn't need to worry about the other two possible sequences from my example: TAC GGT and TAT GGC.

Does this seem like a reasonable approach?

Again, thank you so much for your time and expertise.
Nathan
Reply all
Reply to author
Forward
0 new messages