Questions about dat/jones.dat file (amino acid replacement model, AKA JTT Model)

526 views
Skip to first unread message

Fei Yuan

unread,
Apr 13, 2017, 2:58:50 PM4/13/17
to PAML discussion group
Dear All,

I am trying to use JTT model in codeml. It seems the file should be dat/jones.dat. However, when I jump to the file and other amino acid models (attached bellow), I found something I can not understand, I wish someone can catch me up.

For jones.dat, lines 1 - 24 seems showing the symmetrical part of the rate matrix and aa frequencies and the rate seems showing in absolute count number. However, when compare to wag.dat and lg.dat, the rate seems showing in relative decimal values. Therefore, my first question is are these rate matrix all have the same meaning? For wag.dat and lg.dat, the file itself clearly saying the matrix was not scaled and it will be scaled by paml, but there is no words about jones.dat, does this matrix can be scaled by using the same method? After scaling, can I use the scaled matrix to derive the Q matrix (I think it should be referred transition rate matrix) by using this formula Q_ij = s_ij*pi_j ?

For jones.dat, line 42 - 94 seems not used by paml, but I really want to know what are they and why one is an asymmetrical matrix but the other one is a symmetrical matrix. Does they both can be used in some situation to derive JTT model? 

For jones.dat, line 97 - 158, it seems this part showing a transition probability matrix (since all rows add up to 1, it should be a matrix denoted as P, not a Q showed above, which requires all rows add up to 0 and the diagonal element should be a negative value after scaling) and stationary frequencies for 20 amino acids. Is this the right thing for this part want to show?

For dat/wag.dat and dat/lg.dat, both of them only contains the symmetrical part of the rate matrix and aa frequencies, how can I get a transition probability matrix as showed in jones.dat (line 97-159). 

Thanks for your consideration. 

FEI
jones.dat
lg.dat
wag.dat

Ziheng

unread,
Apr 28, 2017, 12:17:48 PM4/28/17
to PAML discussion group

your descriptions are correct.
all those rate matrices are scaled by codeml before they are used. For that the values do not matter and only their relative values matter. Q is used to calculate P(t) = exp(Qt), where time or branch length t is measured by the number of substitutions per site.
in the jones.dat file, only the first block of triangular matrix (and the amino acid frequencies if the model is JTT, but those are not used if the model is JTT+F) are read by codeml, and the rest of the file is my notes, not read by the program. they were probably generated for this paper, or earlier. the notes should be self-explanatory, and if they are not, you can take a look at the paper.
for example the notes say that the big square matrix near the end of the file is P(0.01), this should be very close to I + Q*0.01, because 0.01 is close to 0. Here again Q is the scaled rate matrix. For any arbitrary t, P(t) is calculated using the spectral decomposition. if you can find a copy of yang (2014 Molecular Evolution: A Statistical Approach), this is explained on page 10 and page 65.

Yang, Z., R. Nielsen and M. Hasegawa. 1998. Models of amino acid substitution and
applications to mitochondrial protein evolution. Mol. Biol. Evol. 15:1600-1611.

杜康

unread,
Jan 31, 2019, 8:36:09 AM1/31/19
to PAML discussion group
Dear Prof. Ziheng,

I'm trying to learn here how to calculate the probability "Pij(b)" that an amino acid i changes to j along a branch with length b, during which some questions also raised regarding the dat/jones.dat, I think it may be better to be put here.

1) as mentioned by FEI,  line 97 - 158 seems to be a transition probability matrix (let's call it JTTm), however instead of transforming vectors from column space to row space, this matrix should be used the other way around (transform vectors from row space to column space), am I right?

2) from the output of codeml, I got from the file "rates" the rate for each site r, and I got from the file "mlc" the branch length b, so when I calculate the Pij(b), should I raise the transition probability matrix to the power r*b*10000?  (actually I'm not sure about the 10000 here, I got it from a paper, why 10000?)

3) also from that paper, the author drive a new matrix from JTTm based on observed amino acid frequency πj using format: NEWm=JTTm*πj/JTTπj. I'm so confused here, I understand from your paper "Estimating the Pattern of Nucleotide Substitution" that the rate matrix for reversible homogeneous Markov process Q could be decomposed as "rate parameters" and "frequency parameters", but JTTm is a  transition probability matrix instead of rate matrix Q, right? Or, this is to say, the transition probability matrix also could be decomposed to "frequency parameters" and something?

              Sincerely,
                       Kang


Ziheng

unread,
Mar 30, 2019, 6:23:48 AM3/30/19
to PAML discussion group
i think jones et al. used the same approach as dayhoff et al. to derive their substitution rate matrix, mostly based on parsimony reconstruction of ancestral sequences and then counting differences.  the counts are made to be symmetrical, which i think is equivalent to requiring the Q matrix to be time reversible.  one can make that process a bit more sophisticated but i do not suppose it will make a big difference.  the following papers discuss some of those issues:

Yang Z, Kumar S. 1996. Approximate methods for estimating the pattern of nucleotide substitution and the variation of substitution rates among sites. Mol Biol Evol 13:650-659.

Kosiol C, Goldman N. 2005. Different versions of the Dayhoff rate matrix. Mol Biol Evol 22:193-199.

The ML method for estimating the Q matrix from a sequence alignment is described in the following papers:

Yang Z. 1994. Estimating the pattern of nucleotide substitution. J Mol Evol 39:105-111.
Adachi J, Hasegawa M. 1996. Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol 42:459-468.
Whelan S, Goldman N. 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum likelihood approach. Mol Biol Evol 18:691-699.

regarding the difference between Q and P(t), when t is small the off-diagonal elements of Q and P(t) are more or less the same (proportional).  
you can read about the general theory in chapter 1 of
Yang Z. 2014. Molecular Evolution: A Statistical Approach. Oxford University Press, Oxford, England.

ziheng

Reply all
Reply to author
Forward
0 new messages