Alphabet states and Model states

4 views

Skip to first unread message

Julien Yann Dutheil

unread,

Aug 11, 2014, 2:46:22 AM8/11/14

to biopp-de...@googlegroups.com

Dear all,

Here is a small post to clarify some ambiguities in the current version of the libraries, which I'm trying to correct for the next release.

The ambiguity concerns the use of the term "state". On the one hand, we call "AlphabetState" the possible values a sequence can take (as defined in the Alphabet classes). These states can be "letters" in simple alphabets, or "words" in more complex cases (eg codons).

We also refer to "states" in the case of susbtitution models. Here the states are the states that the Markov chain can take.

In the case of alphabets, states are coded by an integer value, with a character equivalence: state 0 is A for DNA alphabets, -1 is gap etc.

For substitution model, states are not explicitly coded, but are referred to by indices: basically, the position in the generator matrix of the model.

Historically (and this goes back to the Semphy library...), "canonical" states such as A, C, G and T are coded as 0 to 3, and this code was used as their ranking in the generator matrix. The equivalence between alphabet states and model states is simply done by a static_cast. This is of course unsatisfying!

Problem occurs for instance with covarion models: here a model state is a combination of an alphabet state and a rate class. To cope with this, the SubstitutionModel interface has a method (almost never used)
std::vector<size_t> getModelStates(int i)
which returns a vector of size == the number of rates classes in the case of covarion models.

The problem is that in many places, we still assume the equivalence between alphabet states and model states, making the procedures buggy when used with more complex models. This is the case in substitution mapping classes and sequence simulation classes for instance.

In order to solve these issues, I have modified these classes to work with states coded as "size_t" instead of "int", to explicitly show that these methods use model states and not alphabet states. In addition, variable names are modified to "initialSate" => "initialStateIndex" to clarify things further. These changes will probably lead to backward incompatibilities, but I'm afraid they are necessary to sanitize things.

Any comment / question welcome.

Best,

Julien.

Reply all

Reply to author

Forward

0 new messages