Morphological dataset: polymorphic characters

358 views
Skip to first unread message

astrid cruaud

unread,
Aug 22, 2011, 7:42:46 PM8/22/11
to raxml
Dear all,

I am working on multi-state morphological datasets. However some of
the characters are polymorphic. I mean that for some species two
different character states can be encountered.

I there a way to deal with that in RAXML?

I know that in some softwares polymorphisms can be entered by
enclosing the states in square brackets...

Thanks in advance for your answer!

Best

Astrid Cruaud

Alexis

unread,
Aug 23, 2011, 2:49:38 AM8/23/11
to raxml
Dear Astrid,

Unfortunately, there is no option to use polymorphic multi-state
characters in RAxML.
The reason why this is not implemented is that it would be extremely
hard and work-intensive
to implement this in RAxML from the software engineering point of
view.

Alexis

astrid cruaud

unread,
Sep 1, 2011, 9:31:08 PM9/1/11
to raxml
Dear Alexis,

That's a pity but I understand...

Any case, thanks a lot for your answer and congratulations for this
nice soft!!

Best

Astrid

A-A_lowie li

unread,
Sep 2, 2011, 4:09:38 AM9/2/11
to ra...@googlegroups.com
Dear Astrid: 

        I am a new one in this field, but curious to know what is "polymorphic multi-state characters"

        "some species two different character states can be encountered."  
        Do you mean, for example, nucleotide K represents G or T, or something different? 

Lowie

        
--
Without dream, life is incomplete!

Alexis

unread,
Sep 5, 2011, 1:11:51 PM9/5/11
to raxml
Hi Lowie,

In multi-state data-sets that can be encoded with say 5 characters 0,
1, 2, 3, 4
you may not be sure about certain character states (much as ambiguous
DNA encoding X, Y and the like),
e.g., you may want to encode something as either being 3 or 4, hence
you'd like to write: {3,4} representing a single character
in the alignment to indicate that this is an ambiguous state.

I think that this is the way it is implemented in PAUP.

Alexis


On 2 Sep., 10:09, A-A_lowie li <lowi...@gmail.com> wrote:
> Dear Astrid:
>
>         I am a new one in this field, but curious to know what is "polymorphic
> multi-state characters"
>
>         "some species two different character states can be encountered."
>         Do you mean, for example, nucleotide K represents G or T, or
> something different?
>
> Lowie
>

A-A_lowie li

unread,
Sep 5, 2011, 7:03:48 PM9/5/11
to ra...@googlegroups.com
Dear Prof. Alexis: 

        Thanks for your insights :)
        If the characters are all nucleotides, IUB/IUPAC has standardized the ambiguous letters (http://www.bioinformatics.org/sms/iupac.html) by taking into account of such multi-state problem

        What I am trying to say is that, any program/algorithm that are robust to do analysis based on IUPAC rather than restrictly deal with ATCG, means that there won't be any problem for multi-state, at least in nucleotide datasets. 
 
       Well, it's always easy to say than to be done :)
       Thanks for your explanation, I am clear now :)

Lowie

Ingo Michalak

unread,
Sep 5, 2011, 7:32:32 PM9/5/11
to raxml
Dear Lowie,
RAxML accepts different types of data - one of these is "multi state",
which is morphological characters with more than 2 possible states for
each character (with only 2 character states you would use the binary
data type). Another data type is dna in which the iupac ambiguity
characters are included.

Best
Ingo

On 6 Sep., 01:03, A-A_lowie li <lowi...@gmail.com> wrote:
> Dear Prof. Alexis:
>
>         Thanks for your insights :)
>         If the characters are all nucleotides, IUB/IUPAC
> has standardized the ambiguous letters (http://www.bioinformatics.org/sms/iupac.html) by taking into account of such
> multi-state problem
>
>         What I am trying to say is that, any program/algorithm that are
> robust to do analysis based on IUPAC rather than restrictly deal with ATCG,
> means that there won't be any problem for multi-state, at least in
> nucleotide datasets.
>
>        Well, it's always easy to say than to be done :)
>        Thanks for your explanation, I am clear now :)
>
> Lowie
>

A-A_lowie li

unread,
Sep 5, 2011, 7:39:34 PM9/5/11
to ra...@googlegroups.com
Dear Ingo:

       Thanks for your information. 
       So there is no problem at all, hope so :)

Lowie

Seraina Klopfstein

unread,
May 22, 2024, 6:14:57 AMMay 22
to raxml
Dear all,

I second Astrid's request for inclusion of polymorphisms in morphological datasets in RaxML - it would be extremely useful to be able to contrast the Bayesian results from mixed datasets with ML... but of course I understand that this is not a priority. (I just got very excited when realizing RaxML is including the Mk model at all)

Best,
Seraina

 

Grimm

unread,
May 22, 2024, 9:10:07 AMMay 22
to raxml
Hi all,

I don't think that implementing an explicit polymorphism model would make sense because non-genetic polymorphism does not has the same information content as e.g. allelic variation (which RAxML-ng covers with its GENOTYPE model) but either reflects intra-tip (typically within a species) variation or uncertainty about a character's state (e.g. when combining fossils and extant species). When we model substitution probabilities, this is a difference.
E.g. that a binary state
TaxonA 0
TaxonB 1
TaxonC May be 0 or 1 (logical OR)
TaxonD Can be both, either 0 or 1 (logical AND)
we would have the following tip probability vectors
TaxonA (1,0)
TaxonB (0,1)
TaxonC (0.5,0.5)
TaxonD (1,1)

Another problem hard to tackle within explicit models is the state's quality, which also differs much from amino acids or nucleotides: only in binary or (ordered) ternary morpho-partitions the states are more or less comparable but as soon as we have different state numbers, e.g. a combination of binary states, unoreder multi-state, ordered binned, which is the case for all morpho-partitions I came across, already using a state-only model is problematic because the probability to substitute 0 by 1, by 2, etc. is much different from state to state (which is an inherent problem for both Bayesian and ML total evidence matrices and can be a reason for decreased branch support, PP < 1.00, BS <<< 100). If we now add polymorphisms in the model, we may enforce an already starting overparametrisation and trigger branching artefacts.

Last, is there really a need? Most polymorphisms in morpho-partitions have very little signal for optimising a tip's position in a tree. That's another difference to intragenomic nucleotide variation. Re Astrid's original problem: a simple test is to replace all polymorphisms "(0,1)" or "{0,1}" etc. in your matrix by missing data ("?" or "-", doesn't matter) and re-infer the tree. If the same branches are supported (or preferred), the coded "polymorphism" are irrelevant.

If the trees look much different, then there are two possible workarounds, for both situations, intra-taxon polymorphism or uncertainty, i.e. TaxonC vs TaxonD in the example above,

One would be binarisation of multistate/polymorphic characters in the matrix, rather than using 2, 3 or more states one uses a vector of binary or ordered ternary states (to distinguish uncertainty from intra-taxon variation) and then runs that with the Mk model, unweighted or using differential character-weights (to normalise the effect of binary vs. many-states traits).
E.g. for the example above, the binarisation would be
TaxonA 2 0  = Trait #1 is 0
TaxonB 0 2  = Trait #1 is 1
TaxonC 1 1  = Trait #1 may be 0 or 1
TaxonD 2 2 =  Trait #1 

The other to make use of the customised model in RAxML-ng by treating polymorphism and uncertainties as additional states
In this case we would redefine the binary trait as a quaternary characters with e.g. 0 = is 0, 1 = is 1, X = 0 OR 1?, A = 0 AND 1 and the transformed matrix would be
TaxonA 0
TaxonB 1
TaxonC X
TaxonD A
And the according definition file would be
4 2
01AX
Is0 Is1
0   1,0
1   0,1
A   1,1
X   0.5,0.5

Note that for most morphomatrices, I came across (including anything from dinosaurs to intrageneric variation in plants), using overly complex models just enforces systematic bias; there is a higher risk for false positives and the simpler the model and matrix, the easier it is to pinpoint branching artefacts.
If you're unsure, just run two analyses: one treating polymorphisms as missing data, the other using one of the above schemes (or both, pretty sure no one has tested them against each other yet) and explore the result.

Cheers, Guido
Reply all
Reply to author
Forward
0 new messages