Hi all,
I don't think that implementing an explicit polymorphism model would make sense because non-genetic polymorphism does not has the same information content as e.g. allelic variation (which RAxML-ng covers with its GENOTYPE model) but either reflects intra-tip (typically within a species) variation or uncertainty about a character's state (e.g. when combining fossils and extant species). When we model substitution probabilities, this is a difference.
E.g. that a binary state
TaxonA 0
TaxonB 1
TaxonC May be 0 or 1 (logical OR)
TaxonD Can be both, either 0 or 1 (logical AND)
we would have the following tip probability vectors
TaxonA (1,0)
TaxonB (0,1)
Another problem hard to tackle within explicit models is the state's quality, which also differs much from amino acids or nucleotides: only in binary or (ordered) ternary morpho-partitions the states are more or less comparable but as soon as we have different state numbers, e.g. a combination of binary states, unoreder multi-state, ordered binned, which is the case for all morpho-partitions I came across, already using a state-only model is problematic because the probability to substitute 0 by 1, by 2, etc. is much different from state to state (which is an inherent problem for both Bayesian and ML total evidence matrices and can be a reason for decreased branch support, PP < 1.00, BS <<< 100). If we now add polymorphisms in the model, we may enforce an already starting overparametrisation and trigger branching artefacts.
Last, is there really a need? Most polymorphisms in morpho-partitions have very little signal for optimising a tip's position in a tree. That's another difference to intragenomic nucleotide variation. Re Astrid's original problem: a simple test is to replace all polymorphisms "(0,1)" or "{0,1}" etc. in your matrix by missing data ("?" or "-", doesn't matter) and re-infer the tree. If the same branches are supported (or preferred), the coded "polymorphism" are irrelevant.
If the trees look much different, then there are two possible workarounds, for both situations, intra-taxon polymorphism or uncertainty, i.e. TaxonC vs TaxonD in the example above,
One would be binarisation of multistate/polymorphic characters in the matrix, rather than using 2, 3 or more states one uses a vector of binary or ordered ternary states (to distinguish uncertainty from intra-taxon variation) and then runs that with the Mk model, unweighted or using differential character-weights (to normalise the effect of binary vs. many-states traits).
E.g. for the example above, the binarisation would be
TaxonA 2 0 = Trait #1 is 0
TaxonB 0 2 = Trait #1 is 1
TaxonC 1 1 = Trait #1 may be 0 or 1
TaxonD 2 2 = Trait #1
The other to make use of the customised model in RAxML-ng by treating polymorphism and uncertainties as additional states
In this case we would redefine the binary trait as a quaternary characters with e.g. 0 = is 0, 1 = is 1, X = 0 OR 1?, A = 0 AND 1 and the transformed matrix would be
TaxonA 0
TaxonB 1
TaxonC X
TaxonD A
And the according definition file would be
4 2
01AX
Is0 Is1
0 1,0
1 0,1
A 1,1
X 0.5,0.5
Note that for most morphomatrices, I came across (including anything from dinosaurs to intrageneric variation in plants), using overly complex models just enforces systematic bias; there is a higher risk for false positives and the simpler the model and matrix, the easier it is to pinpoint branching artefacts.
If you're unsure, just run two analyses: one treating polymorphisms as missing data, the other using one of the above schemes (or both, pretty sure no one has tested them against each other yet) and explore the result.
Cheers, Guido