A/C would be M not Y :)
The labels for ambiguity codes have a simple logic.
- Y stands for pYrimidine(s) which is C (cytosine) and T (thymine)
- R is for puRine(s), adenosine (no y) and guanine (again no y)
- W mean Weak, so it's A/T, forming a weak (double) bond vs.
- S for Strong, C/G, triple bond in the double-helix
- K stands for Keto-group, which you have in G and T
- M stands for aMino-group, which you have in A and C
You don't have them with single-copy heterozygous data but the tripletts are easy, too. It's just one letter after the one missing:
- B means A missing = C, G, T
- D means C missing = A, G, T
- H means G missing = A, C, T
- V means T missing = A, C, G (U is uracil, hence, we have to jump to V)
Regarding option 1, 2 and 3, and what to expect and when to choose what.
Heterozygotic data is a simple version of what we called "twisps" (2ISP, intra-individual site polymorphism) in this paper, where we tested their information content in 2ISP-rich data using, among other things, the MULTISTATE option of RAxML:
The standard implementation in RAxML is, in contrast to MP, distance and Bayes implementations (back then, not sure they changed it in MrBayes) not ambiguity-naive, as Alexej pointed out for his possibility 2. Depending on the diversity in data (how well sorted it is), it makes a difference whether you treat eg. R as G/T (implemenation in RaxML) or ? (implementation in MrBayes 2 and 3.1). Possibility one woud be ambiguity-naive.
As a quick test to see how informative the heterozygous sites are, you just run your data following Alexej's suggestion 1 (replace by ?) and 2. If the trees are the same, the heterozygous sites are irrelevant for the inference. If the trees differ in important aspects (placement of your heterozygotic accessions), then you go for Alexej's option 2 and 3. I guess 1 is quicker than 2 and 3 is the slowest.
The experimental genotype implementation (Alexej's possibility 3) is the more sophisticated version (since it used both the mutation rates and the ambiguity codes) of our recoding procedure: we simply replaced A, C, G, T, R, Y, etc by 0, 1, 2, 3, 4 and run the same matrices as DNA or MULTISTATE.
Using ambiguities like that for additional information was a double-edged sword under ML. When your data is 2ISP-rich (in your case: a lot of heterozygotic sites), it's unproblematic and you get fine phylogenetic hypothesis (we preferred networks since evolution is usually not tree-like towards the leaves), also outperforming standard ML (we also did simulations), but if it's 2ISP-poor, the MULTISTATE recoding will over-penalise the change from a monomorphic to a dimorphic (heterozygous) state, e.g. A<>G would have a higher probability than A<>R, which makes no sense. In those case, the standard run (as DNA, option 2) produced more sensible ML trees.
The new genotype implementation should be much less vulnerable to that problem, I guess.
Cheers, Guido
Below the split support changes on the real-world data (using low to high polymorphic multi-copy data, so the worst-possible data for phylogenetics)
(the P index is the level of heterozygosity, -A is the standard implementation, ie. ignorant for NJ and MP, partly aware for ML; -I using the ambiguity codes as additional states, ordered for NJ and MP, 13-state matrix for ML)