Hi Brand,
Changing minority Rs into the majority A would be what we called a "modal consensus". Indeed, given how GBS data is assembled I would argue that if the proportion of Rs etc is low and if the ambiguous base calls are typically extentions of the majority base, then this is a sensible thing to do. The argument would be that those minority calls are mostly because of intragenomic variations (yet unconcerted mutations) or call uncertainty (detection artefacts, random noise) but do not reflect phylogenetic/genetic-evolutionary patterns.
Ages ago we tested the effect of using modal (using majority base)
vs. strict consensus data (using ambiguous codes), I don't know if
anyone looked into that any further for high-polymorphic ITS datasets
showing extreme levels of intra-genomic diversity.
Göker
M, Grimm GW. 2008. General functions to transform associate data to
host data, and their use in phylogenetic inference from sequences with
intra-individual variability.
BMC Evolutionary Biology 8:86.
Like in the Potts et al. study, all is possible and the best treatment may differ from dataset to dataset. It all depends on how many ambiguities are foundand how they are sorted. If the ambiguity score (cf. Potts et al. 2014) is low, you can just use this as argument and directly proceed with using majority base data. Otherwise, it may give you at least aspect wise distorted or wrong results.
To give a very simple example, let's say you have the phylogenetic split A1+A2 | B1+B2; then you may encounter the following principle site ambiguity patterns in your SNP data, dominant base in uppercase.
#1 #2 #3 #4
A1 A A G G
A2 A Ag Ag aG
B1 Ag aG A A
B2 A A A A
Pattern #1 is the random noise, such patterns are unproblematic regarding finding the optimal topology as they do not inform any split within the 4-tip problem. The only difference could be the branch-length of B1 tip; however, if B1 would be the only tip of the ingroup showing much increased levels of ambiguity, and the outgroup would be different from the ingroup in all those cases, one may step into the famous Felsenstein Zone and the analysis may get trapped by ingroup-outgroup long-branch attraction when using the standard 4x4 model, less risk for the GENOTYPE model, and no risk when using the majority base.
Pattern #2 is incongruent to the actual split, pending on how ambiguity calls are treated and how many of such patterns your data has, they may at least decrease the split's support. The distortion effect would be different for using majority bases, ambiguity calls/standard 4x4 or the GENOTYPE model. It hasn't been studied properly to which extent to my knowledge, but I know a few oligogene and phylogenomic datasets who have very few phylogenetically consistently sorted SNPs in their gene samples and get accordingly biased coalescents riddled by branching artefacts, where I know first-hand or suspect they mistreated their ambiguities.
Patterns #3 and #4 would be indistinguishable when we just use the ambiguity code and the standard model, but lead to different preferred splits if we would use majority bases. In the latter case these patterns would lead to the same effect than pattern #1! This is a situation where the standard 4x4 model, being 2ISP semi-aware can much outcompete a majority base approach (cf. Potts et al. 2014), and where the newly implemented GENOTYPE model will excel in its decision depth because it reckons even the difference in pattern #3 vs #4. For instance, at a species-population level, where the phylogenetic sorted genetic patterns are not yet consistently sorted but still affected by population dynamics (intra-species diversity close to inter-species divergence), making a difference between #3 and #4 can be very crucial. Imagine our dataset includes a tip A0, representing the genetically least drifted tip of the A-group, and A2 and A1 increasingly drifted tips. O be a distant outgroup.
#1 #2 #3 #4
A0 A A A Ag
A1 A A G G
A2 A Ag Ag aG
B1 Ag aG A A
B2 A A A A
O A G G A
In this case, using majority bases and provided we have a lot of these patterns would easily trigger a wrong topology since patterns #3 and #4 would prefer an O + A1 (+A2) | A0 + B1/B2 (+A2) split, and all #1 and #2 patterns would be split-ignorant. Keeping the ambiguity calls and using the 4x4 model may already save us from most branching artefacts if not all, like the GENOTYPE model.
But with the speediness of RAxML-ng, we don't need to make that choice anymore. It would be easy to just test for one's data. You just run three analyses:
- The original data with base proportions and the GENOTYPE model
- The data with ambiguity codes and the standard 4x4 model
- All ambiguity codes replaced by the majority base.
Typically these days you work with datasets that have too many tips, not only 20k SNPs but also thousands of tips, which often is unnessary for the question at hand and include many that are near-identical but people like large trees 🙃 If this is the case, you just take an informed or random subset of the total tip set for the test that is quick to run. E.g. all tips with a certain proportion of ambiguity-including SNPs or with one/two tip per species/known clade.
And then just compare the outcome.
- If 1 and 2 give the same tree, support the same splits, and 3 a different, there's phylogenetic information in the intra-tip variation, which may be worth to explore further pending the question at hand, e.g. if there are lack of resolution issues.
- If they're all the same (i.e. if there are no high-supported conflicts), you can use this result to argue that all final optimisations just use the majority base for the quickest possible and least noisy inference.
/G