SNPs-Ambiguous base

64 views
Skip to first unread message

Sithions Brand

unread,
Jan 12, 2024, 11:24:32 AMJan 12
to raxml
Hi everyone,

When using GBS-SNP-CROP for SNP calling, I observed the presence of ambiguous bases in my data. The file is in FAS format, converted from VCF. Does this have any impact on using RAXML-NG to build a phylogenetic tree? Is genotyping necessary for further analysis?

Thanks and happy holidays!

Xu Jun

Grimm

unread,
Jan 15, 2024, 4:02:17 AMJan 15
to raxml
Hi Sithions,

this entirely depends on how much of the signal is due to intragenomic variation.
The standard 4x4 ML substitution model is what we called "2ISP semi-aware" (Potts et al., Syst. Biol., 2014), i.e. if phylogenetically sorted, polymorphic (ambiguous) calls may stabilise the topology.

In datasets where most of the discriminate signal is encoded in the form of intra-genomic polymorphism (for our 2014, we had real-world and simulated data sets where most of the variation was in the form of intra-individual site polymorphism – 2ISPs), one will get better results by changing to a model that is fully recognisant of 2ISPs like the GENOTYPE model implemented in RAxML-ng.

However, a note of care: a certain proportion of ambiguous base calls in GBS-generated datasets are just random noise or even artefacts. Since stochastically distributed, if there is enough signal in the data, these dubious ambiguitous will not impact the topology optimisation, but may inflate especially terminal branch-lengths. So, the sheer proportion of ambiguous base calls is relatively irrelevant but how they are distributed across the genes and within phylogenetic lineages. A general rule is: if your organism evolved trivially, i.e. one species dividing into two, the dichotomous model, there's no real need for the GENOTYPE model, but if there has been reticulation (lineage-crossing), it may give you a better dichotomous approximation of the actual species network.

A simple test is hence to just run the data using the standard 4x4 substitution model vs the GENOTYPE model and plot the BS supports of both runs against each other and compare the resultant best-known trees (does, e.g. the AU-test reject the alternative topology cross-model). This is also highly advisable if you want to use RAxML-ng gene trees as input for coalescent tree approaches (e.g. Astral, see also Data S5 to Cardoni et al., The Plant Journal, 2022, where I re-analysed 2ISPs rich nuclear oligogene data revealing much different phylogenetic relationships than the original study got using phased data).

Cheers, Guido.

Sithions Brand

unread,
Jan 24, 2024, 2:19:09 AMJan 24
to raxml
Hi  Dr.Grimm,

I have come across literature where the operation is performed as follows: if a variant is present in the majority of samples (e.g., A), but in a minority of samples it appears as a heterozygous base (e.g., R), it is changed to the variant base consistent with the majority of samples (A). Is this operation feasible? If so, are there any software tools available for analysis? I have around 20 million SNPs in my dataset.

Thank you!

Cheers, 

Brand


Grimm

unread,
Jan 24, 2024, 3:47:00 AMJan 24
to raxml
Hi Brand,

Changing minority Rs into the majority A would be what we called a "modal consensus". Indeed, given how GBS data is assembled I would argue that if the proportion of Rs etc is low and if the ambiguous base calls are typically extentions of the majority base, then this is a sensible thing to do. The argument would be that those minority calls are mostly because of intragenomic variations (yet unconcerted mutations) or call uncertainty (detection artefacts, random noise) but do not reflect phylogenetic/genetic-evolutionary patterns.

Ages ago we tested the effect of using modal (using majority base) vs. strict consensus data (using ambiguous codes), I don't know if anyone looked into that any further for high-polymorphic ITS datasets showing extreme levels of intra-genomic diversity.

Göker M, Grimm GW. 2008. General functions to transform associate data to host data, and their use in phylogenetic inference from sequences with intra-individual variability. BMC Evolutionary Biology 8:86.

Like in the Potts et al. study, all is possible and the best treatment may differ from dataset to dataset. It all depends on how many ambiguities are foundand how they are sorted. If the ambiguity score (cf. Potts et al. 2014) is low, you can just use this as argument and directly proceed with using majority base data. Otherwise, it may give you at least aspect wise distorted or wrong results.

To give a very simple example, let's say you have the phylogenetic split A1+A2 | B1+B2; then you may encounter the following principle site ambiguity patterns in your SNP data, dominant base in uppercase.
      #1   #2   #3   #4        
A1  A     A      G     G      
A2  A     Ag    Ag   aG     
B1  Ag   aG    A     A      
B2  A     A      A      A

Pattern #1 is the random noise, such patterns are unproblematic regarding finding the optimal topology as they do not inform any split within the 4-tip problem. The only difference could be the branch-length of B1 tip; however, if B1 would be the only tip of the ingroup showing much increased levels of ambiguity, and the outgroup would be different from the ingroup in all those cases, one may step into the famous Felsenstein Zone and the analysis may get trapped by ingroup-outgroup long-branch attraction when using the standard 4x4 model, less risk for the GENOTYPE model, and no risk when using the majority base.

Pattern #2 is incongruent to the actual split, pending on how ambiguity calls are treated and how many of such patterns your data has, they may at least decrease the split's support. The distortion effect would be different for using majority bases, ambiguity calls/standard 4x4 or the GENOTYPE model. It hasn't been studied properly to which extent to my knowledge, but I know a few oligogene and phylogenomic datasets who have very few phylogenetically consistently sorted SNPs in their gene samples and get accordingly biased coalescents riddled by branching artefacts, where I know first-hand or suspect they mistreated their ambiguities.

Patterns #3 and #4 would be indistinguishable when we just use the ambiguity code and the standard model, but lead to different preferred splits if we would use majority bases. In the latter case these patterns would lead to the same effect than pattern #1! This is a situation where the standard 4x4 model, being 2ISP semi-aware can much outcompete a majority base approach (cf. Potts et al. 2014), and where the newly implemented GENOTYPE model will excel in its decision depth because it reckons even the difference in pattern #3 vs #4. For instance, at a species-population level, where the phylogenetic sorted genetic patterns are not yet consistently sorted but still affected by population dynamics (intra-species diversity close to inter-species divergence), making a difference between #3 and #4 can be very crucial. Imagine our dataset includes a tip A0, representing the genetically least drifted tip of the A-group, and A2 and A1 increasingly drifted tips. O be a distant outgroup.

    #1    #2     #3    #4        
A0  A     A      A     Ag
A1  A     A      G      G      
A2  A     Ag    Ag   aG     
B1  Ag   aG    A      A      
B2  A     A      A      A
O    A     G      G      A

In this case, using majority bases and provided we have a lot of these patterns would easily trigger a wrong topology since patterns #3 and #4 would prefer an O + A1 (+A2) | A0 + B1/B2 (+A2) split, and all #1 and #2 patterns would be split-ignorant. Keeping the ambiguity calls and using the 4x4 model may already save us from most branching artefacts if not all, like the GENOTYPE model.

But with the speediness of RAxML-ng, we don't need to make that choice anymore. It would be easy to just test for one's data. You just run three analyses:
  1. The original data with base proportions and the GENOTYPE model
  2. The data with ambiguity codes and the standard 4x4 model
  3. All ambiguity codes replaced by the majority base.
Typically these days you work with datasets that have too many tips, not only 20k SNPs but also thousands of tips, which often is unnessary for the question at hand and include many that are near-identical but people like large trees 🙃 If this is the case, you just take an informed or random subset of the total tip set for the test that is quick to run. E.g. all tips with a certain proportion of ambiguity-including SNPs or with one/two tip per species/known clade.

And then just compare the outcome. 
  • If 1 and 2 give the same tree, support the same splits, and 3 a different, there's phylogenetic information in the intra-tip variation, which may be worth to explore further pending the question at hand, e.g. if there are lack of resolution issues.
  • If they're all the same (i.e. if there are no high-supported conflicts), you can use this result to argue that all final optimisations just use the majority base for the quickest possible and least noisy inference.
/G

Sithions Brand

unread,
Jan 28, 2024, 7:34:42 AMJan 28
to raxml
Dear  Dr.Grimm,

Thank you  for your advice!

Cheers, 

Brand

Reply all
Reply to author
Forward
0 new messages