Algorithms of deciding ambiguous sites?

23 views
Skip to first unread message

Ding Yanqian

unread,
Jun 13, 2023, 10:50:11 AM6/13/23
to GetOrganelle
Hi, 

It seems GetOrganelle doesn't provide any ambiguous sites, I wonder what's the algorithms behind? Is it decided by the proportion of the coverage in specific nucleotides? If so, what's the threshold? When combined with other softwares, e.g., NOVOplasty, if there's a mismatch, foe example, NOVOplasty provides W while GetOrganelle provides A, how shall we decide? Thanks very much. 

Best,
Ding

Ding Yanqian

unread,
Jul 7, 2023, 5:08:26 AM7/7/23
to GetOrganelle
Forwarded from Author's (Jianjun's) reply:

GetOrganelle has options `--degenerate-depth` and `--degenerate-similarity` to partially control it, which set the maximum depth difference (by default 1.5) and the minimum similarity (by default 0.98) between the parallel contigs (or a bubble in an assembly). That means under certain circumstances, GetOrganelle will generate degenerate bases (see a recent issue here https://github.com/Kinggerm/GetOrganelle/issues/279, where GetOrganelle did produce ambiguous bases and reported it in the log, but `summary*py` failed to record it). However, other factors also influence the degeneration of bases, including but not limited to the SPAdes-hammer correction process, two alleles should have identical lengths, which is often violated in ITS where indels happen. 
Another end of GetOrganelle is treating alternative contig as contamination and removing them. Options are `--contamination-depth`, which set a minimum depth difference (default 3.0) and `--contamination-similarity`, which set a minimum similarity (default 0.9).
In the intermediate cases, if the sequence similarity is higher than 0.98 but the depth difference is between 1.5 and 3.0, GetOrganelle will warn about this polymorphic issue but do nothing.

In a general scene, determining degeneration during assembly can be complex. For example, horizontally transferred mt-pt is not true plastome heteroplasmy but is often misassembled into the plastome, yielding false heteroplasmy, a common mistake many tools make. Sometimes in these cases, using a depth threshold can also be wrong; further works (like gene annotation) need to be done to help decide. In your ITS case, I understand its difficulties because I've browsed hundreds, if not thousands, of assembly graphs of nrRNA, which is also why we did not include it in our manuscript. While some datasets contain purely one majority allele, many datasets contain multiple alleles creating a capillary-vessel-like graph structure in the graph, making degeneration impossible or wrong.

So, even if you can use `--degenerate-depth` and `--degenerate-similarity` to partially control it, I would recommend manually checking every assembly graph (the *fastg + *.csv, NOT the *.gfa  produced by GetOrganelle <= v1.7.7; we have an unreleased version that changed the output structure) to check if it contains ITS-associated polymorphism (ETS can always do, ignore it) and if the result is reasonable. It should be fast and helpful to check and know what's happening. 

Best,
Jianjun
Reply all
Reply to author
Forward
0 new messages