Hi Charlie,
Thanks for sharing this important work!
I wanted to check whether there was any place here for including additional evidence or evidence verification. Mappings between different taxonomies have some notorious issues and much of NCBI Taxonomy originates from the pre-genome/sequencing era. So my question is whether there is a place here to include sequence information (eg 16S), genome sequences, or perhaps other data (like isolation source or publications) to help make the synonym and other relation assignments more confident?
As it stands, I think the NCBI Taxonomy relationships are rather unclear with respect to the actual sequence data -- and in the end that is what matters most, that the assigned synonyms/equivalences have 'identical' genetic information. Identical is in quotes because there is a bit of fuzziness here as well as different matching levels. For example:
- You can have the same 16s sequences but still have important differences in the genomes.
- Vice versa, you can have identical genomes and some small differences in 16S (even sequencing errors or strain/substrain differences).
- Just for fun, some taxa can have multiple different 16S copies.
- There is a newer class of genomes out there called MAGs (metagenome assembled genomes) and these have no guarantees of completeness but are starting to enter the mix of genomic/taxonomic data out there ...
Based on the above some relevant matching levels could be: identical 16S, identical genome, close 16s, close genome.
GTDB is the main taxonomy out there built based on sequence (and comes with NCBITaxon mappings), so ideally there will be a convergence of worlds in the near future:
Separately there is the Greengenes resource which is the largest sequence-based taxonomy collection, albeit only based on 16S:
I'm happy to discuss more and also realize this may not quite be on your synonym path at the moment ...
best,
marcin