Standardization of Synonyms in NCBITaxon

Charles Tapley Hoyt (Charlie)

unread,

Oct 10, 2023, 4:07:23 AM10/10/23

to obo-discuss

We are currently working towards standardizing synonym type definitions such as acronyms, layperson synonyms, misnomers, misspellings, and latin terms across OBO Foundry ontologies. See more details and examples at https://github.com/OBOFoundry/OBOFoundry.github.io/issues/2450.

As part of this effort, we will replace some of the ad hoc synonym type definitions' IRIs in the NCBITaxon ontology dump with appropriate OMO terms. For example, this means that http://purl.obolibrary.org/obo/ncbitaxon#acronym will be replaced with http://purl.obolibrary.org/obo/OMO_0003000.

If you have any concerns or comments about this, please join the discussion at https://github.com/obophenotype/ncbitaxon/pull/88.

Martin Krallinger

unread,

Oct 10, 2023, 7:52:40 AM10/10/23

to cth...@gmail.com, obo-discuss

Dear Charles,

This sounds very interesting and relevant. I was wondering how layperson synonyms are dealt with in terms

of language. We have been working on the translation into Spanish of NCBI Taxonomy.

I was wondering wheter you or someone else is aware of plans to generate multilingual versions of the synonyms

and layperson synonyms (for the NCBI Taxonomy or other OBO ontologies)?

Regards,

Martin

--
You received this message because you are subscribed to the Google Groups "obo-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to obo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/obo-discuss/4577dd98-2ae3-4505-a54a-61d1fb20e815n%40googlegroups.com.

--

=======================================
Martin Krallinger, Dr.
Head of NLP for Biomedical Information Analysis Unit
Barcelona Supercomputing Center (BSC-CNS)

https://www.linkedin.com/in/martin-krallinger-85495920/
=======================================

Charles Tapley Hoyt (Charlie)

unread,

Oct 10, 2023, 8:11:24 AM10/10/23

to Martin Krallinger, obo-discuss

Within the ontology world, I think the best way to capture the language of a synonym is by tagging the string that appears as the object with the appropriate language. In this way, the language is orthogonal to the predicate (e.g., oboInOwl:hasExactSynonym) and the synonym type (e.g., layperson synonym).

There are also some subtleties to this discussion. For example, if a non-english word is used as part of English discourse, it makes sense to tag it as an English phrase. This was the consensus on the following discussion about annotating “latin term” as a synonym type instead of tagging the strings with the latin language: https://github.com/information-artifact-ontology/ontology-metadata/issues/146. A less abstract notion is that we typically consider borrowed words like restaurant or kindergarten as English words instead of French or German ones, respectively, when used in English discourse.

As far as I understand, the OBO Foundry remains pretty monolingual with English, but there are several examples and places to jump into discussions about language support and how to best handle internationalization:

- Discussions about recommending specifying a language tag (https://github.com/OBOFoundry/OBOFoundry.github.io/issues/479)

- Policy and infrastructure for better language support (https://github.com/OBOFoundry/OBOFoundry.github.io/issues/437)

- Discussion about language requirement for new ontologies (https://github.com/OBOFoundry/OBOFoundry.github.io/issues/325)

- Check out what HPO is doing - their internationalization effort is notable within the OBOverse

Tiago Lubiana

unread,

Oct 10, 2023, 9:55:22 AM10/10/23

to cth...@gmail.com, Martin Krallinger, obo-discuss

Adding my 2 cents on the discussion:

Coming from Wikidata, which is inherently multilingual, I found it very weird that terms in OBO (or at least in the Cell Ontology) almost always do not have a language tag. Moreover, tagging them with an en tag is not simple, as most just assume a monolingual structure in English.

Other than the technical challenges, there are operational/social barriers to having multiple multilingual labels for the same concept in individual ontologies. For simplicity, OBO Foundry ontologies are in general solely in English, and this is implicitly assumed (at least at the triple level).

It is a different world, but Wikidata has two different datatypes, "monolingual text" and "multilingual text" (see https://www.mediawiki.org/wiki/Wikibase/DataModel#Datatypes_and_their_Values) . In its labels, aliases and relations with "multilingual text" type, tags are used, similar to how Charlie just described. Other than "rdfs:label" and "skos:altLabel", there are a few other Wikidata-specific relations dedicated to particular synonym-like assertions. The two most relevant here might be:

- https://www.wikidata.org/wiki/Property:P1843 (taxon common name), the most relevant for the NCBI example. Also receives language tags and takes references:

- https://www.wikidata.org/wiki/Property:P1813 (short name) - Used widely across Wikidata for all kinds of entities for shortened versions

These statements are queriable via the Wikidata Query Service. Via this query (https://w.wiki/7ixg) one can get the common names in Spanish for NCBI Taxon terms as far as Wikidata is complete (result attached).

Best,

Tiago

To view this discussion on the web visit https://groups.google.com/d/msgid/obo-discuss/6050F9F7-21D8-477F-B0A1-0AB45067694B%40gmail.com.

ncbi_spanish.tsv

Marcin Joachimiak

unread,

Oct 10, 2023, 3:23:34 PM10/10/23

to cth...@gmail.com, obo-discuss

Hi Charlie,

Thanks for sharing this important work!

I wanted to check whether there was any place here for including additional evidence or evidence verification. Mappings between different taxonomies have some notorious issues and much of NCBI Taxonomy originates from the pre-genome/sequencing era. So my question is whether there is a place here to include sequence information (eg 16S), genome sequences, or perhaps other data (like isolation source or publications) to help make the synonym and other relation assignments more confident?

As it stands, I think the NCBI Taxonomy relationships are rather unclear with respect to the actual sequence data -- and in the end that is what matters most, that the assigned synonyms/equivalences have 'identical' genetic information. Identical is in quotes because there is a bit of fuzziness here as well as different matching levels. For example:

- You can have the same 16s sequences but still have important differences in the genomes.

- Vice versa, you can have identical genomes and some small differences in 16S (even sequencing errors or strain/substrain differences).

- Just for fun, some taxa can have multiple different 16S copies.

- There is a newer class of genomes out there called MAGs (metagenome assembled genomes) and these have no guarantees of completeness but are starting to enter the mix of genomic/taxonomic data out there ...

Based on the above some relevant matching levels could be: identical 16S, identical genome, close 16s, close genome.

GTDB is the main taxonomy out there built based on sequence (and comes with NCBITaxon mappings), so ideally there will be a convergence of worlds in the near future:

https://gtdb.ecogenomic.org

Separately there is the Greengenes resource which is the largest sequence-based taxonomy collection, albeit only based on 16S:

https://greengenes.secondgenome.com

I'm happy to discuss more and also realize this may not quite be on your synonym path at the moment ...

best,

marcin

--

Reply all

Reply to author

Forward