Re: [ontolog-forum] Digest for ontolog-forum@googlegroups.com - 4 updates in 1 topic

8 views
Skip to first unread message

Damion Dooley

unread,
Aug 6, 2025, 4:08:05 PMAug 6
to ontolo...@googlegroups.com
About: Reasons for not using user define IRIs?

I think the term URI issue is mainly a transitory technological issue.  I agree that for those that don’t have nice tools - most of us (all of us?) - to hide the ids and replace them with language of choice, in tables and syntactic queries, the alphanumeric code is frustrating, and enables some kinds of mistake to occur.  But the tools are coming to isolate the language a particular use has in interacting with ontology and knowledge graph content, from the underlying concept codes.  The analogy I make is of a word processor that encodes letters in some gobbledygook unicode or binary bits which we never see.  Or similarly, the compiled code of a software program or SQL query and all its table indexes bears little resemblance to its textual presentation because of the software sitting between us and the coding. (P.s https://github.com/ontodev/robot is an example of a tool to manage ontology term content with its tabular template command, where one can use plain labels to reference terms - which, with the reference ontologies you provide it - are used to write OWL format statements with alphanumeric or whatever identifiers.

Some other points: 
* Language change: terms shift in and out of popularity.  I often go to google ngram viewer to see which label I should choose for a term, and see how popularity wanes as we move in and out of centuries.  It is convenient to know the code for a concept is timeless and doesn’t depend on changes in human language or english in particular.  Perhaps english is the “language of science” these past few centuries, but call the alphanumeric concept codes the timeless Esperanto of data science.

* It doesn’t sound like the “cell division” vs “cytokinesis” renaming example is on point if there a semantic difference in the concept has occurred as revealed by definitions of the respective terms. One would not rename a term in that case. One might deprecate a term no longer in use, and have a replacement pointer to a new term.  OBO Foundry frowns on renaming terms if their definition changes for example.

* Term reuse and homologues: In OBO Foundry alone which favours term reuse, there are still over a dozen ontologies claiming their own “patient” term, so as one struggles to establish a federated database world, one is still having to look at the rest of the URL and term definition to distinguish what is mappable or not, i.e. what sense is being brought to bear on the term. How long/detailed do term names have to be anyways in order to distinguish in an ontology lookup service which ones are appropriate for a given use?  

On a related track, I and others realize from a sociotechnical perspective what would have been better than assembling collections of ontologies each with their own curator(s) and ontology prefixes, would have been an ISBN or DOI system that curation groups participate in, to be issued ids to use for defining the terms in their space. Then, if an ontology's curation is taken over by another curator, team, or consortium, each of their ontology's terms ISBN or DOI style identifiers are simply assigned - relayed - to that new curator entity, without affecting its URI at all. No ontology prefix to fuss about.  Reasignment of individual terms can be done too to other more domain relevant curation groups.  One could cook up a scheme where the DOI was a combo of unique mark + plain language component but as I say the tools will come such that one doesn’t have the semantic load of looking at identifiers unless engaged in task of disambiguating terms.

Cheers,

Damion


On Aug 6, 2025, at 8:05 AM, ontolo...@googlegroups.com wrote:

Michael DeBellis <mdebe...@gmail.com>: Aug 05 10:24AM -0700

Thanks, Alex, that is exactly what I was looking for! In general as I
review these I think the arguments are mostly an example of what Dawkins
calls The Tyranny of the Discontinuous Mind:
https://richarddawkins.com/articles/article/the-tyranny-of-the-discontinuous-mind
In this case the assumption that if you use intuitive IRIs you can't use
labels and vice versa.
 
Here are my replies to the Arguments for Alphanumeric Codes (OBO Foundry
Approach): https://claude.ai/share/4e077884-81dc-4866-81de-c0d6daedb5cc
 
Stability and Evolution: Alphanumeric codes provide stable identifiers that
> division" but later scientific understanding shows it should be
> "cytokinesis," the human-readable URI would need updating, potentially
> breaking existing references.
 
 
The unstated assumption is that Alphanumeric Codes are "more stable
identifiers that don't need to change when the understanding of a concept
evolves or when terminology needs refinement". How are alphanumeric codes
any more stable than user defined IRIs? What I've heard people argue is
that with an alphanumeric code you can just change the label and not change
the IRI. However, I would argue that a change where all you want is to
rename some entity and nothing else are fairly rare. In those rare cases
you could just as easily change the label and leave the IRI as it was and
add a comment that in this example the IRI doesn't map to the label. I've
done that several times. I would much rather have 90% of my labels and IRIs
have names that can be directly mapped to their prefLabel and 10% don't and
even for that 10% where there is a mismatch, the IRI still gives you some
idea what it is.
 
The example given: "you initially call something "cell division" but later
scientific understanding shows it should be "cytokinesis," is not a strong
argument because schema evolution is seldom this simple. However, even this
example isn't compelling because I would think that you still want to keep
"cell division" as an altLabel in this case, i.e., you can't just change
the label, even in many of the (already rare) paradigmatic examples. If
you are still doing design and the ontology hasn't been rolled out, it is
easy to change the IRI. But if you did do a roll out and you don't want to
change the IRI, you *can *simply change the label and add a comment
explaining that for this class the IRI doesn't map to the prefLabel.
 
More importantly, most of the time you aren't just changing the name but
you are doing more complex things like inserting a new class in between two
existing classes. Here's a real example: for an ontology I built recently I
used the Agent pattern. Agent is a superclass of Organization and Person.
However, I realized I needed an intermediate class between Agent and
Organization called Group (a Group is any collection of individuals with
one or more identifying trait but no formal structure, e.g. all Males in
the US is a Group, Climate_Social_Science_Network is an organization). So I
added a new class and changed the definitions so that Group is a sublcass
of Agent and Organization is a subclass of Group. But that wasn't
everything. Definitions (domain and range) of various properties needed to
change as well. E.g., has_member's domain was changed from Organization to
Group. In my experience most schema evolution is like this, you typically
change more than just the name of something, you also restructure the
ontology and need to change domain, range, and other axiom definitions so
you will be making changes at the level of IRIs anyway. If you roll out a
new version of your Cell ontology that still has Cell_Division as the IRI
but uses a different prefLabel but you also don't change things like the
domain and range of properties, then your code will break anyway. In this
example, if I assert a Group: Physic_Teacher_At_MIT has_member Alan_Adams
and don't update the domain of has_member then the reasoner will
incorrectly infer that Physic_Teacher_At_MIT is an Organization rather than
a Group.
 
Language Independence: Numeric codes avoid issues with natural language
> variations, translations, and cultural differences in terminology. This is
> crucial for international collaboration and multilingual applications.
 
 
How do codes "avoid issues with natural language"? You still have to define
language tags and you still need to define which language tag to use for
different users. Users won't see the IRI anyway. Choosing one language as
the language that a team standardizes on for names in no way ties you to
only using that language for your labels. Developers have been doing this
since the beginning of the digital computer. Following this logic, if you
want your Python system to support multiple languages then you would choose
names like:
 
variable_123 = conn.createURI("
> http://www.w3.org/2002/07/owl#NamedIndividual")
 
 
Rather than:
 
owl_named_individual = conn.createURI("
http://www.w3.org/2002/07/owl#NamedIndividual")
 
Uniqueness Guarantees: The OBO Foundry uses unique IDSPACE codes that
> identify each project, ensuring no conflicts between ontologies . Combined
> with systematic numbering, this prevents identifier collisions.
 
 
I would argue this is actually a reason you SHOULD use user defined names.
E.g., if I'm creating a class in Protege and I'm using user defined names,
then I will know when I'm creating a class that already has a given IRI.
Again, this happened to me recently. I realized that I was trying to define
a class with IRI "Communication" but I already had a property with that
IRI. I needed IRIs such as Communication_Event and
Greenwashing_Communication_Event. If I were using OBO I might not find the
problem until much later downstream and of course the later you find a
problem (developer time vs. compile time vs. run time) the more expensive
it is too fix. The fact that I can have two different IRIs with the same
label is actually an argument to NOT use codes. Also, there already is a
mechanism in OWL (and every modern programming language) for resolving name
conflicts: namespaces.
 
Technical Robustness: Alphanumeric codes avoid issues with special
> characters, spaces, encoding problems, and URL-unsafe characters that can
> occur with natural language terms.
 
 
Another strawman. I use User Defined names but I am always very rigorous in
only using the basic alphabet and no spaces, colons, slashes, etc. Even
though the IRI spec supports most special characters, I've found using such
characters causes no end of headaches, especially when moving the same file
across different tools such as Protege, Stardog, and AllegroGraph. Again,
the 90/10 rule: would rather have 90% of my IRIs that can automatically
synch with the prefLabel and 10% that don't rather than 100% that don't.
 
Separation of Concerns: The identifier serves purely as a stable
> reference, while human-readable labels are handled through annotation
> properties (like rdfs:label). This allows multiple labels, synonyms, and
> translations without affecting the core identifier.
 
 
Another strawman. It assumes (and I see this all the time, including alas
Protege) that the choice is between user define IRIs and using labels.
There is no reason you can't use both, which is what I and most of the
developers I know do. I use user defined IRIs for developers and labels
for pretty names that are relevant to end users. What I do when I'm not
constrained by some standard, is to use English IRIs with underscores for
blanks and with only basic alphabetic characters and numbers. That way, I
can use a very simple SPARQL transformation to auto-generate most of the
initial labels.
 
This is anecdotal of course, but in the last few years I've worked with
ontologies that use user defined names and codes (which is one reason I
wanted to revisit this) and I find codes add extra work for the developer,
are harder to debug, and I've never seen any example of schema evolution
where I think "this would be easier with codes as IRIs" or any other use
case where using codes would have simplified things.
 
Also, my final argument, is that the onus of proof here should be on those
who argue for codes. Clearly there is a cost to using codes or UUIDS. From
my initial past, clearly having to write SPARQL like:
 
SELECT ?p ?r
> WHERE {?p a codo:Patient;
> codo:hasFamilyRelationship ?r.}
 
 
is better than:
 
SELECT ?p ?r
> WHERE {?p a codo:OWLClass_f861e81c_661a_4243_a9be_cb9c780cb78a;
> codo:OWLProperty_c744v9fv_594j_3640_a9be_dge5305fe45v ?r }
 
 
And in my work over the last few years, I've never seen any use cases where
codes made thing easier. Also, this isn't the only cost to using codes.
Codes mean your team has to depend on some central authority to give you a
range of codes that haven't been used yet. Codes (this for me is one of the
clearest and most costly) make it a lot harder to debug your software.
 
Michael
https://www.michaeldebellis.com/blog
 
On Tue, Aug 5, 2025 at 1:59 AM Alex Shkotin <alex.s...@gmail.com> wrote:
 

Alex Shkotin

unread,
Aug 7, 2025, 5:59:53 AMAug 7
to ontolo...@googlegroups.com

Damion,


I suggest thinking about why it was necessary to encode terms at all?

On the one hand, we rewrite our knowledge in some formal language and it turns out that our terms cannot be used as identifiers of formal objects there.

And then it turns out that if we have to write a lot of formulas during the operation of the system, for example, formal random queries, then writers will require creating abbreviations or a hint tool when writing a query.

On the other hand, it is possible that in our subject area, different people initially use different words to indicate the same referent.


Well, technologically, we have several solutions for encoding terms:

- as close as possible to the terms themselves.

- abbreviations, when "cd" is used instead of "cell division"!

- a single center for issuing codes.

Perhaps there are some other algorithms.


And "Term reuse" is sometimes called polymorphism. Why do you mention "homologues"? biologically? according to Claude. 😀


Alex



ср, 6 авг. 2025 г. в 23:08, Damion Dooley <damion...@gmail.com>:
--
All contributions to this forum are covered by an open-source license.
For information about the wiki, the license, and how to subscribe or
unsubscribe to the forum, see http://ontologforum.org/info
---
You received this message because you are subscribed to the Google Groups "ontolog-forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ontolog-foru...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/ontolog-forum/D54B4CF9-79C1-4B73-92B7-B931D8E437DE%40gmail.com.

John Antill

unread,
Aug 7, 2025, 7:59:57 AMAug 7
to ontolo...@googlegroups.com
So On that not I find it interesting that to be fluent in a lanaguage you need to know 2,000 words. As we know there are was to simplify the words in which we use in sentences that allow more people to understand them. This is why Ontology is important. 
Below is the top ten languages and how many words are in each. Looking just at English we can see that understanding just 2000 words is a little more than 1% of the language. This is one reason why we need to classify words into similar conccepts and categories in MHO. Dont get me started on abbreviations.

English: Estimates range from 171,476 (current use) to over 200,000 words, with a vast number of specialized or obsolete terms.
Spanish: Around 88,000 words are listed in the "Diccionario de la Real Academia Española".
Russian: Estimates range from 150,000 to 200,000 words.
Chinese: Modern Chinese is estimated to have around 100,000 words, but dictionaries like the Hanyu Da Cidian dictionary contain over 370,000, including less common terms.
French: Estimates vary, with some sources citing around 100,000 words.
German: The Duden dictionary lists around 135,000 words.
Italian: Estimates range from around 270,000 words. 

John Antill
MS KM, MCKM, CKS IA & KT, KCS
MS AI Student at Purdue


Alex Shkotin

unread,
Aug 7, 2025, 12:35:57 PMAug 7
to ontolo...@googlegroups.com

John,


We are mainly talking about formal ontologies here, usually for various sciences and technologies, but sometimes for everyday life, i.e. knowledge known as common sense knowledge.


Scientific and technological jargon is huge. For example, how many names do we have for drugs or materials?


By the way, another algorithm to code compound terms is by brackets:  "cell division" → (cell)division i.e. we apply "division" to "cell" but this is a HOL.


Alex



чт, 7 авг. 2025 г. в 14:59, John Antill <djant...@gmail.com>:

John F Sowa

unread,
Aug 7, 2025, 2:45:47 PMAug 7
to ontolo...@googlegroups.com
Being able to use 2000 words of any language X is enough to communicate effectively with anybody who speaks X. 

For many words, you can guess a fair approximation from the context.  You can also ask questions.   If you're in a country that speaks X, you can learn a lot very quickly by watching TV, reading whatever papers or magazines you find, etc.  It also helps to have a dictionary (on paper in the olden days or on your cell phone today). 

The paragraphs above are for a traveler visiting a country that speaks X.  But a computer system can support similar services for people who are just learning how to work with the system.

But that also requires people who design systems to develop good methods for helping beginners and novices.

As for formal ontologies, those are useful for professionals who are designing computer systems.  They are not helpful for people who are trying to get around in a foreign country  -- or read a book or a magazine in some language X.

John Sowa
 


From: "John Antill" <djant...@gmail.com>

So On that note I find it interesting that to be fluent in a lanaguage you need to know 2,000 words. As we know there are ways to simplify the words in which we use in sentences that allow more people to understand them. This is why Ontology is important. 
Reply all
Reply to author
Forward
0 new messages