Dear Sondes,
Thank you very much for your thoughtful and detailed observation. We truly appreciate the time and effort you have invested in understanding the dataset and the evaluation setup, and your feedback is valuable to us.
Below, we address your questions and observations.
A. Evaluation Clarifications
1. Evaluation granularity: Each sub-task (here we meant: term typing, taxonomy discovery, and non-taxonomic relation extraction tasks) will be evaluated independently, consistent with previous editions of the challenge. There is no single end-to-end pipeline score; instead, separate precision, recall, and F1 scores will be reported per sub-task.
2. Matching criterion: Evaluation is based on exact matching after normalization (case folding and whitespace trimming). For relation extraction, all components of a triple (head, relation, tail) must match simultaneously to be considered correct. There is no approximate or fuzzy matching; however, we are considering adding extra fuzzy-based metrics beside these strict metrics, so feel free to use or recommend such an approach. Indeed, we will take into account, and maybe you can submit a PR with new metrics. The current evaluation metrics at OntoLearner are here:
https://github.com/sciknoworg/OntoLearner/blob/main/ontolearner/evaluation/metrics.py. We will consider the semantic matching to the metrics as well, so you will see strict (exact) and semantic (fuzzy) based metrics.
3. Low term coverage: Thanks for the good observation. This was intentional; in most of the ontologies, often terms are also considered as a class (or type in our context), so making differentiation between two is a bit challenging (in some cases), so we preferred to keep the same kind of availability by providing a low term coverage rather than using balanced distributions across all three tasks of the LLMs4OL paradigm. So, here are some documents designed to contain only hierarchical or relational knowledge without explicit instance terms. In such cases, it is acceptable (and expected) that systems may output empty term lists. Terms should only be extracted when they are explicitly supported by the document content. Small hint: you can also combine the task B training set with the task A training set to increase your dataset size.
You are correct that several of the issues you identified stem from annotation artifacts, legacy formatting from source ontologies, and inconsistencies introduced during dataset construction.
At this stage of the challenge, we will not modify the training data, as this would affect reproducibility and fairness across participants. However, we provide the following guidance to clarify the intended interpretation.
Issue 1—Inconsistent term encoding within a single document. The evaluation will remove the '_' and any punctuation before making an evaluation. Moreover, to this, we will introduce semantic similarity-based precision, recall, and F1-scores, besides the strict evaluation that we pointed out in section A of this email and website. I think this can handle the variations that you mentioned.
Issue 2 — Spurious spaces in code-like terms. Very nice observation. The IDs, such as q715269, shouldn't appear within the gold or the context, so this is an annotation issue mostly. However, we assure you that the test case will not have such issues. You will receive clean texts. So, for the train case, feel free to ignore them or clean them up. Just for the GeoNames-style, I would say preserve the h.sea (or similar) as they are in both gold and text.
Issue 3 — Underscore encoding vs. natural text form. The gold should be the same as appears in the documents, so in a case of observing _ in the tail and head (not predicate), you are free to remove it. Even if you don't remove it from gold, the graph similarity metric (or semantic variant of the precision, recall, and f1 -- as stated in issue 1 as an extra metric) will consider it as a correct prediction (with or without "_").
Issues 4, 5, 6, 7. I think the introduction of semantic similarity-based metrics (with existing graph similarity) will catch both ways, either use "FORTRAN90" as an output or "fortran 90", both will be treated as correct, of course, with minor 1 or 2 percent errors in overall evaluation metrics, which I think is avoidable. You are right in all of the cases. Regarding issue 6, based on the context, it is usually achievable as the texts are discussing the specific domain, for example, when we mention "Buy" it is in the financial domain, so it is related to the business function. This technique is mostly used in domain-specific ontologies when the labels are being created. Moreover, regarding the spelling errors, I would say that the entity comes from an ontology, and it seems that the ontology itself has such a spelling error! Interesting observation,
Issue 8 — Gold types absent from document text. If the gold type is the type that can be inferred from terms, it is valid; if it can not be derived from terms, it could be invalid. Classes usually not meant to be explicitly mentioned in the text. Here are the two paradigms: you can either have your type exactly as it is in the text, OR infer it under an expert-driven naming (one of the challenging parts). In real-world ontology, usually a bunch of texts is used for the creation of terms, and then categorization of those terms into a higher level, extending those levels, forming a taxonomy, and then adding axioms. So as you see, the end ontology namings might not even appear in the text; however, you can link them (entity linking) -- this is one way of making ontologies. Another way is to use exact naming as it appears in the text. So, if the model is being finetuned over the samples of the train, a perfect train set creation would require a consideration of this real-world scenario as well. However, you are free to ignore one of these scenarios in your training, and hopefully, in the testing, such an issue will not appear. But as you mentioned, there would be a ceiling indeed for such a case.
Some of the cases you observed are intentional and reflect realistic ontology population scenarios, while others are annotation artifacts that will not appear in the test set. Preprocessing and normalization are encouraged. Systems that can map surface text to canonical ontology labels (rather than relying solely on literal string matching) will be better aligned with the evaluation design.
Thank you again for your careful analysis and constructive feedback.
Best Regards,
Hamed
On behalf of the LLMs4OL Challenge Organizing Committee