Questions and Data Quality Issues — LLMs4OL 2026 Flagship Task

35 views

Skip to first unread message

Sondes Bannour Souihi

unread,

May 18, 2026, 12:15:10 PM (6 days ago) May 18

to LLMs4OL Challenge

Dear LLMs4OL 2026 Organizing Committee,

We hope you are doing well. We are the LIST-CortAIx team participating in Flagship Task, and we would like to raise several questions regarding the evaluation setting and training data quality.

A. Evaluation clarifications

Evaluation metrics: Will each sub-task (term extraction, type extraction, term typing, taxonomy discovery, non-taxonomic relations) be evaluated independently as in previous editions, or will the final score be a single score over the full pipeline? Also, could you clarify how true positives are determined — exact matching or approximate matching?

Low term coverage: Only approximately 9.7% of the 4,303 training samples (419 documents) contain gold terms (subjects of instance-of triples). The remaining 90.3% have only is-a and non-taxonomic triples, with no explicit terms. Is this distribution intentional? Are models expected to output empty term lists for documents that only describe type hierarchies, or are terms implicitly present but unlabeled?

B. Training data quality issues

While developing our pipeline, we identified several annotation inconsistencies in train_task_a.json that significantly impact evaluation. We believe these are artifacts rather than intentional design choices, and we share them hoping they can be addressed before the evaluation phase.

Issue 1 — Inconsistent term encoding within a single document

The same type of entity is annotated with two different formats within a single document:

{"terms": ["mm topic 3120", "mm topic 3250", ..., "mm_topic_3150", "mm_topic_3466", ...]}
Some MM_TOPIC_* identifiers use spaces, others use underscores, with no apparent rule.

Issue 2 — Spurious spaces in code-like terms

70+ terms have spaces inserted inside identifiers that appear without spaces in the source text:

GeoNames-style codes: Text h.sea, h.cnlq, u.shfu → Gold h. sea, h. cn lq, u s h f u
Alphanumeric IDs / Wikidata QIDs: Text q715269 → Gold q 715269
These create unavoidable false negatives for models that extract terms as they appear in text.

Issue 3 — Underscore encoding vs. natural text form

Many multi-word terms use underscores in the gold (frustum_of_cone, molar_heat_capacity_at_constant_pressure) while the document text uses spaces. It is unclear whether models should extract in ontological identifier form or natural language form.

Issue 4 — Arbitrary splitting of compound identifiers

Identifiers that appear as single tokens in text are split inconsistently in the gold:

Text FORTRAN90 → Gold fortran 90;
There is no consistent transformation rule, making the effective recall ceiling significantly below 100% even for a perfect extractor.

Issue 5 — Inconsistent singular/plural forms

The gold uses both plural-as-in-text (polyphenylenes) and canonical singular (programme committee member when text says "members") with no consistent rule.

Issue 6 — Parenthetical qualifiers absent from text

168 terms (3.9%) include disambiguation qualifiers in parentheses that do not appear in the source text:

buy (business function, deprecated) — text says "Buy"
bar (food) — text says "bar"
atmosphere (standard) — text says "standard atmosphere"
Should models infer and output these qualifiers? If so, how?

Issue 7 — Spelling errors in gold labels

Some gold labels contain apparent spelling errors:

material area density mesurement protocol instead of measurement — document: Soil Phosphorus Speciation and Density Measurement Protocols

Issue 8 — Gold types absent from document text

Some gold types do not appear to be explicitly mentioned in the corresponding document:

spermatogenic failure 21, spermatogenic failure 26 — document: Hypertension and Male Infertility: A Genetic Perspective
Should models infer such types from domain knowledge, or strictly extract what is mentioned in the text?

Issue 9 — Near-duplicate formatting variants

Labels with and without hyphens appear for the same concept:

Charcot-Marie-Tooth vs Charcot Marie Tooth — document: Classification and Genetic Variants of Charcot-Marie-Tooth Disease

These issues collectively make it difficult to write a consistent extraction prompt and depress evaluation metrics for correct predictions. We would greatly appreciate:

Clarification on the intended canonical form for term and type extraction
Guidance on whether models should normalize, infer from domain knowledge, or strictly follow the source text
Ideally, a corrected version of the training data before the final evaluation phase
Thank you very much for your time and for organizing this challenge.

Kind regards,
LIST-CortAIx team

LLMs4OL Challenge

unread,

May 20, 2026, 5:45:46 AM (4 days ago) May 20

to Sondes Bannour Souihi, LLMs4OL Challenge

Dear Sondes,

Thank you very much for your thoughtful and detailed observation. We truly appreciate the time and effort you have invested in understanding the dataset and the evaluation setup, and your feedback is valuable to us.

Below, we address your questions and observations.

A. Evaluation Clarifications

1. Evaluation granularity: Each sub-task (here we meant: term typing, taxonomy discovery, and non-taxonomic relation extraction tasks) will be evaluated independently, consistent with previous editions of the challenge. There is no single end-to-end pipeline score; instead, separate precision, recall, and F1 scores will be reported per sub-task.

2. Matching criterion: Evaluation is based on exact matching after normalization (case folding and whitespace trimming). For relation extraction, all components of a triple (head, relation, tail) must match simultaneously to be considered correct. There is no approximate or fuzzy matching; however, we are considering adding extra fuzzy-based metrics beside these strict metrics, so feel free to use or recommend such an approach. Indeed, we will take into account, and maybe you can submit a PR with new metrics. The current evaluation metrics at OntoLearner are here: https://github.com/sciknoworg/OntoLearner/blob/main/ontolearner/evaluation/metrics.py. We will consider the semantic matching to the metrics as well, so you will see strict (exact) and semantic (fuzzy) based metrics.

3. Low term coverage: Thanks for the good observation. This was intentional; in most of the ontologies, often terms are also considered as a class (or type in our context), so making differentiation between two is a bit challenging (in some cases), so we preferred to keep the same kind of availability by providing a low term coverage rather than using balanced distributions across all three tasks of the LLMs4OL paradigm. So, here are some documents designed to contain only hierarchical or relational knowledge without explicit instance terms. In such cases, it is acceptable (and expected) that systems may output empty term lists. Terms should only be extracted when they are explicitly supported by the document content. Small hint: you can also combine the task B training set with the task A training set to increase your dataset size.

B. Training data quality issues

You are correct that several of the issues you identified stem from annotation artifacts, legacy formatting from source ontologies, and inconsistencies introduced during dataset construction.

At this stage of the challenge, we will not modify the training data, as this would affect reproducibility and fairness across participants. However, we provide the following guidance to clarify the intended interpretation.

Issue 1—Inconsistent term encoding within a single document. The evaluation will remove the '_' and any punctuation before making an evaluation. Moreover, to this, we will introduce semantic similarity-based precision, recall, and F1-scores, besides the strict evaluation that we pointed out in section A of this email and website. I think this can handle the variations that you mentioned.

Issue 2 — Spurious spaces in code-like terms. Very nice observation. The IDs, such as q715269, shouldn't appear within the gold or the context, so this is an annotation issue mostly. However, we assure you that the test case will not have such issues. You will receive clean texts. So, for the train case, feel free to ignore them or clean them up. Just for the GeoNames-style, I would say preserve the h.sea (or similar) as they are in both gold and text.

Issue 3 — Underscore encoding vs. natural text form. The gold should be the same as appears in the documents, so in a case of observing _ in the tail and head (not predicate), you are free to remove it. Even if you don't remove it from gold, the graph similarity metric (or semantic variant of the precision, recall, and f1 -- as stated in issue 1 as an extra metric) will consider it as a correct prediction (with or without "_").

Issues 4, 5, 6, 7. I think the introduction of semantic similarity-based metrics (with existing graph similarity) will catch both ways, either use "FORTRAN90" as an output or "fortran 90", both will be treated as correct, of course, with minor 1 or 2 percent errors in overall evaluation metrics, which I think is avoidable. You are right in all of the cases. Regarding issue 6, based on the context, it is usually achievable as the texts are discussing the specific domain, for example, when we mention "Buy" it is in the financial domain, so it is related to the business function. This technique is mostly used in domain-specific ontologies when the labels are being created. Moreover, regarding the spelling errors, I would say that the entity comes from an ontology, and it seems that the ontology itself has such a spelling error! Interesting observation,

Issue 8 — Gold types absent from document text. If the gold type is the type that can be inferred from terms, it is valid; if it can not be derived from terms, it could be invalid. Classes usually not meant to be explicitly mentioned in the text. Here are the two paradigms: you can either have your type exactly as it is in the text, OR infer it under an expert-driven naming (one of the challenging parts). In real-world ontology, usually a bunch of texts is used for the creation of terms, and then categorization of those terms into a higher level, extending those levels, forming a taxonomy, and then adding axioms. So as you see, the end ontology namings might not even appear in the text; however, you can link them (entity linking) -- this is one way of making ontologies. Another way is to use exact naming as it appears in the text. So, if the model is being finetuned over the samples of the train, a perfect train set creation would require a consideration of this real-world scenario as well. However, you are free to ignore one of these scenarios in your training, and hopefully, in the testing, such an issue will not appear. But as you mentioned, there would be a ceiling indeed for such a case.

Some of the cases you observed are intentional and reflect realistic ontology population scenarios, while others are annotation artifacts that will not appear in the test set. Preprocessing and normalization are encouraged. Systems that can map surface text to canonical ontology labels (rather than relying solely on literal string matching) will be better aligned with the evaluation design.

Thank you again for your careful analysis and constructive feedback.

Best Regards,

Hamed

On behalf of the LLMs4OL Challenge Organizing Committee

--
You received this message because you are subscribed to the Google Groups "LLMs4OL Challenge" group.
To unsubscribe from this group and stop receiving emails from it, send an email to llms4ol-challe...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/llms4ol-challenge/f134d87d-b6ad-459c-83fe-cf9cd674802an%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages