CL4Health is concerned with the resources, computational approaches, and behavioral and socio-economic aspects of the public interactions with digital resources in search of health-related information that satisfies their information needs and guides their actions. The workshop invites papers concerning all areas of language processing focused on patients' health and health-related issues concerning the public. The issues include, but are not limited to, accessibility and trustworthiness of health information provided to the public; explainable and evidence-supported answers to consumer-health questions; accurate summarization of patients' health records at their health literacy level; understanding patients' non-informational needs through their language, and accurate and accessible interpretations of biomedical research. The topics of interest for the workshop include, but are not limited to the following:
Detecting Dosing Errors from Clinical Trials.
Medication errors constitute a significant threat to public health worldwide. Although various types of errors may occur, dosing errors have been identified as one of the most frequent types. The objective of the shared task is to develop and evaluate machine
learning methods capable of analyzing clinical trial data (including structured metadata and free-text protocol descriptions) to identify trials that are likely to experience unusually high rates of dosing errors. Such predictive tools could serve as early-warning
systems, supporting more reliable trial design and enhancing medication safety. A human-annotated dataset comprising 40,000 clinical trials will be used for the training and validation set. Submissions will be evaluated primarily using the F1-score, with AUROC
and AUPRC reported as complementary metrics. To avoid participants using unauthorized data for training, only submissions of fully reproducible, open methods will be considered.
Automatic Case Report Form (CRF) Filling from Clinical Notes.
Case Report Forms are standardized instruments in medical research used to collect patient data consistently and reliably. They consist of predefined items to be filled with patient information. Automating CRF filling from clinical notes would accelerate
clinical research, reduce manual burden on healthcare professionals, and create structured representations that can be directly leveraged to produce accessible, patient-friendly, and practitioner-friendly summaries. The shared task focuses on developing systems
that take clinical narratives as input and automatically populate the relevant slots in a CRF. Two different (synthetic and real clinical data) multilingual datasets covering English and Italian will be shared with the participants to develop the system. The
evaluation will be performed in terms of F1-score by comparing the system's outputs with ground truth labels.
Grounded Question Answering from Electronic Health Records.
While there have been studies on answering general health-related queries, few have focused on their own medical records. Furthermore, grounding (linking responses to specific evidence) is critical in medicine. Yet, despite extensive studies in open domains,
its application in the clinical domain remains underexplored. To foster research in these sparsely studied areas of clinical natural language processing, the ArchEHR-QA (“Archer”) shared task was introduced as part of the BioNLP Workshop at ACL 2025. Given
a patient-posed natural language question, the corresponding clinician-interpreted question, and the patient's clinical note excerpt, the task is to produce a natural language answer with citations to the specific note sentences. The ArchEHR-QA dataset is
based on real-life patients' questions from public health forums aligned with clinical notes from publicly accessible EHR databases (MIMIC-III/IV) to form a cohesive question-answer source case. Submissions will be evaluated for evidence use (“Factuality”)
and answer quality (“Relevance”). Factuality is measured via Precision, Recall, and F1 Scores between the cited evidence sentences in systems' answers and ground truth labels. Relevance is measured against ground truth answers using BLEU, ROUGE, SARI, BERTScore,
AlignScore, and MEDCON.