We are pleased to announce the ClinSkill QA shared task, co-located with the BioNLP workshop at ACL 2026.
INTRODUCTION
Multimodal large language models (MLLMs) have the potential to support clinical training and assessment by assisting medical experts in interpreting procedural videos and verifying adherence to standardized workflows. Reliable deployment in these settings
requires evidence that models can continuously interpret students’ actions during clinical skill assessments, which underpins MLLMs’ understanding of clinical skills. Systematically evaluating and improving MLLMs’ understanding of clinical skills and their
continuous perception in clinical skill assessment scenarios is therefore essential for building reliable and high-impact AI systems for medical education. To address this need, the shared task on medical question answering targets clinical skill assessment
scenarios.
IMPORTANT DATES
Release of task data : Jan 30, 2026
Paper submission deadline: Apr 17, 2026
Notification of acceptance: May 4, 2026
Camera-ready paper due: May 12, 2026
BioNLP Workshop Date: July 3 or 4, 2026
Note that all deadlines are 23:59:59 AoE (UTC-12).
TASK DEFINITION
ClinSkill QA formulates clinical skill understanding and continuous perception for clinical skill assessment as an ordering task: the MLLM is required to arrange shuffled key frames into a coherent sequence of clinical actions and to provide explanations
for the resulting order. The dataset is constructed from video clips of medical student clinical procedures, collected from Zhongnan Hospital of Wuhan University and cofun (
http://www.curefun.com/). This study was approved
by the Institutional Review Board (IRB), and all data collection and processing followed relevant ethical guidelines.
DATASET
ClinSkill QA is built on 200 sets of shuffled key frames extracted from three types of clinical skill videos. Each set of key frames represents a sequence of continuous actions and is accompanied by expert-annotated ground-truth ordering and order rationales.
EVALUATION
For evaluation, we use Task Accuracy (exact ordering) and Pairwise Accuracy (the fraction of adjacent pairs correctly ordered) for the ordering results, and BertScore as well as an LLM-as-judge(G-Eval) for assessing the quality of the ordering explanations.
For the i-th sample (a set of shuffled keyframes):
Ordering evaluation
- Task Accuracy
- Pairwise Accuracy
Rationale evaluation
- BertScore
-LLM-as-Judge(G-Eval)
REGISTRATION AND SUBMISSION
Registration and Submission will be done via CodaBench (Link will be available soon on the task home page)
Each team is allowed up to ten successful submissions on CodaBench.
ORGANIZERS
Xiyang Huang, School of Artifical Intelligence, Wuhan University
Yihuai Xu, School of Artifical Intelligence, Wuhan University
Zhiyuan Chen, School of Artifical Intelligence, Wuhan University
Keying Wu, School of Artifical Intelligence, Wuhan University
Jiayi Xiang, School of Artifical Intelligence, Wuhan University
Buzhou Tang, School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen
Renxiong Wei, Zhongnan Hospital of Wuhan University
Yanqing Ye, Zhongnan Hospital of Wuhan University
Jinyu Chen, Zhongnan Hospital of Wuhan University
Cheng Zeng, School of Artifical Intelligence, Wuhan University
Min Peng, School of Artifical Intelligence, Wuhan University
Qianqian Xie, School of Artifical Intelligence,Wuhan University
Sophia Ananiadou, Department of Computer Science, The University of Manchester