About CALAMITA.
CALAMITA (Challenge the Abilities of LAnguage Models in ITAlian) is an AILC community effort to build a dynamic, ever-growing benchmark for evaluating Large Language Models (LLMs) in Italian. The inaugural edition (co-located with CLiC-it 2024) mobilised the community at scale: 80+ contributors defined tasks, annotated data, and helped automate the evaluation pipeline, enabling the assessment of 4 LLMs across 22 challenges and 67 sub-tasks. We are deeply grateful to everyone who made this possible. For background and results, see the CLiC-it 2024 paper: https://aclanthology.org/2024.clicit-1.116/
Following the first edition, CALAMITA is intended to be a stable feature of the AILC Community and serve as the primary reference for Italian benchmarking. To ensure continuity and dynamic development, we propose the following.
We are launching a rolling, periodic process for proposing new CALAMITA challenges (i.e., concrete benchmarks consisting of data, evaluation, and a short paper). Cycles currently run every four months (we may adjust the cadence as the initiative progresses).
Pre-proposal (quick screening) Submit your task proposal via the form (link below). The CALAMITA chairs run a rapid sanity/overlap check (fitness for Italian, non-duplication/similarity, scope). Definitions: we use challenge for the specific benchmark you submit; task for the underlying capability being tested (e.g., commonsense reasoning, toxicity detection, coreference).
Data delivery and model evaluation If the pre-proposal is accepted, you will provide the dataset (public and/or private split) and evaluation scripts compatible with our pipeline (lm-eval-harness), usually within one month. The CALAMITA technical team can provide help with formatting and will then evaluate a set of LLMs on your data, returning scores, logs, and model outputs to support fine-grained error analysis.
Short paper, peer review, publication and presentation Using the results and outputs, you submit a short paper (in the length and style of the CLiC-it short-paper format) for peer review. Because the process is rolling, once accepted, the paper will first be posted on arXiv and later included in the first upcoming volume of annual AILC proceedings (either at CLiC-it or EVALITA, depending on the year). At the relevant event, new CALAMITA challenges and results will be presented by the proposers in a dedicated CALAMITA slot.
Form (pre-proposal): https://forms.gle/UjGJFeJMAubV14oV7
Italian first. We prioritise datasets natively in Italian; proposals based mainly on machine-translated data will generally be rejected.
Private test set (strongly encouraged). To maintain the benchmark's reliability and minimise contamination, proposals that include a held-out private test split (deposited with AILC) are eligible for the core private subset of CALAMITA.
Reproducibility. Provide evaluation code compatible with lm-eval-harness; licensing information for data/code; and a brief data statement (provenance, annotation, labels).
Similarity vs. duplication. If your challenge is similar to a previous CALAMITA challenge, please indicate which one and explain the added value. Chairs may recommend integrating it as an extension of an existing challenge when appropriate.
We intend to start shortly and stress-test the pipeline. For this pilot cycle, we will select a first batch of submissions for full evaluation (the process itself is new for us, and we appreciate your patience and feedback).
Target venue: EVALITA 2026 - https://www.evalita.it/campaigns/evalita-2026/
Pilot timeline
1 November 2025 - Pre-proposal form deadline
8 November 2025 - Notification of pre-proposal decision (Yes/No)
8 December 2025 - Data and code due (for accepted pre-proposals)
7 January 2026 - Evaluation results (scores + logs/outputs) sent to authors
16 January 2026 - Short-paper deadline
30 January 2026 - Light review completed (decisions)
16 February 2026 - Camera-ready (EVALITA)
26–27 February 2026 EVALITA Workshop (Bari)
CALAMITA is ambitious and resource-intensive. It is also a fantastic training ground for PhD students and early-career researchers to learn how to evaluate LLMs at scale and to co-run a community benchmarking process.
Depending on the volume of submissions, we will open volunteer positions for:
the Evaluation Team (running and monitoring model evaluations, QA of logs/outputs), and
Evaluation Supervisors (coordination, quality control, mentoring).
If you are interested in contributing, please indicate this when you submit the form or write to calami...@gmail.com.
Website (coming soon): the official site will carry announcements about open roles and timelines—stay tuned.
Questions about scope, formats, or logistics?
— The CALAMITA Organizing Team
Pierpaolo Basile, Danilo Croce, Malvina Nissim, Viviana Patti