Rolling Call for CALAMITA Challenge Proposals

7 views

Skip to first unread message

Pierpaolo Basile

unread,

Oct 6, 2025, 4:45:56 AM (3 days ago) Oct 6

to ai...@googlegroups.com, ai...@aixia.it, Grin-...@grin-informatica.it

Rolling Call for CALAMITA Challenge Proposals

About CALAMITA.

CALAMITA (Challenge the Abilities of LAnguage Models in ITAlian) is an AILC community effort to build a dynamic, ever-growing benchmark for evaluating Large Language Models (LLMs) in Italian. The inaugural edition (co-located with CLiC-it 2024) mobilised the community at scale: 80+ contributors defined tasks, annotated data, and helped automate the evaluation pipeline, enabling the assessment of 4 LLMs across 22 challenges and 67 sub-tasks. We are deeply grateful to everyone who made this possible. For background and results, see the CLiC-it 2024 paper: https://aclanthology.org/2024.clicit-1.116/

Following the first edition, CALAMITA is intended to be a stable feature of the AILC Community and serve as the primary reference for Italian benchmarking. To ensure continuity and dynamic development, we propose the following.

What we are launching (Rolling process)

We are launching a rolling, periodic process for proposing new CALAMITA challenges (i.e., concrete benchmarks consisting of data, evaluation, and a short paper). Cycles currently run every four months (we may adjust the cadence as the initiative progresses).

How it works - three stages

Pre-proposal (quick screening) Submit your task proposal via the form (link below). The CALAMITA chairs run a rapid sanity/overlap check (fitness for Italian, non-duplication/similarity, scope). Definitions: we use challenge for the specific benchmark you submit; task for the underlying capability being tested (e.g., commonsense reasoning, toxicity detection, coreference).

Data delivery and model evaluation If the pre-proposal is accepted, you will provide the dataset (public and/or private split) and evaluation scripts compatible with our pipeline (lm-eval-harness), usually within one month. The CALAMITA technical team can provide help with formatting and will then evaluate a set of LLMs on your data, returning scores, logs, and model outputs to support fine-grained error analysis.

Short paper, peer review, publication and presentation Using the results and outputs, you submit a short paper (in the length and style of the CLiC-it short-paper format) for peer review. Because the process is rolling, once accepted, the paper will first be posted on arXiv and later included in the first upcoming volume of annual AILC proceedings (either at CLiC-it or EVALITA, depending on the year). At the relevant event, new CALAMITA challenges and results will be presented by the proposers in a dedicated CALAMITA slot.

Submission form

Form (pre-proposal): https://forms.gle/UjGJFeJMAubV14oV7

Scope and minimum requirements

Italian first. We prioritise datasets natively in Italian; proposals based mainly on machine-translated data will generally be rejected.

Private test set (strongly encouraged). To maintain the benchmark's reliability and minimise contamination, proposals that include a held-out private test split (deposited with AILC) are eligible for the core private subset of CALAMITA.

Reproducibility. Provide evaluation code compatible with lm-eval-harness; licensing information for data/code; and a brief data statement (provenance, annotation, labels).

Similarity vs. duplication. If your challenge is similar to a previous CALAMITA challenge, please indicate which one and explain the added value. Chairs may recommend integrating it as an extension of an existing challenge when appropriate.

“And now?” - Pilot cycle and dates

We intend to start shortly and stress-test the pipeline. For this pilot cycle, we will select a first batch of submissions for full evaluation (the process itself is new for us, and we appreciate your patience and feedback).

Target venue: EVALITA 2026 - https://www.evalita.it/campaigns/evalita-2026/

Pilot timeline

1 November 2025 - Pre-proposal form deadline

8 November 2025 - Notification of pre-proposal decision (Yes/No)

8 December 2025 - Data and code due (for accepted pre-proposals)

7 January 2026 - Evaluation results (scores + logs/outputs) sent to authors

16 January 2026 - Short-paper deadline

30 January 2026 - Light review completed (decisions)

16 February 2026 - Camera-ready (EVALITA)

26–27 February 2026 EVALITA Workshop (Bari)

A gentle call to join the effort

CALAMITA is ambitious and resource-intensive. It is also a fantastic training ground for PhD students and early-career researchers to learn how to evaluate LLMs at scale and to co-run a community benchmarking process.

Depending on the volume of submissions, we will open volunteer positions for:

the Evaluation Team (running and monitoring model evaluations, QA of logs/outputs), and

Evaluation Supervisors (coordination, quality control, mentoring).

If you are interested in contributing, please indicate this when you submit the form or write to calami...@gmail.com.

Website (coming soon): the official site will carry announcements about open roles and timelines—stay tuned.

Contact

Questions about scope, formats, or logistics?

calami...@gmail.com

— The CALAMITA Organizing Team

Pierpaolo Basile, Danilo Croce, Malvina Nissim, Viviana Patti

Reply all

Reply to author

Forward

0 new messages