Reproducibility Certification: Early Classification of Time Series: A Survey and Benchmark
Aurélien Renault, Alexis Bondu, Antoine Cornuéjols, Vincent Lemaire
https://openreview.net/forum?id=bcNDYmBicK
---
Accepted papers
===============
Title: IndicFake Meets SAFARI-LLM: Unifying Semantic and Acoustic Intelligence for Multilingual Deepfake Detection
Authors: Rishabh Ranjan, Mayank Vatsa, Richa Singh
Abstract: Audio deepfakes pose a growing threat, particularly in linguistically diverse and low-resource settings where existing detection methods often struggle. This work introduces two transformative contributions to address these challenges. First, we present \textbf{IndicFake}, a pioneering audio deepfake dataset with over 4.2 million samples (7,350 hours) spanning English and 17 Indian languages across Indo-European, Dravidian, and Sino-Tibetan families. With minimal overlap (Jaccard similarity: 0.00--0.06) with existing datasets, IndicFake offers an unparalleled benchmark for multilingual deepfake detection. Second, we propose \textbf{SAFARI-LLM} (Semantic Acoustic Feature Adaptive Router with Integrated LLM), a novel framework that integrates Whisper’s semantic embeddings and m-HuBERT’s acoustic features through an adaptive Audio Feature Unification Module (AFUM). Enhanced by LoRA-fine-tuned LLaMA-7B, SAFARI-LLM achieves unmatched cross-lingual and cross-family generalization. Evaluations across IndicFake, DECRO, and WaveFake datasets demonstrate its superiority, outperforming 14 state-of-the-art models with standout accuracies of 94.21\% (English-to-Japanese transfer on WaveFake) and 84.48\% (English-to-Chinese transfer on DECRO), alongside robust performance across diverse linguistic contexts. These advancements establish a new standard for reliable, scalable audio deepfake detection. Code and resources are publicly available at: https://anonymousillusion.github.io/indicfake/.
URL: https://openreview.net/forum?id=s8pPYRVVTU
---
Title: SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models
Authors: Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, Cihang Xie
Abstract: This work explores two distinct approaches for enhancing reasoning abilities in Large Vision Language Models (LVLMs): supervised fine-tuning (SFT) and reinforcement learning (RL). To support the SFT approach, we curate a multimodal reasoning dataset with the complete reasoning trace guided by DeepSeek-R1. For the RL approach, we focus on GRPO and develop a training framework tailored to vision-language tasks with a composite reward system comprising four signals that address both visual perception and reasoning challenges. Our extensive experiments reveal that RL is a significantly more effective strategy than SFT for training reasoning VLMs. While SFT can assist models that initially struggle with following reasoning instructions, it often induces ``pseudo aha moments'' that degrade overall reasoning performance, implying that only a minimal amount of SFT data is necessary. In contrast, RL leads to substantial improvements, outperforming recent baseline models on a range of math reasoning tasks by at least 2% on average. We also present several intriguing findings --- \eg, combining SFT and GRPO also hurts the model performance, and stronger instruction-aligned LVLMs consistently lead to better results in RL. We hope these findings provide valuable insights into the development of reasoning-capable VLMs and guide future research in this area.
URL: https://openreview.net/forum?id=wZI5qkQeDF
---
Title: Minimax Multi-Target Conformal Prediction with Applications to Imaging Inverse Problems
Authors: Jeffrey Wen, Rizwan Ahmad, Philip Schniter
Abstract: In ill-posed imaging inverse problems, uncertainty quantification remains a fundamental challenge, especially in safety-critical applications. Recently, conformal prediction has been used to quantify the uncertainty that the inverse problem contributes to downstream tasks like image classification, image quality assessment, fat mass quantification, etc. While existing works handle only a scalar estimation target, practical applications often involve multiple targets. In response, we propose an asymptotically minimax approach to multi-target conformal prediction that provides tight prediction intervals while ensuring joint marginal coverage. We then outline how our minimax approach can be applied to multi-metric blind image quality assessment, multi-task uncertainty quantification, and multi-round measurement acquisition. Finally, we numerically demonstrate the benefits of our minimax method, relative to existing multi-target conformal prediction methods, using both synthetic and magnetic resonance imaging (MRI) data.
URL: https://openreview.net/forum?id=53FEYwDQK0
---
Title: Early Classification of Time Series: A Survey and Benchmark
Authors: Aurélien Renault, Alexis Bondu, Antoine Cornuéjols, Vincent Lemaire
Abstract: In many situations, the measurements of a studied phenomenon are provided sequentially, and the prediction of its class needs to be made as early as possible so as not to incur too high a time penalty, but not too early and risk paying the cost of misclassification. This problem has been particularly studied in the case of time series, and is known as Early Classification of Time Series (ECTS). Although it has been the subject of a growing body of literature, there is still a lack of a systematic, shared evaluation protocol to compare the relative merits of the various existing methods. In this paper, we highlight the two components of an ECTS system: \textit{decision} and \textit{prediction}, and focus on the approaches that separate them. This document begins by situating these methods within a principle-based taxonomy. It defines dimensions for organizing their evaluation and then reports the results of a very extensive set of experiments along these dimensions involving nine state-of-the-art ECTS algorithms. In addition, these and other experiments can be carried out using an open-source library in which most of the existing ECTS algorithms have been implemented (see https://github.com/ML-EDM/ml_edm).
URL: https://openreview.net/forum?id=bcNDYmBicK
---
New submissions
===============
Title: Attention-Based Reward Shaping for Sparse and Delayed Rewards
Abstract: Sparse and delayed reward functions pose a significant obstacle for real-world Reinforcement Learning (RL) applications. In this work, we propose Attention-based REward Shaping (ARES), a general and robust algorithm which uses a transformer's attention mechanism to generate shaped rewards and create a dense reward function for any environment. ARES requires a set of episodes and their final returns as input. It can be trained entirely offline and is able to generate meaningful shaped rewards even when using small datasets or episodes produced by agents taking random actions. ARES is compatible with any RL algorithm and can handle any level of reward sparsity. In our experiments, we focus on the most challenging case where rewards are fully delayed until the end of each episode. We evaluate ARES across a diverse range of environments, widely used RL algorithms, and baseline methods to assess the effectiveness of the shaped rewards it produces. Our results show that ARES can indeed improve learning in delayed reward settings, enabling RL agents to train in scenarios that would otherwise require impractical amounts of data or even be unlearnable, though there remain some cases where ARES is not successful in doing so. To our knowledge, ARES is the first approach that works fully offline, remains robust to extreme reward delays and low-quality data, and is not limited to goal-based tasks.
URL: https://openreview.net/forum?id=Vl0SOQWJ6Y
---
Title: Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters
Abstract: Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces, the reasoning behind a judgment, are highly informative but challenging to collect and curate. We present a human-LLM collaborative framework to infer thinking traces from label-only annotations. The proposed framework uses a simple and effective rejection sampling method to reconstruct these traces at scale. These inferred thinking traces are applied to two complementary tasks: (1) fine-tuning open LLM raters; and (2) synthesizing clearer annotation guidelines for proprietary LLM raters. Across multiple datasets, our methods lead to significantly improved LLM-human agreement. Additionally, the refined annotation guidelines increase agreement among different LLM models. These results suggest that LLMs can serve as practical proxies for otherwise unrevealed human thinking traces, enabling label-only corpora to be extended into thinking–trace–augmented resources that enhance the reliability of LLM raters.
URL: https://openreview.net/forum?id=1jLQ629Yps
---
Title: Learning Robust Penetration Testing Policies under Partial Observability: A systematic evaluation
Abstract: Penetration testing, the simulation of cyberattacks to identify security vulnerabilities, presents a sequential decision-making problem well-suited for reinforcement learning (RL) automation. Like many applications of RL to real-world problems, partial observability presents a major challenge, as it invalidates the Markov property present in Markov Decision Processes (MDPs). Partially Observable MDPs require history aggregation or belief state estimation to learn successful policies. We investigate stochastic, partially observable penetration testing scenarios over host networks of varying size, aiming to better reflect real-world complexity through more challenging and representative benchmarks. This approach leads to the development of more robust and transferable policies, which are crucial for ensuring reliable performance across diverse and unpredictable real-world environments. Using vanilla Proximal Policy Optimization (PPO) as a baseline, we compare a selection of PPO variants designed to mitigate partial observability, including frame-stacking, augmenting observations with historical information, and employing recurrent or transformer-based architectures. We conduct a systematic empirical analysis of these algorithms across different host network sizes. We find that this task greatly benefits from history aggregation. Converging three times faster than other approaches. Manual inspection of the learned policies by the algorithms reveals clear distinctions and provides insights that go beyond quantitative results.
URL: https://openreview.net/forum?id=YkUV7wfk19
---
Title: FinMaster: A Holistic Benchmark for Full-Pipeline Financial Management with Large Language Models
Abstract: Financial management tasks are pivotal to global economic stability; however, their efficient execution faces persistent challenges, including labor intensive processes, low error tolerance, data fragmentation, and limitations in existing technological tools. Although large language models (LLMs) have shown remarkable success in various natural language processing (NLP) tasks and have demonstrated potential in automating workflows through reasoning and contextual understanding, current benchmarks for evaluating LLMs in finance suffer from insufficient domain-specific data, simplistic task design, and incomplete evaluation frameworks. To address these gaps, in this work, we present \textbf{FinMaster}, a comprehensive financial management benchmark designed to systematically assess the capabilities of LLM in financial literacy, accounting, auditing, and consulting. Specifically, \textbf{FinMaster} comprises three main modules: i) \emph{FinSim}, which builds simulators that can generate synthetic, privacy-compliant financial datasets for different types of companies to replicate real-world market dynamics; ii) \emph{FinSuite}, which provides a variety of tasks in core financial domains, spanning 183 tasks of various types and difficulty levels; and iii) \emph{FinEval}, which develops a unified evaluation framework for streamlined evaluation. Extensive experiments on state-of-the-art LLMs, such as GPT-4o-mini, Claude-3.7-Sonnet, and DeepSeek-V3, reveal critical capability gaps in financial reasoning, with accuracy dropping from over 90\% on basic tasks to merely 40\% on complex scenarios requiring multi-step reasoning. This degradation exhibits the propagation of computational errors, where single-metric calculations that initially demonstrated 58\% accuracy decreased to 37\% in multimetric scenarios. To the best of our knowledge, \textbf{FinMaster} is the first benchmark that comprehensively covers full-pipeline financial workflows with challenging and realistic tasks. We hope that \textbf{FinMaster} can bridge the gap between the research community and industry practitioners, driving the adoption of LLMs in real-world financial practices to enhance both efficiency and accuracy.
URL: https://openreview.net/forum?id=zCMMlKzbEe
---