Daily TMLR digest for Sep 01, 2025

0 views

Skip to first unread message

TMLR

unread,

Sep 1, 2025, 12:06:07 AM (6 days ago) Sep 1

to tmlr-anno...@googlegroups.com

Accepted papers
===============

Title: FlowBench: Benchmarking Optical Flow Estimation Methods for Reliability and Generalization

Authors: Shashank Agnihotri, Julian Yuya Caspary, Luca Schwarz, Xinyan Gao, Jenny Schmalfuss, Andres Bruhn, Margret Keuper

Abstract: Optical flow estimation is a crucial computer vision task often applied to safety-critical real-world scenarios like autonomous driving and medical imaging.
While optical flow estimation accuracy has greatly benefited from the emergence of deep learning, learning-based methods are also known for their lack of generalization and reliability.
However, reliability is paramount when optical flow methods are employed in the real world, where safety is essential.
Furthermore, a deeper understanding of the robustness and reliability of learning-based optical flow estimation methods is still lacking, hindering the research community from building methods safe for real-world deployment.
Thus, we propose FlowBench, a robustness benchmark and evaluation tool for learning-based optical flow methods.
FlowBench facilitates streamlined research into the reliability of optical flow methods by benchmarking their robustness to adversarial attacks and out-of-distribution samples.
With FlowBench, we benchmark 57 checkpoints across 3 datasets under 9 diverse adversarial attacks and 23 established common corruptions, making it the most comprehensive robustness analysis of optical flow methods to date.
Across this wide range of methods, we consistently find that methods with state-of-the-art performance on established standard benchmarks lack reliability and generalization ability.
Moreover, we find interesting correlations between the performance, reliability, and generalization ability of optical flow estimation methods, under various lenses such as design choices used, number of parameters, etc.
The open-source code and weights for FlowBench are available in this GitHub repository: https://github.com/shashankskagnihotri/FlowBench.

URL: https://openreview.net/forum?id=Kh4bj6YDNm

---

Title: Unbiased Loss Functions for Multilabel Classification with Missing Labels

Authors: Erik Schultheis, Rohit Babbar

Abstract: This paper considers binary and multilabel classification problems in a
setting where labels are missing independently and with a known rate. Missing
labels are a ubiquitous phenomenon in extreme multi-label classification (XMC)
tasks, such as matching Wikipedia articles to a small subset out of the
hundreds of thousands of possible tags, where no human annotator can possibly
check the validity of all the negative samples. For this reason,
propensity-scored precision---an unbiased estimate for precision-at-k under a
known noise model---has become one of the standard metrics in XMC. Few
methods take this problem into account already during the training phase, and
all of these are limited to loss functions that can be decomposed into a sum
of contributions from each individual label. A typical approach to training is
to reduce the multilabel problem into a series of binary or multiclass
problems, and it has been shown that if the surrogate task should be
consistent for optimizing recall, the resulting loss function is not
decomposable over labels. Therefore, this paper develops unbiased estimators
for generic, potentially non-decomposable loss functions. These estimators
suffer from increased variance and may lead to ill-posed optimization
problems, which we address by switching to convex upper-bounds. The
theoretical considerations are further supplemented by an experimental study
showing that the switch to unbiased estimators significantly alters the
bias-variance trade-off and thus requires stronger regularization.

URL: https://openreview.net/forum?id=hMq1hUhLqp

---

Title: Efficient pooling of predictions via kernel embeddings

Authors: Sam Allen, David Ginsbourger, Johanna Ziegel

Abstract: Probabilistic predictions are probability distributions over the set of possible outcomes. Such predictions quantify the uncertainty in the outcome, making them essential for effective decision making. By combining multiple predictions, the information sources used to generate the predictions are pooled, often resulting in a more informative forecast. Probabilistic predictions are typically combined by linearly pooling the individual predictive distributions; this encompasses several ensemble learning techniques, for example. The weights assigned to each prediction can be estimated based on their past performance, allowing more accurate predictions to receive a higher weight. This can be achieved by finding the weights that optimise a proper scoring rule over some training data. By embedding predictions into a Reproducing Kernel Hilbert Space (RKHS), we illustrate that estimating the linear pool weights that optimise kernel-based scoring rules is a convex quadratic optimisation problem. This permits an efficient implementation of the linear pool when optimally combining predictions on arbitrary outcome domains. This result also holds for other combination strategies, and we additionally study a flexible generalisation of the linear pool that overcomes some of its theoretical limitations, whilst allowing an efficient implementation within the RKHS framework. These approaches are compared in an application to operational wind speed forecasts, where this generalisation is found to offer substantial improvements upon the traditional linear pool.

URL: https://openreview.net/forum?id=dji9MfONNP

---

New submissions
===============

Title: Uncertainty Quantification in SVM prediction

Abstract: This paper explores SVM models from the lens of uncertainty quantification (UQ), developed for regression and forecasting tasks. Unlike the Neural Network, the SVM solutions are typically more certain, stable, sparse, optimal and interpretable. However, there is only limited literature addressing uncertainty quantification (UQ) in SVM-based prediction. At first, We provide a comprehensive summary of existing Prediction Interval (PI) estimation and probabilistic forecasting methods developed in the SVM framework. Although SVMs offer globally optimal and stable solutions, the existing literature on UQ within the SVM framework still exhibits several critical gaps. In this work, we also address these gaps and extend contemporary UQ techniques to SVMs, for promoting their applicability across diverse domains for more reliable estimation. Our major contributions include the development of sparse SVM models for PI estimation and probabilistic forecasting, an investigation of the role of feature selection in PI estimation, and the extension of SVM regression to the Conformal Regression (CR) setting to construct more stable prediction sets with finite-sample guarantees. Extensive numerical experiments highlight that SVM-based UQ methods yield PIs and probabilistic forecasts that are less uncertain and comparable to, or even better than, those produced by modern complex deep learning and neural network models.

URL: https://openreview.net/forum?id=LrazQEG1QW

---

Title: Lexical Hints of Accuracy in LLM Reasoning Chains

Abstract: Fine-tuning Large Language Models (LLMs) with reinforcement learning to produce an explicit Chain-of-Thought (CoT) before answering produces models that consistently raise overall performance on code, math, and general-knowledge benchmarks. However, on benchmarks where LLMs currently achieve low accuracy, such as Humanity's Last Exam (HLE), they often report high self-confidence, reflecting poor calibration. Here, we test whether measurable properties of the CoT provide reliable signals of an LLM's internal confidence in its answers. We analyze three feature classes: (i) CoT length, (ii) intra-CoT sentiment volatility, and (iii) lexicographic hints, including hedging words. Using DeepSeek-R1 and Claude 3.7 Sonnet on both Humanity's Last Exam (HLE), a frontier benchmark with very low accuracy, and Omni-MATH, a saturated benchmark of moderate difficulty, we find that lexical markers of uncertainty (e.g., "guess", "stuck", "hard") in the CoT are the strongest indicators of an incorrect response, while shifts in the CoT sentiment provide a weaker but complementary signal. CoT length is informative only on Omni-MATH, where accuracy is already high (≈70%), and carries no signal on the harder HLE (≈9%), indicating that CoT length predicts correctness only in the intermediate-difficulty benchmarks, i.e., inside the model's demonstrated capability, but still below saturation. Finally, we find that uncertainty indicators in the CoT are consistently more salient than high-confidence markers, making errors easier to predict than correct responses. Our findings support a lightweight post-hoc calibration signal that complements unreliable self-reported probabilities and supports safer deployment of LLMs.

URL: https://openreview.net/forum?id=kSn0NTNVaK

---

Reply all

Reply to author

Forward

0 new messages