Weekly TMLR digest for Oct 12, 2025

10 views

Skip to first unread message

TMLR

unread,

Oct 12, 2025, 12:00:15 AMOct 12

to tmlr-annou...@googlegroups.com

New certifications
==================

Expert Certification: On Convolutions, Intrinsic Dimension, and Diffusion Models

Kin Kwan Leung, Rasa Hosseinzadeh, Gabriel Loaiza-Ganem

https://openreview.net/forum?id=xSzBf1te4s

---

Survey Certification: Reliable and Responsible Foundation Models

Xinyu Yang, Junlin Han, Rishi Bommasani, Jinqi Luo, Wenjie Qu, Wangchunshu Zhou, Adel Bibi, Xiyao Wang, Jaehong Yoon, Elias Stengel-Eskin, Shengbang Tong, Lingfeng Shen, Rafael Rafailov, Runjia Li, Zhaoyang Wang, Yiyang Zhou, Chenhang Cui, Yu Wang, Wenhao Zheng, Huichi Zhou, Jindong Gu, Zhaorun Chen, Peng Xia, Tony Lee, Thomas P Zollo, Vikash Sehwag, Jixuan Leng, Jiuhai Chen, Yuxin Wen, Huan Zhang, Zhun Deng, Linjun Zhang, Pavel Izmailov, Pang Wei Koh, Yulia Tsvetkov, Andrew Gordon Wilson, Jiaheng Zhang, James Zou, Cihang Xie, Hao Wang, Philip Torr, Julian McAuley, David Alvarez-Melis, Florian Tramèr, Kaidi Xu, Suman Jana, Chris Callison-Burch, Rene Vidal, Filippos Kokkinos, Mohit Bansal, Beidi Chen, Huaxiu Yao

https://openreview.net/forum?id=nLJZh4M6S5

---

Accepted papers
===============

Title: Pre-Training Representations of Binary Code Using Contrastive Learning

Authors: Yifan Zhang, Chen Huang, Yueke Zhang, Huajie Shao, Kevin Leach, Yu Huang

Abstract: Binary code analysis and comprehension is critical to applications in reverse engineering and computer security tasks where source code is not available. Unfortunately, unlike source code, binary code lacks semantics and is more difficult for human engineers to understand and analyze. In this paper, we present ContraBin, a contrastive learning technique that integrates source code and comment information along with binaries to create an embedding capable of aiding binary analysis and comprehension tasks. Specifically, we present three components in ContraBin: (1) a primary contrastive learning method for initial pre-training, (2) a simplex interpolation method to integrate source code, comments, and binary code, and (3) an intermediate representation learning algorithm to train a binary code embedding. We further analyze the impact of human-written and synthetic comments on binary code comprehension tasks, revealing a significant performance disparity. While synthetic comments provide substantial benefits, human-written comments are found to introduce noise, even resulting in performance drops compared to using no comments. These findings reshape the narrative around the role of comment types in binary code analysis. We evaluate the effectiveness of ContraBin through four indicative downstream tasks related to binary code: algorithmic functionality classification, function name recovery, code summarization, and reverse engineering. The results show that ContraBin considerably improves performance on all four tasks, measured by accuracy, mean of average precision, and BLEU scores as appropriate. ContraBin is the first language representation model to incorporate source code, binary code, and comments into contrastive code representation learning and is intended to contribute to the field of binary code analysis. The dataset used in this study is available for further research.

URL: https://openreview.net/forum?id=qmfUL6D0iz

---

Title: Permissive Information-Flow Analysis for Large Language Models

Authors: Shoaib Ahmed Siddiqui, Radhika Gaonkar, Boris Köpf, David Krueger, Andrew Paverd, Ahmed Salem, Shruti Tople, Lukas Wutschitz, Menglin Xia, Santiago Zanella-Beguelin

Abstract: Large Language Models (LLMs) are rapidly becoming commodity components of larger software systems. This poses natural security and privacy problems: poisoned data retrieved from one component can change the model's behavior and compromise the entire system, including coercing the model to spread confidential data to untrusted components. Assuming each piece of information comes with an additional meta-label (such as low/high integrity labels), one promising approach is to tackle this problem at the system level via dynamic information flow (aka taint) tracking. Unfortunately, this approach of propagating the most restrictive input label to the output is too conservative for applications where LLMs operate on inputs retrieved from diverse sources.

In this paper, we propose a novel, more permissive approach to propagate information flow labels through LLM queries. The key idea behind our approach is to propagate only the input labels that were \emph{influential} in generating the model output and to eliminate the labels of unnecessary inputs. We implement and investigate the effectiveness of two variations of this approach, based on (i) prompt-based retrieval augmentation, and (ii) a $k$-nearest-neighbors language model. We compare these with a baseline that uses introspection to predict the output label. Our experimental results in an LLM agent setting show that our label propagator assigns a more permissive label over the baseline in more than 85% of the cases, which underscores the practicality of our approach.

URL: https://openreview.net/forum?id=ufYRO8y3mr

---

Title: An Asymptotically Optimal Algorithm for the Convex Hull Membership Problem

Authors: Gang Qiao, Ambuj Tewari

Abstract: We study the convex hull membership (CHM) problem in the pure exploration setting where one aims to efficiently and accurately determine if a given point lies in the convex hull of means of a finite set of distributions. We give a complete characterization of the sample complexity of the CHM problem in the one-dimensional case. We present the first asymptotically optimal algorithm called Thompson-CHM, whose modular design consists of a stopping rule and a sampling rule. In addition, we extend the algorithm to settings that generalize several important problems in the multi-armed bandit literature. Furthermore, we discuss the extension of Thompson-CHM to higher dimensions. Finally, we provide numerical experiments to demonstrate the empirical behavior of the algorithm matches our theoretical results for realistic time horizons.

URL: https://openreview.net/forum?id=r8eAwBMtlN

---

Title: PixelWorld: Towards Perceiving Everything as Pixels

Authors: Zhiheng Lyu, Xueguang Ma, Wenhu Chen

Abstract: Recent agentic language models increasingly accept raw camera pixels rather than tokenized text, underscoring the need for a unified perception paradigm. We explore this idea through Perceive Everything as Pixels (PEAP) and release PixelWorld, a benchmark that renders natural-language, tabular, mathematical and diagrammatic inputs into a single pixel space. Experiments show that PEAP attains competitive accuracy on semantic-understanding tasks, indicating that a vision transformer can capture global textual semantics without explicit tokens. In contrast, reasoning-intensive benchmarks (math and code) exhibit sharp performance drops; however, Chain-of-Thought prompting partially mitigates this gap, hinting that explicit reasoning traces compensate for the missing token structure. We also observe that scenarios with tightly intertwined visual--text cues benefit from the unified pixel view, reducing preprocessing overhead and ambiguity relative to split-modality baselines. PixelWorld therefore provides a compact yet challenging yardstick and encourages wider adoption of PEAP for holistic evaluation of next-generation vision–language agents.

URL: https://openreview.net/forum?id=uY5eDN2bML

---

Title: Teaching Diffusion Models to Ground Alpha Matte

Authors: Tianyi Xiang, Weiying Zheng, Yutao Jiang, Tingrui Shen, Hewei Yu, Yangyang Xu, Shengfeng He

Abstract: The power of visual language models is showcased in visual understanding tasks, where language-guided models achieve impressive flexibility and precision. In this paper, we extend this capability to the challenging domain of image matting by framing it as a soft grounding problem, enabling a single diffusion model to handle diverse objects, textures, and transparencies, all directed by descriptive text prompts. Our method teaches the diffusion model to ground alpha mattes by guiding it through a process of instance-level localization and transparency estimation. First, we introduce an intermediate objective that trains the model to accurately localize semantic components of the matte based on natural language cues, establishing a robust spatial foundation. Building on this, the model progressively refines its transparency estimation abilities, using the learned semantic structure as a prior to enhance the precision of alpha matte predictions. By treating spatial localization and transparency estimation as distinct learning objectives, our approach allows the model to fully leverage the semantic depth of diffusion models, removing the need for rigid visual priors. Extensive experiments highlight our model’s adaptability, precision, and computational efficiency, setting a new benchmark for flexible, text-driven image matting solutions.

URL: https://openreview.net/forum?id=2gNy9Yeg8J

---

Title: On Convolutions, Intrinsic Dimension, and Diffusion Models

Authors: Kin Kwan Leung, Rasa Hosseinzadeh, Gabriel Loaiza-Ganem

Abstract: The manifold hypothesis asserts that data of interest in high-dimensional ambient spaces, such as image data, lies on unknown low-dimensional submanifolds. Diffusion models (DMs) -- which operate by convolving data with progressively larger amounts of Gaussian noise and then learning to revert this process -- have risen to prominence as the most performant generative models, and are known to be able to learn distributions with low-dimensional support. For a given datum in one of these submanifolds, we should thus intuitively expect DMs to have implicitly learned its corresponding local intrinsic dimension (LID), i.e. the dimension of the submanifold it belongs to. Kamkari et al. (2024b) recently showed that this is indeed the case by linking this LID to the rate of change of the log marginal densities of the DM with respect to the amount of added noise, resulting in an LID estimator known as FLIPD. LID estimators such as FLIPD have a plethora of uses, among others they quantify the complexity of a given datum, and can be used to detect outliers, adversarial examples and AI-generated text. FLIPD achieves state-of-the-art performance at LID estimation, yet its theoretical underpinnings are incomplete since Kamkari et al. (2024b) only proved its correctness under the highly unrealistic assumption of affine submanifolds. In this work we bridge this gap by formally proving the correctness of FLIPD under realistic assumptions. Additionally, we show that an analogous result holds when Gaussian convolutions are replaced with uniform ones, and discuss the relevance of this result.

URL: https://openreview.net/forum?id=xSzBf1te4s

---

Title: Equivalent Linear Mappings of Large Language Models

Authors: James Robert Golden

Abstract: Despite significant progress in transformer interpretability, an understanding of the computational mechanisms of large language models (LLMs) remains a fundamental challenge. Many approaches interpret a network's hidden representations but remain agnostic about how those representations are generated. We address this by mapping LLM inference for a given input sequence to an equivalent and interpretable linear system which reconstructs the predicted output embedding with relative error below $10^{-13}$ at double floating-point precision, requiring no additional model training. We exploit a property of transformer decoders wherein every operation (gated activations, attention, and normalization) can be expressed as $A(x) \cdot x$, where $A(x)$ represents an input-dependent linear transform and $x$ preserves the linear pathway. To expose this linear structure, we strategically detach components of the gradient computation with respect to an input sequence, freezing the $A(x)$ terms at their values computed during inference, such that the Jacobian yields an equivalent linear mapping. This ``detached’’ Jacobian of the model reconstructs the output with one linear operator per input token, which is shown for Qwen 3, Gemma 3 and Llama 3, up to Qwen 3 14B. These linear representations demonstrate that LLMs operate in extremely low-dimensional subspaces where the singular vectors can be decoded to interpretable semantic concepts. The computation for each intermediate output also has a linear equivalent, and we examine how the linear representations of individual layers and their attention and multilayer perceptron modules build predictions, and use these as steering operators to insert semantic concepts into unrelated text. Despite their expressive power and global nonlinearity, modern LLMs can be interpreted through equivalent linear representations that reveal low-dimensional semantic structures in the next-token prediction process. Code is available at https://github.com/jamesgolden1/equivalent-linear-LLMs/.

URL: https://openreview.net/forum?id=oDWbJsIuEp

---

Title: Where are we with calibration under dataset shift in image classification?

Authors: Mélanie Roschewitz, Raghav Mehta, Fabio De Sousa Ribeiro, Ben Glocker

Abstract: We conduct an extensive study on the state of calibration under real-world dataset shift for image classification. Our work provides important insights on the choice of post-hoc and in-training calibration techniques, and yields practical guidelines for all practitioners interested in robust calibration under shift. We compare various post-hoc calibration methods, and their interactions with common in-training calibration strategies (e.g., label smoothing), across a wide range of natural shifts, on eight different classification tasks across several imaging domains. We find that: (i) simultaneously applying entropy regularisation and label smoothing yield the best calibrated raw probabilities under dataset shift, (ii) post-hoc calibrators exposed to a small amount of semantic out-of-distribution data (unrelated to the task) are most robust under shift, (iii) recent calibration methods specifically aimed at increasing calibration under shifts do not necessarily offer significant improvements over simpler post-hoc calibration methods, (iv) improving calibration under shifts often comes at the cost of worsening in-distribution calibration. Importantly, these findings hold for randomly initialised classifiers, as well as for those finetuned from foundation models, the latter being consistently better calibrated compared to models trained from scratch. Finally, we conduct an in-depth analysis of ensembling effects, finding that (i) applying calibration prior to ensembling (instead of after) is more effective for calibration under shifts, (ii) for ensembles, OOD exposure deteriorates the ID-shifted calibration trade-off, (iii) ensembling remains one of the most effective methods to improve calibration robustness and, combined with finetuning from foundation models, yields best calibration results overall.

URL: https://openreview.net/forum?id=1NYKXlRU2H

---

Title: Diversity-Enhanced and Classification-Aware Prompt Learning for Few-Shot Learning via Stable Diffusion

Authors: Gaoqin Chang, Jun Shu, Xiang Yuan, Deyu Meng

Abstract: Recent text-to-image generative models have exhibited an impressive ability to generate fairly realistic images from some text prompts. In this work, we explore to leverage off-the-shelf text-to-image generative models to train non-specific downstream few-shot classification model architectures using synthetic dataset to classify real images. Current approaches use hand-crafted or model-generated text prompts of text-to-image generative models to generate desired synthetic images, however, they have limited capability of generating diverse images. Especially, their synthetic datasets have relatively limited relevance to the downstream classification tasks. This makes them fairly hard to guarantee training models from synthetic images are efficient in practice. To address this issue, we propose a method capable of adaptively learning proper text prompts for the off-the-shelf diffusion model to generate diverse and classification-aware synthetic images. Our approach shows consistently improvements in various classification datasets, with results comparable to existing prompt designing methods. We find that replacing data generation strategy of existing zero/few-shot methods with proposed method could consistently improve downstream classification performance across different network architectures, demonstrating its model-agnostic potential for few-shot learning. This makes it possible to train an efficient downstream few-shot learning model from synthetic images generated by proposed method for real problems.

URL: https://openreview.net/forum?id=4CfliohyqK

---

Title: Understanding Self-supervised Contrastive Learning through Supervised Objectives

Authors: Byeongchan Lee

Abstract: Self-supervised representation learning has achieved impressive empirical success, yet its theoretical understanding remains limited. In this work, we provide a theoretical perspective by formulating self-supervised representation learning as an approximation to supervised representation learning objectives. Based on this formulation, we derive a loss function closely related to popular contrastive losses such as InfoNCE, offering insight into their underlying principles. Our derivation naturally introduces the concepts of prototype representation bias and a balanced contrastive loss, which help explain and improve the behavior of self-supervised learning algorithms. We further show how components of our theoretical framework correspond to established practices in contrastive learning. Finally, we empirically validate the effect of balancing positive and negative pair interactions. All theoretical proofs are provided in the appendix, and our code is included in the supplementary material.

URL: https://openreview.net/forum?id=cmE97KX2XM

---

Title: Is isotropy a good proxy for generalization in time series forecasting with transformers?

Authors: Rashed Shelim, Shengzhe Xu, Walid Saad, Naren Ramakrishnan

Abstract: Vector representations of contextual embeddings learned by transformer-based models have been shown to be effective even for downstream tasks in \emph{numerical domains} such as time series forecasting. Their success in capturing long-range dependencies and contextual semantics has led to broad adoption across architectures. But at the same time, there is little theoretical understanding of when transformers, both autoregressive and non-autoregressive, generalize well to forecasting tasks. This paper addresses this gap through an analysis of isotropy in contextual embedding space. Specifically, we study a log-linear model as a simplified abstraction for studying hidden representations in transformer-based models. In this formulation, time series embeddings are mapped to predictive outputs through a softmax layer, providing a tractable lens for analyzing generalization. We show that state-of-the-art performance requires embeddings to possess a structure that accounts for the shift-invariance of the softmax function. By examining the gradient structure of self-attention, we demonstrate how isotropy preserves representation structure, resolves the shift-invariance problem, and provides insights into model reliability and generalization. Experiments across $22$ different numerical datasets and $5$ different transformer-based models show that data characteristics and architectural choices significantly affect isotropy, which in turn directly influences forecasting performance. This establishes isotropy as a theoretically grounded and empirically validated indicator of generalization and reliability in time series forecasting. The code for the isotropy analysis and all data are publicly available.

URL: https://openreview.net/forum?id=iUtDYVQzFq

---

Title: Explaining Bayesian Neural Networks

Authors: Kirill Bykov, Marina MC Höhne, Adelaida Creosteanu, Klaus Robert Muller, Frederick Klauschen, Shinichi Nakajima, Marius Kloft

Abstract: To advance the transparency of learning machines such as Deep Neural Networks (DNNs), the field of Explainable AI (XAI) was established to provide interpretations of DNNs' predictions. While different explanation techniques exist, a popular approach is given in the form of attribution maps, which illustrate, given a particular data point, the relevant patterns the model has used for making its prediction. Although Bayesian models such as Bayesian Neural Networks (BNNs) have a limited form of transparency built-in through their prior weight distribution, they lack explanations of their predictions for given instances. In this work, we take a step toward combining these two perspectives by examining how local attributions can be extended to BNNs. Within the Bayesian framework, network weights follow a probability distribution; hence, the standard point explanation extends naturally to an explanation distribution. Viewing explanations probabilistically, we aggregate and analyze multiple local attributions drawn from an approximate posterior to explore variability in explanation patterns. The diversity of explanations offers a way to further explore how predictive rationales may vary across posterior samples. Quantitative and qualitative experiments on toy and benchmark data, as well as on a real-world pathology dataset, illustrate that our framework enriches standard explanations with uncertainty information and may support the visualization of explanation stability.

URL: https://openreview.net/forum?id=ZxsR4t3wJd

---

Title: Multi-model Online Conformal Prediction with Graph-Structured Feedback

Authors: Erfan Hajihashemi, Yanning Shen

Abstract: Online conformal prediction has demonstrated its capability to construct a prediction set for each incoming data point that covers the true label with a predetermined probability. To cope with potential distribution shift, multi-model online conformal prediction has been introduced to select and leverage different models from a preselected candidate set. Along with the improved flexibility, the choice of the preselected set also brings challenges. A candidate set that includes a large number of models may increase the computational complexity. In addition, the inclusion of irrelevant models with poor performance may negatively impact the performance and lead to unnecessarily large prediction sets. To address these challenges, we propose a novel multi-model online conformal prediction algorithm that identifies a subset of effective models at each time step by collecting feedback from a bipartite graph, which is refined upon receiving new data. A model is then selected from this subset to construct the prediction set, resulting in reduced computational complexity and smaller prediction sets. Additionally, we demonstrate that using prediction set size as feedback, alongside model loss, can significantly improve efficiency by constructing smaller prediction sets while still satisfying the required coverage guarantee. The proposed algorithms are proven to ensure valid coverage and achieve sublinear regret. Experiments on real and synthetic datasets validate that the proposed methods construct smaller prediction sets and outperform existing multi-model online conformal prediction approaches.

URL: https://openreview.net/forum?id=9u8ugbismg

---

Title: Private and Fair Machine Learning: Revisiting the Disparate Impact of Differentially Private SGD

Authors: Lea Demelius, Simone Kopeinik, Dominik Kowald, Roman Kern, Andreas Trügler

Abstract: Differential privacy (DP) is a prominent method for protecting information about individuals during data analysis. Training neural networks with differentially private stochastic gradient descent (DPSGD) influences the model's learning dynamics and, consequently, its output. This can affect the model's performance and fairness. While the majority of studies on the topic report a negative impact on fairness, it has recently been suggested that fairness levels comparable to non-private models can be achieved by optimizing hyperparameters for performance directly on differentially private models (rather than re-using hyperparameters from non-private models, as is common practice). In this work, we analyze the generalizability of this claim by 1) comparing the disparate impact of DPSGD on different performance metrics, and 2) analyzing it over a wide range of hyperparameter settings. We highlight that a disparate impact on one metric does not necessarily imply a disparate impact on another. Most importantly, we show that while optimizing hyperparameters directly on differentially private models does not mitigate the disparate impact of DPSGD reliably, it can still lead to improved utility-fairness trade-offs compared to re-using hyperparameters from non-private models. We stress, however, that any form of hyperparameter tuning entails additional privacy leakage, calling for careful considerations of how to balance privacy, utility and fairness. Finally, we extend our analyses to DPSGD-Global-Adapt, a variant of DPSGD designed to mitigate the disparate impact on accuracy, and conclude that this alternative may not be a robust solution with respect to hyperparameter choice.

URL: https://openreview.net/forum?id=o8zrx0bfTp

---

Title: Exponential Scaling of Factual Inconsistency in Data-to-Text Generation with Fine-Tuned LLMs

Authors: Joy Mahapatra, Soumyajit Roy, Utpal Garain

Abstract: Data-to-text (D2T) generation is a core task in text generation that involves converting semi-structured data (e.g., tables, graphs) into text. Recent advances in large language models (LLMs) have led to significant improvements in D2T. Despite these gains, factual inconsistency remains a persistent issue in LLMs for D2T. Understanding how such inconsistencies scale with factors like model size, compute (FLOPs), and data size is crucial for building trustworthy systems. While prior scaling studies focus on generalization error via power law scaling, the impact of these factors on factual inconsistency in D2T remains unexplored. This paper addresses the gap by empirically investigating how factual inconsistency scales with various scaling factors. Unlike prior studies that focus solely on power law scaling, we also examine exponential scaling. To rigorously compare these models, we introduce \textit{VaCScal}, a three-stage statistical validation framework: (1) predictive performance estimation, (2) goodness-of-fit assessment, and (3) comparative analysis. Experiments are conducted across six diverse LLM families and five D2T datasets. Factual inconsistency is inversely measured using four state-of-the-art consistency metrics, including human evaluation. We employ QLoRA, Prefix-Tuning, and full fine-tuning to fine-tune the LLMs. Our analysis, validated through the \textit{VaCScal} framework, consistently shows that factual inconsistency in D2T generation follows exponential scaling with respect to model (LLM) size, compute (FLOPs), and fine-tuning data size---challenging the prevailing assumption of power law scaling. To support this finding, a mathematical rationale is also provided, demonstrating why exponential scaling behavior is expected in factual inconsistency under typical D2T conditions.

URL: https://openreview.net/forum?id=xPaPd6g5WG

---

Title: Differentially Private Clustered Federated Learning

Authors: Saber Malekmohammadi, Afaf Taik, Golnoosh Farnadi

Abstract: Federated learning (FL), which is a decentralized machine learning (ML) approach, often incorporates differential privacy (DP) to provide rigorous data privacy guarantees to clients. Previous works attempted to address high structured data heterogeneity in vanilla FL settings through clustering clients (a.k.a clustered FL), but these methods remain sensitive and prone to errors, further exacerbated by the DP noise. This vulnerability makes the previous methods inappropriate for differentially private FL (DPFL) under structured data heterogeneity. To address this gap, we propose an algorithm for differentially private clustered FL, which is robust to the DP noise in the system and identifies the underlying clients’ clusters correctly. To this end, we propose to cluster clients based on both their model updates and training loss values. Furthermore, for clustering clients’ model updates at the end of the first round, our proposed approach addresses the server’s uncertainties by employing large batch sizes as well as Gaussian Mixture Models (GMM) to reduce the impact of DP and stochastic noise and avoid potential clustering errors. We provide theoretical analysis to justify our approach and evaluate it across diverse data distributions and privacy budgets. Our experimental results show the approach’s effectiveness in addressing high structured data heterogeneity in DPFL.

URL: https://openreview.net/forum?id=JSsko0a4yr

---

Title: An Architecture Built for Federated Learning: Addressing Data Heterogeneity through Adaptive Normalization-Free Feature Recalibration

Authors: Vasilis Siomos, Jonathan Passerat-Palmbach, Giacomo Tarroni

Abstract: Federated learning is a decentralized collaborative training paradigm preserving stakeholders’ data ownership while improving performance and generalization. However, statistical heterogeneity among client datasets degrades system performance. To address this issue, we propose Adaptive Normalization-free Feature Recalibration (ANFR), a model architecture-level approach that combines weight standardization and channel attention to combat heterogeneous data in FL. ANFR leverages weight standardization to avoid mismatched client statistics and inconsistent averaging, ensuring robustness under heterogeneity, and channel attention to produce learnable scaling factors for feature maps, suppressing inconsistencies across clients due to heterogeneity. We demonstrate that combining these techniques boosts model performance beyond their individual contributions, by improving class selectivity and channel attention weight distribution. ANFR works with any aggregation method, supports both global and personalized FL, and adds minimal overhead. Furthermore, when training with differential privacy, ANFR achieves an appealing balance between privacy and utility, enabling strong privacy guarantees without sacrificing performance. By integrating weight standardization and channel attention in the backbone model, ANFR offers a novel and versatile approach to the challenge of statistical heterogeneity. Extensive experiments show ANFR consistently outperforms established baselines across various aggregation methods, datasets, and heterogeneity conditions. Code is provided at https://github.com/siomvas/ANFR.

URL: https://openreview.net/forum?id=GtdYFLsblb

---

Title: EL-Clustering: Combining Upper- and Lower-Bounded Clusterings for Equitable Load Constraints

Authors: Rajni Dabas, Neelima Gupta, Rudra Bhardwaj, Sapna Grover

Abstract: The application of an ordinary clustering algorithm may yield a clustering output where the number of points per cluster (cluster size) varies significantly. In settings where the centers correspond to facilities that provide a service, this can be highly undesirable as the cluster size is essentially the service load for a facility. While prior work has considered imposing either a lower bound on the cluster sizes or an upper bound, imposing both bounds simultaneously has seen limited work, especially for the $k$-median objective, despite its strong practical motivation. In this paper, we solve the \emph{equitable load} (\EL{}) clustering problem where we minimize the $k$-median objective subject to the cluster sizes not exceeding an upper bound or falling below a lower bound. We solve this problem using a modular approach. Specifically, given a clustering solution that satisfies the lower bound constraints and another that satisfies the upper bound constraints, we introduce a combination algorithm which essentially combines both solutions to produce one that satisfies both constraints simultaneously at the expense of a bounded degradation in the $k$-median objective and a slight violation of the upper bound. Our combination algorithm runs in $O(k^3+n)$ time, where $n$ is the number of points and is faster than standard $k$-median algorithms that satisfy either the lower or upper bound constraints. Interestingly, our results can be generalized to various other clustering objectives, including the $k$-means objective. We also do empirical evaluation for $k$-Median objective on benchmark datasets to show that both, the cost as well as the violation factor are significantly smaller in practice than the theoretical worst-case guarantees\footnote{https://github.com/0-rudra-0/el-clustering}.

URL: https://openreview.net/forum?id=EkjDfnJ1gU

---

Title: From Novelty to Imitation: Self-Distilled Rewards for Offline Reinforcement Learning

Authors: Gaurav Chaudhary, Laxmidhar Behera

Abstract: Offline Reinforcement Learning (RL) aims to learn effective policies from a static dataset
without requiring further agent environment interactions. However, its practical adoption is often hindered by the need for explicit reward annotations, which can be costly to engineer or difficult to obtain retrospectively. To address this, we propose ReLOAD (Reinforcement Learning with Offline Reward Annotation via Distillation), a novel reward annotation framework for offline RL. Unlike existing methods that depend on complex alignment procedures, our approach adapts Random Network Distillation (RND) to generate intrinsic rewards from expert demonstrations using a simple yet effective embedding discrepancy measure. First, we train a predictor network to mimic a fixed target network’s embeddings based on expert state transitions. Later, the prediction error between these networks serves as a reward
signal for each transition in the static dataset. This mechanism provides a structured reward signal without requiring
handcrafted reward annotations. We provide a formal theoretical construct that provides insights into how RND prediction errors effectively serve as intrinsic rewards by distinguishing expert-like transitions. Experiments on the D4RL benchmark demonstrate that ReLOAD
enables robust offline policy learning and achieves performance competitive with traditional
reward-annotated methods.

URL: https://openreview.net/forum?id=F5K94JI2Jb

---

Title: Chimera: State Space Models Beyond Sequences

Authors: Aakash Lahoti, Tanya Marwah, Ratish Puduppully, Albert Gu

Abstract: Transformer-based deep learning methods have emerged as the standard approach to model diverse data such as sequences, images, and graphs. These methods rely on self-attention, which treats data as an unordered set of elements. This ignores the neighborhood structure or graph topology of the data and requires the use of inductive biases, such as position embeddings in sequences and images, and random walks in graphs, to incorporate topology. However, developing bespoke inductive biases for each task requires significant effort and can also introduce side-effects hindering generalization. In this work, we introduce Chimera, a unified model that directly incorporates the data topology in a principled way, obviating the need for domain-specific biases. Central to Chimera is the observation that state-space models---which naturally do not require position embeddings---can be generalized to capture any general graph topology. Our experiments demonstrate the versatility of our approach---Chimera achieves strong performance across the domains of language, vision, and graphs, outperforming BERT on GLUE by 0.7 points, ViT on ImageNet-1k by 2.6%, and all the baselines on the Long Range Graph Benchmark. Our results validate Chimera's principled methodological contributions and affirm the long-held belief that data topology is a powerful inductive bias across modalities. We further propose algorithmic optimizations to improve Chimera's efficiency while maintaining performance: 1) For the subclass of Directed Acyclic Graphs we show that Chimera can be implemented as a linear time recurrence. 2) For general graphs, we relax the method with a simple mathematical approximation, achieving Transformer's quadratic complexity without relying on domain-specific biases.

URL: https://openreview.net/forum?id=yv0TUssepk

---

Title: Stochastic Primal-Dual Double Block-Coordinate for Two- way Partial AUC Maximization

Authors: Linli Zhou, Bokun Wang, My T. Thai, Tianbao Yang

Abstract: Two-way partial AUC (TPAUC) is a critical performance metric for binary classification with imbalanced data, as it focuses on specific ranges of the true positive rate (TPR) and false positive rate (FPR). However, stochastic algorithms for TPAUC optimization remain under-explored, with existing methods either limited to approximated TPAUC loss functions or burdened by sub-optimal complexities. To overcome these limitations, we introduce two innovative stochastic primal-dual double block-coordinate algorithms for TPAUC maximization. These algorithms utilize stochastic block-coordinate updates for both the primal and dual variables, catering to both convex and non-convex settings. We provide theoretical convergence rate analyses, demonstrating significant improvements over prior approaches. Our experimental results, based on multiple benchmark datasets, validate the superior performance of our algorithms, showcasing faster convergence and better generalization. This work advances the state of the art in TPAUC optimization and offers practical tools for real-world machine learning applications.

URL: https://openreview.net/forum?id=M3kibBFP4q

---

Title: Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning

Authors: Jingcheng Niu, Subhabrata Dutta, Ahmed Elshabrawy, Harish Tayyar Madabushi, Iryna Gurevych

Abstract: Large-scale Transformer language models (LMs) trained solely on next-token prediction with web-scale data can solve a wide range of tasks after seeing just a few examples. The mechanism behind this capability, known as in-context learning (ICL), remains both controversial and poorly understood. Some studies argue that it is merely the result of memorizing vast amounts of data, while others contend that it reflects a fundamental, symbolic algorithmic development in LMs. In this work, we introduce a suite of investigative tasks and a novel method to systematically investigate ICL by leveraging the full Pythia scaling suite, including interim checkpoints that capture progressively larger amount of training data. By carefully exploring ICL performance on downstream tasks and simultaneously conducting a mechanistic analysis of the residual stream's subspace, we demonstrate that ICL extends beyond mere "memorization" of the training corpus, yet does not amount to the implementation of an independent symbolic algorithm. Our results also clarify several aspects of ICL, including the influence of training dynamics, model capabilities, and elements of mechanistic interpretability. Overall, our work advances the understanding of ICL and its implications, offering model developers insights into potential improvements and providing AI security practitioners with a basis for more informed guidelines.

URL: https://openreview.net/forum?id=10QqO1tM1H

---

Title: Incorporating Spatial Information into Goal-Conditioned Hierarchical Reinforcement Learning via Graph Representations

Authors: Shuyuan Zhang, Zihan Wang, Xiao-Wen Chang, Doina Precup

Abstract: The integration of graphs with Goal-conditioned Hierarchical Reinforcement Learning (GCHRL) has recently gained attention, as intermediate goals (subgoals) can be effectively sampled from graphs that naturally represent the overall task structure in most RL tasks. However, existing approaches typically rely on domain-specific knowledge to construct these graphs, limiting their applicability to new tasks. Other graph-based approaches create graphs dynamically during exploration but struggle to fully utilize them, because they have problems passing the information in the graphs to newly visited states. Additionally, current GCHRL methods face challenges such as sample inefficiency and poor subgoal representation. This paper proposes a solution to these issues by developing a graph encoder-decoder to evaluate unseen states. Our proposed method, Graph-Guided sub-Goal representation Generation RL (G4RL), can be incorporated into any existing GCHRL method when operating in environments with primarily symmetric and reversible transitions to enhance performance across this class of problems. We show that the graph encoder-decoder can be effectively implemented using a network trained on the state graph generated during exploration. Empirical results indicate that leveraging high and low-level intrinsic rewards from the graph encoder-decoder significantly enhances the performance of state-of-the-art GCHRL approaches with an extra small computational cost in dense and sparse reward environments.

URL: https://openreview.net/forum?id=a7Bx4s5gA8

---

Title: D2 Actor Critic: Diffusion Actor Meets Distributional Critic

Authors: Lunjun Zhang, Shuo Han, Hanrui Lyu, Bradly C. Stadie

Abstract: We introduce D2AC, a new model-free reinforcement learning (RL) algorithm designed to train expressive diffusion policies online effectively. At its core is a policy improvement objective that avoids the high variance of typical policy gradients and the complexity of backpropagation through time. This stable learning process is critically enabled by our second contribution: a robust distributional critic, which we design through a fusion of distributional RL and clipped double Q-learning. The resulting algorithm is highly effective, achieving state-of-the-art performance on a benchmark of eighteen hard RL tasks, including Humanoid, Dog, and Shadow Hand domains, spanning both dense-reward and goal-conditioned RL scenarios. Beyond standard benchmarks, we also evaluate a biologically motivated predator-prey task to examine the behavioral robustness and generalization capacity of our approach.

URL: https://openreview.net/forum?id=8KbstCUXhH

---

Title: HybridFlow: Quantification of Aleatoric and Epistemic Uncertainty with a Single Hybrid Model

Authors: Peter Van Katwyk, Karianne Bergen

Abstract: Uncertainty quantification is critical for ensuring robustness in high-stakes machine learning applications. We introduce HybridFlow, a modular hybrid architecture that unifies the modeling of aleatoric and epistemic uncertainty by combining a Conditional Masked Autoregressive normalizing flow for estimating aleatoric uncertainty with a flexible probabilistic predictor for epistemic uncertainty. The framework supports integration with any probabilistic model class, allowing users to easily adapt HybridFlow to existing architectures without sacrificing predictive performance. HybridFlow improves upon previous uncertainty quantification frameworks across a range of regression tasks, such as depth estimation, a collection of regression benchmarks, and a scientific case study of ice sheet emulation. We also provide empirical results of the quantified uncertainty, showing that the uncertainty quantified by HybridFlow is calibrated and better aligns with model error than existing methods for quantifying aleatoric and epistemic uncertainty. HybridFlow addresses a key challenge in Bayesian deep learning, unifying aleatoric and epistemic uncertainty modeling in a single robust framework.

URL: https://openreview.net/forum?id=xRiEdSyVjY

---

Title: Pref-GUIDE: Continual Policy Learning from Real-Time Human Feedback via Preference-Based Learning

Authors: Zhengran Ji, Boyuan Chen

Abstract: Training reinforcement learning agents with human feedback is crucial when task objectives are difficult to specify through dense reward functions. While prior methods rely on offline trajectory comparisons to elicit human preferences, such data is unavailable in online learning scenarios where agents must adapt on the fly. Recent approaches address this by collecting real-time scalar feedback to guide agent behavior and train reward models for continued learning after human feedback becomes unavailable. However, scalar feedback is often noisy and inconsistent, limiting the accuracy and generalization of learned rewards. We propose Pref-GUIDE, a framework that transforms real-time scalar feedback into preference-based data to improve reward model learning for continual policy training. Pref-GUIDE Individual mitigates temporal inconsistency by comparing agent behaviors within short windows and filtering ambiguous feedback. Pref-GUIDE Voting further enhances robustness by aggregating reward models across a population of users to form consensus preferences. Across three challenging environments, Pref-GUIDE significantly outperforms scalar-feedback baselines, with the voting variant exceeding even expert-designed dense rewards. By reframing scalar feedback as structured preferences with population feedback, Pref-GUIDE, offers a scalable and principled approach for harnessing human input in online reinforcement learning.

URL: https://openreview.net/forum?id=dWGUwidXDm

---

Title: VSCoDe: Visual-Augmentation Selection for Contrastive Decoding

Authors: Sihyeon Kim, Boryeong Cho, Sangmin Bae, Sumyeong Ahn, Se-Young Yun

Abstract: Despite the impressive performance of recent Large Vision-Language Models (LVLMs), these models often produce inaccurate responses. To address this issue, previous studies have aimed to reduce hallucinations by using contrastive decoding (CD) with modified images, such as cropping objects related to query or adding noise, thereby contrasting with the original image. However, these methods have several limitations. First, employing fixed visual augmentation, such as adding noise, is a simple approach but too rigid to contrast on various queries. Conversely, using semantics in queries or images by leveraging external models can adaptively generate contrastive images, but it entails significant additional costs. To address these shortcomings, we explore using pre-defined visual augmentations to enable flexible adaptation to each query without relying on external models. We observe that each query achieves different contrasts through different visual augmentations. Based on this, we propose a novel method called VSCoDe, Visual-Augmentation Selection for Contrastive Decoding, which adaptively selects augmentations using a proposed distance metric to identify those with higher contrast. Our empirical evaluations demonstrate that VSCoDe outperforms previous methods and enhances the quality of various vision-language tasks without additional training or reliance on external models.

URL: https://openreview.net/forum?id=CqSyPc9W7Y

---

Title: Learning Time-Series Representations by Hierarchical Uniformity-Tolerance Latent Balancing

Authors: Amin Jalali, Milad Soltany, Michael Greenspan, Ali Etemad

Abstract: We propose TimeHUT, a novel method for learning time-series representations by hierarchical uniformity-tolerance balancing of contrastive representations. Our method uses two distinct losses to learn strong representations with the aim of striking an effective balance between uniformity and tolerance in the embedding space. First, TimeHUT uses a hierarchical setup to learn both instance-wise and temporal information from input time-series. Next, we integrate a temperature scheduler within the vanilla contrastive loss to balance the uniformity and tolerance characteristics of the embeddings. Additionally, a hierarchical angular margin loss enforces instance-wise and temporal contrast losses, creating geometric margins between positive and negative pairs of temporal sequences. This approach improves the coherence of positive pairs and their separation from the negatives, enhancing the capture of temporal dependencies within a time-series sample. We evaluate our approach on a wide range of tasks, namely 128 UCR and 30 UAE datasets for univariate and multivariate classification, as well as Yahoo and KPI datasets for anomaly detection. The results demonstrate that TimeHUT outperforms prior methods by considerable margins on classification, while obtaining competitive results for anomaly detection. Finally, detailed sensitivity and ablation studies are performed to evaluate different components and hyperparameters of our method.

URL: https://openreview.net/forum?id=NTmVEAiyB5

---

Title: Activation sharding for scalable training of large models

Authors: Xingzi Xu, Amir Tavanaei, Kavosh Asadi, Karim Bouyarmane

Abstract: Despite fast progress, efficiently training large language models (LLMs) in extremely long contexts remains challenging.
Existing methods fall back to training LLMs with short contexts (up to a few thousand tokens) and use inference time techniques when evaluating on very long contexts (above 1M tokens).
Training on very long contexts is limited by GPU memory availability and the prohibitively long training times it requires on state-of-the-art hardware.
Meanwhile, many real-life applications require training/fine-tuning with long context on specific tasks. Such applications include, for example, augmenting the context with various sources of raw reference information for extraction, summarization, or fact reconciliation tasks.
We propose adjoint sharding, a novel technique that comprises sharding gradient calculation during training to reduce memory requirements by orders of magnitude, making training on very long contexts computationally tractable. At the core of our adjoint sharding algorithm lies the adjoint method, which efficiently computes gradients that are provably equivalent to the gradients computed using standard backpropagation.
We also propose truncated adjoint sharding to accelerate the algorithm while maintaining performance.
We provide a distributed and a parallel-computing version of adjoint sharding to speed up training and to show that adjoint sharding is compatible with these standard memory-reduction techniques.
Empirical results show the proposed adjoint sharding algorithm reduces memory usage by up to 3$\times$ on a large language model with 1.27B parameters on 1M context length training. This reduction in memory usage allows increasing the maximum context length of training a 1.27B parameter model from 35K tokens to above 100K tokens on a training infrastructure composed of five AWS P4 instances.

URL: https://openreview.net/forum?id=kQCuMcEneq

---

Title: Reliable and Responsible Foundation Models

Authors: Xinyu Yang, Junlin Han, Rishi Bommasani, Jinqi Luo, Wenjie Qu, Wangchunshu Zhou, Adel Bibi, Xiyao Wang, Jaehong Yoon, Elias Stengel-Eskin, Shengbang Tong, Lingfeng Shen, Rafael Rafailov, Runjia Li, Zhaoyang Wang, Yiyang Zhou, Chenhang Cui, Yu Wang, Wenhao Zheng, Huichi Zhou, Jindong Gu, Zhaorun Chen, Peng Xia, Tony Lee, Thomas P Zollo, Vikash Sehwag, Jixuan Leng, Jiuhai Chen, Yuxin Wen, Huan Zhang, Zhun Deng, Linjun Zhang, Pavel Izmailov, Pang Wei Koh, Yulia Tsvetkov, Andrew Gordon Wilson, Jiaheng Zhang, James Zou, Cihang Xie, Hao Wang, Philip Torr, Julian McAuley, David Alvarez-Melis, Florian Tramèr, Kaidi Xu, Suman Jana, Chris Callison-Burch, Rene Vidal, Filippos Kokkinos, Mohit Bansal, Beidi Chen, Huaxiu Yao

Abstract: Foundation models, including Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), Image Generative Models (i.e, Text-to-Image Models and Image-Editing Models), and Video Generative Models, have become essential tools with broad applications across various domains such as law, medicine, education, finance, and beyond. As these models see increasing real-world deployment, ensuring their reliability and responsibility has become critical for academia, industry, and government. This survey addresses the reliable and responsible development of foundation models. We explore critical issues, including bias and fairness, security and privacy, uncertainty, explainability, and distribution shift. Our research also covers model limitations, such as hallucinations, as well as methods like alignment and Artificial Intelligence-Generated Content (AIGC) detection. For each area, we review the current state of the field and outline concrete future research directions. Additionally, we discuss the intersections between these areas, highlighting their connections and shared challenges. We hope our survey fosters the development of foundation models that are not only powerful but also ethical, trustworthy, reliable, and socially responsible.

URL: https://openreview.net/forum?id=nLJZh4M6S5

---

New submissions
===============

Title: Certified Defense Against Cross-Modal Attacks in Multimodal LLMs via Semantic-Perceptual Abstractions

Abstract: Multimodal large language models (MLLMs) have revolutionized AI by enabling seamless
integration of vision and language understanding across diverse applications, from visual
question answering to image captioning. However, their cross-modal architecture introduces unique vulnerabilities to adversarial perturbations that exploit both text and image
modalities simultaneously. While existing defense mechanisms rely on empirical robustness through adversarial training, they lack formal guarantees against sophisticated crossmodal attacks. This paper introduces a novel certified defense framework based on hybrid
polytope-zonotope abstractions that provides provable robustness guarantees for MLLMs.
Our approach unifies discrete text perturbations with continuous image perturbations within
a single mathematical framework Extensive evaluation on VQA v2.0 and Flickr30k across
MLLMS demonstrates 88.5% clean accuracy with 76.4–81.2% certified accuracy under large
perturbations, outperforming state-of-the-art baselines by 8.3% in certification rate and
6.7% in joint attack defense.This work establishes the first comprehensive certified defense
for MLLMs, advancing trustworthy multimodal AI systems.

URL: https://openreview.net/forum?id=E8ZeSR92PA

---

Title: Image Enhancement: A Necessity for Effective Underwater Object Detection?

Abstract: Underwater vision is essential for applications such as marine engineering, aquatic robotics,
and environmental monitoring. However, severe image degradation caused by light absorption and scattering often compromises object detection (OD) performance. Although underwater image enhancement (UIE) intuitively seems beneficial for restoring visual information and improving detection accuracy, its actual impact remains unclear. This work systematically evaluates state-of-the-art enhancement models and investigates their effects on underwater OD to answer the key question: "Is UIE necessary for accurate OD?" We conducted a systematic evaluation of 20 representative UIE algorithms—spanning traditional methods, convolutional neural networks (CNNs), generative adversarial networks (GANs), Transformers, and Diffusion models. These methods are applied to two benchmark datasets, RUOD and URPC2020, producing 21 domain variants per dataset (raw +
20 enhanced). To rigorously assess the effect of enhancement on detection, we trained five object detectors on each domain, resulting in 210 unique model configurations (5 detectors × 21 domains × 2 datasets). Our findings reveal that, contrary to intuitive expectations, most enhancement techniques actually degrade detection accuracy. Only well-designed methods, such as diffusion-based approaches that preserve key low-level features without introducing artificial distortions, can minimize this negative impact. These results provide critical insights into the role of enhancement in underwater vision and highlight important considerations for future research.

URL: https://openreview.net/forum?id=FJeHjSTjvJ

---

Title: $\texttt{LucidAtlas}$: Learning Uncertainty-Aware, Covariate-Disentangled, Individualized Atlas Representations

Abstract: Interpreting how covariates influence spatially structured biological variation remains a key challenge in developing models suitable for clinical application. We present $\texttt{LucidAtlas}$, a versatile framework for modeling and interpreting spatially varying information with associated covariates. To address the limitations of neural additive models when analyzing dependent covariates, we introduce a marginalization approach that enables accurate explanations of how combinations of covariates shape the learned atlas. $\texttt{LucidAtlas}$ integrates covariate interpretation, spatial representation, individualized prediction, population distribution analysis, and out-of-distribution detection into a single interpretable model. We validate its effectiveness on one synthetic spatiotemporal dataset and two real-world medical datasets. Our findings underscore the critical role of by-construction interpretable models in advancing scientific discovery.

URL: https://openreview.net/forum?id=3FbNwC8ua8

---

Title: Solving Truly Massive Budgeted Monotonic POMDPs with Oracle-Guided Meta-Reinforcement Learning

Abstract: Many real-world decision problems, ranging from asset-maintenance scheduling to portfolio rebalancing, can be naturally modelled as budget-constrained multi-component monotonic Partially Observable Markov Decision Processes (POMDPs): each component’s latent state degrades stochastically until an expensive restorative action is taken, while all assets share a fixed intervention budget.
For a large numbers of assets, deriving an optimal policy for this joint POMDP is computationally intractable. To tackle this challenge, we prove that the value function of the associated belief-MDP is \emph{budget-concave}, which allows an efficient two-step approach to finding a near-optimal policy. First, we approximate the optimal cross-component budget split via a random-forest surrogate of each single-component value function. Second, we solve each resulting budget-constrained single-component POMDP with an oracle-guided meta-trained Proximal Policy Optimization (PPO) policy: value-iteration on the fully observable counterpart yields an oracle that shapes the PPO update and greatly accelerates learning. We validate our method through experiments in two disparate domains: (i) preventive maintenance for a large-scale building infrastructure containing 1,000 components, and (ii) portfolio risk management under debit-only loss-budget constraints, where each asset’s latent budget depletes with market losses and can only be replenished through costly recapitalization. Results show that our method consistently achieves longer component survival times and enhanced portfolio viability than both baseline heuristics and vanilla PPO. Furthermore, our approach maintains linear scalability in solution time with respect to the number of components.

URL: https://openreview.net/forum?id=yEAnjlmliL

---

Title: Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?

Abstract: Ensembling outputs from diverse sources is a straightforward yet effective approach to boost performance. Mixture-of-Agents (MoA) is one such popular ensemble method that aggregates outputs from multiple \emph{different} Large Language Models (LLMs). This paper raises the question in the context of language models: is mixing different LLMs truly beneficial?
We propose Self-MoA --- an ensemble method that aggregates outputs from only the \emph{single} top-performing LLM. Our extensive experiments reveal that, surprisingly, Self-MoA outperforms standard MoA that mixes different LLMs in a large number of scenarios: Self-MoA achieves $6.6\%$ improvement over MoA on the AlpacaEval 2.0 benchmark, and an average of $3.8\%$ improvement across various benchmarks, including MMLU, CRUX, and MATH. Applying Self-MoA to one of the top-ranking models in AlpacaEval 2.0 directly achieves the new state-of-the-art performance on the leaderboard. To understand the effectiveness of Self-MoA, we systematically investigate the trade-off between diversity and quality of outputs under various MoA settings. We confirm that the MoA performance is rather sensitive to the quality, and mixing different LLMs often lowers the average quality of the models. To complement the study, we identify the scenarios where mixing different LLMs could be helpful. This paper further introduces a sequential version of Self-MoA, that is capable of aggregating a large number of LLM outputs on-the-fly over multiple rounds, and is as effective as aggregating all outputs at once.

URL: https://openreview.net/forum?id=K6WwK8URlV

---

Title: Neural Architecture Search by Learning a Hierarchical Search Space

Abstract: Monte-Carlo Tree Search (MCTS) is a powerful tool for many non-differentiable search related problems such as adversarial games. However, the performance of such approach highly depends on the order of the nodes that are considered at each branching of the tree. If the first branches cannot distinguish between promising and deceiving configurations for the final task, the efficiency of the search is significantly reduced. In Neural Architecture Search (NAS), as only the final architecture matters, the visiting order of the branching can be optimized to improve learning. In this paper, we study the application of MCTS to NAS for image classification. We analyze several sampling methods and branching alternatives for MCTS and propose to learn the branching by hierarchical clustering of architectures based on their similarity. The similarity is measured by the pairwise distance of output vectors of architectures. Extensive experiments on two challenging benchmarks on CIFAR10 and ImageNet show that MCTS, if provided with a good branching hierarchy, often yielding better solutions more efficiently than other approaches for NAS problems.

URL: https://openreview.net/forum?id=kp3p6sPqVJ

---

Title: Top-$k$ Feature Importance Ranking

Abstract: Accurate ranking of important features is a fundamental challenge in interpretable machine learning with critical applications in scientific discovery and decision-making. Unlike feature selection and feature importance, the specific problem of ranking important features has received considerably less attention. We introduce RAMPART (Ranked Attributions with MiniPatches And Recursive Trimming), a framework that utilizes any existing feature importance measure in a novel algorithm specifically tailored for ranking the top-$k$ features. Our approach combines an adaptive sequential halving strategy that progressively focuses computational resources on promising features with an efficient ensembling technique using both observation and feature subsampling. Unlike existing methods that convert importance scores to ranks as post-processing, our framework explicitly optimizes for ranking accuracy. We provide theoretical guarantees showing that RAMPART achieves the correct top-$k$ ranking with high probability under mild conditions, and demonstrate through extensive simulation studies that RAMPART consistently outperforms popular feature importance methods, concluding with a high-dimensional genomics case study.

URL: https://openreview.net/forum?id=2OSHpccsaV

---

Title: BalancedDPO: Adaptive Multi-Metric Alignment

Abstract: Diffusion models have achieved remarkable progress in text-to-image generation, yet aligning them with human preference remains challenging due to the presence of multiple, sometimes conflicting, evaluation metrics (e.g., semantic consistency, aesthetics, and human preference scores). Existing alignment methods typically optimize for a single metric or rely on scalar- ized reward aggregation, which can bias the model toward specific evaluation criteria. To address this challenge, we propose BalancedDPO, a framework that achieves multi-metric preference alignment within the Direct Preference Optimization (DPO) paradigm. Unlike prior DPO variants that rely on a single metric, BalancedDPO introduces a majority-vote consensus over multiple preference scorers and integrates it directly into the DPO training loop with dynamic reference model updates. This consensus-based formulation avoids reward- scale conflicts and ensures more stable gradient directions across heterogeneous metrics. Experiments on Pick-a-Pic, PartiPrompt, and HPD datasets demonstrate that Balanced- DPO consistently improves preference win rates over the baselines across Stable Diffusion 1.5, Stable Diffusion 2.1 and SDXL backbones. Comprehensive ablations further validate the benefits of majority-vote aggregation and dynamic reference updating, highlighting the method’s robustness and generalizability across diverse alignment settings.

URL: https://openreview.net/forum?id=8HRID5VLQw

---

Title: Privacy Profiles Under Tradeoff Composition

Abstract: Privacy profiles and tradeoff functions are two frameworks for comparing differential privacy guarantees of alternative privacy mechanisms. We study connections between these frameworks. We show that the composition of tradeoff functions corresponds to a binary operation on privacy profiles we call their T-convolution. Composition of tradeoff functions characterizes group privacy guarantees, so the T-convolution provides a bridge for translating group privacy properties from one framework to the other. Composition of tradeoff functions has also been used to characterize mechanisms with log-concave additive noise; we derive a corresponding property based on privacy profiles. We also derive new bounds on privacy profiles for log-concave mechanisms based on new convexity properties. In developing these ideas, we characterize regular privacy profiles, which are privacy profiles for mutually absolutely continuous probability measures.

URL: https://openreview.net/forum?id=gRvKjXWacu

---

Title: ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers

Abstract: Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-scale spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based framework that robustly integrates visual and tactile input data to learn task-agnostic representations for visuotactile perception. Our key idea is to encode positional structure at two complementary levels that emerge naturally in visuotactile perception: local, within each modality, and global, shared across modalities to place their tokens in a common reference before fusion. Unlike prior work, we provide provable guarantees in visuotactile fusion, showing that our encodings are injective, translation-equivariant, and information-preserving, validating these properties empirically. Experiments on multiple large-scale real-world datasets show that ViTaPEs not only surpasses state-of-the-art baselines across various recognition tasks but also demonstrates zero-shot generalization to unseen, out-of-domain scenarios. We further demonstrate the transfer-learning strength of ViTaPEs in a robotic grasping task, where it outperforms state-of-the-art baselines in predicting grasp success.

URL: https://openreview.net/forum?id=mxzzO66Zbu

---

Title: Rethinking the Mixture of Vision Encoders Paradigm for Enhanced Visual Understanding in Multimodal LLMs

Abstract: Mixture of Vision Encoders (MoVE) has emerged as a powerful approach to enhance the fine-grained visual understanding of multimodal large language models (MLLMs), improving their ability to handle tasks such as complex optical character recognition and scene understanding. Despite these advances, effectively combining diverse encoders and their visual tokens, while also scaling to high-resolution inputs, remains an open challenge. In this work, we conduct a systematic study of fusion designs for MoVE-based MLLMs, highlighting principles for token-level integration across complementary encoders. Our study shows that a lightweight recipe consisting of post-adaptation fusion with independent projectors, tile-level sequence interleaving, and dynamic tiling with global context delivers strong performance on diverse benchmarks. We integrate these principles into a simple and effective architecture that we call LEO. Extensive evaluation on 11 vision–language benchmarks demonstrates that LEO achieves better results on the majority of tasks compared to existing MoVE-based approaches. Furthermore, LEO adapts effectively to the specialized domain of autonomous driving without altering its architecture or training recipe, achieving competitive performance against established baselines and thereby highlighting its ability to generalize. The code and model will be publicly available.

URL: https://openreview.net/forum?id=tgnTVmRybs

---

Title: Online Bandit Learning with Offline Preference Data

Abstract: Reinforcement Learning with Human Feedback (RLHF) is at the core of fine-tuning methods for generative AI models for language and images. Such feedback is often sought as preference feedback from human raters, as opposed to eliciting scores since the latter tends to be noisy. On the other hand, RL theory and algorithms predominantly assume that a reward feedback is available. In particular, approaches for online learning that can be helpful in adaptive data collection via active learning cannot incorporate offline preference data. In this paper, we adopt a finite-armed linear bandit model as a prototypical model of online learning. We consider an offline preference dataset to be available generated by an rater of unknown `competence'. We propose $\mathsf{warmPref-PS}$, a posterior sampling algorithm for online learning that can be warm-started with an offline dataset with noisy preference feedback. We show that by modeling the `competence' of the rater that generated it, we are able to use such a dataset most effectively. We support our claims with novel theoretical analysis of its Bayesian regret, as well as, extensive empirical evaluation of an approximate loss function that optimizes for infinitely many arms, and performs substantially better than baselines.

URL: https://openreview.net/forum?id=z2x0r7K3pO

---

Title: Policy-Guided Search on Tree-of-Thoughts for Efficient Problem Solving with Bounded Language Model Queries

Abstract: Recent studies explored integrating state-space search algorithms with Language Models (LM) to perform look-ahead on the ``Tree-of-Thoughts'' (ToT) generated by LMs, thereby improving performance on problem-solving tasks. However, the affiliated search algorithms often overlook the significant computational costs associated with LM inference, particularly in scenarios with constrained computational budgets. Consequently, we address the problem of improving LM performance on problem-solving tasks under limited computational budgets. We demonstrate how the probabilities assigned to thoughts by LMs can serve as a heuristic to guide search within the ToT framework, thereby reducing the number of thought evaluations. Building on this insight, we adapt Levin Tree Search (LTS) to the ToT framework, which leverages LMs as policies to guide the tree exploration efficiently. We extend the theoretical results of LTS by showing that, for ToT (a pruned tree), LTS guarantees a bound on the number of states expanded, and consequently, on the number of thoughts generated. Additionally, we analyze the sensitivity of this bound to the temperature values commonly used in the final softmax layer of the LM. Empirical evaluation under a fixed LM query budget demonstrates that LTS consistently achieves comparable or higher accuracy than baseline search algorithms within the ToT framework, across three domains and three distinct LMs. These findings highlight the efficacy of LTS on ToT, particularly in enabling cost-effective and time-efficient problem-solving, making it well-suited for latency-critical and resource-constrained applications.

URL: https://openreview.net/forum?id=Rlk1bWe2ii

---

Title: Mollifier Layers: Enabling Efficient High-Order Derivatives in Inverse PDE Learning

Abstract: Parameter estimation in inverse problems involving partial differential equations (PDEs) underpins modeling across scientific disciplines, especially when parameters vary in space or time. Physics-informed Machine Learning (PhiML) integrates PDE constraints into deep learning, but prevailing approaches depend on recursive automatic differentiation (autodiff), which produces inaccurate high-order derivatives, inflates memory usage, and underperforms in noisy settings. We propose Mollifier Layers, a lightweight, architecture-agnostic module that replaces autodiff with convolutional operations using analytically defined mollifiers. This reframing of derivative computation as smoothing integration enables efficient, noise-robust estimation of high-order derivatives directly from network outputs. Mollifier Layers attach at the output layer and require no architectural modifications. We compare them with three distinct architectures and benchmark performance across first-, second-, and fourth-order PDEs—including Langevin dynamics, heat diffusion, and reaction-diffusion systems—observing significant improvements in memory efficiency, training time and accuracy for parameter recovery across tasks. To demonstrate practical relevance, we apply Mollifier Layers to infer spatially varying epigenetic reaction rates from super-resolution chromatin imaging data—a real-world inverse problem with biomedical significance. Our results establish Mollifier Layers as an efficient and scalable tool for physics-constrained learning.

URL: https://openreview.net/forum?id=6mFVZSzyev

---

Title: Learning and Transferring Physical Models through Derivatives

Abstract: We propose Derivative Learning (DERL), a supervised approach that models physical systems by learning their partial derivatives. We also leverage DERL to build physical models incrementally, by designing a distillation protocol that effectively transfers knowledge from a pre-trained model to a student one. We provide theoretical guarantees that DERL can learn the true physical system, being consistent with the underlying physical laws, even when using empirical derivatives. DERL outperforms state-of-the-art methods in generalizing an ODE to unseen initial conditions and a parametric PDE to unseen parameters. We also design a method based on DERL to transfer physical knowledge across models by extending them to new portions of the physical domain and a new range of PDE parameters. We believe this is the first attempt at building physical models incrementally in multiple stages.

URL: https://openreview.net/forum?id=IbBCDDeDF7

---

Title: Generalized Dirichlet Energy and Graph Laplacians for Clustering Directed and Undirected Graphs

Abstract: Clustering in directed graphs remains a fundamental challenge due to the asymmetry in edge connectivity, which limits the applicability of classical spectral methods originally designed for undirected graphs. A common workaround is to symmetrize the adjacency matrix, but this often leads to losing critical directional information. In this work, we introduce the generalized Dirichlet energy (GDE), a novel energy functional that extends the classical Dirichlet energy to handle arbitrary positive vertex measures and Markov transition matrices. GDE provides a unified framework applicable to both directed and undirected graphs, and is closely tied to the diffusion dynamics of random walks. Building on this framework, we propose the generalized spectral clustering (GSC) method that enables the principled clustering of weakly connected digraphs without resorting to the introduction of teleportation to the random walk transition matrix. A key component of our approach is the utilization of a parametrized vertex measure encoding graph directionality and density. Experiments on real-world point-cloud datasets demonstrate that GSC consistently outperforms existing spectral clustering approaches in terms of clustering accuracy and robustness, offering a powerful new tool for graph-based data analysis.

URL: https://openreview.net/forum?id=AA6D7fJ9PN

---

Title: Probabilistic Pretraining for Improved Neural Regression

Abstract: While transfer learning has revolutionized computer vision and natural language processing, its application to probabilistic regression remains underexplored, particularly for tabular data. We introduce NIAQUE (Neural Interpretable Any-Quantile Estimation), a novel permutation-invariant architecture that enables effective transfer learning across diverse regression tasks. Through extensive experiments on 101 datasets, we demonstrate that pre-training NIAQUE on multiple datasets and fine-tuning on target datasets consistently outperforms both traditional tree-based models and transformer-based neural baseline. On real-world Kaggle competitions, NIAQUE achieves competitive performance against heavily hand-crafted and feature-engineered solutions and outperforms strong baselines such as TabPFN and TabDPT, while maintaining interpretability through its probabilistic framework. Our results establish NIAQUE as a robust and scalable approach for tabular regression, effectively bridging the gap between traditional methods and modern transfer learning.

URL: https://openreview.net/forum?id=F6BTATGXaf

---

Title: Improving Detection of Rare Nodes in Hierarchical Multi-Label Learning

Abstract: In hierarchical multi-label classification, a persistent challenge is enabling model predictions to reach deeper levels of the hierarchy for more detailed or fine-grained classifications. This difficulty partly arises from the natural rarity of certain classes (or hierarchical nodes) and the hierarchical constraint that ensures child nodes are almost always less frequent than their parents. To address this, we propose a weighted loss objective for neural networks that combines node-wise imbalance weighting with focal weighting components, the latter leveraging contemporary quantification of ensemble uncertainties. By emphasizing rare nodes rather than rare observations (data points), and focusing on uncertain nodes for each model output distribution during training, we observe improvements in recall by up to a factor of five on benchmark datasets, along with statistically significant gains in F1 score. We also provide evidence that the approach can benefit convolutional neural networks in cases where the available vision encoder has not been fully prepared for the downstream task.

URL: https://openreview.net/forum?id=hf4zEWWIvE

---

Title: Enhancing Concept Localization in CLIP-based Concept Bottleneck Models

Abstract: This paper addresses explainable AI (XAI) through the lens of Concept Bottleneck Models (CBMs) that do not require explicit concept annotations, relying instead on concepts extracted using CLIP in a zero-shot manner. We show that CLIP, which is central in these techniques, is prone to concept hallucination—incorrectly predicting the presence or absence of concepts within an image in scenarios used in numerous CBMs, hence undermining the faithfulness of explanations. To mitigate this issue, we introduce Concept Hallucination Inhibition via Localized Interpretability (CHILI), a technique that disentangles image embeddings and localizes pixels corresponding to target concepts. Furthermore, our approach supports the generation of saliency-based explanations that are more interpretable.

URL: https://openreview.net/forum?id=2xaOl0wluw

---

Title: On Gossip Algorithms for Machine Learning with Pairwise Objectives

Abstract: In the IoT era, information is more and more frequently picked up by connected smart sensors with increasing, though limited, storage, communication and computation abilities. Whether due to privacy constraints or to the structure of the distributed system, the development of statistical learning methods dedicated to data that are shared over a network is now a major issue. Gossip-based algorithms have been developed for the purpose of solving a wide variety of statistical learning tasks, ranging from data aggregation over sensor networks to decentralized multi-agent optimization. Whereas the vast majority of contributions consider situations where the function to be estimated or optimized is a basic average of individual observations, it is the goal of this article to investigate the case where the latter is of pairwise nature, taking the form of a $U$-statistic of degree two.
Motivated by various problems such as similarity learning, ranking or clustering for instance, we revisit gossip algorithms specifically designed for pairwise objective functions and provide a comprehensive theoretical framework for their convergence. This analysis fills a gap in the literature by establishing conditions under which these methods succeed, and by identifying the graph properties that critically affect their efficiency. In particular, a refined analysis of the convergence upper and lower bounds is performed.

URL: https://openreview.net/forum?id=VxxpURovJF

---

Title: Instruction-Level Weight Shaping: A Framework for Self- Improving AI Agents

Abstract: Large language models (LLMs) excel at surface fluency yet remain structurally static after pre-training; new or evolving domain knowledge is typically bolted on via retrieval-augmented generation (RAG) or parameter fine-tuning. In practice, RAG often retrieves facts without integrating them logically, adds latency and engineering overhead. Free-form prompt injection and ad hoc prompt engineering are brittle, prone to context-window drift, and can conflict with pre-trained knowledge. Fine-tuning, while effective for specific domains, is resource-intensive and risks catastrophic forgetting.

We propose Instruction-Level Weight Shaping (ILWS), which treats curated system instructions as external, auditable pseudo-parameters updated post-session via reflection and user feedback. After each session an LLM-driven Reflection Engine inspects the conversation trace, diagnoses reasoning successes or failures, and proposes typed deltas $\Delta K=(\Delta S,\Delta U,\Delta T)$ over instructions, user preferences, and tools. Each delta is version-controlled, evaluated under a sliding-window analysis of 1-5 star ratings, automatically repaired on first failure, and rolled back on repeated failure. When the accumulated edit budget crosses a threshold, the agent compiles a rating-weighted synthetic dataset and distils matured instruction-space gains into parameters, converting prompt-space improvements into weight-space without downtime.

Empirically, ILWS makes explicit the low-rank shaping implicitly induced by context in transformer blocks and preserves governance while eliminating per-call retrieval. In enterprise support, ILWS raised throughput by 2.4--5.0$\times$ and cut audited hallucinations by $\sim$80% versus a frozen baseline. A real-world e-commerce platform PoC called "L0 Support" with 1M-token context achieved 4--5$\times$ gains in tickets/hour and an $\sim$80% reduction in time per ticket, with autonomous instruction updates and optional tool synthesis. Because ILWS operates at the instruction layer until a controlled distillation stage, it generalises to dynamic domains (legal, medical, engineering) requiring adaptive reasoning, tool creation, and low-latency deployment.

URL: https://openreview.net/forum?id=2unHBbaor7

---

Title: SpikingMamba: Towards Energy-Efficient Large Language Models via Knowledge Distillation from Mamba

Abstract: Large Language Models (LLMs) have achieved remarkable performance across tasks but remain energy-intensive due to dense matrix operations. Spiking neural networks (SNNs) improve energy efficiency by replacing dense matrix multiplications with sparse accumulations. Their sparse spike activity enables efficient LLMs deployment on edge devices. However, prior SNN-based LLMs often sacrifice performance for efficiency, and recovering accuracy typically requires full pretraining, which is costly and impractical. To address this, we propose SpikingMamba, an energy-efficient SNN-based LLMs distilled from Mamba that improves energy efficiency with minimal accuracy sacrifice. SpikingMamba integrates two key components: (a) TI-LIF, a ternary-integer spiking neuron that preserves semantic polarity through signed multi-level spike representations. (b) A training-exclusive Smoothed Gradient Compensation (SGC) path mitigating quantization loss while preserving spike-driven efficiency. We employ a single-stage distillation strategy to transfer the zero-shot ability of pretrained Mamba and further enhance it via reinforcement learning (RL). Experiments show that SpikingMamba-1.3B achieves a 4.76$\times$ energy benefit, with only a 4.78% zero-shot accuracy gap compared to the original Mamba, and achieves a further 2.55% accuracy improvement after RL.

URL: https://openreview.net/forum?id=uxb2jcCLxt

---

Title: MatLLMSearch: Crystal Structure Discovery with Evolution-Guided Large Language Models

Abstract: Crystal structure generation is fundamental to materials science, enabling the discovery of novel materials with desired properties. While existing approaches leverage Large Language Models (LLMs) through extensive fine-tuning on materials databases, we show that pre-trained LLMs can inherently generate novel and stable crystal structures without additional fine-tuning. Our framework employs LLMs as intelligent proposal agents within an evolutionary pipeline that guides them to perform implicit crossover and mutation operations while maintaining chemical validity. We demonstrate that MatLLMSearch achieves a 78.38% metastable rate validated by machine learning interatomic potentials and 31.7% DFT-verified stability (below the convex hull) via quantum mechanical calculations, outperforming specialized models such as CrystalTextLLM. Beyond crystal structure generation, we further demonstrate that our framework adapts to diverse materials design tasks, including crystal structure prediction and multi-objective optimization of properties such as deformation energy and bulk modulus, all without fine-tuning. These results establish our framework as a versatile and effective framework for consistent high-quality materials discovery, offering training-free generation of novel stable structures with reduced overhead and broader accessibility.

URL: https://openreview.net/forum?id=HfqbCoTpXg

---

Title: Fac-TDMPC: A Factored World Model for Robot Planning

Abstract: Model-based reinforcement learning (MBRL) has shown strong sample efficiency in robotics by learning predictive world models and planning with them, but existing methods suffer from high planning latency due to the combination of centralized world models and model predictive control (MPC) as planners, thus limiting the real-time deployment in high-dimensional action spaces. We introduce \textbf{Fac-TDMPC}, a factored latent-space world model that decomposes transition, reward, and value functions on the latent space and learns the factorization via model distillation. The factored design enables decentralized planning across action dimensions. Empirically, Fac-TDMPC achieves substantial planning speedups while preserving the control performance across a suite of continuous-control robotic tasks; it also demonstrates improved robustness to action perturbations, interpretable joint-level latent structure, and enhanced multi-task data efficiency.

URL: https://openreview.net/forum?id=Smb0sdocmd

---

Title: Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation

Abstract: Recent advances in reasoning-oriented Large Language Models (LLMs) have been driven by the introduction of Chain-of-Thought (CoT) traces, where models generate intermediate reasoning traces before producing an answer. These traces, as in DeepSeek R1, are not only used to guide model inference but also serve as supervision signals for Knowledge Distillation (KD) to improve smaller models. A prevailing but under-examined implicit assumption is that these CoT traces are both semantically correct and interpretable for the end-users. While there are reasons to believe that these intermediate tokens help improve solution accuracy, in this work, we question their validity (semantic correctness) and interpretability to the end user. To isolate the effect of trace semantics, we design experiments in the Question Answering (QA) domain using a rule-based problem decomposition method. This enables us to create Supervised Fine-Tuning (SFT) datasets for LLMs where - each QA problem is paired with either verifiably correct or incorrect CoT traces, while always providing the correct final solution. Trace correctness is then evaluated by checking the accuracy of every sub-step in decomposed reasoning chains. To assess end-user trace interpretability, we also finetune LLMs with three additional types of CoT traces: DeepSeek R1 traces, LLM-generated summaries of R1 traces, and LLM-generated post-hoc explanations of R1 traces. We further conduct a human-subject study with 100 participants asking them to rate the interpretability of each trace type on a standardized Likert scale. Our experiments reveal two key findings - (1) Correctness of CoT traces is not reliably correlated with the model’s generation of correct final answers: correct traces led to correct solutions only for 28% test-set problems while incorrect traces don't necessarily degrade solution accuracy. (2) In interpretability studies, fine-tuning on verbose DeepSeek R1 traces produced the best model performance but these traces were rated as least interpretable by users, scoring on average 3.39 for interpretability and 4.59 for cognitive load metrics on a 5-point Likert scale. In contrast, the decomposed traces that are judged significantly more interpretable don't lead to comparable solution accuracy. Together, these findings challenge the assumption in question suggesting that researchers and practitioners should decouple model supervision objectives from end-user-facing trace design.

URL: https://openreview.net/forum?id=4D1QEEmabF

---

Title: The Clever Hans Mirage: A Comprehensive Survey on Spurious Correlations in Machine Learning

Abstract: Back in the early 20th century, a horse named Hans appeared to perform arithmetic and other intellectual tasks during exhibitions in Germany, while it actually relied solely on involuntary cues in the body language from the human trainer. Modern machine learning models are no different. These models are known to be sensitive to spurious correlations between non-essential features of the inputs (e.g., background, texture, and secondary objects) and the corresponding labels. Such features and their correlations with the labels are known as spurious because they tend to change with shifts in real-world data distributions, which can negatively impact the model's generalization and robustness. In this paper, we provide a comprehensive survey of this emerging issue, along with a fine-grained taxonomy of existing state-of-the-art methods for addressing spurious correlations in machine learning models. Additionally, we summarize existing datasets, benchmarks, and metrics to facilitate future research. The paper concludes with a discussion of the broader impacts, the recent advancements, and future challenges in the era of generative AI, aiming to provide valuable insights for researchers in the related domains of the machine learning community.

URL: https://openreview.net/forum?id=kIuqPmS1b1

---

Title: Natural Policy Gradient for Average Reward Non-Stationary Reinforcement Learning

Abstract: We consider the problem of non-stationary reinforcement learning (RL) in the infinite-horizon average-reward setting. We model it by a Markov Decision Process with time-varying rewards and transition probabilities, with a variation budget of $\Delta_T$. Existing non-stationary RL algorithms focus on model-based and model-free value-based methods. Policy-based methods despite their flexibility in practice are not theoretically well understood in non-stationary RL. We propose and analyze the first model-free policy-based algorithm, Non-Stationary Natural Actor-Critic, NS-NAC, a policy gradient method with a restart based exploration for change and a novel interpretation of learning rates as adapting factors. Further, we present a bandit-over-RL based parameter-free algorithm, BORL-NS-NAC, that does not require prior knowledge of the variation budget $\Delta_T$. We present a dynamic regret of $\mathcal{\Tilde{O}} (|\mathcal{S}|^{1/2}|\mathcal{A}|^{1/2}\Delta_T^{1/6}T^{5/6})$ for both algorithms under standard assumptions, where $T$ is the time horizon, and $|\mathcal{S}|$, $|\mathcal{A}|$ are the sizes of the state and action spaces. The regret analysis leverages a novel adaptation of the Lyapunov function analysis of NAC to dynamic environments and characterizes the effects of simultaneous updates in policy, value function estimate and changes in the environment.

URL: https://openreview.net/forum?id=hBJYNAYtoo

---

Title: Sensitivity of Stability: Theoretical & Empirical Analysis of Replicability for Adaptive Data Selection in Transfer Learning

Abstract: The widespread adoption of transfer learning has revolutionized machine learning by enabling efficient adaptation of pre-trained models to new domains. However, the reliability of these adaptations remains poorly understood, particularly when using adaptive data selection strategies that dynamically prioritize training examples. We present a comprehensive theoretical and empirical analysis of replicability in transfer learning, introducing a mathematical framework that quantifies the fundamental trade-off between adaptation effectiveness and result consistency. Our key contribution is the formalization of selection sensitivity ($\Delta_Q$), a measure that captures how adaptive selection strategies respond to perturbations in training data. We prove that replicability failure probability: the likelihood that two independent training runs produce models differing in performance by more than a threshold, increases quadratically with selection sensitivity while decreasing exponentially with sample size. Through extensive experiments on the MultiNLI corpus using six adaptive selection strategies - ranging from uniform sampling to gradient-based selection - we demonstrate that this theoretical relationship holds precisely in practice. Our results reveal that highly adaptive strategies like gradient-based and curriculum learning achieve superior task performance but suffer from high replicability failure rates, while less adaptive approaches maintain failure rates below 7%. Crucially, we show that source domain pretraining provides a powerful mitigation mechanism, reducing failure rates by up to 30% while preserving performance gains. These findings establish principled guidelines for practitioners to navigate the performance-replicability trade-off and highlight the need for replicability-aware design in modern transfer learning systems.

URL: https://openreview.net/forum?id=rGfdrvwqEY

---

Title: The Transformer Cookbook

Abstract: We present the transformer cookbook: a collection of techniques for directly encoding algorithms into a transformer's parameters. This work addresses the steep learning curve of such endeavors, a problem exacerbated by a fragmented literature where key results are scattered across numerous papers. In particular, we synthesize this disparate body of findings into a curated set of recipes that demonstrate how to implement everything from basic arithmetic in feed-forward layers to complex data routing via self-attention. Our mise en place of formulations is for both newcomers seeking an accessible entry point and experts in need of a systematic reference. This unified presentation of transformer constructions provides a foundation for future work spanning theoretical research in computational complexity to empirical investigations in architecture design and interpretability.

URL: https://openreview.net/forum?id=sPshCSvDrX

---

Title: There are no Champions in Long-Term Time Series Forecasting

Abstract: Recent advances in long-term time series forecasting have introduced numerous complex prediction models that consistently outperform previously published architectures.
However, this rapid progression raises concerns regarding inconsistent benchmarking and reporting practices, which may undermine the reliability of these comparisons.
Our position emphasizes the need to shift focus away from pursuing ever-more complex models and towards enhancing benchmarking practices through rigorous and standardized evaluation methods.
To support our claim, we first perform a broad, thorough, and reproducible evaluation of the top-performing models on the most popular benchmark by evaluating five models over 14 datasets encompassing 3,500+ trained networks for the hyperparameter (HP) searches.
Then, through a comprehensive analysis, we find that slight changes to experimental setups or current evaluation metrics drastically shift the common belief that newly published results are advancing the state of the art.
Our findings suggest the need for rigorous and standardized evaluation methods that enable more substantiated claims, including reproducible HP setups and statistical testing.

URL: https://openreview.net/forum?id=yO1JuBpTBB

---

Title: Shattering the Rings: Reproducibility and Vulnerability Analysis of the ZoDiac Watermarking Framework

Abstract: This paper presents a reproducibility study and robustness evaluation of the paper ‘Attack-
Resilient Image Watermarking Using Stable Diffusion’ by Zhang et al. (2024), which proposes
ZoDiac, a Stable Diffusion-based framework for attack-resilient image watermarking. While
successfully replicating the original method’s core claims—achieving >90% watermark de-
tection rate (WDR) against diffusion-based regeneration attacks and across MS-COCO,
DiffusionDB, and WikiArt datasets—we identify critical vulnerabilities under adversarial
and geometrically asymmetric attack paradigms. Our extended analysis demonstrates that
gradient-based adversarial perturbations reduce ZoDiac’s WDR, a threat model absent in
prior evaluations. We also investigate rotationally asymmetric attacks achieving WDR be-
low 65%. Additionally, we explore a new loss function to mitigate these limitations. Despite
these enhancements, composite attacks combining adversarial noise with other methods re-
duce WDR to near-zero, exposing vulnerabilities through multi-stage offensive pipelines.
Our implementation can be found on Anonymous Github.

URL: https://openreview.net/forum?id=l6QJfoIl1c

---

Title: CADmium: Fine-Tuning Code Language Models for Text- Driven Sequential CAD Design

Abstract: Computer-aided design (CAD) is the digital construction of 2D and 3D objects, and is central to a wide range of engineering and manufacturing applications like automobile and aviation. Despite its importance, CAD modeling remains largely a time-intensive, manual task. Recent works have attempted to automate this process with small transformer-based models and handcrafted CAD sequence representations. However, there has been little effort to leverage the potential of large language models (LLMs) for sequential CAD design. In this work, we introduce a new large-scale dataset of more than 170k CAD models annotated with high-quality, human-like descriptions generated with our pipeline based on GPT-4.1. Using this dataset, we fine-tune powerful code-LLMs to generate CAD sequences represented in a JSON-based format from natural language descriptions, demonstrating the viability and effectiveness of this approach for text-conditioned CAD generation. Because simple metrics often fail to reflect the quality of generated objects, we introduce geometric and topological metrics based on sphericity, mean curvature, and Euler characteristic to provide richer structural insights. Our experiments and ablation studies on both synthetic and human-annotated data demonstrate that CADmium is able to automate CAD design, drastically speeding up the design of new objects.

URL: https://openreview.net/forum?id=lExqWvQht8

---

Title: MV2MAE: Self-Supervised Video Pre-Training with Motion-Aware Multi-View Masked Autoencoders

Abstract: Videos captured from multiple viewpoints can help in perceiving the 3D structure of the world and benefit computer vision tasks such as action recognition, tracking, etc. In this paper, we present MV2MAE, a method for self-supervised learning from synchronized multi-view videos, built on the masked autoencoder framework. We introduce two key enhancements to better exploit multi-view video data. First, we design a cross-view reconstruction task that leverages a cross-attention-based decoder to reconstruct a target viewpoint video from source view. This helps in effectively injecting geometric information and yielding representations robust to viewpoint changes. Second, we introduce a controllable motion-weighted reconstruction loss which emphasizes dynamic regions and mitigates trivial reconstruction of static backgrounds. This improves temporal modeling and encourages learning more meaningful representations across views.
MV2MAE achieves state-of-the-art results on the NTU-60, NTU-120 and ETRI datasets among self-supervised approaches. In the more practical transfer learning setting, it delivers consistent gains of +2.0 -- 8.5% on NUCLA, PKU-MMD-II and ROCOG-v2 datasets, demonstrating the robustness and generalizability of our approach.

URL: https://openreview.net/forum?id=nqt35xJywK

---

Title: Fast Debiasing of the LASSO Estimator

Abstract: In high-dimensional sparse regression, the \textsc{Lasso} estimator offers excellent theoretical guarantees but is well-known to produce biased estimates. To address this, \cite{Javanmard2014} introduced a method to ``debias'' the \textsc{Lasso} estimates for a random sub-Gaussian sensing matrix $\boldsymbol{A}$. Their approach relies on computing an ``approximate inverse'' $\boldsymbol{M}$ of the matrix $\boldsymbol{A}^\top \boldsymbol{A}/n$ by solving a convex optimization problem. This matrix $\boldsymbol{M}$ plays a critical role in mitigating bias and allowing for construction of confidence intervals using the debiased \textsc{Lasso} estimates. However the computation of $\boldsymbol{M}$ is expensive in practice as it requires iterative optimization.
In the presented work, we re-parameterize the optimization problem to compute a ``debiasing matrix'' $\boldsymbol{W} := \boldsymbol{AM}^{\top}$ directly, rather than the approximate inverse $\boldsymbol{M}$. This reformulation retains the theoretical guarantees of the debiased \textsc{Lasso} estimates, as they depend on the \emph{product} $\boldsymbol{AM}^{\top}$ rather than on $\boldsymbol{M}$ alone. Notably, we derive a simple and computationally efficient closed-form expression for $\boldsymbol{W}$, applicable to the sensing matrix $\boldsymbol{A}$ in the original debiasing framework, under a specific deterministic condition.
This condition is satisfied with high probability for a wide class of randomly generated sensing matrices.
Also, the optimization problem based on $\boldsymbol{W}$ guarantees a unique optimal solution, unlike the original formulation based on $\boldsymbol{M}$. We verify our main result with numerical simulations.

URL: https://openreview.net/forum?id=gEVPlLhoNI

---

Title: Weakly-Supervised Disentangled Representation Learning via Filter-Based Adaptive Swapping

Abstract: Disentangled representation learning (DRL) aims to uncover semantically meaningful latent factors from observed data, thereby improving both interpretability and generalization of machine learning (ML) models. Despite remarkable progress, unsupervised DRL cannot achieve complete disentanglement without inductive biases or supervision. To address this challenge, existing approaches either rely on full supervision, which demands extensive manual labeling, or weak supervision, which involves complex training strategies that often result in unstable training. To address these limitations, we propose Filter-VAE, a weakly supervised variational autoencoder (VAE) that introduces a filter-based adaptive swapping strategy to learn stable and meaningful disentangled representations. Specifically, a relevance filter removes semantically meaningless latent factors, while an adaptive swapping filter exchanges those latent factors that have reached stability. With these two filters, Filter-VAE adaptively swaps only stable and semantically aligned latent factors, leading to robust and meaningful representations. We evaluate Filter-VAE on three standard benchmarks and our created traffic sign dataset in two downstream tasks: disentanglement and adversarial robustness. Experimental results demonstrate that Filter-VAE achieves strong disentanglement performance with reduced supervision and delivers remarkable robustness against diverse adversarial attacks and corruptions. The code will be released upon acceptance.

URL: https://openreview.net/forum?id=K69rKKozZU

---

Title: ASMa: Asymmetric Spatio-temporal Masking for Skeleton Action Representation Learning

Abstract: Self-supervised learning (SSL) has shown remarkable success in skeleton-based action recognition by leveraging data augmentations to learn meaningful representations. However, existing SSL methods rely on data augmentations that predominantly focus on masking high-motion frames and high-degree joints such as joints with degree 3 or 4. This results in biased and incomplete feature representations that struggle to generalize across varied motion patterns. To address this, we propose Asymmetric Spatio-temporal Masking (ASMa) for Skeleton Action Representation Learning, a novel combination of masking to learn a full spectrum of spatio-temporal dynamics inherent in human actions. ASMa employs two complementary masking strategies: one that selectively masks high-degree joints and low-motion, and another that masks low-degree joints and high-motion frames. These masking strategies ensure a more balanced and comprehensive skeleton representation learning. Furthermore, we introduce a learnable feature alignment module to effectively align the representations learned from both masked views. To facilitate deployment in resource-constrained settings and on low-resource devices, we compress the learned and aligned representation into a lightweight model using knowledge distillation. Extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD datasets demonstrate that our approach outperforms existing SSL methods with an average improvement of 5 - 8% and achieves competitive performance compared to fully supervised baselines.

URL: https://openreview.net/forum?id=kIFo1q3VMS

---

Title: SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models

Abstract: Understanding social interactions from visual cues is a fundamental challenge for a socially competent AI. While powerful pre-trained vision-language models (VLMs) have shown remarkable general capabilities, they surprisingly struggle to unify and learn multiple social perception tasks simultaneously, often exhibiting negative transfer. We identify that this negative transfer stems from a critical issue we term "social degradation," whereby the general visual-linguistic pre-training process of VLMs impairs the visual encoder's ability to represent nuanced social information. We investigate this behavior further under two lenses: decodability through linear representation probing and compatibility through gradient conflict analysis, revealing that both play a role in the degradation, especially the former, which is significantly compromised in the VLM pre-training process. To address these issues, we propose SocialFusion, a unified framework that learns a minimal connection between a frozen visual encoder and a language model. Compared with existing VLMs, it achieves positive transfer across all five social tasks, leveraging synergies between them to enhance overall performance and achieves comparable performance to task-specific state-of-the-art models on various benchmarks. Our findings suggest that current VLM pre-training strategies may be detrimental to acquiring general social competence and highlight the need for more socially-aware training paradigms.

URL: https://openreview.net/forum?id=ofYhEoKIEx

---

Title: VirDA: Reusing Backbone for Unsupervised Domain Adaptation with Visual Reprogramming

Abstract: Image classification is the foundation of nearly all computer-vision pipelines. While state-of-the-art models excel within their training domains, their performance often deteriorates when transferred to a new, unlabeled setting. Unsupervised domain adaptation (UDA) addresses this challenge by repurposing a well-trained source classifier for the target domain, enabling strong downstream results without the need for additional labeled data. Existing UDA pipelines fine-tune already well-trained backbone parameters for every new source-and-target pair, resulting in the number of training parameters and storage memory growing linearly with each new pair, and also preventing the reuse of these well-trained backbone parameters.

Inspired by recent implications that existing backbones have textural biases, we propose making use of domain-specific textural bias for domain adaptation via visual reprogramming, namely \method{}.
Instead of fine-tuning the full backbone, \method{} prepends a domain-specific visual reprogramming layer to the backbone. This layer produces visual prompts that act as an added textural bias to the input image, adapting its ``style'' to a target domain. To optimize these visual reprogramming layers, we use multiple objective functions that optimize the intra- and inter-domain distribution differences when domain-adapting visual prompts are applied. This process does not require modifying the backbone parameters, allowing the same backbone to be reused across different domains.

We evaluate \method{} on Office-31 and obtain 92.8\% mean accuracy with only 1.5M trainable parameters. \method{} surpasses PDA, the state-of-the-art parameter-efficient UDA baseline, by +1.6\% accuracy while using just 46\% of its parameters. Compared with full-backbone fine-tuning, \method{} outperforms CDTrans and FixBi by +0.2\% and +1.4\%, respectively, while requiring only 1.7\% and 2.8\% of their trainable parameters. Relative to the strongest current methods (PMTrans and TVT), \method{} uses ~1.7\% of their parameters and trades off only 2.2\% and 1.1\% accuracy, respectively.

URL: https://openreview.net/forum?id=Qh7or7JRFI

---

Title: Are Data Embeddings effective in time series forecasting?

Abstract: Time series forecasting plays a crucial role in many real-world applications, and numerous complex forecasting models have been proposed in recent years. Despite their architectural innovations, most state-of-the-art models report only marginal improvements—typically just a few thousandths in standard error metrics. These models often incorporate complex data embedding layers, which typically transform raw inputs into higher-dimensional representations to enhance accuracy. But are data embedding techniques actually effective in time series forecasting? Through extensive ablation studies across fifteen state-of-the-art models and four benchmark datasets, we find that removing data embedding layers from many state-of-the-art models does not degrade forecasting performance—in many cases, it improves both accuracy and computational efficiency. The gains from removing embedding layers often exceed the performance differences typically reported between competing state-of-the-art models.

URL: https://openreview.net/forum?id=yeu44ZRvZZ

---

Title: Augmenting Molecular Graphs with Geometries via Machine Learning Interatomic Potentials

Abstract: Accurate molecular property predictions require 3D geometries, which are typically obtained using expensive methods such as density functional theory (DFT). Here, we attempt to obtain molecular geometries by relying solely on machine learning interatomic potential (MLIP) models. To this end, we first curate a large-scale molecular relaxation dataset comprising 3.5 million molecules and 300 million snapshots. Then MLIP foundation models are trained with supervised learning to predict energy and forces given 3D molecular structures. Once trained, we show that the foundation models can be used in different ways to obtain geometries either explicitly or implicitly. First, it can be used to obtain low-energy 3D geometries via geometry optimization, providing relaxed 3D geometries for downstream molecular property predictions. To mitigate potential biases and enhance downstream predictions, we introduce geometry fine-tuning based on the relaxed 3D geometries. Second, the foundation models can be directly fine-tuned for property prediction when ground truth 3D geometries are available. Our results demonstrate that MLIP foundation models trained on relaxation data can provide valuable molecular geometries that benefit property predictions.

URL: https://openreview.net/forum?id=JwxhHTISJL

---

Title: Exploring Training Data Attribution under Limited Access Constraints

Abstract: Training data attribution (TDA) plays a critical role in understanding the influence of individual training data points on model predictions. Gradient-based TDA methods, popularized by \textit{influence function} for their superior performance, have been widely applied in data selection, data cleaning, data economics, and fact tracing. However, in real-world scenarios where commercial models are not publicly accessible and computational resources are limited, existing TDA methods are often constrained by their reliance on full model access and high computational costs. This poses significant challenges to the broader adoption of TDA in practical applications.

In this work, we present a systematic study of TDA methods under various access and resource constraints. We investigate the feasibility of performing TDA under varying levels of access constraints by leveraging appropriately designed solutions such as proxy models. Besides, we demonstrate that attribution scores obtained from models without prior training on the target dataset remain informative across a range of tasks, which is useful for scenarios where computational resources are limited. Our findings provide practical guidance for deploying TDA in real-world environments, aiming to improve feasibility and efficiency under limited access.

URL: https://openreview.net/forum?id=4O0bwYy4Yu

---

Title: Conformal Calibration of Statistical Confidence Sets

Abstract: Constructing valid confidence sets is a crucial task in statistical inference, yet traditional methods often face challenges when dealing with complex models or limited observed sample sizes. These challenges are frequently encountered in modern applications, such as Likelihood-Free Inference (LFI). In these settings, confidence sets may fail to maintain a confidence level close to the nominal value.
In this paper, we introduce two novel methods, TRUST and TRUST++, for calibrating confidence sets to achieve distribution-free conditional coverage. These methods rely entirely on simulated data from the statistical model to perform calibration. Leveraging insights from conformal prediction techniques adapted to the statistical inference context, our methods ensure both finite-sample local coverage and asymptotic conditional coverage as the number of simulations increases, even if n is small. They effectively handle nuisance parameters and provide computationally efficient uncertainty quantification for the estimated confidence sets. This allows users to assess whether additional simulations are necessary for robust inference. Through theoretical analysis and experiments on models with tractable and intractable likelihoods, we demonstrate that our methods outperform existing approaches, particularly in small-sample regimes. This work bridges the gap between conformal prediction and statistical inference, offering practical tools for constructing valid confidence sets in complex models.

URL: https://openreview.net/forum?id=J4lK62PVE6

---

Title: Incentive-Aware Synthetic Control: Accurate Counterfactual Estimation via Incentivized Exploration

Abstract: Synthetic control methods (SCMs) are a canonical approach used to estimate treatment effects from panel data in the internet economy. We shed light on a frequently overlooked but ubiquitous assumption made in SCMs of ``overlap'': a treated unit can be written as some combination---typically, convex or linear---of the units that remain under control. We show that if units select their own interventions, and there is sufficiently large heterogeneity between units that prefer different interventions, overlap will not hold. We address this issue by proposing a recommender system which incentivizes units with different preferences to take interventions they would not normally consider. Specifically, leveraging tools from information design and online learning, we propose an SCM that incentivizes exploration in panel data settings by providing incentive-compatible intervention recommendations to units. We establish this estimator obtains valid counterfactual estimates without the need for an a priori overlap assumption. We extend our results to the setting of synthetic interventions, where the goal is to produce counterfactual outcomes under all interventions, not just control. Finally, we provide two hypothesis tests for determining whether unit overlap holds for a given panel dataset.

URL: https://openreview.net/forum?id=koln3ufP5c

---

Title: Lyria: A General LLM‑Driven Genetic Algorithm Framework for Problem Solving

Abstract: While Large Language Models (LLMs) have demonstrated impressive abilities across various domains, they still struggle with complex problems characterized by multi-objective optimization, precise constraint satisfaction, immense solution spaces, etc. To address the limitation, drawing on the superior semantic understanding ability of LLMs and also the outstanding global search and optimization capability of genetic algorithms, we propose to capitalize on their respective strengths and introduce Lyria, a general LLM-driven genetic algorithm framework, comprising 7 essential components. Through conducting extensive experiments with 4 LLMs across 3 types of problems, we demonstrated the efficacy of Lyria. Furthermore, with 7 additional ablation experiments, we further systematically analyzed and elucidated the factors that affect its performance. We finally revealed its limitations and provided insights into future directions.

URL: https://openreview.net/forum?id=hu4oaIiCUe

---

Title: On a Gradient Approach to Optimal Function Learning via Chebyshev Centers

Abstract: We introduce $\textsf{gradOL}$, the first gradient-based optimization framework for solving Chebyshev center problems, a fundamental challenge in optimal function learning and geometric optimization. By leveraging automatic differentiation for precise (sub-)gradient computation, $\textsf{gradOL}$ ensures numerical stability and scalability, making it suitable for large-scale settings. Under strong convexity of the ambient norm, our method provably recovers optimal Chebyshev centers while directly computing the associated radius. This addresses a key bottleneck in constructing stable optimal interpolants. Empirically, $\textsf{gradOL}$ achieves significant improvements in accuracy and efficiency on 34 benchmark Chebyshev center problems from the $\textsf{CSIP}$ library. Furthermore, we extend our approach to general convex semi-infinite programming (CSIP), attaining up to $4000\times$ speedups over the state-of-the-art $\textsf{SIPAMPL}$ solver across 67 benchmark instances. Our work also provides the first theoretical foundation for applying gradient-based methods to Chebyshev center problems, bridging rigorous analysis with practical algorithms. $\textsf{gradOL}$ thus offers a unified solution framework for Chebyshev centers and broader CSIPs.

URL: https://openreview.net/forum?id=lPZVsDhyj3

---

Title: Sublinear Algorithms for Estimating Wasserstein and TV Distances: Applications to Fairness and Privacy Auditing

Abstract: Resource-efficiently computing representations of probability distributions and the distances between them while only having access to the samples is a fundamental and useful problem across mathematical sciences. In this paper, we propose a generic framework to learn the probability and cumulative distribution functions (PDFs and CDFs) of a sub-Weibull, i.e. almost any light- or heavy-tailed, distribution while the samples from it arrive in a stream. The idea is to reduce these problems into estimating the frequency of an \textit{appropriately chosen subset} of the support of a \textit{properly discretised distribution}. We leverage this reduction to compute mergeable summaries of distributions from the stream of samples while requiring only sublinear space relative to the number of observed samples. This allows us to estimate Wasserstein and Total Variation (TV) distances between any two distributions while samples arrive in streams and from multiple sources. Our algorithms significantly improves on the existing methods for distance estimation incurring super-linear time and linear space complexities, and further extend the mergeable summaries framework to continuous distributions with possibly infinite support. Our results are tight with respect to the existing lower bounds for bounded discrete distributions. In addition, we leverage our proposed estimators of Wasserstein and TV distances to tightly audit the fairness and privacy of algorithms. We empirically demonstrate the efficiency of proposed algorithms across synthetic and real-world datasets.

URL: https://openreview.net/forum?id=m26nTKlpCr

---

Title: A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems

Abstract: Harnessing parallelism in seemingly sequential models is a central challenge for modern machine learning. Several approaches have been proposed for evaluating sequential processes in parallel using fixed-point methods, like Newton, Picard, and Jacobi iterations. In this work, we show that these methods can be understood within a common framework based on linear dynamical systems (LDSs), where different iteration schemes arise naturally as approximate linearizations of a nonlinear recursion. This unifying view highlights shared principles behind these techniques and clarifies when particular fixed-point methods are most likely to be effective. By bridging diverse algorithms through the language of LDSs, our framework provides a clearer theoretical foundation for parallelizing sequential models and points toward new opportunities for efficient and scalable computation.

URL: https://openreview.net/forum?id=fw6GgAIGur

---

Title: A Watermark for Black-Box Language Models

Abstract: Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require \emph{white-box} access to the model's next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. \emph{black-box} access), boasts a \emph{distortion-free} property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.

URL: https://openreview.net/forum?id=6gcHcgGmLo

---

Title: Multi-Step Alignment as Markov Games: An Optimistic Online Mirror Descent Approach with Convergence Guarantees

Abstract: Reinforcement Learning from Human Feedback (RLHF) has been highly successful in aligning large language models with human preferences. While prevalent methods like DPO have demonstrated strong performance, they frame interactions with the language model as a bandit problem, which limits their applicability in real-world scenarios where multi-turn conversations are common. Additionally, DPO relies on the Bradley-Terry model assumption, which does not adequately capture the non-transitive nature of human preferences. In this paper, we address these challenges by modeling the alignment problem as a two-player constant-sum Markov game, where each player seeks to maximize their winning rate against the other across all steps of the conversation. Our approach Optimistic Multi-step Preference Optimization (OMPO) is built upon the optimistic online mirror descent algorithm~\citep{rakhlin2013online,joulani17a}. Theoretically, we provide a rigorous analysis for the convergence of OMPO and show that OMPO requires $\mathcal{O}(\epsilon^{-1})$ policy updates to converge to an $\epsilon$-approximate Nash equilibrium. We also validate the effectiveness of our method on multi-turn conversations dataset and math reasoning dataset.

URL: https://openreview.net/forum?id=ZWZKaqZCy0

---

Title: EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments

Abstract: Developing autonomous household robots controlled by natural language has long been a pursuit of humanity. While advancements in large language models (LLMs) and embodied intelligence make this goal closer, several challenges persist: the lack of a robotic task and benchmark that are well-aligned with realistic household tasks, limited evaluation methods and metrics, data incompatibility between LLMs and mobile manipulation trajectories. To address these issues, we propose Embodied Mobile Manipulation in Open Environments (EMMOE), a benchmark that requires agents to interpret user instructions and execute long-horizon everyday tasks in continuous space. EMMOE seamlessly integrates high-level and low-level embodied tasks into a unified framework, along with three new metrics for more diverse assessment. Additionally, we collect EMMOE-100, which features in various task attributes, detailed process annotations, re-plans after failures, and two sub-datasets for LLM training. Furthermore, we design HomieBot, a sophisticated agent system consists of LLM with Direct Preference Optimization (DPO), light weighted navigation and manipulation models, and multiple error detection mechanisms. Finally, we demonstrate HomieBot’s performance and evaluations of different models and policies.

URL: https://openreview.net/forum?id=wSyExS00Wp

---

Title: Denoising Diffusions with Optimal Transport: Localization, Curvature, and Multi-Scale Complexity

Abstract: Adding noise is easy; what about denoising? Diffusion is easy; what about reverting a diffusion? Diffusion-based generative models aim to denoise a Langevin diffusion chain, moving from a log-concave equilibrium measure $\nu$, say an isotropic Gaussian, back to a complex, possibly non-log-concave initial measure $\mu$. The score function performs denoising, moving backward in time, and predicting the conditional mean of the past location given the current one. We show that score denoising is the optimal backward map in transportation cost. What is its localization uncertainty? We show that the curvature function determines this localization uncertainty, measured as the conditional variance of the past location given the current. We study in this paper the effectiveness of the diffuse-then-denoise process: the contraction of the forward diffusion chain, offset by the possible expansion of the backward denoising chain, governs the denoising difficulty. For any initial measure $\mu$, we prove that this offset net contraction at time $t$ is characterized by the curvature complexity of a smoothed $\mu$ at a specific signal-to-noise ratio (SNR) scale $r(t)$. We discover that the multi-scale curvature complexity collectively determines the difficulty of the denoising chain. Our multi-scale complexity quantifies a fine-grained notion of average-case curvature instead of the worst-case. Curiously, it depends on an integrated tail function, measuring the relative mass of locations with positive curvature versus those with negative curvature; denoising at a specific SNR scale is easy if such an integrated tail is light. We conclude with several non-log-concave examples to demonstrate how the multi-scale complexity probes the bottleneck SNR for the diffuse-then-denoise process.

URL: https://openreview.net/forum?id=sj1wU6gBXH

---

Title: Deep Multimodal Learning with Missing Modality: A Survey

Abstract: During multimodal model training and testing, certain data modalities may be absent due to sensor limitations, cost constraints, privacy concerns, or data loss, which can degrade performance. Multimodal learning techniques that explicitly account for missing modalities aim to improve robustness by enabling models to perform reliably even when certain inputs are unavailable. This survey presents the first comprehensive review of Multimodal Learning with Missing Modality (MLMM), with a focus on deep learning approaches. We outline the motivations and key distinctions between MLMM and conventional multimodal learning, provide a detailed analysis of existing methods, applications, and datasets, and conclude by highlighting open challenges and future research directions.

URL: https://openreview.net/forum?id=tc7RFcx4hT

---

Title: Video Creation by Demonstration

Abstract: We present Video Creation by Demonstration: given a demonstration video and an initial frame from any scene, we generate a realistic video that continues naturally from the initial frame and carries out the action concepts from the demonstration. This is important because unlike captions, camera poses, or point tracks, a demonstration video can provide detailed description of the target action without needing extensive manual annotations. The main challenge for training these models is the difficulty in curating supervised training data based on paired actions across different contexts. To mitigate this, we propose Delta-Diffusion, a self-supervised method that learns from unlabeled videos. Our key insight is that by placing a separately learned bottleneck on the features of a video foundation model, we can extract demonstration actions through these features and minimize degenerate solutions. We found Delta-Diffusion to outperform baselines in both human preference and large-scale machine evaluations.

URL: https://openreview.net/forum?id=jFxSMyEFVl

---

Title: Parameter Efficient Continual Learning with Dynamic Low- Rank Adaptation

Abstract: Catastrophic forgetting has remained a critical challenge for deep neural networks in Continual Learning (CL) as it undermines consolidated knowledge when learning new tasks. Parameter efficient fine-tuning CL techniques are gaining traction for their effectiveness in addressing catastrophic forgetting with lightweight training schedule while avoiding degradation of consolidated knowledge in pre-trained models. However, low-rank adapters (LoRA) in these approaches are highly sensitive to rank selection as it can lead to sub-optimal resource allocation and performance. To this end, we introduce PEARL, a rehearsal-free CL framework that entails dynamic rank allocation for LoRA components during CL training. Specifically, PEARL leverages reference task weights and adaptively determines the rank of task-specific LoRA components based on the current task’s proximity to reference task weights in parameter space. To demonstrate the versatility of PEARL, we evaluate PEARL across three vision architectures (ResNet, Separable Convolutional Network, and Vision Transformer) and a multitude of CL scenarios, and show that PEARL outperforms all considered baselines by a large margin.

URL: https://openreview.net/forum?id=ZqQATq0Geg

---

Title: Learning to Localize Leakage of Cryptographic Sensitive Variables

Abstract: While cryptographic algorithms such as the ubiquitous Advanced Encryption Standard (AES) are secure, *physical implementations* of these algorithms in hardware inevitably `leak' sensitive data such as cryptographic keys. A particularly insidious form of leakage arises from the fact that hardware consumes power and emits radiation in a manner that is statistically associated with the data it processes and the instructions it executes. Supervised deep learning has emerged as a state-of-the-art tool for carrying out *side-channel attacks*, which exploit this leakage by learning to map power/radiation measurements throughout encryption to the sensitive data operated on during that encryption. In this work we develop a principled deep learning framework for determining the relative leakage due to measurements recorded at different points in time, in order to inform *defense* against such attacks. This information is invaluable to cryptographic hardware designers for understanding *why* their hardware leaks and how they can mitigate it (e.g. by indicating the particular sections of code or electronic components which are responsible). Our framework is based on an adversarial game between a classifier trained to estimate the conditional distributions of sensitive data given subsets of measurements, and a budget-constrained noise distribution which probabilistically erases individual measurements to maximize the loss of this classifier. We demonstrate our method’s efficacy and ability to overcome limitations of prior work through extensive experimental comparison on 6 publicly-available power/EM trace datasets from AES, ECC and RSA implementations. Our PyTorch code is available at https://anonymous.4open.science/r/learning_to_localize_leakage-420B.

URL: https://openreview.net/forum?id=9qxCSU8nDO

---

Title: Distilled Thompson Sampling: Practical and Efficient Thompson Sampling via Imitation Learning

Abstract: Thompson sampling (TS) has emerged as a robust technique for contextual bandit problems. However, TS requires posterior inference and optimization for action generation, prohibiting its use in many online platforms where latency and ease of deployment are of concern. We operationalize TS by proposing a novel imitation-learning-based algorithm that distills a TS policy into an explicit policy representation, allowing fast decision-making and easy deployment in mobile and server-based environments. Using batched data collected under the imitation policy, our algorithm iteratively performs offline updates to the TS policy, and learns a new explicit policy representation to imitate it. Empirically, our imitation policy achieves performance comparable to batch TS while allowing more than an order of magnitude reduction in decision-time latency. Buoyed by low latency and simplicity of implementation, our algorithm has been successfully deployed in multiple video upload systems for Meta. Using a randomized controlled trial, we show our algorithm resulted in significant improvements in video quality and watch time.

URL: https://openreview.net/forum?id=J8PrWwvYX2

---

Title: Steering Dialogue Dynamics for Robustness against Multi- turn Jailbreaking Attacks

Abstract: Large language models (LLMs) are shown to be vulnerable to jailbreaking attacks where adversarial prompts are designed to elicit harmful responses. While existing defenses effectively mitigate single-turn attacks by detecting and filtering unsafe inputs, they fail against multi-turn jailbreaks that exploit contextual drift over multiple interactions, gradually leading LLMs away from safe behavior. To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues. Our approach models the dialogue with LLMs using state-space representations and introduces a novel neural barrier function (NBF) to detect and filter harmful queries emerging from evolving contexts proactively. Our method achieves invariant safety at each turn of dialogue by learning a safety predictor that accounts for adversarial queries, preventing potential context drift toward jailbreaks. Extensive experiments under multiple LLMs show that our NBF-based safety steering outperforms safety alignment, prompt-based steering, and lightweight LLM guardrails baselines, offering stronger defenses against multi-turn jailbreaks while maintaining a better trade-off among safety, helpfulness, and over-refusal. Check out the website at https://sites.google.com/view/llm-nbf/home.

URL: https://openreview.net/forum?id=dcyLr9xYoI

---

Title: Optimal Regret and Hard Violation for Constrained Markov Decision Processes with Adversarial Losses and Constraints

Abstract: We investigate online learning in finite-horizon episodic Constrained Markov Decision Processes (CMDPs) under the most demanding setting: adversarial losses and constraints, bandit feedback, and unknown transitions. The most popular approaches, like primal-dual or linear programming, either rely on Slater’s condition (yielding occasionally vacuous bounds) or require solving a complex optimization problem every round. Inspired by the groundbreaking work of~\citet{sinha2024optimal} in Constrained Online Convex Optimization (COCO), we map the CMDP instances to a corresponding COCO problem. Thus, creating simple and elegant algorithms that require only a single Euclidean projection per episode. Our algorithm first attains $\mathcal{\widetilde{O}}(\sqrt{T})$ regret and $\mathcal{\widetilde{O}}(\sqrt{T})$ hard cumulative constraint violation for adversarial losses and constraints, unknown transition dynamics, bandit feedback, without Slater's condition and also without access to a strictly feasible policy. We achieve $\mathcal{O}(\sqrt{T})$ regret and $\mathcal{\widetilde{O}}(\sqrt{T})$ hard violation for known transitions. Additionally, we study the remaining three permutations of known-unknown transitions and full-bandit feedback, again achieving optimal regret and hard violation bounds in each case. Besides closing several gaps in the literature, our simple construction of biased estimators for the sub-gradient could be of independent interest for didactic purposes.

URL: https://openreview.net/forum?id=EsInBaX0ko

---

Title: Proxy-Anchor and EVT-Driven Continual Learning Method for Generalized Category Discovery

Abstract: Continual generalized category discovery has been introduced and studied in the literature as a method that aims to continuously discover and learn novel categories in incoming data batches while avoiding catastrophic forgetting of previously learned categories. A key component in addressing this challenge is the model’s ability to separate novel samples, where Extreme Value Theory (EVT) has been effectively employed. In this work, we propose a novel method that integrates EVT with proxy anchors to define boundaries around proxies using a probability of inclusion function, enabling the rejection of unknown samples. Additionally, we introduce a novel EVT-based loss function to enhance the learned representation, achieving superior performance compared to other deep-metric learning methods in similar settings. Using the derived probability functions, novel samples are effectively separated from previously known categories. However, category discovery within these novel samples can sometimes overestimate the number of new categories. To mitigate this issue, we propose a novel EVT-based approach to reduce the model size and discard redundant proxies. We also incorporate experience replay and knowledge distillation mechanisms during the continual learning stage to prevent catastrophic forgetting. Experimental results demonstrate that our proposed approach outperforms state-of-the-art methods in continual generalized category discovery scenarios.

URL: https://openreview.net/forum?id=P3Qe9yJRvf

---

Title: PredLDM: Spatiotemporal Sequence Prediction with Latent Diffusion Models

Abstract: Predicting the accurate and realistic future is an attractive landmark in spatiotemporal sequence prediction. Despite recent progress in spatiotemporal predictive models, explorations in this field are challenging due to difficulties in intricate global coherence and comprehensive history understanding. In this study, we introduce latent diffusion models (LDMs) into spatiotemporal sequence prediction (PredLDM) with a two-stage training paradigm. (i) To compress intricate global coherent spatiotemporal content into latent space, we propose the masked-attention transformer-based variational autoencoder (MT-VAE) by exploiting transformers with masked self-attention layers. (ii) Different from LDMs in generation-related fields where the condition in our problem settings is historical observations instead of texts, the condition-aware LDM (CA-LDM) is provided for comprehensive understanding of historical sequences. Our denoising diffusion process learns the distribution of both conditional generation and condition-aware reconstruction. Results on KittiCaltech, KTH and SEVIR datasets show that our PredLDM provides promising performance and realistic predictions in multiple scenarios including car driving, humans and weather evolutions. Code will be released here during camera ready.

URL: https://openreview.net/forum?id=TWmnOUzcCo

---

Title: R3: Robust Rubric-Agnostic Reward Models

Abstract: Reward models are essential for aligning language model outputs with human preferences, yet existing approaches often lack both controllability and interpretability. These models are typically optimized for narrow objectives, limiting their generalizability to broader downstream tasks. Moreover, their scalar outputs are difficult to interpret without contextual reasoning. To address these limitations, we introduce R3, a novel reward modeling framework that is rubric-agnostic, generalizable across evaluation dimensions, and provides interpretable, reasoned score assignments. R3 enables more transparent and flexible evaluation of language models, supporting robust alignment with diverse human values and use cases. Our models, data, and code are available as open source.

URL: https://openreview.net/forum?id=UVMCee4JOJ

---

Title: Fractal Generative Models

Abstract: Modularization is a cornerstone of computer science, abstracting complex functions into atomic building blocks. In this paper, we introduce a new level of modularization by abstracting generative models themselves into atomic modules. Our method constructs generative models by recursively invoking atomic generative modules, resulting in architectures with fractal-like, self-similar properties. We call this new class of models fractal generative models. As a running example, we instantiate our fractal framework using autoregressive models as the atomic modules and examine it on the challenging task of pixel-by-pixel image generation. Our experiments show strong performance in both likelihood estimation and generation quality. We hope this work could serve as a starting point for future research into fractal generative models, establishing a new paradigm in generative modeling.

URL: https://openreview.net/forum?id=Qk9kn6lOlW

---

Title: Constant Rate Scheduling: A General Framework for Optimizing Diffusion Noise Schedule via Distributional Change

Abstract: We propose a general framework for optimizing noise schedules in diffusion models, applicable to both training and sampling.
Our method enforces a constant rate of change in the probability distribution of diffused data throughout the diffusion process,
where the rate of change is quantified using a user-defined discrepancy measure.
We introduce three such measures, which can be flexibly selected or combined depending on the domain and model architecture.
While our framework is inspired by theoretical insights, we do not aim to provide a complete theoretical justification of how distributional change affects sample quality.
Instead, we focus on establishing a general-purpose scheduling framework and validating its empirical effectiveness.
Through extensive experiments, we demonstrate that our approach consistently improves the performance of both pixel-space and latent-space diffusion models,
across various datasets, samplers, and a wide range of number of function evaluations from 5 to 250.
In particular, when applied to both training and sampling schedules, our method achieves a state-of-the-art FID score of 2.03 on LSUN Horse 256$\times$256, without compromising mode coverage.

URL: https://openreview.net/forum?id=Pjq6kdvMBj

---

Title: Meta Compression: Learning to Compress Pre-trained Deep Neural Networks

Abstract: State-of-the-art deep neural networks (DNN) have achieved outstanding results in a variety of tasks. Unfortunately, these DNN so large that cannot fit into the limited resources of edge servers or end devices such as smartphones and IoT sensors. Several approaches have been proposed to design compact yet efficient DNNs, however, the performance of the compressed model can be only characterized a posteriori. This work addresses this issue by introducing meta compression, a novel approach based on meta learning to simplify a pre-trained DNN into one that fulfills given constraints on size or accuracy. We leverage diffusion-based generative models to improve generalization performance of meta learning and extensively evaluate meta compression on an image classification task with popular pre-trained DNNs. The obtained results show that meta compression achieves a 92% top-5 recommendation accuracy and that the top-1 recommendation is only 1% far from the optimal compression method in terms of average accuracy loss.

URL: https://openreview.net/forum?id=p6x58idqhx

---

Title: Domain Indexing Collaborative Filtering

Abstract: In cross-domain recommendation systems, addressing cold-start items remains a significant challenge. Previous methods typically focus on maximizing performance using cross-domain knowledge, often treating the knowledge transfer process as a black box. However, the recent development of domain indexing introduces a new approach to better address such challenges. We have developed an adversarial Bayesian framework, Domain Indexing Collaborative Filtering (DICF), that infers domain indices during cross-domain recommendation. This framework not only significantly improves the recommendation performance but also provides interpretability for cross-domain knowledge transfer. This is verified by our empirical results on both synthetic and real-world datasets.

URL: https://openreview.net/forum?id=2Wvpq5M42E

---

Title: Reinforcement Learning in the Presence of Epistemic Ambivalence

Abstract: The complexity of online decision-making under uncertainty stems from the requirement of finding a balance between exploiting known strategies and exploring new possibilities. Naturally, the uncertainty type plays a crucial role in developing decision-making strategies that manage complexity effectively. In this paper, we focus on a specific form of uncertainty known as epistemic ambivalence (EA), which emerges from conflicting pieces of evidence or contradictory experiences. It creates a delicate interplay between uncertainty and confidence, distinguishing it from epistemic uncertainty that typically diminishes with new information. Indeed, ambivalence can persist even after additional knowledge is acquired. To address this phenomenon, we propose a novel framework, called the epistemically ambivalent Markov decision process (EA-MDP), aiming to understand and control EA in decision-making processes. This framework incorporates the concept of a quantum state from the quantum mechanics formalism, and its core is to assess the probability and reward of every possible outcome. We calculate the reward function using quantum measurement techniques and prove the existence of an optimal policy and an optimal value function in the EA-MDP framework. We also propose the EA-epsilon-greedy Q-learning algorithm. To evaluate the impact of EA on decision-making and the expedience of our framework, we study two distinct experimental setups, namely the two-state problem and the lattice problem. Our results show that using our methods, the agent converges to the optimal policy in the presence of EA.

URL: https://openreview.net/forum?id=E9UJcLEzHc

---

Title: Video Prediction Transformers without Recurrence or Convolution

Abstract: Video prediction has witnessed the emergence of RNN-based models led by ConvLSTM, and CNN-based models led by SimVP. Following the significant success of ViT, recent works have integrated ViT into both RNN and CNN frameworks, achieving improved performance. While we appreciate these prior approaches, we raise a fundamental question: Is there a simpler yet more effective solution that can eliminate the high computational cost of RNNs while addressing the limited receptive fields and poor generalization of CNNs? How far can it go with a simple pure transformer model for video prediction? In this paper, we propose PredFormer, a framework entirely based on Gated Transformers. We provide a comprehensive analysis of 3D Attention in the context of video prediction. Extensive experiments demonstrate that PredFormer delivers state-of-the-art performance across four standard benchmarks. The significant improvements in both accuracy and efficiency highlight the potential of PredFormer as a strong baseline for real-world video prediction applications. The source code and trained models will be released to the public.

URL: https://openreview.net/forum?id=Afvhu9Id8m

---

Title: SPARC: Concept-Aligned Sparse Autoencoders for Cross-Model and Cross-Modal Interpretability

Abstract: Understanding how different AI models encode the same high-level concepts, such as objects or attributes, remains challenging because each model typically produces its own isolated representation. Existing interpretability methods like Sparse Autoencoders (SAEs) produce latent concepts individually for each model, resulting in incompatible concept spaces and limiting cross-model interpretability. To address this, we introduce SPARC (Sparse Autoencoders for Aligned Representation of Concepts), a new framework that learns a single, unified latent space shared across diverse architectures and modalities (e.g., vision models like DINO, and multimodal models like CLIP). SPARC's alignment is enforced through two key innovations: (1) a Global TopK sparsity mechanism, ensuring all input streams activate identical latent dimensions for a given concept; and (2) a Cross-Reconstruction Loss, which explicitly encourages semantic consistency between models. On Open Images, SPARC dramatically improves concept alignment, achieving a Jaccard similarity of 0.80, more than tripling the alignment compared to previous methods. SPARC creates a shared sparse latent space where individual dimensions often correspond to similar high-level concepts across models and modalities, enabling direct comparison of how different architectures represent identical concepts without requiring manual alignment or model-specific analysis. As a consequence of this aligned representation, SPARC also enables practical applications such as text-guided spatial localization in vision-only models and cross-model/cross-modal retrieval.

URL: https://openreview.net/forum?id=IJfvoc2GbZ

---

Title: Lorenza: Enhancing Generalization in Low-Rank Gradient LLM Training via Efficient Zeroth-Order Adaptive SAM

Abstract: Modern applications often require fine-tuning large language models (LLMs) within strict memory and computational limits, but existing memory-efficient optimizers tend to compromise robustness and generalization. To tackle this, we introduce Lorenza, a low-memory optimizer based on Sharpness-Aware Minimization (SAM). Lorenza employs a stochastic zeroth-order estimator to approximate ascent directions, reducing the computational complexity of SAM while, as we prove, maintaining its convergence guarantees. Additionally, by applying randomized singular value decomposition, Lorenza performs efficient low-rank gradient updates, achieving memory efficiency similar to traditional methods. Our theoretical analysis and experiments demonstrate that Lorenza improves robustness and generalization, particularly in challenging language tasks. Furthermore, we present Lorenza+, which enhances Lorenza by incorporating the discarded orthogonal gradient component, resulting in additional performance gains without requiring extra memory or computational overhead.

URL: https://openreview.net/forum?id=YyA51ekcQo

---

Title: Learning object representations through amortized inference over probabilistic programs

Abstract: The recent developments of modern probabilistic programming languages have enabled the combination of pattern recognition engines implemented by neural networks to guide inference over explanatory factors written as symbols in probabilistic programs. We argue that learning to invert fixed generative programs, instead of learned ones, places stronger restrictions on the representations learned by feature extraction networks, which reduces the space of latent hypotheses and enhances training efficiency. To empirically demonstrate this, we investigate a neurosymbolic object-centric representation learning approach that combines a slot-based neural module optimized via inference compilation to invert a prior generative program of scene generation. By amortizing the search over posterior hypotheses, we demonstrate that approximate inference using data-driven sequential Monte Carlo methods achieves competitive results when compared to state-of-the-art fully neural baselines while requiring several times fewer training steps.

URL: https://openreview.net/forum?id=nUFSrlJaUr

---

Title: Mathematical Modeling and Fractal Geometry for Microtexture Fabric Analysis

Abstract: Automated microtexture analysis of textile materials is critical for scalable fabric characterization in industrial quality control and high-throughput processing. We introduce a reproducible pipeline that employs Raspberry Pi microscope imaging, robust preprocessing with augmentation and imaging condition simulation, and encodes each fabric sample as a 41-dimensional feature vector. This vector captures statistical, edge, Haralick/GLCM, LBP, fractal, wavelet, Tamura, and morphological descriptors, supplemented by fractal fitting overlays that yield interpretable surface roughness and complexity maps. We release an open dataset of 20 fabric types with 500 high-resolution images, paired feature vectors, raw microscopy data, and fractal overlay visualizations. Experimental results show consistent improvements in classification F1 scores and defect detection AUC compared to baseline handcrafted feature pipelines across diverse textile types. Our work provides a transparent, extensible framework for computational materials science, AI-driven quality control, and educational use in automated textile analysis.

URL: https://openreview.net/forum?id=d3y4SF34q8

---

Title: On Self-Adaptive Perception Loss Function for Sequential Lossy Compression

Abstract: We consider causal, low-latency, sequential lossy compression, with mean squared error (MSE) as the distortion loss, and a perception loss function (PLF) to enhance the realism of reconstructions.
As the main contribution, we propose and analyze a new PLF that considers the joint distribution between the current source frame and the previous reconstructions. We establish the theoretical rate-distortion-perception function for first-order Markov sources and analyze the Gaussian model in detail. From a qualitative perspective, the proposed metric can simultaneously avoid the error-permanence phenomenon and also better exploit the temporal correlation between high-quality reconstructions. The proposed metric is referred to as self-adaptive perception loss function (PLF-SA), as its behavior adapts to the quality of reconstructed frames. We provide a detailed comparison of the proposed perception loss function with previous approaches through both information theoretic analysis as well as experiments involving moving MNIST and UVG datasets.

URL: https://openreview.net/forum?id=G6x2TPVIjm

---

Title: Generative Causal Structure Learning with Dual Latent Spaces and Annealing

Abstract: In this work, we address causal structure learning in the presence of unobserved confounders. Such causal structures can be represented by Acyclic Directed Mixed Graphs (ADMGs), where observed cause-effect relations are depicted by directed edges and unobserved confounded relations by bidirected edges. Prior methods for causal structure learning with unobserved common causes have primarily focused on search-based approaches, and more recently on flow-based generative models. We propose a novel generative method based on a variant of the Variational Autoencoder (VAE) with dual latent spaces to represent the directed cause-effect relations and the bidirected unobserved confounded relations, associating two trainable adjacency matrices. To enhance the learning process, we introduce a causality constraint combined with the concept of a causal annealing strategy during training, guiding the learning toward meaningful causal structures. Experimental results show that our method achieves competitive performance in identifying both observed and latent causal relationships on synthetic datasets. Furthermore, we demonstrate that the learned causal structure significantly improves downstream causal inference performance on real-world data.

URL: https://openreview.net/forum?id=wI5rFWfjKV

---

Title: Are foundation models for computer vision good conformal predictors?

Abstract: Recent advances in self-supervision and contrastive learning have brought the performance of foundation models to unprecedented levels in a variety of tasks. Fueled by this progress, these models are becoming the prevailing approach for a wide array of real-world vision problems, including risk-sensitive and high-stakes applications. However, ensuring safe deployment in these scenarios requires a more comprehensive understanding of their uncertainty modeling capabilities, which has been barely explored. In this work, we delve into the behaviour of vision and vision-language foundation models under Conformal Prediction (CP), a statistical framework that provides theoretical guarantees of marginal coverage of the true class. Across extensive experiments including popular vision classification benchmarks, well-known foundation vision models, and three CP methods, our findings reveal that foundation models are well-suited for conformalization procedures, particularly those integrating Vision Transformers. We also show that calibrating the confidence predictions of these models, a popular strategy to improve their uncertainty quantification, actually leads to efficiency degradation of the conformal set on adaptive CP methods. Furthermore, few-shot adaptation of Vision-Language Models (VLMs) to downstream tasks, whose popularity is surging, enhances conformal scores compared to zero-shot predictions. Last, our empirical study exposes APS as particularly promising in the context of vision foundation models, as it does not violate the marginal coverage guarantees across multiple challenging, yet realistic scenarios.

URL: https://openreview.net/forum?id=Kxdg98gZp4

---

Title: Concept Flow Models: Anchoring Concept-Based Reasoning with Hierarchical Bottlenecks

Abstract: Concept Bottleneck Models (CBMs) enhance interpretability by projecting learned features into a human-understandable concept space. Recent approaches leverage vision-language models to generate concept embeddings, reducing the need for manual concept annotations. However, these models suffer from a critical limitation: as the number of concepts approaches the embedding dimension, information leakage increases, enabling the model to exploit spurious or semantically irrelevant correlations and undermining interpretability. In this work, we propose Concept Flow Models (CFMs), which replace the flat bottleneck with a hierarchical, concept-driven decision tree. Each internal node in the hierarchy focuses on a localized subset of discriminative concepts, progressively narrowing the prediction scope. Our framework automatically constructs decision hierarchies from visual embeddings, distributes semantic concepts at each hierarchy level, and trains differentiable concept weights through probabilistic tree traversal. Extensive experiments on diverse benchmarks demonstrate that CFMs match the predictive performance of flat CBMs, while substantially reducing effective concept usage and information leakage. Furthermore, CFMs yield stepwise decision flows that enable transparent and auditable model reasoning.

URL: https://openreview.net/forum?id=TNYLf65I3I

---

Reply all

Reply to author

Forward

0 new messages