Weekly TMLR digest for Apr 05, 2026

29 views

Skip to first unread message

TMLR

unread,

Apr 5, 2026, 12:00:11 AMApr 5

to tmlr-annou...@googlegroups.com

New certifications
==================

Expert Certification: Safe Reinforcement Learning using Action Projection: Safeguard the Policy or the Environment?

Hannah Markgraf, Shambhuraj Sawant, Hanna Krasowski, Lukas Schäfer, Sebastien Gros, Matthias Althoff

https://openreview.net/forum?id=DDrGSEYxGU

---

Accepted papers
===============

Title: Learning Energy-Based Models by Self-Normalising the Likelihood

Authors: Hugo Henri Joseph Senetaire, Paul Jeha, Jes Frellsen, Pierre-Alexandre Mattei

Abstract: Training an energy-based model (EBM) with maximum likelihood is challenging due to the intractable normalisation constant. Traditional methods rely on expensive Markov chain Monte Carlo (MCMC) sampling to estimate the gradient of logartihm of the normalisation constant. We propose a novel objective called self-normalised log-likelihood (SNL) that introduces a single additional learnable parameter representing the normalisation constant compared to the regular log-likelihood. SNL is a lower bound of the log-likelihood, and its optimum corresponds to both the maximum likelihood estimate of the model parameters and the normalisation constant. We show that the SNL objective is concave in the model parameters for exponential family distributions. Unlike the regular log-likelihood, the SNL can be directly optimised using stochastic gradient techniques by sampling from a crude proposal distribution. We validate the effectiveness of our proposed method on various density estimation and parameter estimation tasks. Our results show that the proposed method, while simpler to implement and tune, outperforms existing techniques for small to moderate dimensions but degrades for high-dimensional problems. We extend this framework to handle EBM for regression and show the usefulness of our method in this setting, as we outperform existing techniques.

URL: https://openreview.net/forum?id=GVaPBqI6ny

---

Title: A Survey on Over-smoothing and Over-squashing: Unified Propagation Perspectives on Graph Neural Networks and Transformers

Authors: Alvaro Arroyo, Federico Barbero, Hugh Blayney, Michael M. Bronstein, Xiaowen Dong, Pietro Lio, Razvan Pascanu, Pierre Vandergheynst

Abstract: Decoder-Transformers have achieved remarkable success and have laid the groundwork for the development of Large Language Models (LLMs). At the core of these models is the self-attention matrix, which allows different tokens to interact with each other. This process is remarkably similar to the message-passing mechanism used in Graph Neural Networks (GNNs), and as such decoder-Transformers suffer many of the optimization difficulties studied extensively in the GNN literature. In this paper, we present a unified graph perspective that bridges the theoretical understanding of decoder-Transformers and GNNs. We systematically examine how well-known phenomena in GNNs, such as over-smoothing and over-squashing, directly manifest as analogous issues like rank collapse and representational collapse in deep Transformer architectures. By interpreting Transformers' self-attention as a learned adjacency operator, we reveal shared underlying principles governing signal propagation and demonstrate how insights from one field can illuminate challenges and solutions in the other. We analyze the role of architectural components like residual connections, normalization, and causal masking in these issues. We aim to provide a framework for understanding how information flows through deep learning models that perform sequence mixing through an adjacency operator, and to highlight areas for cross-pollination of research, as well as to provide a comprehensive reference for researchers interested in the underpinnings of these architectures.

URL: https://openreview.net/forum?id=H9zhC5pVnH

---

Title: Tube Loss: A Novel Loss Function for Prediction Interval Estimation

Authors: Pritam Anand, Tathagata Bandyopadhyay, Suresh Chandra

Abstract: This paper proposes a novel loss function called Tube Loss, developed for the simultaneous estimation of the lower and upper bounds of a Prediction Interval (PI) in regression problems, including probabilistic forecasting in autoregressive frameworks. The PIs obtained through empirical risk minimization using Tube Loss exhibit superior performance compared to those derived from existing approaches. A theoretical analysis confirms that the estimated PIs asymptotically attain a user-specified confidence level $1-\alpha$. A distinctive feature of Tube Loss is its ability to shift the PI along the support of the response distribution through a tunable parameter, allowing the intervals to better align with high-density regions of the distribution. This is especially valuable for generating tighter intervals when the response distribution is skewed. Moreover, the method allows further narrowing of PIs through recalibration. Unlike several prior techniques, the empirical risk associated with Tube Loss can be efficiently optimized via gradient descent. Extensive experiments demonstrate the robustness and accuracy of the proposed method in delivering high-quality PIs across a range of models, including kernel machines, neural networks, and probabilistic forecasting frameworks.

URL: https://openreview.net/forum?id=3vwPza62Rr

---

Title: Feature Representation Transferring to Lightweight Models via Perception Coherence

Authors: Hai-Vy Nguyen, Fabrice Gamboa, Sixin Zhang, Reda CHHAIBI, Serge Gratton, Thierry Giaccone

Abstract: In this paper, we propose a method for transferring feature representation to lightweight student models from larger teacher models. We mathematically define a new notion called perception coherence. Based on this notion, we propose a loss function, which takes into account the dissimilarities between data points in feature space through their ranking. At a high level, by minimizing this loss function, the student model learns to mimic how the teacher model perceives inputs. More precisely, our method is motivated by the fact that the representational capacity of the student model is weaker than the teacher model. Hence, we aim to develop a conceptually new method allowing for a better relaxation. This means that, the student model does not need to preserve the absolute geometry of the teacher one, while preserving global coherence through dissimilarity ranking. Importantly, while rankings are defined only on finite sets, our notion of perception coherence extends them into a probabilistic form. This formulation depends on the input distribution and applies to general dissimilarity metrics. Our theoretical insights provide a probabilistic perspective on the process of feature representation transfer. Our experimental results show that our method outperforms or achieves on-par performance with strong baseline methods for representation transfer, particularly class-unaware ones.

URL: https://openreview.net/forum?id=yQbNbeSEUq

---

Title: Transformers as Implicit State Estimators: In-Context Learning in Dynamical Systems

Authors: Usman Akram, Haris Vikalo

Abstract: Predicting the behavior of a dynamical system from noisy observations of its past outputs is a classical problem encountered across engineering and science. For linear systems with Gaussian inputs, the Kalman filter -- the best linear minimum mean-square error estimator of the state trajectory -- is optimal in the Bayesian sense. For nonlinear systems, Bayesian filtering is typically approached using suboptimal heuristics such as the Extended Kalman Filter (EKF), or numerical methods such as particle filtering (PF). In this work, we show that transformers, employed in an in-context learning (ICL) setting, can implicitly infer hidden states in order to predict the outputs of a wide family of dynamical systems, without test-time gradient updates or explicit knowledge of the system model. Specifically, when provided with a short context of past input–output pairs and, optionally, system parameters, a frozen transformer accurately predicts the current output. In linear-Gaussian regimes, its predictions closely match those of the Kalman filter; in nonlinear regimes, its performance approaches that of EKF and PF. Moreover, prediction accuracy degrades gracefully when key parameters, such as the state-transition matrix, are withheld from the context, demonstrating robustness and implicit parameter inference. These findings suggest that transformer in-context learning provides a flexible, non-parametric alternative for output prediction in dynamical systems, grounded in implicit latent-state estimation.

URL: https://openreview.net/forum?id=hIMK5MvGkP

---

Title: Towards Customized Knowledge Distillation for Efficient Dense Image Predictions

Authors: Dong Zhang, Pingcheng Dong, Long Chen, Kwang-Ting Cheng

Abstract: It has been revealed that efficient dense image prediction (EDIP) models designed for AI chips, trained using the knowledge distillation (KD) framework, encounter two key challenges, including maintaining boundary region completeness and ensuring target region connectivity, despite their favorable real-time capacity to recognize the main object regions. In this work, we propose a customized boundary and context knowledge distillation (BCKD) method for EDIPs, which facilitates the targeted KD from large accurate teacher models to compact small student models. Specifically, the boundary distillation focuses on extracting explicit object-level boundaries from the hierarchical feature maps to enhance the student model's mask quality in boundary regions. Meanwhile, the context distillation leverages self-relations as a bridge to transfer implicit pixel-level contexts from the teacher model to the student model, ensuring strong connectivity in target regions. Our method is specifically designed for the EDIP tasks and is characterized by its simplicity and efficiency. Theoretical analysis and extensive experimental results across semantic segmentation, object detection, and instance segmentation on five representative datasets demonstrate the effectiveness of BCKD, resulting in well-defined object boundaries and smooth connecting regions.

URL: https://openreview.net/forum?id=4verIe3tE4

---

Title: SemVAD: Fusing Semantic and Vision Features for Weakly Supervised Video Anomaly Detection

Authors: Hamza Karim, Yasin Yilmaz

Abstract: In recent years, vision-language models such as CLIP and VideoLLaMA have demonstrated
the ability to express visual data in semantically rich textual representations, making them
highly effective for downstream tasks. Given their cross-modal semantic representation
power, leveraging such models for video anomaly detection (VAD) holds significant promise.
In this work, we introduce Semantic VAD (SemVAD), a novel methodology for weakly super-
vised video anomaly detection (wVAD) that effectively fuses visual and semantic features
obtained from pretrained vision-language models, specifically VideoLLaMA 3 and CLIP.
Our approach enhances performance and explainability in anomaly detection. Additionally,
we analyze the sensitivity of recent state-of-the-art models to randomness in training initial-
ization and introduce a more comprehensive evaluation framework to assess their robustness
to small changes in training such as the seed of random number generator. This framework
aims to provide a more rigorous and holistic assessment of model performance, ensuring a
deeper understanding of their reliability and reproducibility in wVAD.

URL: https://openreview.net/forum?id=6tkvxrHidI

---

Title: Structure is Supervision: Multiview Masked Autoencoders for Radiology

Authors: Sonia Laguna, Andrea Agostini, Alain Ryser, Samuel Ruiperez-Campillo, Irene Cannistraci, Moritz Vandenhirtz, Stephan Mandt, Nicolas Deperrois, Farhad Nooralahzadeh, Michael Krauthammer, Thomas M. Sutter, Julia E Vogt

Abstract: Building robust medical machine learning systems requires pretraining strategies that exploit the intrinsic structure present in clinical data.
We introduce Multiview Masked Autoencoder (MVMAE), a self-supervised framework that leverages the natural multi-view organization of radiology studies to learn view-invariant and disease-relevant representations. MVMAE combines masked image reconstruction with cross-view alignment, transforming clinical redundancy across projections into a powerful self-supervisory signal. We further extend this approach with MVMAE-V2T, which incorporates radiology reports as an auxiliary text-based learning signal to enhance semantic grounding while preserving fully vision-based inference. Evaluated on a downstream disease classification task on three large-scale public datasets, MIMIC-CXR, CheXpert, and PadChest, MVMAE consistently outperforms supervised and vision–language baselines. Furthermore, MVMAE-V2T provides additional gains, particularly in low-label regimes where structured textual supervision is most beneficial. Together, these results establish the importance of structural and textual supervision as complementary paths toward scalable, clinically grounded medical foundation models.

URL: https://openreview.net/forum?id=wryiHLzieX

---

Title: Asymptotically and Minimax Optimal Regret Bounds for Multi-Armed Bandits with Abstention

Authors: Junwen Yang, Tianyuan Jin, Vincent Y. F. Tan

Abstract: We introduce a novel extension of the canonical multi-armed bandit problem that incorporates an additional strategic innovation: \emph{abstention}. In this enhanced framework, the agent is not only tasked with selecting an arm at each time step, but also has the option to {\em abstain} from accepting the stochastic instantaneous reward before observing it. When opting for abstention, the agent either suffers a fixed regret or gains a guaranteed reward. This added layer of complexity naturally prompts the key question: can we develop algorithms that are both computationally efficient and asymptotically and minimax optimal in this setting? We answer this question in the affirmative by designing and analyzing algorithms whose regrets meet their corresponding information-theoretic lower bounds. Our results offer valuable quantitative insights into the benefits of the abstention option, laying the groundwork for further exploration in other online decision-making problems with such an option. Extensive numerical experiments validate our theoretical results, demonstrating that our approach not only advances theory but also has the potential to deliver significant practical benefits.

URL: https://openreview.net/forum?id=AYp5zOcFdA

---

Title: MV2MAE: Self-Supervised Video Pre-Training with Motion-Aware Multi-View Masked Autoencoders

Authors: Ketul Shah, Robert Crandall, Jie Xu, Peng Zhou, Vipin Pillai, Marian George, Mayank Bansal, Rama Chellappa

Abstract: Videos captured from multiple viewpoints can help in perceiving the 3D structure of the world and benefit computer vision tasks such as action recognition, tracking, etc. In this paper, we present MV2MAE, a method for self-supervised learning from synchronized multi-view videos, built on the masked autoencoder framework. We introduce two key enhancements to better exploit multi-view video data. First, we design a cross-view reconstruction task that leverages a cross-attention-based decoder to reconstruct a target viewpoint video from source view. This helps in effectively injecting geometric information and yielding representations robust to viewpoint changes. Second, we introduce a controllable motion-weighted reconstruction loss which emphasizes dynamic regions and mitigates trivial reconstruction of static backgrounds. This improves temporal modeling and encourages learning more meaningful representations across views.
MV2MAE achieves state-of-the-art results on the NTU-60, NTU-120 and ETRI datasets among self-supervised approaches. In the more practical transfer learning setting, it delivers consistent gains of +2.0 -- 8.5% on NUCLA, PKU-MMD-II and ROCOG-v2 datasets, demonstrating the robustness and generalizability of our approach. Code: https://github.com/kshah33/mv2mae

URL: https://openreview.net/forum?id=nqt35xJywK

---

Title: Accurate Split Learning on Noisy Signals

Authors: Hang Xu, Subhajit Maity, Aritra Dutta, Xin Li, Panos Kalnis

Abstract: Noise injection is applied in Split Learning to address privacy concerns about data leakage. Previous work protects Split Learning by adding noise to intermediate results during the forward pass. Unfortunately, noisy signals significantly degrade the accuracy of Split Learning training. This paper focuses on improving the training accuracy of Split Learning in the presence of noisy signals while protecting training data from reconstruction attacks. We propose two denoising techniques, namely scaling and random masking. Our theoretical results show that both of our denoising techniques accurately estimate the intermediate variables during the forward pass of Split Learning. Moreover, our experiments with deep neural networks demonstrate that the proposed denoising approaches allow Split Learning to tolerate high noise levels while achieving almost the same accuracy as the noise-free baseline. Interestingly, we show that, after applying our denoising techniques, the resulting network is more resilient to a state-of-the-art attack than the simple noise-injection approach. Our code is publicly available at: github.com/MaitySubhajit/AccurateSL.

URL: https://openreview.net/forum?id=in1T4BlzG9

---

Title: Language-Aware Information Maximization for Transductive Few-Shot CLIP

Authors: Ghassen Baklouti, Maxime Zanella, Ismail Ben Ayed

Abstract: Transductive few-shot learning has triggered an abundant literature focusing on vision-only models, but is still at a nascent stage within the recent context of foundational vision-language models (VLMs). Only a few recent methods addressed the problem, pointing to the potential of tranduction in VLMs and to the need for VLM-tailored methods. Building on this momentum, we leverage information-theoretic concepts and recent progress in parameter-efficient fine-tuning (PEFT), developing a highly competitive transductive few-shot CLIP method. Specifically, we introduce a novel Language-aware Information MaximizatiOn (LIMO) loss integrating three complementary terms: (i) the mutual information between the vision inputs and the textual class descriptions; (ii) a Kullback-Leibler (KL) divergence penalizing deviation of the network's probabilistic outputs from the text-driven zero-shot predictions; and (iii) a standard cross-entropy loss based on the labeled shots. Furthermore, we challenge the commonly followed fine-tuning practices in the context of transductive few-shot learning, and explore PEFT strategies, completely overlooked in this context. Surprisingly, we observe substantial boosts in performances, which points to the potential of adapting a subset of the model's parameters in the transductive few-shot setting. We report comprehensive evaluations, which show that LIMO outperforms the very recent transductive few-shot CLIP methods by a large margin and yields significant gains over the best-performing inductive methods. We will publicly release our code.

URL: https://openreview.net/forum?id=JxYmNn5oaY

---

Title: Safe Reinforcement Learning using Action Projection: Safeguard the Policy or the Environment?

Authors: Hannah Markgraf, Shambhuraj Sawant, Hanna Krasowski, Lukas Schäfer, Sebastien Gros, Matthias Althoff

Abstract: Projection-based safety filters, which modify unsafe actions by mapping them to the closest safe alternative, are widely used to enforce safety constraints in reinforcement learning (RL). Two integration strategies are commonly considered: Safe environment RL (SE-RL), where the safeguard is treated as part of the environment, and safe policy RL (SP-RL), where it is embedded within the policy through differentiable optimization layers. Despite their practical relevance in safety-critical settings, a formal understanding of their differences is lacking.
In this work, we present a theoretical comparison of SE-RL and SP-RL. We identify a key distinction in how each approach is affected by action aliasing, a phenomenon in which multiple unsafe actions are projected to the same safe action, causing information loss in the policy gradients. In SE-RL, this effect is implicitly approximated by the critic, while in SP-RL, it manifests directly as rank-deficient Jacobians during backpropagation through the safeguard.
Our contributions are threefold: (i) a unified formalization of SE-RL and SP-RL in the context of actor-critic algorithms, (ii) a theoretical analysis of their respective policy gradient estimates, highlighting the role of action aliasing, and (iii) a comparative study of mitigation strategies, including a novel penalty-based improvement for SP-RL that aligns with established SE-RL practices. Empirical results support our theoretical predictions, showing that action aliasing is more detrimental for SP-RL than for SE-RL. However, with appropriate improvement strategies, SP-RL can match or outperform improved SE-RL across a range of environments. These findings provide actionable insights for choosing and refining projection-based safe RL methods based on task characteristics.

URL: https://openreview.net/forum?id=DDrGSEYxGU

---

Title: The Final-Stage Bottleneck: A Systematic Dissection of the R-Learner for Network Causal Inference

Authors: Sairam S, Sara Girdhar, Shivam Soni

Abstract: The R-Learner is a powerful, theoretically-grounded framework for estimating heterogeneous treatment effects, prized for its robustness to nuisance model errors. However, its application to network data—where causal heterogeneity may be driven by graph structure—presents critical and underexplored challenges to its core assumption of a well-specified final-stage model. In this paper, we conduct a large-scale, multi-seed empirical study to systematically dissect the R-Learner framework on graphs. Our results suggest that for network-dependent effects, a critical driver of performance is the inductive bias of the final-stage CATE estimator, a factor whose importance can dominate that of the nuisance models.
Our central finding is a systematic quantification of a "representation bottleneck": we demonstrate empirically and through a constructive theoretical example that graph-blind final-stage estimators, being theoretically misspecified, exhibit significant under-performance (MSE > 4.0, p < 0.001 across all settings). Conversely, we show that an R-Learner with a correctly specified, end-to-end graph-aware architecture (the "Graph R-Learner") achieves a significantly lower error.
Furthermore, we provide a comprehensive analysis of the framework’s properties. We identify a subtle "nuisance bottleneck" and provide a mechanistic explanation for its topology dependence: on hub-dominated graphs, graph-blind nuisance models can partially capture concentrated confounding signals, while on graphs with diffuse structure, a GNN’s explicit aggregation becomes critical. This is supported by our analysis of a "Hub-Periphery Tradeoff," which we connect to the GNN over-squashing phenomenon. Our findings are validated across diverse synthetic and semi-synthetic benchmarks, where the R-Learner framework also significantly outperforms a strong, non-DML GNN T-Learner baseline. We release our code as a comprehensive and reproducible benchmark to facilitate future research on this critical "final-stage bottleneck."

URL: https://openreview.net/forum?id=QIE0FVSn0p

---

Title: Logical Anomaly Detection with Masked Image Modeling

Authors: Shunsuke SAKAI, Tatsuhito Hasegawa, Makoto Koshino

Abstract: Detecting anomalies such as an incorrect combination of objects or deviations in their positions is a challenging problem in unsupervised anomaly detection (AD). Since conventional AD methods mainly focus on local patterns of normal images, they struggle with detecting logical anomalies that appear in the global patterns.
To effectively detect these challenging logical anomalies, we introduce \textbf{L}ogical \textbf{A}nomaly \textbf{D}etection with \textbf{M}asked \textbf{I}mage \textbf{M}odeling (\textbf{LADMIM}), a novel unsupervised AD framework that harnesses the power of masked image modeling and discrete representation learning.
Our core insight is that predicting the missing region forces the model to learn the long-range dependencies between patches.
Specifically, we formulate AD as a mask completion task, which predicts the distribution of discrete latents in the masked region.
As a distribution of discrete latents is invariant to the low-level variance in the pixel space, the model can desirably focus on the logical dependencies in the image, which improves accuracy in the logical AD.
We evaluate the AD performance on five benchmarks and show that our approach achieves compatible performance without any pre-trained segmentation models.
We also conduct comprehensive experiments to reveal the key factors that influence logical AD performance. Code is available at: \url{https://github.com/SkyShunsuke/LADMIM}.

URL: https://openreview.net/forum?id=uuuaRCMYE3

---

Title: Ternary Momentum For Quantized Training

Authors: Noga Bar, Amit Attia, Michal Moshkovitz, Dotan Di Castro

Abstract: Quantization enables efficient inference on resource-limited devices, yet training still depends on high-precision gradients and optimizer states.
We address this gap by introducing stochastic ternary momentum, a fully quantized optimizer that operates with quantized parameters, ternary gradient information, and enables ternary momentum states for stable and memory efficient quantized optimization.
Our method replaces deterministic and full-precision updates with integer-valued updates driven by stochastic sampling, ensuring that expected updates match standard momentum while maintaining strict memory constraints.
It eliminates re-quantization overhead and preserves quantization consistency throughout training.
We establish theoretical convergence guarantees of our ternary momentum method for convex objectives over bounded integer domains and for non-convex objectives over unbounded integer domains. Experiments on vision and language tasks demonstrate that our approach retains strong performance while reducing optimizer memory by 95\% compared to full-precision, advancing the feasibility of fully quantized training.

URL: https://openreview.net/forum?id=A3mVmPlahU

---

Title: On Sketching for Gaussian Process Regression with New Statistical Guarantees

Authors: Jayesh Malaviya, Rachit Chhaya, Anirban Dasgupta, Supratim Shit

Abstract: The cubic computational complexity of Gaussian Process Regression (GPR) with respect to the number of data points is a major bottleneck to its scalability. While various approaches have been proposed to address this, few come with provable guarantees. Inspired by the success of ridge leverage score based sampling in scaling kernel ridge regression El Alaoui & Mahoney (2015), we propose a sketch-based approximation for GPR using ridge leverage scores. We provide theoretical guarantees on the approximation of the predictive mean, predictive variance, and negative log-marginal likelihood in this setting. To the best of our knowledge, these are the first theoretical guarantees for approximating the predictive variance and negative log-marginal likelihood of GPR using ridge leverage score sampling. We further show that a carefully constructed sketch of the kernel matrix preserves key statistical properties of the full GPR model with high probability. Our theoretical results are supported by empirical evaluations on real-world datasets, demonstrating strong trade-offs between accuracy and efficiency.

URL: https://openreview.net/forum?id=NmwrhyuVEu

---

Title: Towards Online Multimodal Social Interaction Understanding

Authors: Xinpeng Li, Shijian Deng, Bolin Lai, Weiguo Pian, James Matthew Rehg, Yapeng Tian

Abstract: In this paper, we introduce a new problem, Online-MMSI, where the model must perform multimodal social interaction understanding (MMSI) using only historical information. Given a recorded video and a multi-party dialogue, the AI assistant is required to immediately identify the speaker’s referent, which is critical for real-world human-AI interaction. Without access to future conversational context, both humans and models experience substantial performance degradation when moving from offline to online settings. To tackle the challenges, we propose Online-MMSI-VLM, a novel framework based on multimodal large language models. The core innovations of our approach lie in two components: (1) multi-party conversation forecasting, which predicts upcoming speaker turns and utterances in a coarse-to-fine manner; and (2) socially-aware visual prompting, which highlights salient social cues in each video frame using bounding boxes and body keypoints. Our model achieves state-of-the-art results on three tasks across two datasets, significantly outperforming the baseline and demonstrating the effectiveness of Online-MMSI-VLM.

URL: https://openreview.net/forum?id=5P7yVfUEuD

---

Title: ECLipsE-Gen-Local: Efficient Compositional Local Lipschitz Estimates for Deep Neural Networks

Authors: Yuezhu Xu, S Sivaranjani

Abstract: The Lipschitz constant is a key measure for certifying the robustness of neural networks to input perturbations. However, computing the exact constant is NP-hard, and standard approaches to estimate the Lipschitz constant involve solving a large matrix semidefinite program (SDP) that scales poorly with network size. Further, there is a potential to efficiently leverage local information on the input region to provide tighter Lipschitz estimates. We address this problem here by proposing a compositional framework that yields tight yet scalable Lipschitz estimates for deep feedforward neural networks. Specifically, we begin by developing a generalized SDP framework for Lipschitz estimation that is highly flexible, accommodating heterogeneous activation function slope bounds for each neuron on each layer, and allowing Lipschitz estimates with respect to arbitrary input-output pairs in the neural network and arbitrary choices of sub-networks of consecutive layers. We then decompose this generalized SDP into a equivalent small sub-problems that can be solved sequentially, yielding the ECLipsE-Gen series of algorithms, with computational complexity that scales linearly with respect to the network depth. We also develop a variant that achieves near-instantaneous computation through closed-form solutions to each sub-problem. All our algorithms are accompanied by theoretical guarantees on feasibility and validity, serving as strict upper bounds on the true Lipschitz constant. Next, we develop a series of algorithms, termed as ECLipsE-Gen-Local, that explicitly incorporate local information on the input region to provide tighter Lipschitz constant estimates. Our experiments demonstrate that our algorithms achieve substantial speedups over a multitude of benchmarks while producing significantly tighter Lipschitz bounds than global approaches. Moreover, we demonstrate that our algorithms provide strict upper bounds for the Lipschitz constant with values approaching the exact Jacobian from autodiff when the input region is small enough. Finally, we demonstrate the practical utility of our approach by showing that our Lipschitz estimates closely align with network robustness. In summary, our approach considerably advances the scalability and efficiency of certifying neural network robustness, while capturing local input–output behavior to deliver provably tighter bounds, making it particularly suitable for safety-critical and adaptive learning tasks.

URL: https://openreview.net/forum?id=CuqnFjeu5a

---

Title: Quantitative Analysis of the Effect of Density Ratio Estimation in Covariate Shift Adaptation

Authors: Chenglin Yu, Zhengyu Zhou, Weiwei Liu

Abstract: In supervised learning, it is essential to assume that the test sample and the training sample come from the same distribution. But in reality, this assumption is frequently broken, which can lead to subpar performance from the learned model. We examine the learning problem under \emph{covariate shift}, in which the conditional distribution of labels given covariates does not change despite the covariate distribution shifting. Two-step procedures, which first compute the density ratio and then carry out importance-weighted empirical risk minimization, are a popular family of methods for addressing covariate shift. However, the two-step techniques' performance could degrade due to estimation error of the density ratio.
Unfortunately, the extent of the density ratio estimation error that affects the accuracy of learning algorithms is rarely analyzed. This paper accordingly provides a quantitative answer to this question. Specifically, we formulate the two-step covariate adaptation methods as a meta-algorithm. We show that the effect of the density ratio estimation error on the excess risk bound of the meta algorithm is of the fourth order, i.e., $\mathcal{O}\left(\epsilon_{1}\left(\mathcal{G}, S_{s1}, S_t, \delta/2\right)^4\right)$, if the true risk satisfies a requirement known as the \emph{derivative vanishing} property, where $\epsilon_{1}\left(\mathcal{G}, S_{s1}, S_t, \delta/2\right)$ is the convergence rate of the density ratio estimation algorithm, $\mathcal{G}$ is the density ratio function class, $S_{s1}$ and $S_t$ are the samples generated by training distribution and test distribution respectively, and $\delta/2$ is the confidence parameter. Moreover, we analyze the impact of two specific density ratio estimation algorithms, Kullback-Leibler Importance Estimation Procedure and Kernel unconstrained Least-Squares Importance Fitting, on the final classifier's generalization error. We also report the experimental results of two-step covariate shift adaptation with a toy classification dataset using KLIEP.

URL: https://openreview.net/forum?id=TtWsnXTYUV

---

New submissions
===============

Title: United Minds or Isolated Agents? Exploring Coordination of LLMs under Cognitive Load Theory

Abstract: Large Language Models (LLMs) exhibit a notable performance ceiling on complex, multi-faceted tasks. As practitioners increasingly rely on heavy $\textit{context engineering}$---curating intricate instructions, tool schemas, and multi-turn histories---the processing demands often exceed the LLM's effective attention budget, leading to $\textit{context rot}$. Drawing a strong analogy to Cognitive Load Theory (CLT) in cognitive science, we posit that this bottleneck mirrors the bounded working memory of the human mind. Rather than relying on heuristic prompt engineering, we propose using CLT as a principled foundation for LLM system design. To validate this insight, we introduce $\text{\textbf{CoThinker}}$, an instantiation of a CLT-driven multi-agent framework. $\textit{CoThinker}$ operationalizes CLT principles by distributing intrinsic cognitive load through agent specialization and managing transactional load via structured communication and a collective working memory. We empirically validate $\textit{CoThinker}$ on complex problem-solving tasks and fabricated high cognitive load scenarios. Our analysis reveals characteristic interaction patterns that cast insights from collective cognition and load management into a principled approach to agent system design.

URL: https://openreview.net/forum?id=8BYJHZiZ5T

---

Title: Simplifying Multi-Task Architectures Through Task-Specific Normalization

Abstract: Multi-task learning (MTL) aims to leverage shared knowledge across tasks to improve generalization and parameter efficiency, yet balancing resources and mitigating interference remain open challenges. Architectural solutions often introduce elaborate task-specific modules or routing schemes, increasing complexity and overhead. In this work, we show that normalization layers alone are sufficient to address many of these challenges. Simply replacing shared normalization with task-specific variants already yields competitive performance, questioning the need for complex designs. Building on this insight, we propose Task-Specific Sigmoid Batch Normalization (TS$\sigma$BN), a lightweight mechanism that enables tasks to softly allocate network capacity while fully sharing feature extractors. TS$\sigma$BN improves stability across CNNs and Transformers, matching or exceeding performance on NYUv2, Cityscapes, CelebA, and PascalContext, while remaining highly parameter-efficient. Moreover, its learned gates provide a natural framework for analyzing MTL dynamics, offering interpretable insights into capacity allocation, filter specialization, and task relationships. Our findings suggest that complex MTL architectures may be unnecessary and that task-specific normalization offers a simple, interpretable, and efficient alternative.

URL: https://openreview.net/forum?id=QNO893OXrS

---

Title: Hyperedge Anomaly Detection with Hypergraph Neural Network

Abstract: Hypergraph is a data structure that enables us to model higher-order associations among data entities. Conventional graph-structured data can represent pairwise relationships only, whereas hypergraph enables us to associate any number of entities, which is essential in many real-life applications. Hypergraph learning algorithms have been well-studied for numerous problem settings, such as node classification, link prediction, etc. However, much less research has been conducted on anomaly detection from hypergraphs. Anomaly detection identifies events that deviate from the usual pattern and can be applied to hypergraphs to detect unusual higher-order associations. In this work, we propose an end-to-end hypergraph neural network-based model for identifying anomalous associations in a hypergraph. Our proposed algorithm operates in an unsupervised manner without requiring any labeled data. Extensive experimentation on several real-life datasets demonstrates the effectiveness of our model in detecting anomalous hyperedges.

URL: https://openreview.net/forum?id=etPYIk1BqO

---

Title: TabFlowM: Lightweight flow matching for Mixed-Type Tabular Data Synthesis in Latent Space

Abstract: Generative modeling for mixed-type tabular data has recently been dominated by diffusion-based methods, but their gains often come with schedule design, time dependent score parameterization, and multi-step solvers that increase computational overhead and tuning
difficulty. We present TabFlowM, a lightweight framework that asks a more targeted question: once mixed-type records are mapped into a decoder compatible continuous transport space, is diffusion style score learning still necessary? TabFlowM answers this by training a single time conditioned velocity field via flow matching to deterministically transport Gaussian noise to the latent token space data distribution, replacing diffusion specific score estimation and scheduling machinery with direct velocity regression on a closed form coupling path. Experiments on six real world benchmark datasets show that TabFlowM attains the best Shape/Trend score on 7 out of 12 dataset–metric pairs, improving over the next best baseline by 15.6% on average, achieves the strongest column-wise MLE efficacy on 5 out of 6 datasets, and more than half the sampling time relative to the strongest diffusion baseline. Finally, on a million-scale fraud dataset with class ratios exceeding 100:1, where unconditional fidelity can decouple from rare event predictive utility and models with stronger Shape/Trend may underperform on minority class prediction, TabFlowM still delivers the strongest predictive utility with competitive fidelity and runtime. These findings suggest that, under an appropriate transport interface for mixed-type data, a minimalist flow matching generator can recover much of the benefit commonly associated with heavier diffusion models while substantially reducing computational and conceptual complexity.

URL: https://openreview.net/forum?id=t5kygrpSIz

---

Title: Deep Models for Ground Reaction Force Prediction: Subject-Held-Out Evaluation Across Locomotion and Occupational Tasks

Abstract: Deep learning models for predicting ground reaction forces (GRFs) from motion capture markers are typically evaluated within a single laboratory using trial-level splits that permit subject identity leakage. We show that switching from standard trial-level splits to rigorous Leave-One-Subject-Out (LOSO) evaluation reduces $R^2$ by 0.20–0.30 points ($R^2$ 0.86 → 0.66 on locomotion, 0.86 → 0.56 on occupational tasks), quantifying for the first time the inflation caused by subject leakage across two task domains. Under LOSO, three representative architectures—ImprovedConvFormer (ICF), CNN-LSTM, and CNN—all generalize to unseen subjects with $R^2 > 0.5$, but with strongly task-dependent rankings: ICF leads on locomotion data ($R^2=0.655$ on GroundLink, 6 subjects) while CNN-LSTM leads on occupational data ($R^2=0.606$ on Patient Handling, 10 subjects, $p = 0.002$). We quantify a consistent hierarchy of feature difficulty across both datasets—vertical forces and CoP components are easiest, mediolateral forces intermediate, and anteroposterior forces consistently hardest—suggesting fundamental biomechanical observability constraints rather than model limitations.

Attention analysis reveals that ICF learns biomechanically meaningful sparse patterns (97.5% sparsity) organized into three functionally distinct head clusters that align with gait phase structure, with task-dependent differences between locomotion (periodic stride-locked peaks) and occupational tasks (broader, event-driven attention). Combined dual-holdout cross-validation achieves $R^2=0.587$ with asymmetric transfer: Patient Handling benefits from cross-lab data while GroundLink performance degrades. As an exploratory deployment stress test, we also evaluate cross-laboratory transfer: models trained in one laboratory and tested in the other. In this uncontrolled two-laboratory setting, all architectures exhibit severely negative $R^2$ (as low as $-1.5$), despite sharing anatomical markers and output channels, indicating that naïve cross-site deployment is unreliable when tasks, coordinate systems, and calibration protocols differ. We discuss how future work on geometric equivariance and physics-informed constraints could address this limitation, but do not implement such methods here. Code and all experimental outputs are publicly available.

URL: https://openreview.net/forum?id=X0dLMTGmJB

---

Title: A Survey of Hybrid Inference Systems for Large Language Models

Abstract: Efficient deployment of large language models (LLMs) requires balancing inference speed with output quality. Speculative decoding accelerates inference by using a smaller draft model to propose future tokens, whereas reasoning-heavy approaches such as chain-of-thought prompting, ensembles, and dynamic routing improve output quality through deep search and verification. Although historically treated as isolated research trajectories, the demands of complex, high-entropy tasks have forced these domains to converge. This paper presents a structured taxonomy and analysis focused specifically on the emerging paradigm of hybrid and orchestrated inference systems. We begin by examining isolated approaches and their limitations, highlighting the necessity of this shift and identifying a critical "Orchestration Gap" in current architectures. We intend this work to serve as a catalyst for future research on orchestrated inference, ultimately contributing to the development of systems that are both fast and capable.

URL: https://openreview.net/forum?id=OIrJI53MvN

---

Title: Tensor-Decomposed RNNs for Marked Temporal Point Processes

Abstract: We study parameter-efficient neural Marked Temporal Point Processes (MTPPs) for high-dimensional mark and exogenous feature spaces. Building on tensor-train (TT) factorization of recurrent kernels, we propose mark-aware TT shaping that aligns TT cores with known multi-way domain structure (e.g., asset/venue/side in finance). We provide a conditional intensity function-consistent training recipe and evaluate both accuracy and calibration (mark reliability and time-rescaling diagnostics). Across finance and public MTPP benchmarks, TT-compressed RNNs reduce parameters by 40-70% while matching dense baselines and remaining competitive with attention-based and state-space models; TT often improves calibration, and simple ensembling further strengthens reliability.

URL: https://openreview.net/forum?id=Y0up92GcjF

---

Title: DP-FedSOFIM: Differentially Private Federated Stochastic Optimization using Regularized Fisher Information Matrix

Abstract: Differentially private federated learning (DP-FL) often suffers from slow convergence under tight privacy budgets because the noise required for privacy preservation degrades gradient quality. Although second-order optimization can accelerate training, existing approaches for DP-FL face significant scalability limitations: Newton-type methods require clients to compute Hessians, while feature covariance methods scale poorly with model dimension. We propose \textbf{DP-FedSOFIM}, a simple and scalable second-order optimization method for DP-FL. The method constructs an online regularized proxy for the Fisher information matrix at the server using only privatized aggregated gradients, capturing useful curvature information without requiring Hessian computations or feature covariance estimation. Efficient rank-one updates based on the Sherman-Morrison formula enable communication costs proportional to the model size and require only $O(d)$ client-side memory. Because all curvature and preconditioning operations are performed at the server on already privatized gradients, \textbf{DP-FedSOFIM} introduces no additional privacy cost beyond the underlying privatized gradient release mechanism. Experiments on CIFAR-10 and PathMNIST show that \textbf{DP-FedSOFIM} converges faster and consistently achieves higher accuracy than DP-FedGD, DP-SCAFFOLD, and DP-FedFC across a range of privacy budgets, with particularly pronounced gains under stringent privacy constraints.

URL: https://openreview.net/forum?id=aDzj9DrwAR

---

Title: Fractional Rectifier Activations for Off-Policy Reinforcement Learning: A Systematic Empirical Study

Abstract: Activation functions play a key role in deep reinforcement learning by shaping how neural networks approximate policies and value functions. The Rectified Linear Unit (ReLU) remains the dominant choice due to its simplicity and computational efficiency. However, ReLU and its common variants rely on piecewise linear transformations, which may limit representational flexibility when modelling complex nonlinear dynamics in continuous-control tasks. In this work, we investigate fractional-order nonlinearities that extend rectifier-style activations while preserving their computational simplicity. We investigate three fractional variants: Fractional ReLU (FReLU), Fractional Leaky ReLU (FLReLU), and Fractional Parametric ReLU (FPReLU), which incorporate a fractional exponent that enables smooth and continuously adjustable nonlinear transformations.

We conduct a systematic empirical study of these activations in off-policy reinforcement learning by integrating them into two widely used algorithms, TD3 and SAC. Experiments are performed on continuous-control benchmarks from MuJoCo and the DeepMind Control Suite. Across tasks, architectures, and algorithms, fractional activations frequently outperform conventional rectifier functions, with the best fractional configuration yielding an average improvement of about 21\% in normalized Area Under the learning Curve (AUC) relative to the ReLU baseline. These results suggest that fractional rectifier activations provide a simple architectural modification that can improve function approximation and learning efficiency in deep reinforcement learning.

URL: https://openreview.net/forum?id=8vRUIp5KfI

---

Title: What do near-optimal learning rate schedules look like?

Abstract: A basic unanswered question in neural network training is: what is the best learning rate schedule shape for a given workload? The choice of learning rate schedule is a key factor in the success or failure of the training process, but beyond having some kind of warmup and decay, there is no consensus on what makes a good schedule shape. To answer this question, we designed a search procedure to find the best shapes within a parameterized schedule family. Our approach factors out the schedule shape from the base learning rate, which otherwise would dominate cross-schedule comparisons. We applied our search procedure to a variety of schedule families on three workloads: linear regression, image classification on CIFAR-10, and small-scale language modeling on Wikitext103. We showed that our search procedure indeed generally found near-optimal schedules. We found that warmup and decay are robust features of good schedules, and that commonly used schedule families are not optimal on these workloads. Finally, we explored how the outputs of our shape search depend on other optimization hyperparameters, and found that weight decay can have a strong effect on the optimal schedule shape. To the best of our knowledge, our results represent the most comprehensive results on near-optimal schedule shapes for deep neural network training, to date.

URL: https://openreview.net/forum?id=pEsAMnmq0L

---

Title: Sampling Boltzmann distributions via normalizing flow approximation of transport maps

Abstract: In a celebrated paper, Noé, Olsson, K\"ohler and Wu (\cite{noe2019boltzmann}) introduced an efficient method for sampling high-dimensional Boltzmann distributions arising in molecular dynamics via normalizing flow approximation of transport maps. Here, we place this approach on a firm mathematical foundation. We prove the existence of a normalizing flow between the reference measure and the true Boltzmann distribution up to an arbitrarily small error in the Wasserstein distance. This result covers general Boltzmann distributions from molecular dynamics, which have low regularity due to the presence of interatomic Coulomb and Lennard-Jones interactions. The proof is based on a rigorous construction of the Moser transport map for low-regularity endpoint densities and approximation theorems for neural networks in Sobolev spaces.

Numerical simulations for a simple model system and for the alanine dipeptide molecule confirm that the true and generated distributions are close in the Wasserstein distance. Moreover we observe that the RealNVP architecture does not just successfully capture the equilibrium Boltzmann distribution but also the metastable dynamics.

URL: https://openreview.net/forum?id=4HLZD6LMuJ

---

Title: Systematic Evaluation of LoRA Adapter Placement and Rank Allocation for Resource-Constrained Fine-Tuning

Abstract: Parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaptation (LoRA) enable adapting large language models (LLMs) under tight computational and data constraints. However, optimal adapter placement and rank allocation remain underexplored in ultra-low-resource regimes. Here we systematically evaluate LoRA configurations-attention-only, MLP-only, and combined placements with ranks 8 and 16-on the MBPP-mini code generation and GSM8K-mini math reasoning datasets using the Qwen2.5-0.5B-Instruct model with fixed training budgets. Our hardened evaluation protocol includes deterministic prompting, AST-based code extraction, and import-safe execution. Results show combined placement significantly improves code generation pass@1 accuracy (34% vs. 19%), while rank scaling often outperforms increased training steps for parameter efficiency. Hardware profiling reveals all configurations fit within 2.2GB GPU memory, enabling consumer-grade fine-tuning. These findings inform automated configuration selection, facilitating efficient, reliable adaptation of LLMs on minimal hardware for specialized tasks.

URL: https://openreview.net/forum?id=KsPwgcQ60j

---

Title: Dual-Output LLM Explanation Evaluator: A Structured Multi-Metric Framework for Faithful Rationale Assessment (LREval)

Abstract: This study introduces LREval: Dual-Output LLM Explanation Evaluator, a structured multi-metric framework for assessing the faithfulness and quality of Large Language Models' (LLMs) rationales against persistent gaps in current explainability research. While LLMs demonstrate strong reasoning capabilities, quantifying the faithfulness of their dual outputs—both final answers and supporting explanations—remains challenging, particularly in domain-specific contexts like TruthfulQA and Climate-FEVER.

LREval establishes a novel dual-output evaluation paradigm, requiring models to generate both verdicts and corresponding natural language rationales for each input. We assess explanation quality across six complementary dimensions: Fidelity (semantic alignment), Simulatability (human-understandability), Completeness (comprehensiveness), Consistency (stability), Text Quality (coherence), and Robustness (perturbation resilience). By benchmarking LLM-generated explanations against curated, expert-verified ground truth rationales (multi-phrased via `|` separators capturing diverse valid reasoning paths), LREval enables quantitative measurement of explanation faithfulness using BERTScore, cosine similarity, and perturbation tests.

This framework provides a transparent, reproducible foundation for systematic LLM selection, improvement, and domain-adaptation based on how well their explanations align with specialized human reasoning expectations.

URL: https://openreview.net/forum?id=dvVSK9niyB

---

Title: Decomposable Neuro Symbolic Regression

Abstract: Symbolic regression (SR) models complex systems by discovering mathematical expressions that capture underlying relationships in observed data. However, most SR methods prioritize minimizing prediction error over identifying the governing equations, often producing overly complex or inaccurate expressions. To address this, we present a decomposable SR method that generates interpretable multivariate expressions leveraging transformer models, genetic algorithms (GAs), and genetic programming (GP). In particular, our explainable SR method distills a trained "opaque'' regression model into mathematical expressions that serve as explanations of its computed function. Our method employs a Multi-Set Transformer to generate multiple univariate symbolic skeletons that characterize how each variable influences the opaque model's response. We then evaluate the generated skeletons' performance using a GA-based approach to select a subset of high-quality candidates before incrementally merging them via a GP-based cascade procedure that preserves their original skeleton structure. The final multivariate skeletons undergo coefficient optimization via a GA. We evaluated our method on problems with controlled and varying degrees of noise, demonstrating lower or comparable interpolation and extrapolation errors compared to two GP-based methods, three neural SR methods, and a hybrid approach. Unlike these methods, our approach consistently learned expressions that matched the original mathematical structure. Similarly, our method achieved both a high symbolic solution recovery rate and competitive predictive performance relative to benchmark methods on the Feynman dataset.

URL: https://openreview.net/forum?id=54EL928uCf

---

Title: CANDERE-COACH: Reinforcement Learning from Noisy Feedback

Abstract: Reinforcement learning (RL)
has been widely applied to many challenging tasks.
However, in order to perform well, it requires access to a good reward function, which is often sparse or manually engineered with scope for error.
Introducing human prior knowledge is often seen as a possible solution to the above-mentioned problem, such as imitation learning, learning from preference, and inverse reinforcement learning. Learning from feedback is another framework that enables an RL agent to learn from binary evaluative signals describing the teacher’s (positive or negative) evaluation of the agent's action. However, these methods often make the assumption that evaluative teacher feedback is perfect, which is a restrictive assumption. In practice, such feedback can be noisy due to limited teacher expertise or other exacerbating factors like cognitive load, availability, distraction, etc. In this work, we propose the CANDERE-COACH algorithm, which is capable of learning from noisy feedback by a sub-optimal
teacher. We propose a noise-filtering mechanism to de-noise online feedback data, thereby enabling the RL agent to successfully learn with up to 40\% of the teacher feedback being incorrect.
Experiments on three common domains demonstrate the effectiveness of the proposed approach.

URL: https://openreview.net/forum?id=JgP7Wepetn

---

Title: EHR2Path: Scalable Modeling of Longitudinal Patient Pathways from Multimodal Electronic Health Records

Abstract: Forecasting how a patient’s condition is likely to evolve, including possible deterioration, recovery, treatment needs, and care transitions, could support more proactive and personalized care, but requires modeling heterogeneous and longitudinal electronic health record (EHR) data. Yet, existing approaches typically focus on isolated prediction tasks, narrow feature spaces, or short context windows, limiting their ability to model full patient pathways. To address this gap, we introduce EHR2Path, a multimodal framework for forecasting and simulating full in-hospital patient pathways from routine EHRs. EHR2Path converts diverse clinical inputs into a unified temporal representation, enabling modeling of a substantially broader set of patient information, including radiology reports, physician notes, vital signs, medication and laboratory patterns, and dense bedside charting. To support long clinical histories and broad feature spaces, we introduce a Masked Summarization Bottleneck that compresses long-term history into compact, task-optimized summary tokens while preserving recent context, improving both performance and token efficiency. In retrospective experiments on MIMIC-IV, EHR2Path enables next-step pathway forecasting and iterative simulation of complete in-hospital trajectories, while outperforming strong baselines on directly comparable tasks. These results establish a foundation for scalable pathway-level modeling from routine EHRs supporting anticipatory clinical decision-making. We will release our code upon acceptance.

URL: https://openreview.net/forum?id=ywa71iOykg

---

Title: Dynamic Regret with Untrusted Decision Predictions via Heterogeneous Expert Aggregation

Abstract: We study online convex optimization with dynamic regret, where the learner has access to untrusted decision predictions about the per-round minimizers. Existing methods either exploit only gradient feedback, achieving $O(\sqrt{T(1+P_T)})$ dynamic regret but remaining unable to benefit from predictions, or follow predictions blindly, obtaining regret proportional to the prediction error but with no worst-case safeguard. We propose a framework based on heterogeneous expert aggregation that simultaneously adapts to both the environment non-stationarity, characterized by path length $P_T$, and prediction quality, measured by cumulative error $\bar{E}_T$, without prior knowledge of either. The framework maintains a diverse pool of experts, which includes a gradient-based expert utilizing Online Gradient Descent, a prediction-based expert following predictions, and a new hybrid subroutine called Online Anchor Mirror Descent. These experts are aggregated by AdaHedge, whose small-loss property is critical to our results. We prove that our strongest variant achieves dynamic regret that smoothly interpolates between $O(GD\log\log T)$ when predictions are accurate and $O(R^*)$ when predictions are adversarial, where $R^*$ is the optimal prediction-free rate. The small-loss bound of AdaHedge ensures that the aggregation overhead depends on the best expert's loss rather than on $T$, enabling a qualitative improvement over the $\Omega(\sqrt{T})$ floor of prediction-free methods. We further introduce an instance-dependent refinement of the new hybrid subroutine that can strictly improve the guarantee on favorable trajectories. Experiments on synthetic benchmarks validate all theoretical predictions: our methods achieve near-constant regret under accurate predictions, degrade gracefully under adversarial predictions, and outperform baselines by up to $26\times$ in non-stationary environments.

URL: https://openreview.net/forum?id=LWsEyfdnp9

---

Title: Dual-branch Graph Domain Adaptation for Cross-scenario Multi-modal Emotion Recognition

Abstract: Multimodal Emotion Recognition in Conversations (MERC) aims to predict speakers’ emotional states in multi-turn dialogues through text, audio, and visual cues. In real-world settings, conversation scenarios differ significantly in speakers, topics, styles, and noise levels. Existing MERC methods generally neglect these cross-scenario variations, limiting their ability to transfer models trained on a source domain to unseen target domains. To address this issue, we propose a Dual-branch Graph Domain Adaptation framework (DGDA) for multimodal emotion recognition under cross-scenario conditions. We first construct an emotion interaction graph to characterize complex emotional dependencies among utterances. A dual-branch encoder, consisting of a hypergraph neural network (HGNN) and a path neural network (PathNN), is then designed to explicitly model multivariate relationships and implicitly capture global dependencies. To enable out-of-domain generalization, a domain adversarial discriminator is introduced to learn invariant representations across domains. Furthermore, a regularization loss is incorporated to suppress the negative influence of noisy labels. To the best of our knowledge, DGDA is the first MERC framework that jointly addresses domain shift and label noise. Theoretical analysis provides tighter generalization bounds, and extensive experiments on IEMOCAP and MELD demonstrate that DGDA consistently outperforms strong baselines and better adapts to cross-scenario conversations. Our anonymous code is available at https://anonymous.4open.science/r/DGDA-Net-1A58.

URL: https://openreview.net/forum?id=HYd50uuxmL

---

Title: GE-FM: Geometry-aware Energy-based Flow Matching for Non Euclidean Manifolds

Abstract: Flow Matching has emerged as a powerful framework for generative transport and denoising, yet existing formulations are inherently Euclidean, neglecting the curved and time-evolving geometry of diffusion manifolds. Recent higher-order extensions seek to recover curved transport by explicitly modeling higher derivatives, but these approaches introduce instability and accumulate discretization error, particularly in few-step ODE sampling regimes. We propose a strictly first-order, energy-based flow matching framework that incorporates geometry through Christoffel-adjusted dynamics. Our method defines a total energy as the sum of kinetic energy induced by the predicted velocity field and a learned potential energy, and enforces approximate energy conservation along transport trajectories. Energy conservation encourages optimal low-energy denoising paths and yields a smoother optimization landscape, leading to faster and more stable convergence. Crucially, the energy formulation induces a time-dependent Riemannian metric that captures the evolving diffusion geometry without explicit manifold supervision. Christoffel symbols derived from this induced metric adjust the velocity field to account for curvature, implicitly modeling higher-order effects without introducing additional learnable dynamics. This geometric correction is manifold-agnostic and adapts automatically to the evolving diffusion structure. Empirically, our method outperforms existing few-step baselines, achieving improved performance in terms of both FID and mode coverage (∼ 80% ↑ on synthetic spiral datasets). To our knowledge, this is the first geometry-aware flow matching framework that integrates energy conservation and Christoffel dynamics for stable curved generative transport.

URL: https://openreview.net/forum?id=Wb9wwUF7aV

---

Title: Batched Belief-State Planning and Partial Observation Inversion for Information Gathering POMDPs

Abstract: We present the inversion variational autoencoder ($\mathcal{I}$-VAE), a conditional generative model for efficient belief-state planning in partially observable sequential decision-making problems. The $\mathcal{I}$-VAE maps partial observations to stochastic posterior state samples by learning an observation-conditioned latent prior, enabling consistent belief updates without an explicit likelihood model. We further fine-tune the belief model with a trajectory-based mutual information objective to improve latent space consistency across observation sequences. To support scalable planning with these learned beliefs, we formulate the batched belief-state Markov decision process, which is designed to parallelize rollouts while preserving optimality in expectation. We analyze heuristic policies that maximize the expected entropy reduction of the updated belief and show that these heuristics result in the optimal one-step expected Bayesian information gain. Our approach is evaluated on a benchmark masked-pixel task and a real-world intrusion discovery task using indirect muon tomography data, showing improved estimation accuracy and planning efficiency over conventional methods.

URL: https://openreview.net/forum?id=K7MPwAV1AT

---

Title: Towards Bridging the Gap Between Offline and Iterative Alignment via Preference Distillation

Abstract: Direct preference optimization (DPO) is a promising offline approach for aligning large language models (LLMs) due to its simplicity, computational efficiency, and implicit modeling of human preferences. Interestingly, iterative extensions of DPO have achieved stronger performance on academic benchmarks, raising two key questions: (i) Why do iterative methods generally outperform offline ones? (ii) Can their advantages be incorporated into offline alignment? To answer the first question, our controlled experiments reveal that the \textit{explicit preference model}, additionally introduced in the iterative procedure, is a key factor behind its superiority over offline methods. This insight leads us to answer the second question affirmatively and propose Distilled Preference Probability Policy Optimization (DP3O), an effective and efficient offline alignment algorithm. DP3O first learns an explicit preference model using a helper class of LLMs and then distills its knowledge into policy optimization. Theoretically, we show that explicit preference modeling admits better estimation error control than implicit formulations, and that DP3O achieves a tighter generalization bound than hard-label DPO through variance reduction. Empirically, we evaluate DP3O on a wide range of chat-based and downstream tasks, showing that it outperforms state-of-the-art offline methods while achieving performance comparable to iterative DPO while reducing training time by about $42\%$, demonstrating both its effectiveness and efficiency.

URL: https://openreview.net/forum?id=X6r1bU1m6x

---

Title: CURATE: Automatic Curriculum Learning for Reinforcement Learning Agents through Competence-Based Curriculum Policy Search in Structured Task Spaces

Abstract: Due to fundamental exploration challenges without informed priors or specialized algorithms, agents may be unable to consistently receive informative rewards, leading to inefficient or intractable learning. To address these challenges, we introduce CURATE, an automatic curriculum learning algorithm for reinforcement learning agents in structured task spaces of monotonic difficulty. Through "exploration by exploitation," CURATE dynamically scales the task difficulty to match the agent's current competence. By exploiting its current capabilities that were learned in easier tasks, the agent improves its exploration in more difficult tasks. Our key insight is that the learning improvement in tasks that are close to those used for training is inversely proportional to their difficulty, and an agent that chooses a nearby distribution of the easiest unsolved tasks at any given time can automatically induce an easiest-to-hardest curriculum in these task spaces. To achieve this, CURATE conducts policy search in the task space to learn the best task distribution for training. As the agent's mastery grows, the learned curriculum adapts in an approximately easiest-to-hardest and task-directed fashion, efficiently culminating in a performant agent. Our experiments across three diverse domains (MiniGrid, Procgen, BipedalWalker) demonstrate that CURATE learns effective curricula for sample efficiency and generalization, matching or exceeding prior curriculum methods and yielding broadly capable agents.

URL: https://openreview.net/forum?id=DlnvWfoIgv

---

Title: Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

Abstract: The growth of agentic AI has drawn significant attention to function calling Large Language Models (LLMs), which are designed to extend the capabilities of AI-powered system by invoking external functions.
Injection and jailbreaking attacks have been extensively explored to showcase the vulnerabilities of LLMs to user prompt manipulation.
The expanded capabilities of agentic models introduce further vulnerabilities via their function calling interface.
Recent work in LLM security showed that function calling can be abused, leading to data tampering and theft, causing disruptive behavior such as endless loops, or causing LLMs to produce harmful content in the style of jailbreaking attacks.
This paper introduces a novel function hijacking attack (FHA) that manipulates the tool selection process of agentic models to force the invocation of a specific, attacker-chosen function. While existing attacks focus on semantic preference of the model for function-calling tasks, we show that FHA is largely agnostic to the context semantics and robust to the function sets, making it applicable across diverse domains. We further demonstrate that FHA can be trained to produce universal adversarial functions, enabling a single attacked function to hijack tool selection across multiple queries and payload configurations. We conducted experiments on 5 different models, including instructed and reasoning variants, reaching 70% to 100% ASR over the established BFCL dataset.
Our findings further demonstrate the need for strong guardrails and security modules for agentic systems.

URL: https://openreview.net/forum?id=DqSQybYSKu

---

Title: Deep Models, Shallow Alignment: Uncovering the Granularity Mismatch in Neural Decoding

Abstract: Neural visual decoding is a central problem in brain–computer interface research, aiming to reconstruct human visual perception and to elucidate the structure of neural representations. However, existing approaches overlook a fundamental granularity mismatch between human and machine vision, where deep vision models emphasize semantic invariance by suppressing local texture information, whereas neural signals preserve an intricate mixture of low-level visual attributes and high-level semantic content. To address this mismatch, we propose Shallow Alignment, a novel contrastive learning strategy that aligns neural signals with intermediate representations of visual encoders rather than their final outputs, thereby striking a better balance between low-level texture details and high-level semantic features. Extensive experiments across multiple benchmarks demonstrate that Shallow Alignment significantly outperforms standard final-layer alignment, with performance gains ranging from 22% to 58% across diverse vision backbones. Notably, our approach effectively unlocks the scaling law in neural visual decoding, enabling decoding performance to scale predictably with the capacity of pre-trained vision backbones. We further conduct systematic empirical analyses to shed light on the mechanisms underlying the observed performance gains. Code is available at https://anonymous.4open.science/r/repo-7f41a8.

URL: https://openreview.net/forum?id=SzDHkICZpm

---

Title: Dynamic Reward Incentives for Emergent Cooperation under Changing Rewards

Abstract: Peer incentivization (PI) is a popular multi-agent reinforcement learning approach where all agents can reward or penalize each other to achieve cooperation in social dilemmas. Despite their potential for scalable cooperation, current PI methods heavily depend on fixed incentive values that need to be appropriately chosen with respect to the environmental rewards and thus are highly sensitive to their changes. Therefore, they fail to maintain cooperation under changing rewards in the environment, e.g., caused by modified specifications, varying supply and demand, or sensory flaws — even when the conditions for mutual cooperation remain the same. In this paper, we propose Dynamic Reward Incentives for Variable Exchange (DRIVE), an adaptive PI approach to cooperation in social dilemmas with changing rewards. DRIVE agents reciprocally exchange reward differences to incentivize mutual cooperation in a completely decentralized way. We show how DRIVE achieves mutual cooperation in the general Prisoner's Dilemma and empirically evaluate DRIVE in more complex sequential social dilemmas with changing rewards, demonstrating its ability to achieve and maintain cooperation, in contrast to current state-of-the-art PI methods.

URL: https://openreview.net/forum?id=9Ltu1HV2YI

---

Title: Accurate Probes Do Not Imply Causal Influence

Abstract: Linear probes are the standard tool for asking whether neural networks encode specific concepts, and high probe accuracy is routinely taken as evidence that a concept causally influences model predictions. We prove that this inference is unfounded: without additional structural assumptions, probe accuracy alone cannot identify causal importance. A perfectly accurate probe can detect a concept that is completely irrelevant to the model's output. We then identify four testable sufficient conditions, grounded in Pearl's causal identification framework, under which probe accuracy does reliably imply causal influence, and we derive universal false positive rate bounds that hold across all architectures. The non-identifiability is not a statistical limitation that more data can overcome. It persists even with perfect probes, unlimited samples, and arbitrary model capacity. Experiments on one thousand synthetic networks and MNIST confirm that all four identification conditions yield low estimation error, while violations produce errors exceeding forty percent. Our framework gives practitioners concrete criteria for determining when probing results support causal conclusions and when they do not.

URL: https://openreview.net/forum?id=hn6szUk6es

---

Title: Task-Aware Model Merging via Fisher-Weighted Median

Abstract: Fine-tuning large language models provides strong in-domain performance but limits generalization and requires storage of many specialized models. Retraining a unified multitask model is often infeasible due to data unavailability or high computational cost. The majority of model merging approaches rely on performing arithmetic operations directly on model parameters. Although research in model merging has expanded significantly in recent years, two distinct approaches have become dominant: 1) techniques that mitigate interference from redundant parameters and sign conflicts, and 2) techniques that account for the varying sensitivity of individual parameters. However, these two approaches operate independently without considering each other's strengths and remain disconnected from each other. In this work, we aim to unify these two well-established yet currently disconnected approaches by integrating insights from both the approaches. We propose DRIFT-MEDIAN, a Fisher-aware model merging method that assigns sensitivity-weighted task vectors using Fisher-weighted median aggregation, ensuring that task-relevant parameters dominate the merged model. Comprehensive experiments on several LLMs and CLIP models demonstrate that task-vector interference mitigation and parameter sensitivity are complementary, and that integrating both within DRIFT-MEDIAN leads to strong and consistent performance.

URL: https://openreview.net/forum?id=tB6bb0ZosX

---

Title: Depth Completion as Parameter-Efficient Test-Time Adaptation

Abstract: We introduce CAPA, a parameter-efficient test-time optimization framework that adapts pre-trained 3D foundation models (FMs) for depth completion, using sparse geometric cues. Unlike prior methods that train task-specific encoders for auxiliary inputs, which risk degrading the pre-trained prior and limit generalization to new scenes or sensor configurations, CAPA freezes the FM backbone to preserve its strong geometric prior. It updates only a small set of parameters using Parameter-Efficient Fine-Tuning (e.g., LoRA or VPT), guided directly by gradients calculated from the sparse observations available at inference. This approach effectively grounds the foundation model's geometric prior in the scene-specific measurements, correcting distortions and misplaced structures. For videos, CAPA introduces sequence-level parameter sharing, jointly adapting all frames to exploit temporal correlations, improve robustness, and enforce multi-frame consistency. CAPA is model-agnostic, compatible with any ViT-based FM, and achieves state-of-the-art results across diverse condition patterns on both indoor and outdoor datasets.

URL: https://openreview.net/forum?id=71mdn3crgl

---

Title: Learnable Koopman-Enhanced Transformer-Based Time Se- ries Forecasting with Spectral Control

Abstract: This paper proposes a unified family of learnable Koopman operator parameterizations that integrate linear dynamical systems theory with modern deep learning forecasting architectures. We introduce four learnable Koopman variants—scalar-gated, per-mode gated, MLP-shaped spectral mapping, and low-rank Koopman operators—which generalize and interpolate between strictly stable Koopman operators and unconstrained linear latent dynamics. Our formulation enables explicit control over the spectrum, stability, and rank of the linear transition operator while retaining compatibility with expressive nonlinear backbones such as PatchTST, Autoformer, and Informer. We evaluate the proposed operators in a large-scale benchmark that also includes LSTM, DLinear, and simple diagonal State-Space Models (SSMs), as well as lightweight transformer variants. Experiments across multiple horizons and patch lengths show that learnable Koopman models provide a favorable bias–variance trade-off, improved conditioning, and more interpretable latent dynamics. We provide a full spectral analysis, including eigenvalue trajectories, stability envelopes, and learned spectral distributions. Our results demonstrate that learnable Koopman operators are effective, stable, and theoretically principled components for deep forecasting.

URL: https://openreview.net/forum?id=3G9n71TlTK

---

Title: Regret Is Weighted Forgetting

Abstract: How much of an agent's regret comes from a bad representation, and how much from a bad policy? This paper gives an exact answer. For a fixed representation M and a finite evaluation distribution over history-test pairs, the minimum average normalized regret over all M-based policies equals the minimum margin-weighted deletion cost needed to make the optimal bet single-valued on each representation-test cell (M(h),T). A policy-wise decomposition then splits any actual policy's regret into irreducible aliasing cost plus avoidable within-cell misreporting. A Stack-Theoretic reformulation identifies the same quantity as a deficit in weighted weakness on a lifted task constructed from the evaluation support (where weakness is normally the degree to which a policy leaves open unseen diagnostic continuations). I use the identity to derive several direct corollaries, including a representation-convergence theorem in pure RL language, a regret-based partial order on abstractions, Lipschitz stability of K_ρ under margin estimation error, and connections to free energy and multi-agent coordination. A cross-framework corollary converts the regret floor into a generalisation probability. Under the canonical independent prior, the optimal M-based policy generalises with probability exp(−K_ρ(M)). The multi-class generalisation to K>2 diagnostic outcomes is proved. Controlled POMDP experiments confirm the decomposition is numerically exact and that K_ρ discriminates between representations where accuracy and raw impurity do not. The weakness-maximisation theorems predict optimal generalisation through least commitment, but their formal object (the extension of a policy in an embodied language) does not have a direct analogue in neural network function approximation. Bridging that gap is identified as an open problem.

URL: https://openreview.net/forum?id=bIQeQIOUnJ

---

Title: Scalable and Accurate CP Decomposition using Khatri-Rao Volume Sampling

Abstract: CP decomposition is a popular tool for analyzing the large volumes of multi-modal data, generated in diverse applications ranging from analytical chemistry to modern computer vision. Despite being the workhorse algorithm for CP decomposition, the alternating least-squares (ALS) is computationally intensive. This is attributed to the high dimensionality of the underlying Khatri-Rao product regression problem in each iteration. This work proposes a fast sampling based approach, called Khatri-Rao volume sampling (KRVS) to efficiently solve Khatri-Rao product regression problem. The proposed KRVS based solution is an unbiased estimate, and achieves $(1+\epsilon)$ relative error approximation of the optimal least-squares solution. Building on the results of KRVS, a fast and accurate sampling based algorithm called CP-KRVS is proposed for CP decomposition. Theoretical and numerical analysis reveals that the proposed KRVS approach achieves a comparable accuracy, at faster runtime than exact leverage score sampling. Further, KRVS approach has higher accuracy, at slightly slower runtime than implicit leverage score sampling. At higher sample sizes, the KRVS based solution approximately converges with the optimal least-squares and ALS based solutions for Khatri–Rao product regression and CP decomposition, respectively, while achieving a faster runtime.

URL: https://openreview.net/forum?id=9JEX8ZEkVK

---

Title: QAT-SAM: Accurate Quantization for Segment Anything Model 2

Abstract: The Segment Anything Model 2 (SAM2) is a powerful foundation model for promptable segmentation. However, its high computational and memory costs are a major barrier to deployment on resource-constrained devices. In this paper, we present QAT-SAM, an accurate low-bit quantization method that achieves high compression and high fidelity. To address performance degradation arising from challenging weight and activation distributions during quantization, QAT-SAM introduces two novel contributions: Variance-Reduced Calibration (VRC), an initialization method that reduces weight statistical variance by minimizing the Frobenius norm over a small calibration batch; and Learnable Statistical Clipping (LSC), a Quantization-Aware Training (QAT) method that learns momentum-stabilized clipping factors to manage outliers in weights and activations. Comprehensive experiments demonstrate that QAT-SAM achieves highly accurate inference with substantial efficiency gains, significantly surpassing state-of-the-art general QAT schemes, particularly in the ultra-low 2-bit regime. Specifically, QAT-SAM achieves an accuracy gain of up to 9.7 ppt in J\&F on the video segmentation benchmark and 7.3 ppt in mIoU for instance segmentation over the best competing QAT model, all while achieving an 8x reduction in model size compared to the BF16 baseline.

URL: https://openreview.net/forum?id=ZPnmpLAsiT

---

Title: InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning

Abstract: Large multimodal models (LMMs) encode physical laws observed during training, such as momentum conservation, as parametric knowledge. It allows LMMs to answer physical reasoning queries, such as the outcome of a potential collision event from visual input. However, since parametric knowledge includes only the physical laws seen during training, it is insufficient for reasoning in inference scenarios that follow physical laws unseen during training. In such novel physical environments, humans could adapt their physical reasoning based on provided demonstrations. This inductive physical reasoning ability is indispensable for LMMs if they are to replace human agents in safety-critical applications. Despite its importance, existing visual benchmarks do not evaluate inductive physical reasoning and only consider the parametric knowledge in LMMs. To this end, we propose InPhyRe, the first visual question answering benchmark to measure inductive physical reasoning in LMMs. InPhyRe evaluates LMMs' ability to predict the outcome of collision events in algorithmically generated synthetic videos. By inspecting over 13 open-source and proprietary LMMs, InPhyRe informs us that (1) LMMs struggle to apply their limited parametric knowledge about universal physical laws to reasoning, (2) inductive physical reasoning in LMMs is weak when the physical laws underlying inference scenarios were unseen during training, and (3) inductive physical reasoning in LMMs suffers from language bias and may ignore the visual inputs, questioning the trustworthiness of LMMs regarding visual inputs.

URL: https://openreview.net/forum?id=TGmridJZQo

---

Title: Plant Phenotyping from Limited Training Data by Active Learning and Annotation

Abstract: AI-based plant phenotyping has significantly expanded the scope, scale, and speed of trait data collection. Phenotyping of epidermal cell patterning is a long-standing target of studies on carbon and water relations due to the important functions of stomata, bulliform cells and veins. Although SD estimation has benefited from AI-based techniques, automated segmentation and analysis of costal zones and bulliform cell regions have received much less attention. One major bottleneck is that AI tools often require manual annotation of large datasets for training, which is labor-intensive and requires domain expertise. In addition, models trained on one species frequently perform poorly on others. In this paper, we propose an automated framework that enables the detection of stomata, costal zones, and bulliform cell regions, leveraging domain knowledge and minimizing the need for extensive training data. We extract features from both topographic and intensity data, separating them using knowledge of spatial structure, specifically the repeatable, linear, periodic arrangement of epidermal cells, and intrinsic cell models to minimize noise. We bootstrap learning by starting with the most structured parts of the images and progressively adding less structured regions. Furthermore, we incorporate active annotation to continually expand the training dataset throughout the learning process. We demonstrate the effectiveness of our method in two areas: (i) stomatal detection and (ii) detection of costal and bulliform zones. Through extensive quantitative and qualitative experimental results on three crop species: Setaria viridis, Sorghum bicolor, and Zea mays, we show that our approach outperforms state-of-the-art segmentation methods in terms of both precision and time efficiency.

URL: https://openreview.net/forum?id=HuV1cOnEXs

---

Title: A Framework for PAC-Bayes Derandomization with Applications to Majority Votes

Abstract: PAC-Bayes is a popular and efficient framework for obtaining generalization guarantees in situations involving uncountable hypothesis spaces. Unfortunately, in its classical formulation, it only provides guarantees on the expected risk of a randomly sampled hypothesis. This requires stochastic predictions at test time, making PAC-Bayes unusable in many practical situations where a single deterministic hypothesis must be deployed. We propose a unified framework to extract guarantees holding for a single hypothesis from PAC-Bayesian guarantees. We present a general oracle bound, and derive from it a numerical bound and a specialization to majority vote. We empirically show that our approach consistently outperforms popular baselines (by up to a factor of 2) when it comes to generalization bounds for single classifiers.

URL: https://openreview.net/forum?id=nq4nKKZjw8

---

Title: Forgetting of task-specific knowledge in model merging-based continual learning

Abstract: This paper investigates the linear merging of deep neural network models in the context of continual learning (CL).
Using controlled visual cues in computer vision experiments, we demonstrate that merging largely preserves or enhances shared knowledge, while unshared task-specific knowledge rapidly degrades. We further find that merging models from an incremental training process consistently outperforms merging models trained in parallel.

URL: https://openreview.net/forum?id=AW7sLrBCJE

---

Title: Exponential Convergence of (Stochastic) Gradient Descent for Separable Logistic Regression

Abstract: Gradient descent and stochastic gradient descent are central to modern machine learning, yet their behavior under large step sizes remains theoretically unclear. Recent work suggests that acceleration often arises near the edge of stability, where optimization trajectories become unstable and difficult to analyze. Existing results for separable logistic regression achieve faster convergence by explicitly leveraging such unstable regimes through constant or adaptive large step sizes. In this paper, we show that instability is not inherent to acceleration. We prove that gradient descent with a simple, non-adaptive increasing step-size schedule achieves exponential convergence for separable logistic regression under a margin condition, while remaining entirely within a stable optimization regime. The resulting method is anytime and does not require prior knowledge of the optimization horizon or target accuracy. We also establish exponential convergence of stochastic gradient descent using a lightweight adaptive step-size rule that avoids line search and specialized procedures, improving upon existing polynomial-rate guarantees. Together, our results demonstrate that carefully structured step-size growth alone suffices to obtain exponential acceleration for both gradient descent and stochastic gradient descent.

URL: https://openreview.net/forum?id=R5OaFwCmS0

---

Title: Quantification and Control of LSTM Resilience Based on Stability Theory

Abstract: This paper proposes a novel theoretical framework for guaranteeing and evaluating the resilience of long short-term memory (LSTM) networks in control systems. We introduce *recovery time* as a new metric of resilience in order to quantify the time required for an LSTM to return to its normal state after anomalous inputs. By mathematically refining incremental input-to-state stability ($\delta$ISS) theory for LSTM, we derive a practical data-independent upper bound on recovery time. This upper bound gives us resilience-aware training. Experimental validation on simple models demonstrates the effectiveness of our resilience estimation and control methods, enhancing a foundation for rigorous quality assurance in safety-critical AI applications.

URL: https://openreview.net/forum?id=hFmlMUNEsR

---

Title: Point-Identification of a Robust Predictor Under Latent Shift with Imperfect Proxies

Abstract: Addressing the domain adaptation problem becomes more challenging when distribution shifts across domains stem from latent confounders that affect both covariates and outcomes. Existing proxy-based approaches that address latent shift rely on a strong completeness assumption to uniquely determine (point-identify) a robust predictor. Completeness requires that proxies have sufficient information about variations in latent confounders. For imperfect proxies the mapping from confounders to the space of proxy distributions is non-injective, and multiple latent confounder values can generate the same proxy distribution. This breaks the completeness assumption and observed data are consistent with multiple potential predictors (set-identified). To address this, we introduce latent equivalent classes (LECs). LECs are defined as groups of latent confounders that induce the same conditional proxy distribution. We show that point-identification for the robust predictor remains achievable as long as multiple domains differ sufficiently in how they mix proxy-induced LECs to form the robust predictor. This domain diversity condition is formalized as a cross-domain rank condition on the mixture weights, which is substantially weaker assumption than completeness. We introduce the Proximal Quasi-Bayesian Active learning (PQAL) framework, which actively queries a minimal set of diverse domains that satisfy this rank condition. PQAL can efficiently recover the point-identified predictor, demonstrates robustness to varying degrees of shift and outperforms previous methods on synthetic data and semi-synthetic dSprites dataset.

URL: https://openreview.net/forum?id=QFJuVreJDC

---

Title: Entropy Guided Semi-Supervised Graph Coarsening

Abstract: Graphs are foundational abstractions in data-intensive domains, yet the scale of modern datasets strains computation and memory for downstream learning. From recommender systems to biological networks, graphs have emerged as a fundamental substrate for learning. As graph sizes grow, the cost of training and inference becomes prohibitive, thereby necessitating compact surrogates that retain spectral properties and feature semantics.
We propose an entropy-regularized, semi-supervised framework for attributed graph coarsening that jointly leverages the original graph’s Laplacian, node features, and partially observed labels. Central to our approach is an information-theoretic regularizer that minimizes the per-supernode Shannon entropy of the node-profile matrix $\phi = C^TY$, encouraging label-coherent aggregations. We formulate a principled objective that balances structural fidelity and feature alignment, and we solve it using an efficient block MM/BSUM algorithm. We establish learning and show structural guarantees that control Dirichlet-energy distortion, preserve low-order spectral moments, and bound deviations in cut costs and effective resistances. Experiments on standard benchmarks (e.g., Cora, Citeseer, PubMed, coauthor-CS) demonstrate that our method produces a node-profile matrix with low row-wise entropy, in which nodes with the same label are grouped into the same supernode, and achieves competitive or superior node-classification accuracy and link-prediction performance across multiple GNN backbones (GCN, GAT, APPNP), while remaining computationally efficient.

URL: https://openreview.net/forum?id=xZAwzxUP7c

---

Title: Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing

Abstract: Recent advances in large language models have highlighted the excessive quadratic cost of
self-attention. Despite the significant research efforts, subquadratic attention methods still
suffer from inferior performance in practice. We hypothesize that dynamic, learned content-
based sparsity can lead to more efficient attention mechanisms. We present Mixture of Sparse
Attention (MoSA), a novel approach inspired by Mixture of Experts (MoE) with expert
choice routing. MoSA dynamically selects tokens for each attention head, allowing arbitrary
sparse attention patterns. By selecting k tokens from a sequence of length $T$ , MoSA reduces
the computational complexity of each attention head from $O(T^2)$ to $O(k^2 + T)$. This enables
using more heads within the same computational budget, allowing higher specialization.
We show that among the tested sparse attention variants, MoSA is the only one that can
outperform the dense baseline, sometimes with up to $27%$ better perplexity for an identical
compute budget. MoSA can also reduce the resource usage compared to dense self-attention.
Despite using torch implementation without an optimized kernel, perplexity-matched MoSA
models are simultaneously faster in wall-clock time, require less memory for training, and
drastically reduce the size of the KV-cache compared to the dense transformer baselines.

URL: https://openreview.net/forum?id=HUpBs4TZkS

---

Title: LoRA as an Implicit KL Regularizer in GRPO Fine-Tuning: From Theory to Practice

Abstract: Low-Rank Adaptation (LoRA) is widely used for parameter-efficient reinforcement learning fine-tuning of large language models (LLMs), often together with an explicit KL penalty toward a reference policy. In this work, we analyze how the low-rank constraint itself can restrict parameter trajectories during gradient descent and limit the resulting policy shift. We study the learning dynamics of LoRA updates and derive an explicit upper bound on the KL divergence between the reference and updated policies that depends on the adapter rank. Empirically, in Group Relative Policy Optimization (GRPO) fine-tuning of several LLM families on reasoning benchmarks, we observe that removing the explicit KL penalty yields similar evaluation accuracy while reducing training time due to avoiding referencepolicy evaluations. Our results provide theoretical grounding for KL-free fine-tuning with LoRA, maintaining reasoning performance while allowing for training speedups in practice.

URL: https://openreview.net/forum?id=ZBdu9QjXFo

---

Title: Finite Sample Bounds for Non-Parametric Regression: Optimal Sample Efficiency and Space Complexity

Abstract: We address the problem of learning an unknown smooth function and its derivatives from noisy pointwise evaluations under the supremum norm. While classical nonparametric regression provides a strong theoretical foundation, traditional kernel-based estimators often incur high computational costs and memory requirements that scale with the sample size, limiting their utility in real-time applications such as reinforcement learning. To overcome these challenges, we propose a parametric approach based on a finite-dimensional representation that achieves minimax-optimal uniform convergence rates. Our method enables lightweight inference without storing all samples in memory. We provide sharp finite-sample bounds under sub-Gaussian noise, derive second-order Bernstein-type guarantees, and prove matching lower bounds, thereby confirming the optimality of our approach in both estimation error and memory efficiency.

URL: https://openreview.net/forum?id=7AHO204EaZ

---

Title: StatQAT: Statistical Quantizer Optimization for Deep Networks

Abstract: Quantization is essential for reducing the computational cost and memory usage of deep neural networks, enabling efficient inference on low-precision hardware. Despite the growing adoption of uniform and floating-point quantization schemes, selecting optimal quantization parameters remains a key challenge, particularly for diverse data distributions encountered during training and inference. This work presents a novel statistical error analysis framework for uniform and floating-point quantization, providing theoretical insight into error behavior across quantization configurations. Building on this analysis, we propose iterative quantizers designed for arbitrary data distributions and analytic quantizers tailored for Gaussian-like weight distributions. These methods enable efficient, low-error quantization suitable for both activations and weights. We incorporate our quantizers into quantization-aware training and evaluate them across integer and floating-point formats. Experiments demonstrate improved accuracy and stability, highlighting the effectiveness of our approach for training low-precision neural networks.

URL: https://openreview.net/forum?id=44TlErFmYO

---

Title: Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning

Abstract: Large reasoning models (LRMs) achieve higher performance on challenging reasoning tasks by generating more tokens at inference time, but this verbosity often wastes computation on easy problems. Existing solutions—supervised fine-tuning on shorter traces, user-controlled budgets, or RL with uniform penalties—either require data curation, manual configuration, or treat all problems alike regardless of difficulty. We introduce Adaptive Length Penalty (ALP), a reinforcement‑learning objective tailoring generation length to per-prompt solve rate. During training, ALP monitors each prompt’s difficulty measured using online solve rate through multiple rollouts and adds a differentiable penalty whose magnitude scales inversely with that rate, so easy prompts incur a high cost for extra tokens while hard prompts remain unhindered. Across model scales from 1.5B to 8B parameters, ALP cuts average token usage by up to 50% without significantly dropping performance. Relative to fixed‑budget and uniform‑penalty baselines, ALP redistributes its reduced budget more intelligently—cutting compute on easy prompts and reallocating saved tokens to difficult ones—delivering higher accuracy on the hardest problems while using fewer tokens overall.

URL: https://openreview.net/forum?id=tJRTEZ4CJa

---

Title: BRIDLE: Generalized Self-supervised Learning with Quantization

Abstract: Self-supervised learning (SSL) has been a powerful approach for learning meaningful representations from unlabeled data across various domains, reducing the reliance on large labeled datasets. Inspired by BERT's success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio, with models like BEATs extending the bidirectional training paradigm to audio signals using vector quantization (VQ). However, these frameworks face challenges, notably their dependence on a single codebook for quantization, which may not capture the complex, multifaceted nature of signals. In addition, inefficiencies in codebook utilization lead to underutilized code vectors. To address these limitations, we introduce BRIDLE (Bidirectional Residual Quantization Interleaved Discrete Learning Encoder), a self-supervised encoder pretraining framework that incorporates residual quantization (RQ) into the bidirectional training process, and is generalized for pretraining with audio, image, and video. Using multiple hierarchical codebooks, RQ enables fine-grained discretization in the latent space, enhancing representation quality. BRIDLE involves an interleaved training procedure between the encoder and tokenizer. We evaluate BRIDLE on audio understanding tasks using classification benchmarks, achieving state-of-the-art results, and demonstrate competitive performance on image classification and video classification tasks, showing consistent improvements over traditional VQ methods in downstream performance.

URL: https://openreview.net/forum?id=FWFc4rD2AS

---

Reply all

Reply to author

Forward

0 new messages