Daily TMLR digest for Dec 04, 2025

0 views
Skip to first unread message

TMLR

unread,
Dec 4, 2025, 12:30:07 AM (yesterday) Dec 4
to tmlr-anno...@googlegroups.com

Accepted papers
===============


Title: SIRE: SE(3) Intrinsic Rigidity Embeddings

Authors: Cameron Omid Smith, Basile Van Hoorick, Chonghyuk Song, Vincent Sitzmann, Vitor Campagnolo Guizilini, Yue Wang

Abstract: Motion serves as a powerful cue for scene perception and understanding by separating independently moving surfaces and organizing the physical world into distinct entities. We introduce SIRE, a self-supervised method for motion discovery of objects and dynamic scene reconstruction from casual scenes by learning intrinsic rigidity embeddings from videos. Our method trains an image encoder to estimate scene rigidity and geometry, supervised by a simple 4D reconstruction loss: a least-squares solver uses the estimated geometry and rigidity to lift 2D point track trajectories into SE(3) tracks, which are simply re-projected back to 2D and compared against the original 2D trajectories for supervision. Crucially, our framework is fully end-to-end differentiable and can be optimized either on video datasets to learn generalizable image priors, or even on a single video to capture scene-specific structure -- highlighting strong data efficiency. We demonstrate the effectiveness of our rigidity embeddings and geometry across multiple settings, including downstream object segmentation, SE(3) rigid motion estimation, and self-supervised depth estimation. Our findings suggest that SIRE can pave the way towards self-supervised learning of priors over geometry and motion rigidity from large-scale video data.

URL: https://openreview.net/forum?id=OZ9H0TOYMt

---

Title: A second-order-like optimizer with adaptive gradient scaling for deep learning

Authors: Jerome Bolte, Ryan Boustany, Edouard Pauwels, Andrei Purica

Abstract: In this empirical article, we introduce INNAprop, an optimization algorithm that combines the INNA method with the RMSprop adaptive gradient scaling. It leverages second order information and rescaling while keeping the memory and compute requirements of standard DL methods as AdamW or SGD. INNAprop is evaluated on CIFAR-10, Food101, and ImageNet with ResNets, VGG, DenseNet, and ViT. We also train GPT-2 (OpenWebText) from scratch and with LoRA fine-tuning (E2E). INNAprop consistently offers close performance to AdamW, while performing significantly better in our LLM training experiments, achieving faster convergence and higher accuracy with minimal hyperparameter tuning, even at large scale. Our code is public.

URL: https://openreview.net/forum?id=3khtiJDXQW

---

Title: Learning Task-Aware Abstract Representations for Meta-Reinforcement Learning

Authors: Louk van Remmerden, Zhao Yang, Shujian Yu, Mark Hoogendoorn, Vincent Francois-Lavet

Abstract: A central challenge in meta-reinforcement learning (meta-RL) is enabling agents trained on a set of environments to generalize to new, related tasks without requiring full policy retraining. Existing model-free approaches often rely on context-conditioned policies learned via encoder networks. However, these context encoders are prone to overfitting to the training environments, resulting in poor out-of-sample performance on unseen tasks. To address this issue, we adopt an alternative approach that uses an abstract representation model to learn augmented, task-aware abstract states. We achieve this by introducing a novel architecture that offers greater flexibility than existing recurrent network-based approaches. In addition, we optimize our model with multiple loss terms that encourage predictive, task-aware representations in the abstract state space. Our method simplifies the learning problem and provides a flexible framework that can be readily combined with any off-the-shelf reinforcement learning algorithm. We provide theoretical guarantees alongside empirical results, showing strong generalization performance across classical control and robotic meta-RL benchmarks, on par with state-of-the-art meta-RL methods and significantly better than non-meta RL approaches.

URL: https://openreview.net/forum?id=3CWyTh4hJ4

---

Title: State Combinatorial Generalization In Decision Making With Conditional Diffusion Models

Authors: Xintong Duan, Yutong He, Fahim Tajwar, Wentse Chen, Ruslan Salakhutdinov, Jeff Schneider

Abstract: Many real-world decision-making problems are combinatorial in nature, where states (e.g., surrounding traffic of a self-driving car) can be seen as a combination of basic elements (e.g., pedestrians, trees, and other cars). Due to combinatorial complexity, observing all combinations of basic elements in the training set is infeasible, which leads to an essential yet understudied problem of zero-shot generalization to states that are unseen combinations of previously seen elements. In this work, we first formalize this problem and then demonstrate how existing value-based reinforcement learning (RL) algorithms struggle due to unreliable value predictions in unseen states. We argue that this problem cannot be addressed with exploration alone, but requires more expressive and generalizable models. We demonstrate that behavior cloning with a conditioned diffusion model trained on successful trajectory generalizes better to states formed by new combinations of seen elements than traditional RL methods. Through experiments in maze, driving, and multiagent environments, we show that conditioned diffusion models outperform traditional RL techniques and highlight the broad applicability of our problem formulation.

URL: https://openreview.net/forum?id=XB1dd01Ozz

---

Title: MDTree: A Masked Dynamic Autoregressive Model for Phylogenetic Inference

Authors: Zelin Zang, ChenRui Duan, Siyuan Li, Jinlin Wu, BingoWing-Kuen Ling, Fuji Yang, Jiebo Luo, Zhen Lei, Stan Z. Li

Abstract: Phylogenetic tree inference requires optimizing both branch lengths and topologies, yet traditional MCMC-based methods suffer from slow convergence and high computational cost. Recent deep learning approaches improve scalability but remain constrained: Bayesian models are computationally intensive, autoregressive methods depend on fixed species orders, and flow-based models underutilize genomic signals. Fixed-order autoregression introduces an inductive bias misaligned with evolutionary proximity: early misplacements distort subsequent attachment probabilities and compound topology errors (exposure bias). Absent sequence-informed priors, the posterior over the super-exponential topology space remains diffuse and multimodal, yielding high-variance gradients and sluggish convergence for both MCMC proposals and neural samplers.
We propose MDTree, a masked dynamic autoregressive framework that integrates genomic priors into a Dynamic Ordering Network to learn biologically informed node sequences. A dynamic masking mechanism further enables parallel node insertion, improving efficiency without sacrificing accuracy. Experiments on standard benchmarks demonstrate that MDTree outperforms existing methods in accuracy and runtime while producing biologically coherent phylogenies, providing a scalable solution for large-scale evolutionary analysis.

URL: https://openreview.net/forum?id=dTSptQNygv

---

Title: Convergence of linear programming hierarchies for Gibbs states of spin systems

Authors: Hamza Fawzi, Omar Fawzi

Abstract: We consider the problem of computing expectation values of local functions under the Gibbs distribution of a spin system. In particular, we study two families of linear programming hierarchies for this problem. The first hierarchy imposes local spin flip equalities and has been considered in the bootstrap literature in high energy physics. For this hierarchy, we prove fast convergence under a spatial mixing (decay of correlations) condition. This condition is satisfied for example above the critical temperature for Ising models on a d-dimensional grid. The second hierarchy is based on a Markov chain having the Gibbs state as a fixed point and has been studied in the optimization literature and more recently in the bootstrap literature. For this hierarchy, we prove fast convergence provided the Markov chain mixes rapidly. Both hierarchies lead to an ε-approximation for local expectation values using a linear program of size quasi-polynomial in n/ε, where n is the total number of sites, provided the interactions can be embedded in a d-dimensional grid with constant d. Compared to standard Monte Carlo methods, an advantage of this approach is that it always (i.e., for any system) outputs rigorous upper and lower bounds on the expectation value of interest, without needing an a priori analysis of the convergence speed.

URL: https://openreview.net/forum?id=mc1dPxZsv3

---


New submissions
===============


Title: TACO: Training-free Sound Prompted Segmentation via Semantically Constrained Audio-visual CO-factorization

Abstract: Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training additional modules specifically for the task. We adopt a different strategy: we introduce a training-free approach that leverages Non-negative Matrix Factorization (NMF) to co-factorize audio and visual features from pre-trained models so as to reveal shared interpretable concepts. These concepts are passed on to an open-vocabulary segmentation model for precise segmentation maps. By using frozen pre-trained models, our method achieves high generalization and establishes state-of-the-art performance in unsupervised sound-prompted segmentation, significantly surpassing previous unsupervised methods.

URL: https://openreview.net/forum?id=Xt9sdzQQlJ

---

Title: COunterfactual Reasoning for Temporal EXplanations: Plausible and Robust Explanations for EEG-Based Seizure Detection

Abstract: Identifying the drivers of change in time-sensitive domains like healthcare is critical for reliable decision-making, yet explanations must account for both temporal dynamics and structural complexity. While counterfactual explanations are well-studied for static data, existing methods often fail in dynamic, spatio-temporal settings, producing implausible or temporally inconsistent explanations. To address this, we introduce COunterfactual Reasoning for Temporal EXplanations (CORTEX), a search-based explainer for multivariate time series modeled as spatio-temporal graphs, tailored to seizure detection from EEG recordings. CORTEX generates temporally robust and plausible counterfactuals by retrieving relevant past instances and sieving them via structural dissimilarity, temporal distance, and instability. Evaluated on clinical seizure detection data, CORTEX outperforms state-of-the-art methods with a $2.73\times$ improvement in validity and $5.32\times$ in fidelity, and achieves zero implausibility, demonstrating consistency and practical relevance. By shifting the focus from mere validity to plausible and time-consistent explanations, CORTEX enables more reliable and controllable counterfactual explanations.

URL: https://openreview.net/forum?id=FkHVmYnNS9

---

Title: On the Dynamics & Transferability of Latent Generalization during Memorization

Abstract: Deep networks have been known to have extraordinary generalization abilities, via mechanisms that aren't yet well understood. It is also known that upon shuffling labels in the training data to varying degrees, deep networks, trained with standard methods, can still achieve perfect or high accuracy on this corrupted training data. This phenomenon is called memorization, and typically comes at the cost of poorer generalization to true labels. Recent work has demonstrated, surprisingly, that the internal representations of such models retain significantly better latent generalization abilities than is directly apparent from the model. In particular, it has been shown that such latent generalization can be recovered via simple probes (called MASC probes) on the layer-wise representations of the model. However, several basic questions about this phenomenon of latent generalization remain poorly understood: (1) What is the origin and dynamics over training of latent generalization during memorization? Specifically, is it the case that model generalization and latent generalization use largely the same underlying mechanisms? (2) Is the specific nature of the probe critical for our ability to extract latent generalization from the model's layerwise outputs? (3) Does there exist a way to immediately transfer latent generalization to model generalization by suitably modifying model weights directly? On the one hand, this question is conceptually important because it establishes conclusively that the latent generalization manifested by the probe is also within reach of the model, with exactly the information that the model was provided during training, namely the corrupted training data. On the other hand -- and more pragmatically -- it also suggests the possibility of "repairing" a trained model that has memorized, without requiring expensive retraining from scratch. To address (1), we track the training dynamics, empirically, and find that latent generalization abilities largely peak early in training, with model generalization, suggesting a common origin for both. However, while model generalization degrades steeply over training thereafter, latent generalization falls more modestly & plateaus at a higher level over epochs of training. These experiments lend circumstantial evidence to the hypothesis that latent generalization uses largely similar mechanisms as those that underlie the model's generalization in the early phases of training. To investigate (2), we examine the MASC probe and show that it is a quadratic classifier. The question in (2) thus becomes whether the quadratic nature of the MASC probe underlies its remarkable effectiveness in extracting latent generalization. If this were so, a linear probe constructed along these lines would not be as effective. To investigate this, we designed a new linear probe for this setting, and find, surprisingly, that it has superior generalization performance in comparison to the quadratic probe, in most cases. This suggests that the quadratic nature of the probe is not critical in extracting latent generalization. Importantly, the effectiveness of the linear probe enables us to answer (3) in the affirmative. Specifically, using this new linear probe, we devise a way to transfer the latent generalization present in last-layer representations to the model by directly modifying the model weights. This immediately endows such models with improved generalization, i.e. without additional training. Our findings provide a more detailed account of the rich dynamics of latent generalization during memorization, provide clarifying insight on the specific role of the probe in latent generalization, as well as demonstrate the means to leverage this understanding to directly transfer this generalization to the model.

URL: https://openreview.net/forum?id=t024Zm0tKF

---

Title: Networked Communication for Decentralised Cooperative Agents in Mean-Field Control

Abstract: The mean-field framework has been used to find approximate solutions to problems involving very large populations of symmetric, anonymous agents, which may be intractable by other methods. The cooperative mean-field control (MFC) problem has received less attention than the non-cooperative mean-field game (MFG), despite the former potentially being more useful as a tool for engineering large-scale collective behaviours. Decentralised communication algorithms have recently been introduced to MFGs, giving benefits to learning speed and robustness. Inspired by this, we introduce networked communication to MFC - where populations arguably have broader incentive to communicate - and in particular to the setting where decentralised agents learn online from a single, non-episodic run of the empirical system. We adapt recent MFG algorithms to this new setting, as well as contributing a novel sub-routine allowing networked agents to estimate the global average reward from their local neighbourhood. Previous theoretical analysis of decentralised communication in MFGs does not extend trivially to MFC. We therefore contribute new theory proving that in MFC the networked communication scheme allows agents to increase social welfare faster than under *both* of the two typical alternative architectures, namely independent and centralised learning. We provide experiments that support this new result across different classes of cooperative game, and also give numerous ablation studies and additional experiments concerning numbers of communication round and robustness to communication failures.

URL: https://openreview.net/forum?id=qCTg7Dv0DT

---

Title: Representation Similarity Reveals Implicit Layer Grouping in Neural Networks

Abstract: Providing human-understandable insights into the inner workings of neural networks is an important step toward achieving more explainable and trustworthy AI. Analyzing representations across neural layers has become a widely used approach for this purpose in various applications. In this work, we take a step toward a holistic understanding of neural layers by investigating the existence of distinct layer groupings within them. Specifically, we explore using representation similarity within neural networks to identify clusters of similar layers, revealing potential layer groupings. We achieve this by proposing, for the first time to our knowledge, the use of Gromov-Wasserstein distance, which overcomes challenges posed by varying distributions and dimensionalities across intermediate representations--issues that complicate direct layer-to-layer comparisons.
On algebraic, language, and vision tasks, we observe the emergence of layer groups that correspond to functional abstractions within networks. These results reveal implicit layer structure pattern, and suggest that the network computations may exhibit abrupt shifts rather than smooth transitions. Through downstream applications of model compression and fine-tuning, we validate our measure and further show the proposed approach offers meaningful insights into the internal behavior of neural networks.

URL: https://openreview.net/forum?id=V91vAkesm7

---

Title: Interpreting Kolmogorov-Arnold Networks in Neuroimaging: A Path-Based Attribution Framework

Abstract: Explainability aspects of most classification models are learnt through instance-specific analysis. However, in understanding diseases, it is important to consider population-wide analysis in order to identify affected regions that are consistently seen across cohorts of diseased population. In this study, we report utility of Kolmogorov-Arnold Networks (KANs) in understanding population-wide characteristics seen in subjects affected by Alzheimer’s disease (AD). KANs offer enhanced interpretability through learnable activation functions on network edges. Thus, the learned functions reflect the characteristics of the entire span of training data. In a KAN network trained for classification, attributions through the network can be traced to understand how specific inputs influence the output label. In this study, we propose a path-based attribution framework that generates global importance maps by tracing exhaustive information flow through all potential paths. Our method scores edges using L2 norms of the learned spline and base functions. Subsequently, these scores are propagated through the network to compute path-attributions. This approach scales linearly with network depth, and is only dependent on model training and does not need further analysis on data post-hoc. Evaluation on three public AD neuroimaging datasets (OASIS, ADNI, Mendeley, totally comprising 7428 acquisitions), were carried out on 3D brain volumes as well as 2D brain slices. The corresponding KAN test accuracies are $93.24\%$, $81.85\%$, and $91.25\%$ on OASIS, ADNI, and Mendeley datasets, respectively. Alongside, improved performance via metrics such as Insertion AUC, Deletion AUC and Sufficiency, is also demonstrated. The generated attribution maps identify clinically meaningful regions including the body and genu of corpus callossum, corona radiata, bilateral caudate nuclei, medial prefrontal cortex and temporal lobe structures, aligned with established AD pathology literature. By providing voxel-level global attributions as network-intrinsic properties, our framework addresses a critical gap in medical AI interpretability and supports clinical validation of AI-assisted AD diagnosis systems.

URL: https://openreview.net/forum?id=cPtKpNdYc2

---

Title: Controlling Coverage of Uncertainty Sets for Batch Evaluation via Vanilla Conformal Prediction

Abstract: Conformal prediction (CP) provides provable coverage guarantees over uncertainty sets for any given black-box predictive model. The standard split CP guarantees that for a single test input, the uncertainty set contains the true output with a user-specified probability $1 - \alpha$ (say 90\%). However, in many real-world applications, practitioners evaluate the predictive model on a batch of test inputs after calibration on a fixed set. The marginal coverage guarantee of split CP does not say anything directly about the realized false-coverage proportion (FCP) across a batch of inputs. This paper develops a novel approach referred to as {\em Probably Approximately Correct FCP (PAC-FCP)}. PAC-FCP leverages the key insight that FCP over a batch of test inputs from split CP follows a Beta-Binomial distribution and inverts the Beta–Binomial tail to find the minimum level to produce a guarantee around FCP using vanilla CP methods. We provide theoretical analysis for the validity and effectiveness of PAC-FCP building on prior theoretical results. Our experimental results on 17 OpenML benchmarks for regression and ImageNet data for classification, demonstrate that PAC-FCP achieves the specified FCP rate with smaller prediction sets/intervals.

URL: https://openreview.net/forum?id=H1dE34hmHA

---

Title: Uncertainty Regions for Multi-Target Regression via Input- Dependent Conformal Calibration

Abstract: We consider the problem of provable and effective uncertainty quantification (UQ) for multi-target regression tasks where we need to predict multiple related target variables. This is important in many safety-critical applications in domains including healthcare, engineering, and finance. Conformal prediction (CP) is a promising framework for calibrating predictive models for UQ with guaranteed finite sample coverage. There is relatively less work on multi-target CP compared to single-target CP, and existing methods tend to produce large prediction regions that are not useful in real-world applications. This paper proposes a novel approach referred to as {\em Adaptive Prediction Regions (APR)} to produce provably smaller prediction regions by exploiting heterogeneity in the input data. APR is inspired by the principle behind localized CP for single-target \cite{guan2023localized} and extends it to multi-target settings. The key idea behind APR is to perform adaptive calibration by assigning differential weights to multi-dimensional calibration examples based on their similarity to a test input. We theoretically analyze APR and show that it (a) achieves finite-sample coverage guarantees; and (b) constructs smaller prediction regions. Our experiments on diverse real-world datasets with various numbers of targets show that APR outperforms existing methods by producing significantly smaller prediction regions (achieving up to 85.51\% reduction in region area) over state-of-the-art multi-target CP methods.

URL: https://openreview.net/forum?id=O0AXPvbqG9

---

Title: PriSM: Prior-Guided Search Methods for Query Efficient Black-Box Attacks

Abstract: Deep Neural Networks are vulnerable to adversarial examples in black-box settings, requiring query-efficient attack methods. We propose PriSM (Prior-Guided Search Methods), which systematically exploits two types of transferable surrogate information: decision boundary geometry and loss landscape topography. We demonstrate their utility through complementary attacks: (1) TGEA leverages boundary geometry to initialize evolutionary optimization with surrogate evolved populations, maximizing attack success rates, and (2) SGSA leverages loss topography via multi-scale saliency guidance to direct Square Attack's perturbations, minimizing query costs. Across MNIST, CIFAR-10, and ImageNet, both methods achieve 30-60% query reductions compared to uninformed baselines, while also being competitive with state of the art hybrid attacks. Our evaluation reveals a strategic trade off: SGSA excels in query efficiency through local exploitation, whereas TGEA maximizes success rates via global exploration. Our comprehensive evaluation also demonstrates that different types of surrogate information require matched exploitation strategies, providing practical guidance for query-efficient black-box attacks.

URL: https://openreview.net/forum?id=UQsOh2kfhP

---

Title: Improved Sample Complexity Bounds For Diffusion Model Training Without Empirical Risk Minimizer Access

Abstract: Diffusion models have demonstrated state-of-the-art performance across vision, language, and scientific domains. Despite their empirical success, prior theoretical analyses of the sample complexity suffer from poor scaling with input data dimension or rely on unrealistic assumptions such as access to exact empirical risk minimizers. In this work, we provide a principled analysis of score estimation, establishing a sample complexity bound of $\mathcal{O}(\epsilon^{-4})$. Our approach leverages a structured decomposition of the score estimation error into statistical, approximation, and optimization errors, enabling us to eliminate the exponential dependence on neural network parameters that arises in prior analyses. It is the first such result that achieves sample complexity bounds without assuming access to the empirical risk minimizer of score function estimation loss.

URL: https://openreview.net/forum?id=CFdNqqlqOv

---

Title: Self-Supervised Laplace Approximation for Bayesian Uncertainty Quantification

Abstract: Approximate Bayesian Inference typically revolves around computing the posterior parameter distribution. The main practical interest, however, often lies in a model’s predictions rather than its parameters. In this work, we propose to bypass the posterior, focusing directly on approximating the posterior predictive distribution. We achieve this by drawing inspiration from self-supervised and semi-supervised learning. Essentially, we quantify a Bayesian model’s predictive uncertainty by refitting on self-predicted data. The idea is strikingly simple: If a model assigns high likelihood to self-predicted data, these predictions are of low uncertainty, and vice versa. The modular structure of our Self-Supervised Laplace Approximation (SSLA) further allows to plug in different prior specifications, enabling classical Bayesian sensitivity (w.r.t. prior choice) analysis. In order to bypass expensive refitting, we further introduce an approximate version of SSLA, called ASSLA. We study (A)SSLA both theoretically and empirically by employing it in models ranging from Bayesian linear models to Bayesian neural networks. Our approximations outperform classical Laplace approximations on a wide array of both simulated and real-world datasets.

URL: https://openreview.net/forum?id=T8w8L2t3JG

---

Title: A Systematic Study of Model Merging Techniques in Large Language Models

Abstract: Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for smaller models and classifiers generalize to LLMs. We present a large-scale, systematic evaluation of six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a merged model outperforms the base model and relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs. Other interference-aware and subspace merging methods typically result in significant performance drops. Our findings indicate that current merging techniques do not directly transfer to modern LLMs. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods. Code will be released upon acceptance of this paper.

URL: https://openreview.net/forum?id=6zSIyrqS7J

---

Title: Domain-Invariant Hyperbolic Distillation for Robust Medical Image Analysis

Abstract: Robust generalization beyond training distributions remains a critical challenge for deep neural networks. This is especially pronounced in medical image analysis, where data is often scarce and covariate shifts arise from different hardware devices, imaging protocols, and heterogeneous patient populations. These factors collectively hinder reliable performance and slow down clinical adoption. Despite recent progress, existing learning paradigms primarily rely on the Euclidean manifold, whose flat geometry fails to capture the complex, hierarchical structures present in clinical data. In this work, we exploit the superiority of hyperbolic manifolds to model complex data characteristics. We present the first comprehensive validation of hyperbolic representation learning for medical image analysis and demonstrate statistically significant gains across eleven in-distribution datasets and three ViT backbones. We further propose an unsupervised, domain-invariant hyperbolic distillation strategy. Extensive experiments confirm that our hyperbolic distillation learns domain-invariant features and outperforms state-of-the-art Euclidean methods by an average of $+2.1\%$ AUC on three domain generalization benchmarks: Fitzpatrick17k, Camelyon17-Wilds, and a cross-dataset setup for retinal imaging. These datasets span different imaging modalities, data sizes, and label granularities, confirming generalization capabilities across severely different conditions. The code will be released upon acceptance.

URL: https://openreview.net/forum?id=1spGpYmDjy

---

Title: Communication-Efficient Adaptive Federated Bi-level Optimization with Data and System Heterogeneity

Abstract: Bilevel optimization is a popular nested optimization model in machine learning. Federated bilevel optimization, which extends bilevel optimization to the Federated Learning setting, faces challenges such as complex nested sub-loops, high communication overhead, and a lack of adaptive mechanisms. To address these issues, this paper proposes an Adaptive Single-loop Federated Bilevel Optimization algorithm (ASFBO) in the presence of both data heterogeneity (Non-IID client data) and system heterogeneity (partial client participation per round and varying numbers of local iterations). By replacing nested sub-iterations with a single-loop architecture, ASFBO significantly reduces communication frequency and computational costs. It employs multiple adaptive learning rate variables to dynamically adjust the step sizes of upper-level variable updates, thereby speeding up the algorithm's convergence. Furthermore, a locally accelerated version of the algorithm (LA-ASFBO) that incorporates momentum-based variance reduction techniques is proposed to mitigate hyper-gradient estimation bias across distributed nodes effectively. Theoretical analysis shows that, under the classic setting of a non-convex upper-level and strongly convex lower-level, ASFBO and LA-ASFBO achieve convergence to an $\epsilon$-stationary point with only $\tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity and $\tilde{\mathcal{O}}(\epsilon^{-1})$ communication complexity. Experiments on federated hyper-representation learning tasks demonstrate the superiority of the proposed algorithm.

URL: https://openreview.net/forum?id=f9LWE2bA4R

---

Title: Leveraging Multimodal LLM Descriptions of Activity for Explainable Semi-Supervised Video Anomaly Detection

Abstract: Existing semi-supervised video anomaly detection (VAD) methods often struggle with detecting complex anomalies involving object interactions and generally lack explainability. To overcome these limitations, we propose a novel VAD framework leveraging Multimodal Large Language Models (MLLMs). Unlike previous MLLM-based approaches that make direct anomaly judgments at the frame level, our method focuses on extracting and interpreting object activity and interactions over time. By querying an MLLM with visual inputs of object pairs at different moments, we generate textual descriptions of the activity and interactions from nominal videos. These textual descriptions serve as a high-level representation of the activity and interactions of objects in a video. They are used to detect anomalies during test time by comparing them to textual descriptions found in nominal training videos. Our approach inherently provides explainability and can be combined with many traditional VAD methods to further enhance their interpretability. Extensive experiments on benchmark datasets demonstrate that our method not only detects complex interaction-based anomalies effectively but also achieves state-of-the-art performance on datasets without interaction anomalies.

URL: https://openreview.net/forum?id=dfc2HpDSlH

---

Title: Designing Preconditioners for SGD: Local Conditioning, Noise Floors, and Basin Stability

Abstract: Stochastic Gradient Descent (SGD) often slows in the late stage of training due to anisotropic curvature and gradient noise.
We analyze preconditioned SGD in the geometry induced by a symmetric positive definite matrix $\mathbf{M}$,
deriving bounds in which both the convergence rate and the stochastic noise floor are governed by $\mathbf{M}$-dependent quantities:
the rate through an effective condition number in the $\mathbf{M}$-metric, and the floor through the {product} of that condition number and the preconditioned noise level.
For nonconvex objectives, we establish a preconditioner-dependent basin-stability guarantee:
when smoothness and basin size are measured in the $\mathbf{M}$-norm, the probability that the iterates remain in a well-behaved local region admits an explicit lower bound. This perspective is particularly relevant in Scientific Machine Learning (SciML), where achieving small training loss under stochastic updates is closely tied to physical fidelity, numerical stability, and constraint satisfaction.
The framework applies to both diagonal/adaptive and curvature-aware preconditioners and yields a simple design principle: choose $\mathbf{M}$ to improve local conditioning while attenuating noise. Experiments on a quadratic diagnostic and three SciML benchmarks validate the predicted rate–floor behavior.

URL: https://openreview.net/forum?id=vo8FOBt6f6

---

Title: Models with a Cause: Causal Discovery with Language Models on Temporally Ordered Text Data

Abstract: While language models (LMs) have been proposed for causal discovery tasks, it remains unclear whether they possess the inductive biases necessary to identify causal structures in token generation processes. We investigate whether LMs can learn the causal structure governing how tokens depend on their predecessors by testing if they possess the temporal and statistical properties required for causal discovery. We prove that existing algorithms can recover a unique causal model when token sequences satisfy standard causal assumptions and have temporal ordering. LMs' sequential processing and positional encodings enable them to leverage this temporal information. Using controlled experiments on synthetic data generated by mixtures of Markov chains, we test whether LMs learn conditional independencies and Markov exchangeability properties necessary for causal discovery. We find that transformers successfully learn these properties, achieving this not by approximating exact probability distributions but by learning qualitative probability rankings. These synthetic experiments provide initial evidence that LMs possess inductive biases suitable for discovering token-level causal structures.

URL: https://openreview.net/forum?id=YJddclPGuY

---

Title: Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning

Abstract: Low-Rank Adaptation (LoRA) has become ubiquitous for efficiently fine-tuning foundation models. However, federated fine-tuning using LoRA is challenging due to suboptimal updates arising from traditional federated averaging of individual adapters. Existing solutions either incur prohibitively high communication cost that scales linearly with the number of clients or suffer from performance degradation due to limited expressivity. We introduce Fed-SB, a novel approach for federated fine-tuning of LLMs using LoRA-SB, a recently proposed low-rank adaptation method. LoRA-SB optimally aligns the optimization trajectory with the ideal low-rank full fine-tuning projection by learning a small square matrix ($R$) between adapters $B$ and $A$, keeping other components fixed. Direct averaging of $R$ guarantees exact updates, substantially reducing communication cost, which remains independent of the number of clients, and enables scalability. Fed-SB achieves state-of-the-art performance across commonsense reasoning, arithmetic reasoning, and language inference tasks while reducing communication costs by up to 230x. In private settings, Fed-SB further improves performance by (1) reducing trainable parameters, thereby lowering the noise required for differential privacy and (2) avoiding noise amplification introduced by other methods. Overall, Fed-SB offers a state-of-the-art, efficient, and scalable solution for both private and non-private federated fine-tuning. Our code is available anonymously at: https://anonymous.4open.science/r/fed-sb-anonymous-EF55.

URL: https://openreview.net/forum?id=87UyFEhzyP

---

Title: Sparse Mean Estimation in Adversarial Settings via Incremental Learning

Abstract: In this paper, we study the problem of sparse mean estimation under adversarial corruptions, where the goal is to estimate the $k$-sparse mean of a heavy-tailed distribution from samples contaminated by adversarial noise. Existing methods face two key limitations: they require prior knowledge of the sparsity level $k$ and scale poorly to high-dimensional settings. We propose a simple and scalable estimator that addresses both challenges. Specifically, it learns the $k$-sparse mean without knowing $k$ in advance and operates in near-linear time and memory with respect to the ambient dimension. Under a moderate signal-to-noise ratio, our method achieves the optimal statistical rate, matching the information-theoretic lower bound. Extensive simulations corroborate our theoretical guarantees.
At the heart of our approach is an incremental learning phenomenon: we show that a basic subgradient method applied to a nonconvex two-layer formulation with an $\ell_1$-loss can incrementally learn the $k$ nonzero components of the true mean while suppressing the rest. More broadly, our work is the first to reveal the incremental learning phenomenon of the subgradient method in the presence of heavy-tailed distributions and adversarial corruption.

URL: https://openreview.net/forum?id=S3e7ikEZfg

---

Title: Juliet: Per-Sample Conditional Branching for Efficient Con- volutional Networks

Abstract: We introduce Juliet, a dynamic, trie-augmented neural architecture that improves the efficiency of convolutional neural networks by routing each input through learned per-node branches while growing and pruning capacity on the fly. Each node pairs a lightweight sub-module with a transformer-based path selector trained end-to-end; growing and pruning based on exponential moving average (EMA) usage let the model expand or contract during training to preserve accuracy within compute and memory budgets. We graft Juliet onto ResNet-18, EfficientNet-B0, and DenseNet-121 and train on CIFAR-10 (ARCHER2), with an ImageNet/H100 check using ResNet-101. On CIFAR-10, Juliet reduces theoretical training and inference FLOPs, even when the parameter count increases. The results show a $\sim21\%$, (ResNet-18), $\sim68\%$ (EfficientNet-B0), and $\sim70\%$ (DenseNet-121) in inference flops, while staying within $\sim1\%$ Top-1 of the baseline for ResNet-18 and DenseNet-121, with a larger trade-off on EfficientNet-B0. At ImageNet scale, Juliet-101 achieves $27.1$ Top-1 per GFLOPs, outscoring SkipNet, ConvNet-AIG, and BlockDrop. Ablations and hyperparameter sweeps (growth/prune thresholds, prune interval, prebuild limit) reveal nuances in Juliet's architecture, and simpler routers (e.g., a small MLP) match transformer routing, indicating the transformer router may not be a prerequisite for achieving competitive accuracy. Overall, Juliet provides a flexible, interpretable approach to conditional computation for convolutional neural networks, improving the efficiency–accuracy trade-off for the CNNs we evaluate.

URL: https://openreview.net/forum?id=ETQbfcbtjJ

---

Title: Bayesian Optimisation via Difference-of-Convex Thompson Sampling

Abstract: Thompson sampling is a method for Bayesian optimisation whereby a randomly drawn belief of the objective function is sampled at each round and then optimised, informing the next observation point.
The belief is typically maintained using a sufficiently expressive Gaussian process (GP) surrogate of the true objective function.
The sample drawn is non-convex in general and non-trivial to optimise.
Motivated by the desire to make this optimisation subproblem more tractable, we propose difference-of-convex Thompson sampling (DCTS): a scalable method for drawing GP samples that combines random neural network features with pathwise updates on the limiting kernel. The resulting samples belong to the difference-of-convex function class, which are inherently easier to optimise while retaining rich expressive power. We establish sublinear cumulative regret bounds using a simplified proof technique and demonstrate the advantages of our framework on various problems, including synthetic test functions, hyperparameter tuning, and computationally expensive physics simulations.

URL: https://openreview.net/forum?id=Ih9sJCZ0sW

---

Title: Anytime Verified Agents: Adaptive Compute Allocation for Reliable LLM Reasoning under Budget Constraints

Abstract: Large language model (LLMs) agents show promising results in reasoning, planning, and tool use. However, their performance scales with the computational budget. Existing methods allocate computational resources using static strategies such as fixed search depths, constant self-consistency sampling, or uniform verification. This means that simple problems are used as much as complex tasks. We present Anytime Verified Agents (AVA), a framework that dynamically allocates compute search, tool use, and verification within a user-specified budget. AVA integrates calibrated uncertainty estimation, value-of-information-guided search expansion, and selective verification cascades with early exits. The controller dynamically allocates the compute based on the predicted failure risk and marginal reliability gains, allowing the agent to achieve higher accuracy at fixed budgets or lower costs at target reliability levels. AVA is evaluated on mathematical reasoning (GSM8K), multi-hop question answering (HotpotQA), and code generation (HumanEval) benchmarks, and it is compared to fixed-depth search, self-consistency, and always-verify baselines. The results show that the adaptive allocation achieves a 20-40% cost reduction at equivalent reliability while maintaining accuracy, showing clear Pareto improvements in the compute-reliability trade-off.

URL: https://openreview.net/forum?id=JMDCMf7mlF

---

Title: CP Merging: Joint LoRA Merging using Canonical Polyadic Decomposition

Abstract: Large language models (LLMs) are often fine-tuned for specific tasks using Low-Rank Adaptation (LoRA), an efficient method that adds small, task-specific modules called LoRA adapters to a pre-trained base model. However, a major challenge arises when merging multiple LoRA adapters trained on different data sources for a specific task: it often leads to \textit{task interference}, which refers to the redundancy or sign discrepancies found in parameters across different task models, resulting in information conflict and performance loss. While SVD-based merging methods show promise by decomposing adapters into orthogonal components to reduce cross-task interference, they suffer from a critical limitation: SVD decomposition treats the LoRA adapters merely as matrices, which prevents the identification of the optimal orthogonal basis, limiting these approaches from effectively reducing the task interference. To address this, we propose a novel LoRA merging approach using joint Canonical Polyadic (CP) decomposition, which we term CP Merging. We first aggregate the LoRA adapters into a single third-order tensor. Subsequently, we apply CP decomposition to this tensor to disentangle factors that are unique to each task from those that are shared across tasks. This joint factorization inherently helps to reduce cross-task interference without sacrificing critical information. Our extensive experiments further validate this approach, demonstrating that CP merging yields superior performance compared to existing SVD-based merging approaches.

URL: https://openreview.net/forum?id=2poB2149km

---

Title: Enhancing Model Robustness Against Noisy Labels via Kronecker Product Decomposition

Abstract: Deep learning models have made remarkable progress across various domains in recent years. These models heavily rely on large-scale datasets for training, and a noisy dataset can degrade the performance of the model. To train accurate deep learning models, it is crucial to develop training algorithms that are robust to noisy training data and outliers while ensuring high performance. In this work, we study the problem of model training under noisy labels/outputs and propose a method based on Kronecker product decomposition to improve robustness during training. The proposed method is easy to implement and can be readily combined with robust loss functions.
We report results from experiments conducted on both classification and regression tasks in the presence of noisy labels/outputs. Our results demonstrate that our approach outperforms existing robust loss methods in terms of model performance.

URL: https://openreview.net/forum?id=3C1JLecije

---

Title: eDQA: Efficient Deep Quantization of DNN Activations on Edge Devices

Abstract: Quantization of Deep Neural Network (DNN) activations is a commonly used technique to reduce compute and memory demands during DNN inference, which can be particularly beneficial on resource-constrained edge devices. To achieve high accuracy, existing methods for quantizing activations rely on complex mathematical computations or perform extensive online searches for the best hyperparameters. However, these expensive operations are impractical on edge devices with limited computational capabilities, memory capacities, and energy budgets. Furthermore, many existing methods either do not focus on sub-6-bit (or deep) quantization, or leverage mixed-precision approaches to achieve deep quantization on average but without further improving the hardware usage efficiency. To fill these gaps, in this paper we propose eDQA (Efficient Deep Quantization of DNN Activations on Edge Devices), a new method that focuses on sub-6-bit quantization of activations and leverages simple shifting-based operations and data compression techniques to achieve high efficiency and accuracy. We evaluate eDQA with 3, 4, and 5-bit quantization levels and four different DNN models on two different datasets. eDQA shows up to 75\% better accuracy compared to three existing methods: direct quantization, classic power-of-two quantization, and the state-of-the-art NoisyQuant for sub-6-bit quantization. Additionally, we compare eDQA with NoisyQuant on an edge FPGA, achieving up to $309\times$ speedup. The code is available at https://github.com/xxxx.

URL: https://openreview.net/forum?id=SEIBCdgE5W

---

Title: UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Language Models

Abstract: Lifelong learning enables large language models (LLMs) to adapt to evolving information by continually updating their internal knowledge. An ideal system should support efficient, wide-ranging updates while preserving existing capabilities and ensuring reliable deployment. Model editing stands out as a promising solution for this goal, offering a focused and efficient way to revise a model’s internal knowledge. Although recent paradigms have made notable progress, they often struggle to meet the demands of practical lifelong adaptation at scale. To bridge this gap, we propose UltraEdit, a training-, subject-, and memory-free approach that is well-suited for ultra-scalable, real-world lifelong model editing. UltraEdit fundamentally differs from traditional paradigms by computing parameter shifts in one step using only a hidden state and its gradient, making the approach simple yet efficient. To improve scalability in lifelong settings, UltraEdit employs a lifelong normalization strategy that continuously updates feature statistics across turns, allowing it to adapt to distributional shifts and maintain consistency over time. UltraEdit achieves editing speeds more than 7× faster than the previous state-of-the-art method, while requiring 4× less VRAM. This makes it the only method currently capable of editing a 7B LLM on a 24GB consumer-grade GPU. Furthermore, we construct UltraEditBench, the largest dataset in the field to date with over 2M editing pairs, and demonstrate that our method supports up to 2M edits while maintaining high accuracy. Comprehensive experiments on five datasets and six models show that UltraEdit consistently achieves superior performance across diverse model editing scenarios, taking a further step towards safe and scalable lifelong learning. We will release the code and dataset upon acceptance.

URL: https://openreview.net/forum?id=GoJLp3BlRV

---

Reply all
Reply to author
Forward
0 new messages