Daily TMLR digest for Feb 19, 2026

11 views

Skip to first unread message

TMLR

unread,

Feb 19, 2026, 12:30:08 AMFeb 19

to tmlr-anno...@googlegroups.com

New certifications
==================

J2C Certification: A Multi-Fidelity Control Variate Approach for Policy Gradient Estimation

Xinjie Liu, Cyrus Neary, Kushagra Gupta, Wesley A. Suttle, Christian Ellis, ufuk topcu, David Fridovich-Keil

https://openreview.net/forum?id=zAo0L7Dcqt

---

Accepted papers
===============

Title: From Link Prediction to Forecasting: Addressing Challenges in Batch-based Temporal Graph Learning

Authors: Moritz Lampert, Christopher Blöcker, Ingo Scholtes

Abstract: Dynamic link prediction is an important problem considered in many recent works that propose approaches for learning temporal edge patterns. To assess their efficacy, models are evaluated on continuous-time and discrete-time temporal graph datasets, typically using a traditional batch-oriented evaluation setup. However, as we show in this work, a batch-oriented evaluation is often unsuitable and can cause several issues. Grouping edges into fixed-sized batches regardless of their occurrence time leads to information loss or leakage, depending on the temporal granularity of the data. Furthermore, fixed-size batches create time windows with different durations, resulting in an inconsistent dynamic link prediction task. In this work, we empirically show how traditional batch-based evaluation leads to skewed model performance and hinders the fair comparison of methods. We mitigate this problem by reformulating dynamic link prediction as a link forecasting task that better accounts for temporal information present in the data.

URL: https://openreview.net/forum?id=iZPAykLE3l

---

Title: Towards Scalable Language-Image Pre-training for 3D Medical Imaging

Authors: Chenhui Zhao, Yiwei Lyu, Asadur Zaman Chowdury, Edward S Harake, Akhil Kondepudi, Akshay T Rao, Xinhai Hou, Honglak Lee, Todd C Hollon

Abstract: The scalability of current language-image pre-training for 3D medical imaging, such as CT and MRI, is constrained by the need for radiologists to manually curate raw clinical studies. In this work, we pioneer pre-training directly on uncurated studies, which both aligns more closely with the clinical workflow and provides a natural path to scalability. However, the unique structure of such data presents new challenges for existing model architectures, which were originally designed for 2D slices or single 3D scans. To address this, we introduce a novel hierarchical attention mechanism inspired by the intrinsic hierarchy of radiology data: slice, scan, and study. We denote our framework as Hierarchical attention for Language-Image Pre-training (HLIP). Trained on 220K studies with 3.13 million scans for brain MRI and 240K studies with 1.44 million scans for head CT, HLIP achieves state-of-the-art performance, e.g., +10.5% balanced ACC on the proposed publicly available brain MRI benchmark Pub-Brain-5; +8.3% and +1.7% macro AUC on head CT benchmarks CQ500 and RSNA, respectively. HLIP also exhibits strong generalizability on existing 3D medical language-image pre-training benchmarks, e.g., +4.3% macro AUC on the Rad-ChestCT benchmark when pre-trained on CT-RATE. These results demonstrate that, with HLIP, directly pre-training on uncurated clinical datasets is a scalable and effective direction for language-image pre-training in 3D medical imaging. The code is available at https://github.com/zch0414/hlip

URL: https://openreview.net/forum?id=WxHf4EcBWA

---

Title: TABASCO: A Fast, Simplified Model for Molecular Generation with Improved Physical Quality

Authors: Carlos Vonessen, Charles Harris, Miruna Cretu, Pietro Lio

Abstract: State-of-the-art models for 3D molecular generation are based on significant inductive biases: SE(3) equivariance, permutation invariance and graph message‑passing networks to capture local chemistry, yet the generated molecules struggle with physical plausibility.
We introduce TABASCO which relaxes these assumptions: The model has a standard non-equivariant transformer architecture, treats atoms in a molecule as sequences and does not explicitly model bonds. The absence of equivariant layers and message passing allows us to simplify the model architecture and scale data throughput.
On the GEOM‑Drugs and QM9 benchmarks TABASCO achieves state-of-the-art PoseBusters validity and delivers inference roughly 10x faster than the strongest baseline, while exhibiting emergent rotational equivariance without hard-coded symmetry.
Our work offers a blueprint for training minimalist, high‑throughput, unconditional generative models and the resulting architecture is readily extensible to future conditional tasks.
We provide a link to our implementation at https://github.com/carlosinator/tabasco.

URL: https://openreview.net/forum?id=Kg6CSrbXl4

---

Title: Fuzzy PyTorch: Rapid Numerical Variability Evaluation for Deep Learning Models

Authors: Inés Gonzalez Pepe, Hiba Akhaddar, Tristan Glatard, Yohan Chatelain

Abstract: We introduce Fuzzy PyTorch, a framework for rapid evaluation of numerical variability in deep learning (DL) models. As DL is increasingly applied to diverse tasks, understanding variability from floating-point arithmetic is essential to ensure robust and reliable performance. Tools assessing such variability must be scalable, efficient, and integrate seamlessly with existing frameworks while minimizing code modifications. Fuzzy PyTorch enables this by integrating stochastic arithmetic into PyTorch through Probabilistic Rounding with Instruction Set Management, a novel library interfacing with Verificarlo, a numerical analysis compiler. The library offers stochastic rounding mode and a novel mode; up-down rounding.
Comparative evaluations show Fuzzy PyTorch maintains model performance and achieves runtime reductions of $5\times$ to $60\times$ versus Verrou, a state-of-the-art tool. We further demonstrate scalability by running models from 1 to 341 million parameters, confirming applicability across small and large DL architectures. Overall, Fuzzy PyTorch provides an efficient, scalable, and practical solution for assessing numerical variability in deep learning, enabling researchers and practitioners to quantify and manage floating-point uncertainty without compromising performance or computational efficiency.

URL: https://openreview.net/forum?id=0ogq232VGP

---

Title: A Multi-Fidelity Control Variate Approach for Policy Gradient Estimation

Authors: Xinjie Liu, Cyrus Neary, Kushagra Gupta, Wesley A. Suttle, Christian Ellis, ufuk topcu, David Fridovich-Keil

Abstract: Many reinforcement learning (RL) algorithms are impractical for deployment in operational systems or for training with computationally expensive high-fidelity simulations, as they require large amounts of data. Meanwhile, low-fidelity simulators—such as reduced-order models, heuristic reward functions, or generative world models—can cheaply provide useful data for RL training, even if they are too coarse for direct sim-to-real transfer. We propose multi-fidelity policy gradients (MFPGs), an RL framework that mixes a small amount of data from the target environment with a control variate formed from a large volume of low-fidelity simulation data to construct an unbiased, variance-reduced estimator for on-policy policy gradients. We instantiate the framework by developing a practical, multi-fidelity variant of the classical REINFORCE algorithm. We show that under standard assumptions, the MFPG estimator guarantees asymptotic convergence of multi-fidelity REINFORCE to locally optimal policies in the target environment, and achieves faster finite-sample convergence rates compared to training with high-fidelity data alone. We evaluate the MFPG algorithm across a suite of simulated robotics benchmark tasks in scenarios with limited high-fidelity data but abundant off-dynamics, low-fidelity data. In our baseline comparisons, for scenarios where low-fidelity data are neutral or beneficial and dynamics gaps are mild to moderate, MFPG is, among the evaluated off-dynamics RL and low-fidelity-only approaches,the only method that consistently achieves statistically significant improvements in mean performance over a baseline trained solely on high-fidelity data. When low-fidelity data become harmful, MFPG exhibits the strongest robustness against performance degradation among the evaluated methods, whereas strong off-dynamics RL methods tend to exploit low-fidelity data aggressively and fail substantially more severely. An additional experiment in which the high- and low-fidelity environments are assigned anti-correlated rewards shows that MFPG can remain effective even when the low-fidelity environment exhibits reward misspecification. Thus, MFPG not only offers a reliable and robust paradigm for exploiting low-fidelity data, e.g., to enable efficient sim-to-real transfer, but also provides a principled approach to managing the trade-off between policy performance and data collection costs.

URL: https://openreview.net/forum?id=zAo0L7Dcqt

---

Title: Proc-to-Spec: A Functorial Map of Network Processes

Authors: Shanfeng Hu

Abstract: The analysis of dynamic networks is central to understanding complex environmental systems in nature, yet traditional methods often focus on describing changing states rather than formalising the underlying processes of change. In this work, we introduce a category-theoretical framework, Proc-to-Spec, that provides a principled, functorial method for analysing the transformations that govern network evolution. We model resource-constrained systems, such as those commonly found in biology and ecology, within a source category Proc, where morphisms represent dissipative physical processes. We then construct a spectral functor, $\chi: Proc \to Spec$, that maps each process to a unique linear transformation between the eigenspaces of the network's symmetrised Laplacian. This framework allows us to establish a set of rigorous theorems. We prove that physical conservation laws in Proc correspond directly to spectral invariants in Spec, such as the conservation of the Laplacian's trace. We derive a spectral sensitivity theorem that formally links resource dissipation to network fragmentation via the Fiedler value. We also establish a stability-spectrum equivalence theorem, proving that a system's physical dynamics converge to a stable state if and only if its spectral geometry converges. We also derive an optimal Spec-to-Func projection to compress these transformations into interpretable, low-dimensional functional fingerprints. We validate our theory with numerical experiments and demonstrate its generality as a tool for scientific discovery across two comprehensive, contrasting case studies. (1) In a high-signal, high-noise, macro-timescale ecological case study of the Serengeti food web in northern Tanzania, we use a large collection of 1.2 million classified image sets of animal activity from 225 camera traps spread across 1,125 km$^2$ of the Serengeti National Park from 2010 to 2013 to show that our framework can detect the subtle, cyclical signature of seasonal change and identify the unique geometric fingerprint of the 2011 East Africa drought. (2) In a low-signal, high-noise, micro-timescale neuroscience case study, we show that our framework's functional fingerprints can detect and characterise subtle cognitive processes from human brain fMRI data, classifying 8 distinct task states with high, generalisable accuracy. Our work provides a different way of thinking about dynamic systems, shifting the focus from describing states to understanding the fundamental geometry of change. Code to reproduce all results in the paper is released at https://github.com/shanfenghu/pts

URL: https://openreview.net/forum?id=pT84Ii6igG

---

Title: Topological Inductive Bias fosters Multiple Instance Learning in Data-Scarce Scenarios

Authors: Salome Kazeminia, Carsten Marr, Bastian Rieck

Abstract: Multiple instance learning (MIL) is a framework for weakly supervised classification, where labels are assigned to sets of instances, i.e., bags, rather than to individual data points. This paradigm has proven effective in tasks where fine-grained annotations are unavailable or costly to obtain. However, the effectiveness of MIL drops sharply when training data are scarce, such as for rare disease classification. To address this challenge, we propose incorporating topological inductive biases into the data representation space within the MIL framework. This bias introduces a topology-preserving constraint that encourages the instance encoder to maintain the topological structure of the instance distribution within each bag when mapping them to MIL latent space. As a result, our Topology Guided MIL (TG-MIL) method enhances the performance and generalizability of MIL classifiers across different aggregation functions, especially under scarce-data regimes. Our evaluations show average performance improvements of 15.3% for synthetic MIL datasets, 2.8% for MIL benchmarks, and 5.5% for rare anemia classification compared to current state-of-the-art MIL models, where only 17–120 samples per class are available. We make our code publicly available at https://github.com/SalomeKaze/TGMIL.

URL: https://openreview.net/forum?id=1hZy9ZjjCc

---

New submissions
===============

Title: Sample-wise Adaptive Weighting for Transfer Consistency in Adversarial Distillation

Abstract: Adversarial distillation in the standard min–max adversarial training framework aims to transfer adversarial robustness from a large, robust teacher network to a compact student. However, existing work often neglects to incorporate state-of-the-art robust teachers. Through extensive analysis, we find that stronger teachers do not necessarily yield more robust students–a phenomenon known as robust saturation. While typically attributed to capacity gaps, we show that such explanations are incomplete. Instead, we identify adversarial transferability–the fraction of student-crafted adversarial examples that remain effective against the teacher–as a key factor in successful robustness transfer. Based on this insight, we propose Sample-wise Adaptive Adversarial Distillation (SAAD), which reweights training examples by their measured transferability without incurring additional computational cost. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show that SAAD consistently improves AutoAttack robustness over prior methods.

URL: https://openreview.net/forum?id=ek45VamPCE

---

Title: HyperDG: Hyperbolic Representation Alignment for Robust Domain Generalization via Curvature Refinement

Abstract: Domain generalization often suffers from geometric inconsistencies in representations learned across multiple source domains. Although recent approaches pursue flat minima or invariant features, they remain restricted to Euclidean space, overlooking the inherently curved nature of real data manifolds. We introduce HyperDG, a hyperbolic representation learning framework that models each domain as a Lorentz manifold with learnable negative curvature and enforces cross-domain consistency through a self feedback mechanism alternating between local adaptation, tangent space mapping, and global manifold adjustment, effectively unifying flat minima consistency with non Euclidean representation learning within a single optimization process. By jointly optimizing model parameters and manifold curvature, the framework learns a shared meta manifold that preserves invariance across domains while maintaining hierarchical structure within each.
Extensive experiments on standard domain generalization benchmarks show consistent improvements in accuracy, robustness, and out of distribution performance, demonstrating that embracing hyperbolic representation spaces rather than flattening them leads to geometry consistent and domain resilient generalization.

URL: https://openreview.net/forum?id=TSshrjqnXu

---

Title: Recursive Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Abstract: We study risk-sensitive reinforcement learning in finite discounted MDPs with recursive entropic risk measures (ERM), where the risk parameter $\beta \neq 0$ controls the agent's risk attitude: $\beta>0$ for risk-averse and $\beta<0$ for risk-seeking behavior. A generative model of the MDP is assumed to be available. Our focus is on the sample complexities of learning the optimal state–action value function (value learning) and an optimal policy (policy learning) under recursive ERM.
We introduce a model-based algorithm, called Model-Based ERM $Q$-Value Iteration (MB-ERM-QVI), and derive PAC-type bounds on its sample complexity for both value and policy learning. Both PAC bounds scale exponentially with $|\beta|/(1-\gamma)$, where $\gamma$ is the discount factor. We also establish corresponding lower bounds for both value and policy learning, showing that exponential dependence on $|\beta|/(1-\gamma)$ is unavoidable in the worst case. The bounds are tight in the number of states and actions ($S$ and $A$), providing the first rigorous sample complexity guarantees for recursive ERM across both risk-averse and risk-seeking regimes.

URL: https://openreview.net/forum?id=TFwSG4uYwl

---

Title: Learning Multimodal Energy-Based Model with Multimodal Variational Auto-Encoder via MCMC Revision

Abstract: Energy-based models (EBMs) are a flexible class of deep generative models and are well-suited to capture complex dependencies in multimodal data. However, learning multimodal EBM by maximum likelihood requires Markov Chain Monte Carlo (MCMC) sampling in the joint data space, where noise-initialized Langevin dynamics often mixes poorly and fails to discover coherent inter-modal relationships. Multimodal VAEs have made progress in capturing such inter-modal dependencies by introducing a shared latent generator and a joint inference model. However, both the shared latent generator and joint inference model are parameterized as unimodal Gaussian (or Laplace), which severely limits their ability to approximate the complex structure induced by multimodal data. In this work, we study the learning problem of the multimodal EBM, shared latent generator, and joint inference model. We present a learning framework that effectively interweaves their MLE updates with corresponding MCMC refinements in both the data and latent spaces. Specifically, the generator is learned to produce coherent multimodal samples that serve as strong initial states for EBM sampling, while the inference model is learned to provide informative latent initializations for generator posterior sampling. Together, these two models serve as complementary models that enable effective EBM sampling and learning, yielding realistic and coherent multimodal EBM samples. Extensive experiments demonstrate superior performance for multimodal synthesis quality and coherence compared to various baselines. We conduct various analyses and ablation studies to validate the effectiveness and scalability of the proposed multimodal framework.

URL: https://openreview.net/forum?id=ZVD7bHNpY1

---

Title: An Information-Theoretic Framework for Training-Dependent Memory in Neural Sequence Models

Abstract: State-space models trained with identical architectures exhibit vastly different long-range retrieval performance depending solely on training procedure. Standard next-token prediction produces models that fail on tasks requiring precise recall, while multi-objective curricula enable the same architecture to approach Transformer-level accuracy. This training-induced capacity gap cannot be explained by existing theories, which treat representational capacity as an architectural property.
We resolve this puzzle by formalizing fixed-dimensional hidden states as communication channels where capacity depends on both bandwidth (dimension) and signal-to-noise ratio: the degree to which learned features align with task-relevant information versus interference. We prove that multi-objective training systematically increases task-aligned signal while suppressing noise (Lemmas 1-2), yielding strictly higher effective SNR without architectural modification (Theorem 1). This establishes that training can alter effective capacity within fixed dimension by reallocating representational energy across subspaces.
The framework distinguishes three architectural regimes through qualitative capacity bounds: fixed-state models as single channels, Transformers achieving bandwidth scaling through parallel storage, and training procedures amplifying SNR within fixed dimension. Observed performance patterns are quantitatively consistent with predicted scaling relationships through inverse inference. This demonstrates that representational geometry can be characterized using information-theoretic principles, formalizing how training objectives determine memory capacity in neural sequence models.

URL: https://openreview.net/forum?id=4pkynwvPtZ

---

Title: SA-PEF: Step-Ahead Partial Error Feedback for Efficient Federated Learning

Abstract: Biased gradient compression with error feedback (EF) reduces communication in federated learning (FL), but under heterogeneous (non-IID) data and local updates, the compression residual can decay slowly. This induces a mismatch between where gradients are evaluated and where the (decompressed) update is effectively applied, often slowing progress in the early rounds. We propose step-ahead partial error feedback (SA-PEF), which introduces a tunable step-ahead coefficient $\alpha_r\in[0,1]$ and previews only a fraction of the residual while carrying the remainder through standard EF. SA-PEF interpolates smoothly between EF ($\alpha_r=0$) and full step-ahead EF (SAEF; $\alpha_r=1$). For nonconvex objectives with $\delta$-contractive compressors, we develop a second-moment bound and a residual recursion that yield nonconvex stationarity guarantees under data heterogeneity and partial client participation. With a constant inner stepsize, the bound exhibits the standard $\mathcal{O}\!\bigl((\eta\,\eta_0TR)^{-1}\bigr)$ optimization term and an $R$-independent variance/heterogeneity floor induced by biased compression. Our analysis highlights a step-ahead-controlled residual contraction factor $\rho_r$, explaining the observed early-phase acceleration, and suggests choosing $\alpha_r$ near a theory-predicted optimum to balance SAEF’s rapid warm-up with EF’s long-run stability. Experiments across architectures, datasets, and compressors show that SA-PEF consistently reaches target accuracy in fewer communication rounds than EF.

URL: https://openreview.net/forum?id=ejnVWfknCm

---

Title: Privacy Leakage via Output Label Space and Differentially Private Continual Learning

Abstract: Differential privacy (DP) is a formal privacy framework that enables training machine learning (ML) models while protecting individuals' data. As pointed out by prior work, ML models are part of larger systems, which can lead to so-called privacy side-channels even if the model training itself is DP. We identify the output label space of a classification model as such a privacy side-channel and show a concrete privacy attack that exploits it. The side-channel becomes highly relevant in continual learning (CL), where the output label space changes over time. To reason about privacy guarantees in CL, we introduce a formalisation of DP for CL, which also clarifies how our approach differs from existing approaches. We propose and evaluate two methods for eliminating this side-channel: applying an optimal DP mechanism to release the labels in the sensitive data, and using a large public label space. We explore the trade-offs of these methods through adapting pre-trained models. We demonstrate empirically that our models consistently achieve higher accuracy under DP than previous work over both Split-CIFAR-100 and Split-ImageNet-R, with a stronger privacy model.

URL: https://openreview.net/forum?id=ZshFgRQWrm

---

Title: Mitigating Disparate Impact of Differentially Private Learning through Bounded Adaptive Clipping

Abstract: Differential privacy (DP) has become an essential framework for privacy-preserving machine learning. Existing DP learning methods, however, often have disparate impacts on model predictions, e.g., for minority groups. Gradient clipping, which is often used in DP learning, can suppress larger gradients from challenging samples. We show that this problem is amplified by adaptive clipping, which will often shrink the clipping bound to tiny values to match a well-fitting majority, while significantly reducing the accuracy for others. We propose bounded adaptive clipping, which introduces a tunable lower bound to prevent excessive gradient suppression. Our method improves worst-class accuracy by over 10 percentage points on Skewed and Fashion MNIST compared to unbounded adaptive clipping, 7 points compared to Automatic clipping, and 5 points compared to constant clipping. The code is available at https://anonymous.4open.science/r/adaptive-clipping-DPDL.

URL: https://openreview.net/forum?id=UlzcKSHVoN

---

Title: PDEInvBench: A Comprehensive Dataset and Design Space Exploration of Neural Networks for PDE Inverse Problems

Abstract: Inverse problems in partial differential equations (PDEs) involve estimating the physical parameters of a system from observed spatiotemporal solution fields, a fundamental task in numerous scientific domains. Neural networks, and particularly neural operators, are well-suited for PDE parameter estimation due to their capability to model function-to-function space transformations.
While existing benchmarks of machine learning methods for PDEs primarily focus on the forward problem --- mapping physical parameters to solution fields---to our knowledge, there are no similar comprehensive studies and benchmark datasets on PDE inverse problems - mapping solution fields to underlying physical parameters. We fill this gap by introducing PDEInvBench, a comprehensive benchmark dataset consisting of numerical simulations for both time-dependent and time-independent PDEs across a wide range of physical behaviors and parameters. Our dataset includes evaluation splits that assess performance in both in-distribution and various out-of-distribution settings. Using our benchmark dataset, we comprehensively explore the design space of neural networks for PDE inverse problems along three key dimensions: (1) optimization procedures, analyzing the role of supervised, self-supervised, and test-time training objectives on performance, (2) problem representations, where we study the value of architectural choices with different inductive biases and various conditioning strategies, and (3) scaling, which we perform with respect to both model and data size.
Our experiments reveal several practical insights: 1) neural networks perform best with a two-stage training procedure: initial supervision with PDE parameters followed by test-time fine-tuning using the PDE residual, 2) incorporating PDE derivatives as input features consistently improves accuracy, and 3) increasing the diversity of initial conditions in the training data yields greater performance gains than expanding the range of PDE parameters. We make our dataset and evaluation codebase freely available to facilitate reproducibility and further development of our work.

URL: https://openreview.net/forum?id=MSjhqRnNyZ

---

Title: A Survey of Linear Attention: Algorithm, Theory, Application, and Infrastructure

Abstract: Large Language Models (LLMs) have proven effective in understanding and generating extremely long contexts.
Recently, linear attention mechanisms have garnered significant attention, as they can largely reduce the quadratic computational complexity of traditional attention mechanisms to linear complexity relative to token sequence length, thus balancing effectiveness and efficiency in LLM training and inference. This survey mainly focuses on a broad spectrum of linear attention techniques, including traditional linear attention methods, state space model (SSM) series, and linear recurrent neural networks (RNNs). These methods enable implicit historical information integration via state propagation, and achieve approximately constant memory footprint as well as linear time complexity in sequence modeling tasks. Beyond algorithmic designs and model architectures, we further explore the characteristics, challenges, and successful applications of linear attention from a more comprehensive perspective. We also discuss the essential factors for practical hybrid frameworks, robust and efficient infrastructure, and scenario-specific features of downstream tasks, which jointly contribute to the successful deployment of linear attention mechanisms.

URL: https://openreview.net/forum?id=ilkVX8aGmQ

---

Title: CauFR-TS: Causal Time-Series Identifiability via Factorized Representations

Abstract: Causal discovery from multivariate time series is a fundamental problem for interpretable modelling, causality-aware downstream analysis, and intervention-driven simulation. Recent neural approaches commonly rely on shared latent embeddings to capture temporal dynamics and utilize them for causal structure estimation and downstream prediction. We formally establish that such shared encoders entangle distinct causal mechanisms into a unified latent manifold, which exhibits fundamental theoretical limitations of structural non-identifiability and conditional independence assumptions required for Granger causality. To address these issues, we propose CauFR-TS, a recurrent variational framework that enforces mechanism modularity through dimension-wise encoders and ensures mediation of all cross-variable dependencies through structured latent aggregation. Furthermore, we address the instability of heuristic thresholding in continuous relaxation methods by proposing an adaptive, data-driven unsupervised link selection strategy based on decoder weight distribution. Empirical evaluation on synthetic and in silico biological benchmarks demonstrates that CauFR-TS outperforms recent baselines in graph recovery metrics while preserving competitive probabilistic forecasting performance.

URL: https://openreview.net/forum?id=Al4OnLoQsp

---

Title: When Does Margin Clamping Affect Training Variance? Dataset-Dependent Effects in Contrastive Forward-Forward Learning

Abstract: Contrastive Forward-Forward (CFF) learning is a layer-local alternative to backpropagation that trains Vision Transformers using supervised contrastive objectives at each layer independently. In practice, CFF can exhibit substantial seed-to-seed variability, complicating reproducibility and hyperparameter selection. We audit one implementation detail inside the supervised contrastive loss: applying the positive-pair margin via saturating similarity clamping, min⁡(s+m,1)\min(s + m, 1)
min(s+m,1). We compare this against a post-log-probability subtraction reference that we prove is gradient-neutral under the mean-over-positives reduction (Proposition 4.1), thereby isolating the effect of saturation itself. On CIFAR-10 in a 2×22 \times 2
2×2 factorial ablation (n=7n=7
n=7 seeds per cell), the clamped variant exhibits 5.90×5.90\times
5.90× higher pooled test-accuracy variance (p=0.003p=0.003
p=0.003, bootstrap 95% CI [1.62,15.80][1.62, 15.80]
[1.62,15.80]) with no detectable difference in mean accuracy. Clamp activation rates (CAR), layerwise gradient norms, and a reduced-margin dose-response probe jointly indicate that this variance increase is associated with gradient truncation in early transformer layers. However, the effect is dataset-dependent: replication on CIFAR-100 (VR=0.39×\mathrm{VR} = 0.39\times
VR=0.39×), SVHN (VR=0.25×\mathrm{VR} = 0.25\times
VR=0.25×), and Fashion-MNIST (VR=0.08×\mathrm{VR} = 0.08\times
VR=0.08×, p=0.029p=0.029
p=0.029) reveals inverted variance ratios in all three cases. Cross-dataset analysis identifies layer-0 clamp activation rate as a necessary but insufficient condition for variance inflation: CIFAR-10's high L0 CAR (60.7%) co-occurs with the only elevated variance ratio, while CIFAR-100's low L0 CAR (29.0%) and SVHN/Fashion-MNIST's high task accuracy (>92%>92\%
>92%) each independently suppress the effect. An SVHN difficulty sweep confirms this interaction: increasing augmentation difficulty on the same dataset drives the variance ratio from 0.25×0.25\times
0.25× to 16.73×16.73\times
16.73×. These results characterize the conditions under which margin clamping destabilizes CFF training and offer practical guidance for practitioners.

URL: https://openreview.net/forum?id=EmHvSp7Jm0

---

Title: ExpertLens: Activation steering features are highly interpretable

Abstract: Activation steering methods in large language models (LLMs) have emerged as an effective way to perform targeted updates to enhance generated language without requiring large amounts of adaptation data. We ask whether the features discovered by activation steering methods are interpretable. We identify neurons responsible for specific concepts (e.g., ``cat'') using the ``finding experts'' method from research on activation steering and show that the ExpertLens, i.e., inspection of these neurons, provides insights about model representation. We find that ExpertLens representations are stable across models and datasets and closely align with human representations inferred from behavioral data, matching inter-human alignment levels. ExpertLens significantly outperforms the alignment captured by word/sentence embeddings. By reconstructing human concept organization through ExpertLens, we show that it enables a granular view of LLM concept representation. Our findings suggest that ExpertLens is a flexible and lightweight approach for capturing and analyzing model representations.

URL: https://openreview.net/forum?id=FBIsN6RdYO

---

Title: Evading Protections Against Unauthorized Data Usage via Limited Fine-tuning

Abstract: Text-to-image diffusion models, such as Stable Diffusion, have demonstrated exceptional potential for generating high-quality images. However, recent studies have raised concerns about the use of unauthorized data to train these models, which can lead to intellectual property infringement or privacy violations. A promising approach to mitigating these issues is to embed a signature in the model that can be detected or verified from its generated images. Existing works also aim to prevent training on protected images by degrading generation quality, for example by injecting adversarial perturbations into the training data. In this paper, we propose RATTAN, which effectively evades such protection methods by removing protective perturbations from images and inducing catastrophic forgetting of the corresponding learned features in the model. RATTAN leverages the diffusion process to generate controlled images from the protected inputs, preserving high-level features while ignoring the low-level details used by the embedded pattern. A small number of generated images (e.g., 10) are then used to fine-tune a marked model to remove the learned features. Our experiments on four datasets, two different IP protection methods, and 300 text-to-image diffusion models reveal that, while some protections already suffer from weak memorization, RATTAN can reliably bypass stronger defenses, exposing fundamental limitations of current protections and highlighting the need for stronger defenses.

URL: https://openreview.net/forum?id=8xF5KYHRCU

---

Title: Multimodal Video Generation Models with Audio: Present and Future

Abstract: Video generation models have advanced rapidly and are now widely used across entertainment, advertising, filmmaking, and robotics applications such as world modeling and simulation. However, visual content alone is often insufficient for realistic and engaging media experiences—audio is also a key component of immersion and semantic coherence. As AI-generated videos become increasingly prevalent in everyday content, demand has grown for systems that can generate synchronized sound alongside visuals. This trend has driven rising interest in multimodal video generation, which jointly models video and audio to produce more complete, coherent, and appealing outputs. Since late 2025, a wave of multimodal video generation models has emerged, with releases including Veo 3.1, Sora 2,
Kling 2.6, Wan 2.6, OVI, and LTX 2. As multimodal generation technology advances, its impact expands across both daily consumer and industrial domains—revolutionizing daily entertainment while enabling more sophisticated world simulation for training embodied AI systems. In this paper, we provide a comprehensive overview of the multimodal video generation model literature covering the major topics: evolution and common architectures of multimodal video generation models; common post-training methods and evaluation; applications and active research areas of video generation; limitations and challenges of multimodal video generation.

URL: https://openreview.net/forum?id=8i5vInabkm

---

Title: CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

Abstract: Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace. This strategy is computationally efficient but the underlying constraint can be overly rigid for heterogeneous projection weights and may incur avoidable accuracy loss. We propose CoSpaDi (Compression via Sparse Dictionary Learning), a training-free framework that replaces low-rank factorization with a structured sparse decomposition in which each weight matrix is represented as a dense dictionary multiplied by a column-sparse coefficient matrix. This yields a union-of-subspaces model: the columns of the weight matrix are represented as linear combinations of different subsets of dictionary atoms, improving expressiveness at a fixed parameter budget.
CoSpaDi is calibration-guided: using a small calibration set, we optimize the factorization to minimize functional reconstruction error of layer outputs rather than weight-space error. An activation-derived Gram orthonormalization reformulates this data-aware objective into a standard Frobenius-norm dictionary learning problem, and we support both per-layer compression and cross-layer dictionary sharing within groups of similar projections.
Across Llama and Qwen model families, CoSpaDi consistently improves the accuracy-compression and perplexity-compression trade-offs over state-of-the-art SVD-based baselines and strong structured pruning baselines at 20-40 % compression ratios. The resulting structured sparsity enables sparse-dense computation and integrates with post-training quantization of the sparse coefficients.

URL: https://openreview.net/forum?id=N8WUKWDy5C

---

Title: A Survey on the Abstraction and Reasoning Corpus

Abstract: Chollet (2019) proposed a definition of intelligence that emphasizes efficiency in skill acquisition rather than performance on a predefined set of tasks, and introduced the Abstraction and Reasoning Corpus (ARC-v1, or ARC-AGI-1) as a challenge benchmark for machine learning research.
In the following years, ARC and the associated competitions have highlighted fundamental limitations of classical deep learning approaches and underscored the need for new ideas in abstract reasoning. This has incentivized extensive trial-and-error exploration, resulting in a wide variety of methods applied to the corpus.
As ARC-v2 was released in March 2025, this literature survey provides a systematic breadth-first overview of the methods applied to ARC-v1 in the six years since its introduction, prior to version 2, and covers early developments for ARC-v2 and ARC Prize 2025.
We apply a taxonomy distinguishing inductive (which explicitly construct transformation rules) and transductive approaches (which directly map inputs to outputs), examine the ecosystem of enabling techniques and auxiliary datasets, and synthesize patterns, trade-offs, and underexplored areas across the research landscape.
Our goal is to provide newcomers with a comprehensive foundation for understanding existing approaches and identifying promising research directions in abstract reasoning.

URL: https://openreview.net/forum?id=qzFxBcK9Cg

---

Title: Glocal Smoothness: Line search and adaptive sizes can help in theory too!

Abstract: Iteration complexities for optimizing smooth functions with first-order algorithms are typically stated in terms of a global Lipschitz constant of the gradient, and near-optimal results are then achieved using fixed step sizes. But many objective functions that arise in practice have regions with small Lipschitz constants where larger step sizes can be used. Many local Lipschitz assumptions have been proposed, which have led to results showing that adaptive step sizes and/or line searches yield improved convergence rates over fixed step sizes. However, these faster rates tend to depend on the iterates of the algorithm, which makes it difficult to compare the iteration complexities of different methods. We consider a simple characterization of global and local ("glocal") smoothness that only depends on properties of the function. This allows upper bounds on iteration complexities in terms of iterate-independent constants and enables us to compare iteration complexities between algorithms. Under this assumption it is straightforward to show the advantages of line searches over fixed step sizes and that, in some settings, gradient descent with line search has a better iteration complexity than accelerated methods with fixed step sizes. We further show that glocal smoothness can lead to improved complexities for the Polyak and AdGD step sizes, as well other algorithms including coordinate optimization, stochastic gradient methods, accelerated gradient methods, and non-linear conjugate gradient methods.

URL: https://openreview.net/forum?id=be9PdukwEL

---

Title: Affine Invariance in Continuous-Domain Convolutional Neural Networks

Abstract: The notion of group invariance helps neural networks in recognizing patterns and features under geometric transformations. Group convolutional neural networks enhance traditional convolutional neural networks by incorporating group-based geometric structures into their design. This research studies affine invariance on continuous-domain convolutional neural networks. Despite other research considering isometric invariance or similarity invariance, we focus on the full structure of affine transforms generated by the group of all invertible $2 \times 2$ real matrices (generalized linear group $\mathrm{GL}_2(\mathbb{R})$). We introduce a new criterion to assess the invariance of two signals under affine transformations. The input image is embedded into the affine Lie group $G_2 = \mathbb{R}^2 \ltimes \mathrm{GL}_2(\mathbb{R})$ to facilitate group convolution operations that respect affine invariance. Then, we analyze the convolution of embedded signals over $G_2$. In sum, our research could eventually extend the scope of geometrical transformations that usual deep-learning pipelines can handle.

URL: https://openreview.net/forum?id=d4ZNyIAtXt

---

Title: PersonaFeedback: A Large-scale Human-annotated Benchmark For Personalization}

Abstract: With the rapid improvement in the general capabilities of Large Language Models (LLMs), LLM personalization, i.e., how to build LLM systems that can generate personalized responses or services that are tailored to distinct user personas, has become an increasingly important research and engineering problem. However, unlike many new challenging benchmarks being released for evaluating the general/reasoning capabilities, the lack of high-quality benchmarks for evaluating LLM personalization greatly hinders progress in this field. To address this, we introduce PersonaFeedback, a new benchmark that directly evaluates LLMs' ability to provide personalized responses given pre-defined user personas and queries. Unlike existing benchmarks that require models to infer implicit user personas from historical interactions, PersonaFeedback decouples persona inference from personalization, focusing on evaluating the model's ability to generate responses tailored to explicit personas.
PersonaFeedback consists of 8298 human-annotated test cases, which are categorized into easy, medium, and hard tiers based on the contextual complexity of the user personas and the difficulty in distinguishing subtle differences between two personalized responses. We conduct comprehensive evaluations across a wide range of models. The empirical results reveal that even state-of-the-art LLMs that can solve complex real-world reasoning tasks could fall short on the hard tier of PersonaFeedback where even human evaluators may find the distinctions challenging. Furthermore, we conduct an in-depth analysis of failure modes across various types of systems, demonstrating that the current retrieval-augmented framework should not be seen as a de facto solution for personalization tasks. All benchmark data, annotation protocols, and the evaluation pipeline will be publicly available to facilitate future research on LLM personalization.

URL: https://openreview.net/forum?id=Q5HRUJuy9g

---

Title: Rectified Flows for Fast Multiscale Fluid Flow Modeling

Abstract: Statistical surrogate modeling of fluid flows is challenging due to multiscale dynamics and strong sensitivity to initial conditions. Conditional diffusion models can achieve high fidelity, but typically require hundreds of stochastic steps at inference.
We introduce a rectified-flow surrogate that learns a time-dependent conditional velocity field transporting input-to-output laws along nearly straight trajectories. Sampling reduces to solving a deterministic ODE along this learned transport, so each function evaluation is substantially more effective: on multi-scale 2D benchmarks we match diffusion-class posterior statistics with as few as $8$ ODE steps versus $\ge\!128$ steps for score-based diffusion.

On the theory side, we develop a law-level analysis tailored to conditional PDE forecasts.
First, we formalize the link between our evaluation criterion—one-point Wasserstein distances on fields—and the $k\!=\!1$ correlation-marginal viewpoint in statistical solutions.
Second, we provide a one-step error decomposition for the learned pushforward law into a \emph{coverage} (high-frequency tail) term controlled by structure functions (equivalently, by spectral decay), and a \emph{fit} term controlled directly by the training objective.
Third, we show how \emph{straightness} in rectification time governs local truncation error for ODE sampling, yielding step-count requirements and explaining why rectified transports admit large, stable steps.

Guided by this picture, we propose a curvature-aware sampler that tracks an EMA-based straightness proxy and adaptively blends and steps the velocity during inference.
Across multiscale incompressible and compressible 2D flows, our method matches diffusion models in Wasserstein statistics and energy spectra, preserves fine-scale structure missed by MSE baselines, and delivers high-resolution conditional samples at a fraction of the inference cost.

URL: https://openreview.net/forum?id=2tMD6YXgkp

---

Title: FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

Abstract: Web agents powered by large language models (LLMs) must process lengthy web page observations to complete user goals; these pages often exceed tens of thousands of tokens. This saturates context limits and increases computational cost processing; moreover, processing full pages exposes agents to security risks such as prompt injection. Existing pruning strategies either discard relevant content or retain irrelevant context, leading to suboptimal action prediction. We introduce FocusAgent, a simple yet effective approach that leverages a lightweight LLM retriever to extract the most relevant lines from accessibility tree (AxTree) observations, guided by task goals. By pruning noisy and irrelevant content, FocusAgent enables efficient reasoning while reducing vulnerability to injection attacks. Experiments on WorkArena and WebArena benchmarks show that \method matches the performance of strong baselines, while reducing observation size by over 50\%. Furthermore, a variant of FocusAgent significantly reduces the success rate of prompt-injection attacks, including banner and pop-up attacks, while maintaining task success performance in attack-free settings. Our results highlight that targeted LLM-based retrieval is a practical and robust strategy for building web agents that are efficient, effective, and secure.

URL: https://openreview.net/forum?id=mINaJKSy7A

---

Title: VoiceAgentBench: Are Voice Assistants Ready For Agentic Tasks?

Abstract: Large scale Speech Language Models have enabled voice assistants capable of understanding natural spoken queries and performing complex tasks. However, existing speech benchmarks largely focus on isolated capabilities such as transcription or question answering and do not systematically evaluate agentic behavior or adversarial robustness. To address this, we introduce VoiceAgentBench, a comprehensive benchmark for evaluating SpeechLMs in realistic spoken agentic settings, comprising 6,000+ synthetic spoken queries spanning single-tool invocations, multi-tool workflows, multi-turn dialogue, and safety evaluations across English and six Indic languages. To ensure speaker diversity, we further simulate speaker variability using a novel sampling strategy that selects audios for TTS voice conversion based on speaker embeddings to maximize acoustic diversity. Our evaluation measures tool selection accuracy, structural consistency, and the correctness of tool invocations, including adversarial robustness. Across agentic tasks, ASR-LLM pipelines outperform end-to-end SpeechLMs, achieving up to 60.6\% average parameter-filling accuracy on English, while SpeechLMs exhibit lower performance and sharper degradation on Indic languages. All models struggle in sequential workflows and safety evaluations, highlighting persistent limitations in tool orchestration, multilingual generalization, and safety robustness.

URL: https://openreview.net/forum?id=mi9q49AR3d

---

Title: SynQuE: Estimating Synthetic Dataset Quality Without Annotations

Abstract: We introduce and formalize the Synthetic Dataset Quality Estimation (SynQuE) problem: ranking synthetic datasets by their expected real-world task performance using only limited unannotated real data. This addresses a critical and open challenge where data is scarce due to collection costs or privacy constraints. We establish the first comprehensive benchmarks for this problem by introducing and evaluating proxy metrics that choose synthetic data for training to maximize task performance on real data. We introduce the first proxy metrics for SynQuE by adapting distribution and diversity-based distance measures to our context via embedding models. To address the shortcomings of these metrics on complex planning tasks, we propose Lens, a novel proxy that leverages large language model reasoning. Our results show that SynQuE proxies correlate with real task performance across diverse tasks, including sentiment analysis, Text2SQL, web navigation, and image classification, with Lens consistently outperforming others on complex tasks by capturing nuanced characteristics. For instance, on text-to-SQL parsing, training on the top-3 synthetic datasets selected via SynQuE proxies can raise accuracy from 30.4% to 38.4 (+8.1)% on average compared to selecting data indiscriminately. This work establishes SynQuE as a practical framework for synthetic data selection under real-data scarcity and motivates future research on foundation model-based data characterization and fine-grained data selection.

URL: https://openreview.net/forum?id=W4Pwb4SX3P

---

Title: Disjoint Generation of Synthetic Data

Abstract: We propose a new framework for generating tabular synthetic datasets via disjoint generative models. In this paradigm, a dataset is partitioned into disjoint subsets that are supplied to separate instances of generative models. The results are then combined post hoc by a joining operation that works in the absence of common variables/identifiers. The success of the framework is demonstrated through several case studies and examples on tabular data that help illuminate some of the design choices that one may make. The advantages achieved by the disjoint generation include: i) An observed increase in the empirical measurement of privacy. ii) Increased computational feasibility of certain model types. iii) Ability to generate synthetic data using a mixture of different generative models. Specifically, mixed-model synthesis bridges the gap between privacy and utility performance, providing state-of-the-art performance on Accuracy and Area Under the Curve for downstream tasks while significantly lowering the empirical re-identification risk.

URL: https://openreview.net/forum?id=LSzXkAWBKI

---

Title: Embryology of a Language Model

Abstract: Understanding how language models develop their internal computational structure is a central problem in the science of deep learning. We study this development through an embryological lens, applying UMAP to susceptibility vectors to visualize structural organization over training. We observe the emergence of a striking ``body plan''---the rainbow serpent---with an anterior-posterior axis defined by global expression versus suppression, dorsal-ventral stratification corresponding to the induction circuit, and a novel ``spacing fin'' structure. This body plan is reproducible across random seeds, suggesting that high-level functional organization is determined by architecture and data rather than initialization. Our work demonstrates that the relationship between data and internal structure is legible and developmental, with implications for both understanding and guiding model development.

URL: https://openreview.net/forum?id=1sgL0GrY4l

---

Title: Self-Improvement as Coherence Optimization: A Theoretical Account

Abstract: Can language models improve their accuracy without external supervision? Methods such as debate, bootstrap, and internal coherence maximization achieve this surprising feat, even matching golden finetuning performance. Yet why they work remains theoretically unclear. We show that they are all special cases of coherence optimization, i.e., finding a context-to-behavior mapping that's most compressible and jointly predictable. We prove that coherence optimization is equivalent to description-length regularization, and that among all such regularization schemes, it is optimal for semi-supervised learning when the regularizer is derived from a pretrained model. Our theory, supported by preliminary experiments, explains why feedback-free self-improvement works and predicts when it should succeed or fail.

URL: https://openreview.net/forum?id=nR47qAX9oL

---

Reply all

Reply to author

Forward

0 new messages