Weekly TMLR digest for Jun 08, 2025

7 views

Skip to first unread message

TMLR

unread,

Jun 8, 2025, 12:00:11 AMJun 8

to tmlr-annou...@googlegroups.com

New certifications
==================

Reproducibility Certification: Reproducibility Study of ’SLICE: Stabilized LIME for Consistent Explanations for Image Classification’

Aritra Bandyopadhyay, Chiranjeev Bindra, Roan van Blanken, Arijit Ghosh

https://openreview.net/forum?id=vKUPXuEzj8

---

Reproducibility Certification: NeoBERT: A Next Generation BERT

Lola Le Breton, Quentin Fournier, Mariam El Mezouar, Sarath Chandar

https://openreview.net/forum?id=TJRyDi7mwH

---

Accepted papers
===============

Title: [Re] Improving Interpretation Faithfulness for Vision Transformers

Authors: Izabela Kurek, Wojciech Trejter, Stipe Frkovic, Andro Erdelez

Abstract: This work aims to reproduce the results of Faithful Vision Transformers (FViTs) proposed by Hu et al. (2024) alongside interpretability methods for Vision Transformers from Chefer et al. (2021) and Xu et al. (2022). We investigate claims made by Hu et al. (2024), namely that the usage of Diffusion Denoised Smoothing (DDS) improves interpretability robustness to (1) attacks in a segmentation task and (2) perturbation and attacks in a classification task. We also extend the original study by investigating the authors’ claims that adding DDS to any interpretability method can improve its robustness under attack. This is tested on baseline methods and the recently proposed Attribution Rollout method. In addition, we measure the computational costs and environmental impact of obtaining an FViT through DDS. Our results broadly agree with the original study’s findings, although minor discrepancies were found and discussed.

URL: https://openreview.net/forum?id=Z0DhgU8fBt

---

Title: Enhancing Sample Generation of Diffusion Models using Noise Level Correction

Authors: Abulikemu Abuduweili, Chenyang Yuan, Changliu Liu, Frank Permenter

Abstract: The denoising process of diffusion models can be interpreted as an approximate projection of noisy samples onto the data manifold. Moreover, the noise level in these samples approximates their distance to the underlying manifold. Building on this insight, we propose a novel method to enhance sample generation by aligning the estimated noise level with the true distance of noisy samples to the manifold. Specifically, we introduce a noise level correction network, leveraging a pre-trained denoising network, to refine noise level estimates during the denoising process. Additionally, we extend this approach to various image restoration tasks by integrating task-specific constraints, including inpainting, deblurring, super-resolution, colorization, and compressed sensing. Experimental results demonstrate that our method significantly improves sample quality in both unconstrained and constrained generation scenarios. Notably, the proposed noise level correction framework is compatible with existing denoising schedulers (e.g., DDIM), offering additional performance improvements.

URL: https://openreview.net/forum?id=y8VXikiIU0

---

Title: Rational Tuning of LLM Cascades via Probabilistic Modeling

Authors: Michael J. Zellinger, Matt Thomson

Abstract: Understanding the reliability of large language models (LLMs) has recently garnered significant attention. Given LLMs' propensity to hallucinate, as well as their high sensitivity to prompt design, it is already challenging to predict the performance of an individual LLM. However, the problem becomes more complex for compound LLM systems such as cascades, where in addition to each model's standalone performance, we must understand how the error rates of different models interact. In this paper, we present a probabilistic model for the joint performance distribution of a sequence of LLMs, which enables a framework for rationally tuning the confidence thresholds of a LLM cascade using continuous optimization. Compared to selecting confidence thresholds using Bayesian optimization, our parametric Markov-copula model yields more favorable error-cost trade-offs, improving the area under the error-cost curve by 4.3% on average for cascades with $k\geq 3$ models. In the low-sample regime with $n \leq 30$ training examples, the performance improvement widens to 10.2%, suggesting that our framework's inductive assumptions about the interactions between the error rates of different LLMs enhance sample efficiency. Overall, our Markov-copula model provides a rational basis for tuning LLM cascade performance and points to the potential of probabilistic methods in analyzing systems of LLMs.

URL: https://openreview.net/forum?id=YCBVcGSZeR

---

Title: Proximal Policy Distillation

Authors: Giacomo Spigler

Abstract: We introduce Proximal Policy Distillation (PPD), a novel policy distillation method that integrates student-driven distillation and Proximal Policy Optimization (PPO) to increase sample efficiency and to leverage the additional rewards that the student policy collects during distillation. To assess the efficacy of our method, we compare PPD with two common alternatives, student-distill and teacher-distill, over a wide range of reinforcement learning environments that include discrete actions and continuous control (ATARI, Mujoco, and Procgen). For each environment and method, we perform distillation to a set of target student neural networks that are smaller, identical (self-distillation), or larger than the teacher network. Our findings indicate that PPD improves sample efficiency and produces better student policies compared to typical policy distillation approaches. Moreover, PPD demonstrates greater robustness than alternative methods when distilling policies from imperfect demonstrations. The code for the paper is released as part of a new Python library built on top of stable-baselines3 to facilitate policy distillation: <Anonymized GitHub Repository> .

URL: https://openreview.net/forum?id=WfVXe88oMh

---

Title: Metamorphic Forward Adaptation Network: Dynamically Adaptive and Modular Multi-layer Learning

Authors: Yu Sun, Vijja Wichitwechkarn, Ronald Clark, Mirko Kovac, Basaran Bahadir Kocer

Abstract: Back-propagation is a widely used algorithm for training neural networks by adjusting weights based on error gradients. However, back-propagation is biologically implausible with global derivative computation and lacks robustness in long-term dynamic learning. A previously proposed alternative to back-propagation is the Forward-Forward algorithm, which bypasses global gradient dependency and localises computations, making it a more biologically plausible approach. However, Forward-Forward has been evaluated in limited environments, does not yet match back-propagation's performance, and only supports classification, not regression. This research introduces the Metamorphic Forward Adaptation Network (MFAN), using a contrastive learning property as its core, and retaining the layer-wise architecture of the Forward-Forward algorithm. Compared to the Forward-Forward model being limited to discrete classification, MFAN can process discrete and continuous data, showing stability, adaptability, and the ability to handle evolving data. MFAN performs well in continuous data stream scenarios, demonstrating superior adaptability and robustness compared to back-propagation, particularly in tasks requiring dynamic, long-term learning.

URL: https://openreview.net/forum?id=6RCs2tLsHq

---

Title: Lie Symmetry Net: Preserving Conservation Laws in Modelling Financial Market Dynamics via Differential Equations

Authors: Xuelian Jiang, Tongtian Zhu, Yingxiang Xu, Can Wang, Yeyu Zhang, Fengxiang He

Abstract: This paper employs a novel Lie symmetries-based framework to model the intrinsic symmetries within financial market. Specifically, we introduce Lie symmetry net (LSN), which characterises the Lie symmetries of the differential equations (DE) estimating financial market dynamics, such as the Black-Scholes equation. To simulate these differential equations in a symmetry-aware manner, LSN incorporates a Lie symmetry risk derived from the conservation laws associated with the Lie symmetry operators of the target differential equations. This risk measures how well the Lie symmetries are realised and guides the training of LSN under the structural risk minimisation framework. Extensive numerical experiments demonstrate that LSN effectively realises the Lie symmetries and achieves an error reduction of more than one order of magnitude compared to state-of-the-art methods. The code is available at https://github.com/Jxl163/LSN_code.

URL: https://openreview.net/forum?id=rkfop9GyxB

---

Title: A Framework for Finding Local Saddle Points in Two-Player Zero-Sum Black-Box Games

Authors: Shubhankar Agarwal, Hamzah I Khan, Sandeep P. Chinchali, David Fridovich-Keil

Abstract: Saddle point optimization is a critical problem employed in numerous real-world applications, including portfolio optimization, generative adversarial networks, and robotics. It has been extensively studied in cases where the objective function is known and differentiable. Existing work in black-box settings with unknown objectives that can only be sampled either assumes convexity-concavity in the objective to simplify the problem or operates with noisy gradient estimators. In contrast, we introduce a framework inspired by Bayesian optimization which utilizes Gaussian processes to model the unknown (potentially nonconvex-nonconcave) objective and requires only zeroth-order samples. Our approach frames the saddle point optimization problem as a two-level process which can flexibly leverage existing general-sum Nash game solvers to solve for saddle points of zero-sum games. The upper level of our framework produces a model of the objective function by sampling in promising locations, and the lower level of our framework uses the existing model to frame and solve a general-sum game to identify locations to sample. This lower level procedure can be designed in complementary ways, and we demonstrate the flexibility of our approach by introducing variants which appropriately trade off between factors like runtime, the cost of function evaluations, and the number of available initial samples. We experimentally demonstrate these algorithms on synthetic and realistic datasets in black-box nonconvex-nonconcave settings, showcasing their ability to efficiently locate local saddle points in these contexts.

URL: https://openreview.net/forum?id=NbRybPuWCv

---

Title: Scalable Multi-Output Gaussian Processes with Stochastic Variational Inference

Authors: Xiaoyu Jiang, Sokratia Georgaka, Magnus Rattray, Mauricio A Álvarez

Abstract: The Multi-Output Gaussian Process (MOGP) is a popular tool for modelling data from multiple sources. A typical choice to build a covariance function for a MOGP is the Linear Model of Coregionalisation (LMC) which parametrically models the covariance between outputs. The Latent Variable MOGP (LV-MOGP) generalises this idea by modelling the covariance between outputs using a kernel applied to latent variables, one per output, leading to a flexible MOGP model that allows efficient generalisation to new outputs with few data points. The computational complexity in LV-MOGP grows linearly with the number of outputs, which makes it unsuitable for problems with a large number of outputs. In this paper, we propose a stochastic variational inference approach for the LV-MOGP that allows mini-batches for both inputs and outputs, making computational complexity per training iteration independent of the number of outputs. We demonstrate the performance of the model by benchmarking against some other MOGP models in several real-world datasets, including spatial-temporal climate modelling and spatial transcriptomics.

URL: https://openreview.net/forum?id=kK0WrBZAli

---

Title: CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement

Authors: Leitian Tao, Xiang Chen, Tong Yu, Tung Mai, Ryan A. Rossi, Yixuan Li, Saayan Mitra

Abstract: Large Language Models (LLMs) have revolutionized code generation but are require significant resources and tend to over-generalize, limiting their task-specific efficiency. Fine-tuning smaller, open-source LLMs is a cost-effective alternative, yet standard supervised approaches rely solely on correct examples, overlooking valuable insights from failures. We introduce CodeLutra, a new framework that leverages both correct and incorrect code attempts. Instead of purely instructing with correct solutions, CodeLutra uses iterative preference-based refinement, comparing successful and failed outputs to better approximate desired results. This process narrows the performance gap with state-of-the-art, larger models, without requiring massive datasets or auxiliary models. For example, on a challenging data science coding task, using only 500 samples improved Llama-3-8B’s accuracy from 28.2% to 48.6%, approaching GPT-4’s level. By capitalizing on both successes and mistakes, \textsc{CodeLutra} offers a scalable, efficient path to high-quality code generation, making smaller open-source models more competitive with leading closed-source alternatives.

URL: https://openreview.net/forum?id=IGsEgWM4to

---

Title: Disappearance of Timestep Embedding: A Case Study on Neural ODE and Diffusion Models

Authors: Bum Jun Kim, Yoshinobu Kawahara, Sang Woo Kim

Abstract: Dynamical systems are often time-varying, whose modeling requires a function that evolves with respect to time. Recent studies such as the neural ordinary differential equation proposed a time-dependent neural network, which provides a neural network varying with respect to time. However, we claim that the architectural choice to build a time-dependent neural network significantly affects its time-awareness but still lacks sufficient validation in its current states. In this study, we conduct an in-depth analysis of the architecture of neural ordinary differential equations. Here, we report a vulnerability of vanishing timestep embedding, which disables the time-awareness of a time-dependent neural network. Specifically, we find that the ConcatConv operation, which is widely used in neural ordinary differential equations, causes an additive effect of timestep embedding, which is readily canceled out by the subsequent batch normalization. This vanishing timestep embedding also arises for group normalization and is analyzed thoroughly with respect to the number of channels, groups, and relative variance. Furthermore, we find that this vulnerability can also be observed in diffusion models because they employ a similar architecture that incorporates timestep embedding to discriminate between different timesteps during a diffusion process. Our analysis provides a detailed description of this phenomenon as well as several solutions to address the root cause. Through experiments on neural ordinary differential equations and diffusion models, we observed that ensuring alive time-awareness via proposed solutions boosted their performance, such as classification accuracy, FID, and inception score, which implies that their current implementations lack sufficient time-dependency.

URL: https://openreview.net/forum?id=bpaLYaf6Dp

---

Title: Sparser, Better, Faster, Stronger: Sparsity Detection for Efficient Automatic Differentiation

Authors: Adrian Hill, Guillaume Dalle

Abstract: From implicit differentiation to probabilistic modeling, Jacobian and Hessian matrices have many potential use cases in Machine Learning (ML), but they are viewed as computationally prohibitive. Fortunately, these matrices often exhibit sparsity, which can be leveraged to speed up the process of Automatic Differentiation (AD).
This paper presents advances in sparsity detection, previously the performance bottleneck of Automatic Sparse Differentiation (ASD). Our implementation of sparsity detection is based on operator overloading, able to detect both local and global sparsity patterns, and supports flexible index set representations. It is fully automatic and requires no modification of user code, making it compatible with existing ML codebases.
Most importantly, it is highly performant, unlocking Jacobians and Hessians at scales where they were considered too expensive to compute. On real-world problems from scientific ML, graph neural networks and optimization, we show significant speed-ups of up to three orders of magnitude. Notably, using our sparsity detection system, ASD outperforms standard AD for one-off computations, without amortization of either sparsity detection or matrix coloring.

URL: https://openreview.net/forum?id=GtXSN52nIW

---

Title: Harmony: A Joint Self-Supervised and Weakly-Supervised Framework for Learning General Purpose Visual Representations

Authors: Mohammed Baharoon, Jonathan Klein, Dominik Michels

Abstract: Vision-language contrastive learning frameworks such as CLIP enable learning representations from natural language supervision and provide strong zero-shot classification capabilities. However, due to the nature of the supervisory signal in these paradigms, they lack the ability to learn localized features, leading to degraded performance on dense prediction tasks such as segmentation and detection. On the other hand, self-supervised learning methods have shown the ability to learn granular representations, complementing the high-level features in vision-language training. In this work, we present Harmony, a framework that combines vision-language training with discriminative and generative self-supervision to learn visual features that can be generalized across different downstream vision tasks. Our framework is specifically designed to work on web-scraped data by not relying on negative examples in the self-supervised learning path and addressing the one-to-one correspondence issue using soft CLIP targets generated by an EMA model. Moreover, Harmony optimizes for five different objectives simultaneously, efficiently utilizing the supervision in each data example, making it even more suited in data-constrained settings. We comprehensively evaluate Harmony across various vision downstream tasks and find that it significantly outperforms the baseline CLIP and outperforms the previously leading joint self- and weakly supervised methods, SLIP, MaskCLIP, and DetailCLIP. Specifically, when compared against these methods, Harmony shows superior performance in linear-probing, fine-tuning, and zero-shot classification on ImageNet-1k, semantic segmentation on ADE20K, and both object detection and instance segmentation on MS-COCO, when pre-training a ViT-B on CC3M. We also show that Harmony outperforms SILC on detection, linear and fine-tuning classification, and outperforms other self-supervised learning methods like iBOT and MAE across all tasks evaluated. Our code is publicly available at https://github.com/MohammedSB/Harmony.

URL: https://openreview.net/forum?id=IcOBCufqFO

---

Title: Full-Rank Unsupervised Node Embeddings for Directed Graphs via Message Aggregation

Authors: Ciwan Ceylan, Kambiz Ghoorchian, Danica Kragic

Abstract: Linear message-passing models have emerged as compelling alternatives to non-linear graph neural networks for unsupervised node embedding learning, due to their scalability and competitive performance on downstream tasks. However, we identify a fundamental flaw in recently proposed linear models that combine embedding aggregation with concatenation during each message-passing iteration: rank deficiency. A rank-deficient embedding matrix contains column vectors which take arbitrary values, leading to ill-conditioning that degrades downstream task accuracy, particularly in unsupervised tasks such as graph alignment. We deduce that repeated embedding aggregation and concatenation introduces linearly dependent features, causing rank deficiency. To address this, we propose ACC (Aggregate, Compress, Concatenate), a novel model that avoids redundant feature computation by applying aggregation to the messages from the previous iteration, rather than the embeddings. Consequently, ACC generates full-rank embeddings, significantly improving graph alignment accuracy from 10% to 60% compared to rank-deficient embeddings, while also being faster to compute. Additionally, ACC employs directed message-passing and achieves node classification accuracies comparable to state-of-the-art self-supervised graph neural networks on directed graph benchmarks, while also being over 70 times faster on graphs with over 1 million edges.

URL: https://openreview.net/forum?id=3ECbEZg2If

---

Title: Prior Learning in Introspective VAEs

Authors: Ioannis Athanasiadis, Fredrik Lindsten, Michael Felsberg

Abstract: Variational Autoencoders (VAEs) are a popular framework for unsupervised learning and data generation. A plethora of methods have been proposed focusing on improving VAEs, with the incorporation of adversarial objectives and the integration of prior learning mechanisms being prominent directions. When it comes to the former, an indicative instance is the recently introduced family of Introspective VAEs aiming at ensuring that a low likelihood is assigned to unrealistic samples. In this study, we focus on the Soft-IntroVAE (S-IntroVAE), one of only two members of the Introspective VAE family, the other being the original IntroVAE. We select S-IntroVAE for its state-of-the-art status and its training stability. In particular, we investigate the implication of incorporating a multimodal and trainable prior into this S-IntroVAE. Namely, we formulate the prior as a third player and show that when trained in cooperation with the decoder constitutes an effective way for prior learning, which shares the Nash Equilibrium with the vanilla S-IntroVAE. Furthermore, based on a modified formulation of the optimal ELBO in S-IntroVAE, we develop theoretically motivated regularizations, namely (i) adaptive variance clipping to stabilize training when learning the prior and (ii) responsibility regularization to discourage the formation of inactive prior modes. Finally, we perform a series of targeted experiments on a 2D density estimation benchmark and in an image generation setting comprised of the (F)-MNIST and CIFAR-10 datasets demonstrating the effect of prior learning in S-IntroVAE in generation and representation learning.

URL: https://openreview.net/forum?id=u4YDVFodYX

---

Title: Learning Using a Single Forward Pass

Authors: Aditya Somasundaram, Pushkal Mishra, Ayon Borthakur

Abstract: We propose a learning algorithm to overcome the limitations of traditional backpropagation in resource-constrained environments: Solo Pass Embedded Learning Algorithm (SPELA). SPELA operates with local loss functions to update weights, significantly saving on resources
allocated to the propagation of gradients and storing computational graphs while being sufficiently accurate. Consequently, SPELA can closely match backpropagation using less memory. Moreover, SPELA can effectively fine-tune pre-trained image recognition models
for new tasks. Further, SPELA is extended with significant modifications to train CNN networks, which we evaluate on CIFAR-10, CIFAR-100, and SVHN 10 datasets, showing equivalent performance compared to backpropagation. Our results indicate that SPELA, with its features such as local learning and early exit, is a potential candidate for learning in resource-constrained edge AI applications.

URL: https://openreview.net/forum?id=EDQ8QOGqjr

---

Title: Reproducibility Study of ’SLICE: Stabilized LIME for Consistent Explanations for Image Classification’

Authors: Aritra Bandyopadhyay, Chiranjeev Bindra, Roan van Blanken, Arijit Ghosh

Abstract: This paper presents a reproducibility study of SLICE: Stabilized LIME for Consistent Explanations for Image Classification by Bora et al. (2024). SLICE enhances LIME by incorporating Sign Entropy-based Feature Elimination (SEFE) to remove unstable superpixels and an adaptive perturbation strategy using Gaussian blur to improve consistency in feature importance rankings. The original work claims that SLICE significantly improves explanation stability and fidelity. Our study systematically verifies these claims through extensive experimentation using the Oxford-IIIT Pets, PASCAL VOC, and MS COCO datasets. Our results confirm that SLICE achieves higher consistency than LIME, supporting its ability to reduce instability. However, our fidelity analysis challenges the claim of superior performance, as LIME often achieves higher Ground Truth Overlap (GTO) scores, indicating stronger alignment with object segmentations. To further investigate fidelity, we introduce an alternative AOPC evaluation to ensure a fair comparison across methods. Additionally, we propose GRID-LIME, a structured grid-based alternative to LIME, which improves stability while maintaining computational efficiency. Our findings highlight trade-offs in post-hoc explainability methods and emphasize the need for fairer fidelity evaluations. Our implementation is publicly available at our GitHub repository.

URL: https://openreview.net/forum?id=vKUPXuEzj8

---

Title: Multi-objective Bayesian optimization for Likelihood-Free inference in sequential sampling models of decision making

Authors: David Chen, Xinwei Li, Eui-Jin Kim, Prateek Bansal, David J Nott

Abstract: Statistical models are often defined by a generative process for simulating synthetic data, but this can lead to intractable likelihoods. Likelihood free inference (LFI) methods enable Bayesian inference to be performed in this case. Extending a popular approach to simulation-efficient LFI for single-source data, we propose Multi-objective Bayesian Optimization for Likelihood Free Inference (MOBOLFI) to perform LFI using multi-source data. MOBOLFI models a multi-dimensional discrepancy between observed and simulated data, using a separate discrepancy for each data source. The use of a multivariate discrepancy allows for approximations to individual data source likelihoods in addition to the joint likelihood, enabling detection of conflicting information and deeper understanding of the importance of different data sources in estimating individual parameters. The adaptive choice of simulation parameters using multi-objective Bayesian optimization ensures simulation efficient approximation of likelihood components for all data sources. We illustrate our approach in sequential sampling models (SSMs), which are widely used in psychology and consumer-behavior modeling. SSMs are often fitted using multi-source data, such as choice and response time. The advantages of our approach are illustrated in comparison with a single discrepancy for an SSM fitted to data assessing preferences of ride-hailing drivers in Singapore to rent electric vehicles.

URL: https://openreview.net/forum?id=hQjwDqfSzj

---

Title: Change Point Detection in the Frequency Domain with Statistical Reliability

Authors: Akifumi Yamada, Tomohiro Shiraishi, Shuichi Nishino, Teruyuki Katsuoka, Kouichi Taji, Ichiro Takeuchi

Abstract: Effective condition monitoring in complex systems requires identifying change points (CPs) in the frequency domain, as the structural changes often arise across multiple frequencies. This paper extends recent advancements in statistically significant CP detection, based on Selective Inference (SI), to the frequency domain. The proposed SI method quantifies the statistical significance of detected CPs in the frequency domain using $p$-values, ensuring that the detected changes reflect genuine structural shifts in the target system. We address two major technical challenges to achieve this. First, we extend the existing SI framework to the frequency domain by appropriately utilizing the properties of discrete Fourier transform (DFT). Second, we develop an SI method that provides valid $p$-values for CPs where changes occur across multiple frequencies. Experimental results demonstrate that the proposed method reliably identifies genuine CPs with strong statistical guarantees, enabling more accurate root-cause analysis in the frequency domain of complex systems.

URL: https://openreview.net/forum?id=FNRdaHz3qN

---

Title: Recall and Refine: A Simple but Effective Source-free Open- set Domain Adaptation Framework

Authors: Ismail Nejjar, Hao Dong, Olga Fink

Abstract: Open-set Domain Adaptation (OSDA) aims to adapt a model from a labeled source domain to an unlabeled target domain, where novel classes — also referred to as target-private unknown classes — are present. Source-free Open-set Domain Adaptation (SF-OSDA) methods address OSDA without accessing labeled source data, making them particularly relevant under privacy constraints. However, SF-OSDA presents significant challenges due to distribution shifts and the introduction of novel classes. Existing SF-OSDA methods typically rely on thresholding the prediction entropy of a sample to identify it as either a known or unknown class, but fail to explicitly learn discriminative features for the target-private unknown classes. We propose Recall and Refine (RRDA), a novel SF-OSDA framework designed to address these limitations by explicitly learning features for target-private unknown classes. RRDA employs a two-stage process. First, we enhance the model’s capacity to recognize unknown classes by training a target classifier with an additional decision boundary, guided by synthetic samples generated from target domain features. This enables the classifier to effectively separate known and unknown classes. Second, we adapt the entire model to the target domain, addressing both domain shifts and distinguishability to unknown classes. Any off-the-shelf source-free domain adaptation method (e.g.\ SHOT, AaD) can be seamlessly integrated into our framework at this stage. Extensive experiments on three benchmark datasets demonstrate that RRDA significantly outperforms existing SF-OSDA and OSDA methods.

URL: https://openreview.net/forum?id=HBZoXjUAqV

---

Title: Mixed-View Panorama Synthesis using Geospatially Guided Diffusion

Authors: Zhexiao Xiong, Xin Xing, Scott Workman, Subash Khanal, Nathan Jacobs

Abstract: We introduce the task of mixed-view panorama synthesis, where the goal is to synthesize a novel panorama given a small set of input panoramas and a satellite image of the area. This contrasts with previous work which only uses input panoramas (same-view synthesis), or an input satellite image (cross-view synthesis). We argue that the mixed-view setting is the most natural to support panorama synthesis for arbitrary locations worldwide. A critical challenge is that the spatial coverage of panoramas is uneven, with few panoramas available in many regions of the world. We introduce an approach that utilizes diffusion-based modeling and an attention-based architecture for extracting information from all available input imagery. Experimental results demonstrate the effectiveness of our proposed method. In particular, our model can handle scenarios when the available panoramas are sparse or far from the location of the panorama we are attempting to synthesize.

URL: https://openreview.net/forum?id=ylUVRikhTL

---

Title: Link Prediction with Relational Hypergraphs

Authors: Xingyue Huang, Miguel Romero Orth, Pablo Barcelo, Michael M. Bronstein, Ismail Ilkan Ceylan

Abstract: Link prediction with knowledge graphs has been thoroughly studied in graph machine learning, leading to a rich landscape of graph neural network architectures with successful applications. Nonetheless, it remains challenging to transfer the success of these architectures to inductive link prediction with relational hypergraphs, where the task is over $k$-ary relations, substantially harder than link prediction on knowledge graphs with binary relations only. In this paper, we propose a framework for link prediction with relational hypergraphs, empowering applications of graph neural networks on fully relational structures. Theoretically, we conduct a thorough analysis of the expressive power of the resulting model architectures via corresponding relational Weisfeiler-Leman algorithms and also via logical expressiveness. Empirically, we validate the power of the proposed model architectures on various relational hypergraph benchmarks. The resulting model architectures substantially outperform every baseline for inductive link prediction and also lead to competitive results for transductive link prediction.

URL: https://openreview.net/forum?id=S6fe4aH6YA

---

Title: Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization

Authors: Daniel Jarne Ornia, Giannis Delimpaltadakis, Jens Kober, Javier Alonso-Mora

Abstract: In Reinforcement Learning (RL), agents have no incentive to exhibit predictable trajectories, and are often pushed (through e.g. policy entropy regularisation) to randomise their actions in favor of exploration. This lack of predictability awareness often makes it challenging for other agents and humans to predict an agent's trajectories, possibly triggering unsafe scenarios (e.g. in human-robot interaction). We propose a novel method to induce predictable trajectories in RL agents, termed Predictability-Aware RL (PARL), employing the agent's trajectory entropy rate to quantify predictability. Our method maximizes a linear combination of a standard discounted reward and the negative entropy rate, thus trading off optimality with predictability. We show how the entropy rate can be formally cast as an average reward, how entropy-rate value functions can be estimated from a learned model and incorporate this in policy-gradient algorithms, and demonstrate how this approach produces predictable (near-optimal) policies in tasks inspired by human-robot use-cases.

URL: https://openreview.net/forum?id=DDUsc1lD27

---

Title: Efficient and Accurate Optimal Transport with Mirror Descent and Conjugate Gradients

Authors: Mete Kemertas, Allan Douglas Jepson, Amir-massoud Farahmand

Abstract: We propose Mirror Descent Optimal Transport (MDOT), a novel method for solving discrete optimal transport (OT) problems with high precision, by unifying temperature annealing in entropic-regularized OT (EOT) with mirror descent techniques. In this framework, temperature annealing produces a sequence of EOT dual problems, whose solution gradually gets closer to the solution of the original OT problem. We solve each problem efficiently using a GPU-parallel nonlinear conjugate gradients algorithm (PNCG) that outperforms traditional Sinkhorn iterations under weak regularization. Moreover, our investigation also reveals that the theoretical convergence rate of Sinkhorn iterations can exceed existing non-asymptotic bounds when its stopping criterion is tuned in a manner analogous to MDOT.

Our comprehensive ablation studies of MDOT-PNCG affirm its robustness across a wide range of algorithmic parameters. Benchmarking on 24 problem sets of size $n=4096$ in a GPU environment demonstrate that our method attains high-precision, feasible solutions significantly faster than a representative set of existing OT solvers—including accelerated gradient methods and advanced Sinkhorn variants—in both wall-clock time and number of operations. Empirical convergence rates range between $O(n^2 \varepsilon^{-1/4})$ and $O(n^2 \varepsilon^{-1})$, where $\varepsilon$ is the optimality gap. For problem sizes up to $n=16\,384$, the empirical runtime scales as $\widetilde{O}(n^2)$ for moderate precision and as $\widetilde{O}(n^{5/2})$ at worst for high precision. These findings establish MDOT-PNCG as a compelling alternative to current OT solvers, particularly in challenging weak-regularization regimes.

URL: https://openreview.net/forum?id=FVFqrxeF8e

---

Title: Efficient Hardware Scaling and Diminishing Returns in Large-Scale Training of Language Models

Authors: Jared Fernandez, Luca Wehrstedt, Leonid Shamis, Mostafa Elhoushi, Kalyan Saladi, Yonatan Bisk, Emma Strubell, Jacob Kahn

Abstract: To train the exceedingly large neural networks required in modern applications, such as large language models (LLMs), model training is distributed across tens of thousands of hardware accelerators (e.g. GPUs), requiring orchestration of computation and communication across large computing clusters. In this work, we demonstrate that careful consideration of hardware configuration and parallelization strategy is critical for effective (i.e. compute- and cost-efficient) scaling of model training. We conduct an extensive empirical study of the performance of large-scale LLM training workloads across model size, hardware configurations, and distributed parallelization strategies with current best practices. In experiments with model sizes up to 70B parameters and utilizing up to 2048 H100 GPUs, we demonstrate that: (1) Naive scale out with Fully Sharded Data Parallelism (FSDP) incurs communication overhead which leads parallelization strategies previously thought to be sub-optimal in fact become preferable; and (2) scaling the total number of accelerators for training quickly
yields diminishing returns even when hardware and parallelization strategies are properly optimized, implying poor marginal performance per additional unit of power or GPU-hour.

URL: https://openreview.net/forum?id=p7jQEf3wlh

---

Title: Flow map matching with stochastic interpolants: A mathematical framework for consistency models

Authors: Nicholas Matthew Boffi, Michael Samuel Albergo, Eric Vanden-Eijnden

Abstract: Generative models based on dynamical equations such as flows and diffusions offer exceptional sample quality, but require computationally expensive numerical integration during inference. The advent of consistency models has enabled efficient one-step or few-step generation, yet despite their practical success, a systematic understanding of their design has been hindered by the lack of a comprehensive theoretical framework. Here we introduce Flow Map Matching (FMM), a principled framework for learning the two-time flow map of an underlying dynamical generative model, thereby providing this missing mathematical foundation. Leveraging stochastic interpolants, we propose training objectives both for distillation from a pre-trained velocity field and for direct training of a flow map over an interpolant or a forward diffusion process. Theoretically, we show that FMM unifies and extends a broad class of existing approaches for fast sampling, including consistency models, consistency trajectory models, and progressive distillation. Experiments on CIFAR-10 and ImageNet-32 highlight that our approach can achieve sample quality comparable to flow matching while reducing generation time by a factor of 10-20.

URL: https://openreview.net/forum?id=cqDH0e6ak2

---

Title: NeoBERT: A Next Generation BERT

Authors: Lola Le Breton, Quentin Fournier, Mariam El Mezouar, Sarath Chandar

Abstract: Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of large auto-regressive language models such as LLaMA and DeepSeek. In contrast, encoders like BERT and RoBERTa have not seen the same level of progress despite being foundational for many downstream NLP applications. To bridge this gap, we introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT$_{large}$, RoBERTa$_{large}$, NomicBERT, and ModernBERT under identical fine-tuning conditions. In addition, we rigorously evaluate the impact of each modification on GLUE and design a uniform fine-tuning and evaluation framework for MTEB. We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption.

URL: https://openreview.net/forum?id=TJRyDi7mwH

---

Title: Explicit Personalization and Local Training: Double Communication Acceleration in Federated Learning

Authors: Kai Yi, Laurent Condat, Peter Richtárik

Abstract: Federated Learning is an evolving machine learning paradigm, in which multiple clients perform computations based on their individual private data, interspersed by communication with a remote server. A common strategy to curtail communication costs is Local Training, which consists in performing multiple local stochastic gradient descent steps between successive communication rounds. However, the conventional approach to local training overlooks the practical necessity for client-specific personalization, a technique to tailor local models to individual needs. We introduce Scafflix, a novel algorithm that efficiently integrates explicit personalization with local training. This innovative approach benefits from these two techniques, thereby achieving doubly accelerated communication, as we demonstrate both in theory and practice.

URL: https://openreview.net/forum?id=qVUEuhlaEa

---

Title: MemBench: Memorized Image Trigger Prompt Dataset for Diffusion Models

Authors: Chunsan Hong, Tae-Hyun Oh, Minhyuk Sung

Abstract: Diffusion models have achieved remarkable success in Text-to-Image generation tasks, leading to the development of many commercial models. However, recent studies have reported that diffusion models often repeatedly generate memorized images in train data when triggered by specific prompts, potentially raising social issues ranging from copyright to privacy concerns. To sidestep the memorization, recent studies have been conducted to develop memorization mitigation methods for diffusion models. Nevertheless, the lack of benchmarks hinders the assessment of the true effectiveness of these methods. In this work, we present MemBench, the first benchmark for evaluating image memorization mitigation methods. Our benchmark includes a large number of memorized image trigger prompts in various Text-to-Image diffusion models. Furthermore, in contrast to the prior work evaluating mitigation performance only on trigger prompts, we present metrics evaluating on both trigger prompts and general prompts, so that we can see whether mitigation methods address the memorization issue while maintaining performance for general prompts. Through our MemBench evaluation, we revealed that existing memorization mitigation methods notably degrade the overall performance of diffusion models and need to be further developed.

URL: https://openreview.net/forum?id=z3RIiidJgD

---

Title: NeurIPS 2023 Competition: Privacy Preserving Federated Learning Document VQA

Authors: Marlon Tobaben, Mohamed Ali Souibgui, Rubèn Tito, Khanh Nguyen, Raouf Kerkouche, Kangsoo Jung, Joonas Jälkö, Lei Kang, Andrey Barsky, Vincent Poulain d'Andecy, Aurélie JOSEPH, Aashiq Muhamed, Kevin Kuo, Virginia Smith, Yusuke Yamasaki, Takumi Fukami, Kenta Niwa, Iifan Tyou, Hiro Ishii, Rio Yokota, Ragul N, Rintu Kutum, Josep Llados, Ernest Valveny, Antti Honkela, Mario Fritz, Dimosthenis Karatzas

Abstract: The Privacy Preserving Federated Learning Document VQA (PFL-DocVQA) competition challenged the community to develop provably private and communication-efficient solutions in a federated setting for a real-life use case: invoice processing. The competition introduced a dataset of real invoice documents, along with associated questions and answers requiring information extraction and reasoning over the document images. Thereby, it brings together researchers and expertise from the document analysis, privacy, and federated learning communities. Participants fine-tuned a pre-trained, state-of-the-art Document Visual Question Answering model provided by the organizers for this new domain, mimicking a typical federated invoice processing setup. The base model is a multi-modal generative language model, and sensitive information could be exposed through either the visual or textual input modality. Participants proposed elegant solutions to reduce communication costs while maintaining a minimum utility threshold in track 1 and to protect all information from each document provider using differential privacy in track 2. The competition served as a new testbed for developing and testing private federated learning methods, simultaneously raising awareness about privacy within the document image analysis and recognition community. Ultimately, the competition analysis provides best practices and recommendations for successfully running privacy-focused federated learning challenges in the future.

URL: https://openreview.net/forum?id=3HKNwejEEq

---

Title: Leopard: A Vision Language Model for Text-Rich Multi- Image Tasks

Authors: Mengzhao Jia, Wenhao Yu, Kaixin Ma, Tianqing Fang, Zhihan Zhang, Siru Ouyang, Hongming Zhang, Dong Yu, Meng Jiang

Abstract: Text-rich images, where text serves as the central visual element guiding the overall understanding, are prevalent in real-world applications, such as presentation slides, scanned documents, and webpage snapshots. Tasks involving multiple text-rich images are especially challenging, as they require not only understanding the content of individual images but reasoning about inter-relationships and logical flows across multiple visual inputs. Despite the importance of these scenarios, current multimodal large language models (MLLMs) struggle to handle such tasks due to two key challenges: (1) the scarcity of high-quality instruction tuning datasets for text-rich multi-image scenarios, and (2) the difficulty in balancing image resolution with visual feature sequence length. To address these challenges, we propose Leopard, a MLLM designed specifically for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we proposed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length based on the original aspect ratios and resolutions of images. Experiments on a diverse set of benchmarks reveal that our model consistently outperforms state-of-the-art systems, such as Llama-3.2 and Qwen2-VL, in challenging text-rich, multi-image evaluations. Remarkably, our approach achieves outstanding performance using only 1.2M fully open-sourced training instances, outperforming models that rely on large-scale in-house data, highlighting its efficiency and effectiveness.
Our code and data are available at https://anonymous.4open.science/r/Leopard-908F.

URL: https://openreview.net/forum?id=R2rasAEPVi

---

Title: Labeling without Seeing? Blind Annotation for Privacy-Preserving Entity Resolution

Authors: Yixiang Yao, Weizhao Jin, Srivatsan Ravi

Abstract: The entity resolution problem requires finding pairs across datasets that belong to different owners but refer to the same entity in the real world. To train and evaluate solutions (either rule-based or machine-learning-based) to the entity resolution problem, generating a ground truth dataset with entity pairs or clusters is needed. However, such a data annotation process involves humans as domain oracles to review the plaintext data for all candidate record pairs from different parties, which inevitably infringes the privacy of data owners, especially in privacy-sensitive cases like medical records. To the best of our knowledge, there is no prior work on privacy-preserving ground truth labeling in the context of entity resolution. We propose a novel blind annotation protocol based on homomorphic encryption that allows domain oracles to collaboratively label ground truth without sharing data in plaintext with other parties. In addition, we design a domain-specific, user-friendly language that conceals the complex underlying homomorphic encryption circuits, making it more accessible and easier for users to adopt this technique. The empirical experiments indicate the feasibility of our privacy-preserving protocol (f-measure on average achieves more than 90\% compared with the real ground truth).

URL: https://openreview.net/forum?id=bAM8y3Hm0p

---

Title: Dynamic Schwartz-Fourier Neural Operator for Enhanced Expressive Power

Authors: Wenhan Gao, Jian Luo, Ruichen Xu, Yi Liu

Abstract: Recently, neural operators have emerged as a prevailing approach for learning discretization-invariant mappings between function spaces. A particular example is the Fourier Neural Operator (FNO), which constrains integral kernels to be convolutions and learns the kernel directly in the frequency domain. Due to the capacity of Fourier transforms to effectively reduce the dimensionality and preserve information, FNOs demonstrate superior performance in terms of both efficiency and accuracy. In FNOs, the convolution kernel is fixed as a point-wise multiplication in the frequency domain; however, these translation-invariant kernels might limit the expressiveness of FNOs. For instance, if the underlying system lacks translational symmetries, the kernels learned by the FNO will still exhibit translational invariance, thereby limiting the model's expressive power. We propose a dynamic Schwartz operator that induces interactions between modes to enhance the expressiveness of FNOs. In this work, we introduce a novel approach that equips FNOs with Schwartz operators to learn dynamic kernels, termed Dynamic Kernel Fourier Neural Operators (DSFNOs). By incorporating this dynamic mechanism, our model gains the ability to capture relevant frequency information patterns, facilitating a better understanding and representation of complex physical phenomena. Through experiments, we demonstrate that DSFNOs can improve FNOs on a range of tasks, highlighting the effectiveness of our proposed approach. The code is available at https://github.com/wenhangao21/TMLR25_DSFNO.

URL: https://openreview.net/forum?id=B0E2yjrNb8

---

Title: Normality-Guided Distributional Reinforcement Learning for Continuous Control

Authors: Ju-Seung Byun, Andrew Perrault

Abstract: Learning a predictive model of the mean return, or value function, plays a critical role in many reinforcement learning algorithms. Distributional reinforcement learning (DRL) has been shown to improve performance by modeling the value distribution, not just the mean. We study the value distribution in several continuous control tasks and find that the learned value distribution is empirically quite close to normal. We design a method that exploits this property, employing variances predicted from a variance network, along with returns, to analytically compute target quantile bars representing a normal for our distributional value function. In addition, we propose a policy update strategy based on the correctness as measured by structural characteristics of the value distribution not present in the standard value function. The approach we outline is compatible with many DRL structures. We use two representative on-policy algorithms, PPO and TRPO, as testbeds. Our method yields statistically significant improvements in 10 out of 16 continuous task settings, while utilizing a reduced number of weights and achieving faster training time compared to an ensemble-based method for quantifying value distribution uncertainty.

URL: https://openreview.net/forum?id=z27hb0rmLT

---

Title: Mathematical Characterization of Better-than-Random Multiclass Models

Authors: Sébastien Foulle

Abstract: A binary supervised model outperforms chance if and only if the determinant of the confusion matrix is positive. This is equivalent to saying that the associated point in the ROC space is above the random guessing line. This also means that Youden's J, Cohen's $\kappa$ and Matthews' correlation coefficient are positive. We extend these results to any number of classes: for a target variable with $m \geq 2$ classes, we show that a model does better than chance if and only if the entries of the confusion matrix verify $m(m-1)$ homogeneous polynomial inequalities of degree 2, which can be expressed using generalized likelihood ratios. We also obtain a more theoretical formulation: a model does better than chance if and only if it is a maximum likelihood estimator of the target variable. When this is the case, we find that the multiclass versions of the previous metrics remain positive. If $m>2$, we notice that no-skill classifiers are only a small part of the topological boundary between better-than-random models and bad models. For $m=3$, we show that bad models occupy exactly 90\% of the ROC space, far more than the 50\% of the two-class problems. Finally, we propose to define weak multiclass classifiers by conditions on these generalized likelihood ratios.

URL: https://openreview.net/forum?id=VdW9SkALSd

---

Title: To Be Greedy, or Not to Be – That Is the Question for Population Based Training Variants

Authors: Alexander Chebykin, Tanja Alderliesten, Peter Bosman

Abstract: Achieving excellent results with neural networks requires careful hyperparameter tuning, which can be automated via hyperparameter optimization algorithms such as Population Based Training (PBT). PBT stands out for its capability to efficiently optimize hyperparameter schedules in parallel and within the wall-clock time of training a single network. Several PBT variants have been proposed that improve performance in the experimental settings considered in the associated publications. However, the experimental settings and tasks vary across publications, while the best previous PBT variant is not always included in the comparisons, thus making the relative performance of PBT variants unclear. In this work, we empirically evaluate five single-objective PBT variants on a set of image classification and reinforcement learning tasks with different setups (such as increasingly large search spaces). We find that the Bayesian Optimization (BO) variants of PBT tend to behave greedier than the non-BO ones, which is beneficial when aggressively pursuing short-term gains improves long-term performance and harmful otherwise. This is a previously overlooked caveat to the reported improvements of the BO PBT variants. Examining their theoretical properties, we find that the returns of BO PBT variants are guaranteed to asymptotically approach the returns of the greedy hyperparameter schedule (rather than the optimal one, as claimed in prior work). Together with our empirical results, this leads us to conclude that there is currently no single best PBT variant capable of outperforming others both when pursuing short-term gains is helpful in the long term, and when it is harmful.

URL: https://openreview.net/forum?id=3qmnxysNbi

---

Title: Ensemble Kalman Diffusion Guidance: A Derivative-free Method for Inverse Problems

Authors: Hongkai Zheng, Wenda Chu, Austin Wang, Nikola Borislavov Kovachki, Ricardo Baptista, Yisong Yue

Abstract: When solving inverse problems, one increasingly popular approach is to use pre-trained diffusion models as plug-and-play priors. This framework can accommodate different forward models without re-training while preserving the generative capability of diffusion models. Despite their success in many imaging inverse problems, most existing methods rely on privileged information such as derivative, pseudo-inverse, or full knowledge about the forward model. This reliance poses a substantial limitation that restricts their use in a wide range of problems where such information is unavailable, such as in many scientific applications. We propose Ensemble Kalman Diffusion Guidance (EnKG), a derivative-free approach that can solve inverse problems by only accessing forward model evaluations and a pre-trained diffusion model prior. We study the empirical effectiveness of EnKG across various inverse problems, including scientific settings such as inferring fluid flows and astronomical objects, which are highly non-linear inverse problems that often only permit black-box access to the forward model. We open-source our code at https://github.com/devzhk/enkg-pytorch.

URL: https://openreview.net/forum?id=XPEEsKneKs

---

Title: A Novel Benchmark for Few-Shot Semantic Segmentation in the Era of Foundation Models

Authors: Reda Bensaid, Vincent Gripon, François Leduc-Primeau, Lukas Mauch, Ghouthi BOUKLI HACENE, Fabien Cardinaux

Abstract: Few-shot semantic segmentation (FSS) is a crucial challenge in computer vision, driving extensive research into a diverse range of methods, from advanced meta-learning techniques to simple transfer learning baselines. With the emergence of vision foundation models (VFM) serving as generalist feature extractors, we seek to explore the adaptation of these models for FSS.
While current FSS benchmarks focus on adapting pre-trained models to new tasks with few images, they emphasize in-domain generalization, making them less suitable for VFM trained on large-scale web datasets. To address this, we propose a novel realistic benchmark with a simple and straightforward adaptation process tailored for this task. Using this benchmark, we conduct a comprehensive comparative analysis of prominent VFM and semantic segmentation models. To evaluate their effectiveness, we leverage various adaption methods, ranging from linear probing to parameter efficient fine-tuning (PEFT) and full fine-tuning.
Our findings show that models designed for segmentation can be outperformed by self-supervised (SSL) models. On the other hand, while PEFT methods yields competitive performance, they provide little discrepancy in the obtained results compared to other methods, highlighting the critical role of the feature extractor in determining results.
To our knowledge, this is the first study on the adaptation of VFM for FSS.

URL: https://openreview.net/forum?id=5EXrH2h3I5

---

New submissions
===============

Title: An analysis of distributional reinforcement learning with Gaussian mixtures

Abstract: Distributional Reinforcement Learning (DRL) aims at optimizing a risk measure of the return by representing its distribution. However, finding a representation of this distribution is challenging as it requires a tractable estimation of the risk measure, a tractable loss, and a representation with enough approximation power. Although Gaussian mixtures (GM) are powerful statistical models to solve these challenges, only very few papers have investigated this approach and most use the L$_2$ space norm as a tractable metric between GM. In this paper, we provide new theoretical results on previously unstudied metrics. We show that the L$_2$ metric is not suitable and propose alternative metrics, a mixture-specific optimal transport (MW) distance and a maximum mean discrepancy distance. Focusing on temporal difference (TD) learning, we prove a convergence result for a related dynamic programming algorithm for the MW metric. Leveraging natural multivariate GM representations, we also highlight the potential of MW in multi-objective RL. Our approach is illustrated on some environments of the Atari Learning Environment benchmark and shows promising empirical results.

URL: https://openreview.net/forum?id=b4VgI1RTv8

---

Title: Adversarial Robustness of Graph Transformers

Abstract: Existing studies have shown that Message-Passing Graph Neural Networks (MPNNs) are highly susceptible to adversarial attacks. In contrast, despite the increasing importance of Graph Transformers (GTs), their robustness properties are unexplored. We close this gap and design the first adaptive attacks for GTs. In particular, we provide general design principles for strong gradient-based attacks on GTs w.r.t. structure perturbations and instantiate our attack framework for five representative and popular GT architectures. Specifically, we study GTs with specialized attention mechanisms and Positional Encodings (PEs) based on pairwise shortest paths, random walks, and the Laplacian spectrum. We evaluate our attacks on multiple tasks and perturbation models, including structure perturbations for node and graph classification and node injection for graph classification. Our results reveal that GTs can be catastrophically fragile in many cases. Addressing this vulnerability, we show how our adaptive attacks can be effectively used for adversarial training, substantially improving robustness.

URL: https://openreview.net/forum?id=4xK0vjxTWL

---

Title: Can Masked Autoencoders Also Listen to Birds?

Abstract: Masked Autoencoders (MAEs) have shown competitive results in audio classification by learning rich semantic representations through an efficient self-supervised reconstruction task. However, general-purpose models fail to generalize well when applied directly to fine-grained audio domains. Specifically, bird-sound classification requires distinguishing subtle inter-species differences and managing high intra-species acoustic variability, thereby revealing the performance limitations of general-domain Audio-MAE models. This work demonstrates that bridging this domain gap requires more than domain-specific pretraining data; adapting the entire training pipeline is crucial. We systematically revisit and adapt the pretraining recipe, fine-tuning methods, and frozen feature utilization to bird sounds using \texttt{BirdSet}, a large-scale bioacoustic dataset comparable to AudioSet. Our resulting Bird-MAE achieves new state-of-the-art results in BirdSet's multi-label classification benchmark. Additionally, we introduce the parameter-efficient prototypical probing, enhancing the utility of frozen MAE representations and closely approaching fine-tuning performance in low-resource settings. Bird-MAE's prototypical probes outperform linear probing by up to 37\%$_\text{p}$ in MAP and narrow the gap to fine-tuning to approximately 3.3\%$_\text{p}$ on average across BirdSet downstream tasks. Bird-MAE also demonstrates robust few-shot capabilities with prototypical probing in our newly established few-shot benchmark on BirdSet, highlighting the potential of tailored self-supervised learning pipelines for fine-grained audio domains.

URL: https://openreview.net/forum?id=GIBWR0Xo2J

---

Title: Error Correction by Agreement Checking for Adversarial Robustness against Black-box Attacks

Abstract: Inspired by how the early stages of visual perception in humans and primates are vulnerable to adversarial attacks, we present a new defense method called Error Correction by Agreement Checking (ECAC). This strategy is designed to mitigate realistic black-box threats. We exploit the fact that natural and adversarially trained models rely on distinct feature sets for classification. Notably, naturally trained models retain commendable accuracy against adversarial examples generated using adversarially trained models. Leveraging this disparity, ECAC moves the input toward the prediction of the naturally trained model unless it leads to disagreement in prediction between the two models, before making the prediction. This simple error correction mechanism is highly effective against leading SQA (Score-based Query Attacks) as well as decision-based and transfer-based black-box attacks. We also verify that, unlike other black-box defenses, ECAC maintains significant robustness even when adversary has full access to the model. We demonstrate its effectiveness through comprehensive experiments across various datasets (CIFAR and ImageNet) and architectures (ResNet and ViT).

URL: https://openreview.net/forum?id=XgK05fssnx

---

Title: MATEY: multiscale adaptive transformer models for spatiotemporal physical systems

Abstract: Accurate representation of the multiscale features in spatiotemporal physical systems using vision transformer (ViT) architectures requires extremely long, computationally prohibitive token sequences. To address this issue, we propose two adaptive tokenization schemes that dynamically adjust patch sizes based on local features: one ensures convergent behavior to uniform patch refinement, while the other offers better computational efficiency.
Moreover, we present a set of spatiotemporal attention schemes, where the temporal or axial spatial dimensions are decoupled, and evaluate their computational and data efficiencies.
We assess the performance of the proposed multiscale adaptive model, MATEY, in a sequence of experiments.
The results show that adaptive tokenization schemes achieve improved accuracy without significantly increasing the length of the token sequence.
Compared to a full spatiotemporal attention scheme or a scheme that decouples only the temporal dimension, we find that fully decoupled axial attention is less efficient and expressive, requiring more training time and model weights to achieve the same accuracy.
Finally, we demonstrate in two fine-tuning tasks featuring different physics that models pretrained on PDEBench data outperform the ones trained from scratch, especially in the low data regime with frozen attention.

URL: https://openreview.net/forum?id=sIhbw5c1nG

---

Title: Graph Reduction with Unsupervised Learning in Column Generation: A Routing Application

Abstract: Column Generation (CG) is a popular method dedicated to enhancing computational efficiency in large scale Combinatorial Optimization (CO) problems. It reduces the number of decision variables in a problem by solving a pricing problem. For many CO problems, the pricing problem is an Elementary Shortest Path Problem with Resource Constraints (ESPPRC). Large ESPPRC instances are difficult to solve to near-optimality. Consequently, we use a Graph neural Network (GNN) to reduces the size of the ESPPRC such that it becomes computationally tractable with standard solving techniques. Our GNN is trained by Unsupervised Learning and outputs a distribution for the arcs to be retained in the reduced PP. The reduced PP is solved by a local search that finds columns with large reduced costs
and speeds up convergence. We apply our method on a set of Capacitated Vehicle Routing Problems with Time Windows and show significant improvements in convergence compared to simple reduction techniques from the literature. For a fixed computational budget, we improve the objective values by over 9% for larger instances. We also analyze the performance of our CG algorithm and test the generalization of our method to different classes of instances than the training data.

URL: https://openreview.net/forum?id=ANuu812EdY

---

Title: We Can (and Should) Design Neural Networks with a Systematic Dimensional Approach

Abstract: The design of neural network architectures, despite remarkable empirical successes, resembles an architecture zoo characterized by chance innovations and reliance on intuition rather than systematic thinking. This approach limits our ability to deeply understand why architectures succeed, efficiently explore the vast design space, and transfer knowledge across different paradigms. We argue for a shift in how the machine learning community approaches neural architecture design: moving from an architecture-centric cataloging to a dimensional-centric understanding. Building on prior taxonomic work and integrating insights from recent architecture search approaches, we introduce a framework comprising 10 quasi-orthogonal structural dimensions that govern the capabilities of neural networks. This dimensional approach facilitates deeper understanding by enabling the deconstruction of complex architectures into their core design choices and their associated inductive biases. This aims to enable more principled innovation by providing a modern map for systematic exploration of the design space and targeted design for specific problem characteristics. We demonstrate the framework's utility by mapping diverse, prominent architectures onto these dimensions and call upon the community to adopt such systematic frameworks for more principled and efficient advancement in neural network design.

URL: https://openreview.net/forum?id=lR54W6CjNh

---

Title: Improving Tabular Generative Models: Loss Functions, Benchmarks, and Improved Multi-objective Bayesian Optimization Approaches

Abstract: Access to extensive data is essential to improve model performance and generalization in deep learning (DL). When dealing with sparse datasets—those with limited samples relative to model complexity—a promising solution is to generate synthetic data using deep generative models (DGMs). However, these models often struggle to capture the complexities of real-world tabular data, including diverse variable types, imbalances, and intricate dependencies. Additionally, standard Bayesian optimization (SBO), commonly used for hyper-parameter tuning, struggles to optimize over aggregated metrics with different units, leading to unreliable averaging and suboptimal decisions. To address these gaps, we introduce a novel correlation- and distribution-aware loss function that regularizes DGMs, enhancing their ability to generate synthetic tabular data that faithfully represents the underlying data distributions. Theoretical guarantees for the proposed loss functions are provided, including stability and consistency analyses, ensuring their robustness. To enable principled hyperparameter search via Bayesian optimization (BO), we also propose a new multi-objective aggregation strategy based on iterative objective refinement Bayesian optimization (IORBO), along with a comprehensive statistical testing framework. We validate the proposed approach using a benchmarking framework with twenty real-world datasets and ten established tabular DGM baselines. The results demonstrate that the proposed loss function significantly improves the fidelity of the synthetic data generated with DGMs, leading to better performance in downstream machine learning (ML) tasks. Furthermore, the IORBO consistently outperformed SBO, yielding superior hyper-parameter results. This work advances synthetic data generation and optimization techniques, enabling more robust DL applications.

URL: https://openreview.net/forum?id=RPZ0EW0lz0

---

Title: Understanding Self-supervised Contrastive Learning through Supervised Objectives

Abstract: Self-supervised representation learning has achieved impressive empirical success, yet its theoretical understanding remains limited. In this work, we provide a theoretical perspective by formulating self-supervised representation learning as an approximation to supervised representation learning objectives. Based on this formulation, we derive a loss function closely related to popular contrastive losses such as InfoNCE, offering insight into their underlying principles. Our derivation naturally introduces the concepts of prototype representation bias and a balanced contrastive loss, which help explain and improve the behavior of self-supervised learning algorithms. We further show how components of our theoretical framework correspond to established practices in contrastive learning. Finally, we empirically validate the effect of balancing positive and negative pair interactions. All theoretical proofs are provided in the appendix, and our code is included in the supplementary material.

URL: https://openreview.net/forum?id=cmE97KX2XM

---

Title: Effect of Geometry on Graph Neural Networks

Abstract: Hyperbolic Graph Neural Networks (GNNs) have emerged as a promising approach for modeling graph-structured data with less embedding distortion than Euclidean GNNs. In this paper, we explore the effect of geometry on the performance of three types of GNNs for node classification and link prediction. To do so, we build on the hyperbolic framework outlined in Chen et al. (2022) and propose a family of GNNs with alternating geometry, integrating both hyperbolic and Euclidean components that can be trained jointly. We compare our alternating geometry models’ performance and stability against their Euclidean and hyperbolic counterparts across various datasets. Finally, we examine the impact of the choice of geometry and graph properties on hyperparameter selection. The alternating geometry models achieved the best performance in node classification, while the hyperbolic models outperformed alternating and Euclidean models in link prediction. Additionally, for node classification, architecture choice had a greater impact on performance than geometry, whereas for link prediction, geometry had a more significant effect than architecture.

URL: https://openreview.net/forum?id=qSF5Hsjmkd

---

Title: Recursive SNE: Fast Prototype-Based t-SNE for Large-Scale and Online Data

Abstract: Dimensionality reduction techniques like t-SNE excel at visualizing structure in high-dimensional data but incur high computational costs that limit their use on large or streaming datasets. We introduce the Recursive SNE (RSNE) framework, which extends t-SNE with two complementary strategies: i-RSNE for real-time, point-wise updates and Bi-RSNE for efficient batch processing. Across diverse settings, including standard image benchmarks (CIFAR10/CIFAR100) with DINOv2 and CLIP features, domain-specific iROADS road scenes, and long-term climate records, RSNE delivers substantial speedups over Barnes–Hut t-SNE while maintaining or even improving cluster separability. By combining a lightweight prototype-based initialization with localized KL-divergence refinements, RSNE offers a scalable and adaptable framework for both large-scale offline embedding and on-the-fly visualization of streaming data.

URL: https://openreview.net/forum?id=7wCPAFMDWM

---

Title: IBCL: Zero-shot Model Generation under Stability-Plasticity Trade-offs

Abstract: Algorithms that balance the stability-plasticity trade-off are well studied in the Continual Learning literature. However, only a few focus on obtaining models for specified trade-off preferences. When solving the problem of continual learning under specific trade-offs (CLuST), state-of-the-art techniques leverage rehearsal-based learning, which requires retraining when a model corresponding to a new trade-off preference is requested. This is inefficient, since there potentially exist a significant number of different trade-offs, and a large number of models may be requested. As a response, we propose Imprecise Bayesian Continual Learning (IBCL), an algorithm that tackles CLuST efficiently. IBCL replaces retraining with a constant-time convex combination. Given a new task, IBCL (1) updates the knowledge base as a convex hull of model parameter distributions, and (2) generates one Pareto-optimal model per given trade-off via convex combination without additional training. That is, obtaining models corresponding to specified trade-offs via IBCL is zero-shot. Experiments whose baselines are current CLuST algorithms show that IBCL improves by at most 45% on average per task accuracy, and by 43% on peak per task accuracy while maintaining a near-zero to positive backward transfer. In addition, its training overhead, measured by the number of batch updates, remains constant at every task, regardless of the number of preferences requested. Details can be found at: https://github.com/ibcl-anon/ibcl.

URL: https://openreview.net/forum?id=HvTRpctE5n

---

Title: SELU: Self-Learning Embodied Multimodal Large Language Models in Unknown Environments

Abstract: Recently, multimodal large language models (MLLMs) have demonstrated strong visual understanding and decision-making capabilities, enabling the exploration of autonomously improving MLLMs in unknown environments. However, external feedback like human or environmental feedback is not always available. To address this challenge, existing methods primarily focus on enhancing the decision-making capabilities of MLLMs through voting and scoring mechanisms, while little effort has been paid to improving the environmental comprehension of MLLMs in unknown environments. To fully unleash the self-learning potential of MLLMs, we propose a novel actor-critic self-learning paradigm, dubbed SELU, inspired by the actor-critic paradigm in reinforcement learning. The critic employs self-asking and hindsight relabeling to extract knowledge from interaction trajectories collected by the actor, thereby augmenting its environmental comprehension. Simultaneously, the actor is improved by the self-feedback provided by the critic, enhancing its decision-making. We evaluate our method in the AI2-THOR and VirtualHome environments, and SELU achieves critic improvements of approximately 28% and 30%, and actor improvements of about 20% and 24% via self-learning.

URL: https://openreview.net/forum?id=G5gROx8AVi

---

Title: COMMA: A Communicative Multimodal Multi-Agent Benchmark

Abstract: The rapid advances of multimodal agents built on large foundation models have largely overlooked their potential for language-based communication between agents in collaborative tasks. This oversight presents a critical gap in understanding their effectiveness in real-world deployments, particularly when communicating with humans. Existing agentic benchmarks fail to address key aspects of inter-agent communication and collaboration, particularly in scenarios where agents have unequal access to information and must work together to achieve tasks beyond the scope of individual capabilities. To fill this gap, we introduce a novel benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of scenarios, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. Our findings reveal surprising weaknesses in state-of-the-art models, including strong proprietary models like GPT-4o and reasoning models like o4-mini. Many long chain of thought reasoning models such as R1-Onevision and LLaVA-CoT struggle to outperform even a simple random agent baseline in agent-agent collaboration, indicating a potential growth area in their communication abilities.

URL: https://openreview.net/forum?id=TIGQIem1na

---

Title: The Alpha-Alternator: Dynamic Adaptation To Varying Noise Levels In Sequences Using The Vendi Score For Improved Robustness and Performance

Abstract: Current state-of-the-art dynamical models, such as Mamba, assume the same level of noisiness for all elements of a given sequence, which limits their performance on noisy temporal data. In this paper, we introduce the \textbf{$\alpha$-Alternator}, a novel generative model for time-dependent data that dynamically adapts to the complexity introduced by varying noise levels in sequences. The $\alpha$-Alternator leverages the Vendi Score (VS), a flexible similarity-based diversity metric, to adjust, at each time step $t$, the influence of the sequence element at time $t$ and the latent representation of the dynamics up to that time step on the predicted future dynamics. This influence is captured by a parameter that is learned and shared across all sequences in a given dataset. The sign of this parameter determines the direction of influence. A negative value indicates a noisy dataset, where a sequence element that increases the VS is considered noisy, and the model relies more on the latent history when processing that element. Conversely, when the parameter is positive, a sequence element that increases the VS is considered informative, and the $\alpha$-Alternator relies more on this new input than on the latent history when updating its predicted latent dynamics. The $\alpha$-Alternator is trained using a combination of observation masking and Alternator loss minimization. Masking simulates varying noise levels in sequences, enabling the model to be more robust to these fluctuations and improving its performance in trajectory prediction, imputation, and forecasting. Our experimental results demonstrate that the $\alpha$-Alternator outperforms both Alternators and state-of-the-art state-space models across neural decoding and time-series forecasting benchmarks.

URL: https://openreview.net/forum?id=L2ixqvYpnK

---

Title: Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization

Abstract: Reinforcement Learning with Human Feedback (RLHF) enhances the alignment of Large Language Models (LLMs). However, its limitations have led to the development of Direct Preference Optimization (DPO), an RL-free approach designed to overcome these shortcomings. While studies have shown that DPO improves instruction-following capabilities, it negatively impacts the reasoning ability of LLMs. Additionally, DPO is highly sensitive to judgment noise in preference datasets and the size of the training set. Although several modifications to DPO have been proposed, they still fail to fully resolve these issues. To address these limitations, we propose Triple Preference Optimization (TPO), a new preference learning method designed to enhance both reasoning and instruction-following abilities through one-step optimization. We compare TPO against DPO and its recent variants using state-of-the-art training setups, including both base and instruction-tuned models such as Mistral and Llama 3. Our evaluation covers a comprehensive range of chat-based and reasoning benchmarks. The results demonstrate that TPO achieves significant improvements over existing methods without substantially increasing response length across different dataset sizes. Specifically, TPO outperforms DPO and SimPO by up to 7.0% and 7.3% points on Arena-Hard, 12.2% and 13.3% points on MixEval-Hard, 10.4% and 10.1% points on MMLU-Pro, and 19.0% and 19.2% points on GSM8K, respectively. Furthermore, TPO achieves these improvements while requiring less data than DPO.

URL: https://openreview.net/forum?id=A4jyaZheE8

---

Title: Unifi3D: A Study on 3D Representations for Generation and Reconstruction in a Common Framework

Abstract: Following rapid advancements in text and image generation, research has increasingly shifted towards 3D generation. Unlike the well-established pixel-based representation in images, 3D representations remain diverse and fragmented, encompassing a wide variety of approaches such as voxel grids, neural radiance fields, signed distance functions, point clouds, or octrees, each offering distinct advantages and limitations.
In this work, we present a unified evaluation framework designed to assess the performance of 3D representations in reconstruction and generation. We compare these representations based on multiple criteria: quality, computational efficiency, and generalization performance. Beyond standard model benchmarking, our experiments aim to derive best practices over all steps involved in the 3D generation pipeline, including preprocessing, mesh reconstruction, compression with autoencoders, and generation. Our findings highlight that reconstruction errors significantly impact overall performance, underscoring the need to evaluate generation and reconstruction jointly.
We provide insights that can inform the selection of suitable 3D models for various applications, facilitating the development of more robust and application-specific solutions in 3D generation.
The code for our framework is available at https://anonymous.4open.science/r/unifi3d-39CD.

URL: https://openreview.net/forum?id=GQpTWpXILA

---

Title: ABC: Achieving Better Control of Visual Embeddings using VLLMs

Abstract: Visual embedding models excel at zero-shot tasks like visual retrieval and classification.
However, these models cannot be used for tasks that contain ambiguity or require user in-
struction. These tasks necessitate an embedding model which outputs can use a natural
language instruction to control the representation of a visual embedding. Existing CLIP-
based approaches embed images and text independently, and fuse the result. We find that
this results in weak interactions between modalities, and poor user control over the repre-
sentation. We introduce ABC, an open-source multimodal embedding model that uses a
vision-language model backbone to deeply integrate image features with natural language
instructions. ABC achieves best-for-size performance on MSCOCO image-to-text retrieval
and is the top performing model on classification and VQA tasks in the Massive Multimodal
Embedding Benchmark. With a strongly unified vision-language representation, ABC can
use natural language to solve subtle and potentially ambiguous visual retrieval problems. To
evaluate this capability, we design CtrlBench, a benchmark that requires interleaving tex-
tual instructions with image content for correct retrieval. ABC advances the state of visual
embeddings, outputting high-quality visual representations with natural language control.

URL: https://openreview.net/forum?id=RezANmBpxW

---

Title: Learning to Be Cautious

Abstract: A key challenge in the field of reinforcement learning is to develop agents that behave cautiously in novel situations. It is generally impossible to anticipate all situations that an autonomous system may face or what behavior would best avoid bad outcomes. An agent that could learn to be cautious would overcome this challenge by discovering for itself when and how to behave cautiously. In contrast, current approaches typically embed task-specific safety information or explicitly cautious behaviors into the system, which is error-prone and imposes extra burdens on practitioners. In this paper, we present both a sequence of tasks where cautious behavior becomes increasingly non-obvious, as well as an algorithm to demonstrate that it is possible for a system to learn to be cautious. The essential features of our algorithm are that it characterizes reward function uncertainty without task-specific safety information and uses this uncertainty to construct a robust policy. Specifically, we construct robust policies with a $k$-of-$N$ counterfactual regret minimization (CFR) subroutine given a learned reward function uncertainty represented by a neural network ensemble belief. These policies exhibit caution in each of our tasks without any task-specific safety tuning.

URL: https://openreview.net/forum?id=NXvGOaYExG

---

Title: Goal Recognition Design for General Behavioral Agents using Machine Learning

Abstract: Goal recognition design (GRD) aims to make limited modifications to decision-making environments to make it easier to infer the goals of agents acting within those environments. Although various research efforts have been made in goal recognition design, existing approaches are computationally demanding and often assume that agents are (near-)optimal in their decision-making. To address these limitations, we leverage machine learning methods for goal recognition design that can both improve run-time efficiency and account for agents with general behavioral models. Following existing literature, we use worst-case distinctiveness (wcd) as a measure of the difficulty in inferring the goal of an agent in a decision-making environment. Our approach begins by training a machine learning model to predict the wcd for a given environment and the agent behavior model. We then propose a gradient-based optimization framework that accommodates various constraints to optimize decision-making environments for enhanced goal recognition. Through extensive simulations, we demonstrate that our approach outperforms existing methods in reducing wcd and enhances runtime efficiency. Moreover, our approach also adapts to settings in which existing approaches do not apply, such as those involving flexible budget constraints, more complex environments, and suboptimal agent behavior. Finally, we conducted human-subject experiments that demonstrate that our method creates environments that facilitate efficient goal recognition from human decision-makers.

URL: https://openreview.net/forum?id=GDuWBhvMid

---

Title: Improving Single-round Active Adaptation: A Prediction Variability Perspective

Abstract: Machine learning models trained with offline data often suffer from distribution shifts in online environments and require fast adaptation to online data. The high volume of online data further stimulates the study of active adaptation approaches that achieve competitive adaptation performance by selectively annotating only 5%-10% of online data and using it to continuously train a model. Despite the reduction in data annotation cost, many prior active adaptations assume a multi-round data annotation procedure during continuous training, which hinders timely adaptation. In this work, we study a single-round active adaptation problem with a minimum data annotation turnaround time but require the selected subset of data samples to help the entire continuous training procedure until convergence. In our theoretical analysis, we find that the prediction variability of each data sample throughout the training is crucial, in addition to the conventional data diversity. The prediction variability measures how much the prediction could possibly change during the continuous training procedure. To this end, we introduce a novel approach called feature-norm scaled gradient embedding (FORGE), which incorporates prediction variability and improves the single-round active adaptation performance when combined with standard data selection strategies (e.g., k-center greedy). In addition, we provide efficient implementations to construct our FORGE embedding analytically without explicitly backpropagating gradients. Empirical results further demonstrate that our approach consistently outperforms the random selection baseline by up to 1.26% for various vision and language tasks while other competitors often underperform the random selection baseline.

URL: https://openreview.net/forum?id=Vthqn5VE7L

---

Title: Measurement Manipulation of the Matrix Sensing Problem to Improve Optimization Landscape

Abstract: This work studies the matrix sensing (MS) problem through the lens of the Restricted Isometry Property (RIP). It has been shown in several recent papers that two different techniques of convex relaxations and local search methods for the MS problem both require the RIP constant to be less than 0.5 while most real-world problems have their RIPs close to 1. The existing literature guarantees a small RIP constant only for sensing operators having an i.i.d. Gaussian distribution, and it is well-known that the MS problem could have a complicated landscape when the RIP is greater than 0.5. In this work, we address this issue and improve the optimization landscape by developing two results. First, we show that any sensing operator with a model not too distant from i.i.d. Gaussian has a slightly higher RIP than i.i.d. Gaussian. Second, we show that if the sensing operator has an arbitrary distribution, it can be modified in such a way that the resulting operator will act as a perturbed Gaussian with a lower RIP constant. Our approach is a preconditioning/mixing technique that replaces each sensing matrix with a weighted sum of all sensing matrices. This approach does not require taking new measurements (which is not possible in many applications) and relies only on mixing existing measurements. We numerically demonstrate that the RIP constants for different distributions can be reduced from almost 1 to less than 0.5 via the preconditioning of the sensing operator.

URL: https://openreview.net/forum?id=OR8JWLRUrM

---

Title: From Reasoning to Learning: A Survey on Hypothesis Discovery and Rule Learning with Large Language Models

Abstract: Since the advent of Large Language Models (LLMs), efforts have largely focused on improving their instruction-following and deductive reasoning abilities, leaving open the question of whether these models can truly discover new knowledge. In pursuit of artificial general intelligence (AGI), there is a growing need for models that not only execute commands or retrieve information but also learn, reason, and generate new knowledge by formulating novel hypotheses and theories that deepen our understanding of the world. Guided by Peirce's framework of abduction, deduction, and induction, this survey offers a structured lens to examine LLM-based hypothesis discovery. We synthesize existing work in hypothesis generation, application, and validation, identifying both key achievements and critical gaps. By unifying these threads, we illuminate how LLMs might evolve from mere ``information executors'' into engines of genuine innovation, potentially transforming research, science, and real-world problem solving.

URL: https://openreview.net/forum?id=d7W38UzUg0

---

Title: Reinforcement Learning from Human Feedback with Active Queries

Abstract: Aligning large language models (LLM) with human preference plays a key role in building modern generative models and can be achieved by reinforcement learning from human feedback (RLHF). Despite their superior performance, current RLHF approaches often require a large amount of human-labelled preference data, which is expensive to collect. In this paper, inspired by the success of active learning, we address this problem by proposing query-efficient RLHF methods. We first formalize the alignment problem as a contextual dueling bandit problem and design an active-query-based proximal policy optimization (APPO) algorithm with an $\tilde{O}(d^2/\Delta)$ instance-dependent regret bound and an $\tilde{O}(d^2/\Delta^2)$ query complexity, where $d$ is the dimension of feature space and $\Delta$ is the sub-optimality gap over all the contexts. We then propose ADPO, a practical version of our algorithm based on direct preference optimization (DPO) and apply it to fine-tuning LLMs. Our experiments show that ADPO, while only making about half of queries for human preference, matches the performance of DPO, establishing it as a data-efficient alternatives to DPO.

URL: https://openreview.net/forum?id=EScatQaRxz

---

Title: Dual Natural Gradient Descent for Scalable Training of Physics-Informed Neural Networks

Abstract: Natural-gradient methods markedly accelerate the training of Physics-Informed Neural Networks (PINNs), yet their Gauss–Newton update must normally be solved in the parameter space, incurring a prohibitive $\mathcal{O}(n^{3})$ time complexity, where $n$ is the number of network weights. We show that exactly the same step can instead be formulated in a generally smaller residual space of size $m=\sum_{\gamma}N_{\gamma}d_{\gamma}$, where each residual class $\gamma$ (e.g. PDE interior, boundary, initial data) contributes $N_{\gamma}$ collocation points of output dimension $d_{\gamma}$.

Building on this insight, we introduce Dual Natural Gradient Descent (D-NGD). D-NGD computes the Gauss–Newton step in residual space, augments it with a geodesic-acceleration correction at negligible extra cost, and provides both a dense direct solver for modest $m$ and a Nyström-preconditioned conjugate-gradient solver for larger $m$.

Experimentally, D-NGD scales second-order PINN optimisation to networks with up to 12.8 million parameters, delivers one- to three-order-of-magnitude lower final $L^{2}$ error than first-order (Adam, SGD) and quasi-Newton methods, and—crucially—enables full natural-gradient training of PINNs at this scale on a single GPU.

URL: https://openreview.net/forum?id=GDHVRy6SDd

---

Title: On the Universal Statistical Consistency of Expansive Hyperbolic Deep Convolutional Neural Networks

Abstract: The emergence of Deep Convolutional Neural Networks (DCNNs) has been a pervasive tool for accomplishing widespread applications in computer vision. Despite its potential capability to capture intricate patterns inside the data, the underlying embedding space remains Euclidean and primarily pursues contractive convolution. Several instances can serve as a precedent for the exacerbating performance of DCNNs. The recent advancement of neural networks in the hyperbolic spaces gained traction, incentivizing the development of convolutional deep neural networks in the hyperbolic space. In this work, we propose Hyperbolic DCNN based on the Poincar\'{e} Ball. The work predominantly revolves around analyzing the nature of expansive convolution in the context of the non-Euclidean domain. We further offer extensive theoretical insights about the universal consistency of the expansive convolution in the hyperbolic space. Several simulations were performed not only on the synthetic datasets but also on some real-world datasets. The experimental results reveal that the hyperbolic convolutional architecture outperforms the Euclidean ones by a commendable margin.

URL: https://openreview.net/forum?id=YWJeeK8y2k

---

Title: Efficient Vocabulary-Free Fine-Grained Visual Recognition in the Age of Multimodal LLMs

Abstract: Fine-grained Visual Recognition (FGVR) involves distinguishing between visually similar categories, which is inherently challenging due to subtle inter-class differences and the need for large, expert-annotated datasets. In domains like medical imaging, such curated datasets are unavailable due to issues like privacy concerns and high annotation costs. In such scenarios lacking labeled data, an FGVR model cannot rely on a predefined set of training labels, and hence has an unconstrained output space for predictions. We refer to this task as Vocabulary-Free FGVR (VF-FGVR), where a model must predict labels from an unconstrained output space without prior label information. While recent Multimodal Large Language Models (MLLMs) show potential for VF-FGVR, querying these models for each test input is impractical because of high costs and prohibitive inference times. To address these limitations, we introduce Nearest-Neighbor label Refinement (NeaR), a novel approach that fine-tunes a downstream CLIP model using labels generated by an MLLM. Our approach constructs a weakly supervised dataset from a small, unlabeled training set, leveraging MLLMs for label generation. NeaR is designed to handle the noise, stochasticity, and open-endedness inherent in labels generated by MLLMs, and establishes a new benchmark for efficient VF-FGVR.

URL: https://openreview.net/forum?id=FvA0UMw9X2

---

Title: On the Role of Discrete Representation in Sparse Mixture of Experts

Abstract: Sparse Mixture of Experts (SMoE) is an effective solution for scaling up model capacity without increasing the computational costs. A crucial component of SMoE is the router, responsible for directing the input to relevant experts; however, it also presents a major weakness, leading to routing inconsistencies and representation collapse issues. Instead of fixing the router like previous works, we propose an alternative that assigns experts to input via \emph{indirection}, which employs the discrete representation of input that points to the expert. The discrete representations are learned via vector quantization, resulting in a new architecture dubbed Vector-Quantized Mixture of Experts (VQMoE). We provide theoretical support and empirical evidence demonstrating the VQMoE's ability to overcome the challenges present in traditional routers. Through extensive evaluations on both large language models and vision tasks for pre-training and fine-tuning, we show that VQMoE achieves a 28\% improvement in robustness compared to other SMoE routing methods while maintaining strong performance in fine-tuning tasks.

URL: https://openreview.net/forum?id=GTWKmojpI7

---

Title: M4GN: Micro–Meso–Macro Mesh-based Graph Network for Dynamic Simulations

Abstract: Dynamic systems often exhibit intricate interactions that span from localized, fine-scale processes to broad, global effects. Accurately modeling these systems therefore demands methods that account for both localized dynamics and extended global dependencies while remaining computationally tractable. However, existing surrogate models often struggle to balance precision and scalability, especially for large datasets, complex mesh topologies, and long-range effects. In this paper, we introduce M4GN, a physics-informed hierarchical model designed to address the aforementioned challenges by aligning its framework with the inherent behaviors in dynamic simulations. M4GN comprises three stages: a micro-level stage for fine-grained local dynamics, a macro-level stage for far-reaching global interactions, and a meso-level stage that facilitates effective information exchange between these levels by aligning mesh hierarchy with physical properties. Experimental results show that M4GN achieves superior accuracy, excels at modeling long-range interactions, and maintains high computational efficiency. Moreover, M4GN generalizes well to larger physical domains, making it particularly suitable for complex, large-scale dynamic simulations. All code and data will be released upon acceptance.

URL: https://openreview.net/forum?id=R3vDbqWa1v

---

Title: HESSO: Towards Automatic Efficient and User Friendly Any Neural Network Training and Pruning

Abstract: Structured pruning is a popular technique for compressing deep neural networks (DNNs) into efficient sub-networks. However, existing methods often require multi-stage process, engineering efforts, and human expertise. The Only-Train-Once series (OTOv1-v3) has been proposed to resolve some pain points by streamlining the workflow. However, the built-in sparse optimizers in the OTO series need hyperparameter tuning and implicit control over sparsity, necessitating human intervention. To address these limitations, we propose the Hybrid Efficient Structured Sparse Optimizer (HESSO), which automatically and efficiently train a DNN within a single run to produce a high-performing sub-network. HESSO is almost tuning-free and enjoys user-friendly integration for generic training applications. In addition, to tackle the common issue of irreversible pruning performance collapse in certain DNNs, we further propose the Corrective Redundant Identification Cycle (CRIC), which integrates seamlessly with HESSO. The extensive numerical results showcase that HESSO can achieve competitive performance on various state-of-the-art benchmarks and support most DNN architectures. Moreover, CRIC can effectively prevent the irreversible performance collapse and further enhance the performance of HESSO on certain applications.

URL: https://openreview.net/forum?id=QDBPXiJJvp

---

Title: A Unified Understanding and Evaluation of Steering Methods

Abstract: Steering methods provide a practical approach to controlling large language models by applying steering vectors to intermediate activations, guiding outputs toward desired behaviors while avoiding retraining. Despite their growing importance, the field lacks a unified understanding and consistent evaluation across tasks and datasets, hindering progress. This paper introduces a unified framework for analyzing and evaluating steering methods, formalizing their core principles and offering theoretical insights into their effectiveness. Through comprehensive empirical evaluations on multiple-choice and open-ended text generation tasks, we validate these insights, identifying key factors that influence performance and demonstrating the superiority of certain methods. Our work bridges theoretical and practical perspectives, offering actionable guidance for advancing the design, optimization, and deployment of steering methods in LLMs.

URL: https://openreview.net/forum?id=NDTtRaPCzF

---

Title: Dual Caption Preference Optimization for Diffusion Models

Abstract: Recent advancements in human preference optimization, originally developed for Large Language Models (LLMs), have shown significant potential in improving text-to-image diffusion models. These methods aim to learn the distribution of preferred samples while distinguishing them from less preferred ones. However, existing preference datasets often exhibit overlap between these distributions, leading to a *conflict distribution*. Additionally, we identified that input prompts contain irrelevant information for less preferred images, limiting the denoising network's ability to accurately predict noise in preference optimization methods, known as the *irrelevant prompt* issue. To address these challenges, we propose **Dual Caption Preference Optimization (DCPO)**, a novel approach that utilizes two distinct captions to mitigate irrelevant prompts. To tackle conflict distribution, we introduce the *Pick-Double Caption* dataset, a modified version of Pick-a-Pic v2 with separate captions for preferred and less preferred images. We further propose three different strategies for generating distinct captions: captioning, perturbation, and hybrid methods. Our experiments show that DCPO significantly improves image quality and relevance to prompts, outperforming Stable Diffusion (SD) 2.1, SFT_Chosen, Diffusion-DPO and MaPO across multiple metrics, including Pickscore, HPSv2.1, GenEval, CLIPscore, and ImageReward, fine-tuned on SD 2.1 as the backbone.

URL: https://openreview.net/forum?id=ruZksIJBBd

---

Title: CycleSL: Server-Client Cyclical Update Driven Scalable Split Learning

Abstract: Over the past few years, split learning has developed by leaps and bounds due to the shift towards distributed computing and the demand for data privacy. To increase the scalability of sequential split learning, scalable variants such as parallel split learning and split federated learning have been proposed, which often entail huge computation and memory consumption on the server side, limiting thus their scalability. Moreover, former aggregation-based methods generally converge with inferior rate and quality due to factors such as client drift and lag, whilst existing aggregation-free methods cannot really benefit from parallelism. In this paper, we present a novel aggregation-free split learning paradigm termed CycleSL, which can be integrated into existing algorithms to boost model performance while imposing less resource consumption. Inspired by alternating coordinate descent, CycleSL models the training task on the server side as a standalone higher-level machine learning task and updates the server and client in cyclical turns through the reuse of smashed data. Benefiting from feature resampling and alternating gradient steps, CycleSL has great potential to advance model performance and robustness. We integrate CycleSL into previous algorithms and benchmark them on four publicly available datasets with non-iid data distribution and partial client attendance. Our results show that CycleSL can notably improve model performance and convergence.

URL: https://openreview.net/forum?id=IU8t8j90vs

---

Title: Towards Formalizing Spuriousness of Biased Datasets Using Partial Information Decomposition

Abstract: Spuriousness arises when there is an association between two or more variables in a dataset that are not causally related. In this work, we propose an explainability framework to preemptively disentangle the nature of such spurious associations in a dataset before model training. We leverage a body of work in information theory called Partial Information Decomposition (PID) to decompose the total information about the target into four non-negative quantities namely unique information (in core and spurious features respectively), redundant information, and synergistic information. Our framework helps anticipate when the core or spurious feature is indispensable, when either suffice, and when both are jointly needed for an optimal classifier trained on the dataset. Next, we leverage this decomposition to propose a novel measure of the spuriousness of a dataset. We arrive at this measure systematically by examining several candidate measures, and demonstrating what they capture and miss through intuitive canonical examples and counterexamples. Our framework Spurious Disentangler consists of segmentation, dimensionality reduction, and estimation modules, with capabilities to specifically handle high dimensional image data efficiently. Finally, we also perform empirical evaluation to demonstrate the trends of unique, redundant, and synergistic information, as well as our proposed spuriousness measure across $6$ benchmark datasets under various experimental settings. We observe an agreement between our preemptive measure of dataset spuriousness and post-training model generalization metrics such as worst-group accuracy, further supporting our proposition.

URL: https://openreview.net/forum?id=zw6UAPYmyx

---

Title: Trustworthy and Responsible AI for Human-Centric Autonomous Decision-Making Systems

Abstract: Artificial Intelligence (AI) has paved the way for revolutionary decision-making processes, which if harnessed appropriately, can contribute to advancements in various sectors, from healthcare to economics. However, its black box nature presents significant ethical challenges
related to bias and transparency. AI applications are hugely impacted by biases, presenting inconsistent and unreliable findings, leading to significant costs and consequences, highlighting and perpetuating inequalities and unequal access to resources. Hence, developing safe,
reliable, ethical, and Trustworthy AI systems is essential. Our interdisciplinary team of researchers focuses on Trustworthy and Responsible AI, including fairness, bias mitigation, reproducibility, generalization, interpretability, and authenticity. In this paper, we review and discuss the intricacies of AI biases, definitions, methods of detection and mitigation, and metrics for evaluating bias. We also discuss open challenges with regard to the trustworthiness and widespread application of AI across diverse domains of human-centric decision making, as well as guidelines to foster Responsible and Trustworthy AI models.

URL: https://openreview.net/forum?id=1k833OTHpI

---

Title: Two Is Better Than One: Aligned Representation Pairs for Anomaly Detection

Abstract: Anomaly detection focuses on identifying samples that deviate from the norm. Discovering informative representations of normal samples is crucial to detecting anomalies effectively. Recent self-supervised methods have successfully learned such representations by employing prior knowledge about anomalies to create synthetic outliers during training. However, we often do not know what to expect from unseen data in specialized real-world applications. In this work, we address this limitation with our new approach Con$_2$, which leverages symmetries in normal samples to observe the data in different contexts. Con$_2$ clusters representations according to their context and simultaneously aligns their positions to learn an informative representation space that is structured according to the properties of normal data. Anomalies do not adhere to the same structure as normal data, making their representations deviate from the learned context clusters. We demonstrate the benefit of this approach in extensive experiments on specialized medical datasets, outperforming competitive baselines based on self-supervised learning and pretrained models.

URL: https://openreview.net/forum?id=Bt0zdsnWYc

---

Title: RePrompt: Reflection-based Automatic Prompting for LLM Agents

Abstract: In the past year, large language models (LLMs) have had remarkable success in domains outside the traditional natural language processing, and their capacity is further expanded into the so-called LLM agents when connected with external tools. In all domains, the prompt to the LLMs has been shown to make a big difference in what the LLM would generate and thus affect the performance of the LLM agents. Therefore, automatic prompt engineering (APE) has become an important question for many researchers and users of LLMs. However, previous works in APE rely on a final checker to evaluate the performance of the given prompt -- a requirement that is hard to meet in the case of LLM agents, where intermediate feedback is easier to obtain, and the final evaluation could be expensive, inaccurate, or even missing. In this paper, we propose a novel method, \textsc{RePrompt}, which does a ``gradient descent"-like approach to optimize the step-by-step instructions in the prompts given to LLM agents, based on the chat history obtained from interactions and reflections with LLM agents. By leveraging intermediate feedback, \textsc{RePrompt} can optimize the prompt without the need for a final solution checker. We evaluate our approach on PDDL generation, TravelPlanner, and Meeting Planning to show that our method could generally improve performance for different reasoning tasks.

URL: https://openreview.net/forum?id=76F00wmbl3

---

Title: Chimera: State Space Models Beyond Sequences

Abstract: Transformer-based deep learning methods have emerged as the standard approach to model diverse data such as sequences, images, and graphs. These methods rely on self-attention, which treats data as an unordered set of elements. This ignores the neighborhood structure or graph topology of the data and requires the use of inductive biases, such as position embeddings in sequences and images, and random walks in graphs, to incorporate topology. However, developing bespoke inductive biases for each task requires significant effort and can also introduce side-effects hindering generalization. In this work, we introduce Chimera, a unified model that directly incorporates the data topology in a principled way, obviating the need for domain-specific biases. Central to Chimera is the observation that state-space models---which naturally do not require position embeddings---can be generalized to capture any general graph topology. Our experiments demonstrate the versatility of our approach---Chimera achieves strong performance across the domains of language, vision, and graphs, outperforming BERT on GLUE by 0.7 points, ViT on ImageNet-1k by 2.6%, and all the baselines on the Long Range Graph Benchmark. Our results validate Chimera's principled methodological contributions and affirm the long-held belief that data topology is a powerful inductive bias across modalities. We further propose algorithmic optimizations to improve Chimera's efficiency while maintaining performance: 1) For the subclass of Directed Acyclic Graphs we show that Chimera can be implemented as a linear time recurrence. 2) For general graphs, we relax the method with a simple mathematical approximation, achieving Transformer's quadratic complexity without relying on domain-specific biases.

URL: https://openreview.net/forum?id=yv0TUssepk

---

Title: Reproducibility Study of “Vision Transformers Need Registers"

Abstract: Vision Transformers (ViTs) have achieved State-Of-The-Art (SOTA) performance in numerous tasks. However, the emergence of high-norm artifact tokens in supervised and self-supervised ViTs hinders interpretability of attention maps of such models. This study reproduces and validates previous work addressing this issue through the use of register tokens - learnable placeholders added to the input sequence - that mitigate artifacts and yield smoother feature maps. We evaluated the presence of artifacts in various ViT models, namely DeiT-III and DINOv2 architectures, and investigated the impact of fine-tuning pre-trained ViTs with register tokens and additional regularization introduced. By conducting experiments on pre-trained and fine-tuned models, we confirm that register tokens eliminate artifact and improve attention map interpretability.

URL: https://openreview.net/forum?id=CZk4QL3D2v

---

Title: Modularity is the Bedrock of Natural and Artificial Intelligence

Abstract: The astonishing performance showcased by AI systems in the last decade has been achieved through the use of massive amounts of data, computation, and, in turn, energy, which vastly exceed what human intelligence requires. This wide gap underscores the need for further research and points to leveraging brains as a valuable source of guiding principles. On the other hand, the No Free Lunch Theorem highlights that effective inductive biases must be problem-specific. This suggests designing architectures with specialized components that can solve subproblems --- namely, modular architectures. Interestingly, modularity is an established principle of brain organization that is considered essential for supporting the efficient learning and strong generalization abilities consistently demonstrated by humans. However, despite its recognized importance in natural intelligence and the proven benefits it has shown across various seemingly unrelated AI research areas, modularity remains somewhat underappreciated in AI. In this work, we review several research threads in artificial intelligence and neuroscience through a conceptual framework that highlights the central role of modularity in supporting both artificial and natural intelligence. In particular, we examine what computational advantages modularity provides, how it has emerged as a solution in several AI research areas, which modularity principles the brain exploits, and how modularity can help bridge the gap between natural and artificial intelligence.

URL: https://openreview.net/forum?id=bLTMjnUMfw

---

Title: Unlearning Misalignment for Personalized LLM Adaptation via Instance-Response-Dependent Discrepancies

Abstract: While Large Language Models (LLMs) have revolutionized chatbot interactions, they often fall short in aligning responses with the nuanced preferences of individual users—a challenge rooted in the inherently subjective and proprietary nature of user preferences. Consequently, prompt-based learning, though effective in enhancing factual accuracy due to its emphasis on universal correctness, remains insufficient for achieving accurate personalised response alignment. Because user preferences vary widely across individuals and contexts, aligning responses requires a more personalized and context-aware approach. To address this limitation, we propose Consistent Marginalization (CM)—a novel framework that aims to unlearn misalignment by constructing a personalised memory bank of instance-response-dependent discrepancies, built from a small set of user preference samples. This personalised memory bank equips LLMs with the ability to understand, recall, and adapt to individual preferences, enabling more consistent and personalized responses. Evaluated across a diverse range of domain-specific datasets and model architectures, CM yields notable improvements in response alignment and robustness. We believe Consistent Marginalization represents a valuable step toward enabling LLMs to become genuinely personable and adaptive conversational agents by understanding user preferences and generating responses that are better aligned with individual user expectations.

URL: https://openreview.net/forum?id=njE3swFBMc

---

Title: Removing Strong Attribute Bias from Neural Networks with Adversarial Filtering

Abstract: Ensuring a neural network is not relying on protected attributes (e.g., race, sex, age) for prediction is crucial in advancing fair and trustworthy AI.
While several promising methods for removing attribute bias in neural networks have been proposed, their limitations remain under-explored.
To that end, in this work, we mathematically and empirically reveal the limitation of existing attribute bias removal methods in the presence of strong bias and propose a new method that can mitigate this limitation.
Specifically, we first derive a general non-vacuous information-theoretical upper bound on the performance of any attribute bias removal method in terms of the bias strength, revealing that they are effective only when the inherent bias in the dataset is relatively weak.
Inspired by this theoretical finding, we then propose a new method using an adversarial objective that directly filters out protected attributes in the input space while maximally preserving all other attributes, without requiring any specific target label. The proposed method achieves state-of-the-art performance in both strong and moderate bias settings. We provide extensive experiments on synthetic, image, and census datasets, to verify the derived theoretical bound and its consequences in practice, and evaluate the effectiveness of the proposed method in removing strong attribute bias.

URL: https://openreview.net/forum?id=jxLFhuofgf

---

Title: Joint Diffusion for Universal Hand-Object Grasp Generation

Abstract: In this work, we focus on generating both the hand and objects in a grasp by a single diffusion model. Our proposed Joint Hand-Object Diffusion (JHOD) models the hand and object in a unified latent representation. It leverages large-scale object datasets to learn an inclusive object latent embedding. Also, it uses the hand-object grasping data to learn to accommodate hand and object embedding to form grasps. With or without a given object as an optional condition, the diffusion model can generate grasps unconditionally or conditional to the object. Compared to the usual practice of learning object-conditioned grasp generation from only hand-object grasp data, our method benefits from more diverse object data used for training to handle grasp generation more universally. According to both qualitative and quantitative experiments, both conditional and unconditional generation of hand grasp achieves good visual plausibility and diversity. The proposed method generalizes well to unseen object shapes. The code and weights will be made public upon acceptance.

URL: https://openreview.net/forum?id=TZ0ztsYR6x

---

Reply all

Reply to author

Forward

0 new messages