Weekly TMLR digest for May 18, 2025

5 views

Skip to first unread message

TMLR

unread,

May 18, 2025, 12:00:11 AMMay 18

to tmlr-annou...@googlegroups.com

Accepted papers
===============

Title: Policy Optimization via Adv2: Adversarial Learning on Advantage Functions

Authors: Matthieu Jonckheere, Chiara Mignacco, Gilles Stoltz

Abstract: We revisit the reduction of learning in adversarial Markov decision processes [MDPs] to adversarial learning based on $Q$--values; this reduction has been considered in a number of recent articles as one building block to perform policy optimization. Namely, we first consider and extend this reduction in an ideal setting where an oracle provides value functions: it may involve any adversarial learning strategy (not just exponential weights) and it may be based indifferently on $Q$--values or on advantage functions. We then present two extensions: on the one hand, convergence of the last iterate for a vast class of adversarial learning strategies (again, not just exponential weights), satisfying a property called monotonicity of weights; on the other hand, stronger regret criteria for learning in MDPs, inherited from the stronger regret criteria of adversarial learning called strongly adaptive regret and tracking regret. Third, we demonstrate how adversarial learning, also referred to as aggregation of experts, relates to aggregation (orchestration) of expert policies: we obtain stronger forms of performance guarantees in this setting than existing ones, via yet another, simple reduction. Finally, we discuss the impact of the reduction of learning in adversarial MDPs to adversarial learning in the practical scenarios where transition kernels are unknown and value functions must be learned. In particular, we review the literature and note that many strategies for policy optimization feature a policy-improvement step based on exponential weights with estimated $Q$--values. Our main message is that this step may be replaced by the application of any adversarial learning strategy on estimated $Q$--values or on estimated advantage functions. We leave the empirical evaluation of these twists for future research.

URL: https://openreview.net/forum?id=Oyueig10Ed

---

Title: Node Classification With Reject Option

Authors: Uday Bhaskar Kuchipudi, Jayadratha Gayen, Charu Sharma, Naresh Manwani

Abstract: One of the key tasks in graph learning is node classification. While Graph neural networks have been used for various applications, their adaptivity to reject option settings has not been previously explored. In this paper, we propose NCwR, a novel approach to node classification in Graph Neural Networks (GNNs) with an integrated reject option. This allows the model to abstain from making predictions for samples with high uncertainty. We propose cost-based and coverage-based methods for classification with abstention in node classification settings using GNNs. We perform experiments using our method on standard citation network datasets Cora, CiteSeer, PubMed and ogbn-arxiv. We also model the Legal judgment prediction problem on the ILDC dataset as a node classification problem, where nodes represent legal cases and edges represent citations. We further interpret the model by analyzing the cases in which it abstains from predicting and visualizing which part of the input features influenced this decision.

URL: https://openreview.net/forum?id=4xXJDO8Bvu

---

Title: Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models

Authors: Yang Zhang, Chenjia Bai, Bin Zhao, Junchi Yan, Xiu Li, Xuelong Li

Abstract: Learning a world model for model-free Reinforcement Learning (RL) agents can significantly improve the sample efficiency by learning policies in imagination. However, building a world model for Multi-Agent RL (MARL) can be particularly challenging due to the scalability issue across different number of agents in a centralized architecture, and also the non-stationarity issue in a decentralized architecture stemming from the inter-dependency among agents. To address both challenges, we propose a novel world model for MARL that learns decentralized local dynamics for scalability, combined with a centralized representation aggregation from all agents. We cast the dynamics learning as an auto-regressive sequence modeling problem over discrete tokens by leveraging the expressive Transformer architecture, in order to model complex local dynamics across different agents and provide accurate and consistent long-term imaginations. As the first pioneering Transformer-based world model for multi-agent systems, we introduce a Perceiver Transformer as an effective solution to enable centralized representation aggregation within this context. Extensive results on Starcraft Multi-Agent Challenge (SMAC) and MAMujoco demonstrate superior sample efficiency and overall performance compared to strong model-free approaches and existing model-based methods.

URL: https://openreview.net/forum?id=xT8BEgXmVc

---

Title: Hard-Negative Sampling for Contrastive Learning: Optimal Representation Geometry and Neural- vs Dimensional-Collapse

Authors: Ruijie Jiang, Thuan Nguyen, Shuchin Aeron, Prakash Ishwar

Abstract: For a widely-studied data model and general loss and sample-hardening functions we prove that the losses of Supervised Contrastive Learning (SCL), Hard-SCL (HSCL), and Unsupervised Contrastive Learning (UCL) are minimized by representations that exhibit Neural-Collapse (NC), i.e., the class means form an Equiangular Tight Frame (ETF) and data from the same class are mapped to the same representation. We also prove that for any representation mapping, the HSCL and Hard-UCL (HUCL) losses are lower bounded by the corresponding SCL and UCL losses. In contrast to existing literature, our theoretical results for SCL do not require class-conditional independence of augmented views and work for a general loss function class that includes the widely used InfoNCE loss function. Moreover, our proofs are simpler, compact, and transparent. Similar to existing literature, our theoretical claims also hold for the practical scenario where batching is used for optimization. We empirically demonstrate, for the first time, that Adam optimization (with batching) of HSCL and HUCL losses with random initialization and suitable hardness levels can indeed converge to the NC-geometry if we incorporate unit-ball or unit-sphere feature normalization. Without incorporating hard-negatives or feature normalization, however, the representations learned via Adam suffer from Dimensional-Collapse (DC) and fail to attain the NC-geometry. These results exemplify the role of hard-negative sampling in contrastive representation learning and we conclude with several open theoretical problems for future work. The code can be found at https://github.com/rjiang03/HCL/tree/main

URL: https://openreview.net/forum?id=3cnpZ5SIjU

---

Title: Infrastructure for AI Agents

Authors: Alan Chan, Kevin Wei, Sihao Huang, Nitarshan Rajkumar, Elija Perrier, Seth Lazar, Gillian K Hadfield, Markus Anderljung

Abstract: \textbf{AI agents} plan and execute interactions in open-ended environments. For example, OpenAI's Operator can use a web browser to do product comparisons and buy online goods. To facilitate beneficial interactions and mitigate harmful ones, much research focuses on directly modifying agent behaviour. For example, developers can train agents to follow user instructions. This focus on direct modifications is useful, but insufficient. We will also need external protocols and systems that shape how agents interact with institutions and other actors. For instance, agents will need more efficient protocols to communicate with each other and form agreements. In addition, attributing an agent's actions to a particular human or other legal entity can help to establish trust, and also disincentivize misuse. Given this motivation, we propose the concept of \textbf{agent infrastructure}: technical systems and shared protocols external to agents that are designed to mediate and influence their interactions with and impacts on their environments. Just as the Internet relies on protocols like HTTPS, our work argues that agent infrastructure will be similarly indispensable to ecosystems of agents. We identify three functions for agent infrastructure: 1) attributing actions, properties, and other information to specific agents, their users, or other actors; 2) shaping agents' interactions; and 3) detecting and remedying harmful actions from agents. We provide an incomplete catalog of research directions for such functions. For each direction, we include analysis of use cases, infrastructure adoption, relationships to existing (internet) infrastructure, limitations, and open questions. Making progress on agent infrastructure can prepare society for the adoption of more advanced agents.

URL: https://openreview.net/forum?id=Ckh17xN2R2

---

Title: Conformal Bounds on Full-Reference Image Quality for Imaging Inverse Problems

Authors: Jeffrey Wen, Rizwan Ahmad, Philip Schniter

Abstract: In imaging inverse problems, we would like to know how close the recovered image is to the true image in terms of full-reference image quality (FRIQ) metrics like PSNR, SSIM, LPIPS, etc. This is especially important in safety-critical applications like medical imaging, where knowing that, say, the SSIM was poor could potentially avoid a costly misdiagnosis. But since we don’t know the true image, computing FRIQ is non-trivial. In this work, we combine conformal prediction with approximate posterior sampling to construct bounds on FRIQ that are guaranteed to hold up to a user-specified error probability. We demonstrate our approach on image denoising and accelerated magnetic resonance imaging (MRI) problems.

URL: https://openreview.net/forum?id=WADLPccB6o

---

Title: G-RepsNet: A Lightweight Construction of Equivariant Networks for Arbitrary Matrix Groups

Authors: Sourya Basu, Suhas Lohit, Matthew Brand

Abstract: Group equivariance is a strong inductive bias useful in a wide range of deep learning tasks. However, constructing efficient equivariant networks for general groups and domains is difficult. Recent work by Finzi et al. directly solves the equivariance constraint for arbitrary matrix groups to obtain equivariant MLPs (EMLPs). But this method does not scale well and scaling is crucial in deep learning.
Here, we introduce Group Representation Networks (G-RepsNets), a lightweight equivariant network for arbitrary matrix groups with features represented using tensor polynomials. The key insight in our design is that using tensor representations in the hidden layers of a neural network along with simple inexpensive tensor operations leads to scalable equivariant networks. Further, these networks are universal approximators of functions equivariant to orthogonal groups. We find G-RepsNet to be competitive to EMLP on several tasks with group symmetries such as $O(5)$, $O(1, 3)$, and $O(3)$ with scalars, vectors, and second-order tensors as data types.
On image classification tasks, we find that G-RepsNet using second-order representations is competitive and often even outperforms sophisticated state-of-the-art equivariant models such as GCNNs and $E(2)$-CNNs. To further illustrate the generality of our approach, we show that G-RepsNet is competitive to G-FNO and EGNN on N-body predictions and solving PDEs respectively, while being efficient.

URL: https://openreview.net/forum?id=k1eYngOvf0

---

Title: Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

Authors: Riccardo Cappuzzo, Aimee Coelho, Félix Lefebvre, Paolo Papotti, Gaël Varoquaux

Abstract: Machine-learning from a disparate set of tables, a data lake, requires assembling features
by merging and aggregating tables. Data discovery can extend autoML to data tables
by automating these steps. We present an in-depth analysis of such automated table
augmentation for machine learning tasks, analyzing different methods for the three main
steps: retrieving joinable tables, merging information, and predicting with the resultant
table. We use two data lakes: Open Data US, a well-referenced real data lake, and a novel
semi-synthetic dataset, YADL (Yet Another Data Lake), which we developed as a tool for
benchmarking this data discovery task. Systematic exploration on both lakes outlines 1)
the importance of accurately retrieving join candidates, 2) the efficiency of simple merging
methods, and 3) the resilience of tree-based learners to noisy conditions. Our experimental
environment is easily reproducible and based on open data, to foster more research on feature
engineering, autoML, and learning in data lakes

URL: https://openreview.net/forum?id=4uPJN6yfY1

---

Title: Foundation Models Meet Federated Learning: A One-shot Feature-sharing Method with Privacy and Performance Guarantees

Authors: Mahdi Beitollahi, Alex Bie, Sobhan Hemati, Leo Maxime Brunswic, Xu Li, Xi Chen, Guojun Zhang

Abstract: Adapting foundation models for downstream tasks via Federated Learning (FL) is a promising strategy for protecting privacy while leveraging the capability of foundation models. However, FL's iterative training and model transmission result in high communication costs and GPU memory demands, making large foundation models impractical for FL. This paper introduces a one-shot FL method with a server-side performance bound to enable foundation models by reducing communication costs and GPU memory requirements. Our approach, FedPFT (FL with Parametric Feature Transfer), involves clients learning and transferring parametric models for features extracted from frozen foundation models in a single round. Parametric models are then used to generate synthetic features at the server to train a classifier head. We evaluate FedPFT across eight vision datasets using three vision foundation models. Our findings demonstrate that FedPFT is agnostic to data heterogeneity and network topology and it enhances the communication-accuracy frontier up to 7.8\%. Finally, we show FedPFT's compatibility with differential privacy and its resilience against reconstruction attacks. Our work highlights the capability of private, feature-sharing methods for one-shot knowledge transfer using foundation models.

URL: https://openreview.net/forum?id=55593xywWG

---

Title: NITO: Neural Implicit Fields for Resolution-free and Domain-Adaptable Topology Optimization

Authors: Amin Heyrani Nobari, Lyle Regenwetter, Giorgio Giannone, Faez Ahmed

Abstract: Structural topology optimization plays a crucial role in engineering by determining the optimal material layout within a design space to maximize performance under given constraints. We introduce Neural Implicit Topology Optimization (NITO), a deep learning regression approach to accelerate topology optimization tasks.
We demonstrate that, compared to state-of-the-art diffusion models, NITO generates structures that are under 15% as structurally sub-optimal and does so ten times faster. Furthermore, we show that NITO is entirely resolution-free and domain-agnostic, offering a more scalable solution than the current fixed-resolution and domain-specific diffusion models.
To achieve this state-of-the-art performance, NITO combines three key innovations. First, we introduce the Boundary Point Order-Invariant MLP (BPOM), which represents loads and supports in a sparse and domain-agnostic manner, allowing NITO to train on variable conditioning, domain shapes, and mesh resolutions. Second, we adopt a neural implicit field representation, which allows NITO to synthesize topologies of any shape or resolution. Finally, we propose an inference-time refinement step using a few steps of gradient-based optimization to enable NITO to achieve results comparable to direct optimization methods. These three innovations empower NITO with a precision and versatility that is currently unparalleled among competing deep learning approaches for topology optimization. Code & Data: https://github.com/ahnobari/NITO_Public

URL: https://openreview.net/forum?id=XHXAvACdgv

---

Title: An Empirical Study of Pre-trained Model Selection for Out-of-Distribution Generalization and Calibration

Authors: Hiroki Naganuma, Ryuichiro Hataya, Kotaro Yoshida, Ioannis Mitliagkas

Abstract: In out-of-distribution (OOD) generalization tasks, fine-tuning pre-trained models has become a prevalent strategy. Different from most prior work that has focused on advancing learning algorithms, we systematically examined how pre-trained model size, pre-training dataset size, and training strategies impact generalization and uncertainty calibration on downstream tasks. We evaluated 100 models across diverse pre-trained model sizes, five pre-training datasets, and five data augmentations through extensive experiments on four distribution shift datasets totaling over 120,000 GPU hours. Our results demonstrate the significant impact of pre-trained model selection, with optimal choices substantially improving OOD accuracy over algorithm improvement alone. Additionally, we find that larger models and bigger pre-training datasets not only enhance OOD performance but also improve calibration, helping to mitigate overconfidence, contrary to some prior studies that found modern deep networks to calibrate worse than classical shallow models. Our work underscores the overlooked importance of pre-trained model selection for out-of-distribution generalization and calibration.

URL: https://openreview.net/forum?id=tYjoHjShxF

---

Title: Robust Offline Imitation Learning from Diverse Auxiliary Data

Authors: Udita Ghosh, Dripta S. Raychaudhuri, Jiachen Li, Konstantinos Karydis, Amit Roy-Chowdhury

Abstract: Offline imitation learning enables learning a policy solely from a set of expert demonstrations, without any environment interaction. To alleviate the issue of distribution shift arising due to the small amount of expert data, recent works incorporate large numbers of auxiliary demonstrations alongside the expert data. However, the performance of these approaches rely on assumptions about the quality and composition of the auxiliary data, and they are rarely successful when those assumptions do not hold. To address this limitation, we propose Robust Offline Imitation from Diverse Auxiliary Data (ROIDA). ROIDA first identifies high-quality transitions from the entire auxiliary dataset using a learned reward function. These high-reward samples are combined with the expert demonstrations for weighted behavioral cloning. For lower-quality samples, ROIDA applies temporal difference learning to steer the policy towards high-reward states, improving long-term returns. This two-pronged approach enables our framework to effectively leverage both high and low-quality data without any assumptions. Extensive experiments validate that ROIDA achieves robust and consistent performance across multiple auxiliary datasets with diverse ratios of expert and non-expert demonstrations. ROIDA effectively leverages unlabeled auxiliary data, outperforming prior methods reliant on specific data assumptions.

URL: https://openreview.net/forum?id=Hy2KAldqAo

---

Title: Rethinking the Value of Training-Free Structured Pruning of LLMs

Authors: Nahush Lele, Arnav Chavan, Aryamaan Thakur, Deepak Gupta

Abstract: This paper investigates the effectiveness of training-free structured pruning techniques for Large Language Models (LLMs), with a particular focus on depth and width pruning strategies. Through an extensive empirical evaluation across a diverse range of tasks, datasets and modalities, we reveal critical limitations in current pruning methods. While some tasks exhibit minimal performance degradation, others face significant deterioration, even at low pruning rates, contradicting prior findings that often rely on selective benchmarks. Our analysis also finds that depth pruning, despite its simplicity, usually outperforms the more granular width pruning approaches in maintaining downstream task performance. Our findings highlight that existing evaluations of pruned LLMs often overstate their effectiveness due to incomplete or limited evaluation tasks, necessitating a critical reassessment of the true value of pruning and emphasizing the need to explore more robust pruning algorithms.

URL: https://openreview.net/forum?id=7KkytYYhMv

---

Title: FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback

Authors: Liqiang Jing, Xinya Du

Abstract: Large Vision-Language Models (LVLMs) have demonstrated proficiency in tackling a variety of visual-language tasks. However, current LVLMs suffer from misalignment between text and image modalities which causes three kinds of hallucination problems, i.e., object existence, object attribute, and object relationship. To tackle this issue, existing methods mainly utilize Reinforcement Learning (RL) to align modalities in LVLMs. However, they still suffer from three main limitations: (1) General feedback can not indicate the hallucination type contained in the response; (2) Sparse rewards only give the sequence-level reward for the whole response; and (3)Annotation cost is time-consuming and labor-intensive. To handle these limitations, we propose an innovative method to align modalities in LVLMs through \textbf{F}ine-\textbf{G}rained \textbf{A}rtificial \textbf{I}ntelligence \textbf{F}eedback (\textbf{\ours}), which mainly consists of three steps: AI-based Feedback Collection, Fine-grained Reward Model Training, and Reinforcement Learning with Fine-grained Reward. Finally, a novel fine-grained feedback module is integrated into the Proximal Policy Optimization (PPO) algorithm. Extensive experiments are conducted on hallucination and general benchmarks, demonstrating the superior performance of our proposed method. Notably, compared with previous models trained with the RL-based aligning method, our proposed method is effective even with fewer parameters.

URL: https://openreview.net/forum?id=Qhfw5CUVd7

---

Title: Noise-free Loss Gradients: A Surprisingly Effective Baseline for Coreset Selection

Authors: Saumyaranjan Mohanty, Chimata Anudeep, Konda Reddy Mopuri

Abstract: The exponential rise in size and complexity of deep learning models and datasets have resulted
in a considerable demand for computational resources. Coreset selection is one of the
methods to alleviate this rising demand. The goal is to select a subset from a large dataset
to train a model that performs almost at par with the one trained on the large dataset
while reducing computational time and resource requirements. Existing approaches either
attempt to identify remarkable samples (e.g., Forgetting, Adversarial Deepfool, EL2N, etc.)
that stand out from the rest or solve complex optimization (e.g., submodular maximization,
OMP) problems to compose the coresets. This paper proposes a novel and intuitive approach
to efficiently select a coreset based on the similarity of loss gradients. Our method
works on the hypothesis that gradients of samples belonging to a given class will point
in similar directions during the early training phase. Samples with most neighbours that
produce similar gradient directions, in other words, that produce noise-free gradients, will
represent that class. Through extensive experimentation, we have demonstrated the effectiveness
of our approach in out-performing state-of-the-art coreset selection algorithms
on a range of benchmark datasets from CIFAR-10 to ImageNet with architectures of varied
complexity (ResNet-18, ResNet-50, VGG-16, ViT).We have also demonstrated the effectiveness
of our approach in Generative Modelling by implementing coreset selection to reduce
training time for various GAN models (DCGAN, MSGAN, SAGAN, SNGAN) for different
datasets (CIFAR-10, CIFAR-100, Tiny ImageNet) while not impacting the performance
metrics significantly. Source code is provided at URL.

URL: https://openreview.net/forum?id=OE4P1tW8iQ

---

Title: Graph-based Confidence Calibration for Large Language Models

Authors: Yukun Li, Sijia Wang, Lifu Huang, Liping Liu

Abstract: Reliable confidence estimation is essential for enhancing the trustworthiness of large language models (LLMs), especially in high-stakes scenarios. Despite its importance, accurately estimating confidence in LLM responses remains a significant challenge. In this work, we propose using an auxiliary learning model to assess response correctness based on the self-consistency of multiple outputs generated by the LLM. Our method builds a consistency graph to represent the agreement among multiple responses and uses a graph neural network (GNN) to estimate the likelihood that each response is correct. Experiments demonstrate that this method has strong calibration performance on various benchmark datasets and generalizes well to out-of-domain cases.

URL: https://openreview.net/forum?id=BDPvuD5FTg

---

Title: Preferential Multi-Objective Bayesian Optimization

Authors: Raul Astudillo, Kejun Li, Maegan Tucker, Chu Xin Cheng, Aaron Ames, Yisong Yue

Abstract: Preferential Bayesian optimization (PBO) is a framework for optimizing a decision-maker’s latent preferences over available design choices. While real-world problems often involve multiple conflicting objectives, existing PBO methods assume that preferences can be encoded by a single objective function. For instance, in the customization of robotic assistive devices, technicians aim to maximize user comfort while minimizing energy consumption to extend battery life. Likewise, in autonomous driving policy design, stakeholders must evaluate safety and performance trade-offs before committing to a policy. To bridge this gap, we introduce the first framework for PBO with multiple objectives. Within this framework, we propose dueling scalarized Thompson sampling (DSTS), a multi-objective generalization of the popular dueling Thompson sampling algorithm, which may also be of independent interest beyond our setting. We evaluate DSTS across four synthetic test functions and two simulated tasks—exoskeleton personalization and driving policy design—demonstrating that it outperforms several benchmarks. Finally, we prove that DSTS is asymptotically consistent. Along the way, we provide, to our knowledge, the first convergence guarantee for dueling Thompson sampling in single-objective PBO.

URL: https://openreview.net/forum?id=mjsoESaWDH

---

Title: Adam-family Methods with Decoupled Weight Decay in Deep Learning

Authors: Kuangyu Ding, Nachuan Xiao, Kim-chuan Toh

Abstract: In this paper, we investigate the convergence properties of a wide class of Adam-family methods for minimizing quadratically regularized nonsmooth nonconvex optimization problems, especially in the context of training nonsmooth neural networks with weight decay. Motivated by AdamW, we propose a novel framework for Adam-family methods with decoupled weight decay. Within our framework, the estimators for the first-order and second-order moments of stochastic subgradients are updated independently of the weight decay term. Under mild assumptions and with non-diminishing stepsizes for updating the primary optimization variables, we establish the convergence properties of our proposed framework. In addition, we show that our proposed framework encompasses a wide variety of well-known Adam-family methods, hence offering convergence guarantees for these methods in the training of nonsmooth neural networks. More importantly, compared to the existing results on the choices of the parameters for the moment terms in Adam, we show that our proposed framework provides more flexibility for these parameters. As a practical application of our proposed framework, we propose a novel Adam-family method named Adam with Decoupled Weight Decay (AdamD), and establish its convergence properties under mild conditions. Numerical experiments demonstrate that AdamD outperforms Adam and is comparable to AdamW, in the aspects of both generalization performance and efficiency.

URL: https://openreview.net/forum?id=xVEHiAZ7uR

---

Title: Generalized Compressed Sensing for Image Reconstruction with Diffusion Probabilistic Models

Authors: Ling-Qi Zhang, Zahra Kadkhodaie, Eero P Simoncelli, David H. Brainard

Abstract: We examine the problem of selecting a small set of linear measurements for reconstructing high-dimensional signals. Well-established methods for optimizing such measurements include principal component analysis (PCA), independent component analysis (ICA) and compressed sensing (CS) based on random projections, all of which rely on axis- or subspace-aligned statistical characterization of the signal source. However, many naturally occurring signals, including photographic images, contain richer statistical structure. To exploit such structure, we introduce a general method for obtaining an optimized set of linear measurements for efficient image reconstruction, where the signal statistics are expressed by the prior implicit in a neural network trained to perform denoising (known as a ``diffusion model''). We demonstrate that the optimal measurements derived for two natural image datasets differ from those of PCA, ICA, or CS, and result in substantially lower mean squared reconstruction error. Interestingly, the marginal distributions of the measurement values are asymmetrical (skewed), substantially more so than those of previous methods. We also find that optimizing with respect to perceptual loss, as quantified by structural similarity (SSIM), leads to measurements different from those obtained when optimizing for MSE. Our results highlight the importance of incorporating the specific statistical regularities of natural signals when designing effective linear measurements.

URL: https://openreview.net/forum?id=lmHh4FmPWZ

---

Title: Towards Better Understanding of In-Context Learning Ability from In-Context Uncertainty Quantification

Authors: Shang Liu, Zhongze Cai, Guanting Chen, Xiaocheng Li

Abstract: Predicting simple function classes has been widely used as a testbed for developing theory and understanding of the trained Transformer's in-context learning (ICL) ability. In this paper, we revisit the training of Transformers on linear regression tasks, and different from all the existing literature, we consider a bi-objective prediction task of predicting both the conditional expectation $\mathbb{E}[Y|X]$ and the conditional variance Var$(Y|X)$. This additional uncertainty quantification objective provides a handle to (i) better design out-of-distribution experiments to distinguish ICL from in-weight learning (IWL) and (ii) make a better separation between the algorithms with and without using the prior information of the training distribution. Theoretically, we show that the trained Transformer reaches near Bayes optimum, suggesting the usage of the information of the training distribution. Our method can be extended to other cases. Specifically, with the Transformer's context window $S$, we prove a generalization bound of $\tilde{\mathcal{O}}(\sqrt{\min\{S, T\}/(n T)})$ on $n$ tasks with sequences of length $T$, providing sharper analysis compared to previous results of $\tilde{\mathcal{O}}(\sqrt{1/n})$. Empirically, we illustrate that while the trained Transformer behaves as the Bayes-optimal solution as a natural consequence of supervised training in distribution, it does not necessarily perform a Bayesian inference when facing task shifts, in contrast to the \textit{equivalence} between these two proposed in many existing literature. We also demonstrate the trained Transformer's ICL ability over covariate shift and prompt-length shift and interpret them as a generalization over a meta distribution.

URL: https://openreview.net/forum?id=0c6iG28rRl

---

Title: A Theoretical Study of Neural Network Expressive Power via Manifold Topology

Authors: Jiachen Yao, Lingjie Yi, Mayank Goswami, Chao Chen

Abstract: A prevalent assumption regarding real-world data is that it lies on or close to a low-dimensional manifold. When deploying a neural network on data manifolds, the required size, i.e., the number of neurons of the network, heavily depends on the intricacy of the underlying latent manifold. While significant advancements have been made in understanding the geometric attributes of manifolds, it's essential to recognize that topology, too, is a fundamental characteristic of manifolds. In this study, we investigate network expressive power in terms of the latent data manifold. Integrating both topological and geometric facets of the data manifold, we present a size upper bound of ReLU neural networks.

URL: https://openreview.net/forum?id=qRAjZuf48S

---

Title: UniZero: Generalized and Efficient Planning with Scalable Latent World Models

Authors: Yuan Pu, Yazhe Niu, Zhenjie Yang, Jiyuan Ren, Hongsheng Li, Yu Liu

Abstract: Learning predictive world models is crucial for enhancing the planning capabilities of reinforcement learning (RL) agents. Recently, MuZero-style algorithms, leveraging the value equivalence principle and Monte Carlo Tree Search (MCTS), have achieved superhuman performance in various domains. However, these methods struggle to scale in heterogeneous scenarios with diverse dependencies and task variability.
To overcome these limitations, we introduce UniZero, a novel approach that employs a transformer-based world model to effectively learn a shared latent space. By concurrently predicting latent dynamics and decision-oriented quantities conditioned on the learned latent history, UniZero enables joint optimization of the long-horizon world model and policy, facilitating broader and more efficient planning in the latent space. We show that UniZero significantly outperforms existing baselines in benchmarks that require long-term memory. Additionally, UniZero demonstrates superior scalability in multitask learning experiments conducted on Atari benchmarks. In standard single-task RL settings, such as Atari and DMControl, UniZero matches or even surpasses the performance of current state-of-the-art methods. Finally, extensive ablation studies and visual analyses validate the effectiveness and scalability of UniZero's design choices. Our code is available at \textcolor{magenta}{https://github.com/opendilab/LightZero}.

URL: https://openreview.net/forum?id=Gl6dF9soQo

---

Title: T2L: Efficient Zero-Shot Action Recognition with Temporal Token Learning

Authors: Shahzad Ahmad, Sukalpa Chanda, Yogesh S Rawat

Abstract: Recent advancements in large-scale pre-training of visual-language models on paired image-text data have demonstrated impressive generalization capabilities for zero-shot tasks. Building on this success, efforts have been made to adapt these image-based visual-language models, such as CLIP, for videos extending their zero-shot capabilities to the video domain. While these adaptations have shown promising results, they come at a significant computational cost and struggle with effectively modeling the temporal aspects inherent to the video domain. In this study, we present Efficient Zero-Shot Action Recognition with Temporal
Token Learning(T2L), a simple and efficient adaptation of CLIP that addresses these challenges. T2L leverages Temporal Token Learning (TTL) for seamless temporal adaptation, requiring no fundamental changes to the core CLIP architecture while preserving its
remarkable generalization abilities. TTL relies on temporal feature diversity (TFD), a novel learning objective, which guides TTL to focus on capturing motion, thereby enhancing its learning capabilities from videos. We perform extensive experiments on nine different
benchmark datasets, thoroughly evaluating T2L for zero-shot learning and base-to-novel video action recognition, and also demonstrating its potential for few-shot generalization. Impressively, with merely 5.2 million learnable parameters, T2L can be efficiently trained
on a single GPU (with 25x less learnable parameters, 3x reduction in GFLOPs, and 4x improvement in throughput when compared with prior best model), outperforming existing approaches in several evaluations.

URL: https://openreview.net/forum?id=WvgoxpGpuU

---

New submissions
===============

Title: Dreaming is All You Need

Abstract: Achieving a harmonious balance between exploration and precision is of paramount importance in classification tasks. To this end, this research introduces two novel deep learning models, SleepNet and DreamNet, to strike this balance. SleepNet seamlessly integrates supervised learning with unsupervised ``sleep'' stages using pre-trained encoder models. Dedicated neurons within SleepNet are embedded in these unsupervised features, forming intermittent ``sleep'' blocks that facilitate exploratory learning. Building upon the foundation of SleepNet, DreamNet employs full encoder-decoder frameworks to reconstruct the hidden states, mimicking the human ``dreamin'' process. This reconstruction process enables further exploration and refinement of the learned representations. Moreover, the principle ideas of our SleepNet and DreamNet are generic and can be applied to both computer vision and natural language processing downstream tasks. Through extensive empirical evaluations on diverse image and text datasets, SleepNet and DreanNet have demonstrated superior performance compared to state-of-the-art models, showcasing the strengths of unsupervised exploration and supervised precision afforded by our innovative approaches.

URL: https://openreview.net/forum?id=vXhsnUIeTl

---

Title: AI Stereotypes: An Unequipartition Property for Perplexity in Generative Language Models

Abstract: We prove a new asymptotic unequipartition property for the perplexity of long texts generated by a language model and present supporting experimental evidence from open-source models. Specifically we show that the logarithmic perplexity of any large text generated by a language model must asymptotically converge to the average entropy of its token distributions. This defines a ``typical set'' that all long synthetic texts generated by a language model must belong to. We refine the concept of ''typical set'' to include only grammatically correct texts. We then show that this refined typical set is a vanishingly small subset of all possible grammatically correct texts for a very general definition of grammar. This means that language models are strongly constrained in the range of their possible behaviors and outputs. We make no simplifying assumptions (such as stationarity) about the statistics of language model outputs, and therefore our results are directly applicable to practical real-world models without any approximations. We discuss possible applications of the typical set concept to problems such as detecting synthetic texts and membership inference in training datasets.

URL: https://openreview.net/forum?id=LeUcAoZ146

---

Title: Dextr: Zero-Shot Neural Architecture Search with Singular Value Decomposition and Extrinsic Curvature

Abstract: Zero-shot Neural Architecture Search (NAS) typically optimises the architecture search process by exploiting the network or gradient properties at initialisation through zero-cost proxies. The existing proxies often rely on labelled data, which is usually unavailable in real-world settings. Furthermore, the majority of the current methods focus either on optimising the convergence and generalisation attributes or solely on the expressivity of the network architectures. To address both limitations, we first demonstrate how channel collinearity affects the convergence and generalisation properties of a neural network. Then, by incorporating the convergence, generalisation and expressivity in one approach, we propose a zero-cost proxy that omits the requirement of labelled data for its computation. In particular, we leverage the Singular Value Decomposition (SVD) of the neural network layer features and the extrinsic curvature of the network output to design our proxy. As a result, the proposed proxy is formulated as the simplified harmonic mean of the logarithms of two key components: the sum of the inverse of the feature condition number and the extrinsic curvature of the network output. Our approach enables accurate prediction of network performance on test data using only a single label-free data sample. Our extensive evaluation includes a total of six experiments, including the Convolutional Neural Network (CNN) search space, i.e. DARTS and the Transformer search space, i.e. AutoFormer. The proposed proxy demonstrates a superior performance on multiple correlation benchmarks, including NAS-Bench-101, NAS-Bench-201, and TransNAS-Bench-101-micro; as well as on the NAS task within the DARTS and the AutoFormer search space, all while being notably efficient.

URL: https://openreview.net/forum?id=X0vPof5DVh

---

Title: Continuous Language Model Interpolation yields Dynamic and Controllable Text Generation

Abstract: As large language models (LLMs) have gained popularity for a variety of use cases, making them adaptable and controllable has become increasingly important, especially for user-facing applications. In particular, linear interpolation between model parameters forms the backbone for many recent approaches to adapting models to user preferences. While the existing literature on LLM adaptation primarily focuses on finding methods that optimize for some set of performance criteria or user preferences, here we instead seek to better understand and characterize the behavior of dense, continuous interpolation between models. Specifically, we use low-rank updates to fine-tune a base model to various different domains, yielding a set of anchor models with distinct generation profiles. Then, we use the weight updates of these anchor models to parametrize the entire (infinite) class of models contained within their convex hull. We empirically show that varying the interpolation weights yields predictable and consistent change in the model outputs with respect to all of the controlled attributes simultaneously. We find that there is little entanglement between most attributes and identify and discuss the pairs of attributes for which this is not the case. Our results suggest that parameter merging facilitates flexible model adaptation due to its predictable behavior within the full interpolation region.

URL: https://openreview.net/forum?id=xD9Nu2Wah4

---

Title: Certified Robustness to Data Poisoning in Gradient-Based Training

Abstract: Modern machine learning pipelines leverage large amounts of public data, making it infeasible to guarantee data quality and leaving models open to poisoning and backdoor attacks. Provably bounding the behavior of learning algorithms under such attacks remains an open problem. In this work, we address this challenge by developing the first framework providing provable guarantees on the behavior of models trained with potentially manipulated data without modifying the model or learning algorithm. In particular, our framework certifies robustness against untargeted and targeted poisoning, as well as backdoor attacks, for bounded and unbounded manipulations of the training inputs and labels. Our method leverages convex relaxations to over-approximate the set of all possible parameter updates for a given poisoning threat model, allowing us to bound the set of all reachable parameters for any gradient-based learning algorithm. Given this set of parameters, we provide bounds on worst-case behavior, including model performance and backdoor success rate. We demonstrate our approach on multiple real-world datasets from applications including energy consumption, medical imaging, and autonomous driving.

URL: https://openreview.net/forum?id=9WHifn9ZVX

---

Title: Table Foundation Models: on knowledge pre-training for tabular learning

Abstract: Table foundation models bring high hopes to data science: pre-trained on tabular data to embark knowledge or priors, they should facilitate downstream tasks on tables. One specific challenge is that of data semantics: numerical entries take their meaning from context, *e.g.*, column name. The traditional approach combines column-specific data preparation with tree-based models that adapt to column specificities. Pre-trained neural networks that jointly model column names and table entries have recently boosted prediction accuracy. While these models outline the promises of world knowledge to interpret table values, they lack the convenience of popular foundation models in text or vision. Indeed, they must be fine-tuned to bring benefits, come with sizeable computation costs, and cannot easily be reused or combined with other architectures. Here we introduce TARTE, a foundation model that transforms tables to knowledge-enhanced vector representations using the string to capture semantics. Pre-trained on large relational data, TARTE yields representations that facilitate subsequent learning with little additional cost. These representations can be fine-tuned or combined with other learners, giving models that push the state-of-the-art prediction performance and improve the prediction/computation performance trade-off. Specialized to a task or a domain, TARTE gives domain-specific representations that facilitate further learning. Our study demonstrates an effective approach to knowledge pre-training for tabular learning.

URL: https://openreview.net/forum?id=QV4P8Csw17

---

Title: Illusion or Algorithm? Investigating Memorization, Emer- gence, and Symbolic Processing in In-Context Learning

Abstract: Large-scale Transformer language models (LMs) trained solely on next-token prediction with web-scale data can solve a wide range of tasks after seeing just a few examples. The mechanism behind this capability, known as in-context learning (ICL), remains both controversial and poorly understood. Some studies argue that it is merely the result of memorizing vast amounts of data, while others contend that it reflects a fundamental, symbolic algorithmic development in LMs. In this work, we introduce a suite of investigative tasks and a novel method to systematically investigate ICL by leveraging the full Pythia scaling suite, including interim checkpoints that capture progressively larger amount of training data. By carefully exploring ICL performance on downstream tasks and simultaneously conducting a mechanistic analysis of the residual stream's subspace, we demonstrate that ICL extends beyond mere "memorization" of the training corpus, yet does not amount to the implementation of an independent symbolic algorithm. Our results also clarify several aspects of ICL, including the influence of training dynamics, model capabilities, and elements of mechanistic interpretability. Overall, our work advances the understanding of ICL and its implications, offering model developers insights into potential improvements and providing AI security practitioners with a basis for more informed guidelines.

URL: https://openreview.net/forum?id=10QqO1tM1H

---

Title: Enhancing Cost Efficiency in Active Learning with Candidate Set Query

Abstract: This paper introduces a cost-efficient active learning (AL) framework for classification, featuring a novel query design called candidate set query. Unlike traditional AL queries requiring the oracle to examine all possible classes, our method narrows down the set of candidate classes likely to include the ground-truth class, significantly reducing the search space and labeling cost. Moreover, we leverage conformal prediction to dynamically generate small yet reliable candidate sets, adapting to model enhancement over successive AL rounds. To this end, we introduce an acquisition function designed to prioritize data points that offer high information gain at lower cost. Empirical evaluations on CIFAR-10, CIFAR-100, and ImageNet64x64 demonstrate the effectiveness and scalability of our framework. Notably, it reduces labeling cost by 48% on ImageNet64x64.

URL: https://openreview.net/forum?id=LhHxl30xQ1

---

Title: TRIDE: A Text-assisted Radar-Image weather-aware fusion network for Depth Estimation

Abstract: Depth estimation, essential for autonomous driving, seeks to interpret the 3D environment surrounding vehicles. The development of radar sensors, known for their cost-efficiency and robustness, has spurred interest in radar-camera fusion-based solutions. However, existing algorithms fuse features from these modalities without accounting for weather conditions, despite radars being known to be more robust than cameras under adverse weather. Additionally, while Vision-Language models have seen rapid advancement, utilizing language descriptions alongside other modalities for depth estimation remains an open challenge. This paper first introduces a text-generation strategy along with feature extraction and fusion techniques that can assist monocular depth estimation pipelines, leading to improved accuracy across different algorithms on the KITTI dataset. Building on this, we propose TRIDE, a radar-camera fusion algorithm that enhances text feature extraction by incorporating radar point information. To address the impact of weather on sensor performance, we introduce a weather-aware fusion block that adaptively adjusts radar weighting based on current weather conditions. Our method, benchmarked on the nuScenes dataset, demonstrates performance gains over the state-of-the-art, achieving a 12.87% improvement in MAE and a 9.08% improvement in RMSE.

URL: https://openreview.net/forum?id=ZMqMnwMfse

---

Title: Wolf: Dense Video Captioning with a World Summarization Framework

Abstract: We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhance video understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research community (VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf improves CapScore (caption quality) by 55.6% and CapScore (caption similarity) by 77.4% on challenging driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment.

URL: https://openreview.net/forum?id=Z1dH7hao7p

---

Title: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models

Abstract: Robust alignment guardrails for large language models are becoming increasingly important with their widespread application. In contrast to previous studies, we demonstrate that inference-time activation interventions can bypass safety alignments and effectively steer model generations towards harmful AI coordination for Llama 2. Our method applies fine-grained interventions at specific model subcomponents, particularly attention heads, using a simple binary choice probing strategy. These interventions then generalise to the open-ended generation setting effectively circumventing safety guardrails. We show that probing single attention heads is more effective than intervening on full layers and intervening on only four attention heads is comparable to supervised fine-tuning. We further show that only a few example completions are needed to compute effective steering directions, which is an advantage over classical fine-tuning. Our findings highlight the shortcomings of current alignment techniques. In addition, our results suggest that, at the attention head level, activations encode fine-grained linearly separable behaviors. Practically, the approach offers a straightforward methodology to steer large language model behaviour, which could be extended to diverse domains beyond safety requiring fine-grained control over the model output.

URL: https://openreview.net/forum?id=VY0huMBr5n

---

Title: Reasoning Under 1 Billion: Memory-Augmented Reinforcement Learning for Large Language Models

Abstract: Recent advances in fine-tuning large language models (LLMs) with reinforcement learning (RL) have shown promising improvements in complex reasoning tasks, particularly when paired with chain-of-thought (CoT) prompting. However, these successes have been largely demonstrated on large-scale models with billions of parameters, where a strong pretraining foundation ensures effective initial exploration. In contrast, RL remains challenging for tiny LLMs with 1 billion parameters or fewer because they lack the necessary pretraining strength to explore effectively, often leading to suboptimal reasoning patterns. This work introduces a novel intrinsic motivation approach that leverages episodic memory to address this challenge, improving tiny LLMs in CoT reasoning tasks. Inspired by human memory-driven learning, our method leverages successful reasoning patterns stored in memory while allowing controlled exploration to generate novel responses. Intrinsic rewards are computed efficiently using a kNN-based episodic memory, allowing the model to discover new reasoning strategies while quickly adapting to effective past solutions. Experiments on three reasoning datasets demonstrate that our approach significantly enhances smaller LLMs' reasoning performance and generalization capability, making RL-based reasoning improvements more accessible in low-resource settings.

URL: https://openreview.net/forum?id=tmdwuU2uKs

---

Title: Incorporating Spatial Information into Goal-Conditioned Hierarchical Reinforcement Learning via Graph Representations

Abstract: The integration of graphs with Goal-conditioned Hierarchical Reinforcement Learning (GCHRL) has recently gained attention, as intermediate goals (subgoals) can be effectively sampled from graphs that naturally represent the overall task structure in most RL tasks. However, existing approaches typically rely on domain-specific knowledge to construct these graphs, limiting their applicability to new tasks. Other graph-based approaches create graphs dynamically during exploration but struggle to fully utilize them, because they have problems passing the information in the graphs to newly visited states. Additionally, current GCHRL methods face challenges such as sample inefficiency and poor subgoal representation. This paper proposes a solution to these issues by developing a graph encoder-decoder to evaluate unseen states. Our proposed method, Graph-Guided sub-Goal representation Generation RL (G4RL), can be incorporated into any existing GCHRL method to enhance performance. We show that the graph encoder-decoder can be effectively implemented using a network trained on the state graph generated during exploration. Empirical results indicate that leveraging high and low-level intrinsic rewards from the graph encoder-decoder significantly enhances the performance of state-of-the-art GCHRL approaches with an extra small computational cost in dense and sparse reward environments.

URL: https://openreview.net/forum?id=a7Bx4s5gA8

---

Title: Thompson Sampling For Bandits With Cool-Down Periods

Abstract: This paper investigates a variation of dynamic bandits, characterized by arms that follow a periodic availability pattern. Upon a "successful" selection, each arm transitions to an inactive state and requires a possibly unknown cool-down period before becoming active again. We devise Thompson Sampling algorithms specifically designed for this problem, guaranteeing logarithmic regrets. Notably, this work is the first to address scenarios in which the agent lacks knowledge of each arm's active state. Furthermore, the theoretical findings extend to the sleeping bandit framework, offering a notably superior regret bound compared to existing literature.

URL: https://openreview.net/forum?id=1fv0ZS2mXm

---

Title: Spatio-temporal Partial Sensing Forecast of Long-term Traffic

Abstract: Traffic forecasting uses recent measurements by sensors installed at chosen locations to forecast the future road traffic. Existing work either assumes all locations are equipped with sensors or focuses on short-term forecast. This paper studies partial sensing forecast of long-term traffic, assuming sensors are available only at some locations. The problem is challenging due to the unknown data distribution at unsensed locations, the intricate spatio-temporal correlation in long-term forecasting, as well as noise to traffic patterns. We propose a Spatio-temporal Long-term Partial sensing Forecast model for traffic prediction, with several novel contributions, including a rank-based embedding technique to reduce the impact of noise in data, a spatial transfer matrix to overcome the spatial distribution shift from sensed locations to unsensed locations, and a multi-step training process that utilizes all available data to successively refine the model parameters for better accuracy. Extensive experiments on several real-world traffic datasets demonstrate its superior performance. Our source code is at https://anonymous.4open.science/r/STPS-166F

URL: https://openreview.net/forum?id=Ff08aPjVjD

---

Title: High-Dimensional Gaussian Process Regression with Soft Kernel Interpolation

Abstract: We introduce Soft Kernel Interpolation (SoftKI), a method that combines aspects of Structured Kernel Interpolation (SKI) and variational inducing point methods, to achieve scalable Gaussian Process (GP) regression on high-dimensional datasets. SoftKI approximates a kernel via softmax interpolation from a smaller number of interpolation points learned by optimizing a combination of the SoftKI marginal log-likelihood (MLL), and when needed, an approximate MLL for improved numerical stability. Consequently, it can overcome the dimensionality scaling challenges that SKI faces when interpolating from a dense and static lattice while retaining the flexibility of variational methods to adapt inducing points to the dataset. We demonstrate the effectiveness of SoftKI across various examples and show that it is competitive with other approximated GP methods when the data dimensionality is modest

URL: https://openreview.net/forum?id=U9b2FIjvWU

---

Title: Single-positive Multi-label Learning with Label Cardinality

Abstract: We study learning a multi-label classifier from partially labeled data, where each instance has only a single positive label. We explain how auxiliary information available on the label cardinality, the number of positive labels per instance, can be used for improving such methods. We consider auxiliary information of varying granularity, ranging from knowing just the maximum number of labels over all instances to knowledge on the distribution of label cardinalities and even the exact cardinality of each instance. We introduce methods leveraging the different types of auxiliary information, study how close to the fully labeled accuracy we can get under different scenarios, and show that a simple method only assuming the knowledge of the maximum cardinality is comparable to the state-of-the-art methods.

URL: https://openreview.net/forum?id=XEPPXH2nKu

---

Title: Solution Augmentation for ARC Problems Using GFlowNet: A Probabilistic Exploration Approach

Abstract: One of the core challenges in building general reasoning systems lies in generating diverse, human-aligned solution trajectories—different yet valid paths by which a problem can be solved. Prior approaches often rely on handcrafted templates, rule-based augmentations, or human demonstrations, which are limited in scalability and stylistic diversity. To address this, we explore the use of Generative Flow Networks (GFlowNets) for automated solution augmentation in reasoning tasks. We propose a framework that learns to generate diverse reasoning trajectories with probabilities proportional to their quality, guided by a human-inspired reward function and a novel geometric forward policy. This enables the generation of multiple plausible solution paths without relying on manual supervision. We evaluate our framework on the Abstraction and Reasoning Corpus (ARC-AGI), a benchmark designed to test compositional and abstract reasoning. Our results show that GFlowNets can effectively explore the space of valid reasoning processes, producing trajectories that are diverse, concise, and consistent with human reasoning patterns. These findings suggest that GFlowNets offer a promising foundation for modeling structured reasoning in automated trajectory generation. Our code is here: https://anonymous.4open.science/r/GFN_to_ARC-B500/

URL: https://openreview.net/forum?id=ULCOhBgGzy

---

Title: In-context Learning for Mixture of Linear Regression: Existence, Generalization and Training Dynamics

Abstract: We investigate the in-context learning capabilities of transformers for the $d$-dimensional mixture of linear regression model, providing theoretical insights into their existence, generalization bounds, and training dynamics. Specifically, we prove that there exists a transformer capable of achieving a prediction error of order $\mathcal{O}(\sqrt{d/n})$ with high probability, where $n$ represents the training prompt size in the high signal-to-noise ratio (SNR) regime. Moreover, we derive in-context excess risk bounds of order $\mathcal{O}(L/\sqrt{B})$ for the case of two mixtures, where $B$ denotes the number of training prompts, and $L$ represents the number of attention layers. The dependence of $L$ on the SNR is explicitly characterized, differing between low and high SNR settings. We further analyze the training dynamics of transformers with single linear self-attention layers, demonstrating that, with appropriately initialized parameters, gradient flow optimization over the population mean square loss converges to a global optimum. Extensive simulations suggest that transformers perform well on this task, potentially outperforming other baselines, such as the Expectation-Maximization algorithm.

URL: https://openreview.net/forum?id=buZXVuTsHY

---

Title: Large-Scale Targeted Cause Discovery with Data-Driven Learning

Abstract: We propose a novel machine learning approach for inferring causal variables of a target variable from observations. Our focus is on directly inferring a set of causal factors without requiring full causal graph reconstruction, which is computationally challenging in large-scale systems. The identified causal set consists of all potential regulators of the target variable under experimental settings, enabling efficient regulation when intervention costs and feasibility vary across variables. To achieve this, we train a neural network using supervised learning on simulated data to infer causality. By employing a local-inference strategy, our approach scales with linear complexity in the number of variables, efficiently scaling up to thousands of variables. Empirical results demonstrate superior performance in identifying causal relationships within large-scale gene regulatory networks, outperforming existing methods that emphasize full-graph discovery. We validate our model's generalization capability across out-of-distribution graph structures and generating mechanisms, including gene regulatory networks of E. coli and the human K562 cell line.

URL: https://openreview.net/forum?id=NVgy29IQw8

---

Title: Structural Causal Circuits: Probabilistic Circuits Climbing All Rungs of Pearl's Ladder of Causation

Abstract: The complexity and vastness of our world can require large models with numerous variables. Unfortunately, coming up with a model that is both accurate and able to provide predictions in a reasonable amount of time can prove difficult. One possibility to help overcome such problems is sum-product networks (SPNs), probabilistic models with the ability to tractably perform inference in linear time. In this paper, we extend SPNs' capabilities to the field of causality and introduce the family of structural causal circuits (SCCs), a type of SPNs capable of answering causal questions. Starting from conventional SPNs, we ``climb the ladder of causation'' and show how SCCs can represent not only observational but also interventional and counterfactual problems. We demonstrate successful application in different settings, ranging from simple binary variables to physics-based simulations.

URL: https://openreview.net/forum?id=25XyUTICdZ

---

Title: Foldable SuperNets: Scalable Merging of Transformers with Different Initializations and Tasks

Abstract: Recent methods aim to merge neural networks (NNs) with identical architectures trained on different tasks into a single multi-task model. While most works focus on the simpler setup of merging NNs initialized from a common pre-trained network, we target the harder problem of merging large transformers trained on different tasks from distinct initializations. We show that traditional merging methods fail catastrophically in this setup, while Knowledge Distillation (KD) achieves much better results, though at a higher cost. However, KD is data-inefficient, as it does not exploit the original models' weights. To solve this, we introduce "Foldable SuperNet Merge" (FS-Merge), which trains a SuperNet containing the original models (with frozen weights) using a feature reconstruction objective. After training, the SuperNet is folded back to the size of a single original model. FS-Merge is simple, data-efficient, has a computational cost comparable to KD, and is proven to have superior expressiveness over traditional merging methods. It achieves SOTA results when tested on MLPs and transformers across various sizes, tasks, modalities, and distribution shifts, especially in low-data scenarios.

URL: https://openreview.net/forum?id=6FqwLestHv

---

Title: One Model, Any Conjunctive Query: Graph Neural Networks for Answering Queries over Incomplete Knowledge Graphs

Abstract: Motivated by the incompleteness of modern knowledge graphs, a new setup for query answering has emerged, where the goal is to predict answers that do not necessarily appear in the knowledge graph, but are present in its completion. In this paper, we formally introduce and study two query answering problems, namely, query answer classification and query answer retrieval. To solve these problems, we propose AnyCQ, a model that can classify answers to any conjunctive query on any knowledge graph. At the core of our framework lies a graph neural network trained using a reinforcement learning objective to answer Boolean queries. Trained only on simple, small instances, AnyCQ generalizes to large queries of arbitrary structure, reliably classifying and retrieving answers to queries that existing approaches fail to handle. This is empirically validated through our newly proposed, challenging benchmarks. Finally, we empirically show that AnyCQ can effectively transfer to completely novel knowledge graphs when equipped with an appropriate link prediction model, highlighting its potential for querying incomplete data.

URL: https://openreview.net/forum?id=l6LDN3M7vq

---

Title: StructDrop: A Structured Random Algorithm Towards Efficient Large-Scale Graph Training

Abstract: Graph neural networks (GNNs) have gained considerable success in graph-based learning tasks, yet training GNNs on large graphs is still inefficient. The root cause is the graph-based sparse operations are difficult to accelerate with commodity hardware. Prior art reduces the computation cost of sparse matrix based operations (e.g., linear) via sampling-based approximation. However, two under-explored pain points still persist in this paradigm. Inefficiency Issue: The random-based sampling approaches have the non-zero entries randomly distributing over adjacency matrix, which slows down memory access process and is difficult to accelerate with commodity hardware. Under-fitting Problem: The previous sampling methods only utilize the same subset of nodes during the training, which may cause the under-fitting problem on other remain nodes. Aiming to systematically address these two pain points, we propose StructuredDropout, a.k.a, StructDrop. This method involves the selective random sampling of columns and rows from a sparse matrix for computation. Comprehensive experiments validate the efficiency and generalization of our framework: StructDrop achieves up to 5.09x speedup for a single sparse operation and 5.29x end-to-end speedup with negligible accuracy loss or even better accuracy.

URL: https://openreview.net/forum?id=7B7LcXf72j

---

Title: Diversity-Enhanced and Classification-Aware Prompt Learning for Few-Shot Learning via Stable Diffusion

Abstract: Recent text-to-image generative models have exhibited an impressive ability to generate fairly realistic images from some text prompts. In this work, we explore to leverage off-the-shelf text-to-image generative models to train non-specific downstream few-shot classification model architectures using synthetic dataset to classify real images. Current approaches use hand-crafted or model-generated text prompts of text-to-image generative models to generate desired synthetic images, however, they have limited capability of generating diverse images. Especially, their synthetic datasets have relatively limited relevance to the downstream classification tasks. This makes them fairly hard to guarantee training models from synthetic images are efficient in practice. To address this issue, we propose a method capable of adaptively learning proper text prompts for the off-the-shelf diffusion model to generate diverse and classification-aware synthetic images. Our approach shows consistently improvements in various classification datasets, with results comparable to existing prompt designing methods. We find that replacing data generation strategy of existing zero/few-shot methods with proposed method could consistently improve downstream classification performance across different network architectures, demonstrating its model-agnostic potential for few-shot learning. This makes it possible to train an efficient downstream few-shot learning model from synthetic images generated by proposed method for real problems.

URL: https://openreview.net/forum?id=4CfliohyqK

---

Title: Convergence Properties of Natural Gradient Descent for Minimizing KL Divergence

Abstract: The Kullback-Leibler (KL) divergence plays a central role in probabilistic machine learning, where it commonly serves as the canonical loss function. Optimization in such settings is often performed over the probability simplex, where the choice of parameterization significantly impacts convergence. In this work, we study the problem of minimizing the KL divergence and analyze the behavior of gradient-based optimization algorithms under two dual coordinate systems within the framework of information geometry$-$ the exponential family ($\theta$ coordinates) and the mixture family ($\eta$ coordinates). We compare Euclidean gradient descent (GD) in these coordinates with the coordinate-invariant natural gradient descent (NGD), where the natural gradient is a Riemannian gradient that incorporates the intrinsic geometry of the parameter space. In continuous time, we prove that the convergence rates of GD in the $\theta$ and $\eta$ coordinates provide lower and upper bounds, respectively, on the convergence rate of NGD. Moreover, under affine reparameterizations of the dual coordinates, the convergence rates of GD in $\eta$ and $\theta$ coordinates can be scaled to $2c$ and $\frac{2}{c}$, respectively, for any $c>0$, while NGD maintains a fixed convergence rate of $2$, remaining invariant to such transformations and sandwiched between them.
Although this suggests that NGD may not exhibit uniformly superior convergence in continuous time, we demonstrate that its advantages become pronounced in discrete time, where it achieves faster convergence and greater robustness to noise, outperforming GD. Our analysis hinges on bounding the spectrum and condition number of the Hessian of the KL divergence at the optimum, which coincides with the Fisher information matrix.

URL: https://openreview.net/forum?id=h6hjjAF5Bj

---

Title: Model Tensor Planning

Abstract: Sampling-based model predictive control (MPC) offers strong performance in nonlinear and contact-rich robotic tasks, yet often suffers from poor exploration due to locally greedy sampling schemes. We propose \emph{Model Tensor Planning} (MTP), a novel sampling-based MPC framework that introduces high-entropy control trajectory generation through structured tensor sampling. By sampling over randomized multipartite graphs and interpolating control trajectories with B-splines and Akima splines, MTP ensures smooth and globally diverse control candidates. We further propose a simple $\beta$-mixing strategy that blends local exploitative and global exploratory samples within the modified Cross-Entropy Method (CEM) update, balancing control refinement and exploration. Theoretically, we show that MTP achieves asymptotic path coverage and maximum entropy in the control trajectory space in the limit of infinite tensor depth and width.

Our implementation is fully vectorized using JAX and compatible with MuJoCo XLA, supporting \emph{Just-in-time} (JIT) compilation and batched rollouts for real-time control with online domain randomization. Through experiments on various challenging robotic tasks, ranging from dexterous in-hand manipulation to humanoid locomotion, we demonstrate that MTP outperforms standard MPC and evolutionary strategy baselines in task success and control robustness. Design and sensitivity ablations confirm the effectiveness of MTP’s tensor sampling structure, spline interpolation choices, and mixing strategy. Altogether, MTP offers a scalable framework for robust exploration in model-based planning and control.

URL: https://openreview.net/forum?id=fk1ZZdXCE3

---

Title: MoReact: Generating Reactive Motion from Textual Descriptions

Abstract: Modeling and generating human reactions poses a significant challenge with broad applications for computer vision and human-computer interaction. Existing methods either treat multiple individuals as a single entity, directly generating interactions, or rely solely on one person's motion to generate the other's reaction, failing to integrate the rich semantic information that underpins human interactions. Yet, these methods often fall short in adaptive responsiveness, i.e., the ability to accurately respond to diverse and dynamic interaction scenarios. Recognizing this gap, our work introduces an approach tailored to address the limitations of existing models by focusing on text-driven human reaction generation. Our model specifically generates realistic motion sequences for individuals that responding to the other's actions based on a descriptive text of the interaction scenario. The goal is to produce motion sequences that not only complement the opponent's movements but also semantically fit the described interactions. To achieve this, we present MoReact, a diffusion-based method designed to disentangle the generation of global trajectories and local motions sequentially. This approach stems from the observation that generating global trajectories first is crucial for guiding local motion, ensuring better alignment with given action and text. Furthermore, we introduce a novel interaction loss to enhance the realism of generated close interactions. Our experiments, utilizing data adapted from a two-person motion dataset, demonstrate the efficacy of our approach for this novel task, which is capable of producing realistic, diverse, and controllable reactions that not only closely match the movements of the counterpart but also adhere to the textual guidance.

URL: https://openreview.net/forum?id=4zuT73heqm

---

Title: Enabling Automatic Differentiation with Mollified Graph Neural Operators

Abstract: Physics-informed neural operators offer a powerful framework for learning solution operators of partial differential equations (PDEs) by combining data and physics losses. However, these physics losses require the efficient and accurate computation of derivatives. Computing these derivatives remains challenging, with spectral and finite difference methods introducing approximation errors due to finite resolution. Here, we propose the mollified graph neural operator ($m$GNO), the first method to leverage automatic differentiation and compute exact gradients on arbitrary geometries. This enhancement enables efficient training on arbitrary point clouds and irregular grids with varying geometries while allowing the seamless evaluation of physics losses at randomly sampled points for improved generalization. For a PDE example on regular grids, $m$GNO paired with Autograd reduced the L2 relative data error by 20× compared to finite differences, suggesting it better captures the physics underlying the data. It can also solve PDEs on unstructured point clouds seamlessly, using physics losses only, at resolutions vastly lower than those needed for finite differences to be accurate enough. On these unstructured point clouds, $m$GNO leads to errors that are consistently 2 orders of magnitude lower than machine learning baselines (Meta-PDE, which accelerate PINNs) for comparable runtimes, and also delivers speedups from 1 to 3 orders of magnitude compared to the numerical solver for similar accuracy. $m$GNOs can also be used to solve inverse design and shape optimization problems on complex geometries.

URL: https://openreview.net/forum?id=CGoR1hFAGr

---

Title: Corner Cases: How Size and Position of Objects Challenge ImageNet-Trained Models

Abstract: Backgrounds in images play a major role in contributing to spurious correlations among different data points. Owing to aesthetic preferences of humans capturing the images, datasets can exhibit positional (location of the object within a given frame) and size (region-of-interest to image ratio) biases for different classes. In this paper, we show that these biases can impact how much a model relies on spurious features in the background to make its predictions. To better illustrate our findings, we propose a synthetic dataset derived from ImageNet1k, Hard-Spurious-ImageNet, which contains images with various backgrounds, object positions, and object sizes. By evaluating the dataset on different pretrained models, we find that most models rely heavily on spurious features in the background when the region-of-interest (ROI) to image ratio is small and the object is far from the center of the image. Moreover, we also show that current methods that aim to mitigate harmful spurious features, do not take into account these factors, hence fail to achieve considerable performance gains for worst-group accuracies when the size and location of core features in an image change.

URL: https://openreview.net/forum?id=Yqf2BhqfyZ

---

Title: UNICORNN: Unimodal Calibrated Ordinal Regression Neural Network

Abstract: Ordinal regression is a supervised machine learning technique aimed at predicting the value of a discrete dependent variable with an ordered set of possible outcomes. Many of the algorithms that have been developed to address this issue rely on maximum likelihood for training. However, the standard maximum likelihood approach often fails to adequately capture the inherent order of classes, even though it tends to produce well-calibrated probabilities. Alternatively, some methods use Optimal Transport (OT) divergence as their training objective. Unlike maximum likelihood, OT accounts for the ordering of classes; however, in this manuscript, we show that it doesn't always yield well-calibrated probabilities. To overcome these limitations, we introduce UNICORNN, an approach inspired by the well-known Proportional Odds Model, which offers three key guarantees: (i) it ensures unimodal output probabilities, a valuable feature for many real-world applications;
(ii) it employs OT loss during training to accurately capture the natural order of classes;
(iii) it provides well-calibrated probability estimates through a post-training accuracy-preserving calibration step.
Experimental results on six real-world datasets
demonstrate that UNICORNN consistently either outperforms or performs as well as recently proposed deep learning approaches for ordinal regression. It excels in both accuracy and probability calibration, while also guaranteeing output unimodality. The code will be publicly available upon acceptance.

URL: https://openreview.net/forum?id=Nb08hwXzsv

---

Title: The Geometry of Phase Transitions in Diffusion Models: Tubular Neighbourhoods and Singularities

Abstract: Diffusion models undergo phase transitions during the generative process where data features suddenly emerge in the final stages.
The current study aims to elucidate this critical phenomenon from the geometrical perspective. We employ the concept of ``injectivity radius'', a quantity that characterises the structure of the data manifold. Through theoretical and empirical evidence, we demonstrate that phase transitions in the generative process of diffusion models are closely related to the injectivity radius. Our findings offer a novel perspective on phase transitions in diffusion models, with potential implications for improving performance and sampling efficiency.

URL: https://openreview.net/forum?id=ahVFKFLYk2

---

Title: Risk‑Seeking Reinforcement Learning via Multi‑Timescale EVaR Optimization

Abstract: Risk sensitivity is pivotal in shaping agents' behavior when navigating uncertainty and diverging from risk-neutral scenarios. Risk measures such as Value at Risk (VaR) and Conditional Value at Risk (CVaR) have shown promising results in risk-sensitive reinforcement learning. In this paper, we study the incorporation of a relatively new coherent risk measure, Entropic Value at Risk (EVaR), as the objective the agent seeks to optimize. We propose a multi-timescale stochastic approximation algorithm to seek the optimal parameterized EVaR policy. Our algorithm facilitates effective exploration of the policy space and robust approximation of the gradient, leading to the optimization of the EVaR objective. We analyze the asymptotic behavior of our proposed algorithm and rigorously evaluate it across various discrete and continuous benchmark environments. The results highlight that the EVaR policy achieves higher cumulative returns and corroborate that EVaR is indeed a competitive risk-seeking objective for RL.

URL: https://openreview.net/forum?id=4nbEgNDsii

---

Title: Rethinking Adversarial Attacks as Protection Against Diffusion-based Mimicry

Abstract: Diffusion models have demonstrated a remarkable capability to edit or imitate images, which has raised concerns regarding the safeguarding of intellectual property. To address these concerns, the adoption of adversarial attacks, which introduce adversarial perturbations that can fool the targeted diffusion model into protected images, has emerged as a viable solution. Consequently, diffusion models, like many other deep network models, are believed to be susceptible to adversarial attacks. However, in this work, we draw attention to an important oversight in existing research, as all previous studies have focused solely on attacking latent diffusion models (LDMs), neglecting adversarial examples for diffusion models in the pixel space diffusion models (PDMs). Through extensive experiments, we demonstrate that nearly all existing adversarial attack methods designed for LDMs, as well as adaptive attacks designed for PDMs, fail when applied to PDMs. We attribute the vulnerability of LDMs to their encoders, indicating that diffusion models exhibit strong robustness against adversarial attacks. Building upon this insight, we find that PDMs can be used as an off-the-shelf purifier to effectively eliminate adversarial patterns generated by LDMs, thereby maintaining the integrity of images. Notably, we highlight that most existing protection methods can be easily bypassed using PDM-based purification. We hope our findings prompt a reevaluation of adversarial samples for diffusion models as potential protection methods.

URL: https://openreview.net/forum?id=a31IcUSyeU

---

Title: Formulating Node Labelling as Node Classification or Link Prediction in Different Graph Representations

Abstract: Message-passing Graph Neural Networks (GNNs) are increasingly used for predictive tasks on graphs. Much work has been done to improve GNN architectures, but how the actual data graph should be designed is not well studied. In this paper, we investigate how two different graph representations impact the performance of GNN models across datasets with varying characteristics grouped by homophily, heterogeneity, and number of labels per node. A unique phenomenon is that the same abstract predictive task of labelling nodes is formulated as a node classification problem on one representation and as a link prediction problem on the other. Our work is the first to blur the line between these two basic and fundamental tasks in graph learning. Our experiments on 12 real-world datasets suggest that different representations (and tasks) are optimal for different datasets, models, and hyperparameters. We derive empirical heuristics of choosing between the two and pave the way towards a criterion of choosing the optimal graph representations and towards formally understanding the interconnection between node classification and link prediction.

URL: https://openreview.net/forum?id=lK7tjysj0s

---

Title: Genetic-Evolutionary Graph Neural Networks: A Paradigm for Improved Graph Representation Learning

Abstract: Message-passing graph neural networks have become the dominant framework for learning over graphs. However, empirical studies continually show that message-passing graph neural networks tend to generate over-smoothed representations for nodes after iteratively applying message passing. This over-smoothing problem is a core issue that limits the representational capacity of message-passing graph neural networks. We argue that the fundamental problem with over-smoothing is a lack of diversity in the generated embeddings, and the problem could be reduced by enhancing the embedding diversity in the embedding generation process. To this end, we propose genetic-evolutionary graph neural networks, a new paradigm for graph representation learning inspired by genetic algorithms. We view each layer of a graph neural network as an evolutionary process and develop operations based on crossover and mutation to prevent embeddings from becoming similar to one another, thus enabling the model to generate improved graph representations. The proposed framework has good interpretablility, as it directly draws inspiration from genetic algorithms for preserving population diversity. We experimentally validate the proposed framework on six benchmark datasets on different tasks. The results show that our method significant advances the performance current graph neural networks, resulting in new state-of-the-art results for graph representation learning on these datasets.

URL: https://openreview.net/forum?id=qzYTklXVAB

---

Title: ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models

Abstract: Assessing the quality of generative model outputs from large language models (LLMs) or vision-language models (VLMs), poses significant challenges. Traditional evaluation methods either rely on human assessment which is resource-intensive and not scalable or on automatic metrics that often correlate poorly with human preferences. Another approach is to train dedicated neural evaluators, but this typically requires substantial training data and compute. In this study, we thus introduce ReFeR, a tuning-free framework for evaluating generative outputs including both text and images, using a two-level hierarchy of pre-trained LLM and VLM evaluators. This multi-agent hierarchical strategy leverages additional compute at inference time by orchestrating multiple models and utilizing the increased test-time reasoning to boost performance. By having models themselves provide feedback and final judgments, ReFeR reduces the dependence on human evaluation. We rigorously evaluate ReFeR on four diverse evaluation benchmarks, where it surpasses prior methods in accuracy while also generating constructive feedback useful for downstream distillation and self-improvement via finetuning. Interestingly, ReFeR is also applicable for reasoning tasks - experiments on four reasoning benchmarks show ReFeR’s superior collective reasoning abilities. We present two variants of the framework: ReFeR-Turbo, optimized for accelerated performance, and ReFeR-Lite, offering a more test-time compute efficient solution. ReFeR-Lite is $\sim12-14\times$ more compute efficient than previous works while being comparably accurate to ReFeR-Turbo.

URL: https://openreview.net/forum?id=otSHFe8wTf

---

Title: Efficient Object-Centric Representation Learning using Masked Generative Modeling

Abstract: Learning object-centric representations from visual inputs in an unsupervised manner have drawn focus to solve more complex tasks, such as reasoning and reinforcement learning. However, current state-of-the-art methods, relying on autoregressive transformers or diffusion models to generate scenes from object-centric representations, suffer from computational inefficiency due to their sequential or iterative nature. This computational bottleneck limits their practical application and hinders scaling to more complex downstream tasks. To overcome this, we propose MOGENT, an efficient object-centric learning framework based on masked generative modeling. MOGENT conditions a masked bidirectional transformer on learned object slots and employs a parallel iterative decoding scheme to generate scenes, enabling efficient compositional generation. We conduct experiments on 3D Shapes and CLEVR, demonstrating that MOGENT significantly improves computational efficiency, accelerating the generation process by up to 10x compared to autoregressive models. Importantly, the efficiency is attained followed by a strong or competitive performance on object segmentation and compositional generation tasks.

URL: https://openreview.net/forum?id=t9KvOYPeL3

---

Title: Collaborative Compressors in Distributed Mean Estimation with Limited Communication Budget

Abstract: Distributed high dimensional mean estimation is a common aggregation routine used often in distributed optimization methods. Most of these applications call for a communication-constrained setting where vectors, whose mean is to be estimated, have to be compressed before sharing. One could independently encode and decode these to achieve compression, but that overlooks the fact that these vectors are often close to each other. To exploit these similarities, recently Suresh et al., 2022, Jhunjhunwala et al., 2021, Jiang et al, 2023, proposed multiple *correlation-aware compression schemes*. However, in most cases, the correlations have to be known for these schemes to work. Moreover, a theoretical analysis of graceful degradation of these correlation-aware compression schemes with increasing *dissimilarity* is limited to only the $\ell_2$-error in the literature. In this paper, we propose four different collaborative compression schemes that agnostically exploit the similarities among vectors in a distributed setting. Our schemes are all simple to implement and computationally efficient, while resulting in big savings in communication. The analysis of our proposed schemes show how the $\ell_2$, $\ell_\infty$ and cosine estimation error varies with the degree of similarity among vectors.

URL: https://openreview.net/forum?id=AtCKHCoMA7

---

Title: BALSA: Benchmarking Active Learning Strategies for Autonomous laboratories

Abstract: The discovery of new materials and biological solutions is hindered by the vast complexity of design parameter spaces and resource-intensive data acquisition, which makes traditional exhaustive search strategies impractical. Active learning methods, which iteratively identify informative data points, offer a promising solution to tackle these challenges by significantly reducing the data-labeling effort and resource requirements. These methods iteratively guide experiments or simulations by focusing on the most informative data points, enabling faster identification of optimal candidates with reduced labeling demands. Despite these advancements, the absence of standardized benchmarks impedes objective comparison of methodologies, slowing progress in autonomous scientific discovery. To address this, we introduce BALSA, a comprehensive benchmark tailored to systematically evaluate active learning search strategies in autonomous laboratories using active learning frameworks. BALSA provides a standardized evaluation protocol, novel metrics for high-dimensional optimization, and reference implementations to facilitate efficient and reproducible benchmarking. BALSA includes both synthetic benchmarks and real-world tasks in biology and materials science, designed to address unique challenges, particularly limited data availability, in autonomous laboratories.

URL: https://openreview.net/forum?id=T1siqFh1lE

---

Title: Contextual Combinatorial Bandits With Changing Action Sets Via Gaussian Processes

Abstract: We consider a contextual bandit problem with a combinatorial action set and time-varying base arm availability. At the beginning of each round, the agent observes the set of available base arms and their contexts and then selects an action that is a feasible subset of the set of available base arms to maximize its cumulative reward in the long run. We assume that the mean outcomes of base arms are samples from a Gaussian Process (GP) indexed by the context set $\mathcal{X}$, and the expected reward is Lipschitz continuous in expected base arm outcomes. For this setup, we propose an algorithm called Optimistic Combinatorial Learning and Optimization with Kernel Upper Confidence Bounds (O'CLOK-UCB) and prove that it incurs $\tilde{O}(\sqrt{\lambda^*(K)KT\overline{\gamma}_{T}} )$ regret with high probability, where $\overline{\gamma}_{T}$ is the maximum information gain associated with the set of base arm contexts that appeared in the first $T$ rounds, $K$ is the maximum cardinality of any feasible action over all rounds and $\lambda^*(K)$ is the maximum eigenvalue of all covariance matrices of selected actions up to time $T$, which is a function of $K$. To dramatically speed up the algorithm, we also propose a variant of O'CLOK-UCB that uses sparse GPs. Finally, we experimentally show that both algorithms exploit inter-base arm outcome correlation and vastly outperform the previous state-of-the-art UCB-based algorithms in realistic setups.

URL: https://openreview.net/forum?id=2RgfAY3jnI

---

Title: Information-Theoretic State Variable Selection for Reinforcement Learning

Abstract: Identifying the most suitable variables to represent the state is a fundamental challenge in Reinforcement Learning (RL). These variables must efficiently capture the information necessary for making optimal decisions. In order to address this problem, in this paper, we introduce the Transfer Entropy Redundancy Criterion (TERC), an information-theoretic criterion, which determines if there is \textit{entropy transferred} from state variables to actions during training. We define an algorithm based on TERC that provably excludes variables from the state that have no effect on the final performance of the agent, resulting in more sample-efficient learning. Experimental results show that this speed-up is present across three different algorithm classes (represented by tabular Q-learning, Actor-Critic, and Proximal Policy Optimization (PPO)) in a variety of environments. Furthermore, to highlight the differences between the proposed methodology and the current state-of-the-art feature selection approaches, we present a series of controlled experiments on synthetic data, before generalizing to real-world decision-making tasks. We also introduce a representation of the problem that compactly captures the transfer of information from state variables to actions as Bayesian networks.

URL: https://openreview.net/forum?id=6aNZjKVwLn

---

Title: LLaVA-Video: Video Instruction Tuning With Synthetic Data

Abstract: The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we consider an alternative approach, creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this proposed dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

URL: https://openreview.net/forum?id=EElFGvt39K

---

Title: Interpretability for Time Series Transformers using A Concept Bottleneck Framework

Abstract: There has been a recent push of research on Transformer-based models for long-term time series forecasting, but the interpretability of these models remains largely unexplored.
To address this gap, we develop a framework based on Concept Bottleneck Models.
We modify the training objective to encourage a model to develop representations similar to predefined interpretable concepts using Centered Kernel Alignment.
We apply the framework to the Vanilla Transformer and Autoformer, and present an in-depth analysis on synthetic data and on a variety of benchmark datasets.
We find that the model performance remains mostly unaffected, while the model shows much improved interpretability.
Additionally, interpretable concepts become local, which makes the trained model easily intervenable.
We demonstrate this with an intervention after applying a time shift to the data.

URL: https://openreview.net/forum?id=Jd7RA4z6Rs

---

Title: Symbolic Learning Enables Self-Evolving Agents

Abstract: The AI community has been exploring a pathway to artificial general intelligence (AGI) by developing "language agents", which are complex large language models (LLMs) pipelines involving both prompting techniques and tool usage methods. While language agents have demonstrated impressive capabilities for many real-world tasks, a fundamental limitation of current language agents research is that they are model-centric, or engineering-centric. That's to say, the progress on prompts, tools, and pipelines of language agents requires substantial manual engineering efforts from human experts rather than automatically learning from data. We believe the transition from model-centric, or engineering-centric, to data-centric, i.e., the ability of language agents to autonomously learn and evolve in environments, is the key for them to possibly achieve AGI.
In this work, we introduce agent symbolic learning, a systematic framework that enables language agents to optimize themselves on their own in a data-centric way using symbolic optimizers. Specifically, we consider agents as symbolic networks where learnable weights are defined by prompts, tools, and the way they are stacked together. Agent symbolic learning is designed to optimize the symbolic network within language agents by mimicking two fundamental algorithms in connectionist learning: back-propagation and gradient descent. Instead of dealing with numeric weights, agent symbolic learning works with natural language simulacrums of weights, loss, and gradients. We conduct proof-of-concept experiments on both standard benchmarks and complex real-world tasks and show that agent symbolic learning enables language agents to update themselves after being created and deployed in the wild, resulting in "self-evolving agents".

URL: https://openreview.net/forum?id=OgqgZ8vV9N

---

Title: RecurrentGPT: Interactive Generation of (Arbitrarily) Long Text

Abstract: The fixed context size of the Transformer renders the generation of coherent long text with GPT models challenging. In this paper, we introduce RecurrentGPT, a language-based simulacrum of the recurrence mechanism in RNNs, for text generation.
RecurrentGPT is built upon a large language model (LLM) such as ChatGPT and uses a recurrent prompting mechanism that uses natural language to simulate the recurrent computation mechanism in RNNs to generate arbitrarily long texts. At each timestep, RecurrentGPT uses prompting to generate a paragraph of text and update its language-based long-short term memory. This recurrent prompting mechanism enables RecurrentGPT to generate texts of arbitrary length without the need to fit long texts in the context. Since human users can easily observe and edit the natural language memories, RecurrentGPT is naturally interpretable and enables interactive generation of long text. RecurrentGPT is an initial step towards next-generation computer-assisted writing systems that go beyond local editing suggestions. Our experiments show that RecurrentGPT can generate long texts of better quality and coherence compared to other long text generation strategies.

URL: https://openreview.net/forum?id=VER7dmBoU0

---

Title: Efficient Prompting via Dynamic In-Context Learning

Abstract: In context learning has become a common practice for prompting generalist models. Despite being effective, in-context learning can be computationally inefficient because it makes the input prompt much longer, consuming valuable space in the context window and leading to larger computational costs. In this paper, we propose DynaICL, a recipe for efficient prompting with black-box generalist models that dynamically allocates in-context examples according to the input complexity and the computational budget. We train a meta controller that predicts the number of in-context examples suitable for the generalist model to make a good prediction based on the difficulty of a specific input. We then dynamically allocate the number of demonstrations for an input according to the computation budget. Experimental results show that DynaICL helps achieve a better performance-efficiency trade-off in two practical settings where we have constraints on computational resources or the minimum required performance. Specifically, DynaICL saves up to 46% token budget compared to the common practice that allocates the same number of in-context examples to each input. In addition, we also find that a meta controller trained on a certain backbone model and tasks can successfully generalize to unseen models and tasks, suggesting that we can train a meta controller once and use it in various use cases.

URL: https://openreview.net/forum?id=W1sVdo0oF3

---

Title: Enhancing Plaque Segmentation in CCTA with Prompt- based Diffusion Data Augmentation

Abstract: Coronary computed tomography angiography (CCTA) is essential for non-invasive assessment of coronary artery disease (CAD). However, accurate segmentation of atherosclerotic plaques remains challenging due to data scarcity, severe class imbalance, and significant variability between calcified and non-calcified plaques. Inspired by DiffTumor’s tumor synthesis and PromptIR’s adaptive restoration framework, we introduce PromptLesion, a prompt-conditioned diffusion model for multi-class lesion synthesis. Unlike single-class methods, our approach integrates lesion-specific prompts within the diffusion generation process, enhancing diversity and anatomical realism in synthetic data. We validate PromptLesion on a private CCTA dataset and multi-organ tumor segmentation tasks (kidney, liver, pancreas) using public datasets, achieving superior performance compared to baseline methods. Models trained with our prompt-guided synthetic augmentation significantly improve Dice Similarity Coefficient (DSC) scores for both plaque and tumor segmentation. Extensive evaluations and ablation studies confirm the effectiveness of prompt conditioning.

URL: https://openreview.net/forum?id=hbTYt8PX9n

---

Title: Families of Optimal Transport Kernels for Cell Complexes

Abstract: Recent advances have discussed cell complexes as ideal learning representations. However, there is a lack of available machine learning methods suitable for learning on CW complexes. In this paper, we derive an explicit expression for the Wasserstein distance between cell complex signal distributions in terms of a Hodge-Laplacian matrix. This leads to a structurally meaningful measure to compare CW complexes and define the optimal transportation map. In order to simultaneously include both feature and structure information, we extend the Fused Gromov-Wasserstein distance to CW complexes. Finally, we introduce novel kernels over the space of probability measures on CW complexes based on the dual formulation of optimal transport.

URL: https://openreview.net/forum?id=haXP1WPfzw

---

Title: Efficient Reasoning Models: A Survey

Abstract: Reasoning models have demonstrated remarkable progress in solving complex and logic-intensive tasks by generating extended Chain-of-Thoughts (CoTs) prior to arriving at a final answer. Yet, the emergence of this “slow-thinking” paradigm, with numerous tokens generated in sequence, inevitably introduces substantial computational overhead. To this end, it highlights an urgent need for effective acceleration. This survey aims to provide a comprehensive overview of recent advances in efficient reasoning. It categorizes existing works into three key directions: (1) shorter – compressing lengthy CoTs into concise yet effective reasoning chains; (2) smaller – developing compact language models with strong reasoning capabilities through techniques such as knowledge distillation, other model compression techniques, and reinforcement learning; and (3) faster – designing efficient decoding strategies to accelerate inference.

URL: https://openreview.net/forum?id=sySqlxj8EB

---

Title: Beyond ordinary Lipschitz constraints: Differentially Private optimization with TNC

Abstract: We study Stochastic Convex Optimization in Differential Privacy model (DP-SCO). Unlike previous studies, here we assume the population risk function satisfies
the Tsybakov Noise Condition (TNC) with some parameter $\theta>1$, where the Lipschitz constant of the loss could be extremely large or even unbounded, but the $\ell_2$-norm gradient of the loss has bounded $k$-th moment with $k\geq 2$.
For the Lipschitz case with $\theta\geq 2$, we first propose an $(\epsilon, \delta)$-DP algorithms whose utility bound is $\tilde{O}\left(\left(\tilde{r}_{2k}(\frac{1}{\sqrt{n}}+(\frac{\sqrt{d}}{n\epsilon}))^\frac{k-1}{k}\right)^\frac{\theta}{\theta-1}\right)$ in high probability, where $n$ is the sample size, $d$ is the model dimension, and $\tilde{r}_{2k}$ is a term that only depends on the $2k$-th moment of the gradient. It is notable that such an upper bound is independent of the Lipschitz constant. We then extend to the case where
$\theta\geq \bar{\theta}> 1$ for some known constant $\bar{\theta}$. Moreover, when the privacy budget $\epsilon$ is small enough, we show an upper bound of $\tilde{O}\left(\left(\tilde{r}_{k}(\frac{1}{\sqrt{n}}+(\frac{\sqrt{d}}{n\epsilon}))^\frac{k-1}{k}\right)^\frac{\theta}{\theta-1}\right)$ even if the loss function is not Lipschitz. For the lower bound, we show that for any $\theta\geq 2$, the private minimax rate for $\rho$-zero Concentrated Differential Privacy is lower bounded by $\Omega\left(\left(\tilde{r}_{k}(\frac{1}{\sqrt{n}}+(\frac{\sqrt{d}}{n\sqrt{\rho}}))^\frac{k-1}{k}\right)^\frac{\theta}{\theta-1}\right)$.

URL: https://openreview.net/forum?id=SZCygcrGng

---

Title: EGAIN: Enhanced Generative Adversarial Networks for Imputing Missing Values

Abstract: Missing values pose a challenge in predictive analysis specially in big data because most models depend on complete datasets to estimate functional relationships between variables. Generative Adversarial Imputation Networks are among the most reliable methods to impute missing values with plausible numbers from the dataset. This research introduces Enhanced Generative Adversarial Networks (EGAIN), which address the GAIN convergence issue, introduce new functionality to the GAIN process, and significantly improve its performance.

URL: https://openreview.net/forum?id=6ZnELqOKVz

---

Title: LRVS-Fashion: Extending Visual Search with Referring Instructions

Abstract: This paper introduces a new challenge for image similarity search in the context of fashion, addressing the inherent ambiguity in this domain stemming from complex images. We present Referred Visual Search (RVS), a task allowing users to define more precisely the desired similarity, following recent interest in the industry. We release a new large public dataset, LRVS-Fashion, consisting of 272k fashion products with 842k images extracted from fashion catalogs, designed explicitly for this task. However, unlike traditional visual search methods in the industry, we demonstrate that superior performance can be achieved by bypassing explicit object detection and adopting weakly-supervised conditional contrastive learning on image tuples. Our method is lightweight and demonstrates robustness, reaching Recall at one superior to strong detection-based baselines against 2M distractors.

URL: https://openreview.net/forum?id=rHKh4cOFbl

---

Reply all

Reply to author

Forward

0 new messages