Weekly TMLR digest for Jun 15, 2025

3 views

Skip to first unread message

TMLR

unread,

Jun 15, 2025, 12:00:10 AMJun 15

to tmlr-annou...@googlegroups.com

New certifications
==================

Reproducibility Certification, Survey Certification: Pitfalls in Evaluating Inference-time Methods for Improving LLM Reliability

Michael M. Jerge, David Evans

https://openreview.net/forum?id=xeGWsmqFS8

---

Reproducibility Certification: Revisiting Discover-then-Name Concept Bottleneck Models: A Reproducibility Study

Freek Byrman, Emma Kasteleyn, Bart Kuipers, Daniel Uyterlinde

https://openreview.net/forum?id=946cT3Jsq5

---

Reproducibility Certification: Dynamics of the accelerated t-SNE

Kyoichi Iwasaki, Hideitsu Hino

https://openreview.net/forum?id=dfUebM9asV

---

Featured Certification: SEE-DPO: Self Entropy Enhanced Direct Preference Optimization

Shivanshu Shekhar, Shreyas Singh, Tong Zhang

https://openreview.net/forum?id=xQbRFHfgGL

---

Accepted papers
===============

Title: Solving Multi-agent Path Finding as an LLM Benchmark: How, How Good and Why

Authors: Weizhe Chen, Sven Koenig, Bistra Dilkina

Abstract: The rapid success of large language models (LLMs) has spurred extensive research into their ability to solve a wide range of tasks. However, their potential in multi-agent planning remains underexplored. Multi-agent planning presents unique challenges due to the combined complexity of coordination and long-horizon reasoning, often making it difficult to leverage external tools for assistance. In this paper, we introduce Multi-Agent Path Finding (MAPF), also known as multi-robot route planning, as a novel benchmark for evaluating the reasoning capabilities of LLMs. We first describe how the MAPF benchmark can be adapted for LLM-based evaluation, including dataset curation and an agentic workflow for LLMs. We show the motivating success of single-agent planning and multi-agent pathfinding in an empty room map without obstacles, then the failure to plan on the harder room map and maze map of the standard MAPF benchmark. We present our position on why directly solving MAPF with LLMs has not been successful yet, and we use various experiments to support our hypothesis. Based on our results, we discussed how researchers with different backgrounds could help with this problem from different perspectives.

URL: https://openreview.net/forum?id=8hAxEFRVQT

---

Title: Towards Undistillable Models by Minimizing Conditional Mutual Information

Authors: Linfeng Ye, Shayan Mohajer Hamidi, EN-HUI YANG

Abstract: A deep neural network (DNN) is said to be undistillable if, when used as a black-box input-
output teacher, it cannot be distilled through knowledge distillation (KD). In this case, the
distilled student (referred to as the knockoff student) does not outperform a student trained
independently with label smoothing (LS student) in terms of prediction accuracy. To protect
intellectual property of DNNs, it is desirable to build undistillable DNNs. To this end, it is
first observed that an undistillable DNN may have the trait that each cluster of its output
probability distributions in response to all sample instances with the same label should be
highly concentrated to the extent that each cluster corresponding to each label should ideally
collapse into one probability distribution. Based on this observation and by measuring the
concentration of each cluster in terms of conditional mutual information (CMI), a new
training method called CMI minimized (CMIM) method is proposed, which trains a DNN
by jointly minimizing the conventional cross entropy (CE) loss and the CMI values of all
temperature scaled clusters across the entire temperature spectrum. The resulting CMIM
model is shown, by extensive experiments, to be undistillable by all tested KD methods
existing in the literature. That is, the knockoff students distilled by these KD methods
from the CMIM model underperform the respective LS students. In addition, the CMIM
model is also shown to performs better than the model trained with the CE loss alone in
terms of their own prediction accuracy.

URL: https://openreview.net/forum?id=jVABSsD4Vf

---

Title: Distributionally Robust Coreset Selection under Covariate Shift

Authors: Tomonari Tanaka, Hiroyuki Hanada, Hanting Yang, Aoyama Tatsuya, Yu Inatsu, Akahane Satoshi, Yoshito Okura, Noriaki Hashimoto, Taro Murayama, Hanju Lee, Shinya Kojima, Ichiro Takeuchi

Abstract: Coreset selection, which involves selecting a small subset from an existing training dataset, is an approach to reducing training data, and various approaches have been proposed for this method. In practical situations where these methods are employed, it is often the case that the data distributions differ between the development phase and the deployment phase, with the latter being unknown. Thus, it is challenging to select an effective subset of training data that performs well across all deployment scenarios. We therefore propose Distributionally Robust Coreset Selection (DRCS). DRCS theoretically derives an estimate of the upper bound for the worst-case test error, assuming that the future covariate distribution may deviate within a defined range from the training distribution. Furthermore, by selecting instances in a way that suppresses the estimate of the upper bound for the worst-case test error, DRCS achieves distributionally robust training instance selection. This study is primarily applicable to convex training computation, but we demonstrate that it can also be applied to deep learning under appropriate approximations. In this paper, we focus on covariate shift, a type of data distribution shift, and demonstrate the effectiveness of DRCS through experiments.

URL: https://openreview.net/forum?id=Eu7XMLJqsC

---

Title: Thoughts and Lessons on Using Visual Foundation Models for Manipulation

Authors: Ryan Chen, Ziteng Pang, Bradly C. Stadie

Abstract: Training vision-based robotic systems from scratch is both computationally expensive and memory intensive. To mitigate these challenges, recent approaches forgo end-to-end training in favor of adopting visual representations from visual foundation models -- large scale models designed for broad task transferability. Recent years have seen numerous vision foundation models emerge, including several designed specifically for manipulation tasks. However, we still lack clear principles for what makes these models effective for robotics applications. To address this gap, we systematically evaluate vision foundation models to understand what makes them effective for offline robotic learning. We find that across eleven diverse vision encoders, a representation's ability to reconstruct edges and predict keypoints strongly correlates with its performance on manipulation tasks. Extensive correlation analysis across 21 manipulation tasks consistently shows that representations preserving edge and keypoint information achieve the highest environment success rates. These findings appear to challenge conventional wisdom about holistic reconstruction-based pretraining and offer a new lens for understanding what makes vision representations effective for robotics.

URL: https://openreview.net/forum?id=o6mnkDzVuc

---

Title: SR-Reward: Taking The Path More Traveled

Authors: Seyed Mahdi B. Azad, Zahra Padar, Gabriel Kalweit, Joschka Boedecker

Abstract: In this paper, we propose a novel method for learning reward functions directly from offline demonstrations.
Unlike traditional inverse reinforcement learning (IRL), our approach decouples the reward function from the learner's policy, eliminating the adversarial interaction typically required between the two.
This results in a more stable and efficient training process.
Our reward module, \textit{SR-Reward}, leverages successor representation (SR) to encode a state based on expected future states' visitation under the demonstration policy and transition dynamics.
By utilizing the Bellman equation, SR-Reward can be learned concurrently with most reinforcement learning (RL) algorithms without altering the existing training pipeline.
We also introduce a negative sampling strategy to mitigate overestimation errors by reducing rewards for out-of-distribution data, thereby enhancing robustness.
This strategy introduces an inherent conservative bias into RL algorithms that employ the learned reward, encouraging them to stay close to the demonstrations where the consequences of the actions are better understood.
We evaluate our method on D4RL as well as Maniskill Robot Manipulation environments, achieving competitive results compared to offline RL algorithms with access to true rewards and imitation learning (IL) techniques like behavioral cloning.

URL: https://openreview.net/forum?id=bzk1sV1svm

---

Title: AI Agents That Matter

Authors: Sayash Kapoor, Benedikt Stroebl, Zachary S Siegel, Nitya Nadgir, Arvind Narayanan

Abstract: AI agents are an exciting new research direction, and agent development is driven by benchmarks. Our analysis of current agent benchmarks and evaluation practices reveals several shortcomings that hinder their usefulness in real-world applications. First, there is a narrow focus on accuracy without attention to other metrics. As a result, SOTA agents are needlessly complex and costly, and the community has reached mistaken conclusions about the sources of accuracy gains. Our focus on cost in addition to accuracy motivates the new goal of jointly optimizing the two metrics. We design and implement one such optimization, showing its potential to greatly reduce cost while maintaining accuracy. Second, the benchmarking needs of model and downstream developers have been conflated, making it hard to identify which agent would be best suited for a particular application. Third, many agent benchmarks have inadequate holdout sets, and sometimes none at all. This has led to agents that are fragile because they take shortcuts and overfit to the benchmark in various ways. We prescribe a principled framework for avoiding overfitting. Finally, there is a lack of standardization in evaluation practices, leading to a pervasive lack of reproducibility. We hope that the steps we introduce for addressing these shortcomings will spur the development of agents that are useful in the real world and not just accurate on benchmarks.

URL: https://openreview.net/forum?id=Zy4uFzMviZ

---

Title: LASE: Learned Adjacency Spectral Embeddings

Authors: María Sofía Pérez Casulo, Marcelo Fiori, Federico Larroca, Gonzalo Mateos

Abstract: We put forth a principled design of a neural architecture to learn nodal Adjacency Spectral Embeddings (ASE) from graph inputs. By bringing to bear the gradient descent (GD) method and leveraging the technique of algorithm unrolling, we truncate and re-interpret each GD iteration as a layer in a graph neural network (GNN) that is trained to approximate the ASE. Accordingly, we call the resulting embeddings and our parametric model Learned ASE (LASE), which is interpretable, parameter efficient, robust to inputs with unobserved edges, and offers controllable complexity during inference. LASE layers combine Graph Convolutional Network (GCN) and fully-connected Graph Attention Network (GAT) modules, which is intuitively pleasing since GCN-based local aggregations alone are insufficient to express the sought graph eigenvectors. We propose several refinements to the unrolled LASE architecture (such as sparse attention in the GAT module and decoupled layerwise parameters) that offer favorable approximation error versus computation tradeoffs; even outperforming heavily-optimized eigendecomposition routines from scientific computing libraries. Because LASE is a differentiable function with respect to its parameters as well as its graph input, we can seamlessly integrate it as a trainable module within a larger (semi-)supervised graph representation learning pipeline. The resulting end-to-end system effectively learns "discriminative ASEs" that exhibit competitive performance in supervised link prediction and node classification tasks, outperforming a GNN even when the latter is endowed with open loop, meaning task-agnostic, precomputed spectral positional encodings.

URL: https://openreview.net/forum?id=J65NBLWrmh

---

Title: Pitfalls in Evaluating Inference-time Methods for Improving LLM Reliability

Authors: Michael M. Jerge, David Evans

Abstract: Though Large Language Models (LLMs) have demonstrated remarkable capabilities, they are still prone to outputting falsehoods using seemingly persuasive language. Many recent works attempt to address this problem by using LLMs in a framework where a single seed prompt results in a series of interactions involving augmented prompts with an otherwise unchanged LLM, and the results are aggregated with a goal of producing a more reliable output. We consider the replicability and generalizability of evaluations of inference-time methods intended to improve the reliability of responses from a base LLMs. We survey how methods have been evaluated in the literature and find a great variety of benchmarks and models in use. Motivated by this, we conduct our own evaluation to evaluate the effectiveness of a few methods across a range of benchmarks and models. Our evaluation reveals that while these techniques show promise in improving reliability, there is still significant variability in performance across different domains and tasks, and methods that show substantial improvements on weaker base models often do not improve reliability for better base models.

URL: https://openreview.net/forum?id=xeGWsmqFS8

---

Title: Optimization Guarantees for Square-Root Natural-Gradient Variational Inference

Authors: Navish Kumar, Thomas Möllenhoff, Mohammad Emtiyaz Khan, Aurelien Lucchi

Abstract: Variational inference with natural-gradient descent often shows fast convergence in practice, but its theoretical convergence guarantees have been challenging to establish. This is true even for the simplest cases that involve concave log-likelihoods and use a Gaussian approximation. We show that the challenge can be circumvented for such cases using a square-root parameterization for the Gaussian covariance. This approach establishes novel convergence guarantees for natural-gradient variational-Gaussian inference and its continuous-time gradient flow. Our experiments demonstrate the effectiveness of natural gradient methods and highlight their advantages over algorithms that use Euclidean or Wasserstein geometries.

URL: https://openreview.net/forum?id=OMOFmb6ve7

---

Title: Simple Calibration via Geodesic Kernels

Authors: Jayanta Dey, Haoyin Xu, Ashwin De Silva, Joshua T Vogelstein

Abstract: Deep discriminative approaches, such as decision forests and deep neural networks, have recently found applications in many important real-world scenarios. However, deploying these learning algorithms in safety-critical applications raises concerns, particularly when it comes to ensuring calibration for both in-distribution and out-of-distribution regions. Many popular methods for in-distribution (ID) calibration, such as isotonic and Platt’s sigmoidal regression, exhibit adequate ID calibration performance. However, these methods are not calibrated for the entire feature space, leading to overconfidence in the out-of-distribution (OOD) region. Existing OOD calibration methods generally exhibit poor ID calibration. In this paper, we jointly address the ID and OOD problems. We leveraged the fact that deep models learn to partition feature space into a union of polytopes, that is, flat-sided geometric objects. We introduce a geodesic distance to measure the distance between these polytopes and further distinguish samples within the same polytope using a Gaussian kernel. Our experiments on both tabular and vision benchmarks show that the proposed approaches, namely Kernel Density Forest (KDF) and Kernel Density Network (KDN), obtain well-calibrated posteriors for both ID and OOD samples, while mostly preserving the classification accuracy and extrapolating beyond the training data to handle OOD inputs appropriately.

URL: https://openreview.net/forum?id=dpcRp8ix5T

---

Title: TapWeight: Reweighting Pretraining Objectives for Task-Adaptive Pretraining

Authors: Ruiyi Zhang, Sai Ashish Somayajula, Pengtao Xie

Abstract: Large-scale general domain pretraining followed by downstream-specific finetuning has become a predominant paradigm in machine learning. However, discrepancies between the pretraining and target domains can still lead to performance degradation in certain cases, underscoring the need for task-adaptive continued pretraining (TAP). TAP methods typically involve continued pretraining on task-specific unlabeled datasets or introducing additional unsupervised learning objectives to enhance model capabilities. While many TAP methods perform continued pretraining with multiple pretraining objectives, they often determine the tradeoff parameters between objectives manually, resulting in suboptimal outcomes and higher computational costs. In this paper, we propose TapWeight, a task-adaptive pretraining framework which automatically determines the optimal importance of each pretraining objective based on downstream feedback. TapWeight reweights each pretraining objective by solving a multi-level optimization problem. We applied TapWeight to both molecular property prediction and natural language processing tasks, significantly surpassing baseline methods. Experimental results validate the effectiveness and generalizability of TapWeight.

URL: https://openreview.net/forum?id=DCCw2CEVFS

---

Title: Enhancing deep neural networks through complex-valued representations and Kuramoto synchronization dynamics

Authors: Sabine Muzellec, Andrea Alamia, Thomas Serre, Rufin VanRullen

Abstract: Neural synchrony is hypothesized to play a crucial role in how the brain organizes visual scenes into structured representations, enabling the robust encoding of multiple objects within a scene. However, current deep learning models often struggle with object binding, limiting their ability to represent multiple objects effectively. Inspired by neuroscience, we investigate whether synchrony-based mechanisms can enhance object encoding in artificial models trained for visual categorization. Specifically, we combine complex-valued representations with Kuramoto dynamics to promote phase alignment, facilitating the grouping of features belonging to the same object. We evaluate two architectures employing synchrony: a feedforward model and a recurrent model with feedback connections to refine phase synchronization using top-down information. Both models outperform a real-valued baseline and complex-valued models without Kuramoto synchronization on tasks involving multi-object images, such as overlapping handwritten digits, noisy inputs, and out-of-distribution transformations. Our findings highlight the potential of synchrony-driven mechanisms to enhance deep learning models, improving their performance, robustness, and generalization in complex visual categorization tasks.

URL: https://openreview.net/forum?id=zx6QGmBL43

---

Title: Large Language Model-Brained GUI Agents: A Survey

Authors: Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang

Abstract: Graphical User Interfaces (GUIs) have long been central to human-computer interaction, providing an intuitive and visually-driven way to access and interact with digital systems. Traditionally, automating GUI interactions relied on script-based or rule-based approaches, which, while effective for fixed workflows, lacked the flexibility and adaptability required for dynamic, real-world applications. The advent of Large Language Models (LLMs), particularly multimodal models, has ushered in a new era of GUI automation. They have demonstrated exceptional capabilities in natural language understanding, code generation, task generalization, and visual processing. This has paved the way for a new generation of ''LLM-brained'' GUI agents capable of interpreting complex GUI elements and autonomously executing actions based on natural language instructions. These agents represent a paradigm shift, enabling users to perform intricate, multi-step tasks through simple conversational commands. Their applications span across web navigation, mobile app interactions, and desktop automation, offering a transformative user experience that revolutionizes how individuals interact with software. This emerging field is rapidly advancing, with significant progress in both research and industry.

To provide a structured understanding of this trend, this paper presents a comprehensive survey of LLM-brained GUI agents, exploring their historical evolution, core components, and advanced techniques. We address critical research questions such as existing GUI agent frameworks, the collection and utilization of data for training specialized GUI agents, the development of large action models tailored for GUI tasks, and the evaluation metrics and benchmarks necessary to assess their effectiveness. Additionally, we examine emerging applications powered by these agents. Through a detailed analysis, this survey identifies key research gaps and outlines a roadmap for future advancements in the field. By consolidating foundational knowledge and state-of-the-art developments, this work aims to guide both researchers and practitioners in overcoming challenges and unlocking the full potential of LLM-brained GUI agents. We anticipate that this survey will serve both as a practical cookbook for constructing LLM-powered GUI agents, and as a definitive reference for advancing research in this rapidly evolving domain.

URL: https://openreview.net/forum?id=xChvYjvXTp

---

Title: Revisiting Discover-then-Name Concept Bottleneck Models: A Reproducibility Study

Authors: Freek Byrman, Emma Kasteleyn, Bart Kuipers, Daniel Uyterlinde

Abstract: Concept Bottleneck Models (CBMs) (Koh et al., 2020) are a class of interpretable deep learning frameworks that improve transparency by mapping input data into human-understandable concepts. Recent advances, including the Discover-then-Name CBM proposed by Rao et al. (2024), eliminate reliance on external language models by automating concept discovery and naming using a CLIP feature extractor and sparse autoencoder. This study focuses on replicating the key findings reported by Rao et al. (2024). We conclude that the core conceptual ideas are reproducible, but not to the extent presented in the original work. Many representations of active neurons appear to be misaligned with their assigned concepts, indicating a lack of faithfulness of the DN-CBM’s explanations. To address this, we propose a model extension: an enhanced alignment method that we evaluate through a user study. Our extended model provides more interpretable concepts (with statistical significance), at the cost of a slight decrease in accuracy.

URL: https://openreview.net/forum?id=946cT3Jsq5

---

Title: End-to-end Training for Text-to-Image Synthesis using Dual-Text Embeddings

Authors: Yeruru Asrar Ahmed, Anurag Mittal

Abstract: Text-to-Image (T2I) synthesis is a challenging task that requires modeling complex interactions between two modalities ( i.e., text and image). A common framework adopted in recent state-of-the-art approaches to achieving such multimodal interactions is to bootstrap the learning process with pre-trained image-aligned text embeddings trained using contrastive loss. Furthermore, these embeddings are typically trained generically and reused across various synthesis models. In contrast, we explore an approach to learning text embeddings specifically tailored to the T2I synthesis network, trained in an end-to-end fashion. Further, we combine generative and contrastive training and use two embeddings, one optimized to enhance the photo-realism of the generated images, and the other seeking to capture text-to-image alignment. A comprehensive set of experiments on three text-to-image benchmark datasets (Oxford-102, Caltech-UCSD, and MS-COCO) reveal that having two separate embeddings gives better results than using a shared one and that such an approach performs favourably in comparison with methods that use text representations from a pre-trained text encoder trained using a discriminative approach. Finally, we demonstrate that such learned embeddings can be used in other contexts as well, such as text-to-image manipulation.

URL: https://openreview.net/forum?id=gJ1OknHV5e

---

Title: Dynamics of the accelerated t-SNE

Authors: Kyoichi Iwasaki, Hideitsu Hino

Abstract: This paper investigates the dynamics of t-Stochastic Neighbor Embedding (t-SNE), a popular tool for visualizing complex datasets in exploratory data analysis, optimized by the Nesterov’s accelerated gradient method. Building on the foundational work that connects t-SNE with spectral clustering and dynamical systems, we extend the analysis to include accelerated dynamics which is not addressed in the previous work, revealing the emergence of Bessel and modified Bessel functions as a novel aspect of the algorithm’s behavior characterizing the temporal evolution of the accelerated t-SNE. Because the ordinary differential equation corresponding to the optimization process under consideration has a closed-form solution, by performing eigenvalue decomposition of the data’s adjacency matrix as a pre-processing step, we can obtain low-dimensional embeddings at any point in time without performing sequential optimization. This advancement not only enhances the practical utility of t-SNE but also contributes to a deeper understanding of its underlying dynamics.

URL: https://openreview.net/forum?id=dfUebM9asV

---

Title: State-Constrained Offline Reinforcement Learning

Authors: Charles Alexander Hepburn, Yue Jin, Giovanni Montana

Abstract: Traditional offline reinforcement learning (RL) methods predominantly operate in a batch-constrained setting. This confines the algorithms to a specific state-action distribution present in the dataset, reducing the effects of distributional shift but restricting the policy to seen actions. In this paper, we alleviate this limitation by introducing state-constrained offline RL, a novel framework that focuses solely on the dataset’s state distribution. This approach allows the policy to take high-quality out-of-distribution actions that lead to in- distribution states, significantly enhancing learning potential. The proposed setting not only broadens the learning horizon but also improves the ability to combine different trajectories from the dataset effectively, a desirable property inherent in offline RL. Our research is underpinned by theoretical findings that pave the way for subsequent advancements in this area. Additionally, we introduce StaCQ, a deep learning algorithm that achieves state-of-the-art performance on the D4RL benchmark datasets and aligns with our theoretical propositions. StaCQ establishes a strong baseline for forthcoming explorations in this domain.

URL: https://openreview.net/forum?id=KcR8ykFlHA

---

Title: Learning in complex action spaces without policy gradients

Authors: Arash Tavakoli, Sina Ghiassian, Nemanja Rakicevic

Abstract: While conventional wisdom holds that policy gradient methods are better suited to complex action spaces than action-value methods, foundational work has shown that the two paradigms are equivalent in small, finite action spaces (O'Donoghue et al., 2017; Schulman et al., 2017a). This raises the question of why their computational applicability and performance diverge as the complexity of the action space increases. We hypothesize that the apparent superiority of policy gradients in such settings stems not from intrinsic qualities of the paradigm but from universal principles that can also be applied to action-value methods, enabling similar functions. We identify three such principles and provide a framework for incorporating them into action-value methods. To support our hypothesis, we instantiate this framework in what we term QMLE, for Q-learning with maximum likelihood estimation. Our results show that QMLE can be applied to complex action spaces at a computational cost comparable to that of policy gradient methods, all without using policy gradients. Furthermore, QMLE exhibits strong performance on the DeepMind Control Suite, even when compared to state-of-the-art methods such as DMPO and D4PG.

URL: https://openreview.net/forum?id=nOL9M6D4oM

---

Title: Non asymptotic analysis of Adaptive stochastic gradient algorithms and applications

Authors: Antoine Godichon-Baggioni, Pierre Tarrago

Abstract: In stochastic optimization, a widely used approach for handling large samples sequentially is the stochastic gradient algorithm (SGD). However, a key limitation of SGD is that its step size sequence remains uniform across all gradient directions, which can lead to poor performance in practice, particularly for ill-conditioned problems. To address this issue, adaptive gradient algorithms, such as Adagrad and stochastic Newton methods, have been developed. These algorithms adapt the step size to each gradient direction, providing significant advantages in such challenging settings. This paper focuses on the non-asymptotic analysis of these adaptive gradient algorithms for strongly convex objective functions. The theoretical results are further applied to practical examples, including linear regression and regularized generalized linear models, using both Adagrad and stochastic Newton algorithms.

URL: https://openreview.net/forum?id=iyfbGyAkKt

---

Title: Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects

Authors: Wenhao Li, Yudong Xu, Scott Sanner, Elias Boutros Khalil

Abstract: The Abstraction and Reasoning Corpus (ARC) is a popular benchmark focused on visual reasoning
in the evaluation of Artificial Intelligence systems. In its original framing, an ARC task requires
solving a program synthesis problem over small 2D images using a few input-output training pairs.
In this work, we adopt the recently popular data-driven approach to the ARC and ask whether a
Vision Transformer (ViT) can learn the implicit mapping, from input image to output image, that
underlies the task. We show that a ViT—otherwise a state-of-the-art model for images—fails dra-
matically on most ARC tasks even when trained on one million examples per task. This points to
an inherent representational deficiency of the ViT architecture that makes it incapable of uncov-
ering the simple structured mappings underlying the ARC tasks. Building on these insights, we
propose VITARC, a ViT-style architecture that unlocks some of the visual reasoning capabilities re-
quired by the ARC. Specifically, we use a pixel-level input representation, design a spatially-aware
tokenization scheme, and introduce a novel object-based positional encoding that leverages auto-
matic segmentation, among other enhancements. Our task-specific VITARC models achieve a test
solve rate close to 100% on more than half of the 400 public ARC tasks strictly through supervised
learning from input-output grids. This calls attention to the importance of imbuing the powerful
(Vision) Transformer with the correct inductive biases for abstract visual reasoning that are critical
even when the training data is plentiful and the mapping is noise-free. Hence, VITARC provides a
strong foundation for future research in visual reasoning using transformer-based architectures.

URL: https://openreview.net/forum?id=Al72Fp0rCg

---

Title: Tackling Feature and Sample Heterogeneity in Decentralized Multi-Task Learning: A Sheaf-Theoretic Approach

Authors: Chaouki Ben Issaid, Praneeth Vepakomma, Mehdi Bennis

Abstract: Federated multi-task learning (FMTL) aims to simultaneously learn multiple related tasks across clients without sharing sensitive raw data. However, in the decentralized setting, existing FMTL frameworks are limited in their ability to capture complex task relationships
and handle feature and sample heterogeneity across clients. To address these challenges, we introduce a novel sheaf-theoretic-based approach for FMTL. By representing client relationships using cellular sheaves, our framework can flexibly model interactions between
heterogeneous client models. We formulate the sheaf-based FMTL optimization problem using sheaf Laplacian regularization and propose the Sheaf-FMTL algorithm to solve it. We show that the proposed framework provides a unified view encompassing many existing federated learning (FL) and FMTL approaches. Furthermore, we prove that our proposed algorithm, Sheaf-FMTL, achieves a sublinear convergence rate in line with state-of-the-art decentralized FMTL algorithms. Extensive experiments show that although Sheaf-FMTL introduces computational and storage overhead due to the management of interaction maps, it achieves substantial communication savings in terms of transmitted bits when compared to decentralized FMTL baselines. This trade-off makes Sheaf-FMTL especially suitable for cross-silo FL scenarios, where managing model heterogeneity and ensuring communication efficiency are essential, and where clients have adequate computational resources.

URL: https://openreview.net/forum?id=JlPq0LmApB

---

Title: GradSkip: Communication-Accelerated Local Gradient Methods with Better Computational Complexity

Authors: Arto Maranjyan, Mher Safaryan, Peter Richtárik

Abstract: We study a class of distributed optimization algorithms that aim to alleviate high communication costs by allowing clients to perform multiple local gradient-type training steps before communication. In a recent breakthrough, Mishchenko et al. (2022) proved that local training, when properly executed, leads to provable communication acceleration, and this holds in the strongly convex regime without relying on any data similarity assumptions. However, their ProxSkip method requires all clients to take the same number of local training steps in each communication round. We propose a redesign of the ProxSkip method, allowing clients with ``less important'' data to get away with fewer local training steps without impacting the overall communication complexity of the method. In particular, we prove that our modified method, GradSkip, converges linearly under the same assumptions and has the same accelerated communication complexity, while the number of local gradient steps can be reduced relative to a local condition number. We further generalize our method by extending the randomness of probabilistic alternations to arbitrary unbiased compression operators and by considering a generic proximable regularizer. This generalization, which we call GradSkip+, recovers several related methods in the literature as special cases. Finally, we present an empirical study on carefully designed toy problems that confirm our theoretical claims.

URL: https://openreview.net/forum?id=6R3fRqFfhn

---

Title: Reconciling Privacy and Explainability in High-Stakes: A Systematic Inquiry

Authors: Supriya Manna, Niladri Sett

Abstract: Deep learning’s preponderance across scientific domains has reshaped high-stakes decision-making, making it essential to follow rigorous operational frameworks that include both Right-to-Privacy (RTP) and Right-to-Explanation (RTE). This paper examines the complexities of combining these two requirements. For RTP, we focus on ‘Differential privacy’ (DP), which is considered the current gold standard for privacy-preserving machine learning due to its strong quantitative guarantee of privacy. For RTE, we focus on post-hoc explainers: they are the go-to option for model auditing as they operate independently of model training. We formally investigate DP models and various commonly-used post-hoc explainers: how to evaluate these explainers subject to RTP, and analyze the intrinsic interactions between DP models and these explainers. Furthermore, our work throws light on how RTP and RTE can be effectively combined in high-stakes applications. Our study concludes by outlining an industrial software pipeline, with the example of a widely used use case, that respects both RTP and RTE requirements.

URL: https://openreview.net/forum?id=DQqdjPcE6g

---

Title: MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment

Authors: Ziyan Wang, Yali Du, Yudi Zhang, Meng Fang, Biwei Huang

Abstract: Offline Multi-agent Reinforcement Learning (MARL) is valuable in scenarios where online interaction is impractical or risky. While independent learning in MARL offers flexibility and scalability, accurately assigning credit to individual agents in offline settings poses challenges because interactions with an environment are prohibited. In this paper, we propose a new framework, namely \textbf{M}ulti-\textbf{A}gent \textbf{C}ausal \textbf{C}redit \textbf{A}ssignment (\textbf{MACCA}), to address credit assignment in the offline MARL setting. Our approach, MACCA, characterizing the generative process as a Dynamic Bayesian Network, captures relationships between environmental variables, states, actions, and rewards. Estimating this model on offline data, MACCA can learn each agent's contribution by analyzing the causal relationship of their individual rewards, ensuring accurate and interpretable credit assignment. Additionally, the modularity of our approach allows it to integrate with various offline MARL methods seamlessly. Theoretically, we proved that under the setting of the offline dataset, the underlying causal structure and the function for generating the individual rewards of agents are identifiable, which laid the foundation for the correctness of our modeling. In our experiments, we demonstrate that MACCA not only outperforms state-of-the-art methods but also enhances performance when integrated with other backbones.

URL: https://openreview.net/forum?id=gwUOzI4DuV

---

Title: Efficient and Flexible Neural Network Training through Layer-wise Feedback Propagation

Authors: Leander Weber, Jim Berend, Moritz Weckbecker, Alexander Binder, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin

Abstract: Gradient-based optimization has been a cornerstone of machine learning that enabled the vast ad- vances of Artificial Intelligence (AI) development over the past decades. However, this type of optimization requires differentiation, and with recent evidence of the benefits of non-differentiable (e.g. neuromorphic) architectures over classical models w.r.t. efficiency, such constraints can be- come limiting in the future. We present Layer-wise Feedback Propagation (LFP), a novel training principle for neural network-like predictors that utilizes methods from the domain of explainability to decompose a reward to individual neurons based on their respective contributions. Leveraging these neuron-wise rewards, our method then implements a greedy approach reinforcing helpful parts of the network and weakening harmful ones. While having comparable computational complexity to gradient descent, LFP does not require gradient computation and generates sparse and thereby memory- and energy-efficient parameter updates and models. We establish the convergence of LFP theoretically and empirically, demonstrating its effectiveness on various models and datasets. Via two applications — neural network pruning and the approximation-free training of Spiking Neural Networks (SNNs) — we demonstrate that LFP combines increased efficiency in terms of computation and representation with flexibility w.r.t. choice of model architecture and objective function.

URL: https://openreview.net/forum?id=9oToxYVOSW

---

Title: CXAD: Contrastive Explanations for Anomaly Detection: Algorithms, Complexity Results and Experiments

Authors: Ian Davidson, Nicolás Kennedy, S. S. Ravi

Abstract: Anomaly/Outlier detection (AD/OD) is often used in controversial applications to detect unusual behavior which is then further investigated or policed. This means an explanation of why something was predicted as an anomaly is desirable not only for individuals but also for the general population and policy-makers. However, existing explainable AI (XAI) methods are not well suited for Explainable Anomaly detection (XAD). In particular, most XAI methods provide instance-level explanations, whereas a model/global-level explanation is desirable for a complete understanding of the definition of normality or abnormality used by an AD algorithm. Further, existing XAI methods try to explain an algorithm’s behavior by finding an explanation of why an instance belongs to a category. However, by definition, anomalies/outliers are chosen because they are different from the normal instances. We propose a new style of model agnostic explanation, called contrastive explanation, that is designed specifically for AD algorithms. It addresses the novel challenge of providing a model-agnostic and global-level explanation by finding contrasts between the outlier group of instances and the normal group. We propose three formulations: (i) Contrastive Explanation, (ii) Strongly Contrastive Explanation, and (iii) Multiple Strong Contrastive Explanations. The last formulation is specifically for the case where a given dataset is believed to have many types of anomalies. For the first two formulations, we show the underlying problem is in the computational class P by presenting linear and polynomial time exact algorithms. We show that the last formulation is computationally intractable, and we use an integer linear program for that version to generate experimental results. We demonstrate our work on several data sets such as the CelebA image data set, the HateXplain language data set, and the COMPAS dataset on fairness. These data sets are chosen as their ground truth explanations are clear or well-known.

URL: https://openreview.net/forum?id=Tnwci2kLna

---

Title: Fairness with respect to Stereotype Predictors: Impossibilities and Best Practices

Authors: Inbal Rachel Livni Navon, Omer Reingold, Judy Hanwen Shen

Abstract: As AI systems increasingly influence decision-making from consumer recommendations to educational opportunities, their accountability becomes paramount. This need for oversight has driven extensive research into algorithmic fairness, a body of work that has examined both allocative and representational harms. However, numerous works examining representational harms such as stereotypes encompass many different concepts measured by different criteria, yielding many, potentially conflicting, characterizations of harm. The abundance of measurement approaches makes the mitigation of stereotypes in downstream machine learning models highly challenging. Our work introduces and unifies a broad class of auditors through the framework of \textit{stereotype predictors}. We map notions of fairness with respect to these predictors to existing notions of group fairness. We give guidance, with theoretical foundations, for selecting one or a set of stereotype predictors and provide algorithms for achieving fairness with respect to stereotype predictors under various fairness notions. We demonstrate the effectiveness of our algorithms with different stereotype predictors in two empirical case studies.

URL: https://openreview.net/forum?id=FPJKZDzdsW

---

Title: Exploring and Improving Initialization for Deep Graph Neural Networks: A Signal Propagation Perspective

Authors: Senmiao Wang, Yupeng Chen, Yushun Zhang, Ruoyu Sun, Tian Ding

Abstract: Graph Neural Networks (GNNs) often suffer from performance degradation as the network depth increases. This paper addresses this issue by introducing initialization methods that enhance signal propagation (SP) within GNNs. We propose three key metrics for effective SP in GNNs: forward propagation, backward propagation, and graph embedding variation (GEV). While the first two metrics derive from classical SP theory, the third is specifically designed for GNNs. We theoretically demonstrate that a broad range of commonly used initialization methods for GNNs, which exhibit performance degradation with increasing depth, fail to control these three metrics simultaneously. To deal with this limitation, a direct exploitation of the SP analysis--searching for weight initialization variances that optimize the three metrics--is shown to significantly enhance the SP in deep GCNs. This approach is called \textit{\textbf{S}ignal \textbf{P}ropagation \textbf{o}n \textbf{G}raph-guided \textbf{Init}ialization (\textbf{SPoGInit})}. Our experiments demonstrate that SPoGInit outperforms commonly used initialization methods on various tasks and architectures. Notably, SPoGInit enables performance improvements as GNNs deepen, which represents a significant advancement in addressing depth-related challenges and highlights the validity and effectiveness of the SP analysis framework.

URL: https://openreview.net/forum?id=6Aj0aNXfRy

---

Title: Spaced Scheduling for Large Language Model Training

Authors: Amine El hattami, Nicolas Chapados, Christopher Pal

Abstract: Recent breakthroughs in deep learning have accelerated progress toward increasingly capable large language models (LLMs), even sparking discussions about the path to Artificial General Intelligence (AGI). Yet, current LLM training pipelines continue to depend on heuristics and human-driven empirical analysis to curate data. In practice, more sophisticated data selection methods often incur high costs, exhibit limited adaptability, or do not consistently surpass simple random baselines across various models and datasets. In this work, we propose Spaced Scheduled Training (Sst), a novel adaptive data selection strategy that prioritizes training examples based solely on per-example perplexity computed from the model’s own evolving parameters. By obviating the need for external reference models, Sst customizes data selection to the model’s unique characteristics, including its pre-training data composition, and eliminates biases commonly introduced by these external models. Extensive experiments on seven LLMs (0.5B to 32B parameters) in the instruction-finetuning (IFT) setting show that Sst consistently outperforms representative state-of-the-art selection approaches like Deita and InsTag on the Open LLM Leaderboard. For instance, with Qwen2.5-32B and a 30k examples data budget, Sst achieved a 42.75% Open LLM Leaderboard score, exceeding a leading data-selection baseline (38.56%) and the full-100k dataset baseline (39.58%). We further present a theoretical framework to assess computational overhead of model-based selection methods, showing that Sst remains efficient in practical scenarios, and propose strategies to mitigate the overhead in worst-case scenarios. Our findings underscore the potential of model-informed dynamic data selection, offering an efficient, adaptable, and cost-effective approach. We release our training code, trained models, and data mixes in our public repository.

URL: https://openreview.net/forum?id=p0KTYl2B9T

---

Title: Selective Concept Bottleneck Models Without Predefined Concepts

Authors: Simon Schrodi, Julian Schur, Max Argus, Thomas Brox

Abstract: Concept-based models like Concept Bottleneck Models (CBMs) have garnered significant interest for improving model interpretability by first predicting human-understandable concepts before mapping them to the output classes. Early approaches required costly concept annotations. To alleviate this, recent methods utilized large language models to automatically generate class-specific concept descriptions and learned mappings from a pretrained black-box model’s raw features to these concepts using vision-language models. However, these approaches assume prior knowledge of which concepts the black-box model has learned. In this work, we discover the concepts encoded by the model through unsupervised concept discovery techniques instead. We further leverage a simple input-dependent concept selection mechanism that dynamically retains a sparse set of relevant concepts of each input, enhancing both sparsity and interpretability. Our approach not only improves downstream performance, but also needs significantly fewer concepts for accurate classification. Lastly, we show how large vision-language models can guide the editing of our models' weights to correct model errors.

URL: https://openreview.net/forum?id=PMO30TLI4l

---

Title: SEE-DPO: Self Entropy Enhanced Direct Preference Optimization

Authors: Shivanshu Shekhar, Shreyas Singh, Tong Zhang

Abstract: Direct Preference Optimization (DPO) has been successfully used to align large language models (LLMs) according to human preferences, and more recently it has also been applied to improving the quality of text-to-image diffusion models. However, DPO-based methods such as SPO, Diffusion-DPO, and D3PO are highly susceptible to overfitting and reward hacking, especially when the generative model is optimized to fit out-of-distribution during prolonged training. To overcome these challenges and stabilize the training of diffusion models, we introduce a self-entropy regularization mechanism in reinforcement learning from human feedback. This enhancement improves DPO training by encouraging broader exploration and greater robustness. Our regularization technique effectively mitigates reward hacking, leading to improved stability and enhanced image quality across the latent space. Extensive experiments demonstrate that integrating human feedback with self-entropy regularization can significantly boost image diversity and specificity, achieving state-of-the-art results on key image generation metrics.

URL: https://openreview.net/forum?id=xQbRFHfgGL

---

Title: Knowing What Not to Do: Leverage Language Model Insights for Action Space Pruning in Multi-agent Reinforcement Learning

Authors: Zhihao Liu, Xianliang Yang, Zichuan Liu, Yifan Xia, Wei Jiang, Yuanyu Zhang, Lijuan Li, Guoliang Fan, Lei Song, Jiang Bian

Abstract: Multi-agent reinforcement learning (MARL) is employed to develop autonomous agents that can learn to adopt cooperative or competitive strategies within complex environments. However, the linear increase in the number of agents leads to a combinatorial explosion of the action space, which always results in algorithmic instability, difficulty in convergence, or entrapment in local optima. While researchers have designed a variety of effective algorithms to compress the action space, these methods also introduce new challenges, such as the need for manually designed prior knowledge or reliance on the structure of the problem, which diminishes the applicability of these techniques. In this paper, we introduce \textbf{E}volutionary action \textbf{SPA}ce \textbf{R}eduction with \textbf{K}nowledge (eSpark), an exploration function generation framework driven by large language models (LLMs) to boost exploration and prune unnecessary actions in MARL. Using just a basic prompt that outlines the overall task and setting, eSpark is capable of generating exploration functions in a zero-shot manner, identifying and pruning redundant or irrelevant state-action pairs, and then achieving autonomous improvement from policy feedback. In reinforcement learning tasks involving inventory management and traffic light control encompassing a total of 15 scenarios, eSpark consistently outperforms the combined MARL algorithm in all scenarios, achieving an average performance gain of 34.4% and 9.9% in the two types of tasks respectively. Additionally, eSpark has proven to be capable of managing situations with a large number of agents, securing a 29.7% improvement in scalability challenges that featured over 500 agents. The code can be found in https://github.com/LiuZhihao2022/eSpark.

URL: https://openreview.net/forum?id=T49vPTkIt5

---

Title: [RE] GNNBoundary: Finding Boundaries and Going Beyond Them

Authors: Jan Henrik Bertrand, Lukas Bierling, Ina Klaric, Aron Wezenberg

Abstract: Graph classification models are becoming increasingly popular, while explainability methods face challenges due to the discrete nature of graphs and other factors. However, investigating model decision-making, such as through decision-boundary regions, helps prevent
misclassification and improve model robustness. This study aims to reproduce the findings of GNNBoundary: Towards Explaining Graph Neural Networks Through the Lens of Decision Boundaries (Wang & Shen, 2024). Their work supports 3 main claims: (1) their proposed algorithm can identify adjacent class pairs reliably, (2) their GNNBoundary can effectively and consistently generate near-boundary graphs outperforming the cross entropy baseline and (3) the generated near-boundary graphs can be used to accurately assess key properties of the decision boundary; margin, thickness, and complexity. We reproduce the experiments on the same datasets and extended them to two additional real-world datasets. Beyond that, we test different boundary probability ranges and their effect on decision boundary metrics, develop an additional baseline, and conduct hyperparameter tuning. We confirm the first claim regarding the adjacency discovery as well as the second claim that GNNBoundary outperforms the cross-entropy baseline under the limitation that it requires intensive hyperparameter tuning for convergence. The third claim is partially accepted as we observe a high variance between reported and obtained results, disproving the reliability and precision of the boundary statistics.
Code and instructions are available at: https://github.com/jhb300/re_gnnboundary.

URL: https://openreview.net/forum?id=kEUvWFHEsn

---

Title: Return-Aligned Decision Transformer

Authors: Tsunehiko Tanaka, Kenshi Abe, Kaito Ariu, Tetsuro Morimura, Edgar Simo-Serra

Abstract: Traditional approaches in offline reinforcement learning aim to learn the optimal policy that maximizes the cumulative reward, also known as return. It is increasingly important to adjust the performance of AI agents to meet human requirements, for example, in applications like video games and education tools. Decision Transformer (DT) optimizes a policy that generates actions conditioned on the target return through supervised learning and includes a mechanism to control the agent's performance using the target return. However, the action generation is hardly influenced by the target return because DT’s self-attention allocates scarce attention scores to the return tokens. In this paper, we propose Return-Aligned Decision Transformer (RADT), designed to more effectively align the actual return with the target return. RADT leverages features extracted by paying attention solely to the return, enabling action generation to consistently depend on the target return. Extensive experiments show that RADT significantly reduces the discrepancies between the actual return and the target return compared to DT-based methods.

URL: https://openreview.net/forum?id=lTt2cTW8h1

---

Title: Unified Preference Optimization: Language Model Alignment Beyond the Preference Frontier

Authors: Anirudhan Badrinath, Prabhat Agarwal, Jiajing Xu

Abstract: For aligning large language models (LLMs), prior work has leveraged reinforcement learning via human feedback (RLHF) or variations of direct preference optimization (DPO). While DPO offers a simpler framework based on maximum likelihood estimation, it compromises on the ability to easily tune language models to maximize auxiliary, non-preferential objectives according to the LLM designer's preferences (e.g., tuning lexical style or minimizing specific kinds of harmful content). Critically, these designer objectives may not be amply human-labeled or represented in available data, align with user preferences, or even be able to be captured tractably by binary preference pairs. To leverage the simplicity and performance of DPO with the generality of RL, we propose a unified approach. Based on a simple decomposition of preference and auxiliary objectives, we allow for tuning LLMs to optimize user and designer preferences without any additional specialized or preference data, computational cost, stability "tweaks", hyperparameter tuning, or training instability. The proposed method, Unified Preference Optimization, shows the ability to effectively generalize to user preferences and auxiliary objectives, while preserving or surpassing alignment performance on challenging benchmarks across a range of model sizes.

URL: https://openreview.net/forum?id=R7QFlwvnne

---

New submissions
===============

Title: Mixtures of Neural Cellular Automata: A Stochastic Frame- work for Growth Modelling and Self-Organization

Abstract: Neural Cellular Automata (NCAs) are a promising new approach to model self-organizing
processes, with potential applications in life science. However, their deterministic nature limits
their ability to capture the stochasticity of real-world biological and physical systems.

We propose the Mixture of Neural Cellular Automata (MNCA), a novel framework incorporating
the idea of mixture models into the NCA paradigm. By combining probabilistic rule
assignments with intrinsic noise, MNCAs can model diverse local behaviors and reproduce
the stochastic dynamics observed in biological processes.

We evaluate the effectiveness of MNCAs in three key domains: (1) synthetic simulations of
tissue growth and differentiation, (2) image morphogenesis robustness, and (3) microscopy
image segmentation. Results show that MNCAs achieve superior robustness to perturbations,
better recapitulate real biological growth patterns, and provide interpretable rule segmenta-
tion.

These findings position MNCAs as a promising tool for modeling stochastic dynamical
systems and studying self-growth processes.

URL: https://openreview.net/forum?id=GNGSUpfvCn

---

Title: VColRL: Learn to solve the Vertex Coloring Problem using Reinforcement Learning

Abstract: We propose VColRL, a deep reinforcement learning framework for solving the Vertex Coloring Problem (VCP), which aims to color the vertices of a graph using the minimum number of colors such that no two adjacent vertices share the same color. VColRL is based on a novel Markov Decision Process (MDP) formulation, identified through a systematic evaluation of multiple configurations. It employs a reduction-based neural architecture and a reward mechanism designed to minimize the highest-numbered color used from an ordered set. Experiments on synthetic and benchmark graphs show that VColRL consistently outperforms greedy and learning-based methods in terms of color usage, while achieving competitive performance with advanced optimization solvers and search-based baselines. In addition to delivering high-quality solutions, VColRL achieves significantly faster runtimes than the baselines, demonstrating strong scalability and generalization across diverse graphs.

URL: https://openreview.net/forum?id=a9AQRieTne

---

Title: Controlling Statistical, Discretization, and Truncation Errors in Learning Fourier Linear Operators

Abstract: We study learning-theoretic foundations of operator learning, using the linear layer of the Fourier Neural Operator architecture as a model problem. First, we identify three main errors that occur during the learning process: statistical error due to finite sample size, truncation error from finite rank approximation of the operator, and discretization error from handling functional data on a finite grid of domain points. Finally, we analyze a Discrete Fourier Transform (DFT) based least squares estimator, establishing both upper and lower bounds on the aforementioned errors.

URL: https://openreview.net/forum?id=A2sHNGcjLO

---

Title: PSC: Posterior Sampling-Based Compression

Abstract: Diffusion models have transformed the landscape of image generation and now show remarkable potential for image compression. Most of the recent diffusion-based compression methods require training and are tailored for a specific bit-rate. In this work, we propose Posterior Sampling-based Compression (PSC) -- a zero-shot compression method that leverages a pre-trained diffusion model as its sole neural network component, thus enabling the use of diverse, publicly available models without additional training. Our approach is inspired by transform coding methods, which encode the image in some pre-chosen transform domain. However, PSC constructs a transform that is adaptive to the image. This is done by employing a zero-shot diffusion-based posterior sampler so as to progressively construct the rows of the transform matrix. Each new chunk of rows is chosen to reduce the uncertainty about the image given the quantized measurements collected thus far. Importantly, the same adaptive scheme can be replicated at the decoder, thus avoiding the need to encode the transform itself. We demonstrate that even with basic quantization and entropy coding, PSC's performance is comparable to established training-based methods in terms of rate, distortion, and perceptual quality. This is while providing greater flexibility, allowing to choose at inference time any desired rate or distortion.

URL: https://openreview.net/forum?id=OsqgU6Jz4t

---

Title: Learning Criticality: Statistical Limits of Predicting Phase Transitions in Random Networks

Abstract: We study the fundamental limits of learning phase transitions in random graph models from observational data. Motivated by applications in infrastructure resilience, epidemics, and complex systems, we ask: when can a machine learning algorithm predict the onset of a critical transition (e.g., percolation, connectivity collapse, synchronization breakdown) purely from sampled system trajectories? We introduce a formal framework that connects the statistical learnability of phase transitions to large deviations, generalization bounds, and graph ensemble parameters. We prove that for certain classes of random graphs (e.g., Erdős–Rényi, configuration models), there exists a universal scaling law that governs the sample complexity required to distinguish subcritical from supercritical regimes. Moreover, we identify regimes where no learning algorithm—regardless of architecture—can outperform random guessing, due to vanishing information gain near the critical point. Our results establish a phase diagram of learnability and provide a theoretical foundation for predictive algorithms in networked stochastic systems near criticality.

URL: https://openreview.net/forum?id=n03lg7zNLX

---

Title: From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations

Abstract: In an era of rampant misinformation, generating reliable news explanations is vital, especially for under-represented languages like Hindi. Lacking robust automated tools, Hindi faces challenges in scaling misinformation detection. To bridge this gap, we propose a novel framework integrating Direct Preference Optimization (DPO) with curriculum learning to align machine-generated explanations with human reasoning. Fact-checked explanations from credible sources serve as preferred responses, while LLM outputs highlight system limitations and serve as non-preferred responses. To refine task-specific alignment, we introduce two key parameters—\textit{Actuality} and \textit{Finesse}—into the DPO loss function, enhancing explanation quality and consistency. Experiments with LLMs (Mistral, Llama, Gemma) and PLMs (mBART, mT5) confirm the framework's effectiveness in generating coherent, contextually relevant explanations. This scalable approach combats misinformation and extends automated explanation generation to low-resource languages.

URL: https://openreview.net/forum?id=lJ7lEFFVXV

---

Title: Combining Machine Learning Defenses without Conflicts

Abstract: Machine learning (ML) models require protection against various risks to security, privacy, and fairness. Real-life ML models need simultaneous protection against multiple risks, necessitating combining multiple defenses effectively, without incurring significant drop in the effectiveness of the constituent defenses. We present a systematization of existing work based on how defenses are combined, and how they interact. We then identify unexplored combinations, and evaluate combination techniques to identify their limitations. Using these insights, we present, Def\Con, a combination technique which is (a) accurate (correctly identifies whether a combination is effective or not), (b) scalable (allows combining multiple defenses), (c) non-invasive (allows combining existing defenses without modification), and (d) general (is applicable to different types of defenses). We show that Def\Con achieves 90% accuracy on eight combinations from prior work, and 86% in 30 unexplored combinations evaluated empirically.

URL: https://openreview.net/forum?id=C7FgsjfFRC

---

Title: Discrete Audio Tokens: More Than a Survey!

Abstract: Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area.

URL: https://openreview.net/forum?id=eqNchtvc6v

---

Title: PrivShap: A Finer-granularity Network Linearization Method for Private Inference

Abstract: Private inference applies cryptographic techniques like homomorphic encryption, garble circuit and secret sharing to keep both sides privacy in a client-server setting during inference. It is often hindered by the high communication overheads, especially at non-linear activation layers such as ReLU. Hence ReLU pruning has been widely recognized as an efficient way to accelerate private inference. Existing approaches to ReLU pruning typically rely on coarse hypothesis, which assume an inverse correlation between the importance of ReLU and linear layers or shallow activation layers have less importance for universal models, to assign the budgets according to the layer while preserving the inference accuracy. However, these assumptions are based on limited empirical evidence and can fail to generalize to diverse model architectures. In this work, we introduce a finer-granularity ReLU budget assignment approach by assessing the layer-wise importance of ReLU with the Shapley value.

To address the computational burden of exact Shapley value calculation, we propose a tree-trimming algorithm for fast estimation. We provide both theoretical guarantees and empirical validation of our method. Our extensive experiments show that we achieve better efficiency and accuracy than the state-of-the-art across diverse model architectures, activation functions, and datasets. Specifically, we only need $\sim$$2.5\times$ fewer ReLU operations to achieve a similar inference accuracy and gains up to $\sim$$8.13\%$ increase on inference accuracy with similar ReLU budgets.

URL: https://openreview.net/forum?id=7TliYmJr2m

---

Title: Understanding the learned look-ahead behavior of chess neural networks

Abstract: We investigate the look-ahead capabilities of chess-playing neural networks, specifically focusing on the Leela Chess Zero policy network. We build on the work of Jenner et al. (2024) by analyzing the model's ability to consider future moves and alternative sequences beyond the immediate next move. Our findings reveal that the network's look-ahead behavior is highly context-dependent, varying significantly based on the specific chess position. We demonstrate that the model can process information about board states up to seven moves ahead, utilizing similar internal mechanisms across different future time steps. Additionally, we provide evidence that the network considers multiple possible move sequences rather than focusing on a single line of play. These results offer new insights into the emergence of sophisticated look-ahead capabilities in neural networks trained on strategic tasks, contributing to our understanding of AI reasoning in complex domains. Our work also showcases the effectiveness of interpretability techniques in uncovering cognitive-like processes in artificial intelligence systems.

URL: https://openreview.net/forum?id=np4Bg2zIxL

---

Title: Double Machine Learning Based Structure Identification from Temporal Data

Abstract: Learning the causes of time-series data is a fundamental task in many applications, spanning from finance to earth sciences or bio-medical applications. Common approaches for this task are based on vector auto-regression, and they do not take into account unknown confounding between potential causes. However, in settings with many potential causes and noisy data, these approaches may be substantially biased. Furthermore, potential causes may be correlated in practical applications or even contain cycles. To address these challenges, we propose a new double machine learning based method for structure identification from temporal data (DR-SIT). We provide theoretical guarantees, showing that our method asymptotically recovers the true underlying causal structure. Our analysis extends to cases where the potential causes have cycles, and they may even be confounded. We further perform extensive experiments to showcase the superior performance of our method. Code: https://anonymous.4open.science/r/TMLR_submission_DR_SIT-6B46/

URL: https://openreview.net/forum?id=4iHAoFVM2K

---

Title: Hard-Negative Prototype-Based Regularization for Few-Shot Class-Incremental Learning

Abstract: Few-shot class-incremental learning (FSCIL)---involving abundant base training data followed by novel classes with limited labeled samples---poses challenges such as catastrophic forgetting and overfitting, leading to significant performance degradation across incremental sessions. As a remedy, recent work focuses on minimizing the interference of embeddings between base and incremental classes. However, previous studies have not explicitly considered variation in discriminative difficulty across samples and classes, leaving room for improvement: we observe that hard-negative (i.e., difficult to discriminate from the label) samples and classes significantly affect FSCIL performance, whereas easy ones have little impact. To this end, we propose a hard-negative prototype-based regularization approach that enhances discrimination between similar classes by imposing a penalty margin between each sample and its most similar class prototypes based on cosine similarity. To select hard-negative prototypes, we explore two distinct mining strategies: dynamic selection that leverages the model's decision boundary, and static selection that utilizes a pre-defined class-wise similarity matrix derived from external sources such as pre-trained models. We evaluate our approach on three widely used benchmarks, miniImageNet, CIFAR100, and CUB200, achieving state-of-the-art performance on each. Comprehensive analyses demonstrate that our proposed method enhances intra-class cohesion and inter-class separability of embeddings, both of which are crucial for FSCIL to better accommodate novel classes. The code will be made publicly available upon publication.

URL: https://openreview.net/forum?id=xKn7O0vDaR

---

Title: AutoGenDA: Automated Generative Data Augmentation for Imbalanced Classifications

Abstract: Data augmentation is an approach to increasing the training dataset size for deep learning using synthetic data. Recent advancements in image generative models have unleashed the potential of synthesizing high-quality images in data augmentation. However, real-life datasets commonly follow an imbalanced class distribution, where some classes have fewer samples than others. Image generation models may, therefore, struggle to synthesize diverse images for less common classes that lack richness and diversity. To address this, we introduce an automated generative data augmentation method, AutoGenDA, to extract and transfer label-invariant changes across data classes through image captions and text-guided generative models. We also propose an automated search strategy to optimize the data augmentation process for each data class, leading to better generalization. Our experiments demonstrate the effectiveness of AutoGenDA in various object classification datasets. We improve the standard data augmentation baselines by up to 4.9\% on Pascal VOC, Caltech101, MS-COCO, and LVIS under multiple imbalanced classification settings.

URL: https://openreview.net/forum?id=tqj95Map3Q

---

Title: Curvature Diversity-Driven Deformation and Domain Alignment for Point Cloud

Abstract: Unsupervised Domain Adaptation is crucial for point cloud learning due to geometric variations across different generation methods and sensors. To tackle this challenge, we propose \textbf{Curvature Diversity-Driven Nuclear-Norm Wasserstein Domain Alignment (CDND)}. We first introduce a Curvature Diversity-driven Deformation Reconstruction (CurvRec) task, enabling the model to extract salient features from semantically rich regions of a given point cloud. We then propose a theoretical framework for Deformation-based Nuclear-norm Wasserstein Discrepancy (D-NWD), extending the Nuclear-norm Wasserstein Discrepancy to original and deformed samples. Our theoretical analysis demonstrates that D-NWD is effective for any deformation method. Empirical experiment results show that our CDND achieves state-of-the-art performance by a noticeable margin over existing approaches.

URL: https://openreview.net/forum?id=ePXWnH7rGk

---

Title: An Architecture Built for Federated Learning: Addressing Data Heterogeneity through Adaptive Normalization-Free Feature Recalibration

Abstract: Federated learning is a decentralized collaborative training paradigm preserving stakeholders’ data ownership while improving performance and generalization. However, statistical heterogeneity among client datasets degrades system performance. To address this issue, we propose \textbf{Adaptive Normalization-free Feature Recalibration (ANFR)}, the first architecture-level approach to combat heterogeneous data in FL. ANFR leverages weight standardization to avoid mismatched client statistics and inconsistent averaging, ensuring robustness under heterogeneity, and channel attention to produce learnable scaling factors for feature maps, suppressing inconsistencies across clients due to heterogeneity. We demonstrate that this improves class selectivity and channel attention weight distribution, while working with any aggregation method, supporting both global and personalized FL, and adding minimal overhead. ANFR offers a novel and versatile approach to the challenge of statistical heterogeneity. Extensive experiments show ANFR consistently outperforms established baselines across various aggregation methods, datasets, and heterogeneity conditions. Code is provided at \url{https://anonymous.4open.science/r/anfr_tmlr-280D}.

URL: https://openreview.net/forum?id=GtdYFLsblb

---

Title: Structured and Interpretable Learning via Diophantine-Elliptic Neural Networks

Abstract: We introduce Diophantine-Elliptic Curve Neural Networks (DEC-NNs), a novel class of architectures in which parameters are not unconstrained real numbers but integer-valued solutions to a fixed elliptic Diophantine equation. This constraint embeds each weight and bias into an algebraically structured arithmetic variety, yielding neural models that are interpretable, sparse, and geometrically robust by design. Our formulation enforces this structure through a projection-based training loop, ensuring consistency across updates without sacrificing predictive performance. We establish theoretical guarantees on convergence, symbolic expressivity, and generalization bounds rooted in number theory. Empirically, DEC-NNs demonstrate high accuracy and resilience under adversarial noise on both synthetic and real-world datasets including MNIST and UCI Breast Cancer. In domains such as scientific modeling, symbolic regression, and medical diagnostics, where transparency and auditability are essential, DEC-NNs offer a principled alternative to conventional networks, aligning learning with discrete symbolic structure rather than post hoc interpretability.

URL: https://openreview.net/forum?id=IBJECPzkGx

---

Title: Adaptive and Robust Watermark for Generative Tabular Data

Abstract: Recent development in generative models has demonstrated its ability to create high-quality synthetic data. However, the pervasiveness of synthetic content online also brings forth growing concerns that it can be used for malicious purpose. To ensure the authenticity of the data, watermarking techniques have recently emerged as a promising solution due to their strong statistical guarantees. In this paper, we propose a flexible and robust watermarking mechanism for generative tabular data. Specifically, a data provider with knowledge of the downstream tasks can partition the feature space into pairs of (key, value) columns. Within each pair, the data provider first uses elements in the key column to generate a randomized set of ``green'' intervals, then encourages elements of the value column to be in one of these ``green'' intervals. We show theoretically and empirically that the watermarked datasets (i) have negligible impact on the data quality and downstream utility, (ii) can be efficiently detected, (iii) are robust against multiple attacks commonly observed in data science, and (iv) maintain strong security against adversary attempting to learn the underlying watermark scheme.

URL: https://openreview.net/forum?id=yYfNqpywUi

---

Title: BCQ: Block Clustered Quantization for 4-bit (W4A4) LLM Inference

Abstract: Post-training quantization (PTQ) is a promising approach to reducing the storage and computational requirements of large language models (LLMs) without additional training cost. Recent PTQ studies have primarily focused on quantizing only weights to sub-$8$-bits while maintaining activations at $8$-bits or higher. Accurate sub-8-bit quantization for both weights and activations without relying on quantization-aware training remains a significant challenge. We propose a novel quantization method called block clustered quantization (BCQ) wherein each operand tensor is decomposed into blocks (a block is a group of contiguous scalars), blocks are clustered based on their statistics, and a dedicated optimal quantization codebook is designed for each cluster. As a specific embodiment of this approach, we propose a PTQ algorithm called Locally-Optimal BCQ (LO-BCQ) that iterates between the steps of block clustering and codebook design to greedily minimize the quantization mean squared error. When weight and activation scalars are encoded to W4A4 format (with $0.5$-bits of overhead for storing scaling factors and codebook selectors), we advance the current state-of-the-art by demonstrating $<1$\% loss in inference accuracy across several LLMs and downstream tasks.

URL: https://openreview.net/forum?id=loWISTqGwW

---

Title: Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits

Abstract: Large Language Models (LLMs) enable various applications on edge devices such as smartphones, wearables, and embodied robots. However, their deployment often depends on expensive cloud-based APIs, creating high operational costs, which limit access for smaller organizations and raise sustainability concerns. Certain LLMs can be deployed on-device, offering a cost-effective solution with reduced latency and improved privacy. Yet, limited computing resources constrain the size and accuracy of models that can be deployed, necessitating a collaborative design between edge and cloud. We propose a fast and cost-effective speculative edge-cloud decoding framework with a large target model on the server and a small draft model on the device. By introducing early exits in the target model, tokens are generated mid-verification, allowing the client to preemptively draft subsequent tokens before final verification, thus utilizing idle time and enhancing parallelism between edge and cloud. Using an NVIDIA Jetson Nano (client) and an A100 GPU (server) with Vicuna-68M (draft) and Llama2-7B (target) models, our method achieves up to a 35% reduction in latency compared to cloud-based autoregressive decoding, with an additional 11% improvement from preemptive drafting. To demonstrate real-world applicability, we deploy our method on the Unitree Go2 quadruped robot using Vision-Language Model (VLM) based control, achieving a 21% speedup over traditional cloud-based autoregressive decoding. These results demonstrate the potential of our framework for real-time LLM and VLM applications on resource-constrained edge devices.

URL: https://openreview.net/forum?id=PTIUjARnbc

---

Title: Sparse-Input Neural Network using Group Concave Regularization

Abstract: Simultaneous feature selection and non-linear function estimation are challenging, especially in high-dimensional settings where the number of variables exceeds the available sample size in modeling. In this article, we investigate the problem of feature selection in neural networks. Although the group LASSO has been utilized to select variables for learning with neural networks, it tends to select unimportant variables into the model to compensate for its over-shrinkage. To overcome this limitation, we propose a framework of sparse-input neural networks using group concave regularization for feature selection in both low-dimensional and high-dimensional settings. The main idea is to apply a proper concave penalty to the $l_2$ norm of weights from all outgoing connections of each input node, and thus obtain a neural net that only uses a small subset of the original variables. In addition, we develop an effective algorithm based on backward path-wise optimization to yield stable solution paths, in order to tackle the challenge of complex optimization landscapes. Our extensive simulation studies and real data examples demonstrate satisfactory finite-sample performances of the proposed estimator, in feature selection and prediction for modeling continuous, binary, and time-to-event outcomes.

URL: https://openreview.net/forum?id=m9UsLHZYeX

---

Title: Expressivity of Parametrized Distributions over DAGs for Causal Discovery

Abstract: Bayesian approaches for causal discovery can in principle quantify uncertainty in the prediction of the underlying causal structure, typically modeled by a directed acyclic graph (DAG). Various semi-implicit models for parametrized distributions over DAGs have been proposed, but their limitations have not been studied thoroughly. In this work, we focus on the expressiveness of parametrized distributions over DAGs in the context of causal structure learning and show several limitations of candidate models in a theoretical analysis and validate them in experiments. To overcome them, we propose mixture models of distributions over DAGs.

URL: https://openreview.net/forum?id=UsJ0H6VJRl

---

Title: Circuit Explained: How Does a Transformer Perform Compositional Generalization

Abstract: Compositional generalization—the systematic combination of known components into novel structures—is fundamental to flexible human cognition, but the mechanisms in neural networks remain poorly understood in both machine learning and cognitive science. Lake & Baroni (2023) showed that a compact encoder-decoder transformer can achieve simple forms of compositional generalization in a sequence arithmetic task. In this work, we identify and mechanistically interpret the circuit responsible for compositional generalization in such a model. Using causal ablations, we isolate the circuit and further show that this understanding enables precise activation edits to steer the model’s outputs predictably. We found that the circuit leverages the disentangled representation of position and token so that functional transformations can be applied to positions in a token-independent manner. Our findings advance the understanding of how compositionality can emerge in neural networks and offer testable hypotheses for similar mechanisms in other neural architectures and compositional tasks. Code will be published after double-blind review.

URL: https://openreview.net/forum?id=eLSS89MPt6

---

Title: Source-Free Controlled Adaptation of Teachers for Continual Test-Time Adaptation

Abstract: In many real-world scenarios, encountering continual shifts in domain during inference is very common. Consequently, continual test-time adaptation (CTTA) techniques leveraging a teacher-student framework have gained prominence, allowing models to adapt continuously even after deployment. In such a framework, a weight-averaged mean teacher is used to produce pseudo-labels from test data for self-training. The mean teacher gets updated as an exponential moving average of the student parameters using a high value of momentum that is kept fixed even if different distributions of test data are encountered. To combat the resulting drift of the model, we propose a novel controlled teacher adaptation methodology that dynamically sets a proper momentum value depending on the quality of the incoming data. Additionally, we estimate class prototypes from the source pretrained model to help align the target data as they come in. Importantly, our method does not require access to source data or its statistics at any stage of the pipeline, making it truly source-free. We perform extensive experiments on benchmark datasets to demonstrate that our approach outperforms different state-of-the-art adaptation frameworks, many of which require access to source data.

URL: https://openreview.net/forum?id=nymWIrCIhF

---

Title: Perturbative partial moment matching and gradient-flow adaptive importance sampling transformations for Bayesian leave one out cross-validation

Abstract: Importance sampling (IS) allows one to approximate leave one out (LOO) cross-validation for a Bayesian model, without refitting, by inverting the Bayesian update equation to subtract a given data point from a model posterior.
For each data point, one computes expectations under the corresponding LOO posterior by weighted averaging over the full data posterior. This task sometimes requires weight stabilization in the form of adapting the posterior distribution via transformation.
So long as one is successful in finding a suitable transformation, one avoids refitting. To this end, we motivate the use of bijective perturbative transformations of the form $T(\boldsymbol{\theta})=\boldsymbol{\theta} + h Q(\boldsymbol{\theta}),$
for small parameter h and introduce two classes of such transformations: 1) partial moment matching and 2) gradient flow evolution.
The former extends prior literature on moment-matching under the recognition that adaptation for LOO is a small perturbation on the full data posterior. The latter class of methods define transformations based on relaxing various statistical objectives: in our case the variance of the IS estimator and the KL divergence between the transformed distribution and the statistics of the LOO fold.
Being model-specific, the gradient flow transformations require evaluating Jacobian determinants.
While these quantities are generally readily available through auto-differentiation, we derive closed-form expressions in the case of logistic regression and shallow ReLU activated neural networks. We tested the methodology on an $n << p$ dataset that is known to produce unstable LOO IS weights.

URL: https://openreview.net/forum?id=eDgBAjiLbn

---

Title: Capsule Network Projectors are Equivariant and Invariant Learners

Abstract: Learning invariant representations has been the longstanding approach to self-supervised learning. However, recently progress has been made in preserving equivariant properties in representations, yet do so with highly prescribed architectures. In this work, we propose an
invariant-equivariant self-supervised architecture that employs Capsule Networks (CapsNets), which have been shown to capture equivariance with respect to novel viewpoints. We demonstrate that the use of CapsNets in equivariant self-supervised architectures achieves improved downstream performance on equivariant tasks with higher efficiency and fewer network parameters. To accommodate the architectural changes of CapsNets, we introduce a new objective function based on entropy minimisation. This approach, which we name CapsIE (Capsule Invariant Equivariant Network), achieves state-of-the-art performance on the equivariant rotation tasks on the 3DIEBench dataset compared to prior equivariant SSL methods, while performing competitively against supervised counterparts. Our results demonstrate the ability of CapsNets to learn complex and generalised representations for large-scale, multi-task datasets compared to previous CapsNet benchmarks.

URL: https://openreview.net/forum?id=7owCO3qskH

---

Title: MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation

Abstract: Diffusion models are successful for synthesizing high quality videos but are limited to generating short clips (e.g. 2-10 seconds). Synthesizing sustained footage (e.g. over minutes) still remains an open research question. In this paper, we propose MALT Diffusion (using Memory-Augmented Latent Transformers), a new diffusion model specialized for long video generation. MALT Diffusion (or just MALT) handles long videos by subdividing them into short segments and doing segment-level autoregressive generation. To achieve this, we first propose recurrent attention layers that encode multiple segments into a compact memory latent vector; by maintaining this memory vector over time, MALT is able to condition on it and continuously generate new footage based on a long temporal context. We also present several training techniques that enable the model to generate frames over a long horizon with consistent quality and minimal degradation. We validate the effectiveness of MALT through experiments on long video benchmarks. We first perform extensive analysis of MALT in long-contextual understanding capability and stability using popular long video benchmarks. For example, MALT achieves an FVD score of 220.4 on 128-frame video generation on UCF-101, outperforming the previous state-of-the-art of 648.4. Finally, we explore MALT's capabilities in a text-to-video generation setting and show that it can produce long videos compared with recent techniques for long text-to-video generation.

URL: https://openreview.net/forum?id=YuUwwVoWES

---

Title: HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction

Abstract: How can we predict future interaction trajectories of human hands in a scene given high-level colloquial task specifications in the form of natural language? In this paper, we extend the classic hand trajectory prediction task to several tasks involving explicit and implicit language queries. Our proposed tasks require extensive understanding of human daily activities and reasoning abilities about what is happening next given cues from the current scene. We also develop new benchmarks to evaluate the proposed two tasks, Vanilla Hand Prediction (VHP) and Reasoning-Based Hand Prediction (RBHP). We enable solving these tasks by integrating high-level world knowledge and reasoning capabilities of Vision-Language Models (VLMs) with the auto-regressive nature of low-level ego-centric hand trajectories. Our model, HandsOnVLM is a novel VLM that can generate textual responses and produce future hand trajectories through natural-language conversations. Our experiments show that HandsOnVLM outperforms existing task-specific methods and other VLM baselines on proposed tasks, and demonstrates its ability to effectively utilize world knowledge for reasoning about low-level human hand trajectories based on the provided context.

URL: https://openreview.net/forum?id=ehhMFjKnWm

---

Title: Hypergraph clustering using Ricci curvature: an edge transport perspective

Abstract: In this paper, we introduce a novel method for extending Ricci flow to hypergraphs by defining probability measures on the edges and transporting them on the line expansion. This approach yields a new weighting on the edges, which proves particularly effective for community detection. We extensively compare this method with a similar notion of Ricci flow defined on the clique expansion, demonstrating its enhanced sensitivity to the hypergraph structure, especially in the presence of large hyperedges. The two methods are complementary and together form a powerful and highly interpretable framework for community detection in hypergraphs.

URL: https://openreview.net/forum?id=HMROU8MXqV

---

Title: An Information-Theoretic Lower Bound on the Generalization Error of Autoencoders

Abstract: Quantifying the limitations of classical neural network architectures is a critically underexplored area of machine learning research. Deriving lower bounds on the optimal performance of these architectures can facilitate improved neural architecture search and overfitting detection. We present an information-theoretic lower bound on the generalization mean squared error of autoencoders with sigmoid activation functions. Through the Estimation Error and Differential Entropy (EEDE) inequality for continuous random vectors, we derive this lower bound, which provides a new perspective on the inherent limitations and capabilities of autoencoders. Our analysis extends to the examination of how this lower bound is influenced by various architectural features and data distribution characteristics. This study enriches our theoretical understanding of autoencoders and has substantial practical implications for their design, optimization, and application in the field of deep learning.

URL: https://openreview.net/forum?id=0esF0M467w

---

Title: Mind the Metrics: Patterns for Telemetry-Aware In-IDE AI Application Development using Model Context Protocol (MCP)

Abstract: Modern AI-driven development environments are destined to evolve into observability-first platforms by integrating real-time telemetry and feedback loops directly into the developer workflow. This paper introduces telemetry-aware IDEs driven by Model Context Protocol (MCP), a new paradigm for building software. We articulate how an IDE (integrated development environment), enhanced with an MCP client/server, can unify prompt engineering with live metrics, traces, and evaluations to enable iterative optimization and robust monitoring. We present a progression of design patterns: from local large language model (LLM) coding with immediate metrics feedback, to continuous integration (CI) pipelines that automatically refine prompts, to autonomous agents that monitor and adapt prompts based on telemetry. Instead of focusing on any single optimizer, we emphasize a general architecture (exemplified by the Model Context Protocol and illustrated through a reference MCP server implementation) that consolidates prompt and agent telemetry for the future integration of various optimization techniques. We survey related work in prompt engineering, AI observability, and optimization (e.g., Prompts-as-Programs, DSPy's MIPRO, Microsoft's PromptWizard) to position this approach within the emerging AI developer experience. This theoretical systems perspective highlights new design affordances and workflows for AI-first software development, laying a foundation for future benchmarking and empirical studies on optimization in these environments.

URL: https://openreview.net/forum?id=Qhc8xDRZuH

---

Title: Similarity-Distance-Magnitude Universal Verification

Abstract: We address the neural network robustness problem by adding Similarity (i.e., correctly predicted depth-matches into training)-awareness and Distance-to-training-distribution-awareness to the existing output Magnitude (i.e., decision-boundary)-awareness of the softmax function. The resulting SDM activation function provides strong signals of the relative epistemic (reducible) predictive uncertainty. We use this novel behavior to further address the complementary HCI problem of mapping the output to human-interpretable summary statistics over relevant partitions of a held-out calibration set. Estimates of prediction-conditional uncertainty are obtained via a parsimonious learned transform over the class-conditional empirical CDFs of the output of a final-layer SDM activation function. For decision-making and as an intrinsic model check, estimates of class-conditional accuracy are obtained by further partitioning the high-probability regions of this calibrated output into class-conditional, region-specific CDFs. The uncertainty estimates from SDM calibration are remarkably robust to test-time distribution shifts and out-of-distribution inputs; incorporate awareness of the effective sample size; provide estimates of uncertainty from the learning and data splitting processes; and are well-suited for selective classification and conditional branching for additional test-time compute based on the predictive uncertainty, as for selective LLM generation, routing, and composition over multiple models and retrieval. Finally, we construct SDM networks, LLMs with uncertainty-aware verification and interpretability-by-exemplar as intrinsic properties. We provide open-source software implementing these results.

URL: https://openreview.net/forum?id=HP4xCrmthO

---

Title: Improving Visual Commonsense in Language Models via Multiple Image Generation

Abstract: Commonsense reasoning is fundamentally based on multimodal knowledge. However, large language models (LLMs), trained using textual data only, are limited with their ability to incorporate essential visual information. In contrast, Visual Language Models (VLMs), which excel at visually-oriented tasks, often fail at non-visual tasks such as textual commonsense reasoning.
This divergence highlights a critical challenge - the integration of robust visual understanding with foundational text-based reasoning. To this end, we introduce a method aimed at enhancing LLMs' visual commonsense while maintaining textual modeling and commonsense reasoning performance. Specifically, our method is based on test-time compute scaling. We generate multiple images based on the input text prompt and integrates these into the model's decision-making process by mixing their prediction probabilities. To facilitate multimodal grounded language modeling, we employ a late-fusion layer that combines the projected visual features with the output of a pre-trained LLM conditioned on text only. This late-fusion layer enables predictions based on comprehensive image-text knowledge as well as text only when required. We evaluate our approach using several visual commonsense reasoning tasks together with traditional NLP tasks, including common sense reasoning and reading comprehension. Our experimental results demonstrate significant superiority over existing baselines. When applied to recent state-of-the-art LLMs (e.g., Llama3), we observe improvements not only in visual commonsense but also in NLP benchmarks.

URL: https://openreview.net/forum?id=qlmhXjhR0P

---

Title: A Pattern Language for Machine Learning Tasks

Abstract: We formalise the essential data of objective functions as equality constraints on composites of learners. We call these constraints ``tasks'', and we investigate the idealised view that such tasks determine model behaviours. We develop a flowchart-like graphical mathematics for tasks that allows us to; offer a unified perspective of approaches in machine learning across domains; design and optimise desired behaviours model-agnostically; and import insights from theoretical computer science into practical machine learning.
As preliminary experimental validation of our theoretical framework, we exhibit and implement a novel ``manipulation'' task that minimally edits input data to have a desired attribute. Our model-agnostic approach achieves this end-to-end, and without the need for custom architectures, adversarial training, random sampling, or interventions on the data, hence enabling capable, small-scale, and training-stable models.

URL: https://openreview.net/forum?id=IOianP0UHC

---

Title: Complementarity: Toward Better Metrics and Optimizing Data Efficiency in LLMs

Abstract: Generalist Large Language Models (LLMs) are trained with an immense amount of data from across different domains. However, not all data contribute to model performance equally, and prioritized data quality over quantity can improve domain-specific accuracy. We suggest that quality is not merely an independent feature of datasets, but rather the manner in which data samples interfere or complement one another Furthermore, existing evaluation metrics are computationally expensive, require extensive design, are mathematically ill-defined, and are generally poorly suited to LLMs. Toward improving general performance while greatly reducing the amount of training data, and quantifying how data contribute to downstream tasks vis-a-vis their connection with other data, we introduce a new metric, Complementarity. We first establish a strong correlation between Complementarity and domain-specific task performance. Complementarity shows increased robustness over traditional metrics and is significantly less expensive computationally. Furthermore, without the reliance on heavy instruction-tuning and text scraping, Complementarity is easier to apply and applicable to a wide variety of potential target domains. Most interestingly, we demonstrate that the Complementarity taken over a training validation set provides a better predictor of generalization to future test sets than directly measuring performance on a test validation set. With this, we introduce an algorithm that carefully selects the data to fine-tune upon, leading to a high-performing fine-tuned generalist model while using only a fraction of the data, and without requiring data from the test domain. Overall, Complementarity may serve as a key metric in future analysis of data utility and design of datasets, and may prove invaluable in achieving the goal of a truly generalist model.

URL: https://openreview.net/forum?id=feAbrMXGMh

---

Reply all

Reply to author

Forward

0 new messages