Weekly TMLR digest for Feb 16, 2025

26 views

Skip to first unread message

TMLR

unread,

Feb 16, 2025, 12:00:14 AMFeb 16

to tmlr-annou...@googlegroups.com

New certifications
==================

Expert Certification: Necessary and Sufficient Watermark for Large Language Models

Yuki Takezawa, Ryoma Sato, Han Bao, Kenta Niwa, Makoto Yamada

https://openreview.net/forum?id=FcyHZ6Q4k0

---

Featured Certification: What is the Relationship between Tensor Factorizations and Circuits (and How Can We Exploit it)?

Lorenzo Loconte, Antonio Mari, Gennaro Gala, Robert Peharz, Cassio de Campos, Erik Quaeghebeur, Gennaro Vessio, Antonio Vergari

https://openreview.net/forum?id=Y7dRmpGiHj

---

Expert Certification: Personalized Negative Reservoir for Incremental Learning in Recommender Systems

Antonios Valkanas, Yuening Wang, Yingxue Zhang, Mark Coates

https://openreview.net/forum?id=jrUUk5Fskm

---

Survey Certification: Class Incremental Learning from First Principles: A Review

Neil Ashtekar, Jingxi Zhu, Vasant G Honavar

https://openreview.net/forum?id=sZdtTJInUg

---

Survey Certification: Evaluating Interpretable Methods via Geometric Alignment of Functional Distortions

Anna Hedström, Philine Lou Bommer, Thomas F Burns, Sebastian Lapuschkin, Wojciech Samek, Marina MC Höhne

https://openreview.net/forum?id=ukLxqA8zXj

---

Expert Certification: The RealHumanEval: Evaluating Large Language Models’ Abilities to Support Programmers

Hussein Mozannar, Valerie Chen, Mohammed Alsobay, Subhro Das, Sebastian Zhao, Dennis Wei, Manish Nagireddy, Prasanna Sattigeri, Ameet Talwalkar, David Sontag

https://openreview.net/forum?id=hGaWq5Buj7

---

Accepted papers
===============

Title: A Strong Baseline for Molecular Few-Shot Learning

Authors: Philippe Formont, Hugo Jeannin, Pablo Piantanida, Ismail Ben Ayed

Abstract: Few-shot learning has recently attracted significant interest in drug discovery, with a recent, fast-growing literature mostly involving convoluted meta-learning strategies. We revisit the more straightforward fine-tuning approach for molecular data, and propose a regularized quadratic-probe loss based on the the Mahalanobis distance. We design a dedicated block-coordinate descent optimizer, which avoid the degenerate solutions of our loss. Interestingly, our simple fine-tuning approach achieves highly competitive performances in comparison to state-of-the-art methods, while being applicable to black-box settings and removing the need for specific episodic pre-training strategies. Furthermore, we introduce a new benchmark to assess the robustness of the competing methods to domain shifts. In this setting, our fine-tuning baseline obtains consistently better results than meta-learning methods.

URL: https://openreview.net/forum?id=JQ0agisXny

---

Title: PROXI: Challenging the GNNs for Link Prediction

Authors: Astrit Tola, Jack Myrick, Baris Coskunuzer

Abstract: Over the past decade, Graph Neural Networks (GNNs) have transformed graph representation learning. In the widely adopted message-passing GNN framework, nodes refine their representations by aggregating information from neighboring nodes iteratively. While GNNs excel in various domains, recent theoretical studies have raised concerns about their capabilities. GNNs aim to address various graph-related tasks by utilizing such node representations, however, this one-size-fits-all approach proves suboptimal for diverse tasks.

Motivated by these observations, we conduct empirical tests to compare the performance of current GNN models with more conventional and direct methods in link prediction tasks. Introducing our model, PROXI, which leverages proximity information of node pairs in both graph and attribute spaces, we find that standard machine learning (ML) models perform competitively, even outperforming cutting-edge GNN models when applied to these proximity metrics derived from node neighborhoods and attributes. This holds true across both homophilic and heterophilic networks, as well as small and large benchmark datasets, including those from the Open Graph Benchmark (OGB). Moreover, we show that augmenting traditional GNNs with PROXI significantly boosts their link prediction performance. Our empirical findings corroborate the previously mentioned theoretical observations and imply that there exists ample room for enhancement in current GNN models to reach their potential.

URL: https://openreview.net/forum?id=u9EHndbiVw

---

Title: On Space Folds of ReLU Neural Networks

Authors: Michal Lewandowski, Hamid Eghbalzadeh, Bernhard Heinzl, Raphael Pisoni, Bernhard A. Moser

Abstract: Recent findings suggest that the consecutive layers of ReLU neural networks can be understood geometrically as space folding transformations of the input space, revealing patterns of self-similarity. In this paper, we present the first quantitative analysis of this space folding phenomenon in ReLU neural networks. Our approach focuses on examining how straight paths in the Euclidean input space are mapped to their counterparts in the Hamming activation space. In this process, the convexity of straight lines is generally lost, giving rise to non-convex folding behavior. To quantify this effect, we introduce a novel measure based on range metrics, similar to those used in the study of random walks, and provide the proof for the equivalence of convexity notions between the input and activation spaces. Furthermore, we provide empirical analysis on a geometrical analysis benchmark (CantorNet) as well as an image classification benchmark (MNIST). Our work advances the understanding of the activation space in ReLU neural networks by leveraging the phenomena of geometric folding, providing valuable insights on how these models process input information.

URL: https://openreview.net/forum?id=RfFqBXLDQk

---

Title: Improving Consistency in Large Language Models through Chain of Guidance

Authors: Harsh Raj, Vipul Gupta, Domenic Rosati, Subhabrata Majumdar

Abstract: Consistency is a fundamental dimension of trustworthiness in Large Language Models (LLMs). For humans to be able to trust LLM-based applications, their outputs should be consistent when prompted with inputs that carry the same meaning or intent. Despite this need, there is no known mechanism to control and guide LLMs to be more consistent at inference time. In this paper, we introduce a novel alignment strategy to maximize semantic consistency in LLM outputs. Our proposal is based on \textbf{Chain of Guidance} (CoG), a multistep prompting technique that generates highly consistent outputs from LLMs. For closed-book question-answering (Q\&A) tasks, when compared to direct prompting, the outputs generated using CoG show improved consistency. While other approaches like template-based responses and majority voting may offer alternative paths to consistency, our work focuses on exploring the potential of guided prompting. We use synthetic data sets comprised of consistent input-output pairs to fine-tune LLMs to produce consistent {\it and} correct outputs. Our fine-tuned models are more than twice as consistent compared to base models and show strong generalization capabilities by producing consistent outputs over datasets not used in the fine-tuning process. Code is available at \url{https://github.com/vijilAI/chain_of_guidance}.

URL: https://openreview.net/forum?id=asiBW1bB9b

---

Title: Evaluation of Best-of-N Sampling Strategies for Language Model Alignment

Authors: Yuki Ichihara, Yuu Jinnai, Tetsuro Morimura, Kenshi Abe, Kaito Ariu, Mitsuki Sakamoto, Eiji Uchibe

Abstract: Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) with human preferences at the time of decoding. BoN sampling is susceptible to a problem known as reward hacking. Since the reward model is an imperfect proxy for the true objective, an excessive focus on optimizing its value can lead to a compromise of its performance on the true objective. Previous work proposes Regularized BoN sampling (RBoN), a BoN sampling with regularization to the objective, and shows that it outperforms BoN sampling so that it mitigates reward hacking and empirically (Jinnai et al., 2024). However, Jinnai et al. (2024) introduce RBoN based on a heuristic and they lack the analysis of why such regularization strategy improves the performance of BoN sampling. The aim of this study is to analyze the effect of BoN sampling on regularization strategies. Using the regularization strategies corresponds to robust optimization, which maximizes the worst case over a set of possible perturbations in the proxy reward. Although the theoretical guarantees are not directly applicable to RBoN, RBoN corresponds to a practical implementation. This paper proposes an extension of the RBoN framework, called Stochastic RBoN sampling (SRBoN), which is a theoretically guaranteed approach to worst-case RBoN in proxy reward. We then perform an empirical evaluation using the AlpacaFarm and Anthropic’s hh-rlhf datasets to evaluate which factors of the regularization strategies contribute to the improvement of the true proxy reward. In addition, we also propose another simple RBoN method, the Sentence Length Regularized BoN, which has a better performance in the experiment as compared to the previous methods.

URL: https://openreview.net/forum?id=H4S4ETc8c9

---

Title: Unsupervised Discovery of Object-Centric Neural Fields

Authors: Rundong Luo, Hong-Xing Yu, Jiajun Wu

Abstract: We study inferring 3D object-centric scene representations from a single image. While recent methods have shown potential in unsupervised 3D object discovery, they are limited in generalizing to unseen spatial configurations. This limitation stems from the lack of translation invariance in their 3D object representations. Previous 3D object discovery methods entangle objects’ intrinsic attributes like shape and appearance with their 3D locations. This entanglement hinders learning generalizable 3D object representations. To tackle this bottleneck, we propose the unsupervised discovery of Object-Centric neural Fields (uOCF), which integrates translation invariance into the object representation. To allow learning object-centric representations from limited real-world images, we further introduce an object prior learning method that transfers object-centric prior knowledge from a synthetic dataset. To evaluate our approach, we collect four new datasets, including two real kitchen environments. Extensive experiments show that our approach significantly improves generalization and sample efficiency and enables unsupervised 3D object discovery in real scenes. Notably, uOCF demonstrates zero-shot generalization to unseen objects from a single real image. We attach our code in the supplementary file, and the project page is available at https://red-fairy.github.io/uOCF/

URL: https://openreview.net/forum?id=ScEv13W2f1

---

Title: Understanding LLM Embeddings for Regression

Authors: Eric Tang, Bangding Yang, Xingyou Song

Abstract: With the rise of large language models (LLMs) for flexibly processing information as strings, a natural application is regression, specifically by preprocessing string representations into LLM embeddings as downstream features for metric prediction. In this paper, we provide one of the first comprehensive investigations into embedding-based regression and demonstrate that LLM embeddings as features can be better for high-dimensional regression tasks than using traditional feature engineering. This regression performance can be explained in part due to LLM embeddings over numeric data inherently preserving Lipschitz continuity over the feature space. Furthermore, we quantify the contribution of different model effects, most notably model size and language understanding, which we find surprisingly do not always improve regression performance.

URL: https://openreview.net/forum?id=Wt6Iz5XNIO

---

Title: APR-CNN: Convolutional Neural Networks for the Adaptive Particle Representation of Large Microscopy Images

Authors: Joel Jonsson, Bevan Leslie Cheeseman, Ivo Sbalzarini

Abstract: We present APR-CNN, a novel class of convolutional neural networks designed for efficient and scalable three-dimensional microscopy image analysis. APR-CNNs operate natively on a sparse, multi-resolution image representation known as the Adaptive Particle Representation (APR). This significantly reduces memory and compute requirements compared to traditional pixel-based CNNs. We introduce APR-native layers for convolution, pooling, and upsampling, along with hybrid architectures that combine APR and pixel layers to balance accuracy and computational efficiency. We show in benchmarks that APR-CNNs achieve comparable segmentation accuracy to pixel-based CNNs while drastically reducing memory usage and inference time. We further showcase the potential of APR-CNNs in large-scale volumetric image analysis, reducing inference times from weeks to days. This opens up new avenues for applying deep learning to large, high-resolution, three-dimensional biomedical datasets with constrained computational resources.

URL: https://openreview.net/forum?id=5qKI2dkrjL

---

Title: On diffusion-based generative models and their error bounds: The log-concave case with full convergence estimates

Authors: Stefano Bruno, Ying Zhang, Dongyoung Lim, Omer Deniz Akyildiz, Sotirios Sabanis

Abstract: We provide full theoretical guarantees for the convergence behaviour of diffusion-based generative models under the assumption of strongly log-concave data distributions while our approximating class of functions used for score estimation is made of Lipschitz continuous functions avoiding any Lipschitzness assumption on the score function. We demonstrate via a motivating example, sampling from a Gaussian distribution with unknown mean, the powerfulness of our approach. In this case, explicit estimates are provided for the associated optimization problem, i.e. score approximation, while these are combined with the corresponding sampling estimates. As a result, we obtain the best known upper bound estimates in terms of key quantities of interest, such as the dimension and rates of convergence, for the Wasserstein-2 distance between the data distribution (Gaussian with unknown mean) and our sampling algorithm.
Beyond the motivating example and in order to allow for the use of a diverse range of stochastic optimizers, we present our results using an $L^2$-accurate score estimation assumption, which crucially is formed under an expectation with respect to the stochastic optimizer and our novel auxiliary process that uses only known information. This approach yields the best known convergence rate for our sampling algorithm.

URL: https://openreview.net/forum?id=zjxKrb4ehr

---

Title: Relax and penalize: a new bilevel approach to mixed-binary hyperparameter optimization

Authors: Sara Venturini, Marianna De Santis, Jordan Patracone, Martin Schmidt, Francesco Rinaldi, Saverio Salzo

Abstract: In recent years, bilevel approaches have become very popular to efficiently estimate high-dimensional hyperparameters of machine learning models. However, to date, binary parameters are handled by continuous relaxation and rounding strategies, which could lead to inconsistent solutions. In this context, we tackle the challenging optimization of mixed-binary hyperparameters by resorting to an equivalent continuous bilevel reformulation based on an appropriate penalty term. We propose an algorithmic framework that, under suitable assumptions, is guaranteed to provide mixed-binary solutions. Moreover, the generality of the method allows to safely use existing continuous bilevel solvers within the proposed framework. We evaluate the performance of our approach for two specific machine learning problems, i.e., the estimation of the group-sparsity structure in regression problems and the data distillation problem. The reported results show that our method is competitive with state-of-the-art approaches based on relaxation and rounding.

URL: https://openreview.net/forum?id=A1R1cQ93Cb

---

Title: Necessary and Sufficient Watermark for Large Language Models

Authors: Yuki Takezawa, Ryoma Sato, Han Bao, Kenta Niwa, Makoto Yamada

Abstract: Large language models (LLMs) can now generate texts that are indistinguishable from those written by humans. Such remarkable performance of LLMs increases their risk of being used for malicious purposes. Thus, it is necessary to develop methods for distinguishing texts written by LLMs from those written by humans. Watermarking is one of the most powerful methods for achieving this. Although existing methods have successfully detected texts generated by LLMs, they inevitably degrade the text quality. In this study, we propose the Necessary and Sufficient Watermark (NS-Watermark) for inserting watermarks into generated texts with minimum text quality degradation. More specifically, we derive minimum constraints required to be imposed on the generated texts to distinguish whether LLMs or humans write the texts, and we formulate the NS-Watermark as a constrained optimization problem. Through the experiments, we demonstrate that the NS-Watermark can generate more natural texts than existing watermarking methods and distinguish more accurately between texts written by LLMs and those written by humans. Especially in machine translation tasks, the NS-Watermark can outperform the existing watermarking method by up to 30 BLEU scores.

URL: https://openreview.net/forum?id=FcyHZ6Q4k0

---

Title: Maxwell's Demon at Work: Efficient Pruning by Leveraging Saturation of Neurons

Authors: Simon Dufort-Labbé, Pierluca D'Oro, Evgenii Nikishin, Irina Rish, Pierre-Luc Bacon, Razvan Pascanu, Aristide Baratin

Abstract: When training neural networks, dying neurons —units becoming inactive or saturated— are traditionally seen as harmful. This paper sheds new light on this phenomenon. By exploring the impact of various hyperparameter configurations on dying neurons during training, we gather insights on how to improve upon sparse training approaches to pruning. We introduce Demon Pruning (DemP), a method that controls the proliferation of dead neurons through a combination of noise injection on active units and a one-cycled schedule regularization strategy, dynamically leading to network sparsity. Experiments on CIFAR-10 and ImageNet datasets demonstrate that DemP outperforms existing dense-to-sparse structured pruning methods, achieving better accuracy-sparsity tradeoffs while speeding up training up to 3.56$\times$. These findings provide a novel perspective on dying neurons as a resource for efficient model compression and optimization.

URL: https://openreview.net/forum?id=nmBleuFzaN

---

Title: Personalized Federated Learning of Probabilistic Models: A PAC-Bayesian Approach

Authors: Mahrokh Ghoddousi Boroujeni, Andreas Krause, Giancarlo Ferrari-Trecate

Abstract: Federated Learning (FL) aims to infer a shared model from private and decentralized data stored by multiple clients. Personalized FL (PFL) enhances the model’s fit for each client by adapting the global model to the clients. A significant level of personalization is required for highly heterogeneous clients but can be challenging to achieve, especially when clients’ datasets are small. We introduce PAC-PFL for PFL of probabilistic models. PAC-PFL infers a shared hyper-posterior and treats each client’s posterior inference as the personalization step. Unlike previous PFL algorithms, PAC-PFL does not regularize all personalized models towards a single shared model, thereby greatly enhancing its personalization flexibility. By establishing and minimizing a PAC-Bayesian generalization bound on the average true loss of clients, PAC-PFL effectively mitigates overfitting even in data-poor scenarios. Additionally, PAC-PFL provides generalization bounds for new clients joining later. PAC-PFL achieves accurate and well-calibrated predictions, as supported by our experiments.

URL: https://openreview.net/forum?id=ZMliWjMCor

---

Title: Wasserstein Coreset via Sinkhorn Loss

Authors: Haoyun Yin, Yixuan Qiu, Xiao Wang

Abstract: Coreset selection, a technique for compressing large datasets while preserving performance, is crucial for modern machine learning. This paper presents a novel method for generating high-quality Wasserstein coresets using the Sinkhorn loss, a powerful tool with computational advantages. However, existing approaches suffer from numerical instability in Sinkhorn's algorithm. We address this by proposing stable algorithms for the computation and differentiation of the Sinkhorn optimization problem, including an analytical formula for the derivative of the Sinkhorn loss and a rigorous stability analysis of our method. Extensive experiments demonstrate that our approach significantly outperforms existing methods in terms of sample selection quality, computational efficiency, and achieving a smaller Wasserstein distance.

URL: https://openreview.net/forum?id=DrMCDS88IL

---

Title: Diffusion on Graph: Augmentation of Graph Structure for Node Classification

Authors: Yancheng Wang, Changyu Liu, Yingzhen Yang

Abstract: Graph diffusion models have recently been proposed to synthesize entire graphs, such as molecule graphs. Although existing methods have shown great performance in generating entire graphs for graph-level learning tasks, no graph diffusion models have been developed to generate synthetic graph structures, that is, synthetic nodes and associated edges within a given graph, for node-level learning tasks. Inspired by the research in the computer vision literature using synthetic data for enhanced performance, we propose Diffusion on Graph (DoG), which generates synthetic graph structures to boost the performance of GNNs. The synthetic graph structures generated by DoG are combined with the original graph to form an augmented graph for the training of node-level learning tasks, such as node classification and graph contrastive learning (GCL). To improve the efficiency of the generation process, a Bi-Level Neighbor Map Decoder (BLND) is introduced in DoG. To mitigate the adverse effect of the noise introduced by the synthetic graph structures, a low-rank regularization method is proposed for the training of graph neural networks (GNNs) on the augmented graphs. Extensive experiments on various graph datasets for semi-supervised node classification and graph contrastive learning have been conducted to demonstrate the effectiveness of DoG with low-rank regularization. The code of DoG is available at \url{https://github.com/Statistical-Deep-Learning/DoG}.

URL: https://openreview.net/forum?id=tzW948kU6x

---

Title: Hypergraph Neural Networks through the Lens of Message Passing: A Common Perspective to Homophily and Architecture Design

Authors: Lev Telyatnikov, Maria Sofia Bucarelli, Guillermo Bernardez, Olga Zaghen, Simone Scardapane, Pietro Lio

Abstract: Most of the current learning methodologies and benchmarking datasets in the hypergraph realm are obtained by \emph{lifting} procedures from their graph analogs, leading to overshadowing specific characteristics of hypergraphs. This paper attempts to confront some pending questions in that regard: Q1 Can the concept of homophily play a crucial role in Hypergraph Neural Networks (HNNs)? Q2 How do models that employ unique characteristics of higher-order networks perform compared to lifted models? Q3 Do well-established hypergraph datasets provide a meaningful benchmark for HNNs? To address them, we first introduce a novel conceptualization of homophily in higher-order networks based on a Message Passing (MP) scheme, unifying both the analytical examination and the modeling of higher-order networks. Further, we investigate some natural strategies for processing higher-order structures within HNNs (such as keeping hyperedge-dependent node representations or performing node/hyperedge stochastic samplings), leading us to the most general MP formulation up to date --MultiSet. Finally, we conduct an extensive set of experiments that contextualize our proposals.

URL: https://openreview.net/forum?id=8rxtL0kZnX

---

Title: Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent

Authors: Hikaru Umeda, Hideaki Iiduka

Abstract: The performance of mini-batch stochastic gradient descent (SGD) strongly depends on setting the batch size and learning rate to minimize the empirical loss in training the deep neural network. In this paper, we present theoretical analyses of mini-batch SGD with four schedulers: (i) constant batch size and decaying learning rate scheduler, (ii) increasing batch size and decaying learning rate scheduler, (iii) increasing batch size and increasing learning rate scheduler, and (iv) increasing batch size and warm-up decaying learning rate scheduler. We show that mini-batch SGD using scheduler (i) does not always minimize the expectation of the full gradient norm of the empirical loss, whereas it does using any of schedulers (ii), (iii), and (iv). Furthermore, schedulers (iii) and (iv) accelerate mini-batch SGD. The paper also provides numerical results of supporting analyses showing that using scheduler (iii) or (iv) minimizes the full gradient norm of the empirical loss faster than using scheduler (i) or (ii).

URL: https://openreview.net/forum?id=sbmp55k6iE

---

Title: JoIN: Joint GANs Inversion for Intrinsic Image Decomposition

Authors: Viraj Shah, Svetlana Lazebnik, Julien Philip

Abstract: Intrinsic Image Decomposition (IID) is a challenging inverse problem that seeks to decompose a natural image into its underlying intrinsic components such as albedo and shading. While recent image decomposition methods rely on learning-based priors on these components, they often suffer from component cross-contamination owing to joint training of priors; or from Sim-to-Real gap since the priors trained on synthetic data are kept frozen during the inference on real images. In this work, we propose to solve the intrinsic image decomposition problem using a bank of Generative Adversarial Networks (GANs) as priors where each GAN is independently trained only on a single intrinsic component, providing stronger and more disentangled priors. At the core of our approach is the idea that the latent space of a GAN is a well-suited optimization domain to solve inverse problems. Given an input image, we propose to jointly invert the latent codes of a set of GANs and combine their outputs to reproduce the input. Contrary to all existing GAN inversion methods that are limited to inverting only a single GAN, our proposed approach, JoIN, is able to jointly invert multiple GANs using only a single image as supervision while still maintaining distribution priors of each intrinsic component. We show that our approach is modular, allowing various forward imaging models, and that it can successfully decompose both synthetic and real images. Further, taking inspiration from existing GAN inversion approaches, we allow for careful fine-tuning of the generator priors during the inference on real images. This way, our method is able to achieve excellent generalization on real images even though it uses only synthetic data to train the GAN priors. We demonstrate the success of our approach through exhaustive qualitative and quantitative evaluations and ablation studies on various datasets.

URL: https://openreview.net/forum?id=JEHIVfjmOf

---

Title: Robust High-Dimensional Mean Estimation With Low Data Size, an Empirical Study

Authors: Cullen Anderson, Jeff M. Phillips

Abstract: Robust statistics aims to compute quantities to represent data where a fraction of it may be arbitrarily corrupted. The most essential statistic is the mean, and in recent years, there has been a flurry of theoretical advancement for efficiently estimating the mean in high dimensions on corrupted data. While several algorithms have been proposed that achieve near-optimal error, they all rely on large data size requirements as a function of dimension. In this paper, we perform an extensive experimentation over various mean estimation techniques where data size might not meet this requirement due to the high-dimensional setting.

For data with inliers generated from a Gaussian with known covariance, we find experimentally that several robust mean estimation techniques can practically improve upon the sample mean, with the quantum entropy scaling approach from Dong \etal (NeurIPS 2019) performing consistently the best. However, this consistent improvement is conditioned on a couple of simple modifications to how the steps to prune outliers work in the high-dimension low-data setting, and when the inliers deviate significantly from Gaussianity. In fact, with these modifications, they are typically able to achieve roughly the same error as taking the sample mean of the uncorrupted inlier data, even with very low data size. In addition to controlled experiments on synthetic data, we also explore these methods on large language models, deep pretrained image models, and non-contextual word embedding models that do not necessarily have an inherent Gaussian distribution. Yet, in these settings, a mean point of a set of embedded objects is a desirable quantity to learn, and the data exhibits the high-dimension low-data setting studied in this paper. We show both the challenges of achieving this goal, and that our updated robust mean estimation methods can provide significant improvement over using just the sample mean. We additionally publish a library of Python implementations of robust mean estimation algorithms, allowing practitioners and researchers to apply these techniques and to perform further experimentation.

URL: https://openreview.net/forum?id=1QeI99nH9k

---

Title: What is the Relationship between Tensor Factorizations and Circuits (and How Can We Exploit it)?

Authors: Lorenzo Loconte, Antonio Mari, Gennaro Gala, Robert Peharz, Cassio de Campos, Erik Quaeghebeur, Gennaro Vessio, Antonio Vergari

Abstract: This paper establishes a rigorous connection between circuit representations and tensor factorizations, two seemingly distinct yet fundamentally related areas. By connecting these fields, we highlight a series of opportunities that can benefit both communities. Our work generalizes popular tensor factorizations within the circuit language, and unifies various circuit learning algorithms under a single, generalized hierarchical factorization framework. Specifically, we introduce a modular “Lego block” approach to build tensorized circuit architectures. This, in turn, allows us to systematically construct and explore various circuit and tensor factorization models while maintaining tractability. This connection not only clarifies similarities and differences in existing models, but also enables the development of a comprehensive pipeline for building and optimizing new circuit/tensor factorization architectures. We show the effectiveness of our framework through extensive empirical evaluations, and highlight new research opportunities for tensor factorizations in probabilistic modeling.

URL: https://openreview.net/forum?id=Y7dRmpGiHj

---

Title: Personalized Negative Reservoir for Incremental Learning in Recommender Systems

Authors: Antonios Valkanas, Yuening Wang, Yingxue Zhang, Mark Coates

Abstract: Recommender systems have become an integral part of online platforms. Every day the volume of training data is expanding and the number of user interactions is constantly increasing. The exploration of larger and more expressive models has become a necessary pursuit to improve user experience. However, this progression carries with it an increased computational burden. In commercial settings, once a recommendation system model has been trained and deployed it typically needs to be updated frequently as new client data arrive. Cumulatively, the mounting volume of data is guaranteed to eventually make full batch retraining of the model from scratch computationally infeasible. Naively fine-tuning solely on the new data runs into the well-documented problem of catastrophic forgetting. Despite the fact that negative sampling is a crucial part of training with implicit feedback, no specialized technique exists that is tailored to the incremental learning framework. In this work, we propose a personalized negative reservoir strategy, which is used to obtain negative samples for the standard triplet loss of graph-based recommendation systems. Our technique balances alleviation of forgetting with plasticity by encouraging the model to remember stable user preferences and selectively forget when user interests change. We derive the mathematical formulation of a negative sampler to populate and update the reservoir. We integrate our design in three SOTA and commonly used incremental recommendation models. We show that these concrete realizations of our negative reservoir framework achieve state-of-the-art results for standard benchmarks using multiple top-k evaluation metrics.

URL: https://openreview.net/forum?id=jrUUk5Fskm

---

Title: Explaining Explainability: Recommendations for Effective Use of Concept Activation Vectors

Authors: Angus Nicolson, Lisa Schut, Alison Noble, Yarin Gal

Abstract: Concept-based explanations translate the internal representations of deep learning models into a language that humans are familiar with: concepts. One popular method for finding concepts is Concept Activation Vectors (CAVs), which are learnt using a probe dataset of concept exemplars. In this work, we investigate three properties of CAVs: (1) inconsistency across layers, (2) entanglement with other concepts, and (3) spatial dependency. Each property provides both challenges and opportunities in interpreting models. We introduce tools designed to detect the presence of these properties, provide insight into how each property can lead to misleading explanations, and provide recommendations to mitigate their impact. To demonstrate practical applications, we apply our recommendations to a melanoma classification task, showing how entanglement can lead to uninterpretable results and that the choice of negative probe set can have a substantial impact on the meaning of a CAV. Further, we show that understanding these properties can be used to our advantage. For example, we introduce spatially dependent CAVs to test if a model is translation invariant with respect to a specific concept and class. Our experiments are performed on natural images (ImageNet), skin lesions (ISIC 2019), and a new synthetic dataset, Elements. Elements is designed to capture a known ground truth relationship between concepts and classes. We release this dataset to facilitate further research in understanding and evaluating interpretability methods.

URL: https://openreview.net/forum?id=7CUluLpLxV

---

Title: What Makes ImageNet Look Unlike LAION

Authors: Ali Shirali, Moritz Hardt

Abstract: ImageNet was famously created by querying several image search engines such as Flickr. What if we recreated ImageNet instead by searching the massive LAION dataset based on image captions alone? In this work, we carry out this counterfactual investigation. We find that the resulting ImageNet recreation, which we call LAIONet, looks distinctly unlike the original. Specifically, the intra-class similarity of images in the original ImageNet is dramatically higher than it is for LAIONet. Consequently, models trained on ImageNet perform significantly worse on LAIONet. We propose a rigorous explanation for the discrepancy in terms of a subtle, yet important, difference in two plausible causal data-generating processes for the respective datasets, that we support with systematic experimentation. In a nutshell, searching based on an image caption alone creates an information bottleneck that mitigates the selection bias otherwise present in image-based filtering. Our explanation formalizes a long-held intuition in the community that ImageNet images are stereotypical, unnatural, and overly simple representations of the class category. At the same time, it provides a simple and actionable takeaway for future dataset creation efforts.

URL: https://openreview.net/forum?id=IrBYuh9W3T

---

Title: Continual Learning from Simulated Interactions via Multitask Prospective Rehearsal for Bionic Limb Behavior Modeling

Authors: Sharmita Dey, Benjamin Paassen, Sarath Ravindran Nair, Sabri Boughorbel, Arndt F. Schilling

Abstract: Lower limb amputations and neuromuscular impairments severely restrict mobility, necessitating advancements beyond conventional prosthetics. While motorized bionic limbs show promise, their effectiveness depends on replicating the dynamic coordination of human movement across diverse environments. In this paper, we introduce a model for human behavior in the context of bionic prosthesis control. Our approach leverages human locomotion demonstrations to learn the synergistic coupling of the lower limbs, enabling the prediction of the kinematic behavior of a missing limb during tasks such as walking, climbing inclines, and stairs. We propose a multitasking, continually adaptive model that anticipates and refines movements over time. At the core of our method is a technique which we call the multitask prospective rehearsal, that anticipates and synthesizes future movements based on the previous prediction and employs a corrective mechanism for subsequent predictions. Our evolving architecture merges lightweight, task-specific modules on a shared backbone, ensuring both specificity and scalability. We validate our model through experiments on real-world human gait datasets, including transtibial amputees, across a wide range of locomotion tasks. Results demonstrate that our approach consistently outperforms baseline models, particularly in scenarios with distributional shifts, adversarial perturbations, and noise.

URL: https://openreview.net/forum?id=Bmy82p2eez

---

Title: Geometry-Aware visualization of high dimensional Symmetric Positive Definite matrices

Authors: Thibault de Surrel, Sylvain Chevallier, Fabien Lotte, Florian Yger

Abstract: Symmetric Positive Definite (SPD) matrices are pervasive in machine learning, from data features (such as covariance matrices) to optimization process.These matrices induce a Riemannian structure, where the curvature plays a critical role in the success of approaches based on those geometries. Yet, for ML practitioners wanting to visualize SPD matrices, the existing (flat) Euclidean approaches will hide the curvature of the manifold.
To overcome this lack of expressivity in the existing algorithms, we introduce Riemannian versions of two state-of-the-art techniques, namely t-SNE and Multidimensional Scaling. Therefore, we are able to reduce a set of $c \times c$ SPD matrices into a set of $2 \times 2$ SPD matrices in order to capture the curvature information and avoid any distortion induced by flattening the representation in an Euclidean setup. Moreover, our approaches pave the way for targeting more general dimensionality reduction applications while preserving the geometry of the data. We performed experiments on controlled synthetic dataset to ensure that the low-dimensional representation preserves the geometric properties of both SPD Gaussians and geodesics. We also conduct experiments on various real datasets, such as video, anomaly detection, brain signal and others.

URL: https://openreview.net/forum?id=DYCSRf3vby

---

Title: CNN Interpretability with Multivector Tucker Saliency Maps for Self-Supervised Models

Authors: Aymene Mohammed Bouayed, Samuel Deslauriers-gauthier, Adrian IACOVELLI, David Naccache

Abstract: Interpreting the decisions of Convolutional Neural Networks (CNNs) is essential for understanding their behavior, yet it remains a significant challenge, particularly for self-supervised models. Most existing methods for generating saliency maps rely on reference labels, restricting their use to supervised tasks. EigenCAM is the only notable label-independent alternative, leveraging Singular Value Decomposition to generate saliency maps applicable across CNN models, but it does not fully exploit the tensorial structure of feature maps. In this work, we introduce the Tucker Saliency Map (TSM) method, which applies Tucker tensor decomposition to better capture the inherent structure of feature maps, producing more accurate singular vectors and values. These are used to generate high-fidelity saliency maps, effectively highlighting objects of interest in the input. We further extend EigenCAM and TSM into multivector variants—Multivec-EigenCAM and Multivector Tucker Saliency Maps (MTSM)—which utilize all singular vectors and values, further improving saliency map quality. Quantitative evaluations on supervised classification models demonstrate that TSM, Multivec-EigenCAM, and MTSM achieve competitive performance with label-dependent methods. Moreover, TSM enhances interpretability by approximately $50\%$ over EigenCAM for both supervised and self-supervised models. Multivec-EigenCAM and MTSM further advance state-of-the-art interpretability performance on self-supervised models, with MTSM achieving the best results.

URL: https://openreview.net/forum?id=VM8bNd5A09

---

Title: Class Incremental Learning from First Principles: A Review

Authors: Neil Ashtekar, Jingxi Zhu, Vasant G Honavar

Abstract: Continual learning systems attempt to efficiently learn over time without forgetting previously acquired knowledge. In recent years, there has been an explosion of work on continual learning, mainly focused on the class-incremental learning (CIL) setting. In this review, we take a step back and reconsider the CIL problem. We reexamine the problem definition and describe its unique challenges, contextualize existing solutions by analyzing non-continual approaches, and investigate the implications of various problem configurations. Our goal is to provide an alternative perspective to existing work on CIL and direct attention toward unexplored aspects of the problem.

URL: https://openreview.net/forum?id=sZdtTJInUg

---

Title: Neural Lattice Reduction: A Self-Supervised Geometric Deep Learning Approach

Authors: Giovanni Luca Marchetti, Gabriele Cesa, Kumar Pratik, Arash Behboodi

Abstract: Lattice reduction is a combinatorial optimization problem aimed at finding the most orthogonal basis in a given lattice. The Lenstra–Lenstra–Lovász (LLL) algorithm is the best algorithm in the literature for solving this problem. In light of recent research on algorithm discovery, in this work, we would like to answer this question: is it possible to parametrize the algorithm space for lattice reduction problem with neural networks and find an algorithm without supervised data? Our strategy is to use equivariant and invariant parametrizations and train in a self-supervised way. We design a deep neural model outputting factorized unimodular matrices and train it in a self-supervised manner by penalizing non-orthogonal lattice bases. We incorporate the symmetries of lattice reduction into the model by making it invariant to isometries and scaling of the ambient space and equivariant with respect to the hyperocrahedral group permuting and flipping the lattice basis elements.
We show that this approach yields an algorithm with comparable complexity and performance to the LLL algorithm on a set of benchmarks. Additionally, motivated by certain applications for wireless communication, we extend our method to a convolutional architecture which performs joint reduction of spatially-correlated lattices arranged in a grid, thereby amortizing its cost over multiple lattices.

URL: https://openreview.net/forum?id=YxXyRSlZ4b

---

Title: Evaluating Interpretable Methods via Geometric Alignment of Functional Distortions

Authors: Anna Hedström, Philine Lou Bommer, Thomas F Burns, Sebastian Lapuschkin, Wojciech Samek, Marina MC Höhne

Abstract: Interpretability researchers face a universal question: without access to ground truth labels, how can the faithfulness of an explanation to its model be determined? Despite immense efforts to develop new evaluation methods, current approaches remain in a pre-paradigmatic state: fragmented, difficult to calibrate, and lacking cohesive theoretical grounding. Observ- ing the lack of a unifying theory, we propose a novel evaluative criterion entitled Generalised Explanation Faithfulness (GEF) which is centered on explanation-to-model alignment, and integrates existing perturbation-based evaluations to eliminate the need for singular, task-specific evaluations. Complementing this unifying perspective, from a geometric point of view, we reveal a prevalent yet critical oversight in current evaluation practice: the failure to account for the learned geometry, and non-linear mapping present in the model, and explanation spaces. To solve this, we propose a general-purpose, threshold-free faithfulness evaluator GEF that incorporates principles from differential geometry, and facilitates evaluation agnostically across tasks, and interpretability approaches. Through extensive cross-domain benchmarks on natural language processing, vision, and tabular tasks, we provide first-of-its-kind insights into the comparative performance of various interpretable methods. This includes local linear approximators, global feature visualisation methods, large language models as post-hoc explainers, and sparse autoencoders. Our contributions are important to the interpretability and AI safety communities, offering a principled, unified approach for evaluation.

URL: https://openreview.net/forum?id=ukLxqA8zXj

---

Title: The RealHumanEval: Evaluating Large Language Models’ Abilities to Support Programmers

Authors: Hussein Mozannar, Valerie Chen, Mohammed Alsobay, Subhro Das, Sebastian Zhao, Dennis Wei, Manish Nagireddy, Prasanna Sattigeri, Ameet Talwalkar, David Sontag

Abstract: Evaluation of large language models for code has primarily relied on static benchmarks, including HumanEval (Chen et al., 2021), or more recently using human preferences of LLM responses. As LLMs are increasingly used as programmer assistants, we study whether gains on existing benchmarks or more preferred LLM responses translate to programmer productivity when coding with LLMs, including time spent coding. We introduce RealHumanEval, a web interface to measure the ability of LLMs to assist programmers, through either autocomplete or chat support. We conducted a user study (N=243) using RealHumanEval in which users interacted with seven LLMs of varying base model performance. Despite static benchmarks not incorporating humans-in-the-loop, we find that improvements in benchmark performance lead to increased programmer productivity; however gaps in benchmark versus human performance are not proportional---a trend that holds across both forms of LLM support. In contrast, we find that programmer preferences do not correlate with their actual performance, motivating the need for better proxy signals. We open-source RealHumanEval to enable human-centric evaluation of new models and the study data to facilitate efforts to improve code models.

URL: https://openreview.net/forum?id=hGaWq5Buj7

---

Title: Enhancing Remaining Useful Life Prediction with Ensemble Multi-Term Fourier Graph Neural Networks

Authors: Ya Song, Laurens Bliek, Yaoxin Wu, Yingqian Zhang

Abstract: Remaining useful life (RUL) prediction is crucial in predictive maintenance. Recently, deep learning forecasting methods, especially Spatio-Temporal Graph Neural Networks (ST-GNNs), have achieved remarkable performance in RUL prediction. Most existing ST-GNNs require searching for the graph structure before utilizing GNNs to learn spatial graph representation, and they necessitate a temporal model such as LSTM to leverage the temporal dependencies in a fixed lookback window. However, such an approach has several limitations. Firstly, it demands substantial computational resources to learn graph structures for the time series data. Secondly, independently learning spatial and temporal information disregards their inherent correlation, and thirdly, capturing information within a fixed lookback window ignores long-term dependencies across the entire time series. To mitigate the issues above, instead of treating the data within the lookback window as a sequence of graphs in ST-GNN methods, we regard it as a complete graph and employ a Fourier Graph Neural Network (FGN) to learn the spatiotemporal information within this graph in the frequency space. Additionally, we create training and test graphs with varying sizes of lookback windows, enabling the model to learn both short-term and long-term dependencies and provide multiple predictions for ensemble averaging. We also consider scenarios where sensor signals exhibit multiple operation conditions and design a sequence decomposition plugin to denoise input signals, aiming to enhance the performance of FGN. We evaluate the proposed model on two benchmark datasets, demonstrating its superior performance on the RUL prediction task compared to state-of-the-art approaches.

URL: https://openreview.net/forum?id=tzFjcVqmxw

---

Title: Data Augmentation Policy Search for Long-Term Forecasting

Authors: Liran Nochumsohn, Omri Azencot

Abstract: Data augmentation serves as a popular regularization technique to combat overfitting challenges in neural networks. While automatic augmentation has demonstrated success in image classification tasks, its application to time-series problems, particularly in long-term forecasting, has received comparatively less attention. To address this gap, we introduce a time-series automatic augmentation approach named TSAA, which is both efficient and easy to implement. The solution involves tackling the associated bilevel optimization problem through a two-step process: initially training a non-augmented model for a limited number of epochs, followed by an iterative split procedure. During this iterative process, we alternate between identifying a robust augmentation policy through Bayesian optimization and refining the model while discarding suboptimal runs. Extensive evaluations on challenging univariate and multivariate forecasting benchmark problems demonstrate that TSAA consistently outperforms several robust baselines, suggesting its potential integration into prediction pipelines. Code is available at this repository: \href{https://github.com/azencot-group/TSAA}{https://github.com/azencot-group/TSAA}.

URL: https://openreview.net/forum?id=Wnd0XY0twh

---

Title: Adaptive Multi-step Refinement Network for Robust Point Cloud Registration

Authors: Zhi Chen, Yufan Ren, Tong Zhang, Zheng Dang, Wenbing Tao, Sabine Susstrunk, Mathieu Salzmann

Abstract: Point Cloud Registration (PCR) estimates the relative rigid transformation between two point clouds of the same scene. Despite significant progress with learning-based approaches, existing methods still face challenges when the overlapping region between the two point clouds is small. In this paper, we propose an adaptive multi-step refinement network that refines the registration quality at each step by leveraging the information from the preceding step. To achieve this, we introduce a training procedure and a refinement network. Firstly, to adapt the network to the current step, we utilize a generalized one-way attention mechanism, which prioritizes the last step's estimated overlapping region, and we condition the network on step indices. Secondly, instead of training the network to map either random transformations or a fixed pre-trained model's estimations to the ground truth, we train it on transformations with varying registration qualities, ranging from accurate to inaccurate, thereby enhancing the network's adaptiveness and robustness. Despite its conceptual simplicity, our method achieves state-of-the-art performance on both the 3DMatch/3DLoMatch and KITTI benchmarks. Notably, on 3DLoMatch, our method reaches 80.4% recall rate, with an absolute improvement of 1.2%.

URL: https://openreview.net/forum?id=M3SkSMfWcP

---

Title: LLaVA-OneVision: Easy Visual Task Transfer

Authors: Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li

Abstract: We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios. Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities. In particular, strong video understanding and cross-scenario capabilities are demonstrated through task transfer from images to videos.

URL: https://openreview.net/forum?id=zKv8qULV6n

---

Title: On the Regularization of Learnable Embeddings for Time Series Forecasting

Authors: Luca Butera, Giovanni De Felice, Andrea Cini, Cesare Alippi

Abstract: In forecasting multiple time series, accounting for the individual features of each sequence can be challenging. To address this, modern deep learning methods for time series analysis combine a shared (global) model with local layers, specific to each time series, often implemented as learnable embeddings. Ideally, these local embeddings should encode meaningful representations of the unique dynamics of each sequence. However, when these are learned end-to-end as parameters of a forecasting model, they may end up acting as mere sequence identifiers. Shared processing blocks may then become reliant on such identifiers, limiting their transferability to new contexts. In this paper, we address this issue by investigating methods to regularize the learning of local learnable embeddings for time series processing. Specifically, we perform the first extensive empirical study on the subject and show how such regularizations consistently improve performance in widely adopted architectures. Furthermore, we show that methods attempting to prevent the co-adaptation of local and global parameters by means of embeddings perturbation are particularly effective in this context. In this regard, we include in the comparison several perturbation-based regularization methods, going as far as periodically resetting the embeddings during training. The obtained results provide an important contribution to understanding the interplay between learnable local parameters and shared processing layers: a key challenge in modern time series processing models and a step toward developing effective foundation models for time series.

URL: https://openreview.net/forum?id=F5ALCh3GWG

---

Title: Towards context and domain-aware algorithms for scene analysis

Authors: Ibrahim Serouis, Florence Sèdes

Abstract: Interpersonal interactions and social situations in multimedia content encompass a rich blend of visual, textual, audio and contextual cues as well. However, contextual data integration in multimodal scene analysis research has often been overlooked, leading to incomplete interpretations. For instance, recognizing that two combatants in a video are positioned within a designated ring with a dedicated referee drastically alters the perception from a simple scuffle to a structured martial arts contest.

This paper presents an innovative approach to scene analysis in video content, which not only incorporates contextual data but also emphasizes the most significant features during training. Additionally, we introduce a methodology for integrating domain knowledge into our framework. We evaluate our proposed methodology using two comprehensive datasets, demonstrating promising results compared to a baseline study using one of the datasets. These findings underscore the importance of integrating contextual data into multimodal video analysis, while also recognizing the challenges associated with their utilization.

URL: https://openreview.net/forum?id=JQGmbVK4Fr

---

Title: DELTA: Dual Consistency Delving with Topological Uncertainty for Active Graph Domain Adaptation

Authors: Pengyun Wang, Yadi Cao, Chris Russell, Yanxin Shen, Junyu Luo, Ming Zhang, Siyu Heng, Xiao Luo

Abstract: Graph domain adaptation has recently enabled knowledge transfer across different graphs. However, without the semantic information on target graphs, the performance on target graphs is still far from satisfactory. To address the issue, we study the problem of active graph domain adaptation, which selects a small quantitative of informative nodes on the target graph for extra annotation. This problem is highly challenging due to the complicated topological relationships and the distribution discrepancy across graphs. In this paper, we propose a novel approach named Dual Consistency Delving with Topological Uncertainty (DELTA) for active graph domain adaptation. Our DELTA consists of an edge-oriented graph subnetwork and a path-oriented graph subnetwork, which can explore topological semantics from complementary perspectives. In particular, our edge-oriented graph subnetwork utilizes the message passing mechanism to learn neighborhood information, while our path-oriented graph subnetwork explores high-order relationships from substructures. To jointly learn from two subnetworks, we roughly select informative candidate nodes with the consideration of consistency across two subnetworks. Then, we aggregate local semantics from its K-hop subgraph based on node degrees for topological uncertainty estimation. To overcome potential distribution shifts, we compare target nodes and their corresponding source nodes for discrepancy scores as an additional component for fine selection. Extensive experiments on benchmark datasets demonstrate that DELTA outperforms various state-of-the-art approaches. The code implementation of DELTA is available at https://github.com/goose315/DELTA.

URL: https://openreview.net/forum?id=P5y82LKGbY

---

Title: When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training

Authors: Haonan Wang, Qian Liu, Chao Du, Tongyao Zhu, Cunxiao Du, Kenji Kawaguchi, Tianyu Pang

Abstract: Extending context window sizes allows large language models (LLMs) to process longer sequences and handle more complex tasks. Rotary Positional Embedding (RoPE) has become the de facto standard due to its relative positional encoding properties that benefit long-context training. However, we observe that using RoPE with BFloat16 format results in numerical issues, causing it to deviate from its intended relative positional encoding, especially in long-context scenarios. This issue arises from BFloat16's limited precision and accumulates as context length increases, with the first token contributing significantly to this problem. Despite its limitations, BFloat16 remains desirable for its computational efficiency, particularly given the substantial memory overhead required to extend the context window. To improve long-context training under BFloat16, we develop AnchorAttention, a plug-and-play attention method that enhances long-context capabilities, and speeds up training. AnchorAttention reduces unnecessary attention computations, maintains semantic coherence, and boosts computational efficiency by treating the first token as a shared anchor with a consistent position ID, making it visible to all documents within the training context. Experiments on three types of LLMs demonstrate that AnchorAttention significantly improves long-context performance and reduces training time by over 50\% compared to standard full attention mechanisms, while preserving the original LLM's capabilities on general tasks.

URL: https://openreview.net/forum?id=gwXfZ3xkUq

---

New submissions
===============

Title: Multiple Invertible and Partial-Equivariant Function for Latent Vector Transformation to Enhance Disentanglement in VAEs

Abstract: Disentanglement learning is a core issue for understanding and re-using trained information in Variational AutoEncoder (VAE), and effective inductive bias has been reported as a key factor. However, the actual implementation of such bias is still vague. In this paper, we propose a novel method, called Multiple Invertible and partial-equivariant transformation (MIPE-transformation), to inject inductive bias by 1) guaranteeing the invertibility of latent-to-latent vector transformation while preserving a certain portion of equivariance of input-to-latent vector transformation, called Invertible and partial-equivariant transformation (IPE-transformation), 2) extending the form of prior and posterior in VAE frameworks to an unrestricted form through a learnable conversion to an approximated exponential family, called exponential Family conversion (EF-conversion), and 3) integrating multiple units of IPE-transformation and EF-conversion, and their training. In experiments on 3D Cars, 3D Shapes, and dSprites datasets, MIPE-transformation improves the disentanglement performance of state-of-the-art VAEs.

URL: https://openreview.net/forum?id=4c36ZobfSl

---

Title: PII-Scope: A Benchmark for Training Data PII Leakage Assessment in LLM

Abstract: In this work, we introduce PII-Scope, a comprehensive benchmark designed to evaluate state-of-the-art methodologies for PII extraction attacks targeting LLMs across diverse threat settings. Our study provides a deeper understanding of these attacks by uncovering several hyperparameters (e.g., demonstration selection) crucial to their effectiveness. Building on this understanding, we extend our study to more realistic attack scenarios, exploring PII attacks that employ advanced adversarial strategies, including repeated and diverse querying, and leveraging iterative learning for continual PII extraction. Through extensive experimentation, our results reveal a notable underestimation of PII leakage in existing single-query attacks. In fact, we show that with sophisticated adversarial capabilities and a limited query budget, PII extraction rates can increase by up to fivefold when targeting the pretrained model. Moreover, we evaluate PII leakage on finetuned models, showing that they are more vulnerable to leakage than pretrained models. Overall, our work establishes a rigorous empirical benchmark for PII extraction attacks in realistic threat scenarios and provides a strong foundation for developing effective mitigation strategies

URL: https://openreview.net/forum?id=u20RDpgcGC

---

Title: Qualifying Knowledge and Knowledge Sharing in Multilingual Models

Abstract: Pre-trained language models (PLMs) have demonstrated a remarkable ability to encode
factual knowledge. However, the mechanisms underlying how this knowledge is stored and
retrieved remain poorly understood, with important implications for AI interpretability and
safety. In this paper, we disentangle the multifaceted nature of knowledge: successfully completing
a knowledge retrieval task (e.g., “The capital of France is __”) involves mastering
underlying concepts (e.g., France, Paris), relationships between these concepts (e.g., capital
of ), the structure of prompts, including the language of the query. We propose to disentangle
these distinct aspects of knowledge and apply this typology to offer a critical view
of neuron-level knowledge attribution techniques. For concreteness, we focus on Dai et al.’s
(2022) Knowledge Neurons (KNs) across multiple PLMs, testing 10 natural languages and
unnatural languages (e.g. Autoprompt). Our key contributions are twofold: (i) we show
that KNs come in different flavors, some indeed encoding entity level concepts, some having
a much less transparent, more polysemantic role , and (ii) we uncover an unprecedented
overlap in KNs across up to all of the 10 languages we tested, pointing to the existence of
a partially unified, language-agnostic retrieval system. To do so, we introduce and release
the Multi-ParaRel dataset, an extension of ParaRel, featuring prompts and paraphrases for
cloze-style knowledge retrieval tasks in parallel over 10 languages.

URL: https://openreview.net/forum?id=hnpB3SRbZj

---

Title: Variational Inference with Unnormalized Priors

Abstract: Variational inference typically assumes normalized priors, which can limit the expressiveness of generative models like Variational Autoencoders (VAEs). In this work, we propose a novel approach by replacing the prior p(z) with an unnormalized energy-based distribution exp(−E(z))/Z, where E(z) is an unrestricted energy function and Z is the partition function. This leads to a variational lower bound that allows for two key innovations: (1) the incorporation of more powerful, flexible priors into the VAE framework, resulting in improved likelihood estimates and enhanced generative performance, and (2) the ability to train VAEs with energy priors independent of the intractable normalizing constant, requiring only that the prior estimates the aggregated posterior, which can be achieved via a variety of different objectives. Our approach bridges VAEs and EBMs, providing a scalable and efficient framework for leveraging unnormalized priors in probabilistic models.

URL: https://openreview.net/forum?id=Rdb5n2pD5k

---

Title: Stochastic Frank Wolfe for Constrained Nonconvex Optimization

Abstract: We provide a practical convergence analyses of Stochastic Frank Wolfe (SFW) and SFW with momentum with constant and decaying learning rates for constrained nonconvex optimization problems. We show that a convergence measure called the Frank Wolfe gap converges to zero only when we decrease the learning rate and increase the batch size. We apply SFW algorithms to adversarial attacks and propose a new adversarial attack method, Auto-SFW. Finally, we compare existing methods with the SFW algorithms in attacks against the latest robust models.

URL: https://openreview.net/forum?id=FUcti5SHAG

---

Title: Exploring Weak-to-Strong Generalization for CLIP-based Classification

Abstract: Aligning large-scale commercial models with user intent is crucial to preventing harmful outputs. Current methods rely on human supervision but become impractical as model complexity increases. When models surpass human knowledge, providing accurate feedback becomes challenging and inefficient.
A novel solution proposed recently is using a weaker model to supervise a stronger model. This concept leverages the ability of weaker models to perform evaluations, thereby reducing the workload on human supervisors.
Previous work has shown the effectiveness of weak-to-strong generalization in the context of language-only models. Extending this concept to vision-language models leverages these insights, adapting the proven benefits to a multi-modal context.
In our study, we explore weak-to-strong generalization for CLIP-based classification. We propose a method, class prototype learning (CPL), which aims to enhance the classification capabilities of the CLIP model, by learning more representative prototypes for each category.
Our findings indicate that despite the simple loss function under weak supervision, CPL yields robust results.
Our experiments are conducted on challenging datasets to evaluate our method. Extensive experiments show that our method is effective, achieving a 3.67\% improvement over baseline methods.

URL: https://openreview.net/forum?id=quE8gDDegf

---

Title: MarDini: Masked Auto-regressive Diffusion for Video Generation at Scale

Abstract: We introduce MarDini, a new family of video diffusion models that integrate the advantages of masked auto-regression (MAR) into a unified diffusion model (DM) framework. Here, MAR handles temporal planning, while DM focuses on spatial generation in an asymmetric network design: i) a MAR-based planning model containing most of the parameters generates planning signals for each masked frame using low-resolution input; ii) a lightweight generation model uses these signals to produce high-resolution frames via diffusion de-noising. MarDini’s MAR enables video generation conditioned on any number of masked frames at any frame positions: a single model can handle video interpolation (e.g., masking middle frames), image-to-video generation (e.g., masking from the second frame onward), and video expansion (e.g., masking half the frames). The efficient design allocates most of the computational resources to the low-resolution planning model, making computationally expensive but important spatio-temporal attention feasible at scale. MarDini sets a new state-of-the-art for video interpolation; meanwhile, within few inference steps, it efficiently generates videos on par with those of much more expensive advanced image-to-video models.

URL: https://openreview.net/forum?id=fuOHI59rUW

---

Title: Bi-Mamba: Towards Accurate 1-Bit State Space Model

Abstract: The typical selective state-space model (SSM) of Mamba addresses several limitations of Transformers, such as quadratic computational complexity with sequence length and significant inference-time memory requirements due to the key-value cache. However, the growing size of Mamba models continues to pose training and deployment challenges and raises environmental concerns due to considerable energy consumption. In this work, we introduce Bi-Mamba, a scalable and powerful 1-bit Mamba architecture designed for more efficient large language models with multiple sizes across 780M, 1.3B, and 2.7B. Bi-Mamba models are trained from scratch on data volume as regular LLM pertaining using an autoregressive distillation loss. Extensive experimental results on language modeling demonstrate that Bi-Mamba achieves performance comparable to its full-precision counterparts (e.g., FP16 or BF16) and much better accuracy than post-training-binarization (PTB) Mamba and binarization-aware training (BAT) Transformer baselines, while significantly reducing memory footprint and energy consumption compared to the original Mamba model. Our study pioneers a new linear computational complexity LLM framework under low-bit representation and facilitates future design of specialized hardware tailored for efficient 1-bit Mamba-based LLMs. Our code is provided in supplementary material and the pre-trained weights are available anonymously at https://drive.google.com/drive/folders/1jfk_TlDzFbER84ITvU2hOX2VyPC9H4MA?usp=sharing.

URL: https://openreview.net/forum?id=HxU0wSMZ0n

---

Title: Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization

Abstract: Stochastic gradient descent and other first-order variants, such as Adam and AdaGrad, are commonly used in the field of deep learning due to their computational efficiency and low-storage memory requirements. However, these methods do not exploit curvature information. Consequently, iterates can converge to saddle points or poor local minima. On the other hand, Quasi-Newton methods compute Hessian approximations which exploit this information with a comparable computational budget. Quasi-Newton methods re-use previously computed iterates and gradients to compute a low-rank structured update. The most widely used quasi-Newton update is the L-BFGS, which guarantees a positive semi-definite Hessian approximation, making it suitable in a line search setting. However, the loss functions in DNNs are non-convex, where the Hessian is potentially non-positive definite. In this paper, we propose using a limited-memory symmetric rank-one quasi-Newton approach which allows for indefinite Hessian approximations, enabling directions of negative curvature to be exploited. Furthermore, we use a modified adaptive regularized cubics approach, which generates a sequence of cubic subproblems that have closed-form solutions with suitable regularization choices. We investigate the performance of our proposed method on autoencoders and feed-forward neural network models and compare our approach to state-of-the-art first-order adaptive stochastic methods as well as other quasi-Newton methods

URL: https://openreview.net/forum?id=wVMLuhT9iC

---

Title: A Survey on Large Language Model Acceleration based on KV Cache Management

Abstract: Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the computational and memory demands of LLMs, particularly during inference, pose significant challenges when scaling them to real-world, long-context, and real-time applications. Key-Value (KV) cache management has emerged as a critical optimization technique for accelerating LLM inference by reducing redundant computations and improving memory utilization. This survey provides a comprehensive overview of KV cache management strategies for LLM acceleration, categorizing them into token-level, model-level, and system-level optimizations.
Token-level strategies include KV cache selection, budget allocation, merging, quantization, and low-rank decomposition, while model-level optimizations focus on architectural innovations and attention mechanisms to enhance KV reuse. System-level approaches address memory management, scheduling, and hardware-aware designs to improve efficiency across diverse computing environments.
Additionally, the survey provides an overview of both text and multimodal datasets and benchmarks used to evaluate these strategies. By presenting detailed taxonomies and comparative analyses, this work aims to offer useful insights for researchers and practitioners to support the development of efficient and scalable KV cache management techniques, contributing to the practical deployment of LLMs in real-world applications.

URL: https://openreview.net/forum?id=z3JZzu9EA3

---

Title: Disconnects between Dataset Representativeness and Group Algorithmic Fairness

Abstract: There have been numerous demonstrations that the prediction performance of machine learning algorithms differs among groups of people. This causes significant concerns about long-term social impact, including the perpetuation of disadvantages for certain populations. A common explanation is that disparity in performance (i.e., group unfairness) results from differences in group representation in datasets. Recent research has started to explore this explanation and proposed methods to address group unfairness by modulating group representation. We establish that, contrary to conventional wisdom, there exists a fundamental tradeoff between representativeness and group fairness. First, we theoretically describe this tradeoff in a simple univariate setting and confirm our theoretic results empirically across several commonly used datasets. To analyze whether these observations hold in more realistic settings, we then model the process of constructing representative datasets from multiple data sources using a multi-armed bandit framework and a novel Bayesian approach. We find that realistic sampling techniques further nuance the relationship between dataset representativeness and fairness. Notably, we show how the theoretically-sound solution of oversampling groups with lower performance may not hold for realistic multi-site data collection. Finally, we postulate that a key driver of unfairness is the extent to which labels are more challenging to predict for some groups than others. To validate this hypothesis, we show that greater model capacity can lead to improved group fairness independently of representation. In summary, we demonstrate how representativeness and group fairness may be at odds, how theoretically justified approaches to improve fairness may not hold true under realistic conditions, and propose a representation-independent method to improve algorithmic fairness.

URL: https://openreview.net/forum?id=Jr7frMe1Jw

---

Title: Pitfalls in Evaluating Inference-time Methods for Improving LLM Reliability

Abstract: Though Large Language Models (LLMs) have demonstrated remarkable capabilities, they are still prone to outputting falsehoods using seemingly persuasive language. Many recent works attempt to address this problem by using LLMs in a framework where a single seed prompt results in a series of interactions involving augmented prompts with an otherwise unchanged LLM, and the results are aggregated with a goal of producing a more reliable output. We consider the replicability and generalizability of evaluations of inference-time methods intended to improve the reliability of responses from a base LLMs. We survey how methods have been evaluated in the literature and find a great variety of benchmarks and models in use. Motivated by this, we conduct our own evaluation to evaluate the effectiveness of a few methods across a range of benchmarks and models. Our evaluation reveals that while these techniques show promise in improving reliability, there is still significant variability in performance across different domains and tasks, and methods that show substantial improvements on weaker base models often do not improve reliability for better base models.

URL: https://openreview.net/forum?id=xeGWsmqFS8

---

Title: MemBench: Memorized Image Trigger Prompt Dataset for Diffusion Models

Abstract: Diffusion models have achieved remarkable success in Text-to-Image generation tasks, leading to the development of many commercial models. However, recent studies have reported that diffusion models often repeatedly generate memorized images in train data when triggered by specific prompts, potentially raising social issues ranging from copyright to privacy concerns. To sidestep the memorization, recent studies have been conducted to develop memorization mitigation methods for diffusion models. Nevertheless, the lack of benchmarks hinders the assessment of the true effectiveness of these methods. In this work, we present MemBench, the first benchmark for evaluating image memorization mitigation methods. Our benchmark includes a large number of memorized image trigger prompts in various Text-to-Image diffusion models. Furthermore, in contrast to the prior work evaluating mitigation performance only on trigger prompts, we present metrics evaluating on both trigger prompts and general prompts, so that we can see whether mitigation methods address the memorization issue while maintaining performance for general prompts. Through our MemBench evaluation, we revealed that existing memorization mitigation methods notably degrade the overall performance of diffusion models and need to be further developed.

URL: https://openreview.net/forum?id=z3RIiidJgD

---

Title: Privacy-Preserving Language Model Inference with Instance Obfuscation

Abstract: Language Models as a Service (LMaaS) offers convenient access for developers and researchers to perform inference using pre-trained language models. Nonetheless, the input data and the inference results containing private information are exposed as plaintext during the service call, leading to privacy issues. Recent studies have started tackling the privacy issue by transforming input data into privacy-preserving representation from the user-end with the techniques such as noise addition and content perturbation, while the exploration of inference result protection, namely decision privacy, is still a blank page. In order to maintain the black-box manner of LMaaS, conducting data privacy protection, especially for the decision, is a challenging task because the process has to be seamless to the models and accompanied by limited communication and computation overhead. We thus propose Instance-Obfuscated Inference (IoI) method, which focuses on addressing the decision privacy issue of natural language understanding tasks in their complete life-cycle. Besides, we conduct comprehensive experiments to evaluate the performance as well as the privacy-protection strength of the proposed method on various benchmarking tasks.

URL: https://openreview.net/forum?id=izNA8x4kUp

---

Title: Efficient and Flexible Neural Network Training through Layer-wise Feedback Propagation

Abstract: Gradient-based optimization has been a cornerstone of machine learning that enabled the vast advances of Artificial Intelligence (AI) development over the past decades. However, this type of optimization requires differentiation, and with recent evidence of the benefits of non-differentiable (e.g. neuromorphic) architectures over classical models w.r.t. efficiency, such constraints can become limiting in the future. We present Layer-wise Feedback Propagation (LFP), a novel training principle for neural network-like predictors that utilizes methods from the domain of explainability to decompose a reward to individual neurons based on their respective contributions. Leveraging these neuron-wise rewards, our method then implements a greedy approach reinforcing helpful parts of the network and weakening harmful ones. While having comparable computational complexity to gradient descent, LFP does not require differentiation and generates sparse and thereby memory- and energy-efficient parameter updates and models. We establish the convergence of LFP theoretically and empirically, demonstrating its effectiveness on various models and datasets. Via two applications — neural network pruning and the approximation-free training of Spiking Neural Networks (SNNs) — we demonstrate that LFP combines increased efficiency in terms of computation and representation with flexibility w.r.t. choice of model architecture and objective function.

URL: https://openreview.net/forum?id=9oToxYVOSW

---

Title: Exploring the Spatial Dynamics of In-Distribution and Out-of-Distribution Data in Logit Space

Abstract: Out-of-distribution (OOD) data pose a significant challenge to deep learning (DL) classifiers, prompting extensive research into their effective detection methods
Current state-of-the-art OOD detection methods employ a scoring technique designed to assign lower scores to OOD samples compared to in-distribution (ID) ones.
Nevertheless, these approaches lack foresight into the configuration of OOD and ID data within the latent space, instead making an implicit assumption regarding their inherent separation.
As a result, most OOD detection methods result in complicated and hard-to-validate scoring techniques.
This study conducts a thorough analysis of the logit embedding landscape, revealing that both ID and OOD data exhibit a distinct trend.
Specifically, we demonstrate that OOD data tends to reside near to the center of the logit space.
In contrast, ID data tends to be situated farther from the center, predominantly in the positive regions of the logit space, thus forming class-wise clusters along the orthogonal axes that span the logit space.
This study highlights the critical role of the DL classifier in differentiating between ID and OOD logits.

URL: https://openreview.net/forum?id=nmLzSnpHBY

---

Title: Gradient Inversion Attack on Graph Neural Networks

Abstract: Graph federated learning is of essential importance for training over large graph datasets while protecting data privacy, where each client stores a subset of local graph data, while the server collects the local gradients and broadcasts only the aggregated gradients. Recent studies reveal that a malicious attacker can steal private image data from gradient exchanging of neural networks during federated learning. However, none of the existing works have studied the vulnerability of graph data and graph neural networks under such attack. To answer this question, the present paper studies the problem of whether private data can be recovered from leaked gradients in both node classification and graph classification tasks and { proposes a novel attack named Graph Leakage from Gradients (GLG)}. Two widely-used GNN frameworks are analyzed, namely GCN and GraphSAGE. The effects of different model settings on recovery are extensively discussed. Through theoretical analysis and empirical validation, it is shown that parts of the graph data can be leaked from the gradients.

URL: https://openreview.net/forum?id=a0mLrqkWyx

---

Title: Measuring the Impact of Equal Treatment as Blindness via Distributions of Explanations Disparities

Abstract: Liberal political philosophy advocates for the policy of \emph{equal treatment as blindness}, which seeks to achieve fairness by treating individuals without considering their protected characteristics directly. However, this policy has faced longstanding criticism for perpetuating existing inequalities. In machine learning, this policy can be translated into the concept of \emph{fairness as unawareness}, and be measured using disparate treatment metrics such as Demographic Parity (a.k.a. Statistical Parity). Our analysis reveals that Demographic Parity does not faithfully measure whether individuals are being treated independently of the protected attribute by the model. We introduce the Explanation Disparity metric to measure fairness under \emph{equal treatment as blindness} policy. Our metric evaluates the fairness of predictive models by analyzing the extent to which the protected attribute can be inferred from the distribution of explanation values, specifically using Shapley values. The proposed metric tests for statistical independence of the explanation distributions over populations with different protected characteristics. We show the theoretical properties of "Explanation Disparity" and devise an equal treatment inspector based on the AUC of a Classifier Two-Sample Test. We experiment with synthetic and natural data to demonstrate and compare the notion with related ones. We release \href{https://anonymous.4open.science/r/explanationspace-B4B1/README.md}{\texttt{explanationspace}}, an open-source Python package with methods and tutorials.

URL: https://openreview.net/forum?id=AbrhFOT3YC

---

Title: Learning Energy-Based Generative Models via Potential Flow: A Variational Principle Approach to Probability Density Homotopy Matching

Abstract: Energy-based models (EBMs) are a powerful class of probabilistic generative models due to their flexibility and interpretability. However, relationships between potential flows and explicit EBMs remain underexplored, while contrastive divergence training via implicit Markov chain Monte Carlo (MCMC) sampling is often unstable and expensive in high-dimensional settings. In this paper, we propose Variational Potential Flow Bayes (VPFB), a new energy-based generative framework that eliminates the need for implicit MCMC sampling and does not rely on auxiliary networks or cooperative training. VPFB learns an energy-parameterized potential flow by constructing a flow-driven density homotopy that is matched to the data distribution through a variational loss minimizing the Kullback-Leibler divergence between the flow-driven and marginal homotopies. This principled formulation enables robust and efficient generative modeling while preserving the interpretability of EBMs. Experimental results on image generation, interpolation, out-of-distribution detection, and compositional generation confirm the effectiveness of VPFB, showing that our method performs competitively with existing approaches in terms of sample quality and versatility across diverse generative modeling tasks.

URL: https://openreview.net/forum?id=vc7poEYOFK

---

Title: Dynamic Pricing in the Linear Valuation Model using Shape Constraints

Abstract: We propose a shape-constrained approach to dynamic pricing for censored data in the linear valuation model that eliminates the need for tuning parameters commonly required in existing methods. Previous works have addressed the challenge of unknown market noise distribution F using strategies ranging from kernel methods to reinforcement learning algorithms, such as bandit techniques and upper confidence bounds (UCB), under the Lipschitz (and stronger) assumption(s) on $F_0$. In contrast, our method relies on isotonic regression under the weaker assumption that $F_0$ is $\alpha$-H\"older continuous for some $\alpha \in (0,1]$. We obtain an upper bound on the asymptotic expected regret that matches existing bounds in the literature for $\alpha = 1$ (the Lipschitz case). Simulations and experiments with real-world data obtained by Welltower Inc (a major healthcare Real Estate Investment Trust) consistently demonstrate that our method attains better empirical regret in comparison to several existing methods in the literature while offering the advantage of being completely tuning-parameter free.

URL: https://openreview.net/forum?id=uKZ0R4IQaO

---

Title: SEE-DPO: Self Entropy Enhanced Direct Preference Optimization

Abstract: Direct Preference Optimization (DPO) has been successfully used to align large language models (LLMs) according to human preferences, and more recently it has also been applied to improving the quality of text-to-image diffusion models. However, DPO-based methods such as SPO, Diffusion-DPO, and D3PO are highly susceptible to overfitting and reward hacking, especially when the generative model is optimized to fit out-of-distribution during prolonged training. To overcome these challenges and stabilize the training of diffusion models, we introduce a self-entropy regularization mechanism in reinforcement learning from human feedback. This enhancement improves DPO training by encouraging broader exploration and greater robustness. Our regularization technique effectively mitigates reward hacking, leading to improved stability and enhanced image quality across the latent space. Extensive experiments demonstrate that integrating human feedback with self-entropy regularization can significantly boost image diversity and specificity, achieving state-of-the-art results on key image generation metrics.

URL: https://openreview.net/forum?id=xQbRFHfgGL

---

Title: Towards Undistillable Models by Minimizing Conditional Mutual Information

Abstract: A deep neural network (DNN) is said to be undistillable if, when used as a black-box input-output teacher, it cannot be distilled through knowledge distillation (KD). In this case, the distilled student (referred to as the knockoff student) does not outperform a student trained independently with label smoothing (LS student) in terms of prediction accuracy. To protect intellectual property of DNNs, it is desirable to build undistillable DNNs. To this end, it is first observed that an undistillable DNN may have the trait that each cluster of its output probability distributions in response to all sample instances with the same label should be highly concentrated to the extent that each cluster corresponding to each label should ideally collapse into one probability distribution. Based on this observation and by measuring the concentration of each cluster in terms of conditional mutual information (CMI), a new training method called CMI minimized (CMIM) method is proposed, which trains a DNN by jointly minimizing the conventional cross entropy (CE) loss and the CMI values of all temperature scaled clusters across the entire temperature spectrum. The resulting CMIM model is shown, by extensive experiments, to be undistillable by all tested KD methods existing in the literature. That is, the knockoff students distilled by these KD methods from the CMIM model underperform the respective LS students. In addition, the CMIM model is also shown to performs better than the model trained with the CE loss alone in terms of their own prediction accuracy.

URL: https://openreview.net/forum?id=jVABSsD4Vf

---

Title: Towards Optimal LLM Selection

Abstract: Generative AI and LLMs in particular are heavily used nowadays for various document processing tasks such as question answering and document summarization. Enterprises are incurring huge costs when operating or using LLMs for their respective use cases.
In this work, we propose optimizing the usage costs of LLMs in a quality-aware manner for document summarization tasks. Specifically, we propose to exploit the variability of LLM performances across different types and formats of data to maximize the output quality while maintaining expected costs under a budget and latency within a threshold. This presents two challenges: 1) estimating the output quality of LLMs at runtime without invoking each LLM, 2) optimally allocating queries to LLMs such that the objectives are optimized and constraints are satisfied. We propose a model to predict the output quality of LLMs on text summarization, followed by an LP rounding algorithm to optimize the selection of LLMs. We study the problems both theoretically and empirically. Our methods reduce costs by $40\%- 90\%$ while improving quality by $4\%-7\%$. In addition to the quantitative results, we further show that our model quality estimation aligns majorly with human preferences through a user study.

URL: https://openreview.net/forum?id=0tkcWwVtaK

---

Title: Instance-dependent Approximation Guarantees for Lipschitz Approximators, Application to Scientific Machine Learning

Abstract: Despite widespread adoption, Machine Learning models remain data-driven and lack exploitable theoretical guarantees on their approximation error. This limitation hinders their use for critical applications. In this paper, we show how to leverage the Lipschitz property for Lipschitz approximations, i.e., ML models that are Lipschitz continuous, to establish strict post-training –- instance dependent -- generalization error bounds given a set of validation points. We focus on the test case domain of ML for scientific computing called Scientific Machine Learning (SciML), where ML models are increasingly used but miss the theoretical approximation guarantees of classical scientific computing simulation schemes. We first show how to derive error bounds using Voronoï diagrams for a Lipschitz approximator trained to learn a $K$-Lipschitz function by taking advantage of the mesh-like structure of learning points. Second, we cast upper bounding as an optimization problem and use certified Deterministic Optimistic Optimization (introduced in Bachoc et al. 2021) and certified Voronoï Optimistic Optimization (that we design based on the non-certified version in Kim et al. 2020), to achieve tighter error bounds. The code is made available at [https://anonymous.4open.science/r/lipschitz_bounds_doo-7FDF](https://anonymous.4open.science/r/lipschitz_bounds_doo-7FDF]).

URL: https://openreview.net/forum?id=fR7DGWfX83

---

Title: Reinforcement Learning from Bagged Reward

Abstract: In Reinforcement Learning (RL), it is commonly assumed that an immediate reward signal is generated for each action taken by the agent, helping the agent maximize cumulative rewards to obtain the optimal policy. However, in many real-world scenarios, designing immediate reward signals is difficult; instead, agents receive a single reward that is contingent upon a partial sequence or a complete trajectory. In this work, we define this challenging problem as RL from Bagged Reward (RLBR), where sequences of data are treated as bags with non-Markovian bagged rewards, leading to the formulation of Bagged Reward Markov Decision Processes (BRMDPs). Theoretically, we demonstrate that RLBR can be addressed by solving a standard MDP with properly redistributed bagged rewards allocated to each instance within a bag. Empirically, we find that reward redistribution becomes more challenging as the bag length increases, due to reduced informational granularity. Existing reward redistribution methods are insufficient to address these challenges. Therefore, we propose a novel reward redistribution method equipped with a bidirectional attention mechanism, enabling the accurate interpretation of contextual nuances and temporal dependencies within each bag. We experimentally demonstrate that our proposed method consistently outperforms existing approaches. The code is available at an anonymous link: https://anonymous.4open.science/r/RLBR-F66E/.

URL: https://openreview.net/forum?id=bXUipBbZDA

---

Title: Tighter sparse variational Gaussian processes

Abstract: Sparse variational Gaussian process (GP) approximations based on inducing points have become the de facto standard for scaling GPs to large datasets, owing to their theoretical elegance, computational efficiency, and ease of implementation. This paper introduces a provably tighter variational approximation by relaxing the standard assumption that the conditional approximate posterior given the inducing points must match that in the prior. The key innovation is to modify the conditional posterior to have smaller variances than that of the prior at the training points. We derive the collapsed bound for the regression case, describe how to use the proposed approximation in large data settings, and discuss its application to handle orthogonally structured inducing points and GP latent variable models. Extensive experiments on regression benchmarks, classification, and latent variable models demonstrate that the proposed approximation consistently matches or outperforms standard sparse variational GPs while maintaining the same computational cost. An implementation will be made available in all popular GP packages.

URL: https://openreview.net/forum?id=L33DSu3zvq

---

Title: MINDSTORES: Memory-Informed Neural Decision Synthesis for Task-Oriented Reinforcement in Embodied Systems

Abstract: While large language models (LLMs) have shown promising capabilities as zero-shot planners for embodied agents, their inability to learn from experience and build persistent mental models limits their robustness in complex open-world environments like Minecraft. We introduce MINDSTORES, an experience-augmented planning framework that enables embodied agents to build and leverage \textit{mental models} through natural interaction with their environment. Drawing inspiration from how humans construct and refine cognitive mental models, our approach extends existing zero-shot LLM planning by maintaining a database of past experiences that informs future planning iterations. The key innovation is representing accumulated experiences as natural language embeddings of (state, task, plan, outcome) tuples, which can then be efficiently retrieved and reasoned over by an LLM planner to generate insights and guide plan refinement for novel states and tasks. Through extensive experiments in the MineDojo environment, a simulation environment for agents in Minecraft that provides low-level controls for Minecraft, we find that MINDSTORES learns and applies its knowledge significantly better than existing memory-based LLM planners while maintaining the flexibility and generalization benefits of zero-shot approaches, representing an important step toward more capable embodied AI systems that can learn continuously through natural experience.

URL: https://openreview.net/forum?id=A6olOEd8jX

---

Title: Piecewise Constant Spectral Graph Neural Network

Abstract: Graph Neural Networks (GNNs) have achieved significant success across various domains by leveraging graph structures in data. Existing spectral GNNs, which use low-degree polynomial filters to capture graph spectral properties, may not fully identify the graph's spectral characteristics because of the polynomial's small degree. However, increasing the polynomial degree might be computationally infeasible. In this paper, we introduce the Piecewise Constant Spectral Graph Neural Network(PieCoN) to address these challenges. PieCoN combines constant spectral filters with polynomial filters to provide a more flexible way to leverage the graph structure. By adaptively partitioning the spectrum into intervals, our approach increases the range of spectral properties that can be effectively learned. Experiments on seven benchmark datasets, including both homophilic and heterophilic graphs, demonstrate that PieCoN is particularly effective on heterophilic datasets, highlighting its potential for a wide range of applications.

URL: https://openreview.net/forum?id=sTdVnDW0HX

---

Title: Is What You Ask For What You Get? Investigating Concept Associations in Text-to-Image Models

Abstract: Text-to-image (T2I) models are increasingly used in impactful real-life applications. As such, there is a growing need to audit these models to ensure that they generate desirable, task-appropriate images. However, systematically inspecting the associations between prompts and generated content in a human-understandable way remains challenging. To address this, we propose \emph{Concept2Concept}, a framework where we characterize conditional distributions of vision language models using interpretable concepts and metrics that can be defined in terms of these concepts. This characterization allows us to use our framework to audit models and prompt-datasets. To demonstrate, we investigate several case studies of conditional distributions of prompts, such as user-defined distributions or empirical, real-world distributions. Lastly, we implement Concept2Concept as an open-source interactive visualization tool to facilitate use by non-technical end-users. A demo is available at https://tinyurl.com/Concept2ConceptDemo.

Warning: This paper contains discussions of harmful content, including CSAM and NSFW material, which may be disturbing to some readers.

URL: https://openreview.net/forum?id=mk1YIkVvTQ

---

Title: Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models

Abstract: Learning a world model for model-free Reinforcement Learning (RL) agents can significantly improve the sample efficiency by learning policies in imagination. However, building a world model for Multi-Agent RL (MARL) can be particularly challenging due to the scalability issue in a centralized architecture arising from a large number of agents, and also the non-stationarity issue in a decentralized architecture stemming from the inter-dependency among agents. To address both challenges, we propose a novel world model for MARL that learns decentralized local dynamics for scalability, combined with a centralized representation aggregation from all agents. We cast the dynamics learning as an auto-regressive sequence modeling problem over discrete tokens by leveraging the expressive Transformer architecture, in order to model complex local dynamics across different agents and provide accurate and consistent long-term imaginations. As the first pioneering Transformer-based world model for multi-agent systems, we introduce a Perceiver Transformer as an effective solution to enable centralized representation aggregation within this context. Main results on Starcraft Multi-Agent Challenge (SMAC) and additional results on MAMujoco show that it outperforms strong model-free approaches and existing model-based methods in both sample efficiency and overall performance.

URL: https://openreview.net/forum?id=xT8BEgXmVc

---

Title: Mathematical Constraints of RL-Induced Reasoning: A Rebuttal to DeepSeek-R1

Abstract: Under review as submission to TMLR
Mathematical Constraints of RL-Induced Reasoning: A Rebuttal to DeepSeek-R1

Abstract
DeepSeek-R1 claims that reinforcement learning (RL) induces emergent reasoning capabilities in large language models (LLMs), suggesting a fundamental shift in AI development. However, our theoretical and computational analysis challenges this assertion.
Our mathematical framework (Section 2) demonstrates that RL alone cannot induce reasoning without a strong pretraining foundation, which remains the primary driver of reasoning capabilities. Due to high computational costs, poor sample efficiency, and reward sparsity, RL struggles to develop complex reasoning from scratch. Instead, it fine-tunes and reinforces existing pretraining knowledge rather than generating novel reasoning abilities.
Furthermore, DeepSeek-R1’s observed improvements align with well-established pretraining scaling laws, not independent RL-driven emergence. A detailed analysis of DeepSeek-R1’s RL algorithm (Section 3.3) reveals that its Group Relative Policy Optimization (GRPO) approach constrains RL updates within the limits of pretraining knowledge rather than driving reasoning innovation. Additionally, its rule-based reward system optimizes response formatting but does not introduce conceptual advancements in reasoning.
Given these findings, we emphasize the need for rigorous empirical testing to isolate RL’s role from pretraining effects. Until such evidence is presented, RL should be viewed primarily as a fine-tuning mechanism rather than a fundamental source of emergent reasoning in LLMs.

URL: https://openreview.net/forum?id=4bNez06yJf

---

Title: System-2 Mathematical Reasoning via Enriched Instruction Tuning

Abstract: Solving complex mathematical problems via system-2 reasoning is a natural human skill, yet it remains a significant challenge for current large language models (LLMs). We identify the scarcity of deliberate multi-step reasoning data as a primary limiting factor. To this end, we introduce Enriched Instruction Tuning (EIT), a method that enriches existing human-annotated mathematical datasets by synergizing human and AI feedback to create fine-grained reasoning trajectories. These datasets are then used to fine-tune open-source LLMs, enhancing their mathematical reasoning abilities without reliance on any symbolic verification program. Concretely, EIT is composed of two critical steps: Enriching with Reasoning Plan (ERP) and Enriching with Reasoning Step (ERS). The former generates a high-level plan that breaks down complex instructions into a sequence of simpler objectives, while ERS fills in reasoning contexts often overlooked by human annotators, creating a smoother reasoning trajectory for LLM fine-tuning. Unlike existing CoT prompting methods that generate reasoning chains only depending on LLM's internal knowledge, our method leverages human-annotated initial answers as ``meta-knowledge'' to help LLMs generate more detailed and precise reasoning processes, leading to a more trustworthy LLM expert for complex mathematical problems. In experiments, EIT achieves an accuracy of 84.1% on GSM8K and 32.5% on MATH, surpassing state-of-the-art fine-tuning and prompting methods, and even matching the performance of tool-augmented methods.

URL: https://openreview.net/forum?id=Cl9Uox031k

---

Title: Studying Exploration in RL: An Optimal Transport Analysis of Occupancy Measure Trajectories

Abstract: The rising successes of RL are propelled by combining smart algorithmic strategies and deep architectures to optimize the distribution of returns and visitations over the state-action space. A quantitative framework to compare the learning processes of these eclectic RL algorithms is currently absent but desired in practice. We address this gap by representing the learning process of an RL algorithm as a sequence of policies generated during training, and then studying the policy trajectory induced in the manifold of state-action occupancy measures. Using an optimal transport-based metric, we measure the length of the paths induced by the policy sequence yielded by an RL algorithm between an initial policy and a final optimal policy. Hence, we first define the Effort of Sequential Learning (ESL). ESL quantifies the relative distance that an RL algorithm travels compared to the shortest path from the initial to the optimal policy. Further, we connect the dynamics of policies in the occupancy measure space and regret (another metric to understand the suboptimality of an RL algorithm), by defining the Optimal Movement Ratio (OMR). OMR assesses the fraction of movements in the occupancy measure space that effectively reduce an analogue of regret. Finally, we derive approximation guarantees to estimate ESL and OMR with finite number of samples and without access to an optimal policy. Through empirical analyses across various environments and algorithms, we demonstrate that ESL and OMR provide insights into the exploration processes of RL algorithms and hardness of different tasks in discrete and continuous MDPs.

URL: https://openreview.net/forum?id=pdC092Nn8N

---

Title: FoldDiff: Folding in Point Cloud Diffusion

Abstract: Diffusion denoising has emerged as a powerful approach for modeling data distributions, treating data as particles with their position and velocity modeled by a stochastic diffusion processes. While this framework assumes data resides in a fixed vector spaces (e.g., images as pixel-ordered vectors), point clouds present unique challenges due to their unordered representation. Existing point cloud diffusion methods often rely on voxelization to address this issue, but this approach is computationally expensive, with cubically scaling complexity. In this work, we investigate the misalignment between point cloud irregularity and diffusion models, analyzing it through the lens of denoising implicit priors. First, we demonstrate how the unknown permutations inherent in point cloud structures disrupt denoising implicit priors. To address this, we then propose a novel folding-based approach that reorders point clouds into a permutation-invariant grid, enabling diffusion to be performed directly on the structured representation. This construction is exploited both globally and locally. Globally, it can be used to represent point clouds in a fixed vector space (like images), therefore it enables us to extend the work of denoising as implicit priors to point clouds. On the other hand, exploiting this idea locally, allows us to create efficient and novel token representations that can improve existing transformer-based point cloud diffusion models. Our experiments show that the proposed folding operation integrates effectively with both denoising implicit priors as well as advanced diffusion architectures, such as UNet and Diffusion Transformers (DiTs). Notably, DiT with folded tokens achieves competitive generative performance compared to state-of-the-art models while significantly reducing training and inference costs relative to voxelization-based methods. Code is available at http://anonymous.4open.science/r/FoldDiff-3B36/

URL: https://openreview.net/forum?id=pmRabMH1JW

---

Title: Generative Models for Long Time Series: Approximately Equivariant Recurrent Network Structures for an Adjusted Training Scheme

Abstract: We apply a novel training scheme to a specific implementation of a Variational Autoencoder (VAE), which, in combination, we refer to as the Recurrent Variational Autoencoder Subsequent Train (RVAE-ST). This method progressively increases the sequence length during training, leveraging the sequence-length independent parameterization of the model to address the challenge recurrent layers face when handling long sequences, particularly for datasets exhibiting approximate stationarity. Our experiments demonstrate that this approach significantly improves the model’s performance, especially for datasets with periodic behavior. Compared to other recurrent and convolutional-based generative models, our method excels in generating synthetic data for long sequences of l = 1000, with notable improvements in both sample quality and the distribution of the generated datasets. We evaluate the effectiveness of our approach using multiple metrics, including the discriminative score, evidence lower bound (ELBO), and visualizations of embeddings generated by t-SNE and PCA.

URL: https://openreview.net/forum?id=HQ9C9xcrWZ

---

Title: CoDe: Blockwise Control for Denoising Diffusion Models

Abstract: Aligning diffusion models to downstream tasks often requires finetuning new models or gradient-based guidance at inference time to enable sampling from the reward-tilted posterior. In this work, we explore a simple inference-time gradient-free guidance approach, called controlled denoising (CoDe), that circumvents the need for differentiable guidance functions and model finetuning. CoDe is a blockwise sampling method applied during intermediate denoising steps, allowing for alignment with downstream rewards. Our experiments demonstrate that, despite its simplicity, CoDe offers a favorable trade-off between reward alignment, prompt instruction following, and inference cost, achieving a competitive performance against the state-of-the-art baselines}. Our code is available at https://anonymous.4open.science/r/code_blockwise.

URL: https://openreview.net/forum?id=DqPCWMiMU0

---

Title: Learning in complex action spaces without policy gradients

Abstract: Conventional wisdom suggests that policy gradient methods are better suited to complex action spaces than action-value methods. However, foundational studies have shown equivalences between these paradigms in small and finite action spaces (O'Donoghue et al., 2017; Schulman et al., 2017a). This raises the question of why their computational applicability and performance diverge as the complexity of the action space increases. We hypothesize that the apparent superiority of policy gradients in such settings stems not from intrinsic qualities of the paradigm, but from universal principles that can also be applied to action-value methods to serve similar functionality. We identify three such principles and provide a framework for incorporating them into action-value methods. To support our hypothesis, we instantiate this framework in what we term QMLE, for Q-learning with maximum likelihood estimation. Our results show that QMLE can be applied to complex action spaces with a controllable computational cost that is comparable to that of policy gradient methods, all without using policy gradients. Furthermore, QMLE exhibits strong performance on the DeepMind Control Suite, even when compared to the state-of-the-art methods such as DMPO and D4PG.

URL: https://openreview.net/forum?id=nOL9M6D4oM

---

Title: Foundation Models Meet Federated Learning: A One-shot Feature-sharing Method with Privacy and Performance Guarantees

Abstract: Adapting foundation models for downstream tasks via Federated Learning (FL) is a promising strategy for protecting privacy while leveraging the capability of foundation models. However, FL's iterative training and model transmission result in high communication costs and GPU memory demands, making large foundation models impractical for FL. This paper introduces a one-shot FL method with a server-side performance bound to enable foundation models by reducing communication costs and GPU memory requirements. Our approach, FedPFT (FL with Parametric Feature Transfer), involves clients learning and transferring parametric models for features extracted from frozen foundation models in a single round. Parametric models are then used to generate synthetic features at the server to train a classifier head. We evaluate FedPFT across eight vision datasets using three vision foundation models. Our findings demonstrate that FedPFT is agnostic to data heterogeneity and network topology and it enhances the communication-accuracy frontier up to 7.8\%. Finally, we show FedPFT's compatibility with differential privacy and its resilience against reconstruction attacks. Our work highlights the capability of private, feature-sharing methods for one-shot knowledge transfer using foundation models.

URL: https://openreview.net/forum?id=55593xywWG

---

Title: Entropy-Regularized Process Reward Model

Abstract: Large language models (LLMs) have shown promise in performing complex multi-step reasoning, yet they continue to struggle with mathematical reasoning, often making systematic errors. A promising solution is reinforcement learning (RL) guided by reward models, particularly those focusing on process rewards, which score each intermediate step rather than solely evaluating the final outcome. This approach is more effective at guiding policy models towards correct reasoning trajectories. In this work, we propose an entropy-regularized process reward model (ER-PRM) that integrates KL-regularized Markov Decision Processes (MDP) to balance policy optimization with the need to prevent the policy from shifting too far from its initial distribution. We derive a novel reward construction method based on the theoretical results. Our theoretical analysis shows that we could derive the optimal reward model from the initial policy sampling. Our empirical experiments on the MATH and GSM8K benchmarks demonstrate that ER-PRM consistently outperforms existing process reward models, achieving 1% improvement on GSM8K and 2-3% improvement on MATH under best-of-N evaluation, and more than 1% improvement under RLHF. These results highlight the efficacy of entropy-regularization in enhancing LLMs' reasoning capabilities.

URL: https://openreview.net/forum?id=cSxDH7N3x9

---

Title: Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark

Abstract: Large Language Models (LLMs) have become foundational in the realm of natural language processing, demonstrating performance improvements as model sizes increase. The Mixture-of-Experts (MoE) approach offers a promising way to scale LLMs more efficiently by using fewer computational FLOPs through sparse activation. However, it suffers from significant memory overheads, necessitating model compression techniques. Post-training quantization, a popular method for model compression, proves less effective when directly applied to MoE models due to MoE's overlooked inherent sparsity. This paper explores several MoE structure-aware quantization heuristics, ranging from coarse to fine granularity, from MoE block to individual linear weight. Our investigations reveal critical principles: different MoE structures (i.e., blocks, experts, linear layers) require varying numbers of weight bits for effective and efficient quantization. Conclusions are supported by extensive benchmarking across two representative MoE models and six tasks. We further introduce novel enhancements to more accurately identify the most critical weights in MoE quantization that necessitate higher bit allocations, including the linear weight outlier scorer and MoE block scorer. Additionally, subsequent experiments validate our findings in the context of both weight and activation quantization. Our code for reproducing all our experiments is provided as supplemental material.

URL: https://openreview.net/forum?id=VVty3mELRN

---

Title: Optimal Compressed Sensing for Image Reconstruction with Diffusion Probabilistic Models

Abstract: We examine the problem of selecting a small set of linear measurements for reconstructing high-dimensional signals. Well-established methods for optimizing such measurements include principal component analysis (PCA), independent component analysis (ICA) and compressed sensing (CS) based on random projections, all of which rely on axis- or subspace-aligned statistical characterization of the signal source. However, many naturally occurring signals, including photographic images, contain richer statistical structure. To exploit such structure, we introduce a general method for obtaining an optimized set of linear measurements for efficient image reconstruction, where the signal statistics are expressed by the prior implicit in a neural network trained to perform denoising (generally known as a "diffusion model"). We demonstrate that the optimal measurements derived for two natural image datasets differ from those of PCA, ICA, or CS, and result in substantially lower mean squared reconstruction error. Interestingly, the marginal distributions of the measurement values are asymmetrical (skewed), substantially more so than those of previous methods. We also find that optimizing with respect to perceptual loss, as quantified by structural similarity (SSIM), leads to measurements different from those obtained when optimizing for MSE. Our results highlight the importance of incorporating the specific statistical regularities of natural signals when designing effective linear measurements.

URL: https://openreview.net/forum?id=lmHh4FmPWZ

---

Title: Kick Bad Guys Out! Conditionally Activated Anomaly Detection in Federated Learning with Zero-Knowledge Proof Verification

Abstract: Federated Learning (FL) systems are susceptible to adversarial attacks, where malicious clients submit poisoned models to disrupt the convergence or plant backdoors that cause the global model to misclassify some samples. Current defense methods are often impractical for real-world FL systems, as they either rely on unrealistic prior knowledge or cause accuracy loss even in the absence of attacks. Further, these methods lack a protocol for verifying execution, leaving participants uncertain about the correct execution of the mechanism. To address these challenges, we propose a novel anomaly detection strategy that is designed for real-world FL systems. Our approach activates the defense only when potential attacks are detected, and enables the removal of malicious models without affecting the benign ones. Additionally, we incorporate zero-knowledge proofs to ensure the integrity of the proposed defense mechanism. Experimental results demonstrate the effectiveness of our approach in enhancing FL system security against a comprehensive set of adversarial attacks in various ML tasks.

URL: https://openreview.net/forum?id=9lafZCL8nv

---

Title: Interpretable Measurement of CNN Deep Feature Density using Copula and the Generalized Characteristic Function

Abstract: We present a novel empirical approach toward measuring the Probability Density Function (PDF) of the deep features of Convolutional Neural Networks (CNNs). Measurement of the deep feature PDF is a valuable problem for several reasons. Notably, a. Understanding the deep feature PDF yields new insight into deep representations. b. Feature density methods are important for tasks such as anomaly detection which can improve the robustness of deep learning models in the wild. Interpretable measurement of the deep feature PDF is challenging due to the Curse of Dimensionality (CoD), and the Spatial intuition Limitation. Our novel measurement technique combines copula analysis with the Method of Orthogonal Moments (MOM), in order to directly measure the Generalized Characteristic Function (GCF) of the multivariate deep feature PDF. We find that, surprisingly, the one-dimensional marginals of non-negative deep CNN features after major blocks are not well approximated by a Gaussian distribution, and that these features increasingly approximate an exponential distribution with increasing network depth. Furthermore, we observe that deep features become increasingly independent with increasing network depth within their typical ranges. However, we surprisingly also observe that many deep features exhibit strong dependence (either correlation or anti-correlation) with other extremely strong detections, even if these features are independent within typical ranges. We elaborate on these findings in our discussion, where we propose a new hypothesis that exponentially infrequent large valued features correspond to strong computer vision detections of semantic targets, which would imply that these large-valued features are not outliers but rather an important detection signal.

URL: https://openreview.net/forum?id=FHmUfXUVap

---

Title: Discovering group dynamics in coordinated time series via hierarchical recurrent switching-state models

Abstract: We seek a computationally efficient model for a collection of time series arising from multiple interacting entities (a.k.a. "agents"). Recent models of spatiotemporal patterns across individuals fail to incorporate explicit system-level collective behavior that can influence the trajectories of individual entities. To address this gap in the literature, we present a new hierarchical switching-state model that can be trained in an unsupervised fashion to simultaneously learn both system-level and individual-level dynamics. We employ a latent system-level discrete state Markov chain that provides top-down influence on latent entity-level chains which in turn govern the emission of each observed time series. Recurrent feedback from the observations to the latent chains at both entity and system levels allows recent situational context to inform how dynamics unfold at all levels in bottom-up fashion. We hypothesize that including both top-down and bottom-up influences on group dynamics will improve interpretability of the learned dynamics and reduce error when forecasting. Our hierarchical switching recurrent dynamical model can be learned via closed-form variational coordinate ascent updates to all latent chains that scale linearly in the number of entities. This is asymptotically no more costly than fitting a separate model for each entity. Analysis of both synthetic data and real basketball team movements suggests our lean parametric model can achieve competitive forecasts compared to larger neural network models that require far more computational resources. Further experiments on soldier data as well as a synthetic task with 64 cooperating entities show how our approach can yield interpretable insights about team dynamics over time.

URL: https://openreview.net/forum?id=LHchZthcOf

---

Title: Exact Recovery Guarantees for Parameterized Nonlinear System Identification Problem under Sparse Disturbances or Semi-Oblivious Attacks

Abstract: In this work, we study the problem of learning a nonlinear dynamical system by parameterizing its dynamics using basis functions. We assume that disturbances occur at each time step with an arbitrary probability $p$, which models the sparsity level of the disturbance vectors over time. These disturbances are drawn from an arbitrary, unknown probability distribution, which may depend on past disturbances, provided that it satisfies a zero-mean assumption. The primary objective of this paper is to learn the system's dynamics within a finite time and analyze the sample complexity as a function of $p$. To achieve this, we examine a LASSO-type non-smooth estimator, and establish necessary and sufficient conditions for its well-specifiedness and the uniqueness of the global solution to the underlying optimization problem. We then provide exact recovery guarantees for the estimator under two distinct conditions: boundedness and Lipschitz continuity of the basis functions. We show that finite-time exact recovery is achieved with high probability, even when $p$ approaches $1$. Unlike prior works, which primarily focus on independent and identically distributed (i.i.d.) disturbances and provide only asymptotic guarantees for system learning, this study presents the first finite-time analysis of nonlinear dynamical systems under a highly general disturbance model. Our framework allows for possible temporal correlations in the disturbances and accommodates semi-oblivious adversarial attacks, significantly broadening the scope of existing theoretical results.

URL: https://openreview.net/forum?id=c9o9UAmN3r

---

Title: An Empirical Study of Cross-Lingual Transfer Learning in Programming Languages

Abstract: Large language models have achieved state-of-the-art performance in various software engineering tasks, including error detection, clone detection, and code translation, primarily leveraging high-resource programming languages like Python and Java. However, many critical languages, such as COBOL, as well as emerging languages like Rust and Swift, remain low-resource due to limited openly available code. This scarcity hampers the training and effectiveness of LLMs for these languages, increasing software maintenance costs and stifling innovation. Addressing this gap, we investigate the potential of transfer learning to enhance LLM performance on low-resource programming languages by leveraging data from high-resource counterparts. Our extensive empirical study evaluates transferability across 11 to 41 programming languages and four key tasks: clone detection, code repair, solution domain classification, and error detection. Additionally, we develop a performance prediction model to identify optimal source languages for a given target and task, and analyze the features that influence transfer performance. Our findings demonstrate that cross-lingual transfer significantly outperforms zero-shot learning, with effectiveness varying based on both source and target languages. Languages such as Java and Go emerge as the best targets, while Kotlin and JavaScript are optimal sources. Furthermore, our model reliably predicts successful transfer sources by considering linguistic and dataset-specific features, offering practical guidance for data acquisition and model training. This work contributes to the development of LLM-driven tools for low-resource programming languages and provides insights into the characteristics that facilitate effective transfer learning across diverse language pairs.

URL: https://openreview.net/forum?id=1PRBHKgQVM

---

Title: Robustness Evaluation Using Local Substitute Networks

Abstract: The robustness of a neural network against adversarial examples is essential when a deep classifier is applied in safety-critical use cases like health care or autonomous driving. To assess the robustness, practitioners use various tools ranging from adversarial attacks to the exact computation of the distance to the decision boundary. We use the fact that the robustness of a neural network is a local property and empirically show that computing the same metrics for smaller local substitute networks yields reasonable estimates of the robustness for a lower cost. To construct the substitute network, we develop several pruning techniques that preserve the local properties of the initial network around a given anchor point. Our experiments on multiple datasets prove that this approach saves a significant amount of computation.

URL: https://openreview.net/forum?id=HQADVTuw8i

---

Title: Robin: a Suite of Multi-Scale Vision-Language Models and the CHIRP Evaluation Benchmark

Abstract: The proliferation of Vision-Language Models (VLMs) in the past several years calls for rigorous and comprehensive evaluation methods and benchmarks. This work analyzes existing VLM evaluation techniques, including automated metrics, AI-based assessments, and human evaluations across diverse tasks. We first introduce Robin - a novel suite of VLMs that we built by combining Large Language Models (LLMs) and Vision Encoders (VEs) at multiple scales, and use Robin to identify shortcomings of current evaluation approaches across scales. Next, to overcome the identified limitations, we introduce CHIRP - a new long form response benchmark we developed for more robust and complete VLM evaluation. We provide open access to the Robin training code, model suite, and CHIRP benchmark to promote reproducibility and advance VLM research.

URL: https://openreview.net/forum?id=fEwto4CuRS

---

Title: Knowing What Not to Do: Leverage Language Model Insights for Action Space Pruning in Multi-agent Reinforcement Learning

Abstract: Multi-agent reinforcement learning (MARL) is employed to develop autonomous agents that can learn to adopt cooperative or competitive strategies within complex environments. However, the linear increase in the number of agents leads to a combinatorial explosion of the action space, which always results in algorithmic instability, difficulty in convergence, or entrapment in local optima. While researchers have designed a variety of effective algorithms to compress the action space, these methods also introduce new challenges, such as the need for manually designed prior knowledge or reliance on the structure of the problem, which diminishes the applicability of these techniques. In this paper, we introduce \textbf{E}volutionary action \textbf{SPA}ce \textbf{R}eduction with \textbf{K}nowledge (eSpark), an exploration function generation framework driven by large language models (LLMs) to boost exploration and prune unnecessary actions in MARL. Using just a basic prompt that outlines the overall task and setting, eSpark is capable of generating exploration functions in a zero-shot manner, identifying and pruning redundant or irrelevant state-action pairs, and then achieving autonomous improvement from policy feedback. In reinforcement learning tasks involving inventory management and traffic light control encompassing a total of 15 scenarios, eSpark consistently outperforms the combined MARL algorithm in all scenarios, achieving an average performance gain of 34.4% and 9.9% in the two types of tasks respectively. Additionally, eSpark has proven to be capable of managing situations with a large number of agents, securing a 29.7% improvement in scalability challenges that featured over 500 agents. The code can be found in https://anonymous.4open.science/r/0CDH-0DF8/.

URL: https://openreview.net/forum?id=T49vPTkIt5

---

Title: Enhancing Molecular Conformer Generation via Fragment- Augmented Diffusion Pretraining

Abstract: Recent advances in diffusion-based methods have shown promising results for molecular conformer generation, yet their performance remains constrained by training data scarcity---particularly for structurally complex molecules. In this work, we present Fragment-Augmented Diffusion (FragDiff), a data-centric augmentation strategy that incorporates chemical fragmentation techniques into the pre-training phase of modern diffusion-based generative models. Our key innovation lies in decomposing molecules into chemically meaningful fragments that serve as building blocks for systematic data augmentation, enabling the diffusion model to learn enhanced local geometry while maintaining global molecular topology. Unlike existing approaches that focus on complex architectural modifications, FragDiff adopts a data-centric paradigm orthogonal to model design. Comprehensive benchmarks show FragDiff's superior performance, especially in data-scarce scenarios. Notably, it achieves 12.2--13.4% performance improvement on molecules 3$\times$ beyond training scale through pretraining on fragments. Overall, we establish a new paradigm integrating chemical fragmentations with diffusion models, advancing computational chemistry workflows. The code is available at https://anonymous.4open.science/r/FragDiff-BA54/.

URL: https://openreview.net/forum?id=t5WzHOniAF

---

Title: Faithful Interpretation for Graph Neural Networks

Abstract: Currently, attention mechanisms have garnered increasing attention in Graph Neural Networks (GNNs), such as Graph Attention Networks (GATs) and Graph Transformers (GTs). This is due to not only the commendable boost in performance they offer but also their capacity to provide a more lucid rationale for model behaviors, which are often viewed as inscrutable. However, Attention-based GNNs have demonstrated instability in interpretability when subjected to various sources of perturbations during both training and testing phases, including factors like additional edges or nodes. In this paper, we propose a solution to this problem by introducing a novel notion called Faithful Graph Attention-based Interpretation (FGAI). In particular, FGAI has four crucial properties in terms of stability and sensitivity to interpretation and the final output distribution. Built upon this notion, we propose an efficient methodology for obtaining FGAI, which can be viewed as an ad hoc modification to the canonical Attention-based GNNs. To validate our proposed solution, we introduce two novel metrics tailored for graph interpretation assessment. Experimental results demonstrate that FGAI exhibits superior stability and preserves the interpretability of attention under various forms of perturbations and randomness, which makes FGAI a more faithful and reliable explanation tool.

URL: https://openreview.net/forum?id=Y8EspxaksH

---

Title: Multimodal Prescriptive Deep Learning

Abstract: We introduce a multimodal deep learning framework, Prescriptive Neural Networks (PNNs), that combines ideas from optimization and machine learning, and is, to the best of our knowledge, the first prescriptive method to handle multimodal data. The PNN is a feedforward neural network trained on embeddings to output an outcome-optimizing prescription. In two real-world multimodal datasets, we demonstrate that PNNs prescribe treatments that are able to significantly improve estimated outcomes in transcatheter aortic valve replacement (TAVR) procedures by reducing estimated postoperative complication rates by 32% and in liver trauma injuries by reducing estimated mortality rates by over 40%. In four real-world, unimodal tabular datasets, we demonstrate that PNNs outperform or perform comparably to other well-known, state-of-the-art prescriptive models; importantly, on tabular datasets, we also recover interpretability through knowledge distillation, fitting interpretable Optimal Classification Tree models onto the PNN prescriptions as classification targets, which is critical for many real-world applications. Finally, we demonstrate that our multimodal PNN models achieve stability across randomized data splits comparable to other prescriptive methods and produce realistic prescriptions across the different datasets.

URL: https://openreview.net/forum?id=tPdIg0CK6B

---

Title: Explicit Personalization and Local Training: Double Communication Acceleration in Federated Learning

Abstract: Federated Learning is an evolving machine learning paradigm, in which multiple clients perform computations based on their individual private data, interspersed by communication with a remote server. A common strategy to curtail communication costs is Local Training, which consists in performing multiple local stochastic gradient descent steps between successive communication rounds. However, the conventional approach to local training overlooks the practical necessity for client-specific personalization, a technique to tailor local models to individual needs. We introduce Scafflix, a novel algorithm that efficiently integrates explicit personalization with local training. This innovative approach benefits from these two techniques, thereby achieving doubly accelerated communication, as we demonstrate both in theory and practice.

URL: https://openreview.net/forum?id=qVUEuhlaEa

---

Title: Lie Symmetry Net: Preserving Conservation Laws in Modelling Financial Market Dynamics via Differential Equations

Abstract: This paper employs a novel Lie symmetries-based framework to model the intrinsic symmetries within financial market. Specifically, we introduce Lie symmetry net (LSN), which characterises the Lie symmetries of the differential equations (DE) estimating financial market dynamics, such as the Black-Scholes equation. To simulate these differential equations in a symmetry-aware manner, LSN incorporates a Lie symmetry risk derived from the conservation laws associated with the Lie symmetry operators of the target differential equations. This risk measures how well the Lie symmetries are realised and guides the training of LSN under the structural risk minimisation framework. Extensive numerical experiments demonstrate that LSN effectively realises the Lie symmetries and achieves an error reduction of more than one order of magnitude compared to state-of-the-art methods. The code is available at https://anonymous.4open.science/r/LSN_code-5608/README.md.

URL: https://openreview.net/forum?id=rkfop9GyxB

---

Title: Revisiting the Necessity of Graph Learning and Common Graph Benchmarks

Abstract: Graph machine learning has enjoyed a meteoric rise in popularity since the introduction of deep learning in graph contexts. This is no surprise due to the ubiquity of graph data in large scale industrial settings. Tacitly assumed in all graph learning tasks is the separation of the graph structure and node features: node features strictly encode individual data while the graph structure consists only of pairwise interactions. The driving belief is that node features are (by themselves) insufficient for these tasks, so benchmark performance accurately reflects improvements in graph learning. In our paper, we challenge this orthodoxy by showing that, surprisingly, node features are oftentimes more-than-sufficient for many common graph benchmarks, breaking this critical assumption. When comparing against a well-tuned feature-only MLP baseline on seven of the most commonly used graph learning datasets, one gains little benefit from using graph structure on five datasets. We posit that these datasets do not benefit considerably from graph learning because the features themselves already contain enough graph information to obviate or substantially reduce the need for the graph. To illustrate this point, we perform a feature study on these datasets and show how the features are responsible for closing the gap between MLP and graph-method performance. Further, in service of introducing better empirical measures of progress for graph neural networks, we present a challenging parametric family of principled synthetic datasets that necessitate graph information for nontrivial performance. Lastly, we section out a subset of real-world datasets that are not trivially solved by an MLP and hence serve as reasonable benchmarks for graph neural networks.

URL: https://openreview.net/forum?id=MFIWZPy57j

---

Title: Optimal Strategies for Federated Learning Maintaining Client Privacy

Abstract: Federated Learning (FL) emerged as a learning method to enable the server to train models over data distributed among various clients. These clients are protective about their data being leaked to the server, any other client, or an external adversary, and hence, locally train the model and share it with the server rather than sharing the data. The introduction of sophisticated inferencing attacks enabled the leakage of information about data through access to model parameters. To tackle this challenge, privacy-preserving federated learning aims to achieve differential privacy through learning algorithms like DP-SGD. However, such methods involve adding noise to the model, data, or gradients, reducing the model's performance.

This work provides a theoretical analysis of the tradeoff between model performance and communication complexity of the FL system. We formally prove that training for one local epoch per global round of training gives optimal performance while preserving the same privacy budget. We also investigate the change of utility (tied to privacy) of FL models with a change in the number of clients and observe that when clients are training using DP-SGD and argue that for the same privacy budget, the utility improved with increased clients. We validate our findings through experiments on real-world datasets. The results from this paper aim to improve the performance of privacy-preserving federated learning systems.

URL: https://openreview.net/forum?id=7YyOUavOiU

---

Title: Quasimetric Value Functions with Dense Rewards

Abstract: As a generalization of reinforcement learning (RL) to parametrizable goals, goal conditioned
RL (GCRL) has a broad range of applications, particularly in challenging tasks in robotics.
Recent work has established that the optimal value function of GCRL Q∗ (s, a, g) has a
quasimetric structure, leading to targetted neural architectures that respect such structure.
However, the relevant analyses assume a sparse reward setting—a known aggravating factor
to sample complexity. We show that the key property underpinning a quasimetric, viz., the
triangle inequality, is preserved under a dense reward setting as well, specifically identifying
the key condition necessary for triangle inequality. Contrary to earlier findings where dense
rewards were shown to be detrimental to GCRL, we conjecture that dense reward functions
that satisfy this condition can only improve, never worsen, sample complexity. We evalu-
ate this proposal in 12 standard benchmark environments in GCRL featuring challenging
continuous control tasks. Our empirical results confirm that training a quasimetric value
function in our dense reward setting indeed either improves upon, or preserves, the sample
complexity of training with sparse rewards. This opens up opportunities to train efficient
neural architectures with dense rewards, compounding their benefits to sample complexity.

URL: https://openreview.net/forum?id=BOq66KrngZ

---

Title: Selective Concept Bottleneck Models Without Predefined Concepts

Abstract: Concept-based models like Concept Bottleneck Models (CBMs) have garnered significant interest for improving model interpretability by first predicting human-understandable concepts before mapping them to the output classes. Early approaches required costly concept annotations. To alleviate this, recent methods utilized large language models to automatically generate class-specific concept descriptions and learned mappings from a pretrained black-box model’s raw features to these concepts using vision-language models. However, these approaches assume prior knowledge of which concepts the black-box model has learned. In this work, we discover the concepts encoded by the model through unsupervised concept discovery techniques instead. We further leverage a simple input-dependent concept selection mechanism that dynamically retains a sparse set of relevant concepts of each input, enhancing both sparsity and interpretability. Our approach not only improves downstream performance, but also needs significantly fewer concepts for accurate classification. Lastly, we show how large vision-language models can guide the editing of our models' weights to correct model errors.

URL: https://openreview.net/forum?id=PMO30TLI4l

---

Title: ViT-EBoT: Vision Transformer for Encrypted Botnet Detection in Resource-Constrained Edge Devices

Abstract: With the advent of lightweight cryptography in edge devices, attackers can hide malicious code under encrypted network communications to perform malware attacks. This makes IoT botnet attacks extremely challenging to detect by means of traditional signature-based techniques. In this paper, we propose a novel IoT botnet detection framework that uses vision transformers to detect malicious communications captured in encrypted network flow images. Our approach achieved ∼98% accuracy and around 94% reduced inference latency compared to state-of-the-art approaches. Further, we have validated the practicality of our approach by testing it on Jetson Orin Nano acting as an edge gateway and achieved reduced inference latency of 25.16 ms and area overhead of 88.13 MB.

URL: https://openreview.net/forum?id=P3vxtPoq8c

---

Title: Effect of Random Learning Rate: Theoretical Analysis of SGD Dynamics in Non-Convex Optimization via Stationary Distribution

Abstract: We consider a variant of the stochastic gradient descent (SGD) with a random learning rate and reveal its convergence properties. SGD is a widely used stochastic optimization algorithm in machine learning, especially deep learning. Numerous studies reveal the convergence properties of SGD and its simplified variants. Among these, the analysis of convergence using a stationary distribution of updated parameters provides generalizable results. However, to obtain a stationary distribution, the update direction of the parameters must not degenerate, which limits the applicable variants of SGD. In this study, we consider a novel SGD variant, Poisson SGD, which has degenerated parameter update directions and instead utilizes a random learning rate. Consequently, we demonstrate that a distribution of a parameter updated by Poisson SGD converges to a stationary distribution under weak assumptions on a loss function. Based on this, we further show that Poisson SGD finds global minima in non-convex optimization problems and also evaluate the generalization error using this method. As a proof technique, we approximate the distribution by Poisson SGD with that of the bouncy particle sampler (BPS) and derive its stationary distribution, using the theoretical advance of the piece-wise deterministic Markov process (PDMP).

URL: https://openreview.net/forum?id=RPtKkNx9ZK

---

Title: Neural Logic Networks for Interpretable Classification

Abstract: Traditional neural networks have an impressive classification performance, but what they learn cannot be inspected, verified or extracted. Neural Logic Networks on the other hand have an interpretable structure that enables them to learn a logical mechanism relating the inputs and outputs with AND and OR operations. We generalize these networks with NOT operations and biases that take into account unobserved data and develop a rigorous logical and probabilistic modeling in terms of concept combinations to motivate their use. We also propose a novel factorized IF-THEN rule structure for the model as well as a modified learning algorithm. Our method improves the state-of-the-art in Boolean networks discovery and is able to learn relevant, interpretable rules in tabular classification.

URL: https://openreview.net/forum?id=FxdpxfH02l

---

Title: Do Generative Models Learn Rare Generative Factors?

Abstract: Generative models are becoming a promising tool in AI alongside discriminative learning. Several models have been proposed to learn in an unsupervised fashion the corresponding generative factors, namely the latent variables critical for capturing the full spectrum of data variability. Diffusion Models (DMs), Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) are of particular interest due to their impressive ability to generate highly realistic data. Through a systematic empirical study, this paper delves into the intricate challenge of how DMs, GANs and VAEs internalize and replicate rare generative factors. Our findings reveal a pronounced tendency towards memorization of these factors. We study the reasons for this memorization and demonstrate that strategies such as spectral decoupling can mitigate this issue to a certain extent

URL: https://openreview.net/forum?id=EUih9FJI3y

---

Title: Learning to Guide Human Decision Makers with Vision-Language Models

Abstract: There is increasing interest in developing AIs for assisting human decision making in high-stakes tasks, such as medical diagnosis, for the purpose of improving decision quality and reducing cognitive strain. Mainstream approaches team up an expert with a machine learning model to which safer decisions are offloaded, thus letting the former focus on cases that demand their attention. This separation of responsibilities setup, however, is inadequate for high-stakes scenarios. On the one hand, the expert may end up over-relying on the machine’s decisions due to anchoring bias, thus losing the human oversight that is increasingly being required by regulatory agencies to ensure trustworthy AI. On the other hand, the expert is left entirely unassisted on the (typically hardest) decisions on which the model abstained. As a remedy, we introduce learning to guide (LTG), an alternative framework in which – rather than taking control from the human expert – the machine provides guidance useful for decision making, and the human is entirely responsible for coming up with a decision. In order to ensure guidance is interpretable and task-specific, we develop SLOG, an approach for turning any vision-language model into a capable generator of textual guidance by leveraging a modicum of human feedback. Our empirical evaluation highlights the promise of SLOG on a challenging, real-world medical diagnosis task.

URL: https://openreview.net/forum?id=JAW1C8RNth

---

Reply all

Reply to author

Forward

0 new messages