Weekly TMLR digest for Apr 06, 2025

63 views

Skip to first unread message

TMLR

unread,

Apr 6, 2025, 12:00:11 AMApr 6

to tmlr-annou...@googlegroups.com

New certifications
==================

Reproducibility Certification: Revisiting Deep Hybrid Models for Out-of-Distribution Detection

Paul-Ruben Schlumbom, Eibe Frank

https://openreview.net/forum?id=yeITEuhv4Q

---

Survey Certification: A Survey on the Honesty of Large Language Models

Siheng Li, Cheng Yang, Taiqiang Wu, Chufan Shi, Yuji Zhang, Xinyu Zhu, Zesen Cheng, Deng Cai, Mo Yu, Lemao Liu, Jie Zhou, Yujiu Yang, Ngai Wong, Xixin Wu, Wai Lam

https://openreview.net/forum?id=FJgtVfUxLQ

---

Accepted papers
===============

Title: On Learning Representations for Tabular Data Distillation

Authors: Inwon Kang, Parikshit Ram, Yi Zhou, Horst Samulowitz, Oshani Seneviratne

Abstract: Dataset distillation generates a small set of information-rich instances from a large dataset, resulting in reduced storage requirements, privacy or copyright risks, and computational costs for downstream modeling, though much of the research has focused on the image data modality. We study tabular data distillation, which brings in novel challenges such as the inherent feature heterogeneity and the common use of non-differentiable learning models (such as decision tree ensembles and nearest-neighbor predictors). To mitigate these challenges, we present $\texttt{TDColER}$, a tabular data distillation framework via column embeddings-based representation learning. To evaluate this framework, we also present a tabular data distillation benchmark, ${{\sf \small TDBench}}$. Based on an elaborate evaluation on ${{\sf \small TDBench}}$, resulting in 226,200 distilled datasets and 541,980 models trained on them, we demonstrate that $\texttt{TDColER}$ is able to boost the distilled data quality of off-the-shelf distillation schemes by 0.5-143% across 7 different tabular learning models. All of the code used in the experiments can be found in http://github.com/inwonakng/tdbench

URL: https://openreview.net/forum?id=GXlsrvOGIK

---

Title: Bayesian Learning-driven Prototypical Contrastive Loss for Class-Incremental Learning

Authors: Nisha L. Raichur, Lucas Heublein, Tobias Feigl, Alexander Rügamer, Christopher Mutschler, Felix Ott

Abstract: The primary objective of methods in continual learning is to learn tasks in a sequential manner over time (sometimes from a stream of data), while mitigating the detrimental phenomenon of catastrophic forgetting. This paper proposes a method to learn an effective representation between previous and newly encountered class prototypes. We propose a prototypical network with a Bayesian learning-driven contrastive loss (BLCL), tailored specifically for class-incremental learning scenarios. We introduce a contrastive loss that incorporates novel classes into the latent representation by reducing intra-class and increasing inter-class distance. Our approach dynamically adapts the balance between the cross-entropy and contrastive loss functions with a Bayesian learning technique. Experimental results conducted on the CIFAR-10, CIFAR-100, and ImageNet100 datasets for image classification and images of a GNSS-based dataset for interference classification validate the efficacy of our method, showcasing its superiority over existing state-of-the-art approaches.

URL: https://openreview.net/forum?id=dNWaTuKV9M

---

Title: Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting

Authors: Inkyu Shin, Qihang Yu, Xiaohui Shen, In So Kweon, Kuk-Jin Yoon, Liang-Chieh Chen

Abstract: Recent advancements in zero-shot video diffusion models have shown promise for text-driven video editing, but challenges remain in achieving high temporal consistency. To address this, we introduce Video-3DGS, a 3D Gaussian Splatting (3DGS)-based video refiner designed to enhance temporal consistency in zero-shot video editors. Our approach utilizes a two-stage 3D Gaussian optimizing process tailored for editing dynamic monocular videos. In the first stage, Video-3DGS employs an improved version of COLMAP, referred to as MC-COLMAP, which processes original videos using a Masked and Clipped approach. For each video clip, MC-COLMAP generates the point clouds for dynamic foreground objects and complex backgrounds. These point clouds are utilized to initialize two sets of 3D Gaussians (Frg-3DGS and Bkg-3DGS) aiming to represent foreground and background views. Both foreground and background views are then merged with a 2D learnable parameter map to reconstruct full views. In the second stage, we leverage the reconstruction ability developed in the first stage to impose the temporal constraints on the video diffusion model. This approach ensures the temporal consistency in the edited videos while maintaining high fidelity to the editing text prompt. We further propose a recursive and ensembled refinement by revisiting the denoising step and guidance scale used in video diffusion process with Video-3DGS. To demonstrate the efficacy of Video-3DGS on both stages, we conduct extensive experiments across two related tasks: Video Reconstruction and Video Editing. Video-3DGS trained with 3k iterations significantly improves video reconstruction quality (+3 PSNR, +7 PSNR increase) and training efficiency (×1.9, ×4.5 times faster) over NeRF-based and 3DGS-based state-of-art methods on DAVIS dataset, respectively. Moreover, it enhances video editing by ensuring temporal consistency across 58 dynamic monocular videos.

URL: https://openreview.net/forum?id=s1zfBJysbI

---

Title: Stabilizing the Kumaraswamy Distribution

Authors: Max Wasserman, Gonzalo Mateos

Abstract: Large-scale latent variable models require expressive continuous distributions that support efficient sampling and low-variance differentiation, achievable through the reparameterization trick. The Kumaraswamy (KS) distribution is both expressive and supports the reparameterization trick with a simple closed-form inverse CDF. Yet, its adoption remains limited. We identify and resolve numerical instabilities in the log-pdf, CDF, and inverse CDF, exposing issues in libraries like PyTorch and TensorFlow. We then introduce simple and scalable latent variable models to address exploration-exploitation trade-offs in contextual multi-armed bandits and facilitate uncertainty quantification for link prediction with graph neural networks. We find these models to be most performant when paired with the stable KS. Our results support the stabilized KS distribution as a core component in scalable variational models for bounded latent variables.

URL: https://openreview.net/forum?id=baZLwdphqw

---

Title: Empirical Bayes Trend Filtering Through a Variational Inference Framework

Authors: Dongyue Xie

Abstract: This paper introduces a novel framework for Bayesian trend filtering using an empirical Bayes approach and a variational inference algorithm. Trend filtering is a nonparametric regression technique that has gained popularity for its simple formulation and local adaptability. Bayesian adaptations of trend filtering have been proposed as an alternative method, while they often rely on computationally intensive sampling-based methods for posterior inference. We propose an empirical Bayes trend filtering (EBTF) that leverages shrinkage priors, estimated through an empirical Bayes procedure by maximizing the marginal likelihood. To address the computational challenges posed by large datasets, we implement a variational inference algorithm for posterior computation, ensuring scalability and efficiency. Our framework is flexible, allowing the incorporation of various shrinkage priors, and optimizes the level of smoothness directly from the data. We also discuss alternative formulations of the EBTF model, along with their pros and cons. We demonstrate the performance of our EBTF method through comprehensive simulations and real-world data applications, highlighting its ability to maintain computational efficiency while providing accurate trend estimation.

URL: https://openreview.net/forum?id=AHTz2mTlKk

---

Title: Multi-Output Distributional Fairness via Post-Processing

Authors: Gang Li, Qihang Lin, Ayush Ghosh, Tianbao Yang

Abstract: The post-processing approaches are becoming prominent techniques to enhance machine learning models' fairness because of their intuitiveness, low computational cost, and excellent scalability. However, most existing post-processing methods are designed for task-specific fairness measures and are limited to single-output models. In this paper, we introduce a post-processing method for multi-output models, such as the ones used for multi-task/multi-class classification and representation learning, to enhance a model's distributional parity, a task-agnostic fairness measure. Existing methods for achieving distributional parity rely on the (inverse) cumulative density function of a model’s output, restricting their applicability to single-output models. Extending previous works, we propose to employ optimal transport mappings to move a model's outputs across different groups towards their empirical Wasserstein barycenter. An approximation technique is applied to reduce the complexity of computing the exact barycenter and a kernel regression method is proposed to extend this process to out-of-sample data. Our empirical studies evaluate the proposed approach against various baselines on multi-task/multi-class classification and representation learning tasks, demonstrating the effectiveness of the proposed approach.

URL: https://openreview.net/forum?id=MJOKrHqiV1

---

Title: Meta-Learning to Teach Semantic Prompts for Open Domain Generalization in Vision-Language Models

Authors: Shirsha Bose, Mainak Singha, Ankit Jha, Souradeep Mukhopadhyay, Biplab Banerjee

Abstract: Open Domain Generalization (ODG) addresses the challenges posed by domain and category shifts between labeled training sources and unlabeled target domains. Current state-of-the-art methods struggle with the limitations of traditional CNN backbones, leading to reduced generalization and increased error rates in detecting target open samples without prior knowledge. Additionally, recent CLIP-based prompt learning approaches fail to distinguish between known and unknown classes effectively, resulting in suboptimal performance. To address these challenges, we propose MetaPrompt, which leverages the semantic strengths of the vision-language model CLIP and the ''learning-to-learn'' capabilities of Meta-Learning to achieve robust generalization across domain and category shifts. Our framework introduces three key innovations: First, we approach ODG as a multi-class classification problem that includes both known and novel categories, designing novel prompts capable of detecting unknown class samples across multiple domains. These prompts are trained using Meta-Learning with momentum updates, enabling smooth and accurate differentiation between known and unknown classes. Second, we introduce a novel domain-agnostic semantic attention-based prompt alongside domain-focused prompts to enhance robustness in classifying unknown classes across various domains. Finally, we incorporate an unsupervised contrastive loss during episodic Meta-Training, which reinforces the boundaries in the metric space between known and unknown classes, thereby enhancing ''unknown'' class awareness in the prompts. MetaPrompt has demonstrated its superiority through extensive testing on diverse datasets, excelling in both closed and open-set DG scenarios and consistently outperforming existing solutions.

URL: https://openreview.net/forum?id=uJELgNGiMW

---

Title: Revisiting Deep Hybrid Models for Out-of-Distribution Detection

Authors: Paul-Ruben Schlumbom, Eibe Frank

Abstract: Deep hybrid models (DHMs) for out-of-distribution (OOD) detection, jointly training a deep feature extractor with a classification head and a density estimation head based on a normalising flow, provide a conceptually appealing approach to visual OOD detection. The paper that introduced this approach reported 100% AuROC in experiments on two standard benchmarks, including one based on the CIFAR-10 data. As there are no implementations available, we set out to reproduce the approach by carefully filling in gaps in the description of the algorithm. Although we were unable to attain 100% OOD detection rates, and our results indicate that such performance is impossible on the CIFAR-10 benchmark, we achieved good OOD performance. We provide a detailed analysis of when the architecture fails and argue that it introduces an adversarial relationship between the classification component and the density estimator, rendering it highly sensitive to the balance of these two components and yielding a collapsed feature space without careful fine-tuning. Our implementation of DHMs is publicly available.

URL: https://openreview.net/forum?id=yeITEuhv4Q

---

Title: An Efficient Training Algorithm for Models with Block-wise Sparsity

Authors: Ding Zhu, Zhiqun Zuo, Mohammad Mahdi Khalili

Abstract: Large-scale machine learning (ML) models are increasingly being used in critical domains like education, lending, recruitment, healthcare, criminal justice, etc. However, the training, deployment, and utilization of these models demand substantial computational resources. To decrease computation and memory costs, machine learning models with sparse weight matrices are widely used in the literature. Among sparse models, those with special sparse structures (e.g., models with block-wise sparse weight matrices) fit better with the hardware accelerators and can decrease the memory and computation costs during the inference. Unfortunately, while there are several efficient training methods, none of them are designed to train a block-wise sparse model efficiently. As a result, the current methods for training block-wise sparse models start with full and dense models leading to inefficient training. In this work, we focus on training models with \textit{block-wise sparse matrices} and propose an efficient training algorithm to decrease both computation and memory costs during training and inference.
In addition, we will show that our proposed method enables us to efficiently find the right block size for the sparsity pattern during the training process. Our extensive empirical and theoretical analyses show that our algorithms can decrease the computation and memory costs significantly without a performance drop compared to baselines.

URL: https://openreview.net/forum?id=nay3Kvw8BD

---

Title: Almost Sure Convergence of Stochastic Gradient Methods under Gradient Domination

Authors: Simon Weissmann, Sara Klein, Waïss Azizian, Leif Döring

Abstract: Stochastic gradient methods are among the most important algorithms in training machine learning problems. While classical assumptions such as strong convexity allow a simple analysis they are rarely satisfied in applications. In recent years, global and local gradient domination properties have shown to be a more realistic replacement of strong convexity. They were proved to hold in diverse settings such as (simple) policy gradient methods in reinforcement learning and training of deep neural networks with analytic activation functions. We prove almost sure convergence rates $f(X_n)-f^*\in o\big( n^{-\frac{1}{4\beta-1}+\epsilon}\big)$ of the last iterate for stochastic gradient descent (with and without momentum) under global and local $\beta$-gradient domination assumptions. The almost sure rates get arbitrarily close to recent rates in expectation. Finally, we demonstrate how to apply our results to the training task in both supervised and reinforcement learning.

URL: https://openreview.net/forum?id=OTwnNBxZFB

---

Title: Consistency-Guided Asynchronous Contrastive Tuning for Few-Shot Class-Incremental Tuning of Foundation Models

Authors: Shuvendu Roy, Elham Dolatabadi, Arash Afkanpour, Ali Etemad

Abstract: We propose Consistency-guided Asynchronous Contrastive Tuning (CoACT), a novel method for continuously tuning foundation models to learn new classes in few-shot settings. CoACT consists of three key components: (i) asynchronous contrastive tuning, which learns new classes by including LoRA modules in the pre-trained encoder while enforcing consistency between two asynchronous encoders; (ii) controlled fine-tuning, which facilitates effective tuning of a subset of the foundation model; and (iii) consistency-guided incremental tuning, which enforces additional regularization during later sessions to reduce forgetting of the learned classes. We evaluate our proposed solution on Few-Shot Class-Incremental Learning (FSCIL) as well as a new and more challenging setup called Few-Shot Class-Incremental Tuning (FSCIT), which facilitates the continual tuning of vision foundation models to learn new classes with only a few samples per class. Unlike traditional FSCIL, FSCIT does not require a large in-distribution base session for initial fully supervised training prior to the incremental few-shot sessions. We conduct extensive evaluations across 16 diverse datasets, demonstrating the effectiveness of CoACT in both FSCIL and FSCIT setups. CoACT outperforms existing methods by up to 5.02% in FSCIL and up to 12.51% in FSCIT for individual datasets, with an average improvement of 2.47%. Furthermore, CoACT exhibits reduced forgetting and enhanced robustness in low-shot experiments. Detailed ablation and sensitivity studies highlight the contribution of each component of CoACT. We make our code publicly available at https://github.com/ShuvenduRoy/CoACT-FSCIL.

URL: https://openreview.net/forum?id=WfAvMdwiE8

---

Title: LitLLMs, LLMs for Literature Review: Are we there yet?

Authors: Shubham Agarwal, Gaurav Sahu, Abhay Puri, Issam H. Laradji, Krishnamurthy Dj Dvijotham, Jason Stanley, Laurent Charlin, Christopher Pal

Abstract: Literature reviews are an essential component of scientific research, but they remain time-intensive and challenging to write, especially due to the recent influx of research papers. This paper explores the zero-shot abilities of recent Large Language Models (LLMs) in assisting with the writing of literature reviews based on an abstract. We decompose the task into two components: (1) Retrieving related works given a query abstract and (2) Writing a literature review based on the retrieved results. We analyze how effective LLMs are for both components. For retrieval, we introduce a novel two-step search strategy that first uses an LLM to extract meaningful keywords from the abstract of a paper and then retrieves potentially relevant papers by querying an external knowledge base. Additionally, we study a prompting-based re-ranking mechanism with attribution and show that re-ranking doubles the normalized recall compared to naive search methods while providing insights into the LLM’s decision-making process. In the generation phase, we propose a two-step approach that first outlines a plan for the review and then executes steps in the plan to generate the actual review. To evaluate different LLM-based literature review methods, we create test sets from arXiv papers using a protocol designed for rolling use with newly released LLMs to avoid test set contamination in zero-shot evaluations. We release this evaluation protocol to promote additional research and development in this regard. Our empirical results suggest that LLMs show promising potential for writing literature reviews when the task is decomposed into smaller components of retrieval and planning. Particularly, we find that combining keyword-based and document-embedding-based search improves precision and recall during retrieval by 10% and 30%, respectively, compared to using either of the methods in isolation. Further, we demonstrate that our planning-based approach achieves higher-quality reviews by minimizing hallucinated references in the generated review by 18-26% compared to existing simpler LLM-based generation methods. Our project page including a demonstration system and toolkit can be accessed here: https://litllm.github.io.

URL: https://openreview.net/forum?id=heeJqQXKg7

---

Title: DeformTime: capturing variable dependencies with deformable attention for time series forecasting

Authors: Yuxuan Shu, Vasileios Lampos

Abstract: In multivariable time series (MTS) forecasting, existing state-of-the-art deep learning approaches tend to focus on autoregressive formulations and often overlook the potential of using exogenous variables in enhancing the prediction of the target endogenous variable. To address this limitation, we present DeformTime, a neural network architecture that attempts to capture correlated temporal patterns from the input space, and hence, improve forecasting accuracy. It deploys two core operations performed by deformable attention blocks (DABs): learning dependencies across variables from different time steps (variable DAB), and preserving temporal dependencies in data from previous time steps (temporal DAB). Input data transformation is explicitly designed to enhance learning from the deformed series of information while passing through a DAB. We conduct extensive experiments on 6 MTS data sets, using previously established benchmarks as well as challenging infectious disease modelling tasks with more exogenous variables. The results demonstrate that DeformTime improves accuracy against previous competitive methods across the vast majority of MTS forecasting tasks, reducing the mean absolute error by 7.2% on average. Notably, performance gains remain consistent across longer forecasting horizons.

URL: https://openreview.net/forum?id=M62P7iOT7d

---

Title: Latent Covariate Shift: Unlocking Partial Identifiability for Multi-Source Domain Adaptation

Authors: Yuhang Liu, Zhen Zhang, Dong Gong, Mingming Gong, Biwei Huang, Anton van den Hengel, Kun Zhang, Javen Qinfeng Shi

Abstract: Multi-source domain adaptation (MSDA) addresses the challenge of learning a label prediction function for an unlabeled target domain by leveraging both the labeled data from multiple source domains and the unlabeled data from the target domain. Conventional MSDA approaches often rely on covariate shift or conditional shift paradigms, which assume a consistent label distribution across domains. However, this assumption proves limiting in practical scenarios where label distributions do vary across domains, diminishing its applicability in real-world settings. For example, animals from different regions exhibit diverse characteristics due to varying diets and genetics.

Motivated by this, we propose a novel paradigm called latent covariate shift (LCS), which introduces significantly greater variability and adaptability across domains. Notably, it provides a theoretical assurance for recovering the latent cause of the label variable, which we refer to as the latent content variable. Within this new paradigm, we present an intricate causal generative model by introducing latent noises across domains, along with a latent content variable and a latent style variable to achieve more nuanced rendering of observational data. We demonstrate that the latent content variable can be identified up to block identifiability due to its versatile yet distinct causal structure. We anchor our theoretical insights into a novel MSDA method, which learns the label distribution conditioned on the identifiable latent content variable, thereby accommodating more substantial distribution shifts. The proposed approach showcases exceptional performance and efficacy on both simulated and real-world datasets.

URL: https://openreview.net/forum?id=9kFlOyLwyf

---

Title: $f$-Divergence Policy Optimization in Fully Decentralized Cooperative MARL

Authors: Kefan Su, Zongqing Lu

Abstract: Independent learning is a straightforward solution for fully decentralized learning in cooperative multi-agent reinforcement learning (MARL). The study of independent learning has a history of decades, and the representatives, such as independent Q-learning and independent PPO, can achieve good performances on several benchmarks. However, most independent learning algorithms lack convergence guarantees or theoretical support. In this paper, we propose a general formulation of independent policy optimization, $f$-divergence policy optimization. We hope that a more general policy optimization formulation will provide deeper insights into fully decentralized learning. We demonstrate the generality of this formulation and analyze its limitations. Based on this formulation, we further propose a novel independent learning algorithm, TVPO, which theoretically guarantees convergence. Empirically, we demonstrate that TVPO outperforms state-of-the-art fully decentralized learning methods on three popular cooperative MARL benchmarks, thereby verifying the efficacy of TVPO.

URL: https://openreview.net/forum?id=Wj8yFjIpom

---

Title: Vision-Language Models Provide Promptable Representations for Reinforcement Learning

Authors: William Chen, Oier Mees, Aviral Kumar, Sergey Levine

Abstract: Humans can quickly learn new behaviors by leveraging background world knowledge. In contrast, agents trained with reinforcement learning (RL) typically learn behaviors from scratch. We thus propose a novel approach that uses the vast amounts of general and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data for embodied RL. We initialize policies with VLMs by using them as promptable representations: embeddings that encode semantic features of visual observations based on the VLM's internal knowledge and reasoning capabilities, as elicited through prompts that provide task context and auxiliary information. We evaluate our approach on visually-complex, long horizon RL tasks in Minecraft and robot navigation in Habitat. We find that our policies trained on embeddings from off-the-shelf, general-purpose VLMs outperform equivalent policies trained on generic, non-promptable image embeddings. We also find our approach outperforms instruction-following methods and performs comparably to domain-specific embeddings. Finally, we show that our approach can use chain-of-thought prompting to produce representations of common-sense semantic reasoning, improving policy performance in novel scenes by 1.5 times.

URL: https://openreview.net/forum?id=vQDKYYuqWA

---

Title: A Survey on the Honesty of Large Language Models

Authors: Siheng Li, Cheng Yang, Taiqiang Wu, Chufan Shi, Yuji Zhang, Xinyu Zhu, Zesen Cheng, Deng Cai, Mo Yu, Lemao Liu, Jie Zhou, Yujiu Yang, Ngai Wong, Xixin Wu, Wai Lam

Abstract: Honesty is a fundamental principle for aligning large language models (LLMs) with human values, requiring these models to recognize what they know and don't know and be able to faithfully express their knowledge. Despite promising, current LLMs still exhibit significant dishonest behaviors, such as confidently presenting wrong answers or failing to express what they know. In addition, research on the honesty of LLMs also faces challenges, including varying definitions of honesty, difficulties in distinguishing between known and unknown knowledge, and a lack of comprehensive understanding of related research. To address these issues, we provide a survey on the honesty of LLMs, covering its clarification, evaluation approaches, and strategies for improvement. Moreover, we offer insights for future research, aiming to inspire further exploration in this important area.

URL: https://openreview.net/forum?id=FJgtVfUxLQ

---

Title: Implicit Bias and Fast Convergence Rates for Self-attention

Authors: Bhavya Vasudeva, Puneesh Deora, Christos Thrampoulidis

Abstract: We study the fundamental optimization principles of self-attention, the defining mechanism of transformers, by analyzing the implicit bias of gradient-based optimizers in training a self-attention layer with a linear decoder in binary classification. Building on prior studies in linear logistic regression, recent findings demonstrate that the key-query matrix $W_t$ from gradient-descent (GD) converges in direction towards $W_{mm}$, which maximizes the margin between optimal and non-optimal tokens across sequences. However, this convergence is local, dependent on initial conditions, only holds asymptotically as the number of iterations increases, and leaves questions about the potential benefits of adaptive step-size rules unaddressed. To bridge this gap, we first establish scenarios for which convergence is provably global. We then analyze two adaptive step-size strategies: normalized GD and Polyak step-size, demonstrating finite-time convergence rates for $W_t$ to $W_{mm}$, and quantifying the sparsification rate of the attention map. These findings not only show that these strategies can accelerate parameter convergence over standard GD in a non-convex setting but also deepen the understanding of the implicit bias in self-attention, linking it more closely to the phenomena observed in linear logistic regression despite its intricate non-convex nature.

URL: https://openreview.net/forum?id=pKilnjQsb0

---

Title: Decomposing The Dark Matter of Sparse Autoencoders

Authors: Joshua Engels, Logan Riggs Smith, Max Tegmark

Abstract: Sparse autoencoders (SAEs) are a promising technique for decomposing language model activations into interpretable linear features. However, current SAEs fall short of completely explaining model performance, resulting in ``dark matter'': unexplained variance in activations. This work investigates dark matter as an object of study in its own right. Surprisingly, we find that much of SAE dark matter---about half of the error vector itself and $>90\% $ of its norm---can be linearly predicted from the initial activation vector. Additionally, we find that the scaling behavior of SAE error norms at a per token level is remarkably predictable: larger SAEs mostly struggle to reconstruct the same contexts as smaller SAEs. We build on the linear representation hypothesis to propose models of activations that might lead to these observations, including postulating a new type of ``introduced error''; these insights imply that the part of the SAE error vector that cannot be linearly predicted (``nonlinear'' error) might be fundamentally different from the linearly predictable component. To validate this hypothesis, we empirically analyze nonlinear SAE error and show that 1) it contains fewer not yet learned features, 2) SAEs trained on it are quantitatively worse, 3) it helps predict SAE per-token scaling behavior, and 4) it is responsible for a proportional amount of the downstream increase in cross entropy loss when SAE activations are inserted into the model. Finally, we examine two methods to reduce nonlinear SAE error: inference time gradient pursuit, which leads to a very slight decrease in nonlinear error, and linear transformations from earlier layer SAE outputs, which leads to a larger reduction.

URL: https://openreview.net/forum?id=sXq3Wb3vef

---

Title: Predicting sub-population specific viral evolution

Authors: Wenxian Shi, Menghua Wu, Regina Barzilay

Abstract: Forecasting the change in the distribution of viral variants is crucial for therapeutic design and disease surveillance. This task poses significant modeling challenges due to the sharp differences in virus distributions across sub-populations (e.g., countries) and their dynamic interactions. Existing machine learning approaches that model the variant distribution as a whole are incapable of making location-specific predictions and ignore transmissions that shape the viral landscape. In this paper, we propose a sub-population specific protein evolution model, which predicts the time-resolved distributions of viral proteins in different locations. The algorithm explicitly models the transmission rates between sub-populations and learns their interdependence from data. The change in protein distributions across all sub-populations is defined through a linear ordinary differential equation (ODE) parametrized by transmission rates. Solving this ODE yields the likelihood of a given protein occurring in particular sub-populations. Multi-year evaluation on both SARS-CoV-2 and influenza A/H3N2 demonstrates that our model outperforms baselines in accurately predicting distributions of viral proteins across continents and countries. We also find that the transmission rates learned from data are consistent with the transmission pathways discovered by retrospective phylogenetic analysis.

URL: https://openreview.net/forum?id=Mae23iEqPS

---

New submissions
===============

Title: A Comprehensive Survey of Contamination Detection Methods in Large Language Models

Abstract: With the rise of Large Language Models (LLMs) in recent years, abundant new opportunities are emerging, but also new challenges, among which contamination is quickly becoming critical. Business applications and fundraising in Artificial Intelligence (AI) have reached a scale at which a few percentage points gained on popular question-answering benchmarks could translate into dozens of millions of dollars, placing high pressure on model integrity. At the same time, it is becoming harder and harder to keep track of the data that LLMs have seen; if not impossible with closed-source models like GPT-4 and Claude-3 not divulging any information on the training set. As a result, contamination becomes a major issue: LLMs’ performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data. This limitation jeopardizes real capability improvement in the field of NLP, yet, there remains a lack of methods on how to efficiently detect contamination. In this paper, we survey all recent work on contamination detection with LLMs, analyzing their methodologies and use cases to shed light on the appropriate usage of contamination detection methods. Our work calls the NLP research community’s attention into systematically taking into account contamination bias in LLM evaluation.

URL: https://openreview.net/forum?id=SxNMjbtdFm

---

Title: Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks. Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further improved performance in System-2 reasoning domains like mathematics and programming by harnessing supervised fine-tuning (SFT) and reinforcement learning (RL) to enhance Chain-of-Thought (CoT) reasoning. However, while longer CoT reasoning sequences improve performance, they also introduce significant computational overhead due to lengthy and redundant outputs, known as the ''overthinking phenomenon''.
Efficient Reasoning, which seeks to optimize reasoning length while preserving reasoning capabilities, offers practical benefits such as faster processing times, lower energy consumption, and improved responsiveness, especially valuable for reasoning-intensive applications. Despite its potential, efficient reasoning remains in the early stages of research.
In this paper, we provide the first structured survey to systematically investigate and explore the current progress toward achieving efficient reasoning in LLMs. Overall, relying on the inherent mechanism of LLMs, we categorize existing works into several key directions: (1) model-based efficient reasoning, which considers optimizing full-length reasoning models into more concise reasoning models or directly training efficient reasoning models; (2) reasoning output-based efficient reasoning, which aims to dynamically reduce reasoning steps and length during inference; (3) input prompts-based efficient reasoning, which seeks to enhance reasoning efficiency based on input prompt properties such as difficulty or length control. Additionally, we introduce the use of efficient data for training reasoning models, explore the reasoning capabilities of small language models, and discuss evaluation methods and benchmarking.

URL: https://openreview.net/forum?id=HvoG8SxggZ

---

Title: Elucidating the Design Choice of Probability Paths in Flow Matching for Forecasting

Abstract: Flow matching has recently emerged as a powerful paradigm for generative modeling and has been extended to probabilistic time series forecasting. However, the impact of the specific choice of probability path model on forecasting performance, particularly for high-dimensional spatio-temporal dynamics, remains under-explored. In this work, we demonstrate that forecasting spatio-temporal data with flow matching is highly sensitive to the selection of the probability path model. Motivated by this insight, we propose a novel probability path model designed to improve forecasting performance. Our empirical results across various dynamical system benchmarks show that our model achieves faster convergence during training and improved predictive performance compared to existing probability path models. Importantly, our approach is efficient during inference, requiring only a few sampling steps. This makes our proposed model practical for real-world applications and opens new avenues for probabilistic forecasting.

URL: https://openreview.net/forum?id=JApMDLwbLR

---

Title: Are Convex Optimization Curves Convex?

Abstract: In this paper, we study when we might expect the optimization curve induced by gradient descent to be \emph{convex} -- precluding, for example, an initial plateau followed by a sharp decrease, making it difficult to decide when optimization should stop. Although such undesirable behavior can certainly occur when optimizing general functions, might it also occur in the benign and well-studied case of smooth convex functions? As far as we know, this question has not been tackled in previous work. We show, perhaps surprisingly, that the answer crucially depends on the choice of the step size. In particular, for the range of step sizes which are known to result in monotonic convergence to an optimal value, we characterize a regime where the optimization curve will be provably convex, and a regime where the curve can be non-convex. We also extend our results to gradient flow, and to the closely-related but different question of whether the gradient norm decreases monotonically.

URL: https://openreview.net/forum?id=TZtpxselK2

---

Title: Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks

Abstract: We present and analyze a novel regularized form of the gradient clipping algorithm, proving that it converges to global minima of the loss surface of deep neural networks under the squared loss, provided that the layers are of sufficient width. The algorithm presented here, dubbed $\delta-$GClip, introduces a modification to gradient clipping that leads to a first-of-its-kind example of a step size scheduling for gradient descent that provably minimizes training losses of deep neural nets. We also present empirical evidence that our theoretically founded $\delta-$GClip algorithm is competitive with the state-of-the-art deep learning heuristics on various neural architectures including modern transformer based architectures. The modification we do to standard gradient clipping is designed to leverage the PL* condition, a variant of the Polyak-Łojasiewicz inequality which was recently proven to be true for sufficiently wide neural networks at any depth within a neighbourhood of the initialization.

URL: https://openreview.net/forum?id=ABT1XQLbOx

---

Title: Adversarial Subspace Generation for Outlier Detection in High-Dimensional Data

Abstract: Outlier detection in high-dimensional tabular data is challenging since data is often distributed across multiple lower-dimensional subspaces—a phenomenon known as the Multiple Views effect (MV). This effect led to a large body of research focused on mining such subspaces, known as *subspace selection*. However, as the precise nature of the MV effect was not well understood, traditional methods had to rely on heuristic-driven search schemes that struggle to accurately capture the true structure of the data. Properly identifying these subspaces is critical for unsupervised tasks such as outlier detection or clustering, where misrepresenting the underlying data structure can hinder the performance. We introduce Myopic Subspace Theory (MST), a new theoretical framework that mathematically formulates the Multiple Views effect and writes subspace selection as a stochastic optimization problem. Based on MST, we introduce V-GAN, a generative method trained to solve such an optimization problem. This approach avoids any exhaustive search over the feature space while ensuring that the intrinsic data structure is preserved. Experiments on 42 real-world datasets show that using V-GAN subspaces to build ensemble methods leads to a significant increase in one-class classification performance—compared to existing subspace selection, feature selection, and embedding methods. Further experiments on synthetic data show that V-GAN identifies subspaces more accurately while scaling better than other relevant subspace selection methods. These results confirm the theoretical guarantees of our approach and also highlight its practical viability in high-dimensional settings.

URL: https://openreview.net/forum?id=k7QsjiRE17

---

Title: Learning Actionable Counterfactual Explanations in Large State Spaces

Abstract: Recourse generators provide actionable insights, often through feature-based counterfactual explanations (CFEs), to help negatively classified individuals understand how to adjust their input features to achieve a positive classification. These feature-based CFEs, which we refer to as \emph{low-level} CFEs, are overly specific (e.g., coding experience: $4 \to 5+$ years) and often recommended in feature space that doesn't straightforwardly align with real-world actions. To bridge this gap, we introduce three novel recourse types grounded in real-world actions: high-level continuous (\emph{hl-continuous}), high-level discrete (\emph{hl-discrete}), and high-level ID (\emph{hl-id}) CFEs.

We formulate single-agent CFE generation methods, where we model the hl-discrete CFE as a solution to a weighted set cover problem and the hl-continuous CFE as a solution to an integer linear program.
Since these methods require costly optimization per agent, we propose data-driven CFE-generation approaches that, given instances of agents and their optimal CFEs, learn a CFE generator that quickly provides optimal CFEs for new agents.
This approach, also viewed as one of learning an optimal policy in a family of large but deterministic MDPs, considers several problem formulations, including formulations in which the actions and their effects are unknown, and therefore addresses informational and computational challenges.

Through extensive empirical evaluation using publicly available healthcare datasets (BRFSS, Foods, and NHANES), we compare the proposed forms of recourse to low-level CFEs and assess the effectiveness of our data-driven approaches. Empirical results show that the proposed data-driven CFE generators are accurate and resource-efficient, and the proposed forms of recourse have various advantages over the low-level CFEs.

URL: https://openreview.net/forum?id=tXnVRpRlR8

---

Title: Bayesian Kernel Regression for Functional Data

Abstract: In supervised learning, the output variable to be predicted is often represented as a function, such as a spectrum or probability distribution. Despite its importance, functional output regression remains relatively unexplored. In this study, we propose a novel functional output regression model based on kernel methods. Unlike conventional approaches that independently train regressors with scalar outputs for each measurement point of the output function, our method leverages the covariance structure within the function values, akin to multitask learning, leading to enhanced learning efficiency and improved prediction accuracy. Compared with existing nonlinear function-on-scalar models in statistical functional data analysis, our model effectively handles high-dimensional nonlinearity while maintaining a simple model structure. Furthermore, the fully kernel-based formulation allows the model to be expressed within the framework of reproducing kernel Hilbert space, providing an analytic form for parameter estimation and a solid foundation for further theoretical analysis. The proposed model delivers a functional output predictive distribution derived analytically from a Bayesian perspective, enabling the quantification of uncertainty in the predicted function. We demonstrate the model’s enhanced prediction performance through experiments on artificial datasets and density of states prediction tasks in materials science.

URL: https://openreview.net/forum?id=g7eXsIUjI3

---

Title: Dense Attentive Probing: Dense output with less than 100K learnable parameters.

Abstract: The paradigm of pretraining a backbone on a large set of (often unlabeled) images has gained popularity. The quality of the resulting features is commonly measured by freezing the backbone and training different task heads on top of it. However, current evaluations cover only classifications of whole images or require complex dense task heads which introduce a large number of parameters and add their own inductive biases. In this work, we propose dense attentive probing, a parameter-efficient readout to make dense prediction using arbitrary backbones independent of the size and resolution of their feature volume. To this end, we utilize a masked cross-attention layer with learnable mask sizes which enables dense prediction with a small parameter budget, thus providing relatively unbiased access to the features. We employ this method to evaluate common backbones in three dimensions: instance awareness, local semantics and spatial understanding. We find that DINOv2 outperforms all other backbones tested -- including those supervised with masks and language -- across all three task categories. Furthermore, our analysis suggests that self-supervised training tends to yield features that separate object instances better than vision-language models. Code is available at https://to.be.released.

URL: https://openreview.net/forum?id=neMAx4uBlh

---

Title: Set-Based Training for Neural Network Verification

Abstract: Neural networks are vulnerable to adversarial attacks, i.e., small input perturbations can significantly affect the outputs of a neural network. Therefore, to ensure safety of neural networks in safety-critical environments, the robustness of a neural network must be formally verified against input perturbations, e.g., from noisy sensors. To improve the robustness of neural networks and thus simplify the formal verification, we present a novel set-based training procedure in which we compute the set of possible outputs given the set of possible inputs and compute for the first time a gradient set, i.e., each possible output has a different gradient. Therefore, we can directly reduce the size of the output enclosure by choosing gradients toward its center. Small output enclosures increase the robustness of a neural network and, at the same time, simplify its formal verification. The latter benefit is due to the fact that a larger size of propagated sets increases the conservatism of most verification methods. Our extensive evaluation demonstrates that set-based training produces robust neural networks with competitive performance, which can be verified using fast (polynomial-time) verification algorithms due to the reduced output set.

URL: https://openreview.net/forum?id=n0lzHrAWIA

---

Title: Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Abstract: We introduce Generalized Instruction Tuning (called GLAN), a general and scalable method for instruction tuning of Large Language Models (LLMs). Unlike prior work that relies on seed examples or existing datasets to construct instruction-tuning data, GLAN exclusively utilizes a pre-curated taxonomy of human knowledge and capabilities as input and generates large-scale synthetic instruction data across all disciplines. Specifically, inspired by the systematic structure in human education system, we build the taxonomy by decomposing human knowledge and capabilities to various fields, sub-fields and ultimately, distinct disciplines semi-automatically, facilitated by LLMs. Subsequently, we generate a comprehensive list of subjects for every discipline and proceed to design a syllabus tailored to each subject, again utilizing LLMs. With the fine-grained key concepts detailed in every class session of the syllabus, we are able to generate diverse instructions with a broad coverage across the entire spectrum of human knowledge and skills. Extensive experiments on large language models (e.g., Mistral) demonstrate that GLAN excels in multiple dimensions from mathematical reasoning, coding, academic exams, logical reasoning to general instruction following without using task-specific training data of these tasks. In addition, GLAN allows for easy customization and new fields or skills can be added by simply incorporating a new node into our taxonomy.

URL: https://openreview.net/forum?id=PahnCreCxK

---

Title: STC3D: Self-Supervised Contrastive Learning with Spatial Transformations for 3D Medical Image Analysis

Abstract: Self-Supervised Learning (SSL) has demonstrated promising results in 3D medical image analysis, but traditional SSL methods often lack high-level semantics during pre-training, limiting performance on downstream tasks. Recent methods like Volume Contrast (VoCo) have addressed this by leveraging contextual position priors in 3D images, but VoCo relies on random cropping, which may reduce robustness to anatomical variations. In this paper, we propose STC3D, a novel SSL framework that applies controlled spatial transformations (rotation, translation, scaling) to generate multiple views of 3D volume images. These transformed views are then used for contrastive learning, enhancing invariance to anatomical structure transformations. Additionally, STC3D includes a regularization branch to promote feature discrepancy between different base slices, improving the discriminative power of learned representations. Experimental results on several benchmark datasets, including BTCV, MSD Spleen, MM-WHS, and BraTS 21, demonstrate that STC3D outperforms existing methods in segmentation and classification tasks.

URL: https://openreview.net/forum?id=5Ges6wFoyk

---

Title: Can Your Uncertainty Scores Detect Hallucinated Entity?

Abstract: To mitigate the impact of hallucination nature of LLMs, many studies propose detecting hallucinated generation through uncertainty estimation. However, these approaches predominantly operate at the sentence or paragraph level, failing to pinpoint specific spans or entities responsible for hallucinated content. This lack of granularity is especially problematic for long-form outputs that mix accurate and fabricated information. To address this limitation, we explore entity-level hallucination detection. We propose a new data set, HalluEntity, which annotates hallucination at the entity level. Based on the dataset, we comprehensively evaluate uncertainty-based hallucination detection approaches across 17 modern LLMs. Our experimental results show that uncertainty estimation approaches focusing on individual token probabilities tend to over-predict hallucinations, while context-aware methods show better but still suboptimal performance. Through an in-depth qualitative study, we identify relationships between hallucination tendencies and linguistic properties and highlight important directions for future research.

URL: https://openreview.net/forum?id=494k7e9R5D

---

Title: Alternators For Sequence Modeling

Abstract: This paper introduces alternators, a novel family of non-Markovian dynamical models for sequences. An alternator features two neural networks: the observation trajectory network (OTN) and the feature trajectory network (FTN). The OTN and the FTN work in conjunction, alternating between outputting samples in the observation space and some feature space, respectively. The parameters of the OTN and the FTN are not time-dependent and are learned via a minimum cross-entropy criterion over the trajectories. Alternators are versatile. They can be used as dynamical latent-variable generative models or as sequence-to-sequence predictors. Alternators can uncover the latent dynamics underlying complex sequential data, accurately forecast and impute missing data, and sample new trajectories. We showcase the capabilities of alternators in three applications. We first used alternators to model the Lorenz equations, often used to describe chaotic behavior. We then applied alternators to Neuroscience to map brain activity to physical activity. Finally, we applied alternators to Climate Science, focusing on sea-surface temperature forecasting. In all our experiments, we found alternators are stable to train, fast to sample from, yield high-quality generated samples and latent variables, and often outperform strong baselines such as Mambas, neural ODEs, and diffusion models in the domains we studied.

URL: https://openreview.net/forum?id=Q70C1HQ0VO

---

Title: Client-only Distributed Markov Chain Monte Carlo Sampling over a Network

Abstract: We aim to sample from a target
$\exp\left(-\sum_{i=1}^n f_i(x|\mathcal{D}_i\right))$ where each client $f_i$ only has access to local data $\mathcal{D}_i$. We present a fully distributed Markov Chain Monte Carlo (MCMC) sampler that operates through client-to-client communication, eliminating the need for additional centralized servers. Unlike MCMC algorithms that rely on server-client structures, our proposed sampler is entirely distributed, enhancing security and robustness through decentralized communication.
In contrast to limited decentralized algorithms arising from Langevin dynamics, our sampler utilizes blocked Gibbs sampling on an augmented distribution. Furthermore, we establish a non-asymptotic analysis of our sampler, employing innovative techniques. This study contributes to one of the initial analyses of the non-asymptotic behavior of a fully distributed sampler arising from Gibbs sampling.

URL: https://openreview.net/forum?id=1bZ2rLfKwu

---

Title: Mathematical Characterization of Decent Multiclass Models

Abstract: A binary supervised model outperforms chance if and only if the determinant of the confusion matrix is positive. This is equivalent to saying that the associated point on the ROC curve is above the random guessing line. This also means that Youden's J, Cohen's Kappa and Matthews' correlation coefficient are positive. We extend these results to any number of classes: for a target variable with $m \geq 2$ classes, we show that a model does better than chance if and only if the entries of the confusion matrix verify $m(m-1)$ homogeneous polynomial inequalities of degree 2, which can be expressed using generalized likelihood ratios. We also obtain a more theoretical formulation: a model does better than chance if and only if it is a maximum likelihood estimator of the target variable. When this is the case, we find that the multiclass versions of the previous metrics remain positive. For $m=3$, we calculate Volumes Under the ROC Surface and show that bad models occupy exactly 90\% of the ROC space, far more than the 50\% of the binary case. Finally, we propose to define weak multiclass classifiers by conditions on these generalized likelihood ratios.

URL: https://openreview.net/forum?id=VdW9SkALSd

---

Title: Online Selective Conformal Inference: Errors and Solutions

Abstract: In online selective conformal inference, data arrives sequentially, and prediction intervals are constructed only when an online selection rule is met. Since online selections may break the exchangeability between the selected test datum and the rest of the data, one must correct for this by suitably selecting the calibration data. In this paper, we evaluate existing calibration selection strategies and pinpoint some fundamental errors in the associated claims that guarantee selection-conditional coverage and control of the false coverage rate (FCR). To address these shortcomings, we propose novel calibration selection strategies that provably preserve the exchangeability of the calibration data and the selected test datum. Consequently, we demonstrate that online selective conformal inference with these strategies guarantees both selection-conditional coverage and FCR control. Our theoretical findings are supported by experimental evidence examining tradeoffs between valid methods.

URL: https://openreview.net/forum?id=PjIQwFyP07

---

Title: Wasn’t Me: Enabling Users to Falsify Deepfake Attacks

Abstract: The rise of deepfake technology has made everyone vulnerable to false claims based on manipulated media. While many existing deepfake detection methods aim to identify fake media, they often struggle with deepfakes created by new generative models not seen during training. In this paper, we propose VeriFake, a method that enables users to prove that the media claiming to show them are false. VeriFake is based on two key assumptions: (i) generative models struggle to exactly depict a specific identity, and (ii) they often fail to perfectly synchronize generated lip movements with speech. By combining these assumptions with powerful modern representation encoders, VeriFake achieves highly effective results, even against previously unseen deepfakes. Through extensive experiments, we demonstrate that VeriFake significantly outperforms state-of-the-art deepfake detection techniques despite being simple to implement and not relying on any fake data for pretraining.

URL: https://openreview.net/forum?id=jl6G0DgyaT

---

Title: Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models

Abstract: Despite the outstanding performance in vision-language reasoning, Large Vision-Language Models (LVLMs) might generate hallucinated contents that do not exist in the given image. Most existing LVLM hallucination benchmarks are constrained to evaluate the object-related hallucinations. However, the potential hallucination on the relations between two objects, i.e., relation hallucination, still lacks investigation. To remedy that, we design a unified framework to measure object and relation hallucination in LVLMs simultaneously. The core idea of our framework is to evaluate hallucinations in (object, relation, object) triplets extracted from LVLMs’ responses, making it easily generalizable to various vision-language tasks. Based on our framework, we further introduce Tri-HE, a novel Triplet-level Hallucination Evaluation benchmark which can be used to study both object and relation hallucination at the same time. With comprehensive evaluations on Tri-HE, we observe that the relation hallucination issue is even more serious than object hallucination among existing LVLMs, highlighting a previously neglected problem towards reliable LVLMs. Moreover, based on our findings, we design a simple training-free approach that effectively mitigates hallucinations for LVLMs.

URL: https://openreview.net/forum?id=iNywrSPpvc

---

Title: Label Embedding via Low-Coherence Matrices

Abstract: Label embedding is a framework for multiclass classification problems where each label is represented by a distinct vector of some fixed dimension, and training involves matching model output to the vector representing the correct label. While label embedding has been successfully applied in extreme classification and zero-shot learning, and offers both computational and statistical advantages, its theoretical foundations remain poorly understood. This work presents an analysis of label embedding in the context of extreme multiclass classification, where the number of classes $C$ is very large. We present an excess risk bound that reveals a trade-off between computational and statistical efficiency, quantified via the coherence of the embedding matrix. We further show that under the Massart noise condition, the statistical penalty for label embedding vanishes with sufficiently low coherence. Our analysis supports an algorithm that is simple, scalable, and easily parallelizable, and experimental results demonstrate its effectiveness in large-scale applications.

URL: https://openreview.net/forum?id=vrcWXcr4On

---

Title: Long-context Protein Language Modeling Using Bidirectional Mamba with Shared Projection Layers

Abstract: Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design. Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths. Such protein LMs cannot extrapolate to longer proteins and protein complexes well. They also fail to account for the underlying biological mechanisms carried out by biomolecular interactions and dynamics i.e., proteins often interact with other proteins, molecules, and pathways in complex biological systems. In this work, we propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built upon selective structured state-space models, to learn high-quality universal protein representations at the amino acid token level using masked language modeling. We also introduce its graph-contextual variant, LC-PLM, which contextualizes protein-protein interaction (PPI) graphs for a second stage of training. LC-PLM demonstrates favorable neural scaling laws, better length extrapolation capability, and up to 30% and 16% improvements on protein downstream tasks compared to Transformer-based ESM-2 when trained with 100B and 1T tokens, respectively. LC-PLM-G further trained within the context of PPI graphs shows promising results on protein structure and function prediction tasks. Our study demonstrates the benefit of increasing the context size with computationally efficient LM architecture (e.g., structured state space models) in learning universal protein representations and incorporating molecular interaction contexts contained in biological graphs.

URL: https://openreview.net/forum?id=dWvztQzfy4

---

Title: Input and Output Privacy in Cross-Silo Federated Settings: an MPC+DP Approach

Abstract: We address the problem of training a machine learning model on data held by multiple data holders in a cross-silo federated setup while ensuring privacy guarantees. Existing Federated Learning (FL) solutions with Differential Privacy (DP) or Secure Multiparty Computation (MPC) with DP are often limited to either horizontal or vertical partitioning and typically suffer from accuracy loss compared to a centralized setting. We propose an MPC-based approach for training differentially private linear models that supports any partitioning scenario and effectively combines MPC and DP. Our solution employs MPC protocols for both model training and output perturbation using Laplace-like noise. By simulating a trusted curator through MPC, our approach provides the benefits of global DP without requiring an actual trusted party. The resulting MPC+DP method achieves accuracy comparable to a centralized DP setup while maintaining privacy guarantees in a cross-silo federated setup.

URL: https://openreview.net/forum?id=bedKf80Sz2

---

Title: TRA: Better Length Generalisation with Threshold Relative Attention

Abstract: Transformers struggle with length generalisation, displaying poor performance even on basic tasks. We test whether these limitations can be explained through two key failures of the self-attention mechanism. The first is the inability to fully remove irrelevant information. The second is tied to position, even if the dot product between a key and query is highly negative (i.e. an irrelevant key) learned positional biases may unintentionally up-weight such information - dangerous when distances become out of distribution. Put together, these two failure cases lead to compounding generalisation difficulties. We test whether they can be mitigated through the combination of a) selective sparsity - completely removing irrelevant keys from the attention softmax and b) contextualised relative distance - distance is only considered as between the query and the keys that matter. We show how refactoring the attention mechanism with these two mitigations in place can substantially improve generalisation capabilities of decoder only transformers.

URL: https://openreview.net/forum?id=yNiBUc2hMW

---

Title: Adaptive Clipping for Differential Private Federated Learning in Interpolation Regimes

Abstract: We investigate improving the utility of standard differential private optimization algorithms by adaptively determining the clipping radius in federated learning. Our adaptive clipping radius is based on the root-mean-square of the gradient norms, motivated by the interpolation property and smoothness of the objectives. In addition to Renyi Differential Privacy (RDP) analysis, we conduct theoretical utility analysis of the proposed algorithm, showing that our method enhances utility compared to DP-SGD for smooth and non-strongly convex objectives. Numerical experiments confirm the superiority of our adaptive clipping algorithm over standard DP optimization with fixed clipping radius in federated learning settings.

URL: https://openreview.net/forum?id=vvSHlH3a8V

---

Title: EMMA: End-to-End Multimodal Model for Autonomous Driving

Abstract: We introduce EMMA, an End-to-end Multimodal Model for Autonomous driving. Built on a multimodal large language model foundation, EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements. EMMA maximizes the utility of world knowledge from the pre-trained large language models, by representing all non-sensor inputs (e.g. navigation instructions and ego vehicle status) and outputs (e.g. trajectories and 3D locations) as natural language text. This approach allows EMMA to jointly process various driving tasks in a unified language space, and generate the outputs for each task using task-specific prompts. Empirically, we demonstrate EMMA’s effectiveness by achieving state-of-the-art performance in motion planning on nuScenes as well as competitive results on an in-house large-scale benchmark. EMMA also yields competitive results for camera-primary 3D object detection on the Waymo Open Dataset (WOD). We show that co-training EMMA with planner trajectories, object detection, and road graph tasks yields improvements across all three domains, highlighting EMMA’s potential as a generalist model for autonomous driving applications.

URL: https://openreview.net/forum?id=kH3t5lmOU8

---

Title: TapWeight: Reweighting Pretraining Objectives for Task-Adaptive Pretraining

Abstract: Large-scale general domain pretraining followed by downstream-specific finetuning has become a predominant paradigm in machine learning. However, discrepancies between the pretraining and target domains can still lead to performance degradation in certain cases, underscoring the need for task-adaptive continued pretraining (TAP). TAP methods typically involve continued pretraining on task-specific unlabeled datasets or introducing additional unsupervised learning objectives to enhance model capabilities. While many TAP methods perform continued pretraining with multiple pretraining objectives, they often determine the tradeoff parameters between objectives manually, resulting in suboptimal outcomes and higher computational costs. In this paper, we propose TapWeight, a task-adaptive pretraining framework which automatically determines the optimal importance of each pretraining objective based on downstream feedback. TapWeight reweights each pretraining objective by solving a multi-level optimization problem. We applied TapWeight to both molecular property prediction and natural language processing tasks, significantly surpassing baseline methods. Experimental results validate the effectiveness and generalizability of TapWeight. Our code is available in the supplementary material.

URL: https://openreview.net/forum?id=DCCw2CEVFS

---

Title: Multi-Agent Decision S4: Leveraging State Space Models for Offline Multi-Agent Reinforcement Learning

Abstract: Goal-conditioned sequence-based supervised learning with transformers has shown promise in offline reinforcement learning (RL) for single-agent settings. However, extending these methods to offline multi-agent RL (MARL) remains challenging. Existing transformer-based MARL approaches either train agents independently, neglecting multi-agent system dynamics, or rely on centralized transformer models, which face scalability issues. Moreover, transformers inherently struggle with long-term dependencies and computational efficiency. Building on the recent success of Structured State Space Sequence (S4) models, known for their parameter efficiency, faster inference, and superior handling of long context lengths, we propose a novel application of S4-based models to offline MARL tasks. Our method utilizes S4's efficient convolutional view for offline training and its recurrent dynamics for fast on-policy fine-tuning. To foster scalable cooperation between agents, we sequentially expand the decision-making process, allowing agents to act one after another at each time step. This design promotes bi-directional cooperation, enabling agents to share information via their S4 latent states or memory with minimal communication. Gradients also flow backward through this shared information, linking the current agent's learning to its predecessor. Experiments on challenging MARL benchmarks, including Multi-Robot Warehouse (RWARE) and StarCraft Multi-Agent Challenge (SMAC), demonstrate that our approach significantly outperforms state-of-the-art offline RL and transformer-based MARL baselines across most tasks.

URL: https://openreview.net/forum?id=ML9uAK0uzo

---

Title: Interactive Large Language Models for Reliable Answering under Incomplete Context

Abstract: The rise of large language models (LLMs) has revolutionized the way humans interact with artificial intelligence systems. However, their reliability in sensitive applications—such as personal consultations or clinical decision-making—remains limited. A critical shortfall lies in LLMs’ inherent lack of interactivity: these models generate responses even when essential context or domain-specific knowledge is absent, risking inaccurate or misleading outputs. A potential approach to mitigate this issue is to enable LLMs to pose clarifying questions, thereby uncovering the missing information required to provide accurate responses. However, previous methods often tend to greedily prompt LLMs to ask questions. This burdens the user to respond to potentially irrelevant questions and makes the system less flexible. In this paper, we introduce LaMSeI (Language Model with Selective Interaction) method, which enhances LLMs’ ability to judge when interaction is necessary under ambiguous or incomplete contexts. The motivation of LaMSeI is to measure the level of LLMs’ uncertainty about the user query, and interacts with user only when the uncertainty is high. Additionally, we incorporate active learning techniques to select the most informative questions from question candidates, for effectively uncovering the missing context. Our empirical studies, across various challenging question answering benchmarks, where LLMs are posed queries with incomplete context, demonstrate the effectiveness of LaMSeI. The method improves answer accuracy from 31.9% to 50.9%, outperforming other leading question-answering frameworks. Moreover, in experiments involving human participants, LaMSeI consistently generates answers superior to or comparable to baselines in more than 82% of the cases. Moreover, we verify the performance of LaMSeI on various LLMs, such as LLAMA2, LLAMA3, Vicuna and GPT-3.5, highlighting its capability to improve interactive language models.

URL: https://openreview.net/forum?id=nnlmcxYWlV

---

Title: Federated Learning under Evolving Distribution Shifts

Abstract: Federated learning (FL) is a distributed learning paradigm that facilitates training a global machine learning model without collecting the raw data from distributed clients. Recent advances in FL have addressed several considerations that are likely to transpire in realistic settings such as data distribution heterogeneity among clients. However, most of the existing works still consider clients' data distributions to be static or conforming to a simple dynamic, e.g., in participation rates of clients. In real FL applications, client data distributions change over time, and the dynamics, i.e., the evolving pattern, can be highly non-trivial. Further, evolution may take place from training to testing. In this paper, we address dynamics in client data distributions and aim to train FL systems from time-evolving clients that can generalize to future target data. Specifically, we propose two algorithms, FedEvolve and FedEvp, which are able to capture the evolving patterns of the clients during training and are test-robust under evolving distribution shifts. Through extensive experiments on both synthetic and real data, we show the proposed algorithms can significantly outperform the FL baselines across various network architectures.

URL: https://openreview.net/forum?id=ITTAlVyRCj

---

Title: Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Abstract: Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a wide variety of 'jailbreaking' techniques to elicit harmful text from models that were fine-tuned to be harmless. Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from how (adversarial) fine-tuning largely serves to suppress rather than remove undesirable capabilities from LLMs. Prior work has introduced latent adversarial training (LAT) as a way to improve robustness to broad classes of failures. These prior works have considered untargeted latent space attacks where the adversary perturbs latent activations to maximize loss on examples of desirable behavior. Untargeted LAT can provide a generic type of robustness but does not leverage information about specific failure modes. Here, we experiment with targeted LAT where the adversary seeks to minimize loss on a specific competing task. We find that it can augment a wide variety of state-of-the-art methods. First, we use targeted LAT to improve robustness to jailbreaks, outperforming a strong R2D2 baseline with orders of magnitude less compute. Second, we use it to more effectively remove backdoors with no knowledge of the trigger. Finally, we use it to more effectively unlearn knowledge for specific undesirable tasks in a way that is also more robust to re-learning. Overall, our results suggest that targeted LAT can be an effective tool for defending against harmful behaviors from LLMs.

URL: https://openreview.net/forum?id=6LxMeRlkWl

---

Title: Exploring NLP pipelines for textual regression

Abstract: Natural language (NL) regression is the task of predicting numeric responses from text. Like other NL applications, NL regression uses a pipeline that typically comprises four steps: preprocessing, tokenization, featurization and modeling. Each step has multiple options, from traditional and modern NLP approaches, giving many possible pipelines. However, there is no work systematically comparing different combinations of these steps for NL regression. We systematically generate and evaluate hundreds of random valid pipeline configurations, including combinations not commonly studied. For example, approaches with transformers for featurization and gradient boosted trees for modeling. Then, we evaluate these pipelines on two real datasets. These experiments reveal several interesting aspects of pipeline construction:
i) BERT contextual featurization outperforms GloVe non-contextual featurization,
ii) BERT featurization needs to be finetuned to outperform bag of words, with implications for resource constrained applications,
iii) the variance associated with choosing steps upstream from modelling is comparable to that of selecting the model, and
iv) vector embeddings (BERT and GloVe) perform worse than bag of words for GBDT models.
This study provides systematic evidence highlighting the need for holistic pipeline optimization.

URL: https://openreview.net/forum?id=ejQA5b6l1P

---

Title: Graph Masked Language Models

Abstract: Language Models (LMs) and Graph Neural Networks (GNNs) have shown great promise in their respective areas, yet integrating structured graph data with rich textual information remains challenging. In this work, we propose \emph{Graph Masked Language Models} (GMLM), a novel dual-branch architecture that combines the structural learning of GNNs with the contextual power of pretrained language models. Our approach introduces two key innovations: (i) a \emph{semantic masking strategy} that utilizes graph topology to selectively mask nodes based on their structural importance, and (ii) a \emph{soft masking mechanism} that interpolates between original node features and a learnable mask token, ensuring smoother information flow during training. Extensive experiments on multiple node classification and language understanding benchmarks demonstrate that GMLM not only achieves state-of-the-art performance but also exhibits enhanced robustness and stability. This work underscores the benefits of integrating structured and unstructured data representations for improved graph learning.

URL: https://openreview.net/forum?id=NoSFk54m8I

---

Title: Offline Model-Based Optimization: Comprehensive Review

Abstract: Offline optimization is a fundamental challenge in science and engineering, where the goal is to optimize black-box functions using only offline datasets. This setting is particularly relevant when querying the objective function is prohibitively expensive or infeasible, with applications spanning protein engineering, material discovery, neural architecture search, and beyond. The main difficulty lies in accurately estimating the objective landscape beyond the available data, where extrapolations are fraught with significant epistemic uncertainty. This uncertainty can lead to objective hacking(reward hacking), exploiting model inaccuracies in unseen regions, or other spurious optimizations that yield misleadingly high performance estimates outside the training distribution. Recent advances in model-based optimization(MBO) have harnessed the generalization capabilities of deep neural networks to develop offline-specific surrogate and generative models. Trained with carefully designed strategies, these models are more robust against out-of-distribution issues, facilitating the discovery of improved designs. Despite its growing impact in accelerating scientific discovery, the field lacks a comprehensive review. To bridge this gap, we present the first thorough review of offline MBO. We begin by formalizing the problem for both single-objective and multi-objective settings and by reviewing recent benchmarks and evaluation metrics. We then categorize existing approaches into two key areas: surrogate modeling, which emphasizes accurate function approximation in out-of-distribution regions, and generative modeling, which explores high-dimensional design spaces to identify high-performing designs. Finally, we examine the key challenges and propose promising directions for advancement in this rapidly evolving field including safe control of superintelligent systems.

URL: https://openreview.net/forum?id=BZPtFEEwxH

---

Title: Federated Generalized Novel Category Discovery with Prompts Tuning

Abstract: Generalized category discovery (GCD) is proposed to handle categories from unseen labels during the inference stage by clustering them. Most work in GCD provides solutions for unseen classes in data-centralized settings. However, unlabeled categories possessed by clients, which are common in real-world federated learning (FL), have been largely ignored and degraded the performance of classic FL algorithms. To demonstrate and mitigate the harmful effect of unseen classes, we dive into a GCD problem setting applicable for FL named FedGCD, establish a strong baseline constructed with state-of-the-art GCD algorithm simGCD, and design a learning framework with prompt tuning to tackle both the overfitting and communication burden problems in FedGCD. In our methods, clients first separately carry out prompt learning on local data. Then, we aggregate the prompts from all clients as the global prompt to help capture global knowledge and then send the global prompts to local clients to allow access to broader knowledge from other clients. By this method, we significantly reduce the parameters needed to upload in FedGCD, which is a common obstacle in the real application of most FL algorithms. We conduct experiments on both generic and fine-grained datasets like CIFAR-100 and CUB-200, and show that our method is comparable to the FL version of simGCD and surpasses other baselines with significantly fewer parameters to transmit.

URL: https://openreview.net/forum?id=dVMESwnMlo

---

Title: Efficient Knowledge Injection in LLMs via Self-Distillation

Abstract: In many practical applications, large language models (LLMs) need to acquire new knowledge not present in their pre-training data. Efficiently leveraging this knowledge usually relies on supervised fine-tuning or retrieval-augmented generation (RAG). Although RAG has emerged as the industry standard for knowledge injection, fine-tuning has not yet achieved comparable success. This paper proposes utilizing prompt distillation, a self-distillation-based method previously explored primarily for style alignment and instruction tuning, to internalize new factual knowledge from free-form documents. Unlike prior methods, our approach requires neither larger teacher models nor structured knowledge formats. Across multiple LLM sizes and model families, we show that prompt distillation outperforms standard supervised fine-tuning and can even surpass RAG. We analyze the key factors contributing to prompt distillation's effectiveness and examine how it scales.

URL: https://openreview.net/forum?id=drYpdSnRJk

---

Title: FedIN: Federated Intermediate Layers Learning for Model Heterogeneity

Abstract: Federated learning (FL) facilitates edge devices to cooperatively train a global shared model while maintaining the training data locally and privately. However, a prevalent yet impractical assumption in FL requires the participating edge devices to train on an identical global model architecture. Recent research endeavors to address this problem in FL using public datasets. Nevertheless, acquiring data distributions that closely match to those of participating users poses a significant challenge. In this study, we propose an FL method called Federated Intermediate Layers Learning (FedIN), which supports heterogeneous models without relying on any public datasets. Instead, FedIN leverages the inherent knowledge embedded in client model features to facilitate knowledge exchange. To harness the knowledge from client features, we propose Intermediate Layers (IN) training to align intermediate layers based on features obtained from other clients. IN training only needs minimal memory and communication overhead by employing a single batch of client features. Additionally, we formulate and resolve a convex optimization problem to mitigate the challenge of gradient divergence stemming from model heterogeneity. The experimental results demonstrate the superior performance of FedIN in heterogeneous model settings compared to state-of-the-art algorithms. Furthermore, the experiments discuss the details of how to protect user privacy leaked from IN features, and our ablation study illustrates the effectiveness of IN training.

URL: https://openreview.net/forum?id=Nh3pX3AnPt

---

Title: AttentionSmithy: A Modular Framework for Rapid Transformer Development and Customization

Abstract: Transformer architectures have revolutionized a broad spectrum of AI applications by leveraging
attention mechanisms for parallelized and long-range sequence processing. Despite
their remarkable success, building and customizing transformers remains prohibitively complex
for many domain experts who lack deep knowledge of low-level implementations. We
introduce AttentionSmithy, a modular software package that lowers the barrier to transformer
innovation by decomposing key components—attention modules, feed-forward networks,
normalization layers, and positional encodings—into reusable building blocks. By
disentangling architectural elements into well-defined interfaces, users can rapidly prototype,
adapt, and evaluate transformer variants without extensive coding overhead. Our framework
supports four distinct positional encoding strategies (sinusoidal, learned, rotary, and ALiBi)
and integrates seamlessly with neural architecture search (NAS) for automated design exploration.
We validate AttentionSmithy by replicating the original “Attention Is All You Need”
transformer under resource constraints, demonstrating near state-of-the-art performance on
a machine translation task. Leveraging the package’s integrated NAS capability, we made
the unexpected discovery that machine translation performance is maximized by combining
all available positional encoding methods—highlighting the complementary benefits of each
strategy. We further illustrate AttentionSmithy’s adaptability through gene-specific modeling,
where a variant of a BERT-style architecture achieves over 95% accuracy on downstream
cell type classification tasks using ranked transcriptomic data. These case studies underscore
AttentionSmithy’s core advantage: enabling specialized experimentation across diverse
application domains—from natural language processing to genomic analysis—by obviating
the need for labor-intensive, low-level framework manipulation. We anticipate that AttentionSmithy
will serve as a foundation for creative transformer-based solutions, expediting
research and development in numerous scientific and industrial fields.

URL: https://openreview.net/forum?id=0jhoriH9yA

---

Title: Approaching Deep Learning through the Spectral Dynamics of Weights

Abstract: We study the spectral dynamics of weights--the behavior of singular values and vectors during optimization--showing that they clarify and link many phenomena in deep learning. Through extensive experiments, covering small-scale ``grokking'' to large-scale tasks like image classification with ConvNets, image generation with UNets, speech recognition with LSTMs, and language modeling with Transformers, we identify a consistent bias with three key ingredients. First, singular values evolve unequally leading to rank minimization. As a result, top singular vectors stabilize well before the end of training, and lastly this happens without displaying alignment between neighboring layers used in several theoretical results. We show how this bias tracks the transition to generalization in grokking. We demonstrate more generally that weight decay enhances rank minimization beyond its role as a norm regularizer in practical systems. Moreover, we show that these spectral dynamics distinguish random label training from true labels, offering a novel perspective on this longstanding conundrum. Additionally, these dynamics reveal structure in well-performing sparse subnetworks (lottery tickets) and the shape of the loss surface through linear mode connectivity. Our findings suggest that spectral dynamics provide a coherent view that links the behavior of neural networks across diverse settings.

URL: https://openreview.net/forum?id=RtMyoQiCxZ

---

Title: The Diffusion Process as a Correlation Machine: Linear Denoising Insights

Abstract: Recently, diffusion models have gained popularity due to their impressive generative abilities. These models learn the implicit distribution given by a training dataset, and sample new data by transforming random noise through the reverse process, which can be thought of as gradual denoising. In this work, to shed more light on the evolution of denoisers in the reverse process, we examine the generation process as a ``correlation machine'', where random noise is repeatedly enhanced in correlation with the implicit given distribution.
To this end, we explore the linear case, where the optimal denoiser in the MSE sense is known to be the PCA projection. This enables us to connect the theory of diffusion models to the spiked covariance model, where the dependence of the denoiser on the noise level and the amount of training data can be expressed analytically, in the rank-1 case.
In a series of numerical experiments, we extend this result to general low rank data, and show that low frequencies emerge earlier in the generation process, where the denoising basis vectors are more aligned to the true data with a rate depending on their eigenvalues. This model allows us to show that the linear reverse process is a generalization of the prevalent power iteration method, where the generated distribution is composed of several estimations of the given covariance, in varying stages of convergence.
Finally, we empirically demonstrate the applicability of our findings beyond the linear case, in the Jacobians of a deep, non-linear denoiser, used in general image generation tasks.

URL: https://openreview.net/forum?id=FGDJOc27rt

---

Title: Learning the Language of Protein Structure

Abstract: Representation learning and \emph{de novo} generation of proteins are pivotal computational biology tasks.
Whilst natural language processing (NLP) techniques have proven highly effective for protein sequence modelling, structure modelling presents a complex challenge, primarily due to its continuous and three-dimensional nature.
Motivated by this discrepancy, we introduce an approach using a vector-quantized autoencoder that effectively tokenizes protein structures into discrete representations. This method transforms the continuous, complex space of protein structures into a manageable, discrete format with a codebook ranging from 4096 to 64000 tokens, achieving high-fidelity reconstructions with backbone root mean square deviations (RMSD) of approximately 1-4 \AA. To demonstrate the efficacy of our learned representations, we show that a simple GPT model trained on our codebooks can generate novel, diverse, and designable protein structures. Our approach not only provides representations of protein structure, but also mitigates the challenges of disparate modal representations and sets a foundation for seamless, multi-modal integration, enhancing the capabilities of computational methods in protein design.

URL: https://openreview.net/forum?id=SRRPQIOS4w

---

Title: Cumulative Reasoning with Large Language Models

Abstract: Recent advancements in large language models (LLMs) have shown remarkable progress, yet their ability to solve complex problems remains limited. In this work, we introduce Cumulative Reasoning (CR), an approach that utilizes LLMs cumulatively and iteratively, mirroring human thought processes for problem-solving. CR decomposes tasks into smaller, manageable components and leverages previous propositions for effective composition, significantly enhancing problem-solving capabilities. We demonstrate CR’s advantage through several complex reasoning tasks: it outperforms existing methods in logical inference tasks with up to a 9.3% improvement, achieving 98.04% accuracy on the curated FOLIO wiki dataset. In the Game of 24, it achieves 98% accuracy, marking a 24% improvement over previous methods. In solving MATH problems, CR achieves a 4.2% increase from previous methods and a 43% relative improvement in the most challenging level 5 problems. When incorporating a code environment with CR, we further harness LLMs’ reasoning capabilities and outperform the Program of Thought (PoT) method by 38.8%.

URL: https://openreview.net/forum?id=grW15p4eq2

---

Title: Best Practices To Compress Multimodal Large Language Models through Structural Pruning and Recovery

Abstract: Multimodal large language models (MLLMs) are increasingly developed to meet diverse deployment needs, varying in scale and computational demand. While recent research has focused on building MLLMs from Small Language Models (SLMs), these efforts remain limited in flexibility and are still data- and compute-intensive. In this paper, we present the first comprehensive study on flexibly compressing existing MLLMs through structural pruning and recovery training in a data-efficient manner. Hence, we address a critical gap in the literature by empirically analyzing best practices for adapting to specific hardware or resource limitations. Our study investigates pruning and knowledge distillation techniques, examining their impact on downstream performance across various model compression strategies, including pruning paradigms and recovery training schemes. We further investigate the feasibility of performing recovery training using only a small fraction of the available data. Key findings reveal that widthwise pruning is more effective than layerwise pruning in resource-constrained scenarios. For smaller compression ratios, finetuning the multimodal projector alone can restore most performance, while combining finetuning with hidden state knowledge distillation proves most effective across all compression levels. Notably, we demonstrate efficient model downsizing using as little as 5% of the original dataset for moderate compression, which achieves over 95% of the performance compared to using the full dataset. Our paper addresses a critical gap in the literature by empirically analysing the best practices for compressing MLLMs. With our best practices, Bunny-v1.0-3B retains over 95% of its original performance, while LLaVA-v1.5-7B maintains more than 97%, with compression ratios below 30%.

URL: https://openreview.net/forum?id=54dCGQwZwj

---

Title: Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models

Abstract: Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs or fine-tune the model, RepE directly manipulates the model's internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models' behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods? To answer these, we propose a unified framework describing RepE as a pipeline comprising representation identification, operationalization, and control. We posit that while RepE methods offer significant potential, challenges remain, including managing multiple concepts, ensuring reliability, and preserving models' performance. Towards improving RepE, we identify opportunities for experimental and methodological improvements and construct a guide for best practices.

URL: https://openreview.net/forum?id=2U1KIfmaU9

---

Title: Does equivariance matter at scale?

Abstract: Given large data sets and sufficient compute, is it beneficial to design neural architectures for the structure and symmetries of each problem? Or is it more efficient to learn them from data? We study empirically how equivariant and non-equivariant networks scale with compute and training samples. Focusing on a benchmark problem of rigid-body interactions and on general-purpose transformer architectures, we perform a series of experiments, varying the model size, training steps, and dataset size. We find evidence for three conclusions. First, equivariance improves data efficiency, but training non-equivariant models with data augmentation can close this gap given sufficient epochs. Second, scaling with compute follows a power law, with equivariant models outperforming non-equivariant ones at each tested compute budget. Finally, the optimal allocation of a compute budget onto model size and training duration differs between equivariant and non-equivariant models.

URL: https://openreview.net/forum?id=wilNute8Tn

---

Title: LumiNet: Perception-Driven Knowledge Distillation via Statistical Logit Calibration

Abstract: In the knowledge distillation literature, feature-based methods have dominated due to their ability to effectively tap into extensive teacher models. In contrast, logit-based approaches, which aim to distill `dark knowledge' from teachers, typically exhibit inferior performance compared to feature-based methods. To bridge this gap, we present LumiNet, a novel knowledge distillation algorithm designed to enhance logit-based distillation. We introduce the concept of `perception', aiming to calibrate logits based on the model's representation capability. This concept addresses overconfidence issues in the logit-based distillation method while also introducing a novel method to distill knowledge from the teacher. It reconstructs the logits of a sample/instances by considering relationships with other samples in the batch. LumiNet excels on benchmarks like CIFAR-100, ImageNet, and MSCOCO, outperforming the leading feature-based methods, e.g., compared to KD with ResNet18 and MobileNetV2 on ImageNet, it shows improvements of 1.5\% and 2.05\%, respectively.

URL: https://openreview.net/forum?id=3rU1lp9w2l

---

Title: A Critical Review of Predominant Bias in Neural Networks

Abstract: Bias issues of neural networks garner significant attention along with its promising advancement.
Among various bias issues, mitigating two predominant biases is crucial in advancing fair and trustworthy AI: (1) ensuring neural networks yields even performance across demographic groups, and (2) ensuring algorithmic decision-making does not rely on protected attributes.
However, upon the investigation of \pc papers in the relevant literature, we find that there exists a persistent, extensive but under-explored confusion regarding these two types of biases.
Furthermore, the confusion has already significantly hampered the clarity of the community and subsequent development of debiasing methodologies.
Thus, in this work, we aim to restore clarity by providing two mathematical definitions for these two predominant biases and leveraging these definitions to unify a comprehensive list of papers.
Next, we highlight the common phenomena and the possible reasons for the existing confusion.
To alleviate the confusion, we provide extensive experiments on synthetic, census, and image datasets, to validate the distinct nature of these biases, distinguish their different real-world manifestations, and evaluate the effectiveness of a comprehensive list of bias assessment metrics in assessing the mitigation of these biases.
Further, we compare these two types of biases from multiple dimensions including the underlying causes, debiasing methods, evaluation protocol, prevalent datasets, and future directions.
Last, we provide several suggestions aiming to guide researchers engaged in bias-related work to avoid confusion and further enhance clarity in the community.

URL: https://openreview.net/forum?id=LLiJ1WsL2e

---

Title: Agnostic Label-Only Membership Inference Attack on Two-Tower Neural Networks for Recommendation Systems

Abstract: This paper presents an innovative adaptation of the Agnostic Label-Only Membership Inference Attack (ALOA) specifically designed for two-tower neural network (NN) models used in recommendation systems. Unlike traditional membership inference attacks that focus on categorical outputs, our approach targets models that produce continuous vector embeddings. We propose a comprehensive methodology that employs synthetic datasets, shadow model training, and a suite of perturbation techniques to evaluate model robustness using the Maximum Mean Discrepancy (MMD) metric. Experimental results demonstrate that the attack model achieves exceptionally high accuracy and precision in distinguishing whether data is part of the original training dataset, even without direct access to it. These findings extend the theoretical framework of membership inference attacks to continuous output spaces and highlight vulnerabilities in modern recommendation systems.

URL: https://openreview.net/forum?id=0QScq2eA5l

---

Title: Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?

Abstract: Spurious correlations are unstable statistical associations that hinder robust decision-making. Conventional wisdom suggests that models relying on such correlations will fail to generalize out-of-distribution (OOD), particularly under strong distribution shifts. However, a growing body of empirical evidence challenges this view, as naive empirical risk minimizers often achieve the best OOD accuracy across popular OOD generalization benchmarks. In light of these counterintuitive results, we propose a different perspective: many widely used benchmarks for assessing the impact of spurious correlations on OOD generalization are misspecified. Specifically, they fail to include shifts in spurious correlations that meaningfully degrade OOD generalization, making them unsuitable for evaluating the benefits of removing such correlations. We establish sufficient—and in some cases necessary—conditions under which a distribution shift can reliably assess a model's reliance on spurious correlations. Crucially, under these conditions, we provably should not observe a strong positive correlation between in-distribution and out-of-distribution accuracy—often referred to as accuracy on the line. Yet, when we examine state-of-the-art OOD generalization benchmarks, we find that most exhibit accuracy on the line, suggesting they do not effectively assess robustness to spurious correlations. Our findings expose a limitation in evaluating algorithms for domain generalization, i.e., learning predictors that do not rely on spurious correlations. Our results highlight the need to rethink how we assess robustness to spurious correlations.

URL: https://openreview.net/forum?id=fNywRyqPQo

---

Title: Actor-only and Safe-Actor-only REINFORCE Algorithms with Deterministic Update Times

Abstract: Regular Monte-Carlo policy gradient reinforcement learning (RL) algorithms require aggregation of data over regeneration epochs constituting an episode (until a termination state is reached). In real-world applications involving large state and action spaces, the hitting times for goal states can be very sparse or infrequent resulting in large episodes of unpredictable length. As an alternative, we present an RL algorithm called Actor-only algorithm (AOA) that performs data aggregation over a certain (deterministic) number of epochs. This helps remove unpredictability in the data aggregation step and thereby the update instants. Note also that satisfying safety constraints in RL is extremely crucial in safety-critical applications. We also extend the aforementioned AOA to the setting of safe RL that we call Safe-Actor-only algorithm (SAOA). In this work, we provide the asymptotic and finite-time convergence guarantees of our proposed algorithms to obtain the optimal policy. The finite-time analysis of our proposed algorithms demonstrates that finding a first-order stationary point, i.e., $\left\|\nabla \bar J\left(\theta\right)\right\|_2^2\leq \epsilon$ and $\left\|\nabla \bar {\mathcal{L}}\left(\theta,\eta\right)\right\|_2^2\leq \epsilon$ of performance function $\bar J(\theta)$ and $\bar{\mathcal{L}}(\theta,\eta)$, respectively, both with $\mathcal{O}(\epsilon^{-2})$ sample complexity. Further, our empirical results on benchmark RL environments demonstrate the advantages of proposed algorithms over considered algorithms in the literature.

URL: https://openreview.net/forum?id=3EDa8Wkcs4

---

Title: Gaussian Loss Smoothing Enables Certified Training with Tight Convex Relaxations

Abstract: Training neural networks with high certified accuracy against adversarial examples remains an open challenge despite significant efforts.
While certification methods can effectively leverage tight convex relaxations for bound computation, in training, these methods, perhaps surprisingly, can perform worse than looser relaxations.
Prior work hypothesized that this phenomenon is caused by the discontinuity, non-smoothness, and perturbation sensitivity of the loss surface induced by tighter relaxations.
In this work, we theoretically show that Gaussian Loss Smoothing (GLS) can alleviate these issues.
We confirm this empirically by instantiating GLS with two variants: a zeroth-order optimization algorithm, called PGPE, which allows training with non-differentiable relaxations, and a first-order optimization algorithm, called RGS, which requires gradients of the relaxation but is much more efficient than PGPE.
Extensive experiments show that when combined with tight relaxations, these methods surpass state-of-the-art methods when training on the same network architecture for many settings.
Our results clearly demonstrate the promise of Gaussian Loss Smoothing for training certifiably robust neural networks and pave a path towards leveraging tighter relaxations for certified training.

URL: https://openreview.net/forum?id=lknvxcjuos

---

Title: Reasoning with trees: interpreting CNNs using hierarchies

Abstract: Challenges remain in providing interpretable explanations for neural network reasoning in explainable AI (xAI). Existing methods like Integrated Gradients produce noisy maps, and LIME, while intuitive, may deviate from the model's reasoning. We introduce a framework that uses hierarchical segmentation techniques for faithful and interpretable explanations of Convolutional Neural Networks (CNNs). Our method constructs model-based hierarchical segmentations that maintain the model's reasoning fidelity and allow both human-centric and model-centric segmentation. This approach can be combined with various xAI methods and provides multiscale explanations that help identify biases and improve understanding of neural network decision-making. Experiments show that our framework, xAiTrees, delivers highly interpretable and faithful model explanations, not only surpassing traditional xAI methods but shedding new light on a novel approach to enhancing xAI interpretability.

URL: https://openreview.net/forum?id=zjyWZh5IiI

---

Title: Black Box Causal Inference: Effect Estimation via Meta Prediction

Abstract: Causal inference and the estimation of causal effects plays a central role in decision-making across many areas, including healthcare and economics. Estimating causal effects typically requires an estimator that is tailored to each problem of interest. But developing estimators can take significant effort for even a single causal inference setting. For example, algorithms for regression-based estimators, propensity score methods, and doubly robust methods were designed across several decades to handle causal estimation with observed confounders. Similarly, several estimators have been developed to exploit instrumental variables (IVs), including two-stage least-squares (TSLS), control functions, and the method-of-moments. In this work, we instead frame causal inference as a dataset-level prediction problem, offloading algorithm design to the learning process. The approach we introduce, called black box causal inference (BBCI), builds estimators in a black-box manner by learning to predict causal effects from sampled dataset-effect pairs. We demonstrate accurate estimation of average treatment effects (ATEs) and conditional average treatment effects (CATEs) with BBCI across several causal inference problems with known identification, including problems with less developed estimators.

URL: https://openreview.net/forum?id=KEtlsENZSE

---

Title: Dependency-aware Maximum Likelihood Estimation for Active Learning

Abstract: Active learning aims to efficiently build a labeled training set by strategically selecting samples to query labels from annotators.
In this sequential process, each sample acquisition influences subsequent selections, causing dependencies among samples in the labeled set. However, these dependencies are overlooked during the model parameter estimation stage when updating the model using Maximum Likelihood Estimation (MLE), a conventional method that assumes independent and identically distributed (i.i.d.) data. We propose Dependency-aware MLE (DMLE), which corrects MLE within the active learning framework by addressing sample dependencies typically neglected due to the i.i.d. assumption, ensuring consistency with active learning principles in the model parameter estimation process. This improved method achieves superior performance across multiple benchmark datasets, reaching higher performance in earlier cycles compared to conventional MLE. Specifically, we observe average accuracy improvements of 6\%, 8.6\%, and 10.5\% for $k=1$, $k=5$, and $k=10$ respectively, after collecting the first 100 samples, where entropy is the acquisition function and $k$ is the query batch size acquired at every active learning cycle.

URL: https://openreview.net/forum?id=qDVDSXXGK1

---

Title: Potential Score Matching: Debiasing Molecular Structure Sampling with Potential Energy Guidance

Abstract: The ensemble average of physical properties of molecules is closely related to the distribution of molecular conformations, and sampling such distributions is a fundamental challenge in physics and chemistry. Traditional methods like molecular dynamics (MD) simulations and Markov chain Monte Carlo (MCMC) sampling are commonly used but can be time-consuming and costly. Recently, diffusion models have emerged as efficient alternatives by learning the distribution of training data. Obtaining an unbiased target distribution is still an expensive task, primarily because it requires satisfying ergodicity. To tackle these challenges, we propose Potential Score Matching (PSM), an approach that utilizes the potential energy gradient to guide generative models. PSM does not require exact energy functions and can debias sample distributions even when trained on limited and biased data. Our method outperforms existing state-of-the-art (SOTA) models on the Lennard-Jones (LJ) potential, a commonly used toy model. Furthermore, we extend the evaluation of PSM to high-dimensional problems using the MD17 and MD22 datasets. The results demonstrate that molecular distributions generated by PSM more closely approximate the Boltzmann distribution compared to traditional diffusion models.

URL: https://openreview.net/forum?id=tTdzbnvTno

---

Title: On the Challenges and Opportunities in Generative AI

Abstract: The field of deep generative modeling has grown rapidly in the last few years. With the availability of massive amounts of training data coupled with advances in scalable unsupervised learning paradigms, recent large-scale generative models show tremendous promise in synthesizing high-resolution images and text, as well as structured data such as videos and molecules. However, we argue that current large-scale generative AI models exhibit several fundamental shortcomings that hinder their widespread adoption across domains. In this work, our objective is to identify these issues and highlight key unresolved challenges in modern generative AI paradigms that should be addressed to further enhance their capabilities, versatility, and reliability. By identifying these challenges, we aim to provide researchers with insights for exploring fruitful research directions, thus fostering the development of more robust and accessible generative AI solutions.

URL: https://openreview.net/forum?id=NeS9Kj2JwF

---

Title: Personalized Federated Learning via Low-Rank Matrix Optimization

Abstract: Personalized Federated Learning (pFL) has gained significant attention for building a suite of models tailored to different clients. In pFL, the challenge lies in balancing the reliance on local datasets, which may lack representativeness, against the diversity of other clients' models, whose quality and relevance are uncertain. Focusing on the clustered FL scenario, where devices are grouped based on similarities in their data distributions without prior knowledge of cluster memberships, we develop a mathematical model for pFL using low-rank matrix optimization. Building on this formulation, we propose a pFL approach leveraging the Burer-Monteiro factorization technique. We examine the convergence guarantees of the proposed method, and present numerical experiments on training deep neural networks, demonstrating the empirical performance of the proposed method in scenarios where personalization is crucial.

URL: https://openreview.net/forum?id=DFJu1QB2Nr

---

Title: PCF Learned Sort: a Learning Augmented Sort Algorithm with $\mathcal{O}(n \log\log n)$ Expected Complexity

Abstract: Sorting is one of the most fundamental algorithms in computer science. Recently, Learned Sorts, which use machine learning to improve sorting speed, have attracted attention. While existing studies show that Learned Sort is empirically faster than classical sorting algorithms, they do not provide theoretical guarantees about its computational complexity. We propose PCF Learned Sort, a theoretically guaranteed Learned Sort algorithm. We prove that the expected complexity of PCF Learned Sort is $\mathcal{O}(n \log \log n)$ under mild assumptions on the data distribution. We also confirm empirically that PCF Learned Sort has a computational complexity of $\mathcal{O}(n \log \log n)$ on both synthetic and real datasets. This is the first study to theoretically support the empirical success of Learned Sort, and provides evidence for why Learned Sort is fast.

URL: https://openreview.net/forum?id=wVkb8WHbvR

---

Title: On the Efficiency of Diffusion Models in Generating Plausible Designs

Abstract: Diffusion-based generative models have huge potential in creating novel structural images in generative design where the user heavily values the design plausibility, e.g, no floating material or missing part. However, such models often require many denoising steps to achieve satisfactory plausibility, resulting in high computation costs; when using much fewer steps, we can not ensure plausibility. This paper addresses this trade-off and proposes an efficient training and inference method that can achieve the same or better plausibility than existing models while reducing the sampling time. We determine the noise schedule based on the evolution of pixel-value distributions in the forward diffusion process. Compared to previous models, e.g., DDPM and EDM, our method concentrates the noise schedule at a range of noise levels that highly influence the structural modeling and hereby achieves high efficiency in inference without compromising the visual quality or design plausibility. We apply this noise schedule to the EDM method on two structural data sets, BIKED and Seeing3DChairs. On BIKED images, for instance, our noise schedule significantly improves the quality of generated designs: the rate of plausible designs from 83.4% to 93.5%; FID from 7.84 to 4.87, compared to EDM.

URL: https://openreview.net/forum?id=rtNldOBA9N

---

Title: Game-Theoretic LLM: Agent Workflow for Negotiation Games

Abstract: We investigate the rationality of large language models (LLMs) for strategic decision making via a game-theoretic perspective. We evaluate several state-of-the-art LLMs across a spectrum of complete-information and incomplete-information games. Our findings reveal that LLMs frequently deviate from rational strategies, particularly as the complexity of the game increases with larger payoff matrices or deeper sequential trees.

To address these limitations, we design multiple game-theoretic workflows that guide the reasoning and decision-making processes of LLMs. These workflows aim to enhance the models' ability to compute Nash Equilibria and make rational choices, even under incomplete information. Experimental results demonstrate that adopting these workflows significantly improves the rationality and robustness of LLMs in game-theoretic tasks. Specifically, with the workflow, LLMs exhibit marked improvements in identifying optimal strategies, achieving near-optimal allocations in negotiation scenarios, and reducing susceptibility to exploitation during negotiations. Furthermore, we explore the meta-strategic considerations of whether it is rational for agents to adopt such workflows, recognizing that the decision to use or forgo the workflow constitutes a game-theoretic issue in itself.

Our research contributes to a deeper understanding of LLMs' decision-making capabilities in strategic contexts and provides insights into enhancing their rationality through structured workflows. The findings have implications for the development of more robust and strategically sound AI agents capable of navigating complex interactive environments. Code and data supporting this study are available at \url{https://anonymous.4open.science/r/game_theory-B7B5}.

URL: https://openreview.net/forum?id=Q5iCjZmpE4

---

Reply all

Reply to author

Forward

0 new messages