Weekly TMLR digest for Mar 09, 2025

4 views

Skip to first unread message

TMLR

unread,

Mar 9, 2025, 12:00:12 AMMar 9

to tmlr-annou...@googlegroups.com

New certifications
==================

Expert Certification: The BrowserGym Ecosystem for Web Agent Research

Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Lacoste, Massimo Caccia, Alexandre Drouin, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Graham Neubig, Quentin Cappart, Russ Salakhutdinov, Nicolas Chapados

https://openreview.net/forum?id=5298fKGmv3

---

Reproducibility Certification: Towards Graph Foundation Models: A Study on the Generalization of Positional and Structural Encodings

Billy Joe Franks, Moshe Eliasof, Semih Cantürk, Guy Wolf, Carola-Bibiane Schönlieb, Sophie Fellenz, Marius Kloft

https://openreview.net/forum?id=mSoDRZXsqj

---

Expert Certification: Variation Matters: from Mitigating to Embracing Zero-Shot NAS Ranking Function Variation

Pavel Rumiantsev, Mark Coates

https://openreview.net/forum?id=SbGt90dxdp

---

Reproducibility Certification: Shedding Light on Problems with Hyperbolic Graph Learning

Isay Katsman, Anna Gilbert

https://openreview.net/forum?id=rKAkp1f3R7

---

Accepted papers
===============

Title: The BrowserGym Ecosystem for Web Agent Research

Authors: Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Lacoste, Massimo Caccia, Alexandre Drouin, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Graham Neubig, Quentin Cappart, Russ Salakhutdinov, Nicolas Chapados

Abstract: The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. In an earlier work, Drouin et al. (2024) introduced BrowserGym which aims to solve this by providing a unified, gym-like environment with well-defined observation and actionspaces, facilitating standardized evaluation across diverse benchmarks. We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature and includes AgentLab, a complementary framework that aids in agent creation, testing, and analysis. Our proposed ecosystem offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks made available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic’s latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

URL: https://openreview.net/forum?id=5298fKGmv3

---

Title: Calibrated Probabilistic Forecasts for Arbitrary Sequences

Authors: Charles Marx, Volodymyr Kuleshov, Stefano Ermon

Abstract: Real-world data streams can change unpredictably due to distribution shifts, feedback loops and adversarial actors, which challenges the validity of forecasts. We present a forecasting framework ensuring valid uncertainty estimates regardless of how data evolves. Leveraging the concept of Blackwell approachability from game theory, we introduce a forecasting framework that guarantees calibrated uncertainties for outcomes in any compact space (e.g., classification or bounded regression). We extend this framework to recalibrate existing forecasters, guaranteeing calibration without sacrificing predictive performance. We implement both general-purpose gradient-based algorithms and algorithms optimized for popular special cases of our framework. Empirically, our algorithms improve calibration and downstream decision-making for energy systems.

URL: https://openreview.net/forum?id=nuIUTHGlM5

---

Title: State space models can express $n$-gram languages

Authors: Vinoth Nandakumar, Qiang Qu, Peng Mi, Tongliang Liu

Abstract: Recent advancements in recurrent neural networks (RNNs) have reinvigorated interest in their application to natural language processing tasks, particularly with the development of more efficient and parallelizable variants known as state space models (SSMs), which have shown competitive performance against transformer models while maintaining a lower memory footprint. While RNNs and SSMs (e.g., Mamba) have been empirically more successful than rule-based systems based on $n$-gram models, a rigorous theoretical explanation for this success has not yet been developed, as it is unclear how these models encode the combinatorial rules that govern the next-word prediction task. In this paper, we construct state space language models that can solve the next-word prediction task for languages generated from $n$-gram rules, thereby showing that the former are more expressive. Our proof shows how SSMs can encode $n$-gram rules using new theoretical results on their memorization capacity, and demonstrates how their context window can be controlled by restricting the spectrum of the state transition matrix. We conduct experiments with a small dataset generated from $n$-gram rules to show how our framework can be applied to SSMs and RNNs obtained through gradient-based optimization.

URL: https://openreview.net/forum?id=QlBaDKb370

---

Title: Towards Graph Foundation Models: A Study on the Generalization of Positional and Structural Encodings

Authors: Billy Joe Franks, Moshe Eliasof, Semih Cantürk, Guy Wolf, Carola-Bibiane Schönlieb, Sophie Fellenz, Marius Kloft

Abstract: Recent advances in integrating positional and structural encodings (PSEs) into graph neural networks (GNNs) have significantly enhanced their performance across various graph learning tasks. However, the general applicability of these encodings and their potential to serve as foundational representations for graphs remain uncertain. This paper investigates the fine-tuning efficiency, scalability with sample size, and generalization capability of learnable PSEs across diverse graph datasets. Specifically, we evaluate their potential as universal pre-trained models that can be easily adapted to new tasks with minimal fine-tuning and limited data. Furthermore, we assess the expressivity of the learned representations, particularly, when used to augment downstream GNNs. We demonstrate through extensive benchmarking and empirical analysis that PSEs generally enhance downstream models. However, some datasets may require specific PSE-augmentations to achieve optimal performance. Nevertheless, our findings highlight their significant potential to become integral components of future graph foundation models. We provide new insights into the strengths and limitations of PSEs, contributing to the broader discourse on foundation models in graph learning.

URL: https://openreview.net/forum?id=mSoDRZXsqj

---

Title: Unlearning Personal Data from a Single Image

Authors: Thomas De Min, Massimiliano Mancini, Stéphane Lathuilière, Subhankar Roy, Elisa Ricci

Abstract: Machine unlearning aims to erase data from a model as if the latter never saw them during training. While existing approaches unlearn information from complete or partial access to the training data, this access can be limited over time due to privacy regulations. Currently, no setting or benchmark exists to probe the effectiveness of unlearning methods in such scenarios. To fill this gap, we propose a novel task we call One-Shot Unlearning of Personal Identities (1-SHUI) that evaluates unlearning models when the training data is not available. We focus on unlearning identity data, which is specifically relevant due to current regulations requiring personal data deletion after training. To cope with data absence, we expect users to provide a portraiting picture to aid unlearning. We design requests on CelebA, CelebA-HQ, and MUFAC with different unlearning set sizes to evaluate applicable methods in 1-SHUI. Moreover, we propose MetaUnlearn, an effective method that meta-learns to forget identities from a single image. Our findings indicate that existing approaches struggle when data availability is limited, especially when there is a dissimilarity between the provided samples and the training data.

URL: https://openreview.net/forum?id=VxC4PZ71Ym

---

Title: FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness

Authors: Vincent Abbott, Gioele Zardini

Abstract: Optimizing deep learning algorithms currently requires slow, manual derivation, potentially leaving much performance untapped. Methods like FlashAttention have achieved a x6 performance improvement over native PyTorch by avoiding unnecessary data transfers, but required three iterations over three years to be developed. Automated compiled methods have consistently lagged behind. This paper extends Neural Circuit Diagrams for deep learning models to consider resource usage and the distribution of tasks across a GPU hierarchy. We show how diagrams can use simple relabellings to derive high-level streaming and tiling optimization strategies along with performance models. We show how this high-level performance model allows the effects of quantization and multi-level GPU hierarchies to be readily considered. We develop a methodology for representing intermediate-level pseudocode with diagrams, allowing hardware-aware algorithms to be derived step-by-step. Finally, we show how our methodology can be used to better understand existing techniques like FlashAttention. This work uses a theoretical framework to link assumptions about GPU behaviour to claims about performance. We aim to lay the groundwork for a scientific approach to GPU optimization where experiments can address clear hypotheses rather than post-hoc rationalizations.

URL: https://openreview.net/forum?id=pF2ukh7HxA

---

Title: Out of Spuriousity: Improving Robustness to Spurious Correlations without Group Annotations

Authors: Phuong Quynh Le, Jörg Schlötterer, Christin Seifert

Abstract: Machine learning models are known to learn spurious correlations, i.e., features that have strong correlations with class labels but no causal relationship. Relying on these correlations leads to poor performance in data groups that do not contain these correlations, and poor generalization. Approaches to mitigate spurious correlations either rely on the availability of group annotations or require access to different model checkpoints to approximate these group annotations. We propose PruSC, a method for extracting a spurious-free subnetwork from a dense network. PruSC does not require prior knowledge of the spurious correlations and is able to mitigate the effect of multiple spurious attributes. Specifically, we observe that ERM training leads to clusters in representation space that are induced by spurious correlations. We then define a supervised contrastive loss to extract a subnetwork that distorts such clusters, forcing the model to learn only class-specific clusters, rather than attribute-class specific clusters. Our method outperforms all annotation-free methods, achieves worst-group accuracy competitive with methods that require annotations and can mitigate the effect of multiple spurious correlations. Our results show that in a fully trained dense network, there exists a subnetwork that uses only invariant features in classification tasks, thereby eliminating the influence of spurious features.

URL: https://openreview.net/forum?id=EEeVYfXor5

---

Title: No Need for Ad-hoc Substitutes: The Expected Cost is a Principled All-purpose Classification Metric

Authors: Luciana Ferrer

Abstract: The expected cost (EC) is one of the main classification metrics introduced in statistical and machine learning books. It is based on the assumption that, for a given application of interest, each decision made by the system has a corresponding cost which depends on the true class of the sample. An evaluation metric can then be defined by taking the expectation of the cost over the data. Two special cases of the EC are widely used in the machine learning literature: the error rate (one minus the accuracy) and the balanced error rate (one minus the balanced accuracy or unweighted average recall). Other instances of the EC can be useful for applications in which some types of errors are more severe than others, or when the prior probabilities of the classes differ between the evaluation data and the use-case scenario. Surprisingly, the general form for the EC is rarely used in the machine learning literature. Instead, alternative ad-hoc metrics like the F-beta score and the Matthews correlation coefficient (MCC) are used for many applications. In this work, we argue that the EC is superior to these alternative metrics, being more general, interpretable, and adaptable to any application scenario. We provide both theoretically-motivated discussions as well as examples to illustrate the behavior of the different metrics.

URL: https://openreview.net/forum?id=5PPbvCExZs

---

Title: Generalized Tangent Kernel: A Unified Geometric Foundation for Natural Gradient and Standard Gradient

Authors: Qinxun Bai, Steven Rosenberg, Wei Xu

Abstract: Natural gradients have been widely studied from both theoretical and empirical perspectives, and it is commonly believed that natural gradients have advantages over standard (Euclidean) gradients in capturing the intrinsic geometric structure of the underlying function space and being invariant under reparameterization. However, for function optimization, a fundamental theoretical issue regarding the existence of natural gradients on the function space remains underexplored. We address this issue by providing a geometric perspective and mathematical framework for studying both natural gradient and standard gradient that is more complete than existing studies. The key tool that unifies natural gradient and standard gradient is a generalized form of the Neural Tangent Kernel (NTK), which we name the Generalized Tangent Kernel (GTK). Using a novel orthonormality property of GTK, we show that for a fixed parameterization, GTK determines a Riemannian metric on the entire function space which makes the standard gradient as “natural" as the natural gradient in capturing the intrinsic structure of the parameterized function space. Many aspects of this approach relate to RKHS theory. For the practical side of this theory paper, we showcase that our framework motivates new solutions to the non-immersion/degenerate case of natural gradient and leads to new families of natural/standard gradient descent methods.

URL: https://openreview.net/forum?id=HOnL5hjaIt

---

Title: GeoMask3D: Geometrically Informed Mask Selection for Self-Supervised Point Cloud Learning in 3D

Authors: Ali Bahri, Moslem Yazdanpanah, Mehrdad Noori, Milad Cheraghalikhani, Gustavo Adolfo Vargas Hakim, David OSOWIECHI, Farzad Beizaee, Ismail Ben Ayed, Christian Desrosiers

Abstract: We introduce a novel approach to self-supervised learning for point clouds, employing a geometrically informed mask selection strategy called GeoMask3D (GM3D) to boost the efficiency of Masked Auto Encoders (MAE). Unlike the conventional method of random
masking, our technique utilizes a teacher-student model to focus on intricate areas within the data, guiding the model’s focus toward regions with higher geometric complexity. This strategy is grounded in the hypothesis that concentrating on harder patches yields a more
robust feature representation, as evidenced by the improved performance on downstream tasks. Our method also presents a feature-level knowledge distillation technique designed to guide the prediction of geometric complexity, which utilizes a comprehensive context from
feature-level information. Extensive experiments confirm our method’s superiority over State-Of-The-Art (SOTA) baselines, demonstrating marked improvements in classification, segmentation, and few-shot tasks.

URL: https://openreview.net/forum?id=Yk7GUlJwGa

---

Title: Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

Authors: Shenao Zhang, Donghan Yu, Hiteshi Sharma, Han Zhong, Zhihan Liu, Ziyi Yang, Shuohang Wang, Hany Hassan Awadalla, Zhaoran Wang

Abstract: Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions. Unlike offline alignment with a fixed dataset, online feedback collection from humans or AI on model generations typically leads to more capable reward models and better-aligned LLMs through an iterative process. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses that span the vast space of natural language. Random sampling from standard reward-maximizing LLMs alone is insufficient to fulfill this requirement. To address this issue, we propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. By solving the inner-level problem with the reparameterized reward function, the resulting algorithm, named Self-Exploring Language Models (SELM), eliminates the need for a separate RM and iteratively updates the LLM with a straightforward objective. Compared to Direct Preference Optimization (DPO), the SELM objective reduces indiscriminate favor of unseen extrapolations and enhances exploration efficiency. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks such as MT-Bench and AlpacaEval 2.0, as well as various standard academic benchmarks in different settings.

URL: https://openreview.net/forum?id=FoQK84nwY3

---

Title: Path-Specific Counterfactual Fairness via Dividend Correction

Authors: Daisuke Hatano, Satoshi Hara, Hiromi Arai

Abstract: Counterfactual fairness is a fundamental principle in machine learning that allows the analysis of the effects of sensitive attributes in each individual decision by integrating the knowledge of causal graphs. An issue in dealing with counterfactual fairness is that unfair causal effects are often context-specific, influenced by religious, cultural, and national differences, making it difficult to create a universally applicable model. This leads to the challenge of dealing with frequent adaptation to changes in fairness assessments when localizing a model. Thus, applicability across a variety of models and efficiency becomes necessary to meet this challenge. We propose the first efficient post-process approach to achieve path-specific counterfactual fairness by adjusting a model's outputs based on a given causal graph. This approach is model-agnostic, prioritizing on flexibility and generalizability to deliver robust results across various domains and model architectures. By means of the mathematical tools in cooperative game, the Möbius inversion formula and dividends, we demonstrate that our post-process approach can be executed efficiently. We empirically show that proposed algorithm outperforms existing in-process approaches for path-specific counterfactual fairness and a post-process approach for counterfactual fairness.

URL: https://openreview.net/forum?id=RXoSmiyObR

---

Title: KAGNNs: Kolmogorov-Arnold Networks meet Graph Learning

Authors: Roman Bresson, Giannis Nikolentzos, George Panagopoulos, Michail Chatzianastasis, Jun Pang, Michalis Vazirgiannis

Abstract: In recent years, Graph Neural Networks (GNNs) have become the de facto tool for learning node and graph representations. Most GNNs typically consist of a sequence of neighborhood aggregation (a.k.a., message-passing) layers, within which the representation of each node is updated based on those of its neighbors. The most expressive message-passing GNNs can be obtained through the use of the sum aggregator and of MLPs for feature transformation, thanks to their universal approximation capabilities. However, the limitations of MLPs recently motivated the introduction of another family of universal approximators, called Kolmogorov-Arnold Networks (KANs) which rely on a different representation theorem. In this work, we compare the performance of KANs against that of MLPs on graph learning tasks. We implement three new KAN-based GNN layers, inspired respectively by the GCN, GAT and GIN layers. We evaluate two different implementations of KANs using two distinct base families of functions, namely B-splines and radial basis functions. We perform extensive experiments on node classification, link prediction, graph classification and graph regression datasets. Our results indicate that KANs are on-par with or better than MLPs on all tasks studied in this paper. We also show that the size and training speed of RBF-based KANs is only marginally higher than for MLPs, making them viable alternatives. Code available at https://github.com/RomanBresson/KAGNN.

URL: https://openreview.net/forum?id=03UB1MCAMr

---

Title: Buffer-based Gradient Projection for Continual Federated Learning

Authors: Shenghong Dai, Jy-yong Sohn, Yicong Chen, S M Iftekharul Alam, Ravikumar Balakrishnan, Suman Banerjee, Nageen Himayat, Kangwook Lee

Abstract: Continual Federated Learning (CFL) is essential for enabling real-world applications where multiple decentralized clients adaptively learn from continuous data streams. A significant challenge in CFL is mitigating catastrophic forgetting, where models lose previously acquired knowledge when learning new information. Existing approaches often face difficulties due to the constraints of device storage capacities and the heterogeneous nature of data distributions among clients. While some CFL algorithms have addressed these challenges, they frequently rely on unrealistic assumptions about the availability of task boundaries (i.e., knowing when new tasks begin). To address these limitations, we introduce Fed-A-GEM, a federated adaptation of the A-GEM method, which employs a buffer-based gradient projection approach. Fed-A-GEM alleviates catastrophic forgetting by leveraging local buffer samples and aggregated buffer gradients, thus preserving knowledge across multiple clients. Our method is combined with existing CFL techniques, enhancing their performance in the CFL context. Our experiments on standard benchmarks show consistent performance improvements across diverse scenarios. For example, in a task-incremental learning scenario using the CIFAR-100 dataset, our method can increase the accuracy by up to 27%. Our code is available at https://github.com/shenghongdai/Fed-A-GEM.

URL: https://openreview.net/forum?id=Xz5IcOizQ6

---

Title: The 2024 Foundation Model Transparency Index

Authors: Rishi Bommasani, Kevin Klyman, Sayash Kapoor, Shayne Longpre, Betty Xiong, Nestor Maslej, Percy Liang

Abstract: Foundation models are increasingly consequential yet extremely opaque. To characterize the status quo, the Foundation Model Transparency Index was launched in October 2023 to measure the transparency of leading foundation model developers. The October 2023 Index (v1.0) assessed 10 major foundation model developers (e.g. OpenAI, Google) on 100 transparency indicators (e.g. does the developer disclose the wages it pays for data labor?). At the time, developers publicly disclosed very limited information with the average score being 37 out of 100. To understand how the status quo has changed, we conduct a follow-up study (v1.1) after 6 months: we score 14 developers against the same 100 indicators. While in v1.0 we searched for publicly available information, in v1.1 developers submit reports on the 100 transparency indicators, potentially including information that was not previously public. We find that developers now score 58 out of 100 on average, a 21 point improvement over v1.0. Much of this increase is driven by developers disclosing information during the v1.1 process: on average, developers disclosed information related to 16.6 indicators that was not previously public. We observe regions of sustained (i.e. across v1.0 and v1.1) and systemic (i.e. across most or all developers) opacity such as on copyright status, data access, data labor, and downstream impact. We publish transparency reports for each developer that consolidate information disclosures: these reports are based on the information disclosed to us via developers. Our findings demonstrate that transparency can be improved in this nascent ecosystem, the Foundation Model Transparency Index likely contributes to these improvements, and policymakers should consider interventions in areas where transparency has not improved.

URL: https://openreview.net/forum?id=38cwP8xVxD

---

Title: How to Leverage Predictive Uncertainty Estimates for Reducing Catastrophic Forgetting in Online Continual Learning

Authors: Giuseppe Serra, Ben Werner, Florian Buettner

Abstract: Many real-world applications require machine-learning models to be able to deal with non-stationary data distributions and thus learn autonomously over an extended period of time, often in an online setting. One of the main challenges in this scenario is the so-called catastrophic forgetting (CF) for which the learning model tends to focus on the most recent tasks while experiencing predictive degradation on older ones. In the online setting, the most effective solutions employ a fixed-size memory buffer to store old samples used for replay when training on new tasks. Many approaches have been presented to tackle this problem and conflicting strategies are proposed to populate the memory. Are the easiest-to-forget or the easiest-to-remember samples more effective in combating CF? Furthermore, it is not clear how predictive uncertainty information for memory management can be leveraged in the most effective manner. Starting from the intuition that predictive uncertainty provides an idea of the samples' location in the decision space, this work presents an in-depth analysis of different uncertainty estimates and strategies for populating the memory. The investigation provides a better understanding of the characteristics data points should have for alleviating CF. Then, we propose an alternative method for estimating predictive uncertainty via the generalised variance induced by the negative log-likelihood. Finally, we demonstrate that the use of predictive uncertainty measures helps in reducing CF in different settings.

URL: https://openreview.net/forum?id=dczXe0S1oL

---

Title: An elementary concentration bound for Gibbs measures arising in statistical learning theory

Authors: Kelly Ramsay, Aukosh Jagannath, Shojaeddin Chenouri

Abstract: We present an elementary concentration bound for Gibbs measures whose log-likelihood is a function of the empirical risk. This bound controls the distance between samples from the (random) Gibbs measure and the minimizers of the population risk function. This bound is a generalization of a recent inequality developed by Ramsay et al., 2024. As a corollary, we obtain sample complexity bounds and bounds on the inverse temperature so that the samples are within a prescribed error of the population value. The latter bound on the inverse temperature is essentially sharp. We demonstrate our work on three canonical classes of examples: classification of two component mixture models, robust regression, and spiked matrix and tensor models.

URL: https://openreview.net/forum?id=ZInwrlkQ3f

---

Title: Random Walk Diffusion for Efficient Large-Scale Graph Generation

Authors: Tobias Bernecker, Ghalia Rehawi, Francesco Paolo Casale, Janine Knauer-Arloth, Annalisa Marsico

Abstract: Graph generation addresses the problem of generating new graphs that have a data distribution similar to real-world graphs. While previous diffusion-based graph generation methods have shown promising results, they often struggle to scale to large graphs. In this work, we propose ARROW-Diff (AutoRegressive RandOm Walk Diffusion), a novel random walk-based diffusion approach for efficient large-scale graph generation. Our method encompasses two components in an iterative process of random walk sampling and graph pruning. We demonstrate that ARROW-Diff can scale to large graphs efficiently, surpassing other baseline methods in terms of both generation time and multiple graph statistics, reflecting the high quality of the generated graphs.

URL: https://openreview.net/forum?id=tSFpsfndE7

---

Title: Learning Linear Polytree Structural Equation Model

Authors: Xingmei Lou, Yu Hu, Xiaodong Li

Abstract: We are interested in the problem of learning the directed acyclic graph (DAG) when data are generated from a linear structural equation model (SEM) and the causal structure can be characterized by a polytree. Under the Gaussian polytree models, we study sufficient conditions on the sample sizes for the well-known Chow-Liu algorithm to exactly recover both the skeleton and the equivalence class of the polytree, which is uniquely represented by a CPDAG. On the other hand, necessary conditions on the required sample sizes for both skeleton and CPDAG recovery are also derived in terms of information-theoretic lower bounds, which match the respective sufficient conditions and thereby give a sharp characterization of the difficulty of these tasks. We also consider the problem of inverse correlation matrix estimation under the linear polytree models, and establish the estimation error bound in terms of the dimension and the total number of v-structures. We also consider an extension of group linear polytree models, in which each node represents a group of variables. Our theoretical findings are illustrated by comprehensive numerical simulations, and experiments on benchmark data also demonstrate the robustness of polytree learning when the true graphical structures can only be approximated by polytrees.

URL: https://openreview.net/forum?id=N28FdYO2sH

---

Title: Variation Matters: from Mitigating to Embracing Zero-Shot NAS Ranking Function Variation

Authors: Pavel Rumiantsev, Mark Coates

Abstract: Neural Architecture Search (NAS) is a powerful automatic alternative to manual design of a neural network.
In the zero-shot version, a fast ranking function is used to compare architectures without training them.
The outputs of the ranking functions often vary significantly due to different sources of randomness, including the evaluated architecture's weights' initialization or the batch of data used for calculations.
A common approach to addressing the variation is to average a ranking function output over several evaluations.
We propose taking into account the variation in a different manner, by viewing the ranking function output as a random variable representing a proxy performance metric.
During the search process, we strive to construct a stochastic ordering of the performance metrics to determine the best architecture.
Our experiments show that the proposed stochastic ordering can effectively boost performance of a search on standard benchmark search spaces.

URL: https://openreview.net/forum?id=SbGt90dxdp

---

Title: DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

Authors: Tejumade Afonja, Hui-Po Wang, Raouf Kerkouche, Mario Fritz

Abstract: Generating tabular data under differential privacy (DP) protection ensures theoretical privacy guarantees but poses challenges for training machine learning models, primarily due to the need to capture complex structures under noisy supervision signals. Recently, pre-trained Large Language Models (LLMs) -- even those at the scale of GPT-2 -- have demonstrated great potential in synthesizing tabular data. However, their applications under DP constraints remain largely unexplored. In this work, we address this gap by applying DP techniques to the generation of synthetic tabular data. Our findings shows that LLMs face difficulties in generating coherent text when fine-tuned with DP, as privacy budgets are inefficiently allocated to non-private elements like table structures. To overcome this, we propose DP-2Stage, a two-stage fine-tuning framework for differentially private tabular data generation. The first stage involves non-private fine-tuning on a pseudo dataset, followed by DP fine-tuning on a private dataset. Our empirical results show that this approach improves performance across various settings and metrics compared to directly fine-tuned LLMs in DP contexts. We release our code and setup at https://github.com/tejuafonja/DP-2Stage.

URL: https://openreview.net/forum?id=6nBIweDYzZ

---

Title: Optimizing Estimators of Squared Calibration Errors in Classification

Authors: Sebastian Gregor Gruber, Francis R. Bach

Abstract: In this work, we propose a mean-squared error-based risk that enables the comparison and optimization of estimators of squared calibration errors in practical settings.
Improving the calibration of classifiers is crucial for enhancing the trustworthiness and interpretability of machine learning models, especially in sensitive decision-making scenarios.
Although various calibration (error) estimators exist in the current literature, there is a lack of guidance on selecting the appropriate estimator and tuning its hyperparameters.
By leveraging the bilinear structure of squared calibration errors, we reformulate calibration estimation as a regression problem with independent and identically distributed (i.i.d.) input pairs.
This reformulation allows us to quantify the performance of different estimators even for the most challenging calibration criterion, known as canonical calibration.
Our approach advocates for a training-validation-testing pipeline when estimating a calibration error on an evaluation dataset.
We demonstrate the effectiveness of our pipeline by optimizing existing calibration estimators and comparing them with novel kernel ridge regression-based estimators on standard image classification tasks.

URL: https://openreview.net/forum?id=BPDVZajOW5

---

Title: Reset-free Reinforcement Learning with World Models

Authors: Zhao Yang, Thomas M. Moerland, Mike Preuss, Aske Plaat, Edward S. Hu

Abstract: Reinforcement learning (RL) is an appealing paradigm for training intelligent agents, enabling policy acquisition from the agent's own autonomously acquired experience. However, the training process of RL is far from automatic, requiring extensive human effort to reset the agent and environments. To tackle the challenging reset-free setting, we first demonstrate the superiority of model-based (MB) RL methods in such setting, showing that a straightforward adaptation of MBRL can outperform all the prior state-of-the-art methods while requiring less supervision. We then identify limitations inherent to this direct extension and propose a solution called model-based reset-free (MoReFree) agent, which further enhances the performance. MoReFree adapts two key mechanisms, exploration and policy learning, to handle reset-free tasks by prioritizing task-relevant states. It exhibits superior data-efficiency across various reset-free tasks without access to environmental reward or demonstrations while significantly outperforming privileged baselines that require supervision. Our findings suggest model-based methods hold significant promise for reducing human effort in RL. Website: https://yangzhao-666.github.io/morefree

URL: https://openreview.net/forum?id=ZdMIXltJzK

---

Title: Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance

Authors: Qi Zhang, Yi Zhou, Shaofeng Zou

Abstract: This paper provides the first tight convergence analyses for RMSProp and Adam for non-convex optimization under the most relaxed assumptions of coordinate-wise generalized smoothness and affine noise variance.
RMSProp is firstly analyzed, which is a special case of Adam with adaptive learning rates but without first-order momentum. Specifically, to solve the challenges due to the dependence among adaptive update, unbounded gradient estimate and Lipschitz constant, we demonstrate that the first-order term in the descent lemma converges and its denominator is upper bounded by a function of gradient norm. Based on this result, we show that RMSProp with proper hyperparameters converges to an $\epsilon$-stationary point with an iteration complexity of $\mathcal O(\epsilon^{-4})$. We then generalize our analysis to Adam, where the additional challenge is due to a mismatch between the gradient and the first-order momentum. We develop a new upper bound on the first-order term in the descent lemma, which is also a function of the gradient norm. We show that Adam with proper hyperparameters converges to an $\epsilon$-stationary point with an iteration complexity of $\mathcal O(\epsilon^{-4})$. Our complexity results for both RMSProp and Adam match with the complexity lower bound established in Arjevani et al. (2023).

URL: https://openreview.net/forum?id=QIzRdjIWnS

---

Title: HoSNNs: Adversarially-Robust Homeostatic Spiking Neural Networks with Adaptive Firing Thresholds

Authors: Hejia Geng, Peng Li

Abstract: While spiking neural networks (SNNs) offer a promising neurally-inspired model of computation, they are vulnerable to adversarial attacks. We present the first study that draws inspiration from neural homeostasis to design a threshold-adapting leaky integrate-and-fire (TA-LIF) neuron model and utilize TA-LIF neurons to construct the adversarially robust homeostatic SNNs (HoSNNs) for improved robustness. The TA-LIF model incorporates a self-stabilizing dynamic thresholding mechanism, offering a local feedback control solution to the minimization of each neuron's membrane potential error caused by adversarial disturbance. Theoretical analysis demonstrates favorable dynamic properties of TA-LIF neurons in terms of the bounded-input bounded-output stability and suppressed time growth of membrane potential error, underscoring their superior robustness compared with the standard LIF neurons. When trained with weak FGSM attacks ($\epsilon = 2/255$), our HoSNNs significantly outperform conventionally trained LIF-based SNNs across multiple datasets. Furthermore, under significantly stronger PGD7 attacks ($\epsilon = 8/255$), HoSNN achieves notable improvements in accuracy, increasing from 30.90% to 74.91% on FashionMNIST, 0.44% to 36.82% on SVHN, 0.54% to 43.33% on CIFAR10, and 0.04% to 16.66% on CIFAR100.

URL: https://openreview.net/forum?id=UV58hNygne

---

Title: A Self-Explainable Heterogeneous GNN for Relational Deep Learning

Authors: Francesco Ferrini, Antonio Longa, Andrea Passerini, Manfred Jaeger

Abstract: Recently, significant attention has been given to the idea of viewing relational databases as heterogeneous graphs, enabling the application of graph neural network (GNN) technology for predictive tasks. However, existing GNN methods struggle with the complexity of the heterogeneous graphs induced by databases with numerous tables and relations.
Traditional approaches either consider all possible relational meta-paths, thus failing to scale with the number of relations, or rely on domain experts to identify relevant meta-paths. A recent solution does manage to learn informative meta-paths without expert supervision,
but assumes that a node’s class depends solely on the existence of a meta-path occurrence.
In this work, we present a self-explainable heterogeneous GNN for relational data,
that supports models in which class membership depends on aggregate information obtained from multiple
occurrences of a meta-path.
Experimental results show that in the context of relational databases, our approach effectively identifies informative meta-paths that faithfully capture the model’s reasoning mechanisms. It significantly outperforms existing methods in both synthetic and real-world scenarios.

URL: https://openreview.net/forum?id=8Q4qxe9a9Z

---

Title: Long Short-Term Imputer: Handling Consecutive Missing Values in Time Series

Authors: Jiacheng You, Xinyang Chen, Yu Sun, Weili Guan, Liqiang Nie

Abstract: Encountered frequently in time series data, missing values can significantly impede time-series analysis. With the progression of deep learning, advanced imputation models delve into the temporal dependencies inherent in time series data, showcasing remarkable performance. This positions them as intuitive selections for time series imputation tasks which assume ``Miss Completely at Random''. Nonetheless, long-interval consecutive missing values may obstruct the model's ability to grasp long-term temporal dependencies, consequently hampering the efficacy of imputation performance. To tackle this challenge, we propose Long Short-term Imputer (LSTI) to impute consecutive missing values with different length of intervals. Long-term Imputer is designed using the idea of bi-directional autoregression. A forward prediction model and a backward prediction model are trained with a consistency regularization, which is designed to capture long-time dependency and can adapt to long-interval consecutive missing values. Short-term Imputer is designed to capture short-time dependency and can impute the short-interval consecutive missing values effectively. A meta-weighting network is then proposed to take advantage of the strengths of two imputers. As a result, LSTI can impute consecutive missing values with different intervals effectively. Experiments demonstrate that our approach, on average, reduces the error by 57.4% compared to state-of-the-art deep models across five datasets.

URL: https://openreview.net/forum?id=9NVJ0ZgEfT

---

Title: Evolution of Discriminator and Generator Gradients in GAN Training: From Fitting to Collapse

Authors: Weiguo Gao, Ming Li

Abstract: Generative Adversarial Networks (GANs) are powerful generative models but often suffer from mode mixture and mode collapse. We propose a perspective that views GAN training as a two-phase progression from fitting to collapse, where mode mixture and mode collapse are treated as inter-connected. Inspired by the particle model interpretation of GANs, we leverage the discriminator gradient to analyze particle movement and the generator gradient, specifically "steepness," to quantify the severity of mode mixture by measuring the generator's sensitivity to changes in the latent space. Using these theoretical insights into evolution of gradients, we design a specialized metric that integrates both gradients to detect the transition from fitting to collapse. This metric forms the basis of an early stopping algorithm, which stops training at a point that retains sample quality and diversity. Experiments on synthetic and real-world datasets, including MNIST, Fashion MNIST, and CIFAR-10, validate our theoretical findings and demonstrate the effectiveness of the proposed algorithm.

URL: https://openreview.net/forum?id=58gPkcVbFL

---

Title: Shedding Light on Problems with Hyperbolic Graph Learning

Authors: Isay Katsman, Anna Gilbert

Abstract: Recent papers in the graph machine learning literature have introduced a number of approaches for hyperbolic representation learning. The asserted benefits are improved performance on a variety of graph tasks, node classification and link prediction included. Claims have also been made about the geometric suitability of particular hierarchical graph datasets to representation in hyperbolic space. Despite these claims, our work makes a surprising discovery: when simple Euclidean models with comparable numbers of parameters are properly trained in the same environment, in most cases, they perform as well, if not better, than all introduced hyperbolic graph representation learning models, even on graph datasets previously claimed to be the most hyperbolic as measured by Gromov $\delta$-hyperbolicity (i.e., perfect trees). This observation gives rise to a simple question: how can this be? We answer this question by taking a careful look at the field of hyperbolic graph representation learning as it stands today, and find that a number of results do not diligently present baselines, make faulty modelling assumptions when constructing algorithms, and use misleading metrics to quantify geometry of graph datasets. We take a closer look at each of these three problems, elucidate the issues, perform an analysis of methods, and introduce a parametric family of benchmark datasets to ascertain the applicability of (hyperbolic) graph neural networks.

URL: https://openreview.net/forum?id=rKAkp1f3R7

---

New submissions
===============

Title: Variance Dichotomy in Feature Spaces of Facial Recognition Systems is a Weak Defense from Simple Weight Manipulation Attacks

Abstract: We analyze and amend a powerful scheme for anonymity/unlinkability and confusion attacks on facial recognition systems devised by Zehavi et al. (2024), which is based on simple weight manipulations in only the last hidden layer. We consider several leading pretrained networks, and show that they exhibit a variance dichotomy in their feature spaces, which causes the benign accuracy of the attacked system to decrease fast as the number of sequentially installed backdoors increases. We then propose a method for the attacker to overcome this intrinsic defense, and thereby significantly increase the number of backdoors which might avoid detection. We support and explain our empirical findings by a numerical analysis in a streamlined setting based on orthogonal projections of random vectors.

URL: https://openreview.net/forum?id=Q1Cf07flwD

---

Title: Conformal Bounds on Full-Reference Image Quality for Imaging Inverse Problems

Abstract: In imaging inverse problems, we would like to know how close the recovered image is to the true image in terms of full-reference image quality (FRIQ) metrics like PSNR, SSIM, LPIPS, etc. This is especially important in safety-critical applications like medical imaging, where knowing that, say, the SSIM was poor could potentially avoid a costly misdiagnosis. But since we don’t know the true image, computing FRIQ is non-trivial. In this work, we combine conformal prediction with approximate posterior sampling to construct bounds on FRIQ that are guaranteed to hold up to a user-specified error probability. We demonstrate our approach on image denoising and accelerated magnetic resonance imaging (MRI) problems.

URL: https://openreview.net/forum?id=WADLPccB6o

---

Title: NeoBERT: A Next Generation BERT

Abstract: Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of large auto-regressive language models such as LLaMA and DeepSeek. In contrast, encoders like BERT and RoBERTa have not seen the same level of progress despite being foundational for many downstream NLP applications. To bridge this gap, we introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT$_{large}$, RoBERTa$_{large}$, NomicBERT, and ModernBERT under identical fine-tuning conditions. In addition, we rigorously evaluate the impact of each modification on GLUE and design a uniform fine-tuning and evaluation framework for MTEB. We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption.

URL: https://openreview.net/forum?id=TJRyDi7mwH

---

Title: Fast online node labeling with graph subsampling

Abstract: Large data applications rely on storing data in massive, sparse graphs with millions to trillions of nodes. Graph-based methods, such as node prediction, aim for computational efficiency regardless of graph size. Techniques like localized approximate personalized page rank (APPR) solve sparse linear systems with complexity independent of graph size, but is in terms of the maximum node degree, which can be much larger in practice than the average node degree for real-world large graphs. In this paper, we consider an \emph{online subsampled APPR method}, where messages are intentionally dropped at random. We use tools from graph sparsifiers and matrix linear algebra to give approximation bounds on the graph's spectral properties ($O(1/\epsilon^2)$ edges), and node classification performance (added $O(n\epsilon)$ overhead).

URL: https://openreview.net/forum?id=VZErRqUUER

---

Title: Class-wise Generalization Error: an Information-Theoretic analysis

Abstract: Existing generalization theories for supervised learning typically take a holistic approach and provide bounds for the expected generalization over the whole data distribution, which implicitly assumes that the model generalizes similarly for all different classes. In practice, however, there are significant variations in generalization performance among different classes, which cannot be captured by the existing generalization bounds. In this work, we tackle this problem by theoretically studying the class-generalization error, which quantifies the generalization performance of the model for each individual class. We derive a novel information-theoretic bound for class-generalization error using the KL divergence, and we further obtain several tighter bounds using recent advances in conditional mutual information
bound, which enables practical evaluation. We empirically validate our proposed bounds in various neural networks and show that they accurately capture the complex class-generalization behavior. Moreover, we demonstrate that the theoretical tools developed in
this work can be applied in several other applications.

URL: https://openreview.net/forum?id=asW4VcDFpi

---

Title: Agreement-Based Cascading for Efficient Inference

Abstract: Adaptive inference schemes reduce the cost of machine learning inference by assigning smaller models to easier examples, attempting to avoid invocation of larger models when possible. In this work we explore a simple, effective adaptive inference technique we term Agreement-Based Cascading (ABC). ABC builds a cascade of models of increasing size/complexity and uses agreement between ensembles of models at each level of the cascade as a basis for data-dependent routing. Although ensemble execution introduces additional expense, we show that these costs can be easily offset in practice due to large expected differences in model sizes, parallel inference execution capabilities, and accuracy benefits of ensembling. We examine ABC theoretically and empirically in terms of these parameters, showing that the approach can reliably act as a drop-in replacement for existing models and surpass the best single model it aims to replace in terms of both efficiency and accuracy. Additionally, we explore the performance of ABC relative to existing cascading methods in three common scenarios: (1) edge-to-cloud inference, where ABC reduces communication costs by up to 14x; (2) cloud-based model serving, where it achieves a 3x reduction in rental costs; and (3) inference via model API services, where ABC achieves a 2-25x reduction in average price per token/request relative to state-of-the-art LLM cascades.

URL: https://openreview.net/forum?id=jn9B7LMlzk

---

Title: EMMA: Efficient Visual Alignment in Multi-Modal LLMs

Abstract: Multi-modal Large Language Models (MLLMs) have recently exhibited impressive general-
purpose capabilities by leveraging vision foundation models to encode the core concepts of
images into representations. These are then combined with instructions and processed by the
language model to generate high-quality responses. Despite significant progress in enhancing
the language component, challenges persist in optimally fusing visual encodings within the
language model for task-specific adaptability. Recent research has focused on improving
this fusion through modality adaptation modules but at the cost of significantly increased
model complexity and training data needs. In this paper, we propose EMMA (Efficient
Multi-Modal Adaptation), a lightweight cross-modality module designed to efficiently fuse
visual and textual encodings, generating instruction-aware visual representations for the
language model. Our key contributions include: (1) an efficient early fusion mechanism
that integrates vision and language representations with minimal added parameters (less
than 0.2% increase in model size), (2) an in-depth interpretability analysis that sheds light
on the internal mechanisms of the proposed method; (3) comprehensive experiments that
demonstrate notable improvements on both specialized and general benchmarks for MLLMs.
Empirical results show that EMMA boosts performance across multiple tasks by up to 9.3%
while significantly improving robustness against hallucinations.

URL: https://openreview.net/forum?id=lbrO3bGpeO

---

Title: Tracing Facts or just Copies? A critical investigation of the Competitions of Mechanisms in Large Language Models

Abstract: This paper presents a reproducibility study examining how Large Language Models (LLMs) manage competing factual and counterfactual information, focusing on the role of attention heads in this process. We attempt to reproduce and reconcile findings from three recent studies by Ortu et al. [16], Yu, Merullo, and Pavlick [21] and McDougall et al. [8] that investigate the competition between model-learned facts and contradictory context information through Mechanistic Interpretability tools. Our study specifically examines the relationship between attention head strength and factual output ratios, evaluates competing hypotheses about attention heads' suppression mechanisms, and investigates the domain specificity of these attention patterns. Through this analysis, we aim to provide a clearer understanding of how different model components contribute to managing conflicting information in LLMs.

URL: https://openreview.net/forum?id=1QrB5WSWOR

---

Title: Reproducibility study of: "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals"

Abstract: This paper presents a reproducibility study of Ortu et al. (2024), investigating the competition of the factual recall and counterfactual in-context adaptation mechanisms in GPT-2. We extend experiments developed by the original authors with softmax-normalized logits as another metric for gauging the evolution of the scoring of tokens in the model. Our reproduced and extended experiments validate the original paper's main claims regarding the location of the competition of mechanisms in GPT-2, i.e. that the competition emerges predominantly in later layers, and is driven by the attention blocks corresponding to a subset of specialized attention heads. Additionally, we explore intervention strategies based on attention modification to increase factual accuracy. We find that boosting multiple attention heads involved in factual recall simultaneously can have a synergistic effect on factual accuracy, which is further enhanced by the suppression of copy heads. Finally, we find that the specialized factual recall heads identified by Ortu et al. (2024) act as copy regulators, penalizing counterfactual in-context adaptation and rewarding the copying of factual information.

URL: https://openreview.net/forum?id=VCG6j3tcAA

---

Title: Reproducibility Study of "Improving Interpretation Faithfulness For Vision Transformers"

Abstract: This paper attempts to reproduce the findings of the study "Improving Interpretation Faith-fulness For Vision Transformers" Hu et al. (2024). The authors focus on making visual transformers (ViTs) more robust to adversarial attacks, and calling these robust ViTs faithful ViTs (FViTs). In their paper they propose a universal method to transform ViTs to FViTs called denoised diffusion smoothing (DDS). The reproduction of the authors study suffers from certain challenges, but the main claims still hold. Furthermore, this study extends the original paper by trying different diffusion models for DDS and tries to generalize the increased robustness of FViTs.

URL: https://openreview.net/forum?id=a0rytDAGUD

---

Title: Revisiting XRec: How Collaborative Signals Influence LLM-Based Recommendation Explanations

Abstract: Recommender systems help users navigate large volumes of online content by offering personalized recommendations. However, the increasing reliance on deep learning-based techniques has made these systems opaque and difficult to interpret. To address this, XRec (Ma et al., 2024) was introduced as a novel framework that integrates collaborative signals and textual descriptions of past interactions into Large Language Models (LLMs) to generate natural language explanations for recommendations. In this work, we reproduce and expand upon the findings of Ma et al. (2024). While our results validate most of the original authors’ claims, we were unable to fully replicate the reported performance improvements from injecting collaborative information into every LLM attention layer, nor the claimed effects of data sparsity. Beyond replication, our contributions provide evidence that the Graph Neural Network (GNN) component does not enhance explainability. Instead, the observed performance improvement is attributed to the Collaborative Information Adapter, which can act as a form of soft prompting, efficiently encoding task-specific information. This finding aligns with prior research suggesting that lightweight adaptation mechanisms can condition frozen LLMs for specific downstream tasks. Our implementation is open-source.

URL: https://openreview.net/forum?id=cPtqOkxQqH

---

Title: Reproducibility Study of “Efficient Episodic Memory Utiliza- tion of Cooperative Multi-Agent Reinforcement Learning"

Abstract: This paper reports on the reproducibility study on the paper "Efficient episodic memory utilization of cooperative multi-agent reinforcement learning" by \cite{na2024}. The original study proposed a method to enhance MARL performance by leveraging episodic memory to accelerate learning and prevent local optima convergence. EMU introduced a trainable encoder/decoder structure for memory retrieval and an episodic incentive reward mechanism to promote desirable transitions. The original work evaluated the method in StarCraft II and Google Research Football, demonstrating improvements over state-of-the-art approaches. This study further examines the effectiveness of EMU by assessing its reported performance improvements, the impact of its state embedding approach on exploration efficiency, and the robustness of its incentive mechanism in preventing suboptimal convergence. The analysis focuses on the SMAC benchmark, particularly in complex scenarios where EMU showed the most promise, while also exploring its scalability in high-performance computing environments to determine its computational feasibility. The findings confirm the advantages of EMU but underscore the sensitivity of its performance to embedding quality and hyperparameter selection. Our extended implementation and results are available on https://anonymous.4open.science/r/MLRC-EMU-E0EF/README.md.

URL: https://openreview.net/forum?id=OXf8mZfwPV

---

Title: Reproducibility Study of ’SLICE: Stabilized LIME for Consistent Explanations for Image Classification’

Abstract: This paper presents a reproducibility study of SLICE: Stabilized LIME for Consistent Explanations for Image Classification by Bora et al. (2024). SLICE enhances LIME by incorporating Sign Entropy-based Feature Elimination (SEFE) to remove unstable superpixels and an adaptive perturbation strategy using Gaussian blur to improve consistency in feature importance rankings. The original work claims that SLICE significantly improves explanation stability and fidelity. Our study systematically verifies these claims through extensive experimentation using the Oxford-IIIT Pets, PASCAL VOC, and MS COCO datasets. Our results confirm that SLICE achieves higher consistency than LIME, supporting its ability to reduce instability. However, our fidelity analysis challenges the claim of superior performance, as LIME often achieves higher Ground Truth Overlap (GTO) scores, indicating stronger alignment with object segmentations. To further investigate fidelity, we introduce an alternative AOPC evaluation to ensure a fair comparison across methods. Additionally, we propose GRID-LIME, a structured grid-based alternative to LIME, which improves stability while maintaining computational efficiency. Our findings highlight trade-offs in post-hoc explainability methods and emphasize the need for fairer fidelity evaluations. Our implementation is publicly available at our GitHub repository.

URL: https://openreview.net/forum?id=vKUPXuEzj8

---

Title: WorkflowAgent: Towards Specialized Web Agents Using Production-Scale Workflow Data

Abstract: LLM agents are advancing in handling web-based tasks. However, most LLM web agents rely on prompting general-purpose, proprietary models like GPT-4, which are not specifically trained to process web languages (e.g., HTML) or perform long-horizon planning. We explore an alternative approach that fine-tunes open-source LLMs using production-scale workflow data collected from over 250 domains corresponding to 6 billion tokens. This approach shows substantial gains over prompting-based agents on existing benchmarks—our agent achieves state-of-the-art action generation performance on the Mind2Web benchmark and improves the task success rate by 7.3% over existing prompting-based agents on WebArena. We perform detailed ablation studies on various fine-tuning design choices and provide valuable insights into LLM selection, training recipes, context window optimization, and the effect of dataset sizes.

URL: https://openreview.net/forum?id=ACr7qRIWsE

---

Title: Reproducibility Study of "Discover-then-Name: Task- Agnostic Concept Bottlenecks via Automated Concept Discovery"

Abstract: The DN-CBM framework proposed by Rao et al. represents a significant advancement in concept-based interpretability, leveraging Sparse Autoencoders (SAEs) for automatic concept discovery and naming. Our study successfully reproduces DN-CBM’s core findings, confirming its ability to extract meaningful concepts while maintaining competitive classification performance across ImageNet, Places365, CIFAR-10, and CIFAR-100. Additionally, we validate DN-CBM’s effectiveness in clustering semantically related concepts in the latent space, reinforcing its potential for interpretable machine learning.
Beyond replication, our extensions provide deeper insights into DN-CBM’s interpretability and robustness. We show that the discovered concepts are more concrete and less polysemantic, favoring monosemantic representations, and that polysemantic concepts have minimal impact on classification. Our intervention analysis on the Waterbirds100 dataset supports DN-CBM’s interpretability, and a novel loss function improves classification accuracy by reducing reliance on spurious background cues. In addition, we show through a user study the advantages of the new loss function on the interpretable concept selection for CIFAR-10. While our automatic concept intervention method offers an alternative to manual interventions, human selection remains more effective. These findings affirm DN-CBM’s validity and highlight opportunities for further refinement in interpretable deep learning.

URL: https://openreview.net/forum?id=opwdh0xHOd

---

Title: A reproducibility study of “User-item fairness tradeoffs in recommendations”

Abstract: Recommendation systems are necessary to filter the abundance of information presented in
our everyday lives. A recommendation system could merely recommend items that users
prefer the most, potentially resulting in certain items never getting recommended. Con-
versely, a mere focus on including all items could hurt overall recommendation quality. This
gives rise to the challenge of balancing user and item fairness. The paper “User-item fair-
ness tradeoffs in recommendations” by Greenwood et al. (2024) explores these tradeoffs by
developing a theoretical framework that optimizes for user-item fairness constraints. Their
theoretical framework suggests that the cost of item fairness is low when users have diverse
preferences, and may be high for users whose preferences are misestimated. They empiri-
cally measured these phenomena by creating their own recommendation system on arXiv
preprints, and discovered that the cost of item fairness is indeed low for users with diverse
preferences. However, contrary to their theoretical expectations, misestimated users do not
encounter a higher cost of item fairness. This study investigates the reproducibility of their
research by replicating the empirical study. Additionally, we extend their research in two
ways: (i) verifying the generalizability of their findings on a different dataset (Amazon books
reviews), and (ii) analyzing the tradeoffs when recommending multiple items to a user in-
stead of a single item. Our results further validate the claims made in the original paper.
Moreover, the claims hold true when recommending multiple items, with the cost of item
fairness decreasing as more items are recommended.

URL: https://openreview.net/forum?id=vltzxxhzLU

---

Title: "Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery" - A Reproducibility Study

Abstract: Concept Bottleneck Models (CBMs) are a class of interpretable deep learning frameworks
that improve transparency by mapping input data into human-understandable concepts.
Recent advances, including the Discover-then-Name CBM proposed by Rao et al. (2024),
eliminate reliance on external language models by automating concept discovery and naming
using a CLIP feature extractor and sparse autoencoder. This study is focused on replicating
the key findings reported by Rao et al. (2024). We conclude that the core conceptual ideas
are reproducible, but not to the extent presented in the original work. Many representations
of the active neurons appear to be disaligned with their assigned concepts. To address this
discrepancy, we suggest a model extension; we propose an enhanced alignment method
evaluated through a user study. Our extended model provides more interpretable concepts
(with statistical significance), at the cost of a slight decrease in accuracy.

URL: https://openreview.net/forum?id=946cT3Jsq5

---

Title: [Re] Benchmarking LLM Capabilities in Negotiation through Scoreable Games

Abstract: Large Language Models (LLMs) demonstrate significant potential in multi-agent negotiation tasks, yet evaluation in this domain remains challenging due to a lack of robust and generalizable benchmarks. Abdelnabi et al. (2024) introduce a negotiation benchmark based on Scoreable Games, with the aim of developing a highly complex and realistic evaluation framework for LLMs. The following work investigates the reproducibility of several claims in their benchmark, and extends their analysis to provide a deeper understanding of fairness, interpretability, and generalizability within the benchmark. We replicate the original experiments on a multitude of additional models, and introduce additional metrics to illuminate actual negotiation quality and fairness. Our findings reveal that while the benchmark is indeed complex, model comparison is ambiguous, raising questions about its objectivity. Furthermore, we identify limitations in the experimental setup, particularly in information leakage detection and ablation transferability, which impact the robustness of the results. By examining and analyzing the behavior of a wider range of models on an extended version of the benchmark, we reveal key insights that provide much-needed context to potential users. Our results highlight the importance of context in model-specific evaluations and the need for more nuanced metrics to assess negotiation performance.

URL: https://openreview.net/forum?id=BVH81SAAh2

---

Title: Toward Fair and Transparent Vision Transformers: Reproducing FairViT and Introducing FairDeiTA

Abstract: Vision Transformers (ViTs) have achieved state-of-the-art performance in image recognition but frequently inherit social biases from large-scale training data, raising concerns about fairness and transparency. FairViT was recently proposed to mitigate biases in ViTs through adaptive masking while preserving high accuracy with a distance-based loss. This study reproduces and evaluates FairViT’s claims on the CelebA dataset, focusing on accuracy and fairness metrics. Contrary to the original paper’s findings, our experiments reveal that FairViT does not outperform the baseline model in both performance and fairness. To enhance transparency, we apply interpretability techniques, including Gradient Attention Rollout (GAR) and local surrogate explanations (Ribeiro et al., 2016), providing deeper insight into the learned representations of FairViT. Our reproducibility study underscores the challenges of implementing and verifying fairness interventions in ViTs. Finally, we propose an adversarial debiasing (Zhang et al., 2018) component that improves fairness metrics while maintaining competitive accuracy, offering an alternative direction for fairness- focused ViT-based applications. We formulate this model as FairDeiTA.

URL: https://openreview.net/forum?id=6In6ExT3FU

---

Title: [Re] Improving Interpretation Faithfulness for Vision Transformers

Abstract: This work aims to reproduce the results of Faithful Vision Transformers (FViTs) proposed by Hu et al. (2024) alongside interpretability methods for Vision Transformers from Chefer et al. (2021) and Xu et al. (2022). We investigate claims made by Hu et al. (2024), namely that the usage of Diffusion Denoised Smoothing improves interpretability robustness (1) to attack in a segmentation task and (2) to perturbation in a classification task. We also extend the original study by investigating the authors’ claims that adding DDS to any method can improve its robustness under attack. This is tested on baseline interpretability algorithms and the recently proposed Attribution Rollout method. In addition, we measure the computational costs and environmental impact of obtaining an FViT through DDS. Our results agree broadly with the original study’s findings, although minor discrepancies were found and discussed.

URL: https://openreview.net/forum?id=Z0DhgU8fBt

---

Title: Change Point Detection on A Separable Model for Dynamic Networks

Abstract: This paper studies the unsupervised change point detection problem in time series of networks using the Separable Temporal Exponential-family Random Graph Model (STERGM). Inherently, dynamic network patterns can be complex due to dyadic and temporal dependence, and change points detection can identify the discrepancies in the underlying data generating processes to facilitate downstream analysis. Moreover, the STERGM that utilizes network statistics to represent the structural patterns is a flexible and parsimonious model to fit dynamic networks. We propose a new estimator derived from the Alternating Direction Method of Multipliers (ADMM) procedure and Group Fused Lasso (GFL) regularization to simultaneously detect multiple time points, where the parameters of a time-heterogeneous STERGM have changed. We also provide a Bayesian information criterion for model selection and an R package \texttt{CPDstergm} to implement the proposed method. Experiments on simulated and real data show good performance of the proposed framework.

URL: https://openreview.net/forum?id=DSNJykzHF3

---

Title: FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback

Abstract: Large Vision-Language Models (LVLMs) have demonstrated proficiency in tackling a variety of visual-language tasks. However, current LVLMs suffer from misalignment between text and image modalities which causes three kinds of hallucination problems, i.e., object existence, object attribute, and object relationship. To tackle this issue, existing methods mainly utilize Reinforcement Learning (RL) to align modalities in LVLMs. However, they still suffer from three main limitations: (1) General feedback can not indicate the hallucination type contained in the response; (2) Sparse rewards only give the sequence-level reward for the whole response; and (3)Annotation cost is time-consuming and labor-intensive. To handle these limitations, we propose an innovative method to align modalities in LVLMs through \textbf{F}ine-\textbf{G}rained \textbf{A}rtificial \textbf{I}ntelligence \textbf{F}eedback (\textbf{\ours}), which mainly consists of three steps: AI-based Feedback Collection, Fine-grained Reward Model Training, and Reinforcement Learning with Fine-grained Reward. Finally, a novel fine-grained feedback module is integrated into the Proximal Policy Optimization (PPO) algorithm. Extensive experiments are conducted on hallucination and general benchmarks, demonstrating the superior performance of our proposed method. Notably, compared with previous models trained with the RL-based aligning method, our proposed method is effective even with fewer parameters.

URL: https://openreview.net/forum?id=Qhfw5CUVd7

---

Title: Doubly Robust Uncertainty Quantification for Quantile Treatment Effects in Sequential Decision Making

Abstract: We consider multi-stage sequential decision-making, where the treatment at any stage may depend on the subject's entire treatment and covariate history. We introduce a general framework for doubly robust uncertainty quantification for the quantiles of cumulative outcomes corresponding to a sequential treatment rule, given baseline covariates. While previous studies focused on mean effects, quantile effects offer unique insights into the distributional properties and are more robust for heavy-tailed outcomes. It is known that, doubly robust inference is significantly more challenging and largely unexplored for quantile treatment effects. More importantly, for mean effects, doubly robust estimation does not ensure doubly robust inference. Our approach first provides a doubly robust estimator for any quantile of interest based on pre-collected data, achieving semi-parametric efficiency. We then propose a novel doubly robust estimator for the asymptotic variance, enabling the construction of a doubly robust confidence interval. To overcome the challenges in non-smoothness and parameter-dependent nuisance functions, we leverage empirical process and deep conditional generative learning techniques. We demonstrate advantages of our approach via both simulation and real data from a short video platform. Additionally, we observe that our proposed approach leads to another mean effect estimator that outperforms existing estimators for heavy-tailed outcomes.

URL: https://openreview.net/forum?id=F0BwbieVws

---

Title: A Local Polyak-Łojasiewicz and Descent Lemma of Gradient Descent For Overparametrized Linear Models

Abstract: Most prior work on the convergence of gradient descent (GD) for overparameterized neural networks relies on strong assumptions on the step size (infinitesimal), the hidden-layer width (infinite), or the initialization (large, spectral, balanced). Recent efforts to relax these assumptions focus on two-layer linear networks trained with the squared loss.
In this work, we derive a linear convergence rate for training two-layer linear neural networks with GD for general losses and under relaxed assumptions on the step size, width, and initialization. A key challenge in deriving this result is that classical ingredients for deriving convergence rates for nonconvex problems, such as the Polyak-Łojasiewicz (PL) condition and Descent Lemma, do not hold globally for overparameterized neural networks. Here, we prove that these two conditions hold locally with local constants that depend on the weights. Then, we provide bounds on these local constants, which depend on the initialization of the weights, the current loss, and the global PL and smoothness constants of the non-overparameterized model. Based on these bounds, we derive a linear convergence rate for GD. Our convergence analysis not only improves upon prior results but also suggests a better choice for the step size, as verified through our numerical experiments.

URL: https://openreview.net/forum?id=VPl3T43Hxb

---

Title: From Spikes to Heavy Tails: Unveiling the Spectral Evolution of Neural Networks

Abstract: Training strategies for modern deep neural networks (NNs) tend to induce a heavy-tailed (HT) empirical spectral density (ESD) in the layer weights. While previous efforts have shown that the HT phenomenon correlates with good generalization in large NNs, a theoretical explanation of its occurrence is still lacking. Especially, understanding the conditions which lead to this phenomenon can shed light on the interplay between generalization and weight spectra. Our work aims to bridge this gap by presenting a simple, rich setting to model the emergence of HT ESD. In particular, we present a theory-informed setup for 'crafting' heavy tails in the ESD of two-layer NNs and present a systematic analysis of the HT ESD emergence without any gradient noise. This is the first work to analyze a noise-free setting, and we also incorporate optimizer (GD/Adam) dependent (large) learning rates into the HT ESD analysis. Our results highlight the role of learning rates on the Bulk+Spike and HT shape of the ESDs in the early phase of training, which can facilitate generalization in the two-layer NN. These observations shed light on the behavior of large-scale NNs, albeit in a much simpler setting. Last but not least, we present a novel perspective on the ESD evolution dynamics by analyzing the singular vectors of weights and optimizer updates.

URL: https://openreview.net/forum?id=DJHB8eBUnt

---

Title: Transform-Enabled Detection of Backdoor Attacks in Deep Neural Networks

Abstract: Deep Neural Networks (DNNs) have been widely deployed in a range of safety-critical applications. Recent work has illustrated their vulnerability to malicious backdoor attacks, which lead to DNN malfunction when a specific backdoor trigger is applied to the DNN input image. These backdoors cause uncharacteristic behavior in DNN hidden layers, causing the DNN to misclassify the input image. In this work we present Transform-Enabled Detection of Attacks (TESDA), a novel algorithm for on-line detection of uncharacteristic behavior in DNN hidden layers indicative of a backdoor. We leverage the training-dataset distributions of reduced-dimension transforms of deep features in a backdoored DNN to rapidly detect malicious behavior, using theoretically grounded methods with bounded false alarm rates. We verify that TESDA is able to achieve state-of-the-art detection with very low latency on a variety of attacks, datasets and network backbones. Further ablations show that only a small proportion of DNN training data is needed for TESDA to fit an attack detector to the backdoored network.

URL: https://openreview.net/forum?id=ceYXSjh1xO

---

Reply all

Reply to author

Forward

0 new messages