Weekly TMLR digest for Jan 26, 2025

4 views

Skip to first unread message

TMLR

unread,

Jan 26, 2025, 12:00:14 AMJan 26

to tmlr-annou...@googlegroups.com

New certifications
==================

Survey Certification: Unified Risk Analysis for Weakly Supervised Learning

Chao-Kai Chiang, Masashi Sugiyama

https://openreview.net/forum?id=RGsdAwWuu6

---

Featured Certification: Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images

Zhuoran Yu, Chenchen Zhu, Sean Culatana, Raghuraman Krishnamoorthi, Fanyi Xiao, Yong Jae Lee

https://openreview.net/forum?id=YCt8lsIDwA

---

Accepted papers
===============

Title: Making Reliable and Flexible Decisions in Long-tailed Classification

Authors: Bolian Li, Ruqi Zhang

Abstract: Long-tailed classification is challenging due to its heavy imbalance in class probabilities. While existing methods often focus on overall accuracy or accuracy for tail classes, they overlook a critical aspect: certain types of errors can carry greater risks than others in real-world long-tailed problems. For example, misclassifying patients (a tail class) as healthy individuals (a head class) entails far more serious consequences than the reverse scenario. To address this critical issue, we introduce Making Reliable and Flexible Decisions in Long-tailed Classification (RF-DLC), a novel framework aimed at reliable predictions in long-tailed problems. Leveraging Bayesian Decision Theory, we introduce an integrated gain to seamlessly combine long-tailed data distributions and the decision-making procedure. We further propose an efficient variational optimization strategy for the decision risk objective. Our method adapts readily to diverse utility matrices, which can be designed for specific tasks, ensuring its flexibility for different problem settings. In empirical evaluation, we design a new metric, False Head Rate, to quantify tail-sensitivity risk, along with comprehensive experiments on multiple real-world tasks, including large-scale image classification and uncertainty quantification, to demonstrate the reliability and flexibility of our method.

URL: https://openreview.net/forum?id=hMO8sT9qaD

---

Title: Stochastic Variance-Reduced Newton: Accelerating Finite-Sum Minimization with Large Batches

Authors: Michal Derezinski

Abstract: Stochastic variance reduction has proven effective at accelerating first-order algorithms for solving convex finite-sum optimization tasks such as empirical risk minimization. Incorporating second-order information has proven helpful in further improving the performance of these first-order methods. Yet, comparatively little is known about the benefits of using variance reduction to accelerate popular stochastic second-order methods such as Subsampled Newton. To address this, we propose Stochastic Variance-Reduced Newton (SVRN), a finite-sum minimization algorithm that provably accelerates existing stochastic Newton methods from $O(\alpha\log(1/\epsilon))$ to $O\big(\frac{\log(1/\epsilon)}{\log(n)}\big)$ passes over the data, i.e., by a factor of $O(\alpha\log(n))$, where $n$ is the number of sum components and $\alpha$ is the approximation factor in the Hessian estimate. Surprisingly, this acceleration gets more significant the larger the data size $n$, which is a unique property of SVRN. Our algorithm retains the key advantages of Newton-type methods, such as easily parallelizable large-batch operations and a simple unit step size. We use SVRN to accelerate Subsampled Newton and Iterative Hessian Sketch algorithms, and show that it compares favorably to popular first-order methods with variance~reduction.

URL: https://openreview.net/forum?id=dzQCRHKRdC

---

Title: Adjacency Search Embeddings

Authors: Meher Chaitanya, Kshitijaa Jaglan, Ulrik Brandes

Abstract: In this study, we propose two novel Adjacency Search Embeddings that are inspired by the theory of identifying s-t minimum cuts: Maximum Adjacency Search (MAS) and Threshold-based Adjacency Search (TAS), which leverage both the node and a subset of its neighborhood to discern a set of nodes well-integrated into higher-order network structures. This serves as context for generating higher-order representations. Our approaches, when used in conjunction with the skip-gram model, exhibit superior effectiveness in comparison to other shallow embedding techniques in tasks such as link prediction and node classification. By incorporating our mechanisms as a preprocessing technique, we show substantial improvements in node classification performance across GNNs like GCN, GraphSage, and Gatv2 on both attributed and non-attributed networks. Furthermore, we substantiate the applicability of our approaches, shedding light on their aptness for specific graph scenarios.

URL: https://openreview.net/forum?id=GDN5cFTNaL

---

Title: ExCeL: Combined Extreme and Collective Logit Information for Out-of-Distribution Detection

Authors: Naveen Karunanayake, Suranga Seneviratne, Sanjay Chawla

Abstract: Deep learning models often exhibit overconfidence in predicting out-of-distribution (OOD) data, underscoring the crucial role of OOD detection in ensuring reliability in predictions. Among various OOD detection approaches, post-hoc detectors have gained significant popularity, primarily due to their ease of implementation and competitive performance. However, recent benchmarks for OOD detection have revealed a lack of consistency in existing post-hoc methods. This inconsistency in post-hoc detectors can be attributed to their sole reliance either on extreme information, such as the maximum logit, or on collective information (i.e., information spanned across classes or training samples) embedded within the output layer. In this paper, we propose ExCeL, which combines both extreme and collective information within the output layer for enhanced and consistent performance in OOD detection. We leverage the logit of the top predicted class as the extreme information (i.e., the maximum logit), while the collective information is derived in a novel approach that involves assessing the probability of other classes appearing in subsequent ranks across various training samples. Our idea is motivated by the observation that, for in-distribution (ID) data, the ranking of classes beyond the predicted class is more deterministic compared to that in OOD data. Experiments conducted on CIFAR100, ImageNet-200, and ImageNet-1K datasets demonstrate that ExCeL consistently is among the five top-performing methods out of twenty-one existing post-hoc baselines when the joint performance on near-OOD and far-OOD is considered (i.e., in terms of AUROC and FPR95). Furthermore, ExCeL shows the best overall performance across all datasets, unlike other baselines that work best on one dataset but have a performance drop in others.

URL: https://openreview.net/forum?id=4Xz0WBAiX4

---

Title: Time Series Domain Adaptation via Channel-Selective Representation Alignment

Authors: Nauman Ahad, Mark A. Davenport, Eva L Dyer

Abstract: Building generalizable and robust multivariate time series models can be challenging for real-world settings that involve significant shifts between training and testing. Existing unsupervised domain adaptation methods often struggle with real world distribution shifts which are often much more severe in some channels than others. To overcome these obstacles, we introduce a novel method called Signal Selection and Screening via Sinkhorn alignment for Time Series domain Adaptation (SSSS-TSA). SSSS-TSA addresses channel-level variations by aligning both individual channel representations and selectively weighted combined channel representations. This dual alignment strategy based on channel selection not only ensures effective adaptation to new domains but also maintains robustness in scenarios with training and testing set shifts or when certain channels are absent or corrupted. We evaluate our method on several time-series classification benchmarks and find that it consistently improves performance over existing methods. These results demonstrate the importance of adaptively selecting and screening different channels to enable more effective alignment across domains.

URL: https://openreview.net/forum?id=8C8LJIqF4y

---

Title: Can AI-Generated Text be Reliably Detected? Stress Testing AI Text Detectors Under Various Attacks

Authors: Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, Soheil Feizi

Abstract: Large Language Models (LLMs) can perform impressively well in various applications, such as document completion and question-answering. However, the potential for misuse of these models in activities such as plagiarism, generating fake news, and spamming has raised concerns about their responsible use. Consequently, the reliable detection of AI-generated text has become a critical area of research. Recent works have attempted to address this challenge through various methods, including the identification of model signatures in generated text outputs and the application of watermarking techniques to detect AI-generated text.
These detectors have shown to be effective under their specific settings. In this paper, we stress-test the robustness of these AI text detectors in the presence of an attacker. We introduce recursive paraphrasing attack to stress test a wide range of detection schemes, including the ones using the watermarking as well as neural network-based detectors, zero-shot classifiers, and retrieval-based detectors. Our experiments conducted on passages, each approximately 300 tokens long, reveal the varying sensitivities of these detectors to our attacks. We also observe that these paraphrasing attacks add slight degradation to the text quality. We analyze the trade-offs between our attack strength and the resulting text quality, measured through human studies, perplexity scores, and accuracy on text benchmarks. Our findings indicate that while our recursive paraphrasing method can significantly reduce detection rates, it only slightly degrades text quality in many cases, highlighting potential vulnerabilities in current detection systems in the presence of an attacker. Additionally, we investigate the susceptibility of watermarked LLMs to spoofing attacks aimed at misclassifying human-written text as AI-generated. We demonstrate that an attacker can infer hidden AI text signatures without white-box access to the detection method, potentially leading to reputational risks for LLM developers. Finally, we provide a theoretical framework connecting the AUROC of the best possible detector to the Total Variation distance between human and AI text distributions. This analysis offers insights into the fundamental challenges of reliable detection as language models continue to advance.

URL: https://openreview.net/forum?id=OOgsAZdFOt

---

Title: Differentially Private Gradient Flow based on the Sliced Wasserstein Distance

Authors: Ilana Sebag, Muni Sreenivas Pydi, Jean-Yves Franceschi, Alain Rakotomamonjy, Mike Gartrell, Jamal Atif, Alexandre Allauzen

Abstract: Safeguarding privacy in sensitive training data is paramount, particularly in the context of generative modeling. This can be achieved through either differentially private stochastic gradient descent or a differentially private metric for training models or generators. In this paper, we introduce a novel differentially private generative modeling approach based on a gradient flow in the space of probability measures. To this end, we define the gradient flow of the Gaussian-smoothed Sliced Wasserstein Distance, including the associated stochastic differential equation (SDE). By discretizing and defining a numerical scheme for solving this SDE, we demonstrate the link between smoothing and differential privacy based on a Gaussian mechanism, due to a specific form of the SDE's drift term. We then analyze the differential privacy guarantee of our gradient flow, which accounts for both the smoothing and the Wiener process introduced by the SDE itself. Experiments show that our proposed model can generate higher-fidelity data at a low privacy budget compared to a generator-based model, offering a promising alternative.

URL: https://openreview.net/forum?id=aiOHc1LGpD

---

Title: Wasserstein Modality Alignment Makes Your Multimodal Transformer More Robust

Authors: zhuo zhi, Yuxuan Sun, Qiangqiang Wu, Ziquan Liu, Miguel R. D. Rodrigues

Abstract: Multimodal fusion with a multimodal transformer is an effective method for both early and late fusion paradigms. However, in a multimodal transformer, the modality fusion is performed solely through the self-attention mechanism, which is originally designed for unimodal token sequences. To improve the self-attention mechanism for handling multimodal input, a parametric adapter model, like the Q-former in BLIP-2, is often used to align tokens from different modalities. Our empirical study unveils that only using the self-attention layer to perform the modality fusion makes the model less robust to missing modalities and input noise, as the model will overly rely on one certain modality. To improve the robustness of the transformer, our paper proposes an implicit approach based on Wasserstein distance that aligns tokens from different modalities without using any additional trainable parameters. Our empirical study shows that the implicit modality alignment improves the effectiveness of the multimodal Transformer in discriminative tasks, as well as its robustness to input noise and missing modalities. We conduct experiments on four downstream task datasets, including 2-modalities and 3-modalities tasks. We also consider different fusion paradigms, i.e., early and late fusion. The experimental results show that our proposed method has a significant improvement in both performance and robustness over all baselines across all datasets and fusion paradigms.

URL: https://openreview.net/forum?id=dbaGuiYsTl

---

Title: A thorough reproduction and evaluation of $\mu$P

Authors: Georgios Vlassis, David Belius, Volodymyr Fomichov

Abstract: This paper is an independent empirical reproduction of the claimed benefits of the $\mu$P parametrization proposed in \citet{yang2020feature} and \citet{yang2021tuning}. Under the so-called Standard Parametrization (SP), the weights of neural networks are initialized from the Gaussian distribution with variance scaling as the inverse of ``fan-in'', with the learning rate being the same for every layer. While this guarantees that (pre)activations are $\mathcal{O}(1)$ at initialization with respect to width, it causes their scale to be width-dependent during training. To address this, \citet{yang2020feature} and \citet{yang2021tuning} proposed the Maximal Update Parametrization ($\mu$P), which is also claimed to make the optimal value of various hyperparameters independent of width. However, despite its alleged benefits, $\mu$P has not gained much traction among practitioners. Possibly, this could stem from a lack of thorough independent evaluation of $\mu$P against SP. We address this by independently reproducing the empirical claims of the original works. At the same time, we substantially increase the scale of the experiments, by training $16000$ neural networks of sizes from $500$ to $1$B parameters, and empirically investigate $\mu$P's effect on outputs, gradient updates, weights, training loss and validation loss. We find that generally $\mu$P indeed delivers on its promises, even though this does not always translate to improved generalization.

URL: https://openreview.net/forum?id=AFxEdJwQcp

---

Title: Semantic Alignment for Prompt-Tuning in Vision Language Models

Authors: Hari Chandana Kuchibhotla, Sai Srinivas Kancheti, Abbavaram Gowtham Reddy, Vineeth N. Balasubramanian

Abstract: Going beyond mere fine-tuning of vision-language models (VLMs), learnable prompt tuning has emerged as a promising, resource-efficient alternative. Despite their potential, effectively learning prompts faces the following challenges: (i) training in a low-shot scenario results in overfitting, limiting adaptability, and yielding weaker performance on newer classes or datasets; (ii) prompt-tuning's efficacy heavily relies on the label space, with decreased performance in large class spaces, signaling potential gaps in bridging image and class concepts. In this work, we investigate whether better text semantics can help address these concerns. In particular, we introduce a prompt-tuning method that leverages class descriptions obtained from Large Language Models (LLMs). These class descriptions are used to bridge image and text modalities. Our approach constructs part-level description-guided image and text features, which are subsequently aligned to learn more generalizable prompts. Our comprehensive experiments conducted across 11 benchmark datasets show that our method outperforms established methods, demonstrating substantial improvements.

URL: https://openreview.net/forum?id=avDr56QjSI

---

Title: Identifying Spurious Correlations using Counterfactual Alignment

Authors: Joseph Paul Cohen, Louis Blankemeier, Akshay S Chaudhari

Abstract: Models driven by spurious correlations often yield poor generalization performance. We propose the counterfactual (CF) alignment method to detect and quantify spurious correlations of black box classifiers. Our methodology is based on counterfactual images generated with respect to one classifier being input into other classifiers to see if they also induce changes in the outputs of these classifiers. The relationship between these responses can be quantified and used to identify specific instances where a spurious correlation exists. This is validated by observing intuitive trends in face-attribute and waterbird classifiers, as well as by fabricating spurious correlations and detecting their presence, both visually and quantitatively. Furthermore, utilizing the CF alignment method, we demonstrate that we can evaluate robust optimization methods (GroupDRO, JTT, and FLAC) by detecting a reduction in spurious correlations.

URL: https://openreview.net/forum?id=Utjw2z1ale

---

Title: Numerically Robust Fixed-Point Smoothing Without State Augmentation

Authors: Nicholas Krämer

Abstract: Practical implementations of Gaussian smoothing algorithms have received a great deal of attention in the last 60 years. However, almost all work focuses on estimating complete time series (``fixed-interval smoothing'', $\mathcal{O}(K)$ memory) through variations of the Rauch--Tung--Striebel smoother, rarely on estimating the initial states (``fixed-point smoothing'', $\mathcal{O}(1)$ memory). Since fixed-point smoothing is a crucial component of algorithms for dynamical systems with unknown initial conditions, we close this gap by introducing a new formulation of a Gaussian fixed-point smoother. In contrast to prior approaches, our perspective admits a numerically robust Cholesky-based form (without downdates) and avoids state augmentation, which would needlessly inflate the state-space model and reduce the numerical practicality of any fixed-point smoother code. The experiments demonstrate how a JAX implementation of our algorithm matches the runtime of the fastest methods and the robustness of the most robust techniques while existing implementations must always sacrifice one for the other.

URL: https://openreview.net/forum?id=LVQ8BEL5n3

---

Title: Explicitly Disentangled Representations in Object-Centric Learning

Authors: Riccardo Majellaro, Jonathan Collu, Aske Plaat, Thomas M. Moerland

Abstract: Extracting structured representations from raw visual data is an important and long-standing challenge in machine learning. Recently, techniques for unsupervised learning of object-centric representations have raised growing interest. In this context, enhancing the robustness of the latent features can improve the efficiency and effectiveness of the training of downstream tasks. A promising step in this direction is to disentangle the factors that cause variation in the data. Previously, Invariant Slot Attention disentangled position, scale, and orientation from the remaining features. Extending this approach, we focus on separating the shape and texture components. In particular, we propose a novel architecture that biases object-centric models toward disentangling shape and texture components into two non-overlapping subsets of the latent space dimensions. These subsets are known a priori, hence before the training process. Experiments on a range of object-centric benchmarks reveal that our approach achieves the desired disentanglement while also numerically improving baseline performance in most cases. In addition, we show that our method can generate novel textures for a specific object or transfer textures between objects with distinct shapes.

URL: https://openreview.net/forum?id=r8UFp9olQ0

---

Title: Enhanced Federated Optimization: Adaptive Unbiased Client Sampling with Reduced Variance

Authors: Dun Zeng, Zenglin Xu, Yu Pan, Xu Luo, Qifan Wang, Xiaoying Tang

Abstract: Federated Learning (FL) is a distributed learning paradigm to train a global model across multiple devices without collecting local data. In FL, a server typically selects a subset of clients for each training round to optimize resource usage. Central to this process is the technique of unbiased client sampling, which ensures a representative selection of clients. Current methods primarily utilize a random sampling procedure which, despite its effectiveness, achieves suboptimal efficiency owing to the loose upper bound caused by the sampling variance. In this work, by adopting an independent sampling procedure, we propose a federated optimization framework focused on adaptive unbiased client sampling, improving the convergence rate via an online variance reduction strategy.
In particular, we present the first adaptive client sampler, K-Vib, employing an independent sampling procedure. K-Vib achieves a linear speed-up on the regret bound $\tilde{\mathcal{O}}\big(N^{\frac{1}{3}}T^{\frac{2}{3}}/K^{\frac{4}{3}}\big)$ within a set communication budget $K$. Empirical studies indicate that K-Vib doubles the speed compared to baseline algorithms, demonstrating significant potential in federated optimization.

URL: https://openreview.net/forum?id=Gb4HBGG9re

---

Title: Unified Risk Analysis for Weakly Supervised Learning

Authors: Chao-Kai Chiang, Masashi Sugiyama

Abstract: Among the flourishing research of weakly supervised learning (WSL), we recognize the lack of a unified interpretation of the mechanism behind the weakly supervised scenarios, let alone a systematic treatment of the risk rewrite problem, a crucial step in the empirical risk minimization approach. In this paper, we introduce a framework providing a comprehensive understanding and a unified methodology for WSL. The formulation component of the framework, leveraging a contamination perspective, provides a unified interpretation of how weak supervision is formed and subsumes fifteen existing WSL settings. The induced reduction graphs offer comprehensive connections over WSLs. The analysis component of the framework, viewed as a decontamination process, provides a systematic method of conducting risk rewrite. In addition to the conventional inverse matrix approach, we devise a novel strategy called marginal chain aiming to decontaminate distributions. We justify the feasibility of the proposed framework by recovering existing rewrites reported in the literature.

URL: https://openreview.net/forum?id=RGsdAwWuu6

---

Title: Leveraging a Simulator for Learning Causal Representations from Post-Treatment Covariates for CATE

Authors: Lokesh Nagalapatti, Pranava Singhal, Avishek Ghosh, Sunita Sarawagi

Abstract: Treatment effect estimation involves assessing the impact of different treatments on individual outcomes. Current methods estimate Conditional Average Treatment Effect (CATE) using observational datasets where covariates are collected before treatment assignment and outcomes are observed afterward, under assumptions like positivity and unconfoundedness. In this paper, we address a scenario where both covariates and outcomes are gathered after treatment. We show that post-treatment covariates render CATE unidentifiable, and recovering CATE requires learning treatment-independent causal representations. Prior work shows that such representations can be learned through contrastive learning if counterfactual supervision is available in observational data. However, since counterfactuals are rare, other works have explored using simulators that offer synthetic counterfactual supervision. Our goal in this paper is to systematically analyze the role of simulators in estimating CATE. We analyze the CATE error of several baselines and highlight their limitations. We then establish a generalization bound that characterizes the CATE error from jointly training on real and simulated distributions, as a function of the real-simulator mismatch. Finally, we introduce SimPONet, a novel method whose loss function is inspired from our generalization bound. We further show how SimPONet adjusts the simulator’s influence on the learning objective based on the simulator’s relevance to the CATE task. We experiment with various DGPs, by systematically varying the real-simulator distribution gap to evaluate SimPONet’s efficacy against state-of-the-art CATE baselines.

URL: https://openreview.net/forum?id=vmmgFW3ztz

---

Title: Preventing Conflicting Gradients in Neural Marked Temporal Point Processes

Authors: Tanguy Bosser, Souhaib Ben Taieb

Abstract: Neural Marked Temporal Point Processes (MTPP) are flexible models to capture complex temporal inter-dependencies between labeled events. These models inherently learn two predictive distributions: one for the arrival times of events and another for the types of events, also known as marks. In this study, we demonstrate that learning a MTPP model can be framed as a two-task learning problem, where both tasks share a common set of trainable parameters that are optimized jointly. We show that this often leads to the emergence of conflicting gradients during training, where task-specific gradients are pointing in opposite directions. When such conflicts arise, following the average gradient can be detrimental to the learning of each individual tasks, resulting in overall degraded performance. To overcome this issue, we introduce novel parametrizations for neural MTPP models that allow for separate modeling and training of each task, effectively avoiding the problem of conflicting gradients. Through experiments on multiple real-world event sequence datasets, we demonstrate the benefits of our framework compared to the original model formulations.

URL: https://openreview.net/forum?id=INijCSPtbQ

---

Title: Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images

Authors: Zhuoran Yu, Chenchen Zhu, Sean Culatana, Raghuraman Krishnamoorthi, Fanyi Xiao, Yong Jae Lee

Abstract: Recent advances in generative deep learning have enabled the creation of high-quality synthetic images in text-to-image generation.
While prior research indicates that fine-tuning a pretrained diffusion model on ImageNet and generating synthetic training images can boost an ImageNet classifier's performance,
when synthetic images start to outnumber real ones in training, the classifier performance starts to degrade, underscoring the scalability challenge of training with synthetic data.
In this paper, we delve into the necessity of generative fine-tuning for achieving recognition performance improvements and investigate the scalability of training with large-scale synthetic images.
We find that leveraging off-the-shelf generative models without fine-tuning, while addressing challenges of class name ambiguity, limited prompt diversity, and domain shifts effectively mitigates performance degradation from large-scale synthetic data.
Specifically, we leverage large language models (LLMs) and CLIP to resolve class name ambiguity.
To diversify images, we propose contextualized diversification (CD) and stylized diversification (SD) methods, also prompted by LLMs.
Finally, to mitigate domain shifts, we leverage domain adaptation techniques with auxiliary batch normalization for synthetic images.
Our framework consistently boosts recognition model performance with increased synthetic data,
even up to 6 times the original ImageNet size.
Models trained with our approach demonstrate significant in-domain improvement on ImageNet-val (1.20\% to 2.35\% across various architectures) and strong out-of-domain generalization on ImageNet-Sketch and -Rendition ($\sim$10\% improvement with large vision transformers).

URL: https://openreview.net/forum?id=YCt8lsIDwA

---

Title: Towards LifeSpan Cognitive Systems

Authors: Yu Wang, Chi Han, Tongtong Wu, Xiaoxin He, Wangchunshu Zhou, Nafis Sadeq, Xiusi Chen, Zexue He, Wei Wang, Gholamreza Haffari, Heng Ji, Julian McAuley

Abstract: Building a human-like system that continuously interacts with complex environments—whether simulated digital worlds or human society—presents several key challenges. Central to this is enabling continuous, high-frequency interactions, where the interactions are termed experiences. We refer to this envisioned system as the LifeSpan Cognitive System (LSCS). A critical feature of LSCS is its ability to engage in incremental and rapid updates while retaining and accurately recalling past experiences. In this paper we focus on the domain of Large Language Models (LLMs), where we identify two major challenges: (1) Abstraction and Experience Merging, and (2) Long-term Retention with Accurate Recall. These properties are essential for storing new experiences, organizing past experiences, and responding to the environment in ways that leverage relevant historical data. Unlike language models with continual learning, which typically rely on large corpora for fine-tuning and focus on improving performance within specific domains or tasks, LSCS must rapidly and incrementally update with new information from its environment at a high frequency. Existing technologies with the potential of solving the above two major challenges can be classified into four classes based on a conceptual metric called Storage Complexity, which measures the relative space required to store past experiences. Each of these four classes of technologies has its own strengths and limitations while we argue none of them alone can achieve LSCS alone. To this end, we propose a potential instantiation for LSCS that can integrate all four classes of technologies. The new instantiation, serving as a conjecture, operates through two core processes: Absorbing Experiences and Generating Responses.

URL: https://openreview.net/forum?id=LZ9FmeFeLV

---

Title: Distributed Multi-Agent Lifelong Learning

Authors: Prithviraj Tarale, Edward Rietman, Hava T Siegelmann

Abstract: Lifelong learning (LL) machines are designed to operate safely in dynamic environments by continually updating their knowledge. Conventional LL paradigms often assume that new data come labeled and that each LL machine has to learn independently from its environment. However, human labeling is expensive and impractical in remote conditions where automation is most desired. We introduce the Peer Parallel Lifelong Learning (PEEPLL) framework for distributed Multi-Agent Lifelong Learning, where agents continually learn online by actively requesting assistance from other agents instead of relying on the expensive environment to teach them. Unlike classical distributed AI, where communication scales poorly, lifelong learners need to communicate only on information they have not yet learned. Additionally, agents reply only if they are highly confident: Our TRUE confidence score uses a compute-efficient application of Variational Autoencoder to quantify confidence in prediction without needing data reconstruction. TRUE outperforms traditional Entropy-based confidence scores, reducing communication overhead by 18.05\% on CIFAR-100 and 5.8\% on MiniImageNet. To improve system resilience to low-quality or adversarial responses, our agents selectively accept a subset of received responses using the REFINE algorithm, which results in a 51.99\% increase in the percentage of correct accepted responses on CIFAR-100 and 25.79\% on MiniImageNet. Like traditional LL agents, PEEPLL agents store a subset of previously acquired knowledge as memory to learn alongside new information to prevent forgetting. We propose a Dynamic Memory-Update mechanism for PEEPLL agents that improves QA's classification performance by 44.17\% on CIFAR-100 and 26.8\% on MiniImageNet compared to the baseline Memory-Update mechanism. Our findings demonstrate that a PEEPLL agent can outperform an LL agent even if the latter has environmental supervision available, thus significantly reducing the need for labeling. PEEPLL provides a framework to facilitate research in distributed multi-agent LL, marking a substantial step towards practical, scalable lifelong learning technologies at the edge.

URL: https://openreview.net/forum?id=IIVr4Hu3Oi

---

Title: Dependency-Aware Semi-Structured Sparsity of GLU Variants in Large Language Models

Authors: Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe

Abstract: The rapid advancement in Large Language Models (LLMs) has markedly enhanced the capabilities of language understanding and generation. However, the substantial model size poses hardware challenges, affecting both memory size for serving and inference latency for token generation. To address those challenges, we propose Dependency-aware Semi-structured Sparsity (DaSS), a new method for the recent prevalent GLU-based LLMs pruning, which incorporates structural dependency into the weight magnitude-based unstructured pruning. We introduce an MLP-specific pruning metric that evaluates the importance of each weight by jointly considering its magnitude and its corresponding MLP intermediate activation norms. DaSS facilitates a balance between the adaptability offered by unstructured pruning and the structural consistency inherent in dependency-based structured pruning. Empirical evaluations on LLaMA2, Mistral, and Gemma model families demonstrate that DaSS not only achieves superior perplexity and accuracy compared to SparseGPT and Wanda in achieving hardware-friendly N:M sparsity patterns but also maintains the computational efficiency of Wanda.

URL: https://openreview.net/forum?id=T5OuTgPxHS

---

Title: SelfEval: Leveraging discriminative nature of generative models for evaluation

Authors: Sai Saketh Rambhatla, Ishan Misra

Abstract: We present an automated way to evaluate the text alignment of text-to-image generative diffusion models using standard image-text recognition datasets. Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts, and the likelihood can be used to perform recognition tasks with the generative model. We evaluate generative models on standard datasets created for multimodal text-image discriminative learning and assess fine-grained aspects of their performance: attribute binding, color recognition, counting, shape recognition, spatial understanding. Existing automated metrics rely on an external pretrained model like CLIP (VLMs) or LLMs, and are sensitive to the exact pretrained model and its limitations. SelfEval sidesteps these issues, and to the best of our knowledge, is the first automated metric to show a high degree of agreement for measuring text-faithfulness with the gold-standard human evaluations across multiple generative models, benchmarks and evaluation metrics. SelfEval also reveals that generative models showcase competitive recognition performance on challenging tasks such as Winoground image-score compared to discriminative models. We hope SelfEval enables easy and reliable automated evaluation for diffusion models.

URL: https://openreview.net/forum?id=0mGho8wrv5

---

Title: Fairness Through Matching

Authors: Kunwoong Kim, Insung Kong, Jongjin Lee, Minwoo Chae, Sangchul Park, Yongdai Kim

Abstract: Group fairness requires that different protected groups, characterized by a given sensitive attribute, receive equal outcomes overall.
Typically, the level of group fairness is measured by the statistical gap between predictions from different protected groups.
In this study, we reveal an implicit property of existing group fairness measures, which provides an insight into how the group-fair models behave.
Then, we develop a new group-fair constraint based on this implicit property to learn group-fair models.
To do so, we first introduce a notable theoretical observation: every group-fair model has an implicitly corresponding transport map between the input spaces of each protected group.
Based on this observation, we introduce a new group fairness measure termed Matched Demographic Parity (MDP), which quantifies the averaged gap between predictions of two individuals (from different protected groups) matched by a given transport map.
Then, we prove that any transport map can be used in MDP to learn group-fair models, and develop a novel algorithm called Fairness Through Matching (FTM), which learns a group-fair model using MDP constraint with an user-specified transport map.
We specifically propose two favorable types of transport maps for MDP, based on the optimal transport theory, and discuss their advantages.
Experiments reveal that FTM successfully trains group-fair models with certain desirable properties by choosing the transport map accordingly.

URL: https://openreview.net/forum?id=dHljjaNHh1

---

Title: Zero-shot CLIP Class Forgetting via Text-image Space Adaptation

Authors: Alexey Kravets, Vinay P. Namboodiri

Abstract: Efficient class forgetting has attracted significant interest due to the high computational cost of retraining models from scratch whenever classes need to be forgotten. This need arises from data privacy regulations, the necessity to remove outdated information, and the possibility to enhance model robustness and security.
In this paper we address class forgetting in vision-language CLIP model. Modern class forgetting methods for CLIP have demonstrated that zero-shot forgetting is achievable by generating synthetic data and fine-tuning both visual and textual encoders with a regularization loss. Our approach shows that class forgetting in CLIP can be accomplished in a zero-shot manner without any visual data by adapting the shared vision-text space of CLIP, thereby making the class forgetting process more efficient. Our method delivers superior results, demonstrating strong performance and complete class removal, regardless of the visual encoder used in CLIP. Furthermore, we explore what exactly is being targeted by the class forgetting algorithm discovering some interesting properties of CLIP features.

URL: https://openreview.net/forum?id=V2SD2uVKEE

---

Title: A Scalable Approach for Mapper via Efficient Spatial Search

Authors: Luca Simi

Abstract: Topological Data Analysis (TDA) is a branch of applied mathematics that studies the shape of high dimensional datasets using ideas from algebraic topology. The Mapper algorithm is a widely used tool in Topological Data Analysis, used for uncovering hidden structures in complex data. However, existing implementations often rely on naive and inefficient methods for constructing the open covers that Mapper is based on, leading to performance issues, especially with large, high-dimensional datasets. In this study, we introduce a novel, more scalable method for constructing open covers for Mapper, leveraging techniques from computational geometry. Our approach significantly enhances efficiency, improving Mapper's performance for large high-dimensional data. We will present theoretical insights into our method and demonstrate its effectiveness through experimental evaluations on well-known datasets, showcasing substantial improvements in visualization quality and computational performance. We implemented our method in a new Python library called \emph{tda-mapper}, which is freely available at \url{https://github.com/lucasimi/tda-mapper-python}, providing a powerful tool for TDA practitioners and researchers.

URL: https://openreview.net/forum?id=lTX4bYREAZ

---

Title: Doubly Robust Conditional VAE via Decoder Calibration: An Implicit KL Annealing Approach

Authors: Chuanhui Liu, Xiao Wang

Abstract: Several variants of Variational Autoencoders have been developed to address inherent limitations. Specifically, $\sigma$-VAE utilizes a scaled identity matrix $\sigma^2 I$ in the decoder variance, while $\beta$-VAE introduces a hyperparameter $\beta$ to reweight the negative ELBO loss. However, a unified theoretical and practical understanding of model optimality remains unclear. For example, existing learning theories on the global optimality of VAE provide limited insight into their empirical success. Previous work showed the mathematical equivalence between the variance scalar $\sigma^2$ and the hyperparameter $\beta$ in shaping the loss landscape. While $\beta$-annealing is widely used, how to implement $\sigma$-annealing is still unclear. This paper presents a comprehensive analysis of $\sigma$-CVAE, highlighting its enhanced expressiveness in parameterizing conditional densities while addressing the associated estimation challenges arising from suboptimal variational inference. In particular, we propose Calibrated Robust $\sigma$-CVAE, a doubly robust algorithm that facilitates accurate estimation of $\sigma$ while effectively preventing the posterior collapse of $\phi$. Our approach, leveraging functional neural decomposition and KL annealing techniques, provides a unified framework to understand both $\sigma$-VAE and $\beta$-VAE regarding parameter optimality and training dynamics. Experimental results on synthetic and real-world datasets demonstrate the superior performance of our method across various conditional density estimation tasks, highlighting its significance for accurate and reliable probabilistic modeling.

URL: https://openreview.net/forum?id=VIkycTWDWo

---

Title: Improving CLIP Counting Accuracy via Parameter-Efficient Fine-Tuning

Authors: Ruisu Zhang, Yicong Chen, Kangwook Lee

Abstract: We focus on addressing the object counting limitations of vision-language models, with a particular emphasis on Contrastive Language-Image Pre-training (CLIP) models. Centered on our hypothesis that counting knowledge can be abstracted into linear vectors within the text embedding space, we develop a parameter-efficient fine-tuning method and several zero-shot methods to improve CLIP's counting accuracy. Through comprehensive experiments, we demonstrate that our learning-based method not only outperforms full-model fine-tuning in counting accuracy but also retains the broad capabilities of pre-trained CLIP models. Our zero-shot text embedding editing techniques are also effective in situations where training data is scarce, and can be extended to improve Stable Diffusion's ability to generate images with precise object counts. We also contribute two specialized datasets to train and evaluate CLIP’s counting capabilities. Our code is available at https://github.com/UW-Madison-Lee-Lab/CLIP_Counting.

URL: https://openreview.net/forum?id=IZrt6hB2sI

---

Title: Transfer Learning in $\ell_1$ Regularized Regression: Hyperparameter Selection Strategy based on Sharp Asymptotic Analysis

Authors: Koki Okajima, Tomoyuki Obuchi

Abstract: Transfer learning techniques aim to leverage information from multiple related datasets to enhance prediction quality against a target dataset. Such methods have been adopted in the context of high-dimensional sparse regression, and some Lasso-based algorithms have been invented: Trans-Lasso and Pretraining Lasso are such examples. These algorithms require the statistician to select hyperparameters that control the extent and type of information transfer from related datasets. However, selection strategies for these hyperparameters, as well as the impact of these choices on the algorithm's performance, have been largely unexplored. To address this, we conduct a thorough, precise study of the algorithm in a high-dimensional setting via an asymptotic analysis using the replica method. Our approach reveals a surprisingly simple behavior of the algorithm: Ignoring one of the two types of information transferred to the fine-tuning stage has little effect on generalization performance, implying that efforts for hyperparameter selection can be significantly reduced. Our theoretical findings are also empirically supported by \rev{applications on real-world and semi-artificial datasets using the IMDb and MNIST datasets, respectively.

URL: https://openreview.net/forum?id=ccu0M3nmlF

---

New submissions
===============

Title: M3CoL: Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification

Abstract: Deep multimodal learning has shown remarkable success by leveraging contrastive learning to capture explicit one-to-one relations across modalities. However, real-world data often exhibits shared relations beyond simple pairwise associations. We propose M3CoL, a Multimodal Mixup Contrastive Learning approach to capture nuanced shared relations inherent in multimodal data. Our key contribution is a Mixup-based contrastive loss that learns robust representations by aligning mixed samples from one modality with their corresponding samples from other modalities thereby capturing shared relations between them. For multimodal classification tasks, we introduce a framework that integrates a fusion module with unimodal prediction modules for auxiliary supervision during training, complemented by our proposed Mixup-based contrastive loss. Through extensive experiments on diverse datasets (N24News, ROSMAP, BRCA, and Food-101), we demonstrate that M3CoL effectively captures shared multimodal relations and generalizes across domains. It outperforms state-of-the-art methods on N24News, ROSMAP, and BRCA, while achieving comparable performance on Food-101. Our work highlights the significance of learning shared relations for robust multimodal learning, opening up promising avenues for future research.

URL: https://openreview.net/forum?id=NeQYi56MFj

---

Title: Generalizable Spectral Embedding with an Application to UMAP

Abstract: Spectral Embedding (SE) is a popular method for dimensionality reduction, applicable across diverse domains. Nevertheless, its current implementations face three prominent drawbacks which curtail its broader applicability: generalizability (i.e., out-of-sample extension), scalability, and eigenvectors separation.
In this paper, we introduce $\textit{GrEASE}$: Generalizable and Efficient Approximate Spectral Embedding, a novel deep-learning approach designed to address these limitations.
GrEASE incorporates an efficient post-processing step to achieve eigenvectors separation, while ensuring both generalizability and scalability, allowing for the computation of the Laplacian’s eigenvectors on unseen data. This method expands the applicability of SE to a wider range of tasks and can enhance its performance in existing applications.
We empirically demonstrate GrEASE's ability to consistently approximate and generalize SE, while ensuring scalability. Additionally, we show how GrEASE can be leveraged to enhance existing methods. Specifically, we focus on UMAP, a leading visualization technique, and introduce $\textit{NUMAP}$, a generalizable version of UMAP powered by GrEASE. Our code will be publicly available upon acceptance.

URL: https://openreview.net/forum?id=Lmjqtdy1lU

---

Title: Time Series Anomaly Detection in the Frequency Domain with Statistical Reliability

Abstract: Effective anomaly detection in complex systems requires identifying change points (CPs) in the frequency domain, as abnormalities often arise across multiple frequencies. This paper extends recent advancements in statistically significant CP detection, based on Selective Inference (SI), to the frequency domain. The proposed SI method quantifies the statistical significance of detected CPs in the frequency domain using $p$-values, ensuring that the detected changes reflect genuine structural shifts in the target system. We address two major technical challenges to achieve this. First, we extend the existing SI framework to the frequency domain by appropriately utilizing the properties of discrete Fourier transform (DFT). Second, we develop an SI method that provides valid $p$-values for CPs where changes occur across multiple frequencies. Experimental results demonstrate that the proposed method reliably identifies genuine CPs with strong statistical guarantees, enabling more accurate root-cause analysis in the frequency domain of complex systems.

URL: https://openreview.net/forum?id=FNRdaHz3qN

---

Title: G-RepsNet: A Lightweight Construction of Equivariant Networks for Arbitrary Matrix Groups

Abstract: Group equivariance is a strong inductive bias useful in a wide range of deep learning tasks. However, constructing efficient equivariant networks for general groups and domains is difficult. Recent work by Finzi et al. directly solves the equivariance constraint for arbitrary matrix groups to obtain equivariant MLPs (EMLPs). But this method does not scale well and scaling is crucial in deep learning.
Here, we introduce Group Representation Networks (G-RepsNets), a lightweight equivariant network for arbitrary matrix groups with features represented using tensor polynomials. The key insight in our design is that using tensor representations in the hidden layers of a neural network along with simple inexpensive tensor operations leads to scalable equivariant networks. Further, these networks are universal approximators of functions equivariant to orthogonal groups. We find G-RepsNet to be competitive to EMLP on several tasks with group symmetries such as $O(5)$, $O(1, 3)$, and $O(3)$ with scalars, vectors, and second-order tensors as data types.
On image classification tasks, we find that G-RepsNet using second-order representations is competitive and often even outperforms sophisticated state-of-the-art equivariant models such as GCNNs and $E(2)$-CNNs. To further illustrate the generality of our approach, we show that G-RepsNet is competitive to G-FNO and EGNN on N-body predictions and solving PDEs respectively, while being efficient.

URL: https://openreview.net/forum?id=k1eYngOvf0

---

Title: Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

Abstract: Machine-learning from a disparate set of tables, a data lake, requires assembling features
by merging and aggregating tables. Data discovery can extend autoML to data tables
by automating these steps. We present an in-depth analysis of such automated table
augmentation for machine learning tasks, analyzing different methods for the three main
steps: retrieving joinable tables, merging information, and predicting with the resultant
table. We use two data lakes: Open Data US, a well-referenced real data lake, and a novel
semi-synthetic dataset, YADL (Yet Another Data Lake), which we developed as a tool for
benchmarking this data discovery task. Systematic exploration on both lakes outlines 1)
the importance of accurately retrieving join candidates, 2) the efficiency of simple merging
methods, and 3) the resilience of tree-based learners to noisy conditions. Our experimental
environment is easily reproducible and based on open data, to foster more research on feature
engineering, autoML, and learning in data lakes

URL: https://openreview.net/forum?id=4uPJN6yfY1

---

Title: When resampling/reweighting improves feature learning in imbalanced classification? A toy-model study

Abstract: A toy model of binary classification is studied with the aim of clarifying the class-wise resampling/reweighting effect on the feature learning performance under the presence of class imbalance. In the analysis, a high-dimensional limit of the input space is taken while keeping the ratio of the dataset size against the input dimension finite and the non-rigorous replica method from statistical mechanics is employed. The result shows that there exists a case in which the no resampling/reweighting situation gives the best feature learning performance irrespectively of the choice of losses or classifiers, supporting recent findings in~\citet{kang2019decoupling,cao2019learning}. It is also revealed that the key of the result is the symmetry of the loss and the problem setting. Inspired by this, we propose a further simplified model exhibiting the same property in the multiclass setting. These clarify when the class-wise resampling/reweighting becomes effective in imbalanced classification.

URL: https://openreview.net/forum?id=spqbyeGyLR

---

Title: Revisiting Deep Hybrid Models for Out-of-Distribution Detection

Abstract: Deep hybrid models (DHMs) for out-of-distribution (OOD) detection, jointly training a deep feature extractor with a classification head and a density estimation head based on a normalising flow, provide a conceptually appealing approach to visual OOD detection. The paper that introduced this approach reported 100% AuROC in experiments on two standard benchmarks, including one based on the CIFAR-10 data. As there are no implementations available, we set out to reproduce the approach by carefully filling in gaps in the description of the algorithm. Although we were unable to attain 100% OOD detection rates, and our results indicate that such performance is impossible on the CIFAR-10 benchmark, we achieved good OOD performance. We provide a detailed analysis of when the architecture fails and argue that it introduces an adversarial relationship between the classification component and the density estimator, rendering it highly sensitive to the balance of these two components and yielding a collapsed feature space without careful fine-tuning. Our implementation of DHMs is publicly available.

URL: https://openreview.net/forum?id=yeITEuhv4Q

---

Title: Towards identifiability of micro total effects in summary causal graphs with latent confounding: extension of the front-door criterion

Abstract: Conducting experiments to estimate total effects can be challenging due to cost, ethical concerns, or practical limitations. As an alternative, researchers often rely on causal graphs to determine whether these effects can be identified from observational data. Identifying total effects in fully specified causal graphs has received considerable attention, with Pearl's front-door criterion enabling the identification of total effects in the presence of latent confounding even when no variable set is sufficient for adjustment. However, specifying a complete causal graph is challenging in many domains. Extending these identifiability results to partially specified graphs is crucial, particularly in dynamic systems where causal relationships evolve over time. This paper addresses the challenge of identifying total effects using a specific and well-known partially specified graph in dynamic systems called a summary causal graph, which does not specify the temporal lag between causal relations and can contain cycles. In particular, this paper presents sufficient graphical conditions for identifying total effects from observational data, even in the presence of cycles and latent confounding, and when no variable set is sufficient for adjustment.

URL: https://openreview.net/forum?id=5f7YlSKG1l

---

Title: Towards Understanding the Role of Sharpness-Aware Minimization Algorithms for Out-of-Distribution Generalization

Abstract: Recently, sharpness-aware minimization (SAM) has emerged as a promising method to improve generalization by minimizing sharpness, which is known to correlate well with generalization ability. Since the original proposal of SAM, many variants of SAM have been proposed to improve its accuracy and efficiency, but comparisons have mainly been restricted to the i.i.d. setting. In this paper we study SAM for out-of-distribution (OOD) generalization. First, we perform a comprehensive comparison of eight SAM variants on zero-shot OOD generalization, finding that the original SAM outperforms the Adam baseline by 4.76% and the strongest SAM variants outperform the Adam baseline by 8.01% on average. We then provide an OOD generalization bound in terms of sharpness for this setting. Next, we extend our study of SAM to the related setting of gradual domain adaptation (GDA), another form of OOD generalization where intermediate domains are constructed between the source and target domains, and iterative self-training is done on intermediate domains, to improve the overall target domain error. In this setting, our experimental results demonstrate that the original SAM outperforms the baseline of Adam on each of the experimental datasets by 0.82% on average and the strongest SAM variants outperform Adam by 1.52% on average. We then provide a generalization bound for SAM in the GDA setting. Asymptotically, this generalization bound is no better than the one for self-training in the literature of GDA. This highlights a further disconnection between the theoretical justification for SAM versus its empirical performance, with recent work finding that low sharpness alone does not account for all of SAM's generalization benefits. For future work, we provide several potential avenues for obtaining a tighter analysis for SAM in the OOD setting.

URL: https://openreview.net/forum?id=Cnaglt4aDf

---

Title: Benchmark on Drug Target Interaction Modeling from a Structure Perspective

Abstract: The prediction modeling of drug-target interactions is crucial to drug discovery and design, which has seen rapid advancements owing to deep learning technologies. Recently developed methods, such as those based on graph neural networks (GNNs) and Transformers, demonstrate exceptional performance across various datasets by effectively extracting structural information. However, the benchmarking of these novel methods often varies significantly in terms of hyperparameter settings and datasets, which limits algorithmic progress. In view of these, we conduct a comprehensive survey and benchmark for drug-target interaction modeling from a structure perspective, via integrating tens of explicit (i.e., GNN-based) and implicit (i.e., Transformer-based) structure learning algorithms. To this end, we first unify the hyperparameter setting within each class of structure learning methods. Moreover, we conduct a macroscopical comparison between these two classes of encoding strategies as well as the different featurization techniques that inform molecules' chemical and physical properties. We then carry out the microscopical comparison between all the integrated models across the six datasets, via comprehensively benchmarking their effectiveness and efficiency. Remarkably, the summarized insights from the benchmark studies lead to the design of model combos. We demonstrate that our combos can achieve new state-of-the-art performance on various datasets associated with cost-effective memory and computation.

URL: https://openreview.net/forum?id=mymphmqDTL

---

Title: Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization

Abstract: In Reinforcement Learning (RL), agents have no incentive to exhibit predictable behaviors, and are often pushed (through e.g. policy entropy regularisation) to randomise their actions in favor of exploration. This often makes it challenging for other agents and humans to predict an agent's behavior, triggering unsafe scenarios (e.g. in human-robot interaction). We propose a novel method to induce predictable behavior in RL agents, termed Predictability-Aware RL (PARL), employing the agent's trajectory entropy rate to quantify predictability.
Our method maximizes a linear combination of a standard discounted reward and the negative entropy rate, thus trading off optimality with predictability. We show how the entropy rate can be formally cast as an average reward, how entropy-rate value functions can be estimated from a learned model and incorporate this in policy-gradient algorithms, and demonstrate how this approach produces predictable (near-optimal) policies in tasks inspired by human-robot use-cases.

URL: https://openreview.net/forum?id=DDUsc1lD27

---

Title: SE3Set: Harnessing Equivariant Hypergraph Neural Networks for Molecular Representation Learning

Abstract: In this paper, we develop SE3Set, an SE(3) equivariant hypergraph neural network architecture tailored for advanced molecular representation learning. Hypergraphs are not merely an extension of traditional graphs; they are pivotal for modeling high-order relationships, a capability that conventional equivariant graph-based methods lack due to their inherent limitations in representing intricate many-body interactions. To achieve this, we first construct hypergraphs via proposing a new fragmentation method that considers both chemical and three-dimensional spatial information of the molecular system. We then design SE3Set, which incorporates equivariance into the hypergraph neural network. This ensures that the learned molecular representations are invariant to spatial transformations, thereby providing robustness essential for accurate prediction of molecular properties. SE3Set has shown performance on par with state-of-the-art (SOTA) models for small molecule datasets like QM9 and MD17. It demonstrates outstanding performance on the MD22 dataset, achieving a remarkable ~20\% improvement in accuracy across all molecules. Furthermore, on the OE62 dataset, Se3Set outperforms all short-range models. We also conducted a detailed analysis on OE62, highlighting the prevalence of complex many-body interactions in large molecules. This exceptional performance of SE3Set across diverse molecular structures underscores its transformative potential in computational chemistry, offering a route to more accurate and physically nuanced modeling.

URL: https://openreview.net/forum?id=muWEt1TOyo

---

Title: ArxEval: Evaluating Retrieval and Generation in Language Models for Scientific Literature

Abstract: Language Models [LMs] are now playing an increasingly large role in information generation and synthesis; the representation of scientific knowledge in these systems needs to be highly accurate. A prime challenge is hallucination; that is, generating apparently plausible but
actually false information, including invented citations and nonexistent research papers. This kind of inaccuracy is dangerous in all the domains that require high levels of factual correctness, such as academia and education. This work presents a pipeline for evaluating the
frequency with which language models hallucinate in generating responses in the scientific literature. We propose ArxEval, an evaluation pipeline with two tasks using ArXiv as a repository: Jumbled Titles and Mixed Titles. Our evaluation includes fifteen widely used language models and provides comparative insights into their reliability in handling scientific literature.

URL: https://openreview.net/forum?id=aolB8zSDyO

---

Title: Seeing Beyond Labels: Source-Free Domain Adaptation via Hypothesis Consolidation of Prediction Rationale

Abstract: Source-Free Unsupervised Domain Adaptation (SFUDA) is a challenging task where a model needs to be adapted to a new domain without access to target domain labels or source domain data. The primary difficulty in this task is that the model's predictions may be inaccurate, and using these inaccurate predictions for model adaptation can lead to misleading results. To address this issue, this paper proposes a novel approach that considers multiple prediction hypotheses for each sample and investigates the rationale behind each hypothesis. By consolidating these hypothesis rationales, we identify the most likely correct hypotheses, which we then use as a pseudo-labeled set to support a semi-supervised learning procedure for model adaptation. To achieve the optimal performance, we propose a three-step adaptation process: model pre-adaptation, hypothesis consolidation, and semi-supervised learning. Extensive experimental results demonstrate that our approach achieves state-of-the-art performance in the SFUDA task and can be easily integrated into existing approaches to improve their performance.

URL: https://openreview.net/forum?id=fywo0eRzAu

---

Title: On the Interplay Between Sparsity and Training in Deep Reinforcement Learning

Abstract: We study the benefits of different sparse architectures for deep reinforcement learning. In particular, we focus on image-based domains where spatially-biased and fully-connected architectures are common. Using these and several other architectures of equal capacity, we show that sparse structure has a significant effect on learning performance. We also observe that choosing the best sparse architecture for a given domain depends on whether the hidden layer weights are fixed or learned.

URL: https://openreview.net/forum?id=6Rp7ETrZK4

---

Title: Improving out-of-distribution generalization by mimicking the human visual diet.

Abstract: Human visual experience is markedly different from the large-scale computer vision datasets consisting of internet images. Babies densely sample a few $3D$ scenes with diverse variations such as object viewpoints or illuminations, while datasets like ImageNet contain one single snapshot from millions of 3D scenes. We investigated how these differences in input data composition (\ie visual diet) impact the Out-Of-Distribution (OOD) generalization capabilities of a visual system. Training models on a dataset mimicking attributes of the human-like visual diet improved generalization to OOD lighting, material, and viewpoint changes by up to $18\%$. This observation held despite the fact that the models were trained on $1,000$-fold less training data. Furthermore, when trained on purely synthetic data and tested on natural images, incorporating these visual diet attributes in the training dataset improved OOD generalization by $17\%$. These experiments are enabled by our newly proposed benchmark---the Human Visual Diet (HVD) dataset, and a new model (Human Diet Network) designed to leverage the attributes of a human-like visual diet. These findings highlight a critical problem in modern day Artificial Intelligence---building better datasets requires thinking beyond dataset size and rather focus on improving data composition. All data and source code are available at \url{https://bit.ly/3yX3PAM}.

URL: https://openreview.net/forum?id=5muZmOzrFC

---

Title: ∇QDARTS: Quantization as an Elastic Dimension to Differentiable NAS

Abstract: Differentiable Neural Architecture Search methods efficiently find high-accuracy architectures
using gradient-based optimization in a continuous domain, saving computational resources.
Mixed-precision search helps optimize precision within a fixed architecture. However, applying
it to a NAS-generated network doesn’t assure optimal performance as the optimized quantized
architecture may not emerge from a standalone NAS method. In light of these considerations,
this paper introduces ∇QDARTS, a novel approach that combines differentiable NAS with
mixed-precision search for both weight and activation. ∇QDARTS aims to identify the optimal
mixed-precision neural architecture capable of achieving remarkable accuracy while operating
with minimal computational requirements in a single shot, end-to-end differentiable framework
obviating the need for pertaining and proxy. Compared to fp32, ∇QDARTS shows impressive
performance on CIFAR10 with (2,4) bit precision, reducing bit operations by 160× with
a slight 1.57% accuracy drop. Increasing the capacity enables ∇QDARTS to match fp32
accuracy while reducing bit operations by 18×. For the ImageNet dataset, with just (2,4)
bit precision, ∇QDARTS outperforms state-of-the-art methods such as APQ, SPOS, OQA,
and MNAS by 2.3%, 2.9%, 0.3%, and 2.7% in terms of accuracy. By incorporating (2,4,8)
bit precision, ∇QDARTS further minimizes the accuracy drop to a 1% compared to fp32,
alongside a substantial reduction of 17× in required bit operations and 2.6× in memory
footprint. In terms of bit-operation (memory footprint) ∇QDARTS excels over APQ, SPOS,
OQA, and MNAS with similar accuracy by 2.3× (12×), 2.4× (3×), 13% (6.2×), 3.4× (37%),
for bit-operation (memory footprint), respectively. ∇QDARTS enhances the overall search
and training efficiency, achieving a 3.1× and 1.54× improvement over APQ and OQA,
respectively.

URL: https://openreview.net/forum?id=ubrOSWyTS8

---

Title: Flexible Conditional Generation with Stochastically Factorized Autoregressive Models

Abstract: Deep autoregressive generative models have demonstrated promising results in unconditional generation tasks for structured data, such as images. However, their effectiveness in conditional generation remains relatively underexplored. Recent autoregressive diffusion models efficiently amortize across all possible generation orders by defining uniform permutation orders or Markov chains with absorbing states, allowing them to parameterize any conditional distribution between data elements. Despite this flexibility, these models often struggle to conditionally generate content accurately during testing if the masking mechanism deviates from training assumptions. To address this limitation, we propose a novel deep generative model that leverages intrinsic data properties alongside self-supervision principles. Our approach extends established autoregressive frameworks by probabilistically modeling per-element generation as a mixture of semi-supervised mechanisms. This design provides a robust framework for conditional generation across diverse masking patterns. Furthermore, we hypothesize that the ability to model any conditional distribution makes these models particularly well-suited for data acquisition tasks, where collecting data to maximize predictive accuracy is critical. We propose a novel method for active data acquisition using autoregressive diffusion models, demonstrating promising results. Experimental evaluations show significant improvements in both simplicity and accuracy for conditional generation tasks, outperforming conventional methods that rely on random permutations or simultaneous generation of all dimensions

URL: https://openreview.net/forum?id=r9ZfE62jwG

---

Title: Gaussian Pre-Activations in Neural Networks: Myth or Reality?

Abstract: The study of feature propagation at initialization in neural networks lies at the root of numerous initialization designs. A very common assumption is that the pre-activations are Gaussian. Although this convenient *Gaussian hypothesis* can be justified when the number of neurons per layer tends to infinity, it is challenged by both theoretical and experimental work for finite-width neural networks. Our main contribution is to construct a family of pairs of activation functions and initialization distributions that ensure that the pre-activations remain Gaussian throughout the network depth, even in narrow neural networks, under the assumption that the pre-activations are independent. In the process, we discover a set of constraints that a neural network should satisfy to ensure Gaussian pre-activations. In addition, we provide a critical review of the claims of the Edge of Chaos line of work and construct a non-asymptotic Edge of Chaos analysis. We also propose a unified view on the propagation of pre-activations, encompassing the framework of several well-known initialization procedures. More generally, our work provides a principled framework for addressing the much-debated question: is it desirable to initialize the training of a neural network whose pre-activations are guaranteed to be Gaussian?

URL: https://openreview.net/forum?id=goe6fv6iSh

---

Title: Return-Aligned Decision Transformer

Abstract: Traditional approaches in offline reinforcement learning aim to learn the optimal policy that maximizes the cumulative reward, also known as return. It is increasingly important to adjust the performance of AI agents to meet human requirements, for example, in applications like video games and education tools. Decision Transformer (DT) optimizes a policy that generates actions conditioned on the target return through supervised learning and includes a mechanism to control the agent's performance using the target return. However, the action generation is hardly influenced by the target return because DT’s self-attention allocates scarce attention scores to the return tokens. In this paper, we propose Return-Aligned Decision Transformer (RADT), designed to more effectively align the actual return with the target return. RADT leverages features extracted by paying attention solely to the return, enabling action generation to consistently depend on the target return. Extensive experiments show that RADT significantly reduces the discrepancies between the actual return and the target return compared to DT-based methods.

URL: https://openreview.net/forum?id=lTt2cTW8h1

---

Title: DITTO: Offline Imitation Learning with World Models

Abstract: For imitation learning algorithms to scale to real-world challenges, they must handle high-dimensional observations, offline learning, and policy-induced covariate-shift. We propose DITTO, an offline imitation learning algorithm which addresses all three of these problems. DITTO optimizes a novel distance metric in the latent space of a learned world model: First, we train a world model on all available trajectory data, then, the imitation agent is unrolled from expert start states in the learned model, and penalized for its latent divergence from the expert dataset over multiple time steps. We optimize this multi-step latent divergence using standard reinforcement learning algorithms, which provably induces imitation learning, and empirically achieves state-of-the art performance and sample efficiency on a range of Atari environments from pixels, without any online environment access. We also adapt other standard imitation learning algorithms to the world model setting, and show that this considerably improves their performance. Our results show how creative use of world models can lead to a simple, robust, and highly-performant policy-learning framework.

URL: https://openreview.net/forum?id=NCD1PAZ0Xv

---

Title: A functional framework for nonsmooth autodiff with {\it maxpooling} functions

Abstract: We make a comment on the recent work by Boustany, by showing that the Murat-TrombettiTheorem provides a simple and efficient mathematical framework for nonsmooth automatic differentiation of {\it maxpooling} functions. In particular it gives a the chain rule formula which correctly defines the composition of Lipschitz-continuous functions which are piecewise $C^1$. The formalism is applied to four basic examples, with some tests in PyTorch. A self contained proof of an important Stampacchia formula is in the appendix.

URL: https://openreview.net/forum?id=qahoztvThX

---

Title: Efficient Open Set Single Image Test Time Adaptation of Vision Language Models

Abstract: Adapting models to dynamic, real-world environments characterized by shifting data distributions and unseen test scenarios is a critical challenge in deep learning. In this paper, we consider a realistic and challenging Test Time Adaptation setting, where a model must continuously adapt to test samples that arrive sequentially, one at a time, while distinguishing between known and unknown classes. Existing Test Time Adaptation methods fail to handle this setting due to their reliance on closed-set assumptions or batch processing, making them unsuitable for real-world open-set scenarios. We address this limitation by establishing a comprehensive benchmark for {\em Open-set Single image Test Time Adaptation using Vision-Language Models}. Furthermore, we propose ROSITA, a novel framework that leverages dynamically updated feature banks to identify reliable test samples and employs a contrastive learning objective to improve the separation between known and unknown classes. Our approach effectively adapts models to domain shifts for known classes while rejecting unfamiliar samples. Extensive experiments across diverse real-world benchmarks demonstrate that ROSITA sets a new state-of-the-art in open-set TTA, achieving both strong performance and computational efficiency for real-time deployment. Adapting models to dynamic, real-world environments characterized by shifting data distributions and unseen test scenarios is a critical challenge in deep learning. In this paper, we consider a realistic and challenging Test Time Adaptation setting, where a model must continuously adapt to test samples that arrive sequentially, one at a time, while distinguishing between known and unknown classes. Existing Test Time Adaptation methods fail to handle this setting due to their reliance on closed-set assumptions or batch processing, making them unsuitable for real-world open-set scenarios. We address this limitation by establishing a comprehensive benchmark for {\em Open-set Single image Test Time Adaptation using Vision-Language Models}. Furthermore, we propose ROSITA, a novel framework that leverages dynamically updated feature banks to identify reliable test samples and employs a contrastive learning objective to improve the separation between known and unknown classes. Our approach effectively adapts models to domain shifts for known classes while rejecting unfamiliar samples. Extensive experiments across diverse real-world benchmarks demonstrate that ROSITA sets a new state-of-the-art in open-set TTA, achieving both strong performance and computational efficiency for real-time deployment. The code is available anonymously at https://github.com/ostta/ROSITA.git.

URL: https://openreview.net/forum?id=72YVabBErN

---

Title: Rethinking the Value of Training-Free Structured Pruning of LLMs

Abstract: This paper investigates the effectiveness of training-free structured pruning techniques for Large Language Models (LLMs), with a particular focus on depth and width pruning strategies. Through an extensive empirical evaluation across a diverse range of tasks, datasets and modalities, we reveal critical limitations in current pruning methods. While some tasks exhibit minimal performance degradation, others face significant deterioration, even at low pruning rates, contradicting prior findings that often rely on selective benchmarks. Unexpectedly, our analysis finds that depth pruning, despite its simplicity, consistently outperforms the more granular width pruning approaches in maintaining downstream task performance. Our findings highlight that existing evaluations of pruned LLMs often overstate their effectiveness due to incomplete or limited evaluation tasks, necessitating a critical reassessment of the true value of pruning and emphasizing the need to explore more robust pruning algorithms.

URL: https://openreview.net/forum?id=7KkytYYhMv

---

Title: Random Policy Enables In-Context Reinforcement Learning within Trust Horizons

Abstract: Pretrained foundation models (FMs) have exhibited extraordinary in-context learning performance, allowing zero-shot (or few-shot) generalization to new environments/tasks not encountered during the pretraining. In the case of reinforcement learning (RL), in-context RL (ICRL) emerges when pretraining FMs on decision-making problems in an autoregressive-supervised manner. Nevertheless, the current state-of-the-art ICRL algorithms, such as Algorithm Distillation, Decision Pretrained Transformer and Decision Importance Transformer, impose stringent requirements on the pretraining dataset concerning the behavior (source) policies, context information, and action labels, etc. Notably, these algorithms either demand optimal policies or require varying degrees of well-trained behavior policies for all pretraining environments. This significantly hinders the application of ICRL to real-world scenarios, where acquiring optimal or well-trained policies for a substantial volume of real-world training environments can be prohibitively expensive or even intractable. To overcome this challenge, we introduce a novel approach, termed State-Action Distillation (SAD), that allows to generate an effective pretraining dataset guided solely by random policies. In particular, SAD selects query states and corresponding action labels by distilling the outstanding state-action pairs from the entire state and action spaces by using random policies within a trust horizon, and then inherits the classical autoregressive-supervised mechanism during the pretraining. To the best of our knowledge, this is the first work that enables effective ICRL under (e.g., uniform) random policies and random contexts. We also establish the quantitative analysis of the trustworthiness as well as the performance guarantees of our SAD approach. Moreover, our empirical results across multiple popular ICRL benchmark environments demonstrate that, on average, SAD outperforms the best baseline by 236.3% in the offline evaluation and by 135.2% in the online evaluation.

URL: https://openreview.net/forum?id=mAiMKnr9r5

---

Title: An Adversarial Training Approach to Robustify Stable Diffusion Systems Against Prompting Attacks

Abstract: Text-to-Image (T2I) systems are generative models designed to generate images based on textual descriptions. Despite their remarkable performance, it has been shown that they are susceptible to misuse. One form of misuse involves manipulating the input prompt, leading to images that do not match with the given description. To address this, we introduce an adversarial training (AT) procedure for Stable Diffusion. Our aim is to train the model across various concepts (e.g., ``bicycle''), ensuring that the output aligns with the original concept even under adversarial modifications (e.g., ``bicycle MJZM4''). To our knowledge, this is the first method to develop an adversarial training approach against this type of misuse. Finally, through several experiments, we demonstrate that the proposed method enhances the robustness of the model against certain classes of prompting attacks.

URL: https://openreview.net/forum?id=HbdsotVvkf

---

Title: An Embedding is Worth a Thousand Noisy Labels

Abstract: The performance of deep neural networks scales with dataset size and label quality, rendering the efficient mitigation of low-quality data annotations crucial for building robust and cost-effective systems. Existing strategies to address label noise exhibit severe limitations due to computational complexity and application dependency. In this work, we propose WANN, a Weighted Adaptive Nearest Neighbor approach that builds on self-supervised feature representations obtained from foundation models. To guide the weighted voting scheme, we introduce a reliability score, which measures the likelihood of a data label being correct. WANN outperforms reference methods, including a linear layer trained with robust loss functions, on diverse datasets of varying size and under various noise types and severities. WANN also exhibits superior generalization on imbalanced data compared to both Adaptive-NNs (ANN) and fixed k-NNs. Furthermore, the proposed weighting scheme enhances supervised dimensionality reduction under noisy labels. This yields a significant boost in classification performance with 10x and 100x smaller image embeddings, minimizing latency and storage requirements. Our approach, emphasizing efficiency and explainability, emerges as a simple, robust solution to overcome inherent limitations of deep neural network training. The code will be released upon acceptance.

URL: https://openreview.net/forum?id=X3gSvQjShh

---

Title: Evolution guided generative flow networks

Abstract: Generative Flow Networks (GFlowNets) are a family of probabilistic generative models recently invented that learn to sample compositional objects proportional to their rewards. One big challenge of GFlowNets is training them effectively when dealing with long time horizons and sparse rewards. To address this, we propose Evolution guided generative flow networks (EGFN), a simple but powerful augmentation to the GFlowNets training using Evolutionary algorithms (EA). Our method can work on top of any GFlowNets training objective, by training a set of agent parameters using EA, storing the resulting trajectories in the prioritized replay buffer, and training the GFlowNets agent using the stored trajectories. We present a thorough investigation over a wide range of toy and real-world benchmark tasks showing the effectiveness of our method in handling long trajectories and sparse rewards.

URL: https://openreview.net/forum?id=UgZIR6TF5N

---

Title: \textbf{$\text{C}^\text{2}\text{P}$}: Featuring Large Language Models with Causal Reasoning

Abstract: Causal reasoning is one of the primary bottlenecks that Large Language Models (LLMs) must overcome to attain human-level intelligence. Recent studies indicate that LLMs display near-random performance on reasoning tasks. To address this, we introduce the Causal Chain of Prompting ($\text{C}^2\text{P}$), a reasoning framework that aims to equip current LLMs with causal reasoning capabilities as the first framework of its kind operating autonomously without relying on external tools or modules during both the causal learning and reasoning phases. To evaluate the performance of $\text{C}^2\text{P}$, we first demonstrate that reasoning accuracy improved by over $30.7\%$ and $25.9\%$ for GPT-4 Turbo and LLaMA 3.1, respectively, when using our framework, compared to the same models without $\text{C}^2\text{P}$ on a synthetic benchmark dataset. Then, using few-shot learning of the same LLMs with $\text{C}^2\text{P}$, the reasoning accuracy increased by more than $20.05\%$ and $20.89\%$, respectively, with as few as ten examples, compared to the corresponding LLMs without $\text{C}^2\text{P}$ on the same dataset.
To evaluate $\text{C}^2\text{P}$ in realistic scenarios, we utilized another benchmark dataset containing natural stories across various fields, including healthcare, medicine, economics, education, social sciences, environmental science, and marketing. The results show improved reasoning when $\text{C}^2\text{P}$ is applied, compared to cases where our framework is not used, which often leads to random and hallucinated responses. By showing the improved performance of few-shot learned GPT-4 Turbo and LLaMA 3.1 with $\text{C}^2\text{P}$, we demonstrate the generalizability of our framework.

URL: https://openreview.net/forum?id=PboLe05KuB

---

Title: Coupled Distributional Random Expert Distillation for World Model Online Imitation Learning

Abstract: Imitation Learning (IL) has achieved remarkable success across various domains, including robotics, autonomous driving, and healthcare, by enabling agents to learn complex behaviors from expert demonstrations. However, existing IL methods often face instability challenges, particularly when relying on adversarial reward or value formulations in world model frameworks. In this work, we propose a novel approach to online imitation learning that addresses these limitations through a reward model based on random network distillation (RND) for density estimation. Our reward model is built on the joint estimation of expert and behavioral distributions within the latent space of the world model. We evaluate our method across diverse benchmarks, including DMControl, Meta-World, and ManiSkill2, showcasing its ability to deliver stable performance and achieve expert-level results in both locomotion and manipulation tasks. Our approach demonstrates improved stability over adversarial methods while maintaining expert-level performance.

URL: https://openreview.net/forum?id=9fsdvnMWsC

---

Title: Monocular Dynamic Gaussian Splatting is Fast and Brittle but Smooth Motion Helps

Abstract: Gaussian splatting methods are emerging as a popular approach for converting multi-view image data into scene representations that allow view synthesis. In particular, there is interest in enabling view synthesis for dynamic scenes using only monocular input data---an ill-posed and challenging problem. The fast pace of work in this area has produced multiple simultaneous papers that claim to work best, which cannot all be true. In this work, we organize, benchmark, and analyze many Gaussian-splatting-based methods, providing apples-to-apples comparisons that prior works have lacked. We use multiple existing datasets and a new instructive synthetic dataset designed to isolate factors that affect reconstruction quality.
We systematically categorize Gaussian splatting methods into specific motion representation types and quantify how their differences impact performance. Empirically, we find that their rank order is well-defined in synthetic data, but the complexity of real-world data currently overwhelms the differences. Furthermore, the fast rendering speed of all Gaussian-based methods comes at the cost of brittleness in optimization. We summarize our experiments into a list of findings that can help to further progress in this lively problem setting.

URL: https://openreview.net/forum?id=fzmw8Joug4

---

Title: Generalizable and Robust Spectral Method for Multi-view Representation Learning

Abstract: Multi-view representation learning (MvRL) has garnered substantial attention in recent years, driven by the increasing demand for applications that can effectively process and analyze data from multiple sources. In this context, graph Laplacian-based MvRL methods have demonstrated remarkable success in representing multi-view data. However, these methods often struggle with generalization to new data and face challenges with scalability. Moreover, in many practical scenarios, multi-view data is contaminated by noise or outliers. In such cases, modern deep-learning-based MvRL approaches that rely on alignment or contrastive objectives present degraded performance in downstream tasks, as they may impose incorrect consistency between clear and corrupted data sources. We introduce *SpecRaGE*, a novel fusion-based framework that integrates the strengths of graph Laplacian methods with the power of deep learning to overcome these challenges. SpecRage uses neural networks to learn parametric mapping that approximates a joint diagonalization of graph Laplacians. This solution bypasses the need for alignment while enabling generalizable and scalable learning of informative and meaningful representations. Moreover, it incorporates a meta-learning fusion module that dynamically adapts to data quality, ensuring robustness against outliers and noisy views. Our extensive experiments demonstrate that SpecRaGE outperforms state-of-the-art methods, particularly in scenarios with data contamination, paving the way for more reliable and efficient multi-view learning. Our code will be made publicly available upon acceptance.

URL: https://openreview.net/forum?id=X6IY04Akw1

---

Title: Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning

Abstract: Large language models (LLMs), primarily built on decoder-only transformer architectures, excel in natural language generation tasks and have shown promise in adapting to diverse downstream tasks using zero-shot and few-shot prompting techniques. However, these prompting methods often fall short on natural language understanding (NLU) tasks, where smaller encoder-only models like BERT-base consistently outperform LLMs on benchmarks such as GLUE and SuperGLUE. In this paper, we explore two approaches—supervised fine-tuning and proximal policy optimization (PPO)—to enhance the NLU capabilities of LLMs. To reduce the computational cost of full-model fine-tuning, we integrate low-rank adaptation (LoRA) layers, restricting updates to these layers during both supervised fine-tuning and PPO stages. In the supervised fine-tuning approach, task-specific prompts are concatenated with input queries and ground-truth labels from the NLU training corpus, optimizing the model using the next-token prediction objective. Despite this, LLMs still underperform compared to encoder-only models like BERT-base on several NLU tasks. To address this gap, we employ PPO, a reinforcement learning technique that treats each token generation as an action and evaluates the sequence of generated tokens using a reward function based on their alignment with ground-truth answers. PPO then updates the model to maximize these rewards, effectively aligning its outputs with the correct labels. Our experiments with the LLAMA2-7B model demonstrate that PPO-based fine-tuning significantly improves performance, delivering an average gain of 6.3 points over supervised fine-tuning on the GLUE benchmark. PPO surpasses zero-shot prompting by 38.7 points and few-shot prompting by 26.1 points on GLUE, while also outperforming these baselines by 28.8 and 28.5 points on SuperGLUE. Additionally, PPO exceeds the performance of BERT-large, a strong baseline, with an average improvement of 2.7 points on GLUE and 9.3 points on SuperGLUE. These improvements are consistent across models such as Qwen2.5-7B and MPT-7B, highlighting PPO’s robustness and effectiveness in improving the NLU capabilities of LLMs. Furthermore, LLAMA2-7B and LLAMA2-13B models fine-tuned with PPO on a single dataset exhibit strong zero-shot generalization across diverse unseen datasets. On average, they outperform GPT-4o by over 4% on sentiment analysis and natural language inference tasks, achieving notable gains of 7.4% on the Mental Health dataset and more than 10.9% on SIGA-nli.

URL: https://openreview.net/forum?id=EirFORi6v8

---

Title: Adaptive and Scalable Discovery and Mitigation of Multiple Biased Subgroups in Image Classifiers

Abstract: Deep learning models generally perform well across entire datasets but often exhibit disparate behaviors across different subgroups. Such biases hinder real-world applications. Despite numerous efforts to identify and mitigate biases in biased subgroups using the powerful vision-language foundation model CLIP, these approaches commonly neglect inherent biases in CLIP’s feature encoding, which can restrict performance improvements. In our work, we introduce a novel strategy that employs an ensemble of surrogate models for adaptive and scalable discovery of biased subgroups, effectively reducing the impact of feature encoding biases inherent in CLIP. Additionally, we utilize the large vision-language model to elucidate inherent subgroup biases and employ relative Fisher information to identify critical layers for mitigating subgroup bias and suppressing the learning of shortcuts. Extensive experiments on CIFAR-100, Breeds, and ICSD-171K demonstrate the effectiveness of our proposed methods. We also confirm the presence of subgroup bias by analyzing the image encoder of CLIP on the Hard ImageNet dataset.

URL: https://openreview.net/forum?id=IHYXmlHr9z

---

Title: GRAPES: Learning to Sample Graphs for Scalable Graph Neural Networks

Abstract: Graph neural networks (GNNs) learn to represent nodes by aggregating information from
their neighbors. As GNNs increase in depth, their receptive field grows exponentially, leading
to high memory costs. Several works in the literature proposed to address this shortcoming by
sampling subgraphs, or by using historical embeddings. These methods have mostly focused
on benchmarks of single-label node classification on homophilous graphs, where neighboring
nodes often share the same label. However, most of these methods rely on static heuristics
that may not generalize across different graphs or tasks. We argue that the sampling method
should be adaptive, adjusting to the complex structural properties of each graph. To this
end, we introduce GRAPES, an adaptive sampling method that learns to identify the set of
nodes crucial for training a GNN. GRAPES trains a second GNN to predict node sampling
probabilities by optimizing the downstream task objective. We evaluate GRAPES on various
node classification benchmarks involving homophilous as well as heterophilous graphs. We
demonstrate GRAPES’ effectiveness in accuracy and scalability, particularly in multi-label
heterophilous graphs. Additionally, GRAPES uses orders of magnitude less GPU memory
than a strong baseline based on historical embeddings. Unlike other sampling methods,
GRAPES maintains high accuracy even with smaller sample sizes and, therefore, can scale
to massive graphs. Our implementation is publicly available online.

URL: https://openreview.net/forum?id=QI0l842vSq

---

Title: Universal Link Predictor By In-Context Learning on Graphs

Abstract: Link prediction is a crucial task in graph machine learning, where the goal is to infer missing or future links within a graph. Traditional approaches leverage heuristic methods based on widely observed connectivity patterns, offering broad applicability and generalizability without the need for model training. Despite their utility, these methods are limited by their reliance on human-derived heuristics and lack the adaptability of data-driven approaches. Conversely, parametric link predictors excel in automatically learning the connectivity patterns from data and achieving state-of-the-art but fail short to directly transfer across different graphs. Instead, it requires the cost of extensive training and hyperparameter optimization to adapt to the target graph. In this work, we introduce the Universal Link Predictor (UniLP), a novel model that combines the generalizability of heuristic approaches with the pattern learning capabilities of parametric models. UniLP is designed to autonomously identify connectivity patterns across diverse graphs, ready for immediate application to any unseen graph dataset without targeted training. We address the challenge of conflicting connectivity patterns—arising from the unique distributions of different graphs—through the implementation of In-context Learning (ICL). This approach allows UniLP to dynamically adjust to various target graphs based on contextual demonstrations, thereby avoiding negative transfer. Through rigorous experimentation, we demonstrate UniLP's effectiveness in adapting to new, unseen graphs at test time, showcasing its ability to perform comparably or even outperform parametric models that have been finetuned for specific datasets. Our findings highlight UniLP's potential to set a new standard in link prediction, combining the strengths of heuristic and parametric methods in a single, versatile framework.

URL: https://openreview.net/forum?id=EYpqmoejB8

---

Title: xVal: A Continuous Numerical Tokenization for Scientific Language Models

Abstract: Due in part to their discontinuous and discrete default encodings for numbers, Large Language Models (LLMs) have not yet been commonly used to process numerically-dense scientific datasets. Rendering datasets as text, however, could help aggregate diverse and multi-modal scientific data into a single training corpus, thereby potentially facilitating the development of foundation models for science. In this work, we introduce xVal, a strategy for continuously tokenizing numbers within language models that results in a more appropriate inductive bias for scientific applications. By training specially-modified language models from scratch on a variety of scientific datasets formatted as text, we find that xVal generally outperforms other common numerical tokenization strategies on metrics including out-of-distribution generalization and computational efficiency.

URL: https://openreview.net/forum?id=kSPHESKMve

---

Title: Stabilizing the Kumaraswamy Distribution

Abstract: Large-scale latent variable models require expressive continuous distributions that support efficient sampling and low-variance differentiation, achievable through the reparameterization trick. The Kumaraswamy (KS) distribution is both expressive and supports the reparameterization trick with a simple closed-form inverse CDF. Yet, its adoption remains limited. We identify and resolve numerical instabilities in the log-pdf, CDF, and inverse CDF, exposing issues in libraries like PyTorch and TensorFlow. We then introduce simple and scalable latent variable models to improve exploration-exploitation trade-offs in contextual multi-armed bandits and enhance uncertainty quantification for link prediction with graph neural networks. We find these models to be most performant when paired with the stable KS. Our results support the stabilized KS distribution as a core component in scalable variational models for bounded latent variables.

URL: https://openreview.net/forum?id=baZLwdphqw

---

Title: Unraveling Hidden Representations: A Multi-Modal Layer Analysis for Better Synthetic Content Forensics

Abstract: Generative models achieve remarkable results in multiple data domains, including images and texts, among other examples. Unfortunately, malicious users exploit synthetic media for spreading misinformation and disseminating deepfakes. Consequently, the need for robust and stable fake detectors is pressing, especially when new generative models appear everyday. While the majority of existing work train classifiers that discriminate between real and fake information, such tools typically generalize only within the same family of generators and data modalities, yielding poor results on other generative classes and data domains. Towards a “universal” classifier, we propose the use of large pre-trained multi-modal models for the detection of generative content. Effectively, we show that the latent code of these models naturally captures information discriminating real from fake. Building on this observation, we demonstrate that linear classifiers trained on these features can achieve state-of-the-art results across various modalities, while remaining computationally efficient, fast to train, and effective even in few-shot settings. Our work primarily focuses on fake detection in audio and images, achieving performance that surpasses or matches that of strong baseline methods.

URL: https://openreview.net/forum?id=jnYGxN9ZAe

---

Title: SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Abstract: Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, an algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM offers improved robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM.

URL: https://openreview.net/forum?id=laPAh2hRFC

---

Title: Statistical Error Bounds for GANs with Nonlinear Objective Functionals

Abstract: Generative adversarial networks (GANs) are unsupervised learning methods for training a generator distribution to produce samples that approximate those drawn from a target distribution. Many such methods can be formulated as minimization of a metric or divergence between probability distributions. Recent works have derived statistical error bounds for GANs that are based on integral probability metrics (IPMs), e.g., WGAN which is based on the 1-Wasserstein metric. In general, IPMs are defined by optimizing a linear functional (difference of expectations) over a space of discriminators. A much larger class of GANs, which we here call $(f,\Gamma)$-GANs, can be constructed using $f$-divergences (e.g., Jensen-Shannon, KL, or $\alpha$-divergences) together with a regularizing discriminator space $\Gamma$ (e.g., $1$-Lipschitz functions). These GANs have nonlinear objective functions, depending on the choice of $f$, and have been shown to exhibit improved performance in a number of applications. In this work we derive statistical error bounds for $(f,\Gamma)$-GANs for general classes of $f$ and $\Gamma$ in the form of finite-sample concentration inequalities. These results prove the statistical consistency of $(f,\Gamma)$-GANs and reduce to the known results for IPM-GANs in the appropriate limit. Finally, our results also give new insight into the performance of GANs for distributions with unbounded support.

URL: https://openreview.net/forum?id=ZgjhykPSdU

---

Title: Bayesian Learning-driven Prototypical Contrastive Loss for Class-Incremental Learning

Abstract: The primary objective of methods in continual learning is to learn tasks in a sequential manner over time (sometimes from a stream of data), while mitigating the detrimental phenomenon of catastrophic forgetting. This paper proposes a method to learn an effective representation between previous and newly encountered class prototypes. We propose a prototypical network with a Bayesian learning-driven contrastive loss (BLCL), tailored specifically for class-incremental learning scenarios. We introduce a contrastive loss that incorporates novel classes into the latent representation by reducing intra-class and increasing inter-class distance. Our approach dynamically adapts the balance between the cross-entropy and contrastive loss functions with a Bayesian learning technique. Experimental results conducted on both the CIFAR-10 and CIFAR-100 dataset for image classification and images of a GNSS-based dataset for interference classification validate the efficacy of our method, showcasing its superiority over existing state-of-the-art approaches.

URL: https://openreview.net/forum?id=dNWaTuKV9M

---

Title: An Empirical Study on Information Extraction using Large Language Models

Abstract: Human-like large language models (LLMs), especially the most powerful and popular ones in OpenAI's GPT family, have proven to be very helpful for many natural language processing (NLP) related tasks. Therefore, various attempts have been made to apply LLMs to information extraction (IE), which is a fundamental NLP task that involves extracting information from unstructured plain text. To demonstrate the latest representative progress in LLMs' information extraction ability, we assess the information extraction ability of GPT-4 (the latest version of GPT at the time of writing this paper) from four perspectives: Performance, Evaluation Criteria, Robustness, and Error Types. Our results suggest a visible performance gap between GPT-4 and state-of-the-art (SOTA) IE methods. To alleviate this problem, considering the LLMs' human-like characteristics, we propose and analyze the effects of a series of simple prompt-based methods, which can be generalized to other LLMs and NLP tasks. Rich experiments show our methods' effectiveness and some of their remaining issues in improving GPT-4's information extraction ability.

URL: https://openreview.net/forum?id=bC03x33q0o

---

Title: A Vector Bernstein Inequality for Self-Normalized Martingales

Abstract: We prove a Bernstein inequality for vector-valued self-normalized martingales. We first give an alternative perspective of the corresponding sub-Gaussian bound due to Abbasi-Yadkori et al. via a PAC-Bayesian argument with Gaussian priors. By instantiating this argument to priors drawn uniformly over well-chosen ellipsoids, we obtain a Bernstein bound.

URL: https://openreview.net/forum?id=4ZJjr9YbBw

---

Title: Learning General Representation of 12-Lead Electrocardiogram With a Joint-Embedding Predictive Architecture

Abstract: Electrocardiogram (ECG) captures the heart's electrical signals, offering valuable information for diagnosing cardiac conditions. However, the scarcity of labeled data makes it challenging to fully leverage supervised learning in medical domain. Self-supervised learning (SSL) offers a promising solution, enabling models to learn from unlabeled data and uncover meaningful patterns. In this paper, we show that masked modeling in the latent space can be a powerful alternative to existing self-supervised methods in the ECG domain. We introduce ECG-JEPA, a SSL model for 12-lead ECG analysis that learns semantic representations of ECG data by predicting in the hidden latent space, bypassing the need to reconstruct raw signals. This approach offers several advantages in the ECG domain: (1) it avoids producing unnecessary details, such as noise, which is common in ECG; and (2) it addresses the limitations of naïve L2 loss between raw signals. Another key contribution is the introduction of Cross-Pattern Attention (CroPA), a specialized masked attention mechanism tailored for 12-lead ECG data. ECG-JEPA is trained on the union of several open ECG datasets, totaling approximately 180,000 samples, and achieves state-of-the-art performance in various downstream tasks including ECG classification and feature prediction.

URL: https://openreview.net/forum?id=dkemJJT2Y2

---

Title: Network Depth and Inductive Bias in Convolutional Neural Networks

Abstract: Understanding the remarkable generalization abilities of Deep Learning systems remains one of the significant scientific challenges of our time. It is widely accepted that the success of DNNs stems, at least partially, from having many hidden layers. However, the benefits of such depth are not universal. We introduce a simple experimental paradigm that demonstrates the contrasts between CNNs and MLPs in this respect. This paradigm demonstrates the ability of contemporary architectures to leverage deep, multi-layered structures to systematically improve model generalization ability. However, this conflicts with statistical learning theory (SLT) and its key concept of the bias-variance tradeoff. Therefore, we present an alternative framework to understand the relationship between network architecture and generalization by viewing classifiers as maps between different metric spaces. Through comparative analysis, we uncover how deeper networks develop a bias towards smoother input representations; and that the inductive bias responsible for superior generalization in deep CNNs is distinct from the standard “minimal complexity” (Occam’s razor) that is the focus of SLT.

URL: https://openreview.net/forum?id=XelbRjhn8R

---

Reply all

Reply to author

Forward

0 new messages