Survey Certification: A Survey on Large Language Models for Critical Societal Domains: Finance, Healthcare, and Law
Zhiyu Chen, Jing Ma, Xinlu Zhang, Nan Hao, An Yan, Armineh Nourbakhsh, Xianjun Yang, Julian McAuley, Linda Ruth Petzold, William Yang Wang
https://openreview.net/forum?id=upAWnMgpnH
---
Accepted papers
===============
Title: Graphon-Explainer: Generating Model-Level Explanations for Graph Neural Networks using Graphons
Authors: Sayan Saha, Sanghamitra Bandyopadhyay
Abstract: Graph Neural Networks (GNNs) form the backbone of several state-of-the-art methods for performing machine learning tasks on graphs. As GNNs find application across diverse real-world scenarios, ensuring their interpretability and reliability becomes imperative. In this paper, we propose Graphon-Explainer, a model-level explanation method to elucidate the high-level decision-making process of a GNN. Graphon-Explainer learns a graphon—a symmetric, continuous function viewed as a weighted adjacency matrix of an infinitely large graph—to approximate the distribution of a target class as learned by the GNN. The learned graphon then acts as a generative model, yielding distinct graph motifs deemed significant by the GNN for the target class. Unlike existing model-level explanation methods for GNNs, which are limited to explaining a GNN for individual target classes, Graphon-Explainer can also generate synthetic graphs close to the decision boundary between two target classes by interpolating graphons of both classes, aiding in characterizing the GNN model’s decision boundary. Furthermore, Graphon-Explainer is model-agnostic, does not rely on additional black-box models, and does not require manually specified handcrafted constraints for explanation generation. The effectiveness of our method is validated through thorough theoretical analysis and extensive experimentation on both synthetic and real-world datasets on the task of graph classification. Results demonstrate its capability to effectively learn and generate diverse graph patterns identified by a trained GNN, thus enhancing its interpretability for end-users.
URL: https://openreview.net/forum?id=yHUtuvoIQv
---
Title: A Survey on Large Language Models for Critical Societal Domains: Finance, Healthcare, and Law
Authors: Zhiyu Chen, Jing Ma, Xinlu Zhang, Nan Hao, An Yan, Armineh Nourbakhsh, Xianjun Yang, Julian McAuley, Linda Ruth Petzold, William Yang Wang
Abstract: In the fast-evolving domain of artificial intelligence, large language models (LLMs) such as GPT-3 and GPT-4 are revolutionizing the landscapes of finance, healthcare, and law: domains characterized by their reliance on professional expertise, challenging data acquisition, high-stakes, and stringent regulatory compliance. This survey offers a detailed exploration of the methodologies, applications, challenges, and forward-looking opportunities of LLMs within these high-stakes sectors. We highlight the instrumental role of LLMs in enhancing diagnostic and treatment methodologies in healthcare, innovating financial analytics, and refining legal interpretation and compliance strategies. Moreover, we critically examine the ethics for LLM applications in these fields, pointing out the existing ethical concerns and the need for transparent, fair, and robust AI systems that respect regulatory norms. By presenting a thorough review of current literature and practical applications, we showcase the transformative impact of LLMs, and outline the imperative for interdisciplinary cooperation, methodological advancements, and ethical vigilance. Through this lens, we aim to spark dialogue and inspire future research dedicated to maximizing the benefits of LLMs while mitigating their risks in these precision-dependent sectors. To facilitate future research on LLMs in these critical societal domains, we also initiate a reading list that tracks the latest advancements under this topic, which will be released and continually updated.
URL: https://openreview.net/forum?id=upAWnMgpnH
---
Title: Simple Steps to Success: A Method for Step-Based Counterfactual Explanations
Authors: Jenny Hamer, Nicholas Perello, Jason Valladares, Vignesh Viswanathan, Yair Zick
Abstract: Algorithmic recourse is a process that leverages counterfactual explanations, going beyond understanding why a system produced a given classification, to providing a user with actions they can take to change their predicted outcome. Existing approaches to compute such interventions---known as recourse---identify a set of points that satisfy some desiderata---e.g. an intervention in the underlying causal graph, minimizing a cost function, etc. Satisfying these criteria, however, requires extensive knowledge of the underlying model structure, an often unrealistic amount of information in several domains. We propose a data-driven and model-agnostic framework to compute counterfactual explanations. We introduce StEP, a computationally efficient method that offers incremental steps along the data manifold that directs users towards their desired outcome. We show that StEP uniquely satisfies a desirable set of axioms. Furthermore, via a thorough empirical and theoretical investigation, we show that StEP offers provable robustness and privacy guarantees while outperforming popular methods along important metrics.
URL: https://openreview.net/forum?id=R6ey5DKaoX
---
Title: Autoencoding Hyperbolic Representation for Adversarial Generation
Authors: Eric Qu, Dongmian Zou
Abstract: With the recent advance of geometric deep learning, neural networks have been extensively used for data in non-Euclidean domains. In particular, hyperbolic neural networks have proved successful in processing hierarchical information of data. However, many hyperbolic neural networks are numerically unstable during training, which precludes using complex architectures. This crucial problem makes it difficult to build hyperbolic generative models for real and complex data. In this work, we propose a hyperbolic generative network in which we design novel architecture and layers to improve stability in training. Our proposed network contains a hyperbolic autoencoder (AE) that produces hyperbolic embedding for input data and a hyperbolic generative adversarial network (GAN) for generating the hyperbolic latent embedding of the AE from simple noise. Our generator inherits the decoder from the AE and the generator from the GAN. Our architecture fosters expressive and numerically stable representation in the hyperbolic space. Theoretically, we validate the training of GAN in the hyperbolic space, and prove stability of our hyperbolic layers used in the AE. Experiments show that our model is capable of generating tree-like graphs as well as complex molecular data with comparable structure-related performance.
URL: https://openreview.net/forum?id=NQi9U0YLW3
---
Title: Feature Alignment: Rethinking Efficient Active Learning via Proxy in the Context of Pre-trained Models
Authors: Ziting Wen, Oscar Pizarro, Stefan B. Williams
Abstract: Fine-tuning the pre-trained model with active learning holds promise for reducing annotation costs. However, this combination introduces significant computational costs, particularly with the growing scale of pre-trained models. Recent research has proposed proxy-based active learning, which pre-computes features to reduce computational costs. Yet, this approach often incurs a significant loss in active learning performance, sometimes outweighing the computational cost savings. This paper demonstrates that not all sample selection differences result in performance degradation. Furthermore, we show that suitable training methods can mitigate the decline of active learning performance caused by certain selection discrepancies. Building upon detailed analysis, we propose a novel method, aligned selection via proxy, which improves proxy-based active learning performance by updating pre-computed features and selecting a proper training method. Extensive experiments validate that our method improves the total cost of efficient active learning while maintaining computational efficiency.
URL: https://openreview.net/forum?id=PNcgJMJcdl
---
Title: Bandits with Mean Bounds
Authors: Nihal Sharma, Soumya Basu, Karthikeyan Shanmugam, Sanjay Shakkottai
Abstract: We study a variant of the bandit problem where side information in the form of bounds on the mean of each arm is provided. We prove that these translate to tighter estimates of subgaussian factors and develop novel algorithms that exploit these estimates. In the linear setting, we present the Restricted-set OFUL (R-OFUL) algorithm that additionally uses the geometric properties of the problem to (potentially) restrict the set of arms being played and reduce exploration rates for suboptimal arms. In the stochastic case, we propose the non-optimistic Global Under-Explore (GLUE) algorithm which employs the inferred subgaussian estimates to adapt the rate of exploration for the arms. We analyze the regret of R-OFUL and GLUE, showing that our regret upper bounds are never worse than that of the standard OFUL and UCB algorithms respectively. Further, we also consider a practically motivated setting of learning from confounded logs where mean bounds appear naturally.
URL: https://openreview.net/forum?id=4TZ4DE24fX
---
Title: Merging Text Transformer Models from Different Initializations
Authors: Neha Verma, Maha Elbayad
Abstract: Recent work on one-shot permutation-based model merging has shown impressive low- or zero-barrier mode connectivity between models from completely different initializations.
However, this line of work has not yet extended to the Transformer architecture, despite its dominant popularity in the language domain.
Therefore, in this work, we investigate the extent to which separate Transformer minima learn similar features, and propose a model merging technique to investigate the relationship between these minima in the loss landscape.
The specifics of the architecture, like its residual connections, multi-headed attention, and discrete, sequential input, require specific interventions in order to compute model permutations that remain within the same functional equivalence class.
In merging these models with our method, we consistently find lower loss barriers between minima compared to model averaging for several models trained on a masked-language modeling task or fine-tuned on a language understanding benchmark.
Our results show that the minima of these models are less sharp and isolated than previously understood, and provide a basis for future work on merging separately trained Transformer models.
URL: https://openreview.net/forum?id=nWnYSLncXa
---
Title: Deep-Graph-Sprints: Accelerated Representation Learning in Continuous-Time Dynamic Graphs
Authors: Ahmad Naser Eddin, Jacopo Bono, David Oliveira Aparicio, Hugo Ferreira, Pedro Manuel Pinto Ribeiro, Pedro Bizarro
Abstract: Continuous-time dynamic graphs (CTDGs) are essential for modeling interconnected, evolving systems. Traditional methods for extracting knowledge from these graphs often depend on feature engineering or deep learning. Feature engineering is limited by the manual and time-intensive nature of crafting features, while deep learning approaches suffer from high inference latency, making them impractical for real-time applications. This paper introduces Deep-Graph-Sprints (DGS), a novel deep learning architecture designed for efficient representation learning on CTDGs with low-latency inference requirements. We benchmark DGS against state-of-the-art (SOTA) feature engineering and graph neural network methods using five diverse datasets. The results indicate that DGS achieves competitive performance while inference speed improves between 4x and 12x compared to other deep learning approaches on our benchmark datasets. Our method effectively bridges the gap between deep representation learning and low-latency application requirements for CTDGs.
URL: https://openreview.net/forum?id=0uwe0z2Hqm
---
Title: Contrastive Learning with Adaptive Neighborhoods for Brain Age Prediction on 3D Stiffness Maps
Authors: Jakob Träuble, Lucy V Hiscox, Curtis Johnson, Carola-Bibiane Schönlieb, Gabriele S Kaminski Schierle, Angelica I Aviles-Rivero
Abstract: In the field of neuroimaging, accurate brain age prediction is pivotal for uncovering the complexities of brain aging and pinpointing early indicators of neurodegenerative conditions. Recent advancements in self-supervised learning, particularly in contrastive learning, have demonstrated greater robustness when dealing with complex datasets. However, current approaches often fall short in generalizing across non-uniformly distributed data, prevalent in medical imaging scenarios. To bridge this gap, we introduce a novel contrastive loss that adapts dynamically during the training process, focusing on the localized neighborhoods of samples. Moreover, we expand beyond traditional structural features by incorporating brain stiffness—a mechanical property previously underexplored yet promising due to its sensitiv- ity to age-related changes. This work presents the first application of self-supervised learning to brain mechanical properties, using compiled stiffness maps from various clinical studies to predict brain age. Our approach, featuring dynamic localized loss, consistently outperforms existing state-of-the-art methods, demonstrating superior performance and laying the way for new directions in brain aging research.
URL: https://openreview.net/forum?id=oI2Tpd4tiP
---
Title: Heterogeneous graph adaptive flow network
Authors: Lu Yiqi, Feng Ji, Wee Peng Tay
Abstract: Many graphs or networks are heterogeneous by nature, involving various vertex types and relation types. Most graph learning models for heterogeneous graphs employ meta-paths to guide neighbor selections and extract composite relations. However, the use of meta-paths to generate relations between the same vertex types may result in directed edges and failure to fully utilize the other vertex or edge types in the data. To address such a limitation, we propose Heterogeneous graph adaptive flow network (HetaFlow), which removes the need for meta-paths. HetaFlow decomposes the heterogeneous graph into flows and performs convolution across heterogeneous vertex and edge types, using an adaptation to change the vertex features based on the corresponding vertex and edge types during aggregation. Experiments on real-world datasets for vertex clustering and vertex classification demonstrate that HetaFlow outperforms other benchmark models and achieves state-of-the-art performance on commonly used benchmark datasets. The codes are available at https://github.com/AnonymizedC/HetaFlow.
URL: https://openreview.net/forum?id=usvg3yhjAx
---
Title: MoCaE: Mixture of Calibrated Experts Significantly Improves Object Detection
Authors: Kemal Oksuz, Selim Kuzucu, Tom Joy, Puneet K. Dokania
Abstract: Combining the strengths of many existing predictors to obtain a Mixture of Experts which is superior to its individual components is an effective way to improve the performance without having to develop new architectures or train a model from scratch. However, surprisingly, we find that naively combining off-the-shelf object detectors in a similar way to Deep Ensembles, can often lead to degraded performance. We identify that the primary cause of this issue is that the predictions of the experts do not match their performance, a term referred to as miscalibration. Consequently, the most confident detector dominates the final predictions, preventing the mixture from leveraging all the predictions from the experts appropriately. To address this, when constructing the Mixture of Experts for object detection, we propose to combine their predictions in a manner which reflects the individual performance of the experts; an objective we achieve by first calibrating the predictions before filtering and refining them. We term this approach the Mixture of Calibrated Experts (MoCaE) and demonstrate its effectiveness through extensive experiments on 5 different detection tasks, showing that it: (i) improves object detectors on COCO and instance segmentation methods on LVIS by up to $\sim 2.5$ AP; (ii) reaches state-of-the-art on COCO test-dev with $65.1$ AP and on DOTA with $82.62$ $\mathrm{AP_{50}}$; (iii) outperforms single models consistently on recent detection tasks such as Open Vocabulary Object Detection. Code is available at: https://github.com/fiveai/MoCaE
URL: https://openreview.net/forum?id=fJEsas1z8J
---
Title: IM-Context: In-Context Learning for Imbalanced Regression Tasks
Authors: Ismail Nejjar, Faez Ahmed, Olga Fink
Abstract: Regression models often fail to generalize effectively in regions characterized by highly imbalanced label distributions. Previous methods for deep imbalanced regression rely on gradient-based weight updates, which tend to overfit in underrepresented regions. This paper proposes a paradigm shift towards in-context learning as an effective alternative to conventional in-weight learning methods, particularly for addressing imbalanced regression. In-context learning refers to the ability of a model to condition itself, given a prompt sequence composed of in-context samples (input-label pairs) alongside a new query input to generate predictions, without requiring any parameter updates. In this paper, we study the impact of the prompt sequence on the model performance from both theoretical and empirical perspectives. We emphasize the importance of localized context in reducing bias within regions of high imbalance. Empirical evaluations across a variety of real-world datasets demonstrate that in-context learning substantially outperforms existing in-weight learning methods in scenarios with high levels of imbalance.
URL: https://openreview.net/forum?id=p4Y844vJWG
---
Title: Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
Authors: Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, Sai Qian Zhang
Abstract: Large models represent a groundbreaking advancement in multiple application fields, enabling remarkable achievements across various tasks. However, their unprecedented scale comes with significant computational costs. These models, often consisting of billions of parameters, require vast amounts of computational resources for execution. Especially, the expansive scale and computational demands pose considerable challenges when customizing them for particular downstream tasks, particularly over the hardware platforms constrained by computational capabilities.
Parameter Efficient Fine-Tuning (PEFT) provides a practical solution by efficiently adjusting the large models over the various downstream tasks. In particular, PEFT refers to the process of adjusting the parameters of a pre-trained large model to adapt it to a specific task or domain while minimizing the number of additional parameters introduced or computational resources required. This approach is particularly important when dealing with large-scale language models with high parameter counts, as fine-tuning these models from scratch can be computationally expensive and resource-intensive, posing considerable challenges in the supporting system platform design.
In this survey, we present comprehensive studies of various PEFT algorithms, examining their performance and computational overhead. Moreover, we provide an overview of applications developed using different PEFT algorithms and discuss common techniques employed to mitigate PEFT computation costs. In addition to providing an extensive survey from an algorithmic standpoint, we also examine various real-world system designs to investigate the implementation costs associated with different PEFT approaches. This survey serves as a valuable resource for researchers aiming to understand both the PEFT algorithm and its system implementation, offering detailed insights into recent advancements and practical applications.
URL: https://openreview.net/forum?id=lIsCS8b6zj
---
Title: Federated Graph Learning with Graphless Clients
Authors: Xingbo Fu, Song Wang, Yushun Dong, Binchi Zhang, Chen Chen, Jundong Li
Abstract: Federated graph learning is tasked with training machine learning models, such as Graph Neural Networks (GNNs), for multiple clients, each with its own graph data. Existing methods usually assume that each client has both node features and graph structure of its graph data. In real-world scenarios, however, there exist federated learning systems where only a part of the clients have such data while other clients graphless clients may only have features. This naturally leads to a novel problem in federated graph learning: how to jointly train a model over distributed graph data with graphless clients? To tackle this problem, we propose a novel Federated Graph Structure Learning (FedGSL) framework in this paper. In FedGSL, we devise a local graph learner on each graphless client which learns the local graph structure with the structure knowledge transferred from other clients. To enable structure knowledge transfer, we design a GNN model and a feature encoder on each client. During local training, the feature encoder retains the local graph structure knowledge together with the GNN model via knowledge distillation, and the structure knowledge is transferred among clients in global update. Our extensive experiments on five real-world graph datasets demonstrate the superiority of FedGSL over other five federated learning approaches.
URL: https://openreview.net/forum?id=mVAp0eDfyR
---
Title: One by One, Continual Coordinating with Humans via Hyper-Teammate Identification
Authors: Cong Guan, Feng Chen, Ke Xue, Chunpeng Fan, Lichao Zhang, Ziqian Zhang, Pengyao Zhao, Zongzhang Zhang, Chao Qian, Lei Yuan, Yang Yu
Abstract: One of the primary objectives in modern artificial intelligence researches is to empower agents to effectively coordinate with diverse teammates, particularly human teammates. Previous studies focused on training agents either with a fixed population of pre-generated teammates or through the co-evolution of distinct populations of agents and teammates. However, it is challenging to enumerate all possible teammates in advance, and it is costly, or even impractical to maintain such a sufficiently diverse population and repeatedly interact with previously encountered teammates. Additional design considerations, such as prioritized sampling, are also required to ensure efficient training. To address these challenges and obtain an efficient human-AI coordination paradigm, we propose a novel approach called \textbf{Concord}. Considering that human participants tend to occur in a sequential manner, we model the training process with different teammates as a continual learning framework, akin to how humans learn and adapt in the real world. We propose a mechanism based on hyper-teammate identification to prevent catastrophic forgetting while promoting forward knowledge transfer. Concretely, we introduce a teammate recognition module that captures the identification of corresponding teammates. Leveraging the identification, a well-coordinated AI policy can be generated via the hyper-network. The entire framework is trained in a decomposed policy gradient manner, allowing for effective credit assignment among agents. This approach enables us to train agents with each generated teammate or humans one by one, ensuring that agents can coordinate effectively with concurrent teammates without forgetting previous knowledge. Our approach outperforms multiple baselines in various multi-agent benchmarks, either with generated human proxies or real human participants.
URL: https://openreview.net/forum?id=HVxumpoWBm
---
Title: PLUM: Improving Inference Efficiency By Leveraging Repetition-Sparsity Trade-Off
Authors: Sachit Kuhar, Yash Jain, Alexey Tumanov
Abstract: Efficient inference of Deep Neural Networks (DNNs) on resource-constrained edge devices is essential. Quantization and sparsity are key techniques that translate to repetition and sparsity within tensors at the hardware-software interface. This paper introduces the concept of repetition-sparsity trade-off that helps explain computational efficiency during inference. We propose PLUM, a unified co-design framework that integrates DNN inference systems and quantization (forward and backward pass) to leverage the repetition-sparsity trade-off to improve inference efficiency. Our results demonstrate that PLUM’s quantization method is more accurate than binary quantization with the same number of non-zero weights. Detailed analysis indicates that signed binarization generates a smaller distribution of effectual (non-zero) parameters nested within a larger distribution of total parameters of latent full-precision weights for a DNN block. Finally, the proposed PLUM framework achieves a 26% speedup on real hardware, doubles energy efficiency, and reduces density by 2.8× compared to binary methods while retaining top-1 accuracy when compared to prior-art methods for ResNets on ImageNet (by achieving 66.2% top-1 accuracy), presenting an alternative solution for deploying efficient models in resource-limited environments
URL: https://openreview.net/forum?id=IEKtMMSblm
---
Title: Global Convergence Guarantees for Federated Policy Gradient Methods with Adversaries
Authors: Swetha Ganesh, Jiayu Chen, Gugan Thoppe, Vaneet Aggarwal
Abstract: Federated Reinforcement Learning (FRL) allows multiple agents to collaboratively build a decision making policy without sharing raw trajectories. However, if a small fraction of these agents are adversarial, it can lead to catastrophic results. We propose a policy gradient based approach that is robust to adversarial agents which can send arbitrary values to the server. Under this setting, our results form the first global convergence guarantees with general parametrization. These results demonstrate resilience with adversaries, while achieving optimal sample complexity of order $\tilde{\mathcal{O}}\left( \frac{1}{N\epsilon^2} \left( 1+ \frac{f^2}{N}\right)\right)$, where $N$ is the total number of agents and $f < N/2$ is the number of adversarial agents.
URL: https://openreview.net/forum?id=Ea0LrPORzM
---
Title: The Klarna Product Page Dataset: Web Element Nomination with Graph Neural Networks and Large Language Models
Authors: Alexandra Hotti, Riccardo Sven Risuleo, Stefan Magureanu, Aref Moradi, Jens Lagergren
Abstract: Web automation holds the potential to revolutionize how users interact with the digital world, offering unparalleled assistance and simplifying tasks via sophisticated computational methods. Central to this evolution is the web element nomination task, which entails identifying unique elements on webpages. Unfortunately, the development of algorithmic designs for web automation is hampered by the scarcity of comprehensive and realistic datasets that reflect the complexity faced by real-world applications on the Web. To address this, we introduce the Klarna Product Page Dataset, a comprehensive and diverse collection of webpages that surpasses existing datasets in richness and variety. The dataset features 51,701 manually labeled product pages from 8,175 e-commerce websites across eight geographic regions, accompanied by a dataset of rendered page screenshots. To initiate research on the Klarna Product Page Dataset, we empirically benchmark a range of Graph Neural Networks (GNNs) on the web element nomination task. We make three important contributions. First, we found that a simple Convolutional GNN (GCN) outperforms complex state-of-the-art nomination methods, and further enhance its performance using a Reversible GNN (RevGNN) architecture. Second, we introduce a training refinement procedure that involves identifying a small number of relevant elements from each page using the aforementioned GNN. These elements are then passed to a Large Language Model for the final nomination. This procedure significantly improves the nomination accuracy by 10.9 percentage points on our challenging dataset, without any need for fine-tuning. Finally, in response to another prevalent challenge in this field – the abundance of training methodologies suitable for element nomination – we introduce the Challenge Nomination Training Procedure, a training method that further boosts nomination accuracy.
URL: https://openreview.net/forum?id=zz6FesdDbB
---
Title: Towards Provable Log Density Policy Gradient
Authors: Pulkit Katdare, Anant A Joshi, Katherine Rose Driggs-Campbell
Abstract: Policy gradient methods are a vital ingredient behind the success of modern reinforcement learning. Modern policy gradient methods, although successful, introduce a residual error in gradient estimation. In this work, we argue that this residual term is significant and correcting for it could potentially improve sample-complexity of reinforcement learning methods. To that end, we propose log density gradient to estimate the policy gradient, which corrects for this residual error term. Log density gradient method computes policy gradient by utilising the state-action discounted distributional formulation. We first present the equations needed to exactly find the log density gradient for a tabular Markov Decision Processes (MDPs). For more complex environments, we propose a temporal difference (TD) method that approximates log density gradient by utilizing backward on-policy samples. Since backward sampling from a Markov chain is highly restrictive we also propose a min-max optimization that can approximate log density gradient using just on-policy samples. We also prove uniqueness, and convergence under linear function approximation, for this min-max optimization. Finally, we show that the sample complexity of our min-max optimization to be of the order of $m^{-1/2}$, where $m$ is the number of on-policy samples. We also demonstrate a proof-of-concept for our log density gradient method on gridworld environment, and observe that our method is able to improve upon the classical policy gradient method by a clear margin, thus indicating a promising novel direction to develop reinforcement learning algorithms that require fewer samples.
URL: https://openreview.net/forum?id=qIWazsRaTR
---
Title: Practical Synthesis of Mixed-Tailed Data with Normalizing Flows
Authors: Saba Amiri, Eric Nalisnick, Adam Belloum, Sander Klous, Leon Gommans
Abstract: Capturing the correct tail behavior is difficult, yet essential for a faithful generative model. In this work, we provide an improved framework for training flows-based models with robust capabilities to capture the tail behavior of mixed-tail data. We propose a combination of a tail-flexible base distribution and a robust training algorithm to enable the flow to model heterogeneous tail behavior in the target distribution. We support our claim with extensive experiments on synthetic and real world data.
URL: https://openreview.net/forum?id=uphsKDj0Uu
---
Title: Single-Shot Plug-and-Play Methods for Inverse Problems
Authors: Yanqi Cheng, Lipei Zhang, Zhenda Shen, Shujun Wang, Lequan Yu, Raymond H. Chan, Carola-Bibiane Schönlieb, Angelica I Aviles-Rivero
Abstract: The utilisation of Plug-and-Play (PnP) priors in inverse problems has become increasingly prominent in recent years. This preference is based on the mathematical equivalence between the general proximal operator and the regularised denoiser, facilitating the adaptation of various off-the-shelf denoiser priors to a wide range of inverse problems. However, existing PnP models predominantly rely on pre-trained denoisers using large datasets. In this work, we introduce Single-Shot PnP methods (SS-PnP), shifting the focus to solving inverse problems with minimal data. First, we integrate Single-Shot proximal denoisers into iterative methods, enabling training with single instances. Second, we propose implicit neural priors based on a novel function that preserves relevant frequencies to capture fine details while avoiding the issue of vanishing gradients. We demonstrate, through extensive numerical and visual experiments, that our method leads to better approximations.
URL: https://openreview.net/forum?id=vXevE43NxF
---
Title: Hashing with Uncertainty Quantification via Sampling-based Hypothesis Testing
Authors: Yucheng Wang, Mingyuan Zhou, Xiaoning Qian
Abstract: To quantify different types of uncertainty when deriving hash-codes for image retrieval, we develop a probabilistic hashing model(ProbHash).
Sampling-based hypothesis testing is then derived for hashing with uncertainty quantification(HashUQ) in ProbHash to improve the granularity of hashing-based retrieval by prioritizing the data with confident hash-codes. HashUQ can drastically improve the retrieval performance without sacrificing computational efficiency. For efficient deployment of HashUQ in real-world applications, we discretize the quantified uncertainty to reduce the potential storage overhead. Experimental results show that our HashUQ can achieve state-of-the-art retrieval performance on three image datasets. Ablation experiments on model hyperparameters, different model components, and effects of UQ are also provided with performance comparisons. Our code is available at https://github.com/QianLab/HashUQ.
URL: https://openreview.net/forum?id=cc4v6v310f
---
Title: Adaptively Robust and Sparse $K$-means Clustering
Authors: HAO LI, Shonosuke Sugasawa, Shota Katayama
Abstract: While $K$-means is known to be a standard clustering algorithm, its performance may be compromised due to the presence of outliers and high-dimensional noisy variables. This paper proposes adaptively robust and sparse $K$-means clustering (ARSK) to address these practical limitations of the standard $K$-means algorithm. For robustness, we introduce a redundant error component for each observation, and this additional parameter is penalized using a group sparse penalty. To accommodate the impact of high-dimensional noisy variables, the objective function is modified by incorporating weights and implementing a penalty to control the sparsity of the weight vector. The tuning parameters to control the robustness and sparsity are selected by $\rm Gap$ statistics.Through simulation experiments and real data analysis, we demonstrate the proposed method's superiority to existing algorithms in identifying clusters without outliers and informative variables simultaneously.
URL: https://openreview.net/forum?id=EhC84fT2yA
---
Title: Modular Quantization-Aware Training for 6D Object Pose Estimation
Authors: Saqib Javed, Chengkun Li, Andrew Lawrence Price, Yinlin Hu, Mathieu Salzmann
Abstract: Edge applications, such as collaborative robotics and spacecraft rendezvous, demand efficient 6D object pose estimation on resource-constrained embedded platforms. Existing 6D object pose estimation networks are often too large for such deployments, necessitating compression while maintaining reliable performance. To address this challenge, we introduce Modular Quantization-Aware Training (MQAT), an adaptive and mixed-precision quantization-aware training strategy that exploits the modular structure of modern 6D object pose estimation architectures. MQAT guides a systematic gradated modular quantization sequence and determines module-specific bit precisions, leading to quantized models that outperform those produced by state-of-the-art uniform and mixed-precision quantization techniques. Our experiments showcase the generality of MQAT across datasets, architectures, and quantization algorithms. Additionally, we observe that MQAT quantized models can achieve an accuracy boost (>7% ADI-0.1d) over the baseline full-precision network while reducing model size by a factor of 4x or more.
Project Page: https://saqibjaved1.github.io/MQAT_
URL: https://openreview.net/forum?id=lIy0TEUou7
---
Title: SASSL: Enhancing Self-Supervised Learning via Neural Style Transfer
Authors: Renan A. Rojas-Gomez, Karan Singhal, Ali Etemad, Alex Bijamov, Warren Richard Morningstar, Philip Andrew Mansfield
Abstract: Existing data augmentation in self-supervised learning, while diverse, fails to preserve the inherent structure of natural images. This results in distorted augmented samples with compromised semantic information, ultimately impacting downstream performance. To overcome this limitation, we propose SASSL: Style Augmentations for Self Supervised Learning, a novel data augmentation technique based on Neural Style Transfer. SASSL decouples semantic and stylistic attributes in images and applies transformations exclusively to their style while preserving content, generating diverse samples that better retain semantic information. SASSL boosts top-1 image classification accuracy on ImageNet by up to 2 percentage points compared to established self-supervised methods like MoCo, SimCLR, and BYOL, while achieving superior transfer learning performance across various datasets. Because SASSL can be performed asynchronously as part of the data augmentation pipeline, these performance impacts can be obtained with no change in pretraining throughput.
URL: https://openreview.net/forum?id=NxhXtkPYsk
---
Title: Analyzing the Impact of Learnable Softmax Temperature in Contrastive Visual-Textual Alignment Systems: Benefits, Drawbacks, and Alternative Approaches
Authors: Zhun Sun, Chao Li
Abstract: This work does NOT read like “fabricate motivation - propose something - obtain sota results”. Instead, we provide an in-depth analysis of the learnable softmax temperature parameter in the practical training of contrastive visual-textual alignment models, commonly known as CLIP models. This parameter is critical for optimal system performance, yet its mechanism and potential drawbacks have been largely overlooked. Our study addresses this gap and proposes a novel solution by utilizing the architecture of Vision Transformers (ViTs). We focus on the crucial role of the softmax temperature in managing noisy training data. We demonstrate that there is a balance in the gradient of the contrastive loss, with the temperature parameter acting as a distance scaling factor. If not properly calibrated, the model struggles to align positive pairs due to numerical issues in the loss term. Conversely, a high temperature can lead to unstable learning dynamics. We explore alternative approaches to mitigate this problem from a topological perspective of the contrastive loss. Ultimately, we leverage multiple class tokens embedded within the transformer architecture to present a concise solution. This configuration significantly enhances zero-shot classification performance, improving baseline CLIP models pretrained on large-scale datasets by an average of 6.1%.
URL: https://openreview.net/forum?id=rx1QNhsNsK
---
New submissions
===============
Title: A Survey on Future Frame Synthesis: Bridging Deterministic and Generative Approaches
Abstract: Future Frame Synthesis (FFS) aims to enable models to generate sequences of future frames based on existing content. This survey comprehensively reviews historical and contemporary works in FFS, including widely used datasets and algorithms. It scrutinizes the challenges and the evolving landscape of FFS within computer vision, with a focus on the transition from deterministic to generative synthesis methodologies. Our taxonomy highlights the significant advancements and shifts in approach, underscoring the growing importance of generative models in achieving realistic and diverse future frame predictions.
URL: https://openreview.net/forum?id=QSUTq0plnW
---
Title: Conformalized Credal Regions for Classification with Ambiguous Ground Truth
Abstract: An open question in Imprecise Probabilistic Machine Learning is how to empirically derive a credal region (i.e., a closed and convex family of probabilities on the output space) from the available data, without any prior knowledge or assumption. In classification problems, credal regions are a tool that is able to provide provable guarantees under realistic assumptions by characterizing the uncertainty about the distribution of the labels. Building on previous work, we show that credal regions can be directly constructed using conformal methods. This allows us to provide a novel extension of classical conformal prediction to problems with ambiguous ground truth, that is, when the exact labels for given inputs are not exactly known. The resulting construction enjoys desirable practical and theoretical properties: (i) conformal coverage guarantees, (ii) smaller prediction sets (compared to classical conformal prediction regions) and (iii) disentanglement of uncertainty sources (epistemic, aleatoric). We empirically verify our findings on both synthetic and real datasets.
URL: https://openreview.net/forum?id=L7sQ8CW2FY
---
Title: Compressive Recovery of Signals Defined on Perturbed Graphs
Abstract: Recovery of signals with elements defined on the nodes of a graph, from compressive measurements is an important problem, which can arise in various domains such as sensor networks, image reconstruction and group testing. In some scenarios, the graph may not be accurately known, and there may exist a few edge additions or deletions relative to a ground truth graph. Such perturbations, even if small in number, significantly affect the Graph Fourier Transform (GFT). This impedes recovery of signals which may have sparse representations in the GFT bases of the ground truth graph. We present an algorithm which simultaneously recovers the signal from the compressive measurements and also corrects the graph perturbations. We analyze some important theoretical properties of the algorithm. Our approach to correction for graph perturbations is based on model selection techniques such as cross-validation in compressed sensing. We validate our algorithm on signals which have a sparse representation in the GFT bases of many commonly used graphs in the network science literature. An application to compressive image reconstruction is also presented, where graph perturbations are modeled as undesirable graph edges linking pixels with significant intensity difference. In all experiments, our algorithm clearly outperforms baseline techniques which either ignore the perturbations or use first order approximations to the perturbations in the GFT bases.
URL: https://openreview.net/forum?id=nXULOmdO3b
---
Title: An Empirical Study of Pre-trained Model Selection for Out-of-Distribution Generalization and Calibration
Abstract: In out-of-distribution (OOD) generalization tasks, fine-tuning pre-trained models has become a prevalent strategy. Different from most prior work that has focused on advancing learning algorithms, we systematically examined how pre-trained model size, pre-training dataset size, and training strategies impact generalization and uncertainty calibration on downstream tasks. We evaluated 100 models across diverse pre-trained model sizes, five pre-training datasets, and five data augmentations through extensive experiments on four distribution shift datasets totaling over 120,000 GPU hours. Our results demonstrate the significant impact of pre-trained model selection, with optimal choices substantially improving OOD accuracy over algorithm improvement alone. Additionally, we find that larger models and bigger pre-training datasets not only enhance OOD performance but also improve calibration, helping to mitigate overconfidence, contrary to some prior studies that found modern deep networks to calibrate worse than classical shallow models. Our work underscores the overlooked importance of pre-trained model selection for out-of-distribution generalization and calibration.
URL: https://openreview.net/forum?id=tYjoHjShxF
---
Title: Where Do We Stand with Implicit Neural Representations? A Technical and Performance Survey
Abstract: Implicit Neural Representations (INRs) have emerged as a paradigm in knowledge representation, offering exceptional flexibility and performance across a diverse range of applications. INRs leverage multilayer perceptrons (MLPs) to model data as continuous implicit functions, providing critical advantages such as resolution independence, memory efficiency, and generalisation beyond discretised data structures. Their ability to solve complex inverse problems makes them particularly effective for tasks including audio reconstruction, image representation, 3D object reconstruction, and high-dimensional data synthesis. This survey provides a comprehensive review of state-of-the-art INR methods, introducing a clear taxonomy that categorises them into four key areas: activation functions, position encoding, combined strategies, and network structure optimisation. We rigorously analyse their critical properties—such as full differentiability, smoothness, compactness, and adaptability to varying resolutions—while also examining their strengths and limitations in addressing locality biases and capturing fine details. Our experimental comparison offers new insights into the trade-offs between different approaches, showcasing the capabilities and challenges of the latest INR techniques across various tasks. In addition to identifying areas where current methods excel, we highlight key limitations and potential avenues for improvement, such as developing more expressive activation functions, enhancing positional encoding mechanisms, and improving scalability for complex, high-dimensional data. This survey serves as a roadmap for researchers, offering practical guidance for future exploration in the field of INRs. We aim to foster new methodologies by outlining promising research directions for INRs and applications.
URL: https://openreview.net/forum?id=QTsJXSvAI2
---
Title: Investigating the Effects of Fairness Interventions Using Pointwise Representational Similarity
Abstract: Machine learning (ML) algorithms can often exhibit discriminatory behavior, negatively affecting certain populations across protected groups. To address this, numerous debiasing methods, and consequently evaluation measures, have been proposed. Current evaluation measures for debiasing methods suffer from two main limitations: (1) they primarily provide a global estimate of unfairness, failing to provide a more fine-grained analysis, and (2) they predominantly analyze the model output on a specific task, failing to generalize the findings to other tasks. In this work, we introduce Pointwise Normalized Kernel Alignment (PNKA), a pointwise representational similarity measure that addresses these limitations by measuring how debiasing measures affect the intermediate representations of individuals. On tabular data, the use of PNKA reveals previously unknown insights: while group fairness predominantly influences a small subset of the population, maintaining high representational similarity for the majority, individual fairness constraints uniformly impact representations across the entire population, altering nearly every data point. We show that by evaluating representations using PNKA, we can reliably predict the behavior of ML models trained on these representations. Moreover, applying PNKA to language embeddings shows that existing debiasing methods may not perform as intended, failing to remove biases from stereotypical words and sentences. Our findings suggest that current evaluation measures for debiasing methods are insufficient, highlighting the need for a deeper understanding of the effects of debiasing methods, and show how pointwise representational similarity metrics can help with fairness audits.
URL: https://openreview.net/forum?id=CkVlt2Qgdb
---
Title: Unlabelled Compressive Sensing under Sparse Permutation and Prior Information
Abstract: In this paper, we study the problem of unlabelled compressed sensing, where the correspondence between the measurement values and the rows of the sensing matrix is lost, the number of measurements is less than the dimension of the regression vector, and the regression vector is sparse in the identity basis. Additionally, motivated by practical situations, we assume that we accurately know a small number of correspondences between the rows of the measurement matrix and the measurement vector. We propose a tractable estimator, based on a modified form of the \textsc{Lasso}, to estimate the regression vector, and we derive theoretical error bounds for the estimate. This is unlike previous approaches to unlabelled compressed sensing, which either do not produce theoretical bounds or which produce bounds for intractable estimators. We show that our algorithm outperforms a hard thresholding pursuit (\textsc{Htp}) approach and an $\ell_1$-norm estimator used to solve a similar problem across diverse regimes. We also propose a modified \textsc{Htp} based estimator which has superior properties to the baseline \textsc{Htp} estimator. Lastly, we show an application of unlabelled compressed sensing in image registration, demonstrating the utility of a few known point correspondences.
URL: https://openreview.net/forum?id=HaAg9RN7Hi
---
Title: A Fundamental Accuracy--Robustness Trade-off in Regression and Classification
Abstract: We derive a fundamental trade-off between standard and adversarial risk in a rather general situation that formalizes the following simple intuition:
``If no (nearly) optimal predictor is smooth, adversarial robustness comes at the cost of accuracy.''
As a concrete example, we evaluate the derived trade-off in regression with polynomial ridge functions under mild regularity conditions.
URL: https://openreview.net/forum?id=8t0FihWuE2
---
Title: Neural Lattice Reduction: A Self-Supervised Geometric Deep Learning Approach
Abstract: Lattice reduction is a combinatorial optimization problem aimed at finding the most orthogonal basis in a given lattice. The Lenstra–Lenstra–Lovász (LLL) algorithm is the best algorithm in the literature for solving this problem. In light of recent research on algorithm discovery, in this work, we would like to answer this question: is it possible to parametrize the algorithm space for lattice reduction problem with neural networks and find an algorithm without supervised data? Our strategy is to use equivariant and invariant parametrizations and train in a self-supervised way. We design a deep neural model outputting factorized unimodular matrices and train it in a self-supervised manner by penalizing non-orthogonal lattice bases. We incorporate the symmetries of lattice reduction into the model by making it invariant to isometries and scaling of the ambient space and equivariant with respect to the hyperocrahedral group permuting and flipping the lattice basis elements.
We show that this approach yields an algorithm with comparable complexity and performance to the LLL algorithm on a set of benchmarks. Additionally, motivated by certain applications for wireless communication, we extend our method to a convolutional architecture which performs joint reduction of spatially-correlated lattices arranged in a grid, thereby amortizing its cost over multiple lattices.
URL: https://openreview.net/forum?id=YxXyRSlZ4b
---
Title: Music Foundation Model as Generic Booster for Music Downstream Tasks
Abstract: We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions.
URL: https://openreview.net/forum?id=kHl4JzyNzF
---
Title: HawkI: Homography & Mutual Information Guidance for 3D-free Single Image to Aerial View
Abstract: We present HawkI, for synthesizing aerial-view images from text and an exemplar image, without any additional multi-view or 3D information for finetuning or at inference. HawkI uses techniques from classical computer vision and information theory. It seamlessly blends the visual features from the input image within a pretrained text-to-2Dimage stable diffusion model with a test-time optimization process for a careful bias-variance trade-off, which uses an Inverse Perspective Mapping (IPM) homography transformation to provide subtle cues for aerialview synthesis. At inference, HawkI employs a unique mutual information guidance formulation to steer the generated image towards faithfully replicating the semantic details of the input-image, while maintaining a realistic aerial perspective. Mutual information guidance maximizes the semantic consistency between the generated image and the input image, without enforcing pixel-level correspondence between vastly different viewpoints. Through extensive qualitative and quantitative comparisons against text + exemplar-image based methods and 3D/ multi-view based novel-view synthesis methods on proposed synthetic and real datasets, we demonstrate that our method achieves a significantly better bias-variance trade-off towards generating high fidelity aerial-view images. Code and data will be made publicly available.
URL: https://openreview.net/forum?id=2DQWxUpYfV
---
Title: Identifying Axiomatic Mathematical Transformation Steps using Tree-Structured Pointer Networks
Abstract: The classification of mathematical relations has become a new area of research in deep
learning. A major focus lies on determining mathematical equivalence, as this problem is
expensive to solve with rule-based systems due to the large search space. While previous
work has simply approached the task as a binary classification without providing further
insight into the underlying decision, we aim to iteratively find a sequence of necessary steps
to transform a mathematical expression into an arbitrary equivalent form. Each step in
this sequence is specified by an axiom together with its position of application. We denote
this task as Stepwise Equation Transformation Identification (SETI) task. To solve the
task efficiently, we further propose TreePointerNet, a novel architecture which exploits the
inherent tree structure of mathematical equations and consists of three key building blocks:
(i) a transformer model tailored to work on hierarchically tree-structured equations, making
use of (ii) a copy-pointer mechanism to extract the exact location of a transformation in the
tree and finally (iii) custom embeddings that map distinguishable occurrences of the same
token type to a common embedding. In addition, we introduce new datasets of equations for
the SETI task. We benchmark our model against various baselines and perform an ablation
study to quantify the influence of our custom embeddings and the copy-pointer component.
Furthermore, we test the robustness of our model on data of unseen complexity. Our results
clearly show that incorporating the hierarchical structure, embeddings and copy-pointer into
a single model is highly beneficial for solving the SETI task
URL: https://openreview.net/forum?id=gLQ801ewwp
---
Title: DivIL: Unveiling and Addressing Over-Invariance for Out-of- Distribution Generalization
Abstract: Out-of-distribution generalization is a common problem that expects the model to perform well in the different distributions even far from the train data. A popular approach to addressing this issue is invariant learning (IL), in which the model is compiled to focus on invariant features instead of spurious features by adding strong constraints during training. However, there are some potential pitfalls of strong invariant constraints. Due to the limited number of diverse environments and over-regularization in the feature space, it may lead to a loss of important details in the invariant features while alleviating the spurious correlations, namely the over-invariance, which can also degrade the generalization performance. We theoretically define the over-invariance and observe that this issue occurs in various classic IL methods. To alleviate this issue, we propose a simple approach Diverse Invariant Learning (DivIL) by adding the unsupervised contrastive learning and the random masking mechanism compensatory for the invariant constraints, which can be applied to various IL methods. Furthermore, we conduct experiments across multiple modalities across 12 datasets and 6 classic models, verifying our over-invariance insight and the effectiveness of our DivIL framework. Our code is available in https://anonymous.4open.science/r/DivGIL-B68F/.
URL: https://openreview.net/forum?id=2Zan4ATYsh
---
Title: STC-ViT: Spatio Temporal Continuous Vision Transformer for Weather Forecasting
Abstract: Operational weather forecasting system relies on computationally expensive physics-based models. Recently, transformer based models have shown remarkable potential in weather forecasting achieving state-of-the-art results. However, transformers are discrete and physics-agnostic models which limit their ability to learn the continuous spatio-temporal features of the dynamical weather system. We address this issue with STC-ViT, a Spatio-Temporal Continuous Vision Transformer for weather forecasting. STC-ViT incorporates the continuous time Neural ODE layers with multi-head attention mechanism to learn the continuous weather evolution over time. The attention mechanism is encoded as a differentiable function in the transformer architecture to model the complex weather dynamics. Further, we define a customised physics informed loss for STC-ViT which penalize the model's predictions for deviating away from physical laws. We evaluate STC-ViT against operational Numerical Weather Prediction (NWP) model and several deep learning based weather forecasting models. STC-ViT, trained on 1.5-degree 6-hourly data, demonstrates computational efficiency and competitive performance compared to state-of-the-art data-driven models trained on higher-resolution data for global forecasting.
URL: https://openreview.net/forum?id=foib4M4UXm
---
Title: Effective Subset Selection Through The Lens of Neural Network Pruning
Abstract: Having large amounts of annotated data significantly impacts the effectiveness of deep neural networks. However, the annotation task can be very expensive in some domains, such as medical data. Thus, it is important to select the data to be annotated wisely, which is known as the subset selection problem.
We investigate and establish a relationship between subset selection and neural network pruning, which is more widely studied.
Leveraging insights from network pruning, we propose utilizing the norm criterion of neural network features to improve subset selection methods. We empirically validate our proposed strategy on various networks and datasets, demonstrating enhanced accuracy.
This shows the potential of employing pruning tools for subset selection.
URL: https://openreview.net/forum?id=ZNosoXScyp
---
Title: Kernel Orthogonality does not necessarily imply a Decrease in Feature Map Redundancy in CNNs: Convolutional Similarity Minimization
Abstract: Convolutional Neural Networks (CNNs) have been heavily used in Deep Learning due to their success in various tasks. Nonetheless, it has been observed that CNNs suffer from redundancy in feature maps, leading to inefficient capacity utilization. Efforts to mitigate and solve this problem led to the emergence of multiple methods, amongst which is kernel orthogonality through variant means. In this work, we challenge the common belief that kernel orthogonality leads to a decrease in feature map redundancy, which is, supposedly, the ultimate objective behind kernel orthogonality. We prove, theoretically and empirically, that kernel orthogonality has an unpredictable effect on feature map similarity and does not necessarily decrease it. Based on our theoretical result, we propose an effective method to reduce feature map similarity independently of the input of the CNN. This is done by minimizing a novel loss function we call Convolutional Similarity. Empirical results show that minimizing the Convolutional Similarity increases the performance of classification models and can accelerate their convergence. Furthermore, using our proposed method pushes towards a more efficient use of the capacity of models, allowing the use of significantly smaller models to achieve the same levels of performance.
URL: https://openreview.net/forum?id=tKxga13DZg
---
Title: Tiny models from tiny data: Textual and null-text inversion for few-shot distillation
Abstract: Few-shot learning deals with problems such as image classification using very few training
examples. Recent vision foundation models show excellent few-shot transfer abilities, but are
large and slow at inference. Using knowledge distillation, the capabilities of high-performing
but slow models can be transferred to tiny, efficient models. However, common distillation
methods require a large set of unlabeled data, which is not available in the few-shot setting.
To overcome this lack of data, there has been a recent interest in using synthetic data. We
expand on this line of research by presenting a novel diffusion model inversion technique
(TINT) combining the diversity of textual inversion with the specificity of null-text inversion.
Using this method in a few-shot distillation pipeline leads to state-of-the-art accuracy among
small student models on popular benchmarks, while being significantly faster than prior
work. Popular few-shot benchmarks involve evaluation over a large number of episodes,
which is computationally cumbersome for methods involving synthetic data generation. We
also present a theoretical analysis on how the accuracy estimator variance depends on the
number of episodes and query examples, and use these results to lower the computational
effort required for method evaluation. Finally, to further motivate the use of generative
models in few-shot distillation, we demonstrate that our method outperforms training on
real data mined from the dataset used in the original diffusion model training. Source code
is available at TBD [Released with the camera-ready version].
URL: https://openreview.net/forum?id=kt6kZHOAVA
---
Title: WavePaint: Resource-efficient Token-mixer for Self-supervised Inpainting
Abstract: Inpainting, which refers to the synthesis of missing regions, can help restore occluded or degraded areas of an image and also serve as a precursor task for self-supervision of neural networks for computer vision. The current state-of-the-art models for inpainting are computationally heavy as they are based on transformer or CNN backbones that are trained in adversarial or diffusion settings. This paper diverges from vision transformers by using a computationally-efficient WaveMix-based fully convolutional architecture---WavePaint. It uses a 2D-discrete wavelet transform (DWT) for spatial and multi-resolution token-mixing along with convolutional layers. The proposed model outperforms the current state-of-the-art models for image inpainting on reconstruction quality while also using much fewer parameters and GPU RAM, and considerably lower training and evaluation times. Our model even outperforms current GAN-based architectures in CelebA-HQ dataset without using an adversarially trainable discriminator. This work suggests that neural architectures that are modeled after natural image priors require fewer parameters and computations to achieve better generalization.
URL: https://openreview.net/forum?id=ZgMNtu2Ykd
---
Title: On the Sample Complexity of One Hidden Layer Networks with Equivariance, Locality and Weight Sharing
Abstract: Weight sharing, equivariance, and local filters, as in convolutional neural networks, are believed to contribute to the sample efficiency of neural networks. However, it is not clear how each one of these design choices contribute to the generalization error. Through the lens of statistical learning theory, we aim to provide an insight into this question by characterizing the relative impact of each choice on the sample complexity. We obtain lower and upper sample complexity bounds for a class of single hidden layer networks. It is shown that the gain of equivariance is directly manifested in the bound, while getting a similar increase for weight sharing depends on the sharing mechanism. Among our results, we obtain a completely dimension-free bound for equivariant networks for a class of pooling operations. We show that the bound depends merely on the norm of filters, which is tighter than using the spectral norm of the respective matrix. We also characterize the trade-off in sample complexity between the parametrization of filters in spatial and frequency domains, particularly when spatial filters are localized as in vanilla convolutional neural networks.
URL: https://openreview.net/forum?id=Q7aXOnEGgU
---
Title: What Makes ImageNet Look Unlike LAION
Abstract: ImageNet was famously created from Flickr image search results. What if we recreated ImageNet instead by searching the massive LAION dataset based on image captions alone? In this work, we carry out this counterfactual investigation. We find that the resulting ImageNet recreation, which we call LAIONet, looks distinctly unlike the original. Specifically, the intra-class similarity of images in the original ImageNet is dramatically higher than it is for LAIONet. Consequently, models trained on ImageNet perform significantly worse on LAIONet. We propose a rigorous explanation for the discrepancy in terms of a subtle, yet important, difference in two plausible causal data-generating processes for the respective datasets, that we support with systematic experimentation. In a nutshell, searching based on an image caption alone creates an information bottleneck that mitigates the selection bias otherwise present in image-based filtering. Our explanation formalizes a long-held intuition in the community that ImageNet images are stereotypical, unnatural, and overly simple representations of the class category. At the same time, it provides a simple and actionable takeaway for future dataset creation efforts.
URL: https://openreview.net/forum?id=IrBYuh9W3T
---
Title: Personalized Negative Reservoir for Incremental Learning in Recommender Systems
Abstract: Recommender systems have become an integral part of online platforms. Every day the volume of training data is expanding and the number of user interactions is constantly increasing. The exploration of larger and more expressive models has become a necessary pursuit to improve user experience. However, this progression carries with it an increased computational burden. In commercial settings, once a recommendation system model has been trained and deployed it typically needs to be updated frequently as new client data arrive. Cumulatively, the mounting volume of data is guaranteed to eventually make full batch retraining of the model from scratch computationally infeasible. Naively fine-tuning solely on the new data runs into the well-documented problem of catastrophic forgetting. Despite the fact that negative sampling is a crucial part of training with implicit feedback, no specialized technique exists that is tailored to the incremental learning framework. In this work, we propose a personalized negative reservoir strategy, which is used to obtain negative samples for the standard triplet loss. Our technique balances alleviation of forgetting with plasticity by encouraging the model to remember stable user preferences and selectively forget when user interests change. We derive the mathematical formulation of a negative sampler to populate and update the reservoir. We integrate our design in three SOTA and commonly used incremental recommendation models. We show that these concrete realizations of our negative reservoir framework achieve state-of-the-art results for standard benchmarks using multiple standard top-k evaluation metrics.
URL: https://openreview.net/forum?id=jrUUk5Fskm
---
Title: A Stochastic Polynomial Expansion for Uncertainty Propagation through Networks
Abstract: Network-based machine learning constructs are becoming more prevalent in sensing and decision-making systems. As these systems are implemented in safety-critical environments such as pedestrian detection and power management, it is crucial to evaluate confidence in their decisions. At the heart of this problem is a need to understand and characterize how errors at the input of networks become progressively expanded or contracted as signals move through layers, especially in light of the non-trivial nonlinearities manifest throughout modern machine learning architectures. When sampling methods become expensive due to network size or complexity, approximation is needed and popular methods include Jacobian (first order Taylor) linearization and stochastic linearization. However, despite computational tractability, the accuracy of these methods can break down in situations with moderate to high input uncertainty.
Here, we present a generalized method of propagating variational multivariate Gaussian distributions through neural networks. We propose a modified Taylor expansion function for nonlinear transformation of Gaussian distributions, with an additional approximation in which the polynomial terms act on independent Gaussian random variables (which are identically distributed). With these approximated higher order terms (HOTs), we obtain significantly more accurate estimation of layer-wise distributions. Despite the introduction of the HOTs, this method can propagate a full covariance matrix with a complexity of $\boldsymbol{O}(n^2)$ (and $\boldsymbol{O}(n)$ if only propagating marginal variance), comparable to Jacobian linearization. Thus, our method finds a balance between efficiency and accuracy. We derived the closed form solutions for this approximate Stochastic Taylor expansion for seven commonly used nonlinearities and verified the effectiveness of our method in deep residual neural networks. This general method can be integrated into use-cases such as Kalman filtering, adversarial training, and variational learning.
URL: https://openreview.net/forum?id=2NHRnF859N
---
Title: (Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum
Abstract: Stochastic heavy ball momentum (SHB) is commonly used to train machine learning models, and often provides empirical improvements over stochastic gradient descent. By primarily focusing on strongly-convex quadratics, we aim to better understand the theoretical advantage of SHB and subsequently improve the method. For strongly-convex quadratics, Kidambi et al. (2018) show that SHB (with a mini-batch of size $1$) cannot attain accelerated convergence, and hence has no theoretical benefit over SGD. They conjecture that the practical gain of SHB is a by-product of using larger mini-batches. We first substantiate this claim by showing that SHB can attain an accelerated rate when the mini-batch size is larger than a threshold $b^*$ that depends on the condition number $\kappa$. Specifically, we prove that with the same step-size and momentum parameters as in the deterministic setting, SHB with a sufficiently large mini-batch size results in an $O\left(\exp(-\frac{T}{\sqrt{\kappa}}) + \sigma \right)$ convergence, where $T$ is the number of iterations and $\sigma^2$ is the variance in the stochastic gradients. We prove a lower-bound which demonstrates that a $\kappa$ dependence in $b^*$ is necessary. To ensure convergence to the minimizer, we design a noise-adaptive multi-stage algorithm that results in an $O\left(\exp\left(-\frac{T}{\sqrt{\kappa}}\right) + \frac{\sigma}{T}\right)$ rate. We also consider the general smooth, strongly-convex setting and propose the first noise-adaptive SHB variant that converges to the minimizer at an $O(\exp(-\frac{T}{\kappa}) + \frac{\sigma^2}{T})$ rate. We empirically demonstrate the effectiveness of the proposed algorithms.
URL: https://openreview.net/forum?id=Okxp1W8If0
---
Title: What’s Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias
Abstract: Knowledge Distillation is a commonly used Deep Neural Network (DNN) compression method, which often maintains overall generalization performance. However, we show that even for balanced image classification datasets, such as CIFAR-100, Tiny ImageNet and ImageNet, as many as 41% of the classes are statistically significantly affected by distillation when comparing class-wise accuracy (i.e. class bias) between a teacher/distilled student or distilled student/non-distilled student model. Changes in class bias are not necessarily an undesirable outcome when considered outside of the context of a model’s usage. Using two common fairness metrics, Demographic Parity Difference (DPD) and Equalized Odds Difference (EOD) on models trained with the CelebA, Trifeature, and HateXplain datasets, our results suggest that increasing the distillation temperature improves the distilled student model’s fairness, and the distilled student fairness can even surpass the fairness of the teacher model at high temperatures. This study highlights the uneven effects of distillation on certain classes and its potentially significant role in fairness, emphasizing that caution is warranted when using distilled models for sensitive application domains.
URL: https://openreview.net/forum?id=xBbj46Y2fN
---
Title: DisDet: Exploring Detectability of Backdoor Attack on Diffusion Models
Abstract: In the exciting generative AI era, the diffusion model has emerged as a very powerful and widely adopted content-generation tool. Very recently, some pioneering works have shown the vulnerability of the diffusion model against backdoor attacks, calling for in-depth analysis and investigation of the security challenges. In this paper, for the first time, we systematically explore the detectability of the poisoned noise input for the backdoored diffusion models, an important performance metric yet little explored in the existing works. Starting from the perspective of a defender, we first analyze the distribution discrepancy of the trigger pattern in the existing diffusion backdoor attacks. Based on this finding, we propose a trigger detection mechanism that can effectively identify the poisoned input noise. Then, from the attack side, we propose a backdoor attack strategy that can learn the unnoticeable trigger to evade our proposed detection scheme. Our empirical evaluations across various diffusion models and datasets demonstrate the effectiveness of the proposed trigger detection and detection-evading attack strategy. For trigger detection, our distribution discrepancy-based solution can achieve a 100% detection rate for the Trojan triggers used in the existing works. For evading trigger detection, our proposed stealthy trigger design approach performs end-to-end learning to make the distribution of poisoned noise input approach that of benign noise, enabling nearly 100% detection pass rate with very high attack and benign performance for the backdoored diffusion models.
URL: https://openreview.net/forum?id=SfqCaAOF1S
---
Title: ADAPT to Robustify Prompt Tuning Vision Transformers
Abstract: The performance of deep models, including Vision Transformers, is known to be vulnerable
to adversarial attacks. Many existing defenses against these attacks, such as adversarial
training, rely on full-model fine-tuning to induce robustness in the models. These defenses
require storing a copy of the entire model, that can have billions of parameters, for each task.
At the same time, parameter-efficient prompt tuning is used to adapt large transformer-
based models to downstream tasks without the need to save large copies. In this paper,
we examine parameter-efficient prompt tuning of Vision Transformers for downstream tasks
under the lens of robustness. We show that previous adversarial defense methods, when
applied to the prompt tuning paradigm, suffer from gradient obfuscation and are vulnerable
to adaptive attacks. We introduce ADAPT, a novel framework for performing adaptive
adversarial training in the prompt tuning paradigm. Our method achieves competitive
robust accuracy of ∼ 40% w.r.t. SOTA robustness methods using full-model fine-tuning, by
tuning only ∼ 1% of the number of parameters
URL: https://openreview.net/forum?id=bZzXgheUSD
---
Title: Conditional Density Estimations from Privacy-Protected Data
Abstract: Many modern statistical analysis and machine learning applications require training models on sensitive user data. Under a formal definition of privacy protection, differentially private algorithms inject calibrated noise into the confidential data or during the data analysis process to produce privacy-protected datasets or queries. However, restricting access to only privatized data during statistical analysis makes it computationally challenging to make valid statistical inferences. In this work, we propose simulation-based inference methods from privacy-protected datasets. In addition to sequential Monte Carlo approximate Bayesian computation, we adopt neural conditional density estimators as a flexible family of distributions to approximate the posterior distribution of model parameters given the observed private query results. We illustrate our methods on discrete time-series data under an infectious disease model and with ordinary linear regression models. Illustrating the privacy-utility trade-off, our experiments and analysis demonstrate the necessity and feasibility of designing valid statistical inference procedures to correct for biases introduced by the privacy-protection mechanisms.
URL: https://openreview.net/forum?id=SB7JzhDG45
---
Title: Necessary and Sufficient Watermark for Large Language Models
Abstract: Large language models (LLMs) can now generate texts that are indistinguishable from those written by humans. Such remarkable performance of LLMs increases their risk of being used for malicious purposes. Thus, it is necessary to develop methods for distinguishing texts written by LLMs from those written by humans. Watermarking is one of the most powerful methods for achieving this. Although existing methods have successfully detected texts generated by LLMs, they inevitably degrade the text quality. In this study, we propose the Necessary and Sufficient Watermark (NS-Watermark) for inserting watermarks into generated texts with minimum text quality degradation. More specifically, we derive minimum constraints required to be imposed on the generated texts to distinguish whether LLMs or humans write the texts, and we formulate the NS-Watermark as a constrained optimization problem. Through the experiments, we demonstrate that the NS-Watermark can generate more natural texts than existing watermarking methods and distinguish more accurately between texts written by LLMs and those written by humans. Especially in machine translation tasks, the NS-Watermark can outperform the existing watermarking method by up to 30 BLEU scores.
URL: https://openreview.net/forum?id=FcyHZ6Q4k0
---
Title: Predicting sub-population specific viral evolution
Abstract: Forecasting the change in the distribution of viral variants is crucial for therapeutic design and disease surveillance. This task poses significant modeling challenges due to the sharp differences in virus distributions across sub-populations (e.g., countries) and their dynamic interactions. Existing machine learning approaches that model the variant distribution as a whole are incapable of making location-specific predictions and ignore transmissions that shape the viral landscape. In this paper, we propose a sub-population specific protein evolution model, which predicts the time-resolved distributions of viral proteins in different locations. The algorithm explicitly models the transmission rates between sub-populations and learns their interdependence from data. The change in protein distributions across all sub-populations is defined through a linear ordinary differential equation (ODE) parametrized by transmission rates. Solving this ODE yields the likelihood of a given protein occurring in particular sub-populations. Multi-year evaluation on both SARS-CoV-2 and influenza A/H3N2 demonstrates that our model outperforms baselines in accurately predicting distributions of viral proteins across continents and countries. We also find that the transmission rates learned from data are consistent with the transmission pathways discovered by retrospective phylogenetic analysis.
URL: https://openreview.net/forum?id=Mae23iEqPS
---
Title: Sparse Decomposition of Graph Neural Networks
Abstract: Graph Neural Networks (GNN) exhibit superior performance in graph representation learning, but their inference cost can be high, due to an aggregation operation that can require a memory fetch for a very large number of nodes.
This inference cost is the major obstacle to deploying GNN models with \emph{online prediction} to reflect the potentially dynamic node features.
To address this, we propose an approach to reduce the number of nodes that are included during aggregation.
We achieve this through a sparse decomposition, learning to approximate node representations using a weighted sum of linearly transformed features of a carefully selected subset of nodes within the extended neighbourhood.
The approach achieves linear complexity with respect to the average node degree and the number of layers in the graph neural network.
We introduce an algorithm to compute the optimal parameters for the sparse decomposition, ensuring an accurate approximation of the original GNN model, and present effective strategies to reduce the training time and improve the learning process.
We demonstrate via extensive experiments that our method outperforms other baselines designed for inference speedup, achieving significant accuracy gains with comparable inference times for both node classification and spatio-temporal forecasting tasks.
URL: https://openreview.net/forum?id=xdWP1d8BxI
---
Title: Causal Discovery over High-Dimensional Structured Hypothesis Spaces with Causal Graph Partitioning
Abstract: The aim in many sciences is to understand the mechanisms that underlie the observed distribution of variables, starting from a set of initial hypotheses. Causal discovery allows us to infer mechanisms as sets of cause and effect relationships in a generalized way---without necessarily tailoring to a specific domain. Causal discovery algorithms search over a structured hypothesis space, defined by the set of Directed Acyclic Graphs (DAG), to find the graph that best explains the data. For high-dimensional problems, however, this search becomes intractable and scalable algorithms for causal discovery are needed to bridge the gap.
In this paper, we define a novel causal graph partition that allows for divide-and-conquer causal discovery with theoretical guarantees under the Maximal Ancestral Graph (MAG) class. We leverage the idea of a superstructure---a set of learned or existing candidate hypotheses---to partition the search space. We prove under certain assumptions that learning with a causal graph partition always yields the Markov Equivalence Class of the true causal graph. We show our algorithm achieves comparable accuracy and a faster time to solution for biologically-tuned synthetic networks and networks up to ${10^4}$ variables. This makes our method applicable to gene regulatory network inference and other domains with high-dimensional structured hypothesis spaces.
URL: https://openreview.net/forum?id=FecsgPCOHk
---
Title: Time Series Domain Adaptation via Channel-Selective Representation Alignment
Abstract: Building generalizable and robust multivariate time series models can be challenging for real-world settings that involve significant shifts between training and testing. Existing unsupervised domain adaptation methods often struggle with real world distribution shifts which are often much more severe in some channels than others. To overcome these obstacles, we introduce a novel method called Signal Selection and Screening via Sinkhorn alignment for Time Series domain Adaptation (SSSS-TSA). SSSS-TSA addresses channel-level variations by aligning both individual channel representations and selectively weighted combined channel representations. This dual alignment strategy based on channel selection not only ensures effective adaptation to new domains but also maintains robustness in scenarios with training and testing set shifts or when certain channels are absent or corrupted. We evaluate our method on several time-series classification benchmarks and find that it consistently improves performance over existing methods. These results demonstrate the importance of adaptively selecting and screening different channels to enable more effective alignment across domains.
URL: https://openreview.net/forum?id=8C8LJIqF4y
---
Title: Multi-Output Distributional Fairness via Post-Processing
Abstract: The post-processing approaches are becoming prominent techniques to enhance machine learning models' fairness because of their intuitiveness, low computational cost, and excellent scalability. However, most existing post-processing methods are designed for task-specific fairness measures and are limited to single-output models. In this paper, we introduce a post-processing method for multi-output models, such as the ones used for multi-task/multi-class classification and representation learning, to enhance a model's distributional parity, a task-agnostic fairness measure. Existing techniques to achieve distributional parity are based on the (inverse) cumulative density function of a model's output, which is limited to single-output models. Extending previous works, our method employs an optimal transport mapping to move a model's outputs across different groups towards their empirical Wasserstein barycenter. An approximation technique is applied to reduce the complexity of computing the exact barycenter and a kernel regression method is proposed for extending this process to out-of-sample data. Our empirical studies, which compare our method to current existing post-processing baselines on multi-task/multi-class classification and representation learning tasks, demonstrate the effectiveness of the proposed approach.
URL: https://openreview.net/forum?id=MJOKrHqiV1
---
Title: Reheated Gradient-based Discrete Sampling for Combinatorial Optimization
Abstract: Recently, gradient-based discrete sampling has emerged as a highly efficient, general-purpose solver for various combinatorial optimization (CO) problems, achieving performance comparable to or surpassing the popular data-driven approaches. However, we identify a critical issue in these methods, which we term ``wandering in contours''. This behavior refers to sampling new different solutions that share very similar objective values for a long time, leading to computational inefficiency and suboptimal exploration of potential solutions. In this paper, we introduce a novel reheating mechanism inspired by the concept of critical temperature and specific heat in physics, aimed at overcoming this limitation. Empirically, our method demonstrates superiority over existing sampling-based and data-driven algorithms across a diverse array of CO problems.
URL: https://openreview.net/forum?id=uPCvfyr2KP
---
Title: Label Distribution Shift-Aware Prediction Refinement for Test-Time Adaptation
Abstract: Test-time adaptation (TTA) is an effective approach to mitigate performance degradation of trained models when encountering input distribution shifts at test time. However, existing TTA methods often suffer significant performance drops when facing additional class distribution shifts. We first analyze TTA methods under label distribution shifts and identify the presence of class-wise confusion patterns commonly observed across different covariate shifts. Based on this observation, we introduce label Distribution shift-Aware prediction Refinement for Test-time adaptation (DART), a novel TTA method that refines the predictions by focusing on class-wise confusion patterns. DART trains a prediction refinement module during an intermediate time by exposing it to several batches with diverse class distributions using the training dataset. This module is then used during test time to detect and correct class distribution shifts, significantly improving pseudo-label accuracy for test data. Our method exhibits 5-18% gains in accuracy under label distribution shifts on CIFAR-10C, without any performance degradation when there is no label distribution shift. Extensive experiments on CIFAR, PACS, OfficeHome, and ImageNet benchmarks demonstrate DART's ability to correct inaccurate predictions caused by test-time distribution shifts. This improvement leads to enhanced performance in existing TTA methods, making DART a valuable plug-in tool.
URL: https://openreview.net/forum?id=c7AAHdEYz5
---
Title: Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation
Abstract: In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation methods usually resort to huge large language model or composable diffusion models. Instead of designing another giant model for audio-visual generation, in this paper we take a step back showing a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation, can achieve excellent results on image2audio generation. The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner. After training, the classifier-free guidance could be deployed off-the-shelf achieving better performance, without any extra training or modification. Since the transformer model is modality symmetrical, it could also be directly deployed for audio2image generation and co-generation. In the experiments, we show that our simple method surpasses recent image2audio generation methods. Generated audio samples could be found in this [anonymous link](https://docs.google.com/presentation/d/1ZtC0SeblKkut4XJcRaDsSTuCRIXB3ypxmSi7HTY3IyQ/edit?usp=sharing).
URL: https://openreview.net/forum?id=EC5zQT1nLP
---
Title: Attention Mechanisms Don’t Learn Additive Models: Rethinking Feature Importance for Transformers
Abstract: We address the critical challenge of applying feature attribution methods to the transformer architecture, which dominates current applications in natural language processing and beyond. Traditional attribution methods to explainable AI (XAI) explicitly or implicitly rely on linear or additive surrogate models to quantify the impact of input features on a model's output. In this work, we formally prove an alarming incompatibility: transformers are structurally incapable to align with popular surrogate models for feature attribution, undermining the grounding of these conventional explanation methodologies. To address this discrepancy, we introduce the Softmax-Linked Additive Log-Odds Model (SLALOM), a novel surrogate model specifically designed to align with the transformer framework. SLALOM demonstrates the capacity to deliver a range of insightful explanations with across both synthetic and real-world datasets. We highlight SLALOM's unique efficiency-quality curve by showing that SLALOM can produce explanations with substantially higher fidelity than competing surrogate models or provide explanations of comparable quality at a fraction of their computational costs.
URL: https://openreview.net/forum?id=yawWz4qWkF
---
Title: Accelerating Non-Conjugate Gaussian Processes By Trading Off Computation For Uncertainty
Abstract: Non-conjugate Gaussian processes (NCGPs) define a flexible probabilistic framework to model categorical, ordinal and continuous data, and are widely used in practice. However, exact inference in NCGPs is prohibitively expensive for large datasets, thus requiring approximations in practice. The approximation error adversely impacts the reliability of the model and is not accounted for in the uncertainty of the prediction. We introduce a family of iterative methods that explicitly model this error. They are uniquely suited to parallel modern computing hardware, efficiently recycle computations, and compress information to reduce both the time and memory requirements for NCGPs. As we demonstrate on large-scale classification problems, our method significantly accelerates training compared to competitive baselines by trading off reduced computation for increased uncertainty.
URL: https://openreview.net/forum?id=UdcF3JbSKb
---
Title: Memory-Modular Classification: Learning to Generalize with Memory Replacement
Abstract: We propose a novel memory-modular learner for image classification that separates knowledge memorization from reasoning. Our model enables effective generalization to new classes by simply replacing the memory contents, without the need for model retraining. Unlike traditional models that encode both world knowledge and task-specific skills into their weights during training, our model stores knowledge in the external memory of web-crawled image and text data. At inference time, the model dynamically selects relevant content from the memory based on the input image, allowing it to adapt to arbitrary classes by simply replacing the memory contents. The key differentiator that our learner meta-learns to perform classification tasks with noisy web data from unseen classes, resulting in robust performance across various classification scenarios. Experimental results demonstrate the promising performance and versatility of our approach in handling diverse classification tasks, including zero-shot/few-shot classification of unseen classes, fine-grained classification, and class-incremental classification.
URL: https://openreview.net/forum?id=DcIW0idrg8
---
Title: Rethinking Patch Dependence for Masked Autoencoders
Abstract: In this work, we examine the impact of inter-patch dependencies in the decoder of masked autoencoders (MAE) on representation learning. We decompose the decoding mechanism for masked reconstruction into self-attention between mask tokens and cross-attention between masked and visible tokens. Our findings reveal that MAE reconstructs coherent images from visible patches not through interactions between patches in the decoder but by learning a global representation within the encoder. This discovery leads us to propose a simple visual pretraining framework: cross-attention masked autoencoders (CrossMAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs. This approach achieves comparable or superior performance to traditional MAE across models ranging from ViT-S to ViT-H and significantly reduces computational requirements. By its design, CrossMAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Anonymized code is available [here](https://anonymous.4open.science/r/mae-cross-anon-11EB/README.md).
URL: https://openreview.net/forum?id=JT2KMuo2BV
---
Title: Hypergraph Neural Networks through the Lens of Message Passing: A Common Perspective to Homophily and Architecture Design
Abstract: Most of the current learning methodologies and benchmarking datasets in the hypergraph realm are obtained by \emph{lifting} procedures from their graph analogs, leading to overshadowing specific characteristics of hypergraphs. This paper attempts to confront some pending questions in that regard: Q1 Can the concept of homophily play a crucial role in Hypergraph Neural Networks (HNNs)? Q2 How do models that employ unique characteristics of higher-order networks perform compared to lifted models? Q3 Do well-established hypergraph datasets provide a meaningful benchmark for HNNs? To address them, we first introduce a novel conceptualization of homophily in higher-order networks based on a Message Passing (MP) scheme, unifying both the analytical examination and the modeling of higher-order networks. Further, we investigate some natural strategies for processing higher-order structures within HNNs (such as keeping hyperedge-dependent node representations or performing node/hyperedge stochastic samplings), leading us to the most general MP formulation up to date --MultiSet. Finally, we conduct an extensive set of experiments that contextualize our proposals.
URL: https://openreview.net/forum?id=8rxtL0kZnX
---
Title: Accounting for AI and Users Shaping One Another: The Role of Mathematical Models
Abstract: As AI systems enter into a growing number of societal domains, these systems increasingly shape and are shaped by user preferences, opinions, and behaviors. However, the design of AI systems rarely accounts for how AI and users shape one another. In this survey paper, we argue for the development of \textit{formal interaction models} which mathematically specify how AI and users shape one another. Formal interaction models can be leveraged to (1) specify interactions for implementation, (2) monitor interactions through empirical analysis, (3) anticipate societal impacts via counterfactual analysis, and (4) control societal impacts via interventions. The design space of formal interaction models is vast, and model design requires careful consideration of factors such as style, granularity, mathematical complexity, and measurability. Using content recommender systems as a case study, we critically examine the nascent literature of formal interaction models with respect to these use-cases and design axes. More broadly, we call for the community to leverage formal interaction models when designing, evaluating, or auditing any AI system which interacts with users.
URL: https://openreview.net/forum?id=UkP4DhrJt1
---
Title: Robust Model Selection of Gaussian Graphical Models
Abstract: In Gaussian graphical model selection, noise-corrupted samples present significant challenges.
It is known that even minimal amounts of noise can obscure the underlying structure,
leading to fundamental identifiability issues. A recent line of work addressing this “robust
model selection” problem narrows its focus to tree-structured graphical models. Even within
this specific class of models, exact structure recovery is shown to be impossible. However,
several algorithms have been developed that are known to provably recover the underlying
tree-structure up to an (unavoidable) equivalence class.
In this paper, we extend these results beyond tree-structured graphs. We first characterize
the equivalence class up to which general graphs can be recovered in the presence of noise.
Despite the inherent ambiguity (which we prove is unavoidable), the structure that can
be recovered reveals local clustering information and global connectivity patterns in the
underlying model. Such information is useful in a range of real-world problems, including
power grids, social networks, protein-protein interactions, and neural structures. We then
propose an algorithm which provably recovers the underlying graph up to the identified
ambiguity. We further provide finite sample guarantees in the high-dimensional regime for
our algorithm and validate our results through numerical simulations.
URL: https://openreview.net/forum?id=AIby9MQXbu
---
Title: Advanced Optimization Techniques in Neural Networks: A Sobolev Space Approach
Abstract: In this article, we explore the concept of Sobolev loss and its advantages over conventional loss functions in neural network training, particularly in the context of approximating smooth functions and their derivatives. Conventional loss functions like Mean Squared Error (MSE) and Mean Absolute Error (MAE) focus solely on minimizing the difference between predicted and true function values. However, they often fail to capture the smoothness and derivative information critical for accurate function approximation in various scientific and engineering applications.
Sobolev loss addresses this limitation by incorporating terms that measure the difference between the derivatives of the predicted and true functions. This not only ensures better function value approximations but also promotes smoother and more accurate representations of the underlying function. The article delves into the theoretical foundations of Sobolev spaces, which provide the mathematical framework for Sobolev loss, and discusses the benefits of using Sobolev loss in terms of improved generalization, stability, and performance.
We illustrate these concepts through a practical example of approximating
$f(x)=sin(x)$ and $f(x)=e^{-x}$ using a neural network. The example demonstrates how Sobolev loss enables the network to learn both the function values and their derivatives, resulting in a more accurate and smooth approximation compared to traditional loss functions. Additionally, we highlight key references for further reading, including foundational texts on Sobolev spaces and research papers that explore the application of Sobolev loss in neural networks.
By integrating derivative information into the training process, Sobolev loss provides a powerful tool for enhancing the quality of neural network approximations, making it particularly valuable for applications requiring smooth and accurate function representations.
URL: https://openreview.net/forum?id=noubCjKiT9
---
Title: Rethinking Knowledge Transfer in Learning Using Privileged Information
Abstract: In supervised machine learning, privileged information (PI) is information that is unavailable at inference, but is accessible during training time. Research on learning using privileged information (LUPI) aims to transfer the knowledge captured in PI onto a model that can perform inference without PI. It seems that this extra bit of information ought to make the resulting model better. However, finding conclusive theoretical or empirical evidence that supports the ability to transfer knowledge using PI has been challenging. In this paper, we critically examine the assumptions underlying existing theoretical analyses and argue that there is little theoretical justification for when LUPI should work. We analyze LUPI methods and reveal that apparent improvements in empirical risk of existing research may not directly result from PI. Instead, these improvements often stem from dataset anomalies or modifications in model design misguidedly attributed to PI. Our experiments for a wide variety of application domains further demonstrate that state-of-the-art LUPI approaches fail to effectively transfer knowledge from PI. Thus, we advocate for practitioners to exercise caution when working with PI to avoid unintended inductive biases.
URL: https://openreview.net/forum?id=dg1tqNIWg3
---
Title: Towards Principled Benchmarking of Non-tabular Reinforcement learning
Abstract: Thorough evaluation of the performance of reinforcement learning agents is critical to establish significant progress in the field, with benchmarks being the key component of this process. In the tabular setting, a rich theory of environment hardness has been recently leveraged to design benchmarks with precise characterizations of hardness. In contrast, the non-tabular setting currently lacks such a theory and instead relies on expert judgments and community popularity to establish benchmarks. This reliance on subjective assessments can limit the rigour and reliability of the evaluation process. The goal of this paper is to take the first step towards the design of principled non-tabular benchmarks by four main contributions. First, we review the theory of hardness in the tabular and non-tabular settings to highlight promising directions. Second, we identify the essential features that a principled benchmarking library for non-tabular reinforcement learning should possess while explaining the limitations of existing libraries in meeting those needs. Third, we propose a new library (pharos) specifically designed to support the development of principled benchmarking. Finally, we present an in-depth case study that, in addition to illustrating examples of the kind of analysis that pharos facilitates, demonstrates that, while tabular measures can represent a component in quantifying non-tabular hardness, it is necessary to develop measures tailored for the non-tabular setting.
URL: https://openreview.net/forum?id=2h8Ws41sbj
---
Title: Speech Synthesis By Unrolling Diffusion Process using Neural Network Layers
Abstract: This work proposes a novel setup where a neural network is trained to predict multiple steps of the reverse diffusion process in an unrolled manner, with successive layers corresponding to equally spaced steps in the diffusion schedule. Each layer progressively denoises the input during the reverse process until the final layer estimates the original input, $x_0$. Additionally, we introduce a new learning target by using latent variables, rather than the conventional approach of predicting the original input $x_0$ or source error $\epsilon_0$. In speech synthesis, using $x_0$ or $\epsilon_0$ often leads to large prediction errors in the early stages of the denoising process, causing distortion in the recovered speech. Our method mitigates this issue and, through extensive evaluation, demonstrates the generation of high-fidelity speech in competitive time, outperforming current state-of-the-art techniques. Moreover, the proposed approach generalizes well to unseen speech. Sample audio is available at \url{https://onexpeters.github.io/UDPNet/}.
URL: https://openreview.net/forum?id=F6l3BBPElY
---
Title: Stability-Aware Training of Machine Learning Force Fields with Differentiable Boltzmann Estimators
Abstract: Machine learning force fields (MLFFs) are an attractive alternative to ab-initio methods for molecular dynamics (MD) simulations. However, they can produce unstable simulations, limiting their ability to model phenomena occurring over longer timescales and compromising the quality of estimated observables. To address these challenges, we present Stability-Aware Boltzmann Estimator (StABlE) Training, a multi-modal training procedure which leverages joint supervision from reference quantum-mechanical calculations and system observables. StABlE Training iteratively runs many MD simulations in parallel to seek out unstable regions, and corrects the instabilities via supervision with a reference observable. We achieve efficient end-to-end automatic differentiation through MD simulations using our Boltzmann Estimator, a generalization of implicit differentiation techniques to a broader class of stochastic algorithms. Unlike existing techniques based on active learning, our approach requires no additional ab-initio energy and forces calculations to correct instabilities. We demonstrate our methodology across organic molecules, tetrapeptides, and condensed phase systems, using three modern MLFF architectures. StABlE-trained models achieve significant improvements in simulation stability, data efficiency, and agreement with reference observables. By incorporating observables into the training process alongside first-principles calculations, StABlE Training can be viewed as a general semi-empirical framework applicable across MLFF architectures and systems. This makes it a powerful tool for training stable and accurate MLFFs, particularly in the absence of large reference datasets.
URL: https://openreview.net/forum?id=ZckLMG00sO
---
Title: State-Constrained Offline Reinforcement Learning
Abstract: Traditional offline reinforcement learning (RL) methods predominantly operate in a batch-constrained setting. This confines the algorithms to a specific state-action distribution present in the dataset, reducing the effects of distributional shift but restricting the policy to seen actions. In this paper, we alleviate this limitation by introducing state-constrained offline RL, a novel framework that focuses solely on the dataset’s state distribution. This approach allows the policy to take high-quality out-of-distribution actions that lead to in-distribution states, significantly enhancing learning potential. The proposed setting not only broadens the learning horizon but also improves the ability to combine different trajectories from the dataset effectively, a desirable property inherent in offline RL. Our research is underpinned by theoretical findings that pave the way for subsequent advancements in this area. Additionally, we introduce StaCQ, a deep learning algorithm that achieves state-of-the-art performance on the D4RL benchmark datasets and aligns with our theoretical propositions. StaCQ establishes a strong baseline for forthcoming explorations in this domain.
URL: https://openreview.net/forum?id=KcR8ykFlHA
---
Title: VLM’s Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models
Abstract: Vision language models (VLMs) have shown promising reasoning capabilities across various benchmarks; however, our understanding of their visual perception remains limited. In this work, we propose an eye examination process to investigate how a VLM perceives images, specifically focusing on key elements of visual recognition, from primitive color and shape to semantic levels. To this end, we introduce a dataset named LENS to guide a VLM to follow the examination and check its readiness. Once the model is ready, we conduct the examination. Through this examination, we quantify and visualize VLMs' sensitivities to color and shape, and semantic matching. Our findings reveal that VLMs have varying sensitivity to different colors while consistently showing insensitivity to green across different VLMs. Also, we found different shape sensitivity and semantic recognition depending on LLM's capacity despite using the same fixed visual encoder. Our analyses and findings have potential to inspire the design of VLMs and the pre-processing of visual input to VLMs for improving application performance.
URL: https://openreview.net/forum?id=CgWkVb2lHB
---
Title: Transformers in Uniform TC$^0$
Abstract: Previous work has shown that the languages recognized by average-hard attention transformers (AHATs) and softmax-attention transformers (SMATs) are within the circuit complexity class TC$^0$. However, these results assume limited-precision arithmetic: using floating-point numbers with O(log n) bits (where n is the length of the input string), Strobl showed that AHATs can be approximated in L-uniform TC$^0$, and Merrill and Sabharwal showed that SMATs can be approximated in DLOGTIME-uniform TC$^0$. Here, we improve these results, showing that AHATs with no approximation, SMATs with O(poly(n)) bits of floating-point precision, and SMATs with at most $2^{−O(poly(n))}$ absolute error are all in DLOGTIME-uniform TC$^0$.
URL: https://openreview.net/forum?id=ZA7D4nQuQF
---
Title: Meta Sparse Principal Component Analysis
Abstract: We study the meta-learning for support recovery (i.e., non-zero entries of the eigenvectors) in high-dimensional Principal Component Analysis. We reduce the sufficient sample complexity in a novel task, with the information that is learned from auxiliary tasks, where a task is defined as a random Principal Component (PC) matrix with its own support. We pool data from all the tasks to execute an improper estimation of a single PC matrix, by maximising the $\ell_1$-regularised predictive covariance. With $m$ tasks for $p$-variate sub-Gaussian random vectors, we establish the sufficient sample complexity for each task to be of the order $O(\sqrt{m^{-1}\log p})$, with high probability. This is very relevant for meta-learning where there are many tasks $m = O(\log p)$, each with very few samples, i.e., $n = O(1)$, in an scenario where multi-task learning fails. For a novel task, we prove that the sufficient sample complexity of successful support recovery can be reduced to $O(\log |J|)$, under an additional constraint that the support of the novel task is a subset of the estimated support union ($J$) from the auxiliary tasks. This reduces the original sample complexity of $O(\log p)$ for learning a single task. Theoretical claims are validated with numerical simulations and the problem of true covariance estimation in brain-imaging and cancer genetics data sets are considered to validate the proposed methodology.
URL: https://openreview.net/forum?id=QhuoVFT6Kc
---