Survey Certification: A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning
Prateek Yadav, Colin Raffel, Mohammed Muqeeth, Lucas Caccia, Haokun Liu, Tianlong Chen, Mohit Bansal, Leshem Choshen, Alessandro Sordoni
https://openreview.net/forum?id=u0azVc9Y0y
---
Expert Certification: Investigating Generalization Behaviours of Generative Flow Networks
Lazar Atanackovic, Emmanuel Bengio
https://openreview.net/forum?id=9L0B5N5hUX
---
Featured Certification: MiniFold: Simple, Fast, and Accurate Protein Structure Prediction
Jeremy Wohlwend, Mateo Reveiz, Matt McPartlon, Axel Feldmann, Wengong Jin, Regina Barzilay
https://openreview.net/forum?id=1p9hQTbjgo
---
Survey Certification: Autoregressive Models in Vision: A Survey
Jing Xiong, Gongye Liu, Lun Huang, Chengyue Wu, Taiqiang Wu, Yao Mu, Yuan Yao, Hui Shen, Zhongwei Wan, Jinfa Huang, Chaofan Tao, Shen Yan, Huaxiu Yao, Lingpeng Kong, Hongxia Yang, Mi Zhang, Guillermo Sapiro, Jiebo Luo, Ping Luo, Ngai Wong
https://openreview.net/forum?id=1BqXkjNEGP
---
Survey Certification: Expressivity of Representation Learning on Continuous-Time Dynamic Graphs: An Information-Flow Centric Review
Sofiane ENNADIR, Gabriela Zarzar Gandler, Filip Cornell, Lele Cao, Oleg Smirnov, Tianze Wang, Levente Zólyomi, Björn Brinne, Sahar Asadi
https://openreview.net/forum?id=M7Lhr2anjg
---
Accepted papers
===============
Title: Double Horizon Model-Based Policy Optimization
Authors: Akihiro Kubo, Paavo Parmas, Shin Ishii
Abstract: Model-based reinforcement learning (MBRL) reduces the cost of
real-environment sampling by generating synthetic trajectories (called
rollouts) from a learned dynamics model. However, choosing the length of the rollouts
poses two dilemmas: (1) Longer rollouts better preserve on-policy training
but amplify model bias, indicating the need for an intermediate
horizon to mitigate distribution shift (i.e., the gap between
on-policy and past off-policy samples). (2) Moreover, a longer
model rollout may reduce value estimation bias but raise the variance
of policy gradients due to backpropagation through multiple steps,
implying another intermediate horizon for stable gradient estimates.
However, these two optimal horizons may differ. To resolve this
conflict, we propose Double Horizon Model-Based Policy Optimization
(DHMBPO), which divides the rollout procedure into a long
``distribution rollout'' (DR) and a short ``training rollout'' (TR).
The DR generates on-policy state samples for mitigating distribution
shift. In contrast, the short TR leverages differentiable
transitions to offer accurate value gradient estimation with stable
gradient updates, thereby requiring fewer updates and reducing overall
runtime. We demonstrate that the double-horizon approach effectively
balances distribution shift, model bias, and gradient instability, and
surpasses existing MBRL methods on continuous-control benchmarks in
terms of both sample efficiency and runtime.
URL: https://openreview.net/forum?id=HRvHCd03HM
---
Title: LLM-TS Integrator: Integrating LLM for Enhanced Time Series Modeling
Authors: Can Chen, Gabriel L. Oliveira, Hossein Sharifi-Noghabi, Tristan Sylvain
Abstract: Time series~(TS) modeling is essential in dynamic systems like weather prediction and anomaly detection. Recent studies utilize Large Language Models (LLMs) for TS modeling, leveraging their powerful pattern recognition capabilities. These methods primarily position LLMs as the predictive backbone, often omitting the mathematical modeling within traditional TS models, such as periodicity. However, disregarding the potential of LLMs also overlooks their pattern recognition capabilities. To address this gap, we introduce \textit{LLM-TS Integrator}, a novel framework that effectively integrates the capabilities of LLMs into traditional TS modeling. Central to this integration is our \textit{mutual information} module. The core of this \textit{mutual information} module is a traditional TS model enhanced with LLM-derived insights for improved predictive abilities. This enhancement is achieved by maximizing the mutual information between traditional model's TS representations and LLM's textual representation counterparts, bridging the two modalities. Moreover, we recognize that samples vary in importance for two losses: traditional prediction and mutual information maximization. To address this variability, we introduce the \textit{sample reweighting} module to improve information utilization. This module assigns dual weights to each sample: one for prediction loss and another for mutual information loss, dynamically optimizing these weights via bi-level optimization. Our method achieves state-of-the-art or comparable performance across five mainstream TS tasks, including short-term and long-term forecasting, imputation, classification, and anomaly detection. Our code is available at: \url{https://anonymous.4open.science/r/llm_ts_anonymous-F07D/README.MD}
URL: https://openreview.net/forum?id=vPVqQmjCy8
---
Title: Covariate-dependent Graphical Model Estimation via Neural Networks with Statistical Guarantees
Authors: Jiahe Lin, Yikai Zhang, George Michailidis
Abstract: Graphical models are widely used in diverse application domains to model the conditional dependencies amongst a collection of random variables. In this paper, we consider settings where the graph structure is covariate-dependent, and investigate a deep neural network-based approach to estimate it. The method allows for flexible functional dependency on the covariate, and fits the data reasonably well in the absence of a Gaussianity assumption. Theoretical results with PAC guarantees are established for the method, under assumptions commonly used in an Empirical Risk Minimization framework. The performance of the proposed method is evaluated on several synthetic data settings and benchmarked against existing approaches. The method is further illustrated on real datasets involving data from neuroscience and finance, respectively, and produces interpretable results.
URL: https://openreview.net/forum?id=beqSqPgE33
---
Title: Generalizable and Robust Spectral Method for Multi-view Representation Learning
Authors: Amitai Yacobi, Ofir Lindenbaum, Uri Shaham
Abstract: Multi-view representation learning (MvRL) has garnered substantial attention in recent years, driven by the increasing demand for applications that can effectively process and analyze data from multiple sources. In this context, graph Laplacian-based MvRL methods have demonstrated remarkable success in representing multi-view data. However, these methods often struggle with generalization to new data and face challenges with scalability. Moreover, in many practical scenarios, multi-view data is contaminated by noise or outliers. In such cases, modern deep-learning-based MvRL approaches that rely on alignment or contrastive objectives present degraded performance in downstream tasks, as they may impose incorrect consistency between clear and corrupted data sources. We introduce *SpecRaGE*, a novel fusion-based framework that integrates the strengths of graph Laplacian methods with the power of deep learning to overcome these challenges. SpecRage uses neural networks to learn parametric mapping that approximates a joint diagonalization of graph Laplacians. This solution bypasses the need for alignment while enabling generalizable and scalable learning of informative and meaningful representations. Moreover, it incorporates a meta-learning fusion module that dynamically adapts to data quality, ensuring robustness against outliers and noisy views. Our extensive experiments demonstrate that SpecRaGE outperforms state-of-the-art methods, particularly in scenarios with data contamination, paving the way for more reliable and efficient multi-view learning. Our code will be made publicly available upon acceptance.
URL: https://openreview.net/forum?id=X6IY04Akw1
---
Title: Out-of-Distribution Learning with Human Feedback
Authors: Haoyue Bai, Xuefeng Du, Katie Rainey, Shibin Parameswaran, Yixuan Li
Abstract: Out-of-distribution (OOD) learning often relies on strong statistical assumptions or predefined OOD data distributions, limiting its effectiveness in real-world deployment for both OOD generalization and detection, especially when human inspection is minimal. This paper introduces a novel framework for OOD learning that integrates human feedback to enhance model adaptation and reliability. Our approach leverages freely available unlabeled data in the wild, which naturally captures environmental test-time OOD distributions under both covariate and semantic shifts. To effectively utilize such data, we propose selectively acquiring human feedback to label a small subset of informative samples. These labeled samples are then used to train both a multi-class classifier and an OOD detector. By incorporating human feedback, our method significantly improves model robustness and precision in handling OOD scenarios. We provide theoretical insights by establishing generalization error bounds for our algorithm. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods by a significant margin. Code is publicly available at https://github.com/HaoyueBaiZJU/ood-hf.
URL: https://openreview.net/forum?id=5qo8MF3QU1
---
Title: A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning
Authors: Prateek Yadav, Colin Raffel, Mohammed Muqeeth, Lucas Caccia, Haokun Liu, Tianlong Chen, Mohit Bansal, Leshem Choshen, Alessandro Sordoni
Abstract: The availability of performant pre-trained models has led to a proliferation of fine-tuned expert models that are specialized to a particular domain or task. Model MoErging methods aim to recycle expert models to create an aggregate system with improved performance or generalization. A key component of MoErging methods is the creation of a router that decides which expert model(s) to use for a particular input or application. The promise, effectiveness, and large design space of MoErging has spurred the development of many new methods over the past few years. This rapid pace of development has made it challenging to compare different MoErging methods, which are rarely compared to one another and are often validated in different experimental setups. To remedy such gaps, we present a comprehensive survey of MoErging methods that includes a novel taxonomy for cataloging key design choices and clarifying suitable applications for each method. Apart from surveying MoErging research, we inventory software tools and applications that make use of MoErging. We additionally discuss related fields of study such as model merging, multitask learning, and mixture-of-experts models. Taken as a whole, our survey provides a unified overview of existing MoErging methods and creates a solid foundation for future work in this burgeoning field.
URL: https://openreview.net/forum?id=u0azVc9Y0y
---
Title: Investigating Generalization Behaviours of Generative Flow Networks
Authors: Lazar Atanackovic, Emmanuel Bengio
Abstract: Generative Flow Networks (GFlowNets, GFNs) are a generative framework for learning unnormalized probability mass functions over discrete spaces. Since their inception, GFlowNets have proven to be useful for learning generative models in applications where the majority of the discrete space is unvisited during training. This has inspired some to hypothesize that GFlowNets, when paired with deep neural networks (DNNs), have favorable generalization properties. In this work, we empirically verify some of the hypothesized mechanisms of generalization of GFlowNets. We accomplish this by introducing a novel graph-based benchmark environment where reward difficulty can be easily varied, $p(x)$ can be computed exactly, and an unseen test set can be constructed to quantify generalization performance. Using this graph-based environment, we are able to systematically test the hypothesized mechanisms of generalization of GFlowNets and put forth a set of empirical observations that summarize our findings. In particular, we find (and confirm) that the functions that GFlowNets learn to approximate have an implicit underlying structure which facilitate generalization. Surprisingly ---and somewhat contradictory to existing knowledge--- we also find that GFlowNets are sensitive to being trained offline and off-policy. However, the reward implicitly learned by GFlowNets is robust to changes in the training distribution.
URL: https://openreview.net/forum?id=9L0B5N5hUX
---
Title: MiniFold: Simple, Fast, and Accurate Protein Structure Prediction
Authors: Jeremy Wohlwend, Mateo Reveiz, Matt McPartlon, Axel Feldmann, Wengong Jin, Regina Barzilay
Abstract: Protein structure prediction has emerged as a powerful tool for biologists and drug makers. However, the computational cost of state-of-the-art models such as AlphaFold limits their scalability and makes training and fine-tuning prohibitively expensive. Although previous work has achieved considerable inference speedups by replacing the multiple sequence alignment step with protein language models, the overall architecture of structure prediction models, inherited from AlphaFold2, has remained largely unchanged. In this work, we show that protein language model-based structure predictors can be dramatically simplified at little to no loss in accuracy. Our model, MiniFold, consists of a redesigned Evoformer and a lightweight structure module. We also propose two novel GPU kernels, tailored to the proposed architecture. Equipped with the same ESM2 protein language model, MiniFold is competitive with ESMFold on the standard CAMEO and CASP datasets while achieving training and inference speedups of up to 20x, and significant reductions in peak memory. Our results show that MiniFold is an effective solution for large-scale applications and resource-constrained environments.
URL: https://openreview.net/forum?id=1p9hQTbjgo
---
Title: RLeXplore: Accelerating Research in Intrinsically-Motivated Reinforcement Learning
Authors: Mingqi Yuan, Roger Creus Castanyer, Bo Li, Xin Jin, Wenjun Zeng, Glen Berseth
Abstract: Extrinsic rewards can effectively guide reinforcement learning (RL) agents in specific tasks. However, extrinsic rewards frequently fall short in complex environments due to the significant human effort needed for their design and annotation. This limitation underscores the necessity for intrinsic rewards, which offer auxiliary and dense signals and can enable agents to learn in an unsupervised manner. Although various intrinsic reward formulations have been proposed, their implementation and optimization details are insufficiently explored and lack standardization, thereby hindering research progress. To address this gap, we introduce RLeXplore, a unified, highly modularized, and plug-and-play framework offering reliable implementations of eight state-of-the-art intrinsic reward methods. Furthermore, we conduct an in-depth study that identifies critical implementation details and establishes well-justified standard practices in intrinsically-motivated RL. Our documentation, examples, and source code are available at [https://github.com/RLE-Foundation/RLeXplore](https://github.com/RLE-Foundation/RLeXplore).
URL: https://openreview.net/forum?id=B9BHjTN4z6
---
Title: Hyperparameters in Continual Learning: A Reality Check
Authors: Sungmin Cha, Kyunghyun Cho
Abstract: Continual learning (CL) aims to train a model on a sequence of tasks (i.e., a CL scenario) while balancing the trade-off between plasticity (learning new tasks) and stability (retaining prior knowledge). The dominantly adopted conventional evaluation protocol for CL algorithms selects the best hyperparameters (e.g., learning rate, mini-batch size, regularization strengths, etc.) within a given scenario and then evaluates the algorithms using these hyperparameters in the same scenario. However, this protocol has significant shortcomings: it overestimates the CL capacity of algorithms and relies on unrealistic hyperparameter tuning, which is not feasible for real-world applications. From the fundamental principles of evaluation in machine learning, we argue that the evaluation of CL algorithms should focus on assessing the generalizability of their CL capacity to unseen scenarios. Based on this, we propose the Generalizable Two-phase Evaluation Protocol (GTEP) consisting of hyperparameter tuning and evaluation phases. Both phases share the same scenario configuration (e.g., number of tasks) but are generated from different datasets. Hyperparameters of CL algorithms are tuned in the first phase and applied in the second phase to evaluate the algorithms. We apply this protocol to class-incremental learning, both with and without pretrained models. Across more than 8,000 experiments, our results show that most state-of-the-art algorithms fail to replicate their reported performance, highlighting that their CL capacity has been significantly overestimated in the conventional evaluation protocol.
URL: https://openreview.net/forum?id=hiiRCXmbAz
---
Title: When Are Bias-Free ReLU Networks Effectively Linear Networks?
Authors: Yedi Zhang, Andrew M Saxe, Peter E. Latham
Abstract: We investigate the implications of removing bias in ReLU networks regarding their expressivity and learning dynamics. We first show that two-layer bias-free ReLU networks have limited expressivity: the only odd function two-layer bias-free ReLU networks can express is a linear one. We then show that, under symmetry conditions on the data, these networks have the same learning dynamics as linear networks. This enables us to give analytical time-course solutions to certain two-layer bias-free (leaky) ReLU networks outside the lazy learning regime. While deep bias-free ReLU networks are more expressive than their two-layer counterparts, they still share a number of similarities with deep linear networks. These similarities enable us to leverage insights from linear networks to understand certain ReLU networks. Overall, our results show that some properties previously established for bias-free ReLU networks arise due to equivalence to linear networks.
URL: https://openreview.net/forum?id=Ucpfdn66k2
---
Title: Accelerating Learned Image Compression Through Modeling Neural Training Dynamics
Authors: Yichi Zhang, Zhihao Duan, Yuning Huang, Fengqing Zhu
Abstract: As learned image compression (LIC) methods become increasingly computationally demanding, enhancing their training efficiency is crucial. This paper takes a step forward in accelerating the training of LIC methods by modeling the neural training dynamics. We first propose a Sensitivity-aware True and Dummy Embedding Training mechanism (STDET) that clusters LIC model parameters into few separate modes where parameters are expressed as affine transformations of reference parameters within the same mode. By further utilizing the stable intra-mode correlations throughout training and parameter sensitivities, we gradually embed non-reference parameters, reducing the number of trainable parameters. Additionally, we incorporate a Sampling-then-Moving Average (SMA) technique, interpolating sampled weights from stochastic gradient descent (SGD) training to obtain the moving average weights, ensuring smooth temporal behavior and minimizing training state variances. Overall, our method significantly reduces training space dimensions and the number of trainable parameters without sacrificing model performance, thus accelerating model convergence. We also provide a theoretical analysis on the Noisy quadratic model, showing that the proposed method achieves a lower training variance than standard SGD. Our approach offers valuable insights for further developing efficient training methods for LICs.
URL: https://openreview.net/forum?id=nannw4SGfS
---
Title: Overcoming Knowledge Barriers: Online Imitation Learning from Visual Observation with Pretrained World Models
Authors: Xingyuan Zhang, Philip Becker-Ehmck, Patrick van der Smagt, Maximilian Karl
Abstract: Pretraining and finetuning models has become increasingly popular in decision-making. But there are still serious impediments in Imitation Learning from Observation (ILfO) with pretrained models. This study identifies two primary obstacles: the Embodiment Knowledge Barrier (EKB) and the Demonstration Knowledge Barrier (DKB). The EKB emerges due to the pretrained models' limitations in handling novel observations, which leads to inaccurate action inference. Conversely, the DKB stems from the reliance on limited demonstration datasets, restricting the model's adaptability across diverse scenarios. We propose separate solutions to overcome each barrier and apply them to Action Inference by Maximising Evidence (AIME), a state-of-the-art algorithm. This new algorithm, AIME-NoB, integrates online interactions and a data-driven regulariser to mitigate the EKB. Additionally, it uses a surrogate reward function to broaden the policy's supported states, addressing the DKB. Our experiments on vision-based control tasks from the DeepMind Control Suite and MetaWorld benchmarks show that AIME-NoB significantly improves sample efficiency and converged performance, presenting a robust framework for overcoming the challenges in ILfO with pretrained models. Code available at https://github.com/IcarusWizard/AIME-NoB.
URL: https://openreview.net/forum?id=BaRD2Nfj41
---
Title: Connecting Parameter Magnitudes and Hessian Eigenspaces at Scale using Sketched Methods
Authors: Andres Fernandez, Frank Schneider, Maren Mahsereci, Philipp Hennig
Abstract: Recently, it has been observed that when training a deep neural net with SGD, the majority of the loss landscape's curvature quickly concentrates in a tiny *top* eigenspace of the loss Hessian, which remains largely stable thereafter.
Independently, it has been shown that successful magnitude pruning masks for deep neural nets emerge early in training and remain stable thereafter.
In this work, we study these two phenomena jointly and show that they are connected:
We develop a methodology to measure the similarity between arbitrary parameter masks and Hessian eigenspaces via Grassmannian metrics. We identify *overlap* as the most useful such metric due to its interpretability and stability.
To compute *overlap*, we develop a matrix-free algorithm based on sketched SVDs that allows us to compute over 1000 Hessian eigenpairs for nets with over 10M parameters --an unprecedented scale by several orders of magnitude.
Our experiments reveal an *overlap* between magnitude parameter masks and top Hessian eigenspaces consistently higher than chance-level, and that this effect gets accentuated for larger network sizes.
This result indicates that *top Hessian eigenvectors tend to be concentrated around larger parameters*, or equivalently, that *larger parameters tend to align with directions of larger loss curvature*.
Our work provides a methodology to approximate and analyze deep learning Hessians at scale, as well as a novel insight on the structure of their eigenspace
URL: https://openreview.net/forum?id=yGGoOVpBVP
---
Title: Dimension reduction via score ratio matching
Authors: Ricardo Baptista, Michael Brennan, Youssef Marzouk
Abstract: Gradient-based dimension reduction decreases the cost of Bayesian inference and probabilistic modeling by identifying maximally informative (and informed) low-dimensional projections of the data and parameters, allowing high-dimensional problems to be reformulated as cheaper low-dimensional problems. A broad family of such techniques identify these projections and provide error bounds on the resulting posterior approximations, via eigendecompositions of certain diagnostic matrices. Yet these matrices require gradients or even Hessians of the log-likelihood, excluding the purely data-driven setting and many problems of simulation-based inference. We propose a framework, derived from score-matching, to extend gradient-based dimension reduction to problems where gradients are unavailable. Specifically, we formulate an objective function to directly learn the score ratio function needed to compute the diagnostic matrices, propose a tailored parameterization for the score ratio network, and introduce regularization methods that capitalize on the hypothesized low-dimensional structure. We also introduce a novel algorithm to iteratively identify the low-dimensional reduced basis vectors more accurately with limited data based on eigenvalue deflation methods. We show that our approach outperforms standard score-matching for problems with low-dimensional structure, and demonstrate its effectiveness for PDE-constrained Bayesian inverse problems and conditional generative modeling.
URL: https://openreview.net/forum?id=mvbZBaqSXo
---
Title: Convex Relaxation for Solving Large-Margin Classifiers in Hyperbolic Space
Authors: Sheng Yang, Peihan Liu, Cengiz Pehlevan
Abstract: Hyperbolic spaces have increasingly been recognized for their outstanding performance in handling data with inherent hierarchical structures compared to their Euclidean counterparts. However, learning in hyperbolic spaces poses significant challenges. In particular, extending support vector machines to hyperbolic spaces is in general a constrained non-convex optimization problem. Previous and popular attempts to solve hyperbolic SVMs, primarily using projected gradient descent, are generally sensitive to hyperparameters and initializations, often leading to suboptimal solutions. In this work, by first rewriting the problem into a polynomial optimization, we apply semidefinite relaxation and sparse moment-sum-of-squares relaxation to effectively approximate the optima. From extensive empirical experiments, these methods are shown to achieve better classification accuracies than the projected gradient descent approach in most of the synthetic and real two-dimensional hyperbolic embedding dataset under the one-vs-rest multiclass-classification scheme.
URL: https://openreview.net/forum?id=eIPwJgadfZ
---
Title: Can Kernel Methods Explain How the Data Affects Neural Collapse?
Authors: Vignesh Kothapalli, Tom Tirer
Abstract: A vast amount of literature has recently focused on the “Neural Collapse” (NC) phenomenon, which emerges when training neural network (NN) classifiers beyond the zero training error point. The core component of NC is the decrease in the within-class variability of the network’s deepest features, dubbed as NC1. The theoretical works that study NC are typically based on simplified unconstrained features models (UFMs) that mask any effect of the data on the extent of collapse. To address this limitation of UFMs, this paper explores the possibility of analyzing NC1 using kernels associated with shallow NNs. We begin by formulating an NC1 metric as a function of the kernel. Then, we specialize it to the NN Gaussian Process kernel (NNGP) and the Neural Tangent Kernel (NTK), associated with wide networks at initialization and during gradient-based training with a small learning rate, respectively. As a key result, we show that the NTK does not represent more collapsed features than the NNGP for Gaussian data of arbitrary dimensions. This showcases the limitations of data-independent kernels such as NTK in approximating the NC behavior of NNs. As an alternative to NTK, we then empirically explore a recently proposed data-aware Gaussian Process kernel, which generalizes NNGP to model feature learning. We show that this kernel yields lower NC1 than NNGP but may not follow the trends of the shallow NN. Our study demonstrates that adaptivity to data may allow kernel-based analysis of NC, though further advancements in this area are still needed. A nice byproduct of our study is showing both theoretically and empirically that the choice of nonlinear activation function affects NC1 (with ERF yielding lower values than ReLU).
URL: https://openreview.net/forum?id=MbF1gYfIlY
---
Title: ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization
Authors: Prateek Yadav, Leshem Choshen, Colin Raffel, Mohit Bansal
Abstract: Parameter-efficient fine-tuning (PEFT) enables creation of specialized language models for diverse tasks, resulting in numerous expert modules. In many practical use cases, these expert PEFT modules are integrated into a single model that answers arbitrary queries by routing queries to different experts. However, only a few experts can be kept in GPU memory due to memory constraints. Consequently, expert modules are frequently loaded and offloaded between CPU/GPU memory or disk storage. This frequent swapping dramatically increases communication overhead, leading unacceptable latency and degrading user experience. The large size of modern PEFT modules further exacerbates this latency. For example, QLoRA experts for 65B LLaMA are 3.2GB, making swapping a major communication bottleneck, particularly in memory-constrained environments. To address these issues, we present ComPEFT (compressed PEFT), a novel method for compressing fine-tuning residuals (task vectors) of PEFT models. Reducing expert PEFT module size effectively addresses both memory and communication limitations, facilitating faster swapping and enabling a higher density of experts within a given memory footprint. ComPEFT employs sparsification and ternary quantization to reduce PEFT module size without any additional training while preserving or enhancing model performance. Extensive evaluation across T5, T0, and LLaMA-based models with 200M − 65B parameters, ComPEFT achieves compression ratios of 8x − 50x. Specifically, we show that ComPEFT improves with scale – stronger models exhibit higher compressibility and better performance. We show ComPEFT applied to LLaMA − 65B outperforms QLoRA by 4.16% on MMLU with a 26x storage size reduction. Additionally, compressed experts produced by ComPEFT maintain few-shot compositional generalization capabilities, facilitate efficient communication and computation, and exhibit enhanced performance when merged. Lastly, we provide an analysis of different method components, compare ComPEFT with other PEFT methods, and test its efficacy for compressing full finetuning residual.
URL: https://openreview.net/forum?id=CovLQwu611
---
Title: MaxCutBench: Revisiting and Benchmarking Graph Neural Networks for Maximum Cut
Authors: Ankur Nath, Alan Kuhnle
Abstract: Recently, there has been much work on designing general heuristics for graph-based, combinatorial optimization problems via the incorporation of Graph Neural Networks (GNNs) to learn distribution-specific solution structures. However, there is a lack of consistency in evaluating these heuristics in terms of the baselines and instances chosen, making it difficult to assess the relative performance of the algorithms. In this paper, we introduce \textbf{MaxCutBench}—an open-source benchmark suite dedicated to the NP-hard Maximum Cut problem. The suite offers a unified interface for $16$ algorithms, both traditional and machine-learning-based. Using our benchmark, we conduct an in-depth analysis of the implemented algorithms on a carefully selected set of hard instances from diverse graph datasets. Our main finding is that classical local search heuristics can outperform several highly cited learning-based approaches, including S2V-DQN (Khalil et al., 2017), ECO-DQN (Barrett et al., 2020), among others, in terms of objective value, generalization, inference time, and scalability. Additionally, we find that the performance of ECO-DQN either remains the same or improves when the GNN is replaced by simple linear regression. We hope our benchmark will contribute to the efforts of the community to standardize the evaluation of learned heuristics for combinatorial optimization. Code, data, and pre-trained models are available at: \url{https://github.com/ankurnath/MaxCut-Bench}.
URL: https://openreview.net/forum?id=322PpCGAX8
---
Title: Future-aware Safe Active Learning of Time Varying Systems using Gaussian Processes
Authors: Markus Lange-Hegermann, Christoph Zimmer
Abstract: Experimental exploration of high-cost systems with safety constraints, common in engineering applications, is a challenging endeavor. Data-driven models offer a promising solution, but acquiring the requisite data remains expensive and is potentially unsafe. Safe active learning techniques prove essential, enabling the learning of high-quality models with minimal expensive data points and high safety. This paper introduces a safe active learning framework tailored for time-varying systems, addressing drift, seasonal changes, and complexities due to dynamic behavior. The proposed Time-aware Integrated Mean Squared Prediction Error (T-IMSPE) method minimizes posterior variance over current and future states, optimizing information gathering also in the time domain. Empirical results highlight T-IMSPE's advantages in model quality through synthetic and real-world examples. State of the art Gaussian processes are compatible with T-IMSPE. Our theoretical contributions include a clear delineation which Gaussian process kernels, domains, and weighting measures are suitable for T-IMSPE and even beyond for its non-time aware predecessor IMSPE.
URL: https://openreview.net/forum?id=YBPbMKJbLd
---
Title: VLM’s Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models
Authors: Nam Hyeon-Woo, Moon Ye-Bin, Wonseok Choi, Lee Hyun, Tae-Hyun Oh
Abstract: Vision language models (VLMs) have shown promising reasoning capabilities across various benchmarks; however, our understanding of their visual perception remains limited. In this work, we propose an eye examination process to investigate how a VLM perceives images, focusing on key aspects of visual recognition, ranging from basic color and shape to semantic understanding. We introduce a dataset, LENS, to guide VLMs to follow the examination and check its readiness. Once the model is ready, we conduct the examination. We quantify and visualize VLMs' sensitivities to color and shape, and semantic matching. Our findings reveal that VLMs have varying sensitivity to different colors while consistently showing insensitivity to green across different VLMs. Also, we found different shape sensitivity and semantic recognition depending on LLM's capacity despite using the same fixed visual encoder. Our analyses and findings have the potential to inspire the design of VLMs and the pre-processing of visual input to VLMs for improving application performance.
URL: https://openreview.net/forum?id=CgWkVb2lHB
---
Title: Jet: A Modern Transformer-Based Normalizing Flow
Authors: Alexander Kolesnikov, André Susano Pinto, Michael Tschannen
Abstract: In the past, normalizing generative flows have emerged as a promising class of generative models for natural images. This type of model has many modeling advantages: the ability to efficiently compute log-likelihood of the input data, fast generation, and simple overall structure. Normalizing flows remained a topic of active research but later fell out of favor, as visual quality of the samples was not competitive with other model classes, such as GANs, VQ-VAE-based approaches or diffusion models. In this paper we revisit the design of coupling-based normalizing flow models by carefully ablating prior design choices and using computational blocks based on the Vision Transformer architecture, not convolutional neural networks. As a result, we achieve a much simpler architecture that matches existing normalizing flow models and improves over them when paired with pretraining. While the overall visual quality is still behind the current state-of-the-art models, we argue that strong normalizing flow models can help advancing the research frontier by serving as building components of more powerful generative models.
URL: https://openreview.net/forum?id=jdvnaki7ZY
---
Title: Deep Koopman Learning using Noisy Data
Authors: Wenjian Hao, Devesh Upadhyay, Shaoshuai Mou
Abstract: This paper proposes a data-driven framework to learn a finite-dimensional approximation
of a Koopman operator for approximating the state evolution of a dynamical system under
noisy observations. To this end, our proposed solution has two main advantages. First, the
proposed method only requires the measurement noise to be bounded. Second, the proposed
method modifies the existing deep Koopman operator formulations by characterizing the
effect of the measurement noise on the Koopman operator learning and then mitigating it
by updating the tunable parameter of the observable functions of the Koopman operator,
making it easy to implement. The performance of the proposed method is demonstrated on
several standard benchmarks. We then compare the presented method with similar methods
proposed in the latest literature on Koopman learning.
URL: https://openreview.net/forum?id=j6Rm6T2lFU
---
Title: CoDe: Blockwise Control for Denoising Diffusion Models
Authors: Anuj Singh, Sayak Mukherjee, Ahmad Beirami, Hadi J. Rad
Abstract: Aligning diffusion models to downstream tasks often requires finetuning new models or gradient-based guidance at inference time to enable sampling from the reward-tilted posterior. In this work, we explore a simple inference-time gradient-free guidance approach, called controlled denoising (CoDe), that circumvents the need for differentiable guidance functions and model finetuning. CoDe is a blockwise sampling method applied during intermediate denoising steps, allowing for alignment with downstream rewards. Our experiments demonstrate that, despite its simplicity, CoDe offers a favorable trade-off between reward alignment, prompt instruction following, and inference cost, achieving a competitive performance against the state-of-the-art baselines}. Our code is available at https://github.com/anujinho/code.
URL: https://openreview.net/forum?id=DqPCWMiMU0
---
Title: Autoregressive Models in Vision: A Survey
Authors: Jing Xiong, Gongye Liu, Lun Huang, Chengyue Wu, Taiqiang Wu, Yao Mu, Yuan Yao, Hui Shen, Zhongwei Wan, Jinfa Huang, Chaofan Tao, Shen Yan, Huaxiu Yao, Lingpeng Kong, Hongxia Yang, Mi Zhang, Guillermo Sapiro, Jiebo Luo, Ping Luo, Ngai Wong
Abstract: Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, i.e., pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision.
To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the representation strategy. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multifaceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multimodal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey.
URL: https://openreview.net/forum?id=1BqXkjNEGP
---
Title: Expressivity of Representation Learning on Continuous-Time Dynamic Graphs: An Information-Flow Centric Review
Authors: Sofiane ENNADIR, Gabriela Zarzar Gandler, Filip Cornell, Lele Cao, Oleg Smirnov, Tianze Wang, Levente Zólyomi, Björn Brinne, Sahar Asadi
Abstract: Graphs are ubiquitous in real-world applications, ranging from social networks to biological systems, and have inspired the development of Graph Neural Networks (GNNs) for learning expressive representations. While most research has centered on static graphs, many real-world scenarios involve dynamic, temporally evolving graphs, motivating the need for Continuous-Time Dynamic Graph (CTDG) models. This paper provides a comprehensive review of Graph Representation Learning (GRL) on CTDGs with a focus on Self-Supervised Representation Learning (SSRL). We introduce a novel theoretical framework that analyzes the expressivity of CTDG models through an Information-Flow (IF) lens, quantifying their ability to propagate and encode temporal and structural information. Leveraging this framework, we categorize existing CTDG methods based on their suitability for different graph types and application scenarios. Within the same scope, we examine the design of SSRL methods tailored to CTDGs, such as predictive and contrastive approaches, highlighting their potential to mitigate the reliance on labeled data. Empirical evaluations on synthetic and real-world datasets validate our theoretical insights, demonstrating the strengths and limitations of various methods across long-range, bi-partite and community-based graphs. This work offers both a theoretical foundation and practical guidance for selecting and developing CTDG models, advancing the understanding of GRL in dynamic settings.
URL: https://openreview.net/forum?id=M7Lhr2anjg
---
New submissions
===============
Title: Graph Transformer based Large Neighborhood Search via Expert Guided Reinforcement Learning
Abstract: Mixed-Integer Linear Programs (MILPs) have wide-ranging applications across various fields. Recently, significant research efforts have been directed towards developing learning-based Large Neighborhood Search (LNS) methods for efficiently identifying high-quality MILP solutions. Most existing works focus on imitation learning, which faces two major challenges: (i) the performance is limited by the expert policy itself, and (ii) learning the graph representation of MILPs cannot effectively explore the global graph structure due to the limited receptive field of Graph Convolutional Networks (GCNs). To address these issues, we propose a novel expert-guided reinforcement learning model to design the destroy operator in LNS. In our approach, the expert provides weighted guidance to assist the learning agent efficiently. Additionally, we introduce a novel graph transformer-based network, which captures both local and global information efficiently. We prove that our graph transformer-based network is more expressive than 1-WL test and can distinguish non-isomorphic MILP graphs successfully. We conduct extensive evaluations to demonstrate the effectiveness of our proposed algorithm, showing significant improvements in the performance of LNS for MILPs.
URL: https://openreview.net/forum?id=iHldssy2Z8
---
Title: MUC: Machine Unlearning for Contrastive Learning with Black-box Evaluation
Abstract: Machine unlearning offers effective solutions for revoking the influence of specific training data on pre-trained model parameters. While existing approaches address unlearning for classification and generative models, they overlook an important category of machine learning models: contrastive learning (CL) methods. This paper addresses this gap by introducing the Machine Unlearning for Contrastive Learning (MUC) framework and adapting existing methods. We identify limitations in current approaches, noting that several methods perform inadequately as unlearners and that existing evaluation tools insufficiently validate unlearning effects in contrastive learning. To address these issues, we propose Alignment Calibration (AC), a novel method that explicitly considers contrastive learning properties and optimizes towards new auditing metrics for easy verification of unlearning. Through empirical comparisons with baseline methods on SimCLR, MoCo, and CLIP, we demonstrate that AC: (1) achieves state-of-the-art performance, approximating exact unlearning (retraining); (2) enables data owners to clearly visualize unlearning effects through black-box evaluation.
URL: https://openreview.net/forum?id=F9pjSDvuM9
---
Title: SortBench: Benchmarking LLMs based on their ability to sort lists
Abstract: Sorting is a tedious but simple task for human intelligence and can be solved fairly easily algorithmically. However, for Large Language Models (LLMs) this task is surprisingly hard, as some properties of sorting are among known weaknesses of LLMs: being faithful to the input data, logical comparisons between values, and strictly differentiating between syntax (used for sorting) and semantics (typically learned by embeddings). Within this paper, we describe the new SortBench benchmark for LLMs that comes with different difficulties and that can be easily scaled in terms of difficulty. We apply this benchmark to seven state-of-the-art LLMs, including current test-time reasoning models. Our results show that while the o3-mini model is very capable at sorting in general, even this can be fooled if strings are defined to mix syntactical and semantical aspects, e.g., by asking to sort numbers written-out as word. Furthermore, all models have problems with the faithfulness to the input of long lists, i.e., they drop items and add new ones. Our results also show that test-time reasoning has a tendency to overthink problems which leads to performance degradation. Finally, models without test-time reasoning like GPT-4o are not much worse than reasoning models.
URL: https://openreview.net/forum?id=xGhg4RFBHc
---
Title: Continual learning via probabilistic exchangeable sequence modelling
Abstract: Continual learning (CL) refers to the ability to continuously learn and accumulate new
knowledge while retaining useful information from past experiences. Although numerous
CL methods have been proposed in recent years, it is not straightforward to deploy them
directly to real-world decision-making problems due to their computational cost and lack of
uncertainty quantification. To address these issues, we propose CL-BRUNO, a probabilistic,
Neural Process-based CL model that performs scalable and tractable Bayesian update
and prediction. Our proposed approach uses deep-generative models to create a unified
probabilistic framework capable of handling different types of CL problems such as task-
and class-incremental learning, allowing users to integrate information across different CL
scenarios using a single model. Our approach is able to prevent catastrophic forgetting
through distributional and functional regularisation without the need of retaining any
previously seen samples, making it appealing to applications where data privacy or storage
capacity is of concern. Experiments show that CL-BRUNO outperforms existing methods
on both natural image and biomedical data sets, confirming its effectiveness in real-world
applications.
URL: https://openreview.net/forum?id=fDnAsRUk0F
---
Title: FLD+: Data-efficient Evaluation Metric for Generative Models
Abstract: We introduce a new metric to assess the quality of generated images that is more reliable, data-efficient, compute-efficient, and adaptable to new domains than the previous metrics, such as Fréchet Inception Distance (FID). The proposed metric is based on normalizing flows, which allows for the computation of density (exact log-likelihood) of images from any domain. Thus, unlike FID, the proposed Flow-based Likelihood Distance Plus (FLD+) metric exhibits strongly monotonic behavior with respect to different types of image degradations, including noise, occlusion, diffusion steps, and generative model size. Additionally, because normalizing flow can be trained stably and efficiently, FLD+ achieves stable results with two orders of magnitude fewer images than FID (which requires more images to reliably compute Fréchet distance between features of large samples of real and generated images). We made FLD+ computationally even more efficient by applying normalizing flows to features extracted in a lower-dimensional latent space instead of using a pre-trained network. We also show that FLD+ can easily be retrained on new domains, such as medical images, unlike the networks behind previous metrics -- such as InceptionNetV3 pre-trained on ImageNet.
URL: https://openreview.net/forum?id=ApBs8aWUYA
---
Title: DNR-Pruning: Sparsity-Aware Pruning via Dying Neuron Reactivation in Convolutional Neural Networks
Abstract: In this paper, we challenge the conventional view of dead neurons—neurons that cease to activate—during deep neural network training. Traditionally regarded as problematic due to their association with optimization challenges and reduced model adaptability over training epochs, dead neurons are often seen as a hindrance. However, we present a novel perspective, demonstrating that they can be effectively leveraged to enhance network sparsity. Specifically, we propose DNR-Pruning, dying neuron reactivation based sparsity-aware pruning approach for convolutional neural networks (CNNs) that exploits the behavior of individual neurons during training. Through a systematic exploration of hyperparameter configurations, we show that dying neurons can be harnessed to improve pruning algorithms. Our method dynamically monitors the occurrence of dying neurons, enabling adaptive sparsification throughout CNN training. Extensive experiments on diverse datasets demonstrate that DNR-Pruning outperforms existing sparsity-aware pruning techniques while achieving competitive results compared to state-of-the-art methods. These findings suggest that dying neurons can serve as an efficient mechanism for network compression and resource optimization in CNNs, opening new avenues for more efficient and high-performance deep learning models.
URL: https://openreview.net/forum?id=ymUjGCNPYa
---
Title: What Should Embeddings Embed? Autoregressive Models Represent Latent Generating Distributions
Abstract: Autoregressive language models have demonstrated a remarkable ability to extract latent structure from text. The embeddings from large language models have been shown to capture aspects of the syntax and semantics of language. But what should embeddings represent? We show that the embeddings from autoregressive models correspond to predictive sufficient statistics. By identifying settings where the predictive sufficient statistics are interpretable distributions over latent variables, including exchangeable models and latent state models, we show that embeddings of autoregressive models encode these explainable quantities of interest. We conduct empirical probing studies to extract information from transformers about latent generating distributions. Furthermore, we show that these embeddings generalize to out-of-distribution cases, do not exhibit token memorization, and that the information we identify is more easily recovered than other related measures. Next, we extend our analysis of exchangeable models to more realistic scenarios where the predictive sufficient statistic is difficult to identify by focusing on an interpretable subcomponent of language, topics. We show that large language models encode topic mixtures inferred by latent Dirichlet allocation (LDA) in both synthetic datasets and natural corpora.
URL: https://openreview.net/forum?id=YyMACp98Kz
---
Title: Beyond Grids: Multi-objective Bayesian Optimization With Adaptive Discretization
Abstract: We consider the problem of optimizing a vector-valued objective function $\boldsymbol{f}$ sampled from a Gaussian Process (GP) whose index set is a well-behaved, compact metric space $(\mathcal{X},d)$ of designs. We assume that $\boldsymbol{f}$ is not known beforehand and that evaluating $\boldsymbol{f}$ at design $x$ results in a noisy observation of $\boldsymbol{f}(x)$. Since identifying the Pareto optimal designs via exhaustive search is infeasible when the cardinality of $\mathcal{X}$ is large, we propose an algorithm, called Adaptive $\boldsymbol{\epsilon}$-PAL, that exploits the smoothness of the GP-sampled function and the structure of $(\mathcal{X},d)$ to learn fast. In essence, Adaptive $\boldsymbol{\epsilon}$-PAL employs a tree-based adaptive discretization technique to identify an $\boldsymbol{\epsilon}$-accurate Pareto set of designs in as few evaluations as possible. We provide both information-type and metric dimension-type bounds on the sample complexity of $\boldsymbol{\epsilon}$-accurate Pareto set identification. We also experimentally show that our algorithm outperforms other Pareto set identification methods.
URL: https://openreview.net/forum?id=Wq150HaRVE
---
Title: YoooP: You Only Optimize One Prototype per Class for Non-Exemplar Incremental Learning
Abstract: Incremental learning (IL) usually addresses catastrophic forgetting of old tasks when learning new tasks by replaying old tasks' raw data stored in a memory, which can be limited by its size and the risk of privacy leakage. Recent non-exemplar IL methods store class centroids as prototypes and perturb them with high-dimensional Gaussian noise to generate synthetic data for replaying. Unfortunately, this approach has two major limitations. First, the boundary between embedding clusters around prototypes of different classes might be unclear, leading to serious catastrophic forgetting. Second, directly applying high-dimensional Gaussian noise produces nearly identical synthetic samples that fail to preserve the true data distribution, ultimately degrading performance. In this paper, we propose YoooP, a novel exemplar-free IL approach that can greatly outperform previous methods by only storing and replaying one prototype per class even without synthetic data replay. Instead of merely storing class centroids, YoooP optimizes each prototype by (1) shifting it to high-density regions within each class using an attentional mean-shift algorithm, and (2) optimizing its cosine similarity with class-specific embeddings to form compact, well-separated clusters. As a result, replaying only the optimized prototypes effectively reduces inter-class interference and maintains clear decision boundaries. Furthermore, we extend YoooP to YoooP+ by synthesizing replay data preserving the angular distribution between each class prototype and the class's real data in history, which cannot be obtained by high-dimensional Gaussian perturbation. YoooP+ effectively stabilizes and further improves YoooP without storing real data. Extensive experiments demonstrate the superiority of YoooP/YoooP+ over non-exemplar baselines in terms of different metrics. The source code will be released upon acceptance of the paper.
URL: https://openreview.net/forum?id=FYe66NLDkO
---
Title: Synthesizing world models for bilevel planning
Abstract: Modern reinforcement learning (RL) systems have demonstrated remarkable capabilities in complex environments, such as video games. However, they still fall short of achieving human-like sample efficiency and adaptability when learning new domains. Theory-based reinforcement learning (TBRL) is an algorithmic framework specifically designed to address this gap. Modeled on cognitive theories, TBRL leverages structured, causal world models---``theories''---as forward simulators for use in planning, generalization and exploration. Although current TBRL systems provide compelling explanations of how humans learn to play video games, they face several technical limitations: their theory languages are restrictive, and their planning algorithms are not scalable. To address these challenges, we introduce TheoryCoder, an instantiation of TBRL that exploits hierarchical representations of theories and efficient program synthesis methods for more powerful learning and planning. TheoryCoder equips agents with general-purpose abstractions (e.g., ``move to''), which are then grounded in a particular environment by learning a low-level transition model (a Python program synthesized from observations by a large language model). A bilevel planning algorithm can exploit this hierarchical structure to solve large domains. We demonstrate that this approach can be successfully applied to diverse and challenging grid-world games, where approaches based on directly synthesizing a policy perform poorly. Ablation studies demonstrate the benefits of using hierarchical abstractions.
URL: https://openreview.net/forum?id=m9V4JHLJrD
---
Title: PSformer: Parameter-efficient Transformer with Segment Shared Attention for Time Series Forecasting
Abstract: Time series forecasting remains a critical challenge across various domains, often complicated by high-dimensional data and long-term dependencies. This paper presents a novel transformer architecture for time series forecasting, incorporating two innovative designs: parameter sharing module (PS) and Segment Shared Attention (SSA). The proposed model, PSformer, reduces the number of training parameters through the integrated parameter sharing mechanism without sacrificing performance. The spatiotemporal segment defined as a patch spanning across spatial variables and local time. The introduction of SSA could enhance the capability of capturing local spatio-temporal dependencies and improve global representation by integrating information across segments. Consequently, The combination of parameter sharing and SSA reduces the model's parameter count while enhancing forecasting performance. Extensive experiments on benchmark datasets demonstrate that PSformer outperforms many baseline approaches in terms of accuracy and scalability, positioning it as an effective and scalable tool for time series forecasting. Code can be found in https://anonymous.4open.science/r/PSformer_Anonymous-3E11.
URL: https://openreview.net/forum?id=LgeOwZgnb7
---
Title: Generalizable Spectral Embedding with an Application to UMAP
Abstract: Spectral Embedding (SE) is a popular method for dimensionality reduction, applicable across diverse domains. Nevertheless, its current implementations face three prominent drawbacks which curtail its broader applicability: generalizability (i.e., out-of-sample extension), scalability, and eigenvectors separation. Existing SE implementations often address two of these drawbacks; however, they fall short in addressing the remaining one. In this paper, we introduce $\textit{Sep-SpectralNet}$ (eigenvector-separated SpectralNet), a SE implementation designed to address $\textit{all}$ three limitations. Sep-SpectralNet extends SpectralNet with an efficient post-processing step to achieve eigenvectors separation, while ensuring both generalizability and scalability. This method expands the applicability of SE to a wider range of tasks and can enhance its performance in existing applications. We empirically demonstrate Sep-SpectralNet's ability to consistently approximate and generalize SE, while maintaining scalability. Additionally, we show how Sep-SpectralNet can be leveraged to enable generalizable UMAP visualization. Our code will be publicly available upon acceptance.
URL: https://openreview.net/forum?id=8cuQwztCKk
---
Title: Sortability of Time Series Data
Abstract: Evaluating the performance of causal discovery algorithms that aim to find causal relationships between time-dependent processes remains a challenging topic. In this paper, we show that certain characteristics of datasets, such as varsortability (Reisach et al. 2021) and R2-sortability (Reisach et al. 2023), also occur in datasets for autocorrelated stationary time series. We illustrate this empirically using four types of data: simulated data based on SVAR models and Erdős-Rényi graphs, the data used in the 2019 causality-for-climate challenge (Runge et al. 2019), real-world river stream datasets, and real-world data generated by the Causal Chamber of (Gamella et al. 2024). To do this, we adapt var- and R2-sortability to time series data. We also investigate the extent to which the performance of continuous score-based causal discovery methods goes hand in hand with high sortability. Arguably, our most surprising finding is that the investigated real-world datasets exhibit high varsortability and low R2-sortability indicating that scales may carry a significant amount of causal information.
URL: https://openreview.net/forum?id=OGvmCpcHdV
---
Title: Algorithmic fairness with monotone likelihood ratios
Abstract: We show that inequalities of many commonly used fairness metrics (true/false positive/negative rates, predicted positive/negative rates, and positive/negative predictive values) are guaranteed for groups with different outcome rates under a monotonically calibrated model whose risk distributions have a monotone likelihood ratio, extending existing impossibility results. We further provide lower bounds on the FNR/FPR disparities and PPR/PNR disparities in the same setting, showing that either the FNR disparity or FPR disparity is at least as large as the positive outcome rate disparity (for FNR disparity) or negative outcome rate disparity (for FPR disparity), and either the PPR disparity or PNR disparity is at least as large as the positive outcome rate disparity (for PPR disparity) or negative outcome rate disparity (for PNR disparity). While incompatibilities of some combinations of these metrics have been demonstrated previously, we are unaware of any work that has demonstrated direct incompatibility of calibration with these individual equalities, equivalence of these inequalities, or lower bounds for the disparity in these values under distributional assumptions about a model's predictions.
URL: https://openreview.net/forum?id=mtoWa0gIKy
---
Title: Trust Me, I’m Calibrated: Robustifying Deep Networks
Abstract: The tremendous success of deep neural networks (DNNs) in solving a wide range of complex computer vision tasks has paved the way for their deployment in real-world applications. However, challenges arise when these models are exposed to natural adversarial corruptions that can occur in unconstrained physical environments. Such corruptions are inherently present in the real world and can significantly degrade model performance by causing incorrect predictions. This vulnerability is further enhanced by the miscalibration of modern DNNs, where models tend to output incorrect predictions with high confidence. To ensure safe and reliable deployment, it is crucial to calibrate these models correctly. While existing literature primarily focuses on calibrating DNNs, it often overlooks the impact of adversarial corruption. Thus, substantial scope remains to explore how calibration techniques interact with adversarial robustness and whether improving calibration can increase robustness to corrupted or adversarial data. In this work, we aim to address this gap by employing uncertainty quantification methods to improve the calibration and robustness of DNNs and Transformer-based models against adversarial data.
URL: https://openreview.net/forum?id=qcNrlJd2S7
---
Title: GROOD: Gradient-Aware Out-of-Distribution Detection
Abstract: Out-of-distribution (OOD) detection is crucial for ensuring the reliability of deep learning models in real-world applications. Existing methods typically focus on feature representations or output-space analysis, often assuming a distribution over these spaces or leveraging gradient norms with respect to model parameters. However, these approaches struggle to distinguish near-OOD samples and often require extensive hyper-parameter tuning, limiting their practicality.
In this work, we propose GRadient-aware Out-Of-Distribution detection (GROOD), a method that derives an OOD prototype from synthetic samples and computes class prototypes directly from In-distribution (ID) training data. By analyzing the gradients of a nearest-class-prototype loss function concerning an artificial OOD prototype, our approach achieves a clear separation between in-distribution and OOD samples.
Experimental evaluations demonstrate that gradients computed from the OOD prototype enhance the distinction between ID and OOD data, surpassing established baselines in robustness, particularly on ImageNet-1k. These findings highlight the potential of gradient-based methods and prototype-driven approaches in advancing OOD detection within deep neural networks.
URL: https://openreview.net/forum?id=2V7itvvMVJ
---
Title: Utilising Gradient-Based Proposals Within Sequential Monte Carlo Samplers for Training of Partial Bayesian Neural Networks
Abstract: Partial Bayesian neural networks (pBNNs) have been shown to perform competitively with fully Bayesian neural networks while only having a subset of the parameters be stochastic. Using sequential Monte Carlo (SMC) samplers as the inference method for pBNNs gives a non-parametric probabilistic estimation of the stochastic parameters, and has shown improved performance over parametric methods. In this paper we introduce a new SMC-based training method for pBNNs by utilising a guided proposal and incorporating gradient-based Markov kernels, which gives us better scalability on high dimensional problems. We show that our new method outperforms the state-of-the-art in terms of predictive performance and optimal loss. We also show that pBNNs scale well with larger batch sizes, resulting in significantly reduced training times and often better performance.
URL: https://openreview.net/forum?id=miT5oN8YwX
---
Title: BRL-Attention: Toward Linearly Regularizing the Geometric Bottleneck of Linear Generalized Attention
Abstract: Transformers excel across domains, yet their full self-attention carries a prohibitive $\mathcal{O}(n^2)$ cost for long sequences with length $n$. Existing \textit{efficient} attention methods either restrict the attention pattern (local/sparse attention) or approximate the softmax kernel with certain drawbacks. The former suffers from attention bottlenecks (over-squashing of long-range dependencies) and invalidates the use of global tokens in autoregressive tasks, while the latter often requires sequential processing that can degrade in accuracy when approximations fall short. In this work, we introduce a novel attention mechanism, \textit{Bottleneck Regularized Linear Attention (BRL-Attention)}, uniting the strengths of pattern-based and kernel-based techniques to enable efficient, global information flow with linear complexity. BRL-Attention extends a local attention pattern with a small set of compressed tokens that serve as a global information reservoir, ensuring long-range interactions without quadratic cost. This bottleneck regularization strategy effectively alleviates the geometric attention bottleneck and retains full expressiveness; that is, it matches the sequence modeling capacity of full softmax attention while mitigating over-squashing across layers. Moreover, it integrates global tokens without breaking causal masking, making it applicable to both encoder-only and autoregressive decoder architectures. Extensive experiments on long-sequence and graph benchmarks show that BRL-Attention matches or exceeds the predictive performance of standard Transformers with full attention, while substantially reducing memory usage and computation time. These results underscore its potential as a scalable, drop-in replacement for existing attention mechanisms.
URL: https://openreview.net/forum?id=Vpyg3fqXbl
---
Title: Pre-Training Representations of Binary Code Using Contrastive Learning
Abstract: Binary code analysis and comprehension is critical to applications in reverse engineering and computer security tasks where source code is not available. Unfortunately, unlike source code, binary code lacks semantics and is more difficult for human engineers to understand and analyze. In this paper, we present ContraBin, a contrastive learning technique that integrates source code and comment information along with binaries to create an embedding capable of aiding binary analysis and comprehension tasks. Specifically, we present three components in ContraBin: (1) a primary contrastive learning method for initial pre-training, (2) a simplex interpolation method to integrate source code, comments, and binary code, and (3) an intermediate representation learning algorithm to train a binary code embedding. We further analyze the impact of human-written and synthetic comments on binary code comprehension tasks, revealing a significant performance disparity. While synthetic comments provide substantial benefits, human-written comments are found to introduce noise, even resulting in performance drops compared to using no comments. These findings reshape the narrative around the role of comment types in binary code analysis. We evaluate the effectiveness of ContraBin through four indicative downstream tasks related to binary code: algorithmic functionality classification, function name recovery, code summarization, and reverse engineering. The results show that ContraBin considerably improves performance on all four tasks, measured by accuracy, mean of average precision, and BLEU scores as appropriate. ContraBin is the first language representation model to incorporate source code, binary code, and comments into contrastive code representation learning and is intended to contribute to the field of binary code analysis. The dataset used in this study is available for further research.
URL: https://openreview.net/forum?id=qmfUL6D0iz
---
Title: Private Regression via Data-Dependent Sufficient Statistic Perturbation
Abstract: Sufficient statistic perturbation (SSP) is a widely used method for differentially private linear regression. SSP adopts a data-independent approach where privacy noise from a simple distribution is added to sufficient statistics. However, sufficient statistics can often be expressed as linear queries and better approximated by data-dependent mechanisms. In this paper we introduce data-dependent SSP for linear regression based on post-processing privately released marginals, and find that it outperforms state-of-the-art data-independent SSP. We extend this result to logistic regression by developing an approximate objective that can be expressed in terms of sufficient statistics, resulting in a novel and highly competitive SSP approach for logistic regression. We also make a connection to synthetic data for machine learning: for models with sufficient statistics, training on synthetic data corresponds to data-dependent SSP, with the overall utility determined by how well the mechanism answers these linear queries.
URL: https://openreview.net/forum?id=gtCfDKm9ME
---
Title: MobileCLIP2: Improving Multi-Modal Reinforced Training
Abstract: Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures and a novel multi-modal reinforced training that made knowledge distillation from multiple caption-generators and CLIP teachers efficient, scalable, and reproducible. In this paper, we improve the multi-modal reinforced training of MobileCLIP through: 1) better CLIP teacher ensembles trained on the DFN dataset, 2) improved captioner teachers trained on the DFN dataset and fine-tuned on a diverse selection of high-quality image-caption datasets. We discover new insights through ablations such as the importance of temperature tuning in contrastive knowledge distillation, the effectiveness of caption-generator fine-tuning for caption diversity, and the additive improvement from combining synthetic captions generated by multiple models. We train a new family of models called MobileCLIP2 and achieve state-of-the-art ImageNet-1k zero-shot accuracies at low latencies. In particular, we observe 2.2% improvement in ImageNet-1k accuracy for MobileCLIP2-B compared with MobileCLIP-B architecture. Notably, MobileCLIP2-XL matches the zero-shot accuracy of SigLIP-SO400M/14 on ImageNet-1k while being 2× smaller and improves on DFN ViT-L/14 at 2.5× lower latency. We will release the data generation code and our pretrained models. The data generation code makes it easy to create new reinforced datasets with arbitrary teachers using distributed scalable processing.
URL: https://openreview.net/forum?id=WeF9zolng8
---
Title: Intern-GS: Vision Model Guided Sparse-View 3D Gaussian Splatting
Abstract: Sparse-view scene reconstruction often faces significant challenges due to the constraints imposed by limited observational data. These limitations result in incomplete information, leading to suboptimal reconstructions using existing methodologies. To address this, we present Intern-GS, a novel approach that effectively leverages rich prior knowledge from vision foundation models to enhance the process of sparse-view Gaussian Splatting, thereby enabling high-quality scene reconstruction. Specifically, Intern-GS utilizes vision foundation models to guide both the initialization and the optimization process of 3D Gaussian splatting, effectively addressing the limitations of sparse inputs. In the initialization process, our method employs DUSt3R to generate a dense and non-redundant gaussian point cloud. This approach significantly alleviates the limitations encountered by traditional structure-from-motion (SfM) methods, which often struggle under sparse-view constraints. During the optimization process, vision foundation models predict depth and appearance for unobserved views, refining the 3D Gaussians to compensate for missing information in unseen regions. Extensive experiments demonstrate that Intern-GS achieves state-of-the-art rendering quality across diverse datasets, including both forward-facing and large-scale scenes, such as LLFF, DTU, and Tanks and Temples.
URL: https://openreview.net/forum?id=H31Js3M96Z
---
Title: Task Diversity Shortens the In-Context Learning Plateau
Abstract: In-context learning (ICL) describes a language model's ability to generate outputs based on a set of input demonstrations and a subsequent query. To understand this remarkable capability, researchers have studied simplified, stylized models. These studies have consistently observed long loss plateaus, during which models exhibit minimal improvement, followed by a sudden, rapid surge of learning. In this work, we reveal that training on multiple diverse ICL tasks simultaneously shortens the loss plateaus, making each task easier to learn. This finding is surprising as it contradicts the natural intuition that the combined complexity of multiple ICL tasks would lengthen the learning process, not shorten it. Our result suggests that the recent success in large-scale training of language models may be attributed not only to the richness of the data at scale but also to the easier optimization (training) induced by the diversity of natural language training data.
URL: https://openreview.net/forum?id=7t5DzaJOdB
---
Title: Bayesian Optimization of Robustness Measures under Input Uncertainty: A Randomized Gaussian Process Upper Confidence Bound Approach
Abstract: Bayesian optimization based on the Gaussian process upper confidence bound (GP-UCB) offers a theoretical guarantee for optimizing black-box functions. In practice, however, black-box functions often involve input uncertainty. To handle such cases, GP-UCB can be extended to optimize evaluation criteria known as robustness measures. However, GP-UCB-based methods for robustness measures require a trade-off parameter, $\beta$, which, as in the original GP-UCB, must be set sufficiently large to ensure theoretical validity. In this study, we propose randomized robustness measure GP-UCB (RRGP-UCB), a novel method that samples $\beta$ from a chi-squared-based probability distribution. This approach eliminates the need to explicitly specify $\beta$. Notably, the expected value of $\beta$ under this distribution is not excessively large. Furthermore, we show that RRGP-UCB provides tight bounds on the expected regret between the optimal and estimated solutions. Numerical experiments demonstrate the effectiveness of the proposed method.
URL: https://openreview.net/forum?id=FDzojiLSia
---
Title: Approximations to worst-case data dropping: unmasking failure modes
Abstract: A data analyst might worry about generalization if dropping a very small fraction of data points from a study could change its substantive conclusions. Checking this non-robustness directly poses a combinatorial optimization problem and is intractable even for simple models and moderate data sizes. Recently various authors have proposed a diverse set of approximations to detect this non-robustness. In the present work, we show that, even in a setting as simple as ordinary least squares (OLS) linear regression, many of these approximations can fail to detect (true) non-robustness in realistic data arrangements. We focus on OLS in the present work due its widespread use and since some approximations work only for OLS. Of the approximations that do not fail our tests, we find not only that a simple recursive greedy algorithm is the most conceptually straightforward but also that it can be orders of magnitude faster to run than the others.
URL: https://openreview.net/forum?id=m6EQ6YdPXV
---
Title: The Over-Certainty Phenomenon in Modern Test-Time Adaptation Algorithms
Abstract: When neural networks are confronted with unfamiliar data that deviate from their training set, this signifies a domain shift. While these networks output predictions on their inputs, they typically fail to account for their level of familiarity with these novel observations. Prevailing works navigate test-time adaptation with the goal of curtailing model entropy, yet they unintentionally produce models that struggle with sub-optimal calibration—a dilemma we term the over-certainty phenomenon. This over-certainty in predictions can be particularly dangerous in the setting of domain shifts, as it may lead to misplaced trust. In this paper, we propose a solution that not only maintains accuracy but also addresses calibration by mitigating the over-certainty phenomenon. Our method achieves state-of-the-art performance in terms of expected calibration error and negative log likelihood, all while maintaining parity in accuracy.
URL: https://openreview.net/forum?id=AGQRij8iUC
---
Title: VSCoDe: Visual-Augmentation Selection for Contrastive Decoding
Abstract: Despite the impressive performance of recent Large Vision-Language Models (LVLMs), these models often produce inaccurate responses. To address this issue, previous studies have aimed to reduce hallucinations by using contrastive decoding (CD) with modified images, such as cropping objects related to query or adding noise, thereby contrasting with the original image. However, these methods have several limitations. First, employing fixed visual augmentation, such as adding noise, is a simple approach but too rigid to contrast on various queries. Conversely, using semantics in queries or images by leveraging external models can adaptively generate contrastive images, but it entails significant additional costs. To address these shortcomings, we explore using pre-defined visual augmentations to enable flexible adaptation to each query without relying on external models. We observe that each query achieves different contrasts through different visual augmentations. Based on this, we propose a novel method called VSCoDe, Visual-Augmentation Selection for Contrastive Decoding, which adaptively selects augmentations using a proposed distance metric to identify those with higher contrast. Our empirical evaluations demonstrate that VSCoDe outperforms previous methods and enhances the quality of various vision-language tasks without additional training or reliance on external models.
URL: https://openreview.net/forum?id=CqSyPc9W7Y
---
Title: Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Abstract: Evaluations of large language model (LLM) risks and capabilities are increasingly being incorporated into AI risk management and governance frameworks. Currently, most risk evaluations are conducted by designing inputs that elicit harmful behaviors from the system. However, this approach suffers from two limitations. First, input-output evaluations cannot evaluate realistic risks from open-weight models. Second, the behaviors identified during any particular input-output evaluation can only lower-bound the model's worst-possible-case input-output behavior. As a complementary method for eliciting harmful behaviors, we propose evaluating LLMs with model tampering attacks which allow for modifications to latent activations or weights. We pit state-of-the-art techniques for removing harmful LLM capabilities against a suite of 5 input-space and 6 model tampering attacks. In addition to benchmarking these methods against each other, we show that (1) model resilience to capability elicitation attacks lies on a low-dimensional robustness subspace; (2) the attack success rate of model tampering attacks can empirically predict and offer conservative estimates for the success of held-out input-space attacks; and (3) state-of-the-art unlearning methods can easily be undone within 16 steps of fine-tuning. Together these results highlight the difficulty of suppressing harmful LLM capabilities and show that model tampering attacks enable substantially more rigorous evaluations than input-space attacks alone.
URL: https://openreview.net/forum?id=E60YbLnQd2
---
Title: Between Linear and Sinusoidal: Rethinking the Time Encoder in Dynamic Graph Learning
Abstract: Dynamic graph learning is essential for applications involving temporal networks and requires effective modeling of temporal relationships. Seminal attention-based models like TGAT and DyGFormer rely on sinusoidal time encoders to capture temporal relationships between edge events. In this paper, we study a simpler alternative: the linear time encoder, which avoids temporal information loss caused by sinusoidal functions and reduces the need for high dimensional time encoders. We show that the self-attention mechanism can effectively learn to compute time spans from linear time encodings and extract relevant temporal patterns. Through extensive experiments on six dynamic graph datasets, we demonstrate that the linear time encoder improves the performance of TGAT and DyGFormer in most cases. Moreover, the linear time encoder can lead to significant savings in model parameters with minimal performance loss. For example, compared to a 100-dimensional sinusoidal time encoder, TGAT with a 2-dimensional linear time encoder saves 43% of parameters and achieves higher average precision on five datasets. These results can be readily used to positively impact the design choices of a wide variety of dynamic graph learning architectures.
URL: https://openreview.net/forum?id=W6GQvdOGHg
---
Title: GRAD-T: Graph Regularized Attention-based Diffusion Model for Analysis of Contextual Emotion Contagion
Abstract: We propose a Graph Regularized-Attention-Based Diffusion Transformer (GRAD-T) model, which uses kernel temporal attention and a regularized sparse graph method to analyze model general diffusion processes over networks. The proposed model uses the spatiotemporal nature of data generated from diffusion processes over networks to examine phenomena that vary across different locations and time, such as disease outbreaks, climate patterns, ecological changes, information flows, news contagion, transportation flows or information and sentiment contagion over social networks. The kernel attention models the temporal dependence of diffusion processes within locations, and the regularized spatial attention mechanism accounts for the spatial diffusion process. The proposed regularization using a combination of penalized matrix estimation and a resampling approach helps in modeling high-dimensional data from large graphical networks, and identify the dominant diffusion pathways. We use the model to predict how emotions spread across sparse networks. We applied our model to a unique dataset of COVID-19 tweets that we curated, spanning March to December 2020 across various U.S. locations. We used model parameters (attention measures) to create indices for comparing emotion diffusion potential within and between nodes. Our findings show that negative emotions like fear, anger, and disgust demonstrate substantial potential for temporal and spatial diffusion. We will release the dataset for public consumption. Using the dataset and the proposed method we demonstrate that different types of emotions exhibit different patters of temporal and spatial diffusion. We show that our model improves prediction accuracy of emotion diffusion over social medial networks over standard models such as LSTM and CNN methods. Our key contribution is the regularized graph transformer using a penalty and a resampling approach to enhance the robustness, interpretability, and scalability of sparse graph learning.
URL: https://openreview.net/forum?id=DLVhL7dz1c
---
Title: Bridging VMP and CEP: Theoretical Insights for Connecting Different Approximate Bayesian Inference Methods
Abstract: Approximate Bayesian inference (ABI) methods have become indispensable tools in modern machine learning and statistics for approximating intractable posterior distributions. Despite the related extensive studies and applications across diverse domains, the theoretical connections among these methods have remained relatively unexplored. This paper takes the first step to uncover the underlying relationships between two widely employed ABI techniques: the variational message passing (VMP) and the conditional expectation propagation (CEP) methods. Through rigorous mathematical analysis, we demonstrate a strong connection between these two approaches under mild conditions, from optimization as well as graphical model perspectives. This newly unveiled connection not only enhances our understanding of the performance and convergence properties of VMP and CEP, but it also facilitates the cross-fertilization of their respective strengths. For instance, we prove the convergence of CEP and enable an online variant of VMP through this connection. Furthermore, our findings provide insights into the underlying relationships and distinctive characteristics of other ABI methods, shedding new light on the understanding and development of more advanced ABI techniques. To validate our theoretical findings, we derive and analyze various ABI methods within the context of Bayesian tensor decomposition, a fundamental tool in machine learning research. Specifically, we show that these two approaches yield the same updates within this context and illustrate how the established connection can be leveraged to construct a streaming version of the VMP-based Bayesian tensor decomposition algorithm.
URL: https://openreview.net/forum?id=alODfvLNuP
---
Title: On Joint Regularization and Calibration in Deep Ensembles
Abstract: Deep ensembles are a powerful tool in machine learning, improving both model performance and uncertainty calibration. While ensembles are typically formed by training and tuning models individually, evidence suggests that jointly tuning the ensemble can lead to better performance. This paper investigates the impact of jointly tuning weight decay, temperature scaling, and early stopping on both predictive performance and uncertainty quantification. Additionally, we propose a partially overlapping holdout strategy that relaxes the need for a common holdout set, thereby increasing ensemble diversity. Our results demonstrate that jointly tuning the ensemble matches or improves performance across all conditions, with significant variation in effect size. We highlight the trade-offs between individual and joint optimization in deep ensemble training, with the overlapping holdout strategy offering an attractive practical solution. We believe our findings provide valuable insights and guidance for practitioners looking to optimize deep ensemble models.
URL: https://openreview.net/forum?id=6xqV7DP3Ep
---
Title: Solving Inverse Problems using Diffusion with Iterative Colored Renoising
Abstract: Imaging inverse problems can be solved in an unsupervised manner using pre-trained diffusion models, but doing so requires approximating the gradient of the measurement-conditional score function in the diffusion reverse process.
We show that the approximations produced by existing methods are relatively poor, especially early in the reverse process, and so
we propose a new approach that iteratively reestimates and ``renoises'' the estimate several times per diffusion step.
This iterative approach, which we call Fast Iterative REnoising (FIRE), injects colored noise that is shaped to ensure that the pre-trained diffusion model always sees white noise, in accordance with how it was trained.
We then embed FIRE into the DDIM reverse process and show that the resulting ``DDfire'' offers state-of-the-art accuracy and runtime on several linear inverse problems, as well as phase retrieval.
URL: https://openreview.net/forum?id=RZv8FcQDPW
---
Title: Reliable and Responsible Foundation Models
Abstract: Foundation models, including autoregressive generative models (e.g., Large Language Models and Large Multimodal Models) and generative diffusion models (e.g., Text-to-Image Models and Video Generative Models), are essential tools with broad applications across various domains such as law, medicine, education, finance, and beyond. As these models are increasingly deployed in real-world scenarios, ensuring their reliability and responsibility has become critical for academia, industry, and government. This survey addresses the reliable and responsible development of foundation models. We explore critical issues, including bias and fairness, security and privacy, uncertainty, explainability, and distribution shift. Our research also covers model limitations, such as hallucinations, as well as methods like alignment and Artificial Intelligence-Generated Content (AIGC) detection. For each area, we review the current state of the field and outline concrete future research directions. Additionally, we discuss the intersections between these areas, highlighting their connections and shared challenges. We hope our survey fosters the development of foundation models that are not only powerful but also ethical, trustworthy, reliable, and socially responsible.
URL: https://openreview.net/forum?id=nLJZh4M6S5
---
Title: DIVINE: Diverse-Inconspicuous Feature Learning to Mitigate Abridge Learning
Abstract: Deep learning algorithms aim to minimize overall error and exhibit impressive performance on test datasets across various domains. However, they often struggle with out-of-distribution data samples. We posit that deep models primarily focus on capturing the prominent features beneficial for the task while neglecting other subtle yet discriminative features. This phenomenon is referred to as Abridge Learning. To address this issue and promote a more comprehensive learning process from data, we introduce a novel DIVerse and INconspicuous feature lEarning (DIVINE) approach aimed at counteracting Abridge Learning. DIVINE embodies a holistic learning methodology, effectively utilizing data by engaging with its diverse dominant features. Through experiments conducted on ten datasets,
including MNIST, CIFAR10, CIFAR100, TinyImageNet, and their corrupted and perturbed counterparts (CIFAR10-C, CIFAR10-P, CIFAR100-C, CIFAR100-P, TinyImageNet-C, and TinyImageNet-P), we demonstrate that DIVINE encourages the learning of a rich set of features. This, in turn, boosts the model’s robustness and its ability to generalize. The results on out-of-distribution datasets, such as those that are perturbed, achieve a performance 5.36%, 3.10%, and 21.85% mean Flip Rate (mFR) corresponding to CIFAR10-P, CIFAR100-P, and TinyImageNet-P datasets using DIVINE. On the other hand, Abridged Learning on CIFAR10-P, CIFAR100-P, and TinyImageNet-P datasets, achieve a performance 6.53%, 11.75%, and 31.90% mFR, respectively. The proposed DIVINE algorithm achieves state-of-the-art (sota) results on CIFAR100-P dataset when compared to existing algorithms.
URL: https://openreview.net/forum?id=8NGKGTAD6F
---
Title: CYCle: Choosing Your Collaborators Wisely to Enhance Collaborative Fairness in Decentralized Learning
Abstract: Collaborative learning (CL) enables multiple participants to jointly train machine learning (ML) models on decentralized data sources without raw data sharing. While the primary goal of CL is to maximize the expected accuracy gain for each participant, it is also important to ensure that the gains are **fairly** distributed. Specifically, no client should be negatively impacted by the collaboration, and the individual gains must ideally be commensurate with the contributions. Most existing CL algorithms require central coordination and focus on the gain maximization objective while ignoring collaborative fairness. In this work, we first show that the existing measure of collaborative fairness based on the correlation between accuracy values without and with collaboration has drawbacks because it does not account for negative collaboration gain. We argue that maximizing mean collaboration gain (MCG) while simultaneously minimizing the collaboration gain spread (CGS) is a fairer alternative. Next, we propose the CYCle protocol that enables individual participants in a private decentralized learning (PDL) framework to achieve this objective through a novel reputation scoring method based on gradient alignment between the local cross-entropy and distillation losses. We further extend the CYCle protocol to operate on top of gossip-based decentralized algorithms such as Gossip-SGD. For the simple mean estimation problem with two participants, we also theoretically show that CYCle performs better than standard FedAvg, especially when there is large statistical heterogeneity. Experiments on the CIFAR-10, CIFAR-100, and Fed-ISIC2019 datasets empirically demonstrate the effectiveness of the CYCle protocol to ensure positive and fair collaboration gain for all participants, even in cases where the data distributions of participants are highly skewed.
URL: https://openreview.net/forum?id=ygqNiLQqfH
---
Title: A Unified Approach Towards Active Learning and Out-of-Distribution Detection
Abstract: In real-world applications of deep learning models, active learning (AL) strategies are essential for identifying label candidates from vast amounts of unlabeled data. In this context, robust out-of-distribution (OOD) detection mechanisms are crucial for handling data out-
side the target distribution during the application’s operation. Usually, these problems have been addressed separately. In this work, we introduce SISOM as a unified solution designed explicitly for AL and OOD detection. By combining feature space-based and uncertainty-
based metrics, SISOM leverages the strengths of the currently independent tasks to solve both effectively, without requiring specific training schemes. We conducted extensive experiments showing the problems arising when migrating between both tasks. In our experiments
SISOM underlined its effectiveness by achieving first place in two of the commonly used OpenOOD benchmarks settings and second place in the remaining one for near-OOD data. In AL, SISOM outperforms others and delivers top-1 performance in three benchmarks.
URL: https://openreview.net/forum?id=HL75La10FN
---
Title: A note on the $k$-means clustering for missing data
Abstract: The classical $k$-means clustering algorithm requires complete data and cannot be directly applied when observations contain missing entries. An intuitive and computationally efficient extension addresses this issue by minimizing the $k$-means loss over the observed entries only, a strategy considered in several studies. This method is known as $k$-POD clustering. In this paper, we provide a theoretical analysis of this approach and demonstrate that it is generally inconsistent, even under the missing completely at random (MCAR) assumption. Specifically, we show that the expected loss being minimized asymptotically differs from the original $k$-means objective, leading to biased estimates of cluster centers in the large-sample limit. This highlights a fundamental limitation: the method may fail to recover the true underlying cluster structure, even in settings where $k$-means performs well on fully observed data. Nevertheless, when the missing rate per variable is sufficiently low and the dimensionality is high, the method can still produce stable and practically useful results, making it a viable alternative when the complete-case analysis is ineffective.
URL: https://openreview.net/forum?id=pcqlTvePXS
---
Title: A Comprehensive Survey on Knowledge Distillation
Abstract: Deep Neural Networks (DNNs) have achieved notable performance in the fields of computer vision and natural language processing with various applications in both academia and industry. However, with recent advancements in DNNs and transformer models with a tremendous number of parameters, deploying these large models on edge devices causes serious issues such as high runtime and memory consumption. This is especially concerning with the recent large-scale foundation models, Vision-Language Models (VLMs), and Large Language Models (LLMs). Knowledge Distillation (KD) is one of the prominent techniques proposed to address the aforementioned problems using a teacher-student architecture. More specifically, a lightweight student model is trained using additional knowledge from a cumbersome teacher model. In this work, a comprehensive survey of knowledge distillation methods is proposed. This includes reviewing KD from different aspects: distillation sources, distillation schemes, distillation algorithms, distillation by modalities, applications of distillation,and comparison among existing methods. In contrast to most existing surveys, which are either outdated or simply update former surveys, this work proposes a comprehensive survey with a new point of view and representation structure that categorizes and investigates the most recent methods in knowledge distillation. This survey considers various critically important subcategories, including KD for diffusion models, 3D inputs, foundational models, transformers, and LLMs. Furthermore, existing challenges in KD and possible future research directions are discussed.
URL: https://openreview.net/forum?id=3cbJzdR78B
---
Title: Text-Guided Video Amodal Completion
Abstract: Amodal perception enables humans to perceive entire objects even when parts are occluded, a remarkable cognitive skill that artificial intelligence struggles to replicate. While substantial advancements have been made in image amodal completion, video amodal completion remains underexplored despite its high potential for real-world applications in video editing and analysis. In response, we propose a video amodal completion framework to explore this potential direction. Our contributions include (i) a synthetic dataset for video amodal completion with text description for the object of interest. The dataset captures a variety of object types, textures, motions, and scenarios to support zero-shot transferring on natural videos. (ii) A diffusion-based text-guided video amodal completion framework enhanced with a motion continuity module to ensure temporal consistency across frames. (iii) Zero-shot inference for long video, inspired by temporal diffusion techniques to effectively manage long video sequences while improving inference accuracy and maintaining coherent amodal completions. Experimental results shows the efficacy of our approach in handling video amodal completion, opening potential capabilities for advanced video editing and analysis with amodal completion.
URL: https://openreview.net/forum?id=o7taRJqXWJ
---
Title: BELLA: Black-box model Explanations by Local Linear Approximations
Abstract: Understanding the decision-making process of black-box models has become not just a legal requirement, but also an additional way to assess their performance. However, the state of the art post-hoc explanation approaches for regression models rely on synthetic data generation, which introduces uncertainty and can hurt the reliability of the explanations. Furthermore, they tend to produce explanations that apply to only very few data points. In this paper, we present BELLA, a deterministic model-agnostic post-hoc approach for explaining the individual predictions of regression black-box models. BELLA provides explanations in the form of a linear model trained in the feature space. BELLA maximizes the size of the neighborhood to which the linear model applies so that the explanations are accurate, simple, general, and robust.
URL: https://openreview.net/forum?id=F9Kv96KcwM
---
Title: Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers
Abstract: Scientific research is inherently global. However, the vast majority of academic journals are published exclusively in English, creating barriers for non-native-English-speaking researchers. In this study, we leverage large language models (LLMs) to translate published scientific articles while preserving their native JATS XML formatting, thereby developing a practical, automated approach for implementation by academic journals. Using our approach, we translate articles across multiple scientific disciplines into 28 languages. To evaluate translation accuracy, we introduce a novel question-and-answer (QA) benchmarking method, in which an LLM generates comprehension-based questions from the original text and then answers them based on the translated text. Our benchmark results show an average performance of 95.9%, showing that the key scientific details are accurately conveyed. In a user study, we translate the scientific papers of 15 researchers into their native languages, finding that the authors consistently found the translations to accurately capture the original information in their articles. Interestingly, a third of the authors found many technical terms "overtranslated," expressing a preference to keep terminology more familiar in English untranslated. Finally, we demonstrate how in-context learning techniques can be used to align translations with domain-specific preferences such as mitigating overtranslation, highlighting the adaptability and utility of LLM-driven scientific translation. The code and translated articles are available at https://anonymous.4open.science/r/ProjectMundo-Anon-106A.
URL: https://openreview.net/forum?id=RlkGGVbcZI
---