Accepted papers
===============
Title: Scalable Multi-Output Gaussian Processes with Stochastic Variational Inference
Authors: Xiaoyu Jiang, Sokratia Georgaka, Magnus Rattray, Mauricio A Álvarez
Abstract: The Multi-Output Gaussian Process (MOGP) is a popular tool for modelling data from multiple sources. A typical choice to build a covariance function for a MOGP is the Linear Model of Coregionalisation (LMC) which parametrically models the covariance between outputs. The Latent Variable MOGP (LV-MOGP) generalises this idea by modelling the covariance between outputs using a kernel applied to latent variables, one per output, leading to a flexible MOGP model that allows efficient generalisation to new outputs with few data points. The computational complexity in LV-MOGP grows linearly with the number of outputs, which makes it unsuitable for problems with a large number of outputs. In this paper, we propose a stochastic variational inference approach for the LV-MOGP that allows mini-batches for both inputs and outputs, making computational complexity per training iteration independent of the number of outputs. We demonstrate the performance of the model by benchmarking against some other MOGP models in several real-world datasets, including spatial-temporal climate modelling and spatial transcriptomics.
URL: https://openreview.net/forum?id=kK0WrBZAli
---
Title: CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement
Authors: Leitian Tao, Xiang Chen, Tong Yu, Tung Mai, Ryan A. Rossi, Yixuan Li, Saayan Mitra
Abstract: Large Language Models (LLMs) have revolutionized code generation but are require significant resources and tend to over-generalize, limiting their task-specific efficiency. Fine-tuning smaller, open-source LLMs is a cost-effective alternative, yet standard supervised approaches rely solely on correct examples, overlooking valuable insights from failures. We introduce CodeLutra, a new framework that leverages both correct and incorrect code attempts. Instead of purely instructing with correct solutions, CodeLutra uses iterative preference-based refinement, comparing successful and failed outputs to better approximate desired results. This process narrows the performance gap with state-of-the-art, larger models, without requiring massive datasets or auxiliary models. For example, on a challenging data science coding task, using only 500 samples improved Llama-3-8B’s accuracy from 28.2% to 48.6%, approaching GPT-4’s level. By capitalizing on both successes and mistakes, \textsc{CodeLutra} offers a scalable, efficient path to high-quality code generation, making smaller open-source models more competitive with leading closed-source alternatives.
URL: https://openreview.net/forum?id=IGsEgWM4to
---
Title: Disappearance of Timestep Embedding: A Case Study on Neural ODE and Diffusion Models
Authors: Bum Jun Kim, Yoshinobu Kawahara, Sang Woo Kim
Abstract: Dynamical systems are often time-varying, whose modeling requires a function that evolves with respect to time. Recent studies such as the neural ordinary differential equation proposed a time-dependent neural network, which provides a neural network varying with respect to time. However, we claim that the architectural choice to build a time-dependent neural network significantly affects its time-awareness but still lacks sufficient validation in its current states. In this study, we conduct an in-depth analysis of the architecture of neural ordinary differential equations. Here, we report a vulnerability of vanishing timestep embedding, which disables the time-awareness of a time-dependent neural network. Specifically, we find that the ConcatConv operation, which is widely used in neural ordinary differential equations, causes an additive effect of timestep embedding, which is readily canceled out by the subsequent batch normalization. This vanishing timestep embedding also arises for group normalization and is analyzed thoroughly with respect to the number of channels, groups, and relative variance. Furthermore, we find that this vulnerability can also be observed in diffusion models because they employ a similar architecture that incorporates timestep embedding to discriminate between different timesteps during a diffusion process. Our analysis provides a detailed description of this phenomenon as well as several solutions to address the root cause. Through experiments on neural ordinary differential equations and diffusion models, we observed that ensuring alive time-awareness via proposed solutions boosted their performance, such as classification accuracy, FID, and inception score, which implies that their current implementations lack sufficient time-dependency.
URL: https://openreview.net/forum?id=bpaLYaf6Dp
---
Title: Sparser, Better, Faster, Stronger: Sparsity Detection for Efficient Automatic Differentiation
Authors: Adrian Hill, Guillaume Dalle
Abstract: From implicit differentiation to probabilistic modeling, Jacobian and Hessian matrices have many potential use cases in Machine Learning (ML), but they are viewed as computationally prohibitive. Fortunately, these matrices often exhibit sparsity, which can be leveraged to speed up the process of Automatic Differentiation (AD).
This paper presents advances in sparsity detection, previously the performance bottleneck of Automatic Sparse Differentiation (ASD). Our implementation of sparsity detection is based on operator overloading, able to detect both local and global sparsity patterns, and supports flexible index set representations. It is fully automatic and requires no modification of user code, making it compatible with existing ML codebases.
Most importantly, it is highly performant, unlocking Jacobians and Hessians at scales where they were considered too expensive to compute. On real-world problems from scientific ML, graph neural networks and optimization, we show significant speed-ups of up to three orders of magnitude. Notably, using our sparsity detection system, ASD outperforms standard AD for one-off computations, without amortization of either sparsity detection or matrix coloring.
URL: https://openreview.net/forum?id=GtXSN52nIW
---
New submissions
===============
Title: We Can (and Should) Design Neural Networks with a Systematic Dimensional Approach
Abstract: The design of neural network architectures, despite remarkable empirical successes, resembles an architecture zoo characterized by chance innovations and reliance on intuition rather than systematic thinking. This approach limits our ability to deeply understand why architectures succeed, efficiently explore the vast design space, and transfer knowledge across different paradigms. We argue for a shift in how the machine learning community approaches neural architecture design: moving from an architecture-centric cataloging to a dimensional-centric understanding. Building on prior taxonomic work and integrating insights from recent architecture search approaches, we introduce a framework comprising 10 quasi-orthogonal structural dimensions that govern the capabilities of neural networks. This dimensional approach facilitates deeper understanding by enabling the deconstruction of complex architectures into their core design choices and their associated inductive biases. This aims to enable more principled innovation by providing a modern map for systematic exploration of the design space and targeted design for specific problem characteristics. We demonstrate the framework's utility by mapping diverse, prominent architectures onto these dimensions and call upon the community to adopt such systematic frameworks for more principled and efficient advancement in neural network design.
URL: https://openreview.net/forum?id=lR54W6CjNh
---
Title: Understanding Self-supervised Contrastive Learning through Supervised Objectives
Abstract: Self-supervised representation learning has achieved impressive empirical success, yet its theoretical understanding remains limited. In this work, we provide a theoretical perspective by formulating self-supervised representation learning as an approximation to supervised representation learning objectives. Based on this formulation, we derive a loss function closely related to popular contrastive losses such as InfoNCE, offering insight into their underlying principles. Our derivation naturally introduces the concepts of prototype representation bias and a balanced contrastive loss, which help explain and improve the behavior of self-supervised learning algorithms. We further show how components of our theoretical framework correspond to established practices in contrastive learning. Finally, we empirically validate the effect of balancing positive and negative pair interactions. All theoretical proofs are provided in the appendix, and our code is included in the supplementary material.
URL: https://openreview.net/forum?id=cmE97KX2XM
---
Title: Effect of Geometry on Graph Neural Networks
Abstract: Hyperbolic Graph Neural Networks (GNNs) have emerged as a promising approach for modeling graph-structured data with less embedding distortion than Euclidean GNNs. In this paper, we explore the effect of geometry on the performance of three types of GNNs for node classification and link prediction. To do so, we build on the hyperbolic framework outlined in Chen et al. (2022) and propose a family of GNNs with alternating geometry, integrating both hyperbolic and Euclidean components that can be trained jointly. We compare our alternating geometry models’ performance and stability against their Euclidean and hyperbolic counterparts across various datasets. Finally, we examine the impact of the choice of geometry and graph properties on hyperparameter selection. The alternating geometry models achieved the best performance in node classification, while the hyperbolic models outperformed alternating and Euclidean models in link prediction. Additionally, for node classification, architecture choice had a greater impact on performance than geometry, whereas for link prediction, geometry had a more significant effect than architecture.
URL: https://openreview.net/forum?id=qSF5Hsjmkd
---
Title: IBCL: Zero-shot Model Generation under Stability-Plasticity Trade-offs
Abstract: Algorithms that balance the stability-plasticity trade-off are well studied in the Continual Learning literature. However, only a few focus on obtaining models for specified trade-off preferences. When solving the problem of continual learning under specific trade-offs (CLuST), state-of-the-art techniques leverage rehearsal-based learning, which requires retraining when a model corresponding to a new trade-off preference is requested. This is inefficient, since there potentially exist a significant number of different trade-offs, and a large number of models may be requested. As a response, we propose Imprecise Bayesian Continual Learning (IBCL), an algorithm that tackles CLuST efficiently. IBCL replaces retraining with a constant-time convex combination. Given a new task, IBCL (1) updates the knowledge base as a convex hull of model parameter distributions, and (2) generates one Pareto-optimal model per given trade-off via convex combination without additional training. That is, obtaining models corresponding to specified trade-offs via IBCL is zero-shot. Experiments whose baselines are current CLuST algorithms show that IBCL improves by at most 45% on average per task accuracy, and by 43% on peak per task accuracy while maintaining a near-zero to positive backward transfer. In addition, its training overhead, measured by the number of batch updates, remains constant at every task, regardless of the number of preferences requested. Details can be found at: https://github.com/ibcl-anon/ibcl.
URL: https://openreview.net/forum?id=HvTRpctE5n
---
Title: The Alpha-Alternator: Dynamic Adaptation To Varying Noise Levels In Sequences Using The Vendi Score For Improved Robustness and Performance
Abstract: Current state-of-the-art dynamical models, such as Mamba, assume the same level of noisiness for all elements of a given sequence, which limits their performance on noisy temporal data. In this paper, we introduce the \textbf{$\alpha$-Alternator}, a novel generative model for time-dependent data that dynamically adapts to the complexity introduced by varying noise levels in sequences. The $\alpha$-Alternator leverages the Vendi Score (VS), a flexible similarity-based diversity metric, to adjust, at each time step $t$, the influence of the sequence element at time $t$ and the latent representation of the dynamics up to that time step on the predicted future dynamics. This influence is captured by a parameter that is learned and shared across all sequences in a given dataset. The sign of this parameter determines the direction of influence. A negative value indicates a noisy dataset, where a sequence element that increases the VS is considered noisy, and the model relies more on the latent history when processing that element. Conversely, when the parameter is positive, a sequence element that increases the VS is considered informative, and the $\alpha$-Alternator relies more on this new input than on the latent history when updating its predicted latent dynamics. The $\alpha$-Alternator is trained using a combination of observation masking and Alternator loss minimization. Masking simulates varying noise levels in sequences, enabling the model to be more robust to these fluctuations and improving its performance in trajectory prediction, imputation, and forecasting. Our experimental results demonstrate that the $\alpha$-Alternator outperforms both Alternators and state-of-the-art state-space models across neural decoding and time-series forecasting benchmarks.
URL: https://openreview.net/forum?id=L2ixqvYpnK
---
Title: A Unified Understanding and Evaluation of Steering Methods
Abstract: Steering methods provide a practical approach to controlling large language models by applying steering vectors to intermediate activations, guiding outputs toward desired behaviors while avoiding retraining. Despite their growing importance, the field lacks a unified understanding and consistent evaluation across tasks and datasets, hindering progress. This paper introduces a unified framework for analyzing and evaluating steering methods, formalizing their core principles and offering theoretical insights into their effectiveness. Through comprehensive empirical evaluations on multiple-choice and open-ended text generation tasks, we validate these insights, identifying key factors that influence performance and demonstrating the superiority of certain methods. Our work bridges theoretical and practical perspectives, offering actionable guidance for advancing the design, optimization, and deployment of steering methods in LLMs.
URL: https://openreview.net/forum?id=NDTtRaPCzF
---
Title: Towards Formalizing Spuriousness of Biased Datasets Using Partial Information Decomposition
Abstract: Spuriousness arises when there is an association between two or more variables in a dataset that are not causally related. In this work, we propose an explainability framework to preemptively disentangle the nature of such spurious associations in a dataset before model training. We leverage a body of work in information theory called Partial Information Decomposition (PID) to decompose the total information about the target into four non-negative quantities namely unique information (in core and spurious features respectively), redundant information, and synergistic information. Our framework helps anticipate when the core or spurious feature is indispensable, when either suffice, and when both are jointly needed for an optimal classifier trained on the dataset. Next, we leverage this decomposition to propose a novel measure of the spuriousness of a dataset. We arrive at this measure systematically by examining several candidate measures, and demonstrating what they capture and miss through intuitive canonical examples and counterexamples. Our framework Spurious Disentangler consists of segmentation, dimensionality reduction, and estimation modules, with capabilities to specifically handle high dimensional image data efficiently. Finally, we also perform empirical evaluation to demonstrate the trends of unique, redundant, and synergistic information, as well as our proposed spuriousness measure across $6$ benchmark datasets under various experimental settings. We observe an agreement between our preemptive measure of dataset spuriousness and post-training model generalization metrics such as worst-group accuracy, further supporting our proposition.
URL: https://openreview.net/forum?id=zw6UAPYmyx
---