Daily TMLR digest for Aug 27, 2025

0 views

Skip to first unread message

TMLR

unread,

Aug 27, 2025, 12:06:08 AM (12 days ago) Aug 27

to tmlr-anno...@googlegroups.com

Accepted papers
===============

Title: FraGNNet: A Deep Probabilistic Model for Tandem Mass Spectrum Prediction

Authors: Adamo Young, Fei Wang, David Wishart, BO WANG, Russell Greiner, Hannes Rost

Abstract: Compound identification from tandem mass spectrometry (MS/MS) data is a critical step in the analysis of complex mixtures. Typical solutions for the MS/MS spectrum to compound (MS2C) problem involve comparing the unknown spectrum against a library of known spectrum-molecule pairs, an approach that is limited by incomplete library coverage. Compound to MS/MS spectrum (C2MS) models can improve retrieval rates by augmenting real libraries with predicted MS/MS spectra. Unfortunately, many existing C2MS models suffer from problems with mass accuracy, generalization, or interpretability. We develop a new probabilistic method for C2MS prediction, FraGNNet, that can efficiently and accurately simulate MS/MS spectra with high mass accuracy. Our approach formulates the C2MS problem as learning a distribution over molecule fragments. FraGNNet achieves state-of-the-art performance in terms of prediction error and surpasses existing C2MS models as a tool for retrieval-based MS2C.

URL: https://openreview.net/forum?id=UsqeHx9Mbx

---

Title: A Mixture of Exemplars Approach for Efficient Out-of-Distribution Detection with Foundation Models

Authors: Evelyn Mannix, Howard Bondell

Abstract: One of the early weaknesses identified in deep neural networks trained for image classification tasks was their inability to provide low confidence predictions on out-of-distribution (OOD) data that was significantly different from the in-distribution (ID) data used to train them. Representation learning, where neural networks are trained in specific ways that improve their ability to detect OOD examples, has emerged as a promising solution. However, these approaches require long training times and can add additional overhead to detect OOD examples. Recent developments in Vision Transformer (ViT) foundation models—large networks trained on large and diverse datasets with self-supervised approaches—also show strong performance in OOD detection, and could address these challenges. This paper presents Mixture of Exemplars (MoLAR), an efficient approach to tackling OOD detection challenges that is designed to maximise the benefit of training a classifier with a high quality, frozen, pretrained foundation model backbone. MoLAR provides strong OOD detection performance when only comparing the similarity of OOD examples to the exemplars, a small set of images chosen to be representative of the dataset, leading to \mhl{significantly reduced overhead} for OOD detection inference over other methods that provide best performance when the full ID dataset is used. Extensive experiments demonstrate the improved OOD detection performance of MoLAR in comparison to comparable approaches in both supervised and semi-supervised settings, and code is available at github.com/emannix/molar-mixture-of-exemplars.

URL: https://openreview.net/forum?id=xpKqnSJtE4

---

Title: Emergent Neural Network Mechanisms for Generalization to Objects in Novel Orientations

Authors: Avi Cooper, Daniel Harari, Tomotake Sasaki, Spandan Madan, Hanspeter Pfister, Pawan Sinha, Xavier Boix

Abstract: The capability of Deep Neural Networks (DNNs) to recognize objects in orientations outside the training data distribution is not well understood. We investigate the limitations of DNNs’ generalization capacities by systematically inspecting DNNs' patterns of success and failure across out-of-distribution (OoD) orientations. We present evidence that DNNs (across architecture types, including convolutional neural networks and transformers) are capable of generalizing to objects in novel orientations, and we describe their generalization behaviors. Specifically, generalization strengthens when training the DNN with an increasing number of familiar objects, but only in orientations that involve 2D rotations of familiar orientations. We also hypothesize how this generalization behavior emerges from internal neural mechanisms – that neurons tuned to common features between familiar and unfamiliar objects enable out of distribution generalization – and present supporting data for this theory. The reproducibility of our findings across model architectures, as well as analogous prior studies on the brain, suggests that these orientation generalization behaviors, as well as the neural mechanisms that drive them, may be a feature of neural networks in general.

URL: https://openreview.net/forum?id=4wBQTZVSHU

---

Title: Transferring Reasoning Capabilities between LLMs operating via Curriculum Learning Policy

Authors: Leonardo Ranaldi, Giulia Pucci, Fabio Massimo Zanzotto

Abstract: In-context reasoning methods, exemplified by Chain-of-Thought (CoT) (et alia.,) empower the reasoning abilities of large language models (LLMs), eliciting them to solve complex reasoning tasks step-by-step. Nevertheless, the capacities to deliver robust CoT explanations arise only in models with billions of parameters, representing a barrier to entry for many users forced to operate on a smaller model scale, i.e., Small Language Models (SLMs). Even though many companies are releasing LLMs of the same family with a reduced number of parameters, these models sometimes produce misleading answers and are unable to deliver accurate step-wise reasoned answers. This paper proposes a method to transfer step-wise reasoning over SLMs by operating via Instruction-tuning (IT) on synthetic demonstrations delivered in a pedagogically motivated manner. In particular, firstly, we propose aligning step-wise reasoning capabilities via IT using Demonstrations "taught" by LLMs teacher to SLMs students. Then, we operate via Curriculum Learning, a pedagogically motivated learning method that improves the IT phase. We analyse the impact on the downstream performances of four question-answering benchmarks. The results show that SMLs can be instructed to reason via Demonstrations delivered by LLMs. We move a step further
in research: conceiving SLMs as human learners, we expose them to a CL teaching-based approach, obtaining better results on downstream performances.

URL: https://openreview.net/forum?id=zPKqyjmyEQ

---

Title: Dextr: Zero-Shot Neural Architecture Search with Singular Value Decomposition and Extrinsic Curvature

Authors: Rohan Asthana, Joschua Conrad, Maurits Ortmanns, Vasileios Belagiannis

Abstract: Zero-shot Neural Architecture Search (NAS) typically optimises the architecture search process by exploiting the network or gradient properties at initialisation through zero-cost proxies. The existing proxies often rely on labelled data, which is usually unavailable in real-world settings. Furthermore, the majority of the current methods focus either on optimising the convergence and generalisation attributes or solely on the expressivity of the network architectures. To address both limitations, we first demonstrate how channel collinearity affects the convergence and generalisation properties of a neural network. Then, by incorporating the convergence, generalisation and expressivity in one approach, we propose a zero-cost proxy that omits the requirement of labelled data for its computation. In particular, we leverage the Singular Value Decomposition (SVD) of the neural network layer features and the extrinsic curvature of the network output to design our proxy. As a result, the proposed proxy is formulated as the simplified harmonic mean of the logarithms of two key components: the sum of the inverse of the feature condition number and the extrinsic curvature of the network output. Our approach enables accurate prediction of network performance on test data using only a single label-free data sample. Our extensive evaluation includes a total of six experiments, including the Convolutional Neural Network (CNN) search space, i.e. DARTS and the Transformer search space, i.e. AutoFormer. The proposed proxy demonstrates a superior performance on multiple correlation benchmarks, including NAS-Bench-101, NAS-Bench-201, and TransNAS-Bench-101-micro; as well as on the NAS task within the DARTS and the AutoFormer search space, all while being notably efficient. The code is available at https://github.com/rohanasthana/Dextr.

URL: https://openreview.net/forum?id=X0vPof5DVh

---

Title: Differentiable Causal Discovery of Linear Non-Gaussian Acyclic Models Under Unmeasured Confounding

Authors: Yoshimitsu Morinishi, Shohei Shimizu

Abstract: We propose a score-based method that extends the framework of the linear non- Gaussian acyclic model (LiNGAM) to address the problem of causal structure estimation in the presence of unmeasured variables. Building on the method pro- posed by Bhattacharya et al. (2021), we develop a method called ABIC LiNGAM, which assumes that error terms follow a multivariate generalized normal distribu- tion and employs continuous optimization techniques to recover acyclic directed mixed graphs (ADMGs). We demonstrate that the proposed method can esti- mate causal structures, including the possibility of identifying their orientations, rather than only Markov equivalence classes, under the assumption that the data are linear and follow a multivariate generalized normal distribution. Additionally, we provide proofs of the identifiability of the parameters in ADMGs when the er- ror terms follow a multivariate generalized normal distribution. The effectiveness of the proposed method is validated through simulations and experiments using real-world data.

URL: https://openreview.net/forum?id=HR7MFlW73I

---

Title: Rollout Total Correlation for Deep Reinforcement Learning

Authors: Bang You, Huaping Liu, Jan Peters, Oleg Arenz

Abstract: Learning task-relevant representations is crucial for reinforcement learning. Recent approaches aim to learn such representations by improving the temporal consistency in the observed transitions. However, they only consider individual transitions and can fail to achieve long-term consistency. Instead, we argue that capturing aspects of the state that correlate with other states and actions of the trajectory---even more distant in the future---could further help in extracting task-relevant information. Hence, in this paper we investigate how to learn representations by maximizing the rollout total correlation, the correlation among all learned representations and actions within the trajectories produced by the agent. For improving rollout total correlation, we propose to combine two complementary lower bounds based on a generative and a discriminative model, combined with a simple and effective technique of chunk-wise mini-batching. Furthermore, we propose an intrinsic reward based on the learned representation for better exploration. Experimental evaluations on a set of challenging image-based simulated control tasks show that our method achieves better sample efficiency, and robustness to both white noise and natural video backgrounds compared to leading baselines.

URL: https://openreview.net/forum?id=qTdRJAL8Li

---

New submissions
===============

Title: Targeted Unlearning Using Perturbed Sign Gradient Methods With Applications On Medical Images

Abstract: Machine unlearning aims to remove the influence of specific training samples from a trained model without full retraining. While prior work has largely focused on privacy-motivated settings, we recast unlearning as a general-purpose tool for post-deployment model revision. Specifically, we focus on utilizing unlearning in clinical contexts where data shifts, device deprecation, and policy changes are common. To this end, we propose a bilevel optimization formulation of boundary-based unlearning that can be solved using iterative algorithms. We provide convergence guarantees when first order algorithms are used to unlearn. Our method introduces tunable loss design for controlling the forgetting–retention tradeoff and supports novel model composition strategies that merge the strengths of distinct unlearning runs. Across benchmark and real-world clinical imaging datasets, our approach outperforms baselines on both forgetting and retention metrics, including scenarios involving imaging devices and anatomical outliers. This work establishes machine unlearning as a modular, practical alternative to retraining for real-world model maintenance in clinical applications.

URL: https://openreview.net/forum?id=XE0bJg6sQN

---

Title: Disentangled Concept-Residual Models: Bridging the Interpretability–Performance Gap for Incomplete Concept Sets

Abstract: Deploying AI in high-stakes settings requires models that are not only accurate but also interpretable and amenable to human oversight. Concept Bottleneck Models (CBMs) support these goals by structuring predictions around human-understandable concepts, enabling interpretability and post-hoc human intervenability. However, CBMs rely on a ‘complete’ concept set, requiring practitioners to define and label enough concepts to match the predictive power of black-box models. To relax this requirement, prior work introduced residual connections that bypass the concept layer and recover information missing from an incomplete concept set. While effective in bridging the performance gap, these residuals can redundantly encode concept information, a phenomenon we term \textbf{concept-residual overlap}. In this work, we investigate the effects of concept-residual overlap and evaluate strategies to mitigate it. We (1) define metrics to quantify the extent of concept-residual overlap in CRMs; (2) introduce complementary metrics to evaluate how this overlap impacts interpretability, concept importance, and the effectiveness of concept-based interventions; and (3) present \textbf{Disentangled Concept-Residual Models (D-CRMs)}, a general class of CRMs designed to mitigate this issue. Within this class, we propose a novel disentanglement approach based on minimizing mutual information (MI). Using CelebA, CIFAR100, AA2, CUB, and OAI, we show that standard CRMs exhibit significant concept-residual overlap, and that reducing this overlap with MI-based D-CRMs restores key properties of CBMs, including interpretability, functional reliance on concepts, and intervention robustness, without sacrificing predictive performance.

URL: https://openreview.net/forum?id=NKgNizwDa6

---

Title: Efficient and Unbiased Sampling from Boltzmann Distributions via Variance-Tuned Diffusion Models

Abstract: Score-based diffusion models (SBDMs) are powerful amortized samplers for Boltzmann distributions; however, imperfect score estimates bias downstream Monte Carlo estimates. Classical importance sampling (IS) can correct this bias, but computing exact likelihoods requires solving the probability-flow ordinary differential equation (PF–ODE), a procedure that is prohibitively costly and scales poorly with dimensionality. We introduce Variance-Tuned Diffusion Importance Sampling (VT-DIS), a lightweight post-training method that adapts the per-step noise covariance of a pretrained SBDM by minimizing the $\alpha$-divergence $(\alpha=2)$ between its forward diffusion and reverse denoising trajectories. VT-DIS assigns a single trajectory-wise importance weight to the joint forward–reverse process, yielding unbiased expectation estimates at test time with negligible overhead compared to standard sampling. On the DW-4, LJ-13, and alanine-dipeptide benchmarks, VT-DIS achieves effective sample sizes of approximately 80%, 35%, and 3.5%, respectively, while using only a fraction of the computational budget required by vanilla diffusion + IS or PF-ODE–based IS.

URL: https://openreview.net/forum?id=Jq2dcMCS5R

---

Title: Generalization bound for a Shallow Transformer trained using Gradient Descent

Abstract: In this work, we develop a norm-based generalization bound for a shallow Transformer model trained using Gradient Descent. This is achieved in three major steps i.e., (a) Defining a class of Transformer models whose weights stay close to their initialization during training. (b) Upper bounding the Rademacher complexity of this class. (c) Upper bounding the empirical loss of all transformer models belonging to the above-defined class for all training steps. We end up with an upper bound on the true loss which tightens sublinearly with increasing number of training examples $N$ for all values of model dimension $d_m$. We also perform experiments on MNIST dataset to support our theoretical findings.

URL: https://openreview.net/forum?id=t3iUeMOT8Z

---

Title: Rewarding the Rare: Maverick-Aware Shapley Valuation in Federated Learning

Abstract: Federated Learning (FL) allows clients to train a model collaboratively without sharing their private data. Shapley value (SV) provides a principled way to quantify client contributions in FL. However, existing SV methods use uniform per-class weighting during validation, treating all classes as equally important. This uniform weighting breaks down in the presence of clients with underrepresented or rare classes, also referred to as Mavericks. Such clients are often undervalued due to lower model performance on these challenging classes, despite their critical role in improving generalization. To address this, we introduce a Maverick-aware Shapley valuation framework that reweights validation scores based on per-class accuracy, assigning greater importance to classes where models perform poorly. Building on this, we design FedMS, a Maverick-Shapley client selection mechanism that leverages our refined contribution scores to guide intelligent client selection. Experiments on benchmark datasets demonstrate that FedMS improves model performance and better recognizes valuable client contributions, even under scenarios involving adversaries, free-riders, and skewed or rare-class distributions.

URL: https://openreview.net/forum?id=JtybGfTUdq

---

Title: Preserving Angles Improves Feature Distillation

Abstract: Knowledge distillation methods compress models by training a student network using the classification outputs of a high quality teacher model, but can fail to effectively transfer the properties of computer vision foundation models from the teacher to the student. While it has been recently shown that feature distillation—where a teacher model's output features are replicated instead—can reproduce performance for foundation models across numerous downstream tasks, they fall short in matching critical properties such as robustness and out-of-distribution (OOD) detection performance. This paper overcomes this shortcoming by introducing Cosine-similarity Preserving Compression (CosPress), a feature distillation technique that learns a mapping to compress the latent space of the teacher model into the smaller latent space of the student, by preserving the cosine similarities between image embeddings. This enables direct optimisation of the student network and produces a more faithful reproduction of the teacher's properties. It is shown that distillation with CosPress on a variety of datasets, including ImageNet, produces more accurate models with greater performance on generalisability, robustness and OOD detection benchmarks, and that this technique provides a competitive pathway for training highly performant lightweight models on small datasets. Code is available at https://anonymous.4open.science/r/cospress-83E3/README.md.

URL: https://openreview.net/forum?id=ZEhgODZkWU

---

Title: Coresets from Trajectories: Selecting Data via Correlation of Loss Differences

Abstract: Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Correlation of Loss Differences ($\mathtt{CLD}$), a simple and scalable metric for coreset selection that identifies the most impactful training samples by measuring their alignment with the loss trajectories of a held-out validation set. $\mathtt{CLD}$ is highly efficient, requiring only per-sample loss values computed at training checkpoints, and avoiding the costly gradient and curvature computations used in many existing subset selection methods. We develop a general theoretical framework that establishes convergence guarantees for $\mathtt{CLD}$-based coresets, demonstrating that the convergence error is upper-bounded by the alignment of the selected samples and the representativeness of the validation set. On CIFAR-100 and ImageNet-1k, $\mathtt{CLD}$-based coresets typically outperform or closely match state-of-the-art methods across subset sizes, and remain within 1\% of more computationally expensive baselines even when not leading. $\mathtt{CLD}$ transfers effectively across architectures (ResNet, VGG, DenseNet), enabling proxy-to-target selection with $<1\%$ degradation. Moreover, $\mathtt{CLD}$ is stable when using only early checkpoints, incurring negligible accuracy loss. Finally, $\mathtt{CLD}$ exhibits inherent bias reduction via per-class validation alignment, obviating the need for additional stratified sampling. Together, these properties make $\mathtt{CLD}$ a principled, efficient, stable, and transferable tool for scalable dataset optimization.

URL: https://openreview.net/forum?id=QY0pbZTWJ9

---

Title: Pre-trained Language Models Improve the Few-shot Prompt Ability of Decision Transformer

Abstract: Decision Transformer (DT) has emerged as a promising class of algorithms in offline reinforcement learning (RL) tasks, leveraging pre-collected datasets and Transformer's capability to model long sequences. Recent works have demonstrated that using parts of trajectories from training tasks as prompts in DT enhances its performance on unseen tasks, giving rise to Prompt-DT methods. However, collecting data from specific environments can be both costly and unsafe in many scenarios, leading to suboptimal performance and limited few-shot prompt abilities due to the data-hungry nature of Transformer-based models. Additionally, the limited datasets used in pre-training make it challenging for Prompt-DT type of methods to distinguish between various RL tasks through prompts alone. To address these challenges, we introduce the Language model-initialized Prompt Decision Transformer (LPDT) framework, which leverages pretrained language models providing rich prior knowledge for RL tasks and fine-tunes the sequence model using Low-rank Adaptation (LoRA) for meta-RL problems. We further incorporate prompt regularization to effectively differentiate between tasks based on prompt feature representations. Comprehensive empirical studies demonstrate that initializing with a pre-trained language model provides the prior knowledge and achieves a similar performance with Prompt-DT under only $10\%$ data. We also provide a thorough ablation study to validate the effectiveness of each component, including sequence modeling, language models, prompt regularizations, and prompt strategies.

URL: https://openreview.net/forum?id=k520i3XEMK

---

Title: Bi-level Hierarchical Neural Contextual Bandits for Online Recommendation

Abstract: Contextual bandit algorithms aim to identify the optimal choice among a set of candidate arms, based on their contextual information. Among others, neural contextual bandit algorithms have demonstrated generally superior performance compared to conventional linear and kernel-based methods. Nevertheless, neural methods can be inherently unsuitable for handling a large number of candidate arms due to their high computational cost when performing principled exploration. Motivated by the widespread availability of arm category information (e.g., movie genres, retailer types), we formulate contextual bandits as a bi-level online recommendation problem, and propose a novel neural bandit framework, named $\text{H}_{2}\text{N-Bandit}$, which utilizes a bi-level hierarchical neural architecture to mitigate the substantial computational cost found in conventional neural bandit methods. To demonstrate its theoretical effectiveness, we provide regret analysis under general over-parameterization settings, along with a guarantee for category-level recommendation. To illustrate its effectiveness and efficiency, we conduct extensive experiments on multiple real-world data sets, highlighting that $\text{H}_{2}\text{N-Bandit}$ can significantly reduce the computational cost over existing strong non-linear baselines, while achieving better or comparable performance under online recommendation settings.

URL: https://openreview.net/forum?id=k3XsA75SGv

---

Title: A Modular Abstraction for Integrating Domain Rules into Deep Learning Models

Abstract: Domain-specific knowledge can often be expressed as suggestive rules defined over subgroups of data. Such rules, when encoded as hard constraints, are often not directly compatible with deep learning frameworks that train neural networks over batches of data. Also, domain-experts often use heuristics that should not be encoded as logical rules. In this work, we propose a framework to capture domain-experts' knowledge as domain-specific rules over subgroups of data, and to leverage such rules in training deep learning models using the modular components of regularization, data augmentation, and parameter optimization. This translation of domain knowledge into custom primitives that can be augmented to existing state-of-the-art deep learning models improves the ability of domain experts to interpret and express model behavior, intervene through changes in the modeling specifications, and improve the overall performance of the model as compared to existing frameworks that incorporate deterministic declarative predicates. On one synthetic and three real-world tasks, we show that our method allows iterative refinement and is demonstrably more accurate.

URL: https://openreview.net/forum?id=KicRPZsIDH

---

Title: Unreasonable effectiveness of LLM reasoning: a doubly cautionary tale of temporal question-answering

Abstract: The remarkable success of Large Language Models in modeling both the syntax and the semantics of language has prompted a body of research into language-adjacent abilities, most notably commonsense reasoning.
As LLMs' performance continues to advance on successive benchmarks, we turn to temporal reasoning, which lags somewhat behind other tasks due to its more complex logic.
We start from previous work, where authors successfully induce (apparent) reasoning by breaking down the problem into a two-step procedure of temporal graph extraction and subsequent reasoning.
Specifically, in the first step an LLM is prompted to parse a natural language description into a semi-structured timeline of events; and in the second step, it is given the extracted timeline and prompted to answer a temporal reasoning question.
We conjecture that this procedure presents two separate opportunities for introducing errors and further hypothesise that a Neuro-symbolic approach should help in this matter.
We follow the recent trend of using external executors in concert with LLMs to carry out exact reasoning and verification.
We see the reasoning step of the original two-step procedure as a natural target for a symbolic solver and design a rule-based solution for Temporal Question-Answering, drawing on ideas from Allen’s Interval Algebra.
To our surprise, we find that our rule-based reasoner does not improve beyond the previously reported, purely neural solution.
It appears that both our approach and the previous method operate at around the limits of achievable performance, imposed by the correctness of information extraction.
Such a result seems to suggest that a non-symbolic LLM is capable of symbolic-level reasoning, although upon further investigation we discover that not to be the case.
It is not that the neural solution makes no reasoning mistakes, but rather that the LLM manages to compensate for some of its erroneous replies by `short-cutting' to the correct answer in other questions; a.k.a. not reasoning but guessing.
Although the effect is not pronounced performance-wise, we feel it is conceptually important: as we argue, production of correct answers is not a measure of reasoning.

URL: https://openreview.net/forum?id=1DkD0Nd8Rd

---

Title: RIZE: Regularized Imitation Learning via Distributional Reinforcement Learning

Abstract: We propose a novel Inverse Reinforcement Learning (IRL) method that mitigates the rigidity of fixed reward structures and the limited flexibility of implicit reward regularization. Building on the Maximum Entropy IRL framework, our approach incorporates a squared temporal-difference (TD) regularizer with adaptive targets that evolve dynamically during training, thereby imposing adaptive bounds on recovered rewards and promoting robust decision-making. To capture richer return information, we integrate distributional RL into the learning process. Empirically, our method achieves expert-level performance on complex MuJoCo tasks, surpassing baseline methods on the Humanoid task with 3 demonstrations. Extensive experiments and ablation studies further validate the effectiveness of the approach and provide insights into reward dynamics in imitation learning.

URL: https://openreview.net/forum?id=a6DWqXJZCZ

---

Title: ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

Abstract: Determining the optimal data mixture for large language model training remains a challenging problem with an outsized impact on performance. In practice, language model developers continue to rely on heuristic exploration since no learning-based approach has emerged as a reliable solution. In this work, we propose to view the selection of training data mixtures as a black-box hyperparameter optimization problem, for which Bayesian Optimization is a well-established class of appropriate algorithms. Firstly, we cast data mixture learning as a sequential decision-making problem, in which we aim to find a suitable trade-off between the computational cost of training exploratory (proxy-) models and final mixture performance. Secondly, we systematically explore the properties of transferring mixtures learned at a small scale to larger-scale experiments, providing insights and highlighting opportunities for research at a modest scale. By proposing Multi-fidelity Bayesian Optimization as a suitable method in this common scenario, we introduce a natural framework to balance experiment cost with model fit, avoiding the risks of overfitting to smaller scales while minimizing the number of experiments at high cost. We present results for pre-training and instruction finetuning across models ranging from 1 million to 7 billion parameters, varying from simple architectures to state-of-the-art models and benchmarks spanning dozens of datasets. We demonstrate consistently strong results relative to a wide range of benchmarks, showing a speed-ups of over 500% in determining the best data mixture on our largest experiments relative to recent baselines. In addition, we broaden access to research by sharing ADMIRE IFT Runs, a dataset of 460 full training & evaluation runs reproducible post-training pipelines worth over 13,000 GPU hours, greatly reducing the cost of conducting research in this area. Finally, we highlight rich opportunities for future research in this area, helping bridge the gap towards a comprehensive understanding of the broader effects of training data on model generalization.

URL: https://openreview.net/forum?id=0Euvm9zDpu

---

Title: Unbiased Stochastic Optimization for Gaussian Processes on Finite Dimensional RKHS

Abstract: Current methods for stochastic hyperparameter learning in Gaussian Processes (GPs) rely
on approximations, such as computing biased stochastic gradients or using inducing points in
stochastic variational inference. However, when using such methods we are not guaranteed
to converge to a stationary point of the true marginal likelihood. In this work, we propose
algorithms for exact stochastic inference of GPs with kernels that induce a Reproducing
Kernel Hilbert Space (RKHS) of moderate finite dimension. Our approach can also be
extended to infinite dimensional RKHSs at the cost of forgoing exactness. Both for finite and
infinite dimensional RKHSs, our method achieves better experimental results than existing
methods when memory resources limit the feasible batch size and the possible number of
inducing points.

URL: https://openreview.net/forum?id=nVRpd28Fms

---

Title: TabRep: Training Tabular Diffusion Models with a Simple and Effective Continuous Representation

Abstract: Diffusion models have been the predominant generative model for tabular data generation. However, they face the conundrum of modeling under a separate versus a unified data representation. The former encounters the challenge of jointly modeling all multi-modal distributions of tabular data in one model. While the latter alleviates this by learning a single representation for all features, it currently leverages sparse suboptimal encoding heuristics and necessitates additional computation costs. In this work, we address the latter by presenting TabRep, a tabular diffusion architecture trained with a unified continuous representation. To motivate the design of our representation, we provide geometric insights into how the data manifold affects diffusion models. The key attributes of our representation are composed of its density, flexibility to provide ample separability for nominal features, and ability to preserve intrinsic relationships. Ultimately, TabRep provides a simple yet effective approach for training tabular diffusion models under a continuous data manifold. Our results showcase that TabRep achieves superior performance across a broad suite of evaluations. It is the first to synthesize tabular data that exceeds the downstream quality of the original datasets while preserving privacy and remaining computationally efficient.

URL: https://openreview.net/forum?id=yRbtFEh2OP

---

Reply all

Reply to author

Forward

0 new messages