Accepted papers
===============
Title: On Learning Representations for Tabular Data Distillation
Authors: Inwon Kang, Parikshit Ram, Yi Zhou, Horst Samulowitz, Oshani Seneviratne
Abstract: Dataset distillation generates a small set of information-rich instances from a large dataset, resulting in reduced storage requirements, privacy or copyright risks, and computational costs for downstream modeling, though much of the research has focused on the image data modality. We study tabular data distillation, which brings in novel challenges such as the inherent feature heterogeneity and the common use of non-differentiable learning models (such as decision tree ensembles and nearest-neighbor predictors). To mitigate these challenges, we present $\texttt{TDColER}$, a tabular data distillation framework via column embeddings-based representation learning. To evaluate this framework, we also present a tabular data distillation benchmark, ${{\sf \small TDBench}}$. Based on an elaborate evaluation on ${{\sf \small TDBench}}$, resulting in 226,200 distilled datasets and 541,980 models trained on them, we demonstrate that $\texttt{TDColER}$ is able to boost the distilled data quality of off-the-shelf distillation schemes by 0.5-143% across 7 different tabular learning models. All of the code used in the experiments can be found in http://github.com/inwonakng/tdbench
URL: https://openreview.net/forum?id=GXlsrvOGIK
---
Title: Stabilizing the Kumaraswamy Distribution
Authors: Max Wasserman, Gonzalo Mateos
Abstract: Large-scale latent variable models require expressive continuous distributions that support efficient sampling and low-variance differentiation, achievable through the reparameterization trick. The Kumaraswamy (KS) distribution is both expressive and supports the reparameterization trick with a simple closed-form inverse CDF. Yet, its adoption remains limited. We identify and resolve numerical instabilities in the log-pdf, CDF, and inverse CDF, exposing issues in libraries like PyTorch and TensorFlow. We then introduce simple and scalable latent variable models to address exploration-exploitation trade-offs in contextual multi-armed bandits and facilitate uncertainty quantification for link prediction with graph neural networks. We find these models to be most performant when paired with the stable KS. Our results support the stabilized KS distribution as a core component in scalable variational models for bounded latent variables.
URL: https://openreview.net/forum?id=baZLwdphqw
---
Title: Empirical Bayes Trend Filtering Through a Variational Inference Framework
Authors: Dongyue Xie
Abstract: This paper introduces a novel framework for Bayesian trend filtering using an empirical Bayes approach and a variational inference algorithm. Trend filtering is a nonparametric regression technique that has gained popularity for its simple formulation and local adaptability. Bayesian adaptations of trend filtering have been proposed as an alternative method, while they often rely on computationally intensive sampling-based methods for posterior inference. We propose an empirical Bayes trend filtering (EBTF) that leverages shrinkage priors, estimated through an empirical Bayes procedure by maximizing the marginal likelihood. To address the computational challenges posed by large datasets, we implement a variational inference algorithm for posterior computation, ensuring scalability and efficiency. Our framework is flexible, allowing the incorporation of various shrinkage priors, and optimizes the level of smoothness directly from the data. We also discuss alternative formulations of the EBTF model, along with their pros and cons. We demonstrate the performance of our EBTF method through comprehensive simulations and real-world data applications, highlighting its ability to maintain computational efficiency while providing accurate trend estimation.
URL: https://openreview.net/forum?id=AHTz2mTlKk
---
Title: Multi-Output Distributional Fairness via Post-Processing
Authors: Gang Li, Qihang Lin, Ayush Ghosh, Tianbao Yang
Abstract: The post-processing approaches are becoming prominent techniques to enhance machine learning models' fairness because of their intuitiveness, low computational cost, and excellent scalability. However, most existing post-processing methods are designed for task-specific fairness measures and are limited to single-output models. In this paper, we introduce a post-processing method for multi-output models, such as the ones used for multi-task/multi-class classification and representation learning, to enhance a model's distributional parity, a task-agnostic fairness measure. Existing methods for achieving distributional parity rely on the (inverse) cumulative density function of a model’s output, restricting their applicability to single-output models. Extending previous works, we propose to employ optimal transport mappings to move a model's outputs across different groups towards their empirical Wasserstein barycenter. An approximation technique is applied to reduce the complexity of computing the exact barycenter and a kernel regression method is proposed to extend this process to out-of-sample data. Our empirical studies evaluate the proposed approach against various baselines on multi-task/multi-class classification and representation learning tasks, demonstrating the effectiveness of the proposed approach.
URL: https://openreview.net/forum?id=MJOKrHqiV1
---
New submissions
===============
Title: Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks. Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further improved performance in System-2 reasoning domains like mathematics and programming by harnessing supervised fine-tuning (SFT) and reinforcement learning (RL) to enhance Chain-of-Thought (CoT) reasoning. However, while longer CoT reasoning sequences improve performance, they also introduce significant computational overhead due to lengthy and redundant outputs, known as the ''overthinking phenomenon''.
Efficient Reasoning, which seeks to optimize reasoning length while preserving reasoning capabilities, offers practical benefits such as faster processing times, lower energy consumption, and improved responsiveness, especially valuable for reasoning-intensive applications. Despite its potential, efficient reasoning remains in the early stages of research.
In this paper, we provide the first structured survey to systematically investigate and explore the current progress toward achieving efficient reasoning in LLMs. Overall, relying on the inherent mechanism of LLMs, we categorize existing works into several key directions: (1) model-based efficient reasoning, which considers optimizing full-length reasoning models into more concise reasoning models or directly training efficient reasoning models; (2) reasoning output-based efficient reasoning, which aims to dynamically reduce reasoning steps and length during inference; (3) input prompts-based efficient reasoning, which seeks to enhance reasoning efficiency based on input prompt properties such as difficulty or length control. Additionally, we introduce the use of efficient data for training reasoning models, explore the reasoning capabilities of small language models, and discuss evaluation methods and benchmarking.
URL: https://openreview.net/forum?id=HvoG8SxggZ
---
Title: Elucidating the Design Choice of Probability Paths in Flow Matching for Forecasting
Abstract: Flow matching has recently emerged as a powerful paradigm for generative modeling and has been extended to probabilistic time series forecasting. However, the impact of the specific choice of probability path model on forecasting performance, particularly for high-dimensional spatio-temporal dynamics, remains under-explored. In this work, we demonstrate that forecasting spatio-temporal data with flow matching is highly sensitive to the selection of the probability path model. Motivated by this insight, we propose a novel probability path model designed to improve forecasting performance. Our empirical results across various dynamical system benchmarks show that our model achieves faster convergence during training and improved predictive performance compared to existing probability path models. Importantly, our approach is efficient during inference, requiring only a few sampling steps. This makes our proposed model practical for real-world applications and opens new avenues for probabilistic forecasting.
URL: https://openreview.net/forum?id=JApMDLwbLR
---
Title: Are Convex Optimization Curves Convex?
Abstract: In this paper, we study when we might expect the optimization curve induced by gradient descent to be \emph{convex} -- precluding, for example, an initial plateau followed by a sharp decrease, making it difficult to decide when optimization should stop. Although such undesirable behavior can certainly occur when optimizing general functions, might it also occur in the benign and well-studied case of smooth convex functions? As far as we know, this question has not been tackled in previous work. We show, perhaps surprisingly, that the answer crucially depends on the choice of the step size. In particular, for the range of step sizes which are known to result in monotonic convergence to an optimal value, we characterize a regime where the optimization curve will be provably convex, and a regime where the curve can be non-convex. We also extend our results to gradient flow, and to the closely-related but different question of whether the gradient norm decreases monotonically.
URL: https://openreview.net/forum?id=TZtpxselK2
---
Title: Wasn’t Me: Enabling Users to Falsify Deepfake Attacks
Abstract: The rise of deepfake technology has made everyone vulnerable to false claims based on manipulated media. While many existing deepfake detection methods aim to identify fake media, they often struggle with deepfakes created by new generative models not seen during training. In this paper, we propose VeriFake, a method that enables users to prove that the media claiming to show them are false. VeriFake is based on two key assumptions: (i) generative models struggle to exactly depict a specific identity, and (ii) they often fail to perfectly synchronize generated lip movements with speech. By combining these assumptions with powerful modern representation encoders, VeriFake achieves highly effective results, even against previously unseen deepfakes. Through extensive experiments, we demonstrate that VeriFake significantly outperforms state-of-the-art deepfake detection techniques despite being simple to implement and not relying on any fake data for pretraining.
URL: https://openreview.net/forum?id=jl6G0DgyaT
---
Title: TapWeight: Reweighting Pretraining Objectives for Task-Adaptive Pretraining
Abstract: Large-scale general domain pretraining followed by downstream-specific finetuning has become a predominant paradigm in machine learning. However, discrepancies between the pretraining and target domains can still lead to performance degradation in certain cases, underscoring the need for task-adaptive continued pretraining (TAP). TAP methods typically involve continued pretraining on task-specific unlabeled datasets or introducing additional unsupervised learning objectives to enhance model capabilities. While many TAP methods perform continued pretraining with multiple pretraining objectives, they often determine the tradeoff parameters between objectives manually, resulting in suboptimal outcomes and higher computational costs. In this paper, we propose TapWeight, a task-adaptive pretraining framework which automatically determines the optimal importance of each pretraining objective based on downstream feedback. TapWeight reweights each pretraining objective by solving a multi-level optimization problem. We applied TapWeight to both molecular property prediction and natural language processing tasks, significantly surpassing baseline methods. Experimental results validate the effectiveness and generalizability of TapWeight. Our code is available in the supplementary material.
URL: https://openreview.net/forum?id=DCCw2CEVFS
---
Title: Federated Generalized Novel Category Discovery with Prompts Tuning
Abstract: Generalized category discovery (GCD) is proposed to handle categories from unseen labels during the inference stage by clustering them. Most work in GCD provides solutions for unseen classes in data-centralized settings. However, unlabeled categories possessed by clients, which are common in real-world federated learning (FL), have been largely ignored and degraded the performance of classic FL algorithms. To demonstrate and mitigate the harmful effect of unseen classes, we dive into a GCD problem setting applicable for FL named FedGCD, establish a strong baseline constructed with state-of-the-art GCD algorithm simGCD, and design a learning framework with prompt tuning to tackle both the overfitting and communication burden problems in FedGCD. In our methods, clients first separately carry out prompt learning on local data. Then, we aggregate the prompts from all clients as the global prompt to help capture global knowledge and then send the global prompts to local clients to allow access to broader knowledge from other clients. By this method, we significantly reduce the parameters needed to upload in FedGCD, which is a common obstacle in the real application of most FL algorithms. We conduct experiments on both generic and fine-grained datasets like CIFAR-100 and CUB-200, and show that our method is comparable to the FL version of simGCD and surpasses other baselines with significantly fewer parameters to transmit.
URL: https://openreview.net/forum?id=dVMESwnMlo
---
Title: AttentionSmithy: A Modular Framework for Rapid Transformer Development and Customization
Abstract: Transformer architectures have revolutionized a broad spectrum of AI applications by leveraging
attention mechanisms for parallelized and long-range sequence processing. Despite
their remarkable success, building and customizing transformers remains prohibitively complex
for many domain experts who lack deep knowledge of low-level implementations. We
introduce AttentionSmithy, a modular software package that lowers the barrier to transformer
innovation by decomposing key components—attention modules, feed-forward networks,
normalization layers, and positional encodings—into reusable building blocks. By
disentangling architectural elements into well-defined interfaces, users can rapidly prototype,
adapt, and evaluate transformer variants without extensive coding overhead. Our framework
supports four distinct positional encoding strategies (sinusoidal, learned, rotary, and ALiBi)
and integrates seamlessly with neural architecture search (NAS) for automated design exploration.
We validate AttentionSmithy by replicating the original “Attention Is All You Need”
transformer under resource constraints, demonstrating near state-of-the-art performance on
a machine translation task. Leveraging the package’s integrated NAS capability, we made
the unexpected discovery that machine translation performance is maximized by combining
all available positional encoding methods—highlighting the complementary benefits of each
strategy. We further illustrate AttentionSmithy’s adaptability through gene-specific modeling,
where a variant of a BERT-style architecture achieves over 95% accuracy on downstream
cell type classification tasks using ranked transcriptomic data. These case studies underscore
AttentionSmithy’s core advantage: enabling specialized experimentation across diverse
application domains—from natural language processing to genomic analysis—by obviating
the need for labor-intensive, low-level framework manipulation. We anticipate that AttentionSmithy
will serve as a foundation for creative transformer-based solutions, expediting
research and development in numerous scientific and industrial fields.
URL: https://openreview.net/forum?id=0jhoriH9yA
---
Title: Does equivariance matter at scale?
Abstract: Given large data sets and sufficient compute, is it beneficial to design neural architectures for the structure and symmetries of each problem? Or is it more efficient to learn them from data? We study empirically how equivariant and non-equivariant networks scale with compute and training samples. Focusing on a benchmark problem of rigid-body interactions and on general-purpose transformer architectures, we perform a series of experiments, varying the model size, training steps, and dataset size. We find evidence for three conclusions. First, equivariance improves data efficiency, but training non-equivariant models with data augmentation can close this gap given sufficient epochs. Second, scaling with compute follows a power law, with equivariant models outperforming non-equivariant ones at each tested compute budget. Finally, the optimal allocation of a compute budget onto model size and training duration differs between equivariant and non-equivariant models.
URL: https://openreview.net/forum?id=wilNute8Tn
---
Title: Gaussian Loss Smoothing Enables Certified Training with Tight Convex Relaxations
Abstract: Training neural networks with high certified accuracy against adversarial examples remains an open challenge despite significant efforts.
While certification methods can effectively leverage tight convex relaxations for bound computation, in training, these methods, perhaps surprisingly, can perform worse than looser relaxations.
Prior work hypothesized that this phenomenon is caused by the discontinuity, non-smoothness, and perturbation sensitivity of the loss surface induced by tighter relaxations.
In this work, we theoretically show that Gaussian Loss Smoothing (GLS) can alleviate these issues.
We confirm this empirically by instantiating GLS with two variants: a zeroth-order optimization algorithm, called PGPE, which allows training with non-differentiable relaxations, and a first-order optimization algorithm, called RGS, which requires gradients of the relaxation but is much more efficient than PGPE.
Extensive experiments show that when combined with tight relaxations, these methods surpass state-of-the-art methods when training on the same network architecture for many settings.
Our results clearly demonstrate the promise of Gaussian Loss Smoothing for training certifiably robust neural networks and pave a path towards leveraging tighter relaxations for certified training.
URL: https://openreview.net/forum?id=lknvxcjuos
---
Title: Black Box Causal Inference: Effect Estimation via Meta Prediction
Abstract: Causal inference and the estimation of causal effects plays a central role in decision-making across many areas, including healthcare and economics. Estimating causal effects typically requires an estimator that is tailored to each problem of interest. But developing estimators can take significant effort for even a single causal inference setting. For example, algorithms for regression-based estimators, propensity score methods, and doubly robust methods were designed across several decades to handle causal estimation with observed confounders. Similarly, several estimators have been developed to exploit instrumental variables (IVs), including two-stage least-squares (TSLS), control functions, and the method-of-moments. In this work, we instead frame causal inference as a dataset-level prediction problem, offloading algorithm design to the learning process. The approach we introduce, called black box causal inference (BBCI), builds estimators in a black-box manner by learning to predict causal effects from sampled dataset-effect pairs. We demonstrate accurate estimation of average treatment effects (ATEs) and conditional average treatment effects (CATEs) with BBCI across several causal inference problems with known identification, including problems with less developed estimators.
URL: https://openreview.net/forum?id=KEtlsENZSE
---
Title: Dependency-aware Maximum Likelihood Estimation for Active Learning
Abstract: Active learning aims to efficiently build a labeled training set by strategically selecting samples to query labels from annotators.
In this sequential process, each sample acquisition influences subsequent selections, causing dependencies among samples in the labeled set. However, these dependencies are overlooked during the model parameter estimation stage when updating the model using Maximum Likelihood Estimation (MLE), a conventional method that assumes independent and identically distributed (i.i.d.) data. We propose Dependency-aware MLE (DMLE), which corrects MLE within the active learning framework by addressing sample dependencies typically neglected due to the i.i.d. assumption, ensuring consistency with active learning principles in the model parameter estimation process. This improved method achieves superior performance across multiple benchmark datasets, reaching higher performance in earlier cycles compared to conventional MLE. Specifically, we observe average accuracy improvements of 6\%, 8.6\%, and 10.5\% for $k=1$, $k=5$, and $k=10$ respectively, after collecting the first 100 samples, where entropy is the acquisition function and $k$ is the query batch size acquired at every active learning cycle.
URL: https://openreview.net/forum?id=qDVDSXXGK1
---