J2C Certification: In-context Learning in Presence of Spurious Correlations
Hrayr Harutyunyan, Rafayel Darbinyan, Samvel Karapetyan, Hrant Khachatrian
https://openreview.net/forum?id=C9CSaTR1iA
---
Expert Certification: Approximately Equivariant Recurrent Generative Models for Quasi-Periodic Time Series with a Progressive Training Scheme
Ruwen Fulek, Markus Lange-Hegermann
https://openreview.net/forum?id=KHk5EECG3Z
---
Featured Certification, J2C Certification: E$^2$M: Double Bounded $\alpha$-Divergence Optimization for Tensor-based Discrete Density Estimation
Kazu Ghalamkari, Jesper Løve Hinrich, Morten Mørup
https://openreview.net/forum?id=954CjhXSXL
---
Featured Certification: PRISM: PRIor from corpus Statistics for topic Modeling
Tal Ishon, Yoav Goldberg, Uri Shaham
https://openreview.net/forum?id=454v3Xbtza
---
Accepted papers
===============
Title: A Complete Decomposition of KL Error using Refined Information and Mode Interaction Selection
Authors: James Enouen, Mahito Sugiyama
Abstract: The log-linear model has received a significant amount of theoretical attention in previous decades and remains the fundamental tool used for learning probability distributions over discrete variables. Despite its large popularity in statistical mechanics and high-dimensional statistics, the vast majority of related energy-based models only focus on the two-variable relationships, such as Boltzmann machines and Markov graphical models. Although these approaches have easier-to-solve structure learning problems and easier-to-optimize parametric distributions, they often ignore the rich structure which exists in the higher- order interactions between different variables. Using more recent tools from the field of information geometry, we revisit the classical formulation of the log-linear model with a focus on higher-order mode interactions, going beyond the 1-body modes of independent distributions and the 2-body modes of Boltzmann distributions. This perspective allows us to define a complete decomposition of the KL error. This then motivates the formulation of a sparse selection problem over the set of possible mode interactions. In the same way as sparse graph selection allows for better generalization, we find that our learned distributions are able to more efficiently use the finite amount of data which is available in practice. We develop an algorithm called MAHGenTa which leverages a novel Monte-Carlo sampling technique for energy-based models alongside a greedy heuristic for incorporating statistical robustness. On both synthetic and real-world datasets, we demonstrate our algorithm’s effectiveness in maximizing the log-likelihood for the generative task and also the ease of adaptability to the discriminative task of classification.
URL: https://openreview.net/forum?id=DxUXI15C38
---
Title: Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation
Authors: Jinxin Zhou, Jiachen Jiang, Zhihui Zhu
Abstract: Extending CLIP models to semantic segmentation remains challenging due to the misalignment between their image-level pre-training objectives and the pixel-level visual understanding required for dense prediction. While prior efforts have achieved encouraging results by reorganizing the final layer and features, they often inherit the global alignment bias of preceding layers, leading to suboptimal segmentation performance. In this work, we propose LHT-CLIP, a novel training-free framework that systematically exploits the visual discriminability of CLIP across \emph{layer}, \emph{head}, and \emph{token} levels. Through comprehensive analysis, we reveal three key insights: (i) the final layers primarily strengthen image–text alignment with sacrifice of visual discriminability (e.g., last 3 layers in ViT-B/16 and 8 layers in ViT-L/14), partly due to the emergence of anomalous tokens; (ii) a subset of attention heads (e.g., 10 out of 144 in ViT-B/16) display consistently strong visual discriminability across datasets; (iii) abnormal tokens display sparse and consistent activation pattern compared to normal tokens. Based on these findings, we propose three complementary techniques: semantic-spatial reweighting, selective head enhancement, and abnormal token replacement to effectively restore visual discriminability and improve segmentation performance without any additional training, auxiliary pre-trained networks, or extensive hyperparameter tuning. Comprehensive experiments on eight widely used semantic segmentation benchmarks demonstrate that LHT-CLIP achieves substantial performance improvements across diverse scenarios, underscoring its effectiveness and practicality for real-world deployment.
URL: https://openreview.net/forum?id=9spNW3DXg5
---
Title: Unrealized Expectations: Comparing AI Methods vs Classical Algorithms for Maximum Independent Set
Authors: Yikai Wu, Haoyu Zhao, Sanjeev Arora
Abstract: AI methods, such as generative models and reinforcement learning, have recently been applied to combinatorial optimization (CO) problems, especially NP-hard ones. This paper compares such GPU-based methods with classical CPU-based methods on Maximum Independent Set (MIS). Strikingly, even on in-distribution random graphs, leading AI-inspired methods are consistently outperformed by state-of-art classical solver KaMIS running on a single CPU, and some AI-inspired methods frequently fail to surpass even the simplest degree-based greedy heuristic. Even with post-processing techniques like local search, AI-inspired methods still perform worse than CPU-based solvers. To better understand the source of these failures, we introduce a novel analysis, serialization, which reveals that non-backtracking AI-inspired methods, e.g. LTFT (which is based on GFlowNets), end up reasoning similarly to the simplest degree-based greedy, and thus worse than KaMIS. More generally, our findings suggest a need for a rethinking of current approaches in AI for CO, advocating for more rigorous benchmarking and the principled integration of classical heuristics. Additionally, we also find that CPU-based algorithm KaMIS have strong performance on sparse random graphs, which appears to show that the shattering threshold conjecture for large independent sets proposed by Coja-Oghlan & Efthymiou (2015) is either false or does not apply for real-life sizes (such as $10^6$ nodes).
URL: https://openreview.net/forum?id=ksGoCT5zW6
---
Title: Socrates Loss: Unifying Confidence Calibration and Classification by Leveraging the Unknown
Authors: Sandra Gómez-Gálvez, Tobias Olenyi, Gillian Dobbie, Katerina Taskova
Abstract: Deep neural networks, despite their high accuracy, often exhibit poor confidence calibration, limiting their reliability in high-stakes applications. Current ad-hoc confidence calibration methods attempt to fix this during training but face a fundamental trade-off: two-phase training methods achieve strong classification performance at the cost of training instability and poorer confidence calibration, while single-loss methods are stable but underperform in classification. This paper addresses and mitigates this stability-performance trade-off. We propose Socrates Loss, a novel, unified loss function that explicitly leverages uncertainty by incorporating an auxiliary unknown class, whose predictions directly influence the loss function and a dynamic uncertainty penalty. This unified objective allows the model to be optimized for both classification and confidence calibration simultaneously, without the instability of complex, scheduled losses. We provide theoretical guarantees that our method regularizes the model to prevent miscalibration and overfitting. Across four benchmark datasets and multiple architectures, our comprehensive experiments demonstrate that Socrates Loss consistently improves training stability while achieving more favorable accuracy-calibration trade-off, often converging faster than existing methods.
URL: https://openreview.net/forum?id=DONqw1KhHq
---
Title: In-context Learning in Presence of Spurious Correlations
Authors: Hrayr Harutyunyan, Rafayel Darbinyan, Samvel Karapetyan, Hrant Khachatrian
Abstract: Large language models exhibit a remarkable capacity for in-context learning, where they learn to solve tasks given a few examples. Recent work has shown that transformers can be trained to perform simple regression tasks in-context. This work explores the possibility of training an in-context learner for classification tasks involving spurious features. We find that the conventional approach of training in-context learners is susceptible to spurious features. Moreover, when the meta-training dataset includes instances of only one task, the conventional approach leads to in-weights learning and fails to produce a model that leverages context for predictions. Based on these observations, we propose a novel technique to train such a learner for a given classification task. Remarkably, this in-context learner matches and sometimes outperforms strong methods like ERM and GroupDRO. However, unlike these algorithms, it does not generalize well to other tasks. We show that it is possible to obtain an in-context learner that generalizes to unseen tasks by training on a diverse dataset of synthetic in-context learning instances.
URL: https://openreview.net/forum?id=C9CSaTR1iA
---
Title: DRAW: Domain Weight Randomization with Bayesian Updating for LLM Pre-Training
Authors: Ruonan Wang, Yongqi Qiao, Zhonglin Xie, Kun Yuan
Abstract: Optimal pre-training data mixture is pivotal for large language model (LLM) performance, but searching for the best domain weights is computationally expensive. We present Domain Weight Randomization with Bayesian Updating (DRAW), a principled framework treating domain weights as Dirichlet-distributed random variables whose parameters scale with model width. Informative priors are first estimated using proxy models; the main model then refines these using Bayesian inference and parameter scaling, dynamically sampling domain weights during training. Theoretically, DRAW reduces generalization error at a rate $\mathcal{O}(1/\sqrt{n})$ as model width increases, ensuring stable convergence. Empirical results on open-domain corpora and diverse benchmarks show DRAW reliably outperforms fixed and adaptive baselines in both language modeling and downstream tasks, achieving better average and worst-case performance alongside strong robustness. DRAW not only highlights valuable data domains while suppressing noisy ones, but also introduces a scalable and effective mechanism for adaptive data mixing in LLM pre-training, facilitating efficient knowledge transfer from proxy to large models.
URL: https://openreview.net/forum?id=tc8TyD7ZyD
---
Title: Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning
Authors: Mingyu Cao, Gen Li, Jie Ji, Jiaqi Zhang, Ajay Jaiswal, Li Shen, Xiaolong Ma, Shiwei Liu, Lu Yin
Abstract: Mixture-of-Experts (MoE) has garnered significant attention for its ability to scale up neural networks while utilizing the same or even fewer active parameters. However, MoE does not alleviate the massive memory requirements of networks, which limits their practicality in real-world applications, especially in the era of large language models (LLMs). While recent work explores the possibility of removing entire layers of MoE to reduce memory, the performance degradation is still notable. In this paper, we propose ConDense-MoE (CD-MoE), which, instead of dropping the entire MoE layer, condenses the large, sparse MoE layer into a smaller, denser layer with only a few experts activated for all tokens, while maintaining hardware friendliness. Our approach is specifically designed for fine-grained MoE with shared experts, where Feed-Forward Networks are split into many small experts, with certain experts isolated to serve as shared experts that are always activated, such as DeepSeekMoE and QwenMoE. We demonstrate the effectiveness of our method. Specifically, for the DeepSeekMoE-16B model, our approach maintains 90% of the average accuracy while reducing memory usage by 27.5% and increasing inference speed by 1.26 times. Moreover, we show that by applying lightweight expert fine-tuning—only to the condensed layers—and using 5 hours on a single 80G A100 GPU, we can successfully recover 98% of the original performance.
URL: https://openreview.net/forum?id=BQe6j6sAu6
---
Title: Towards Generalized Certified Robustness with Multi-Norm Training
Authors: Enyi Jiang, David Shu Cheung, Gagandeep Singh
Abstract: Existing certified training methods can only train models to be robust against a certain perturbation type (e.g. $l_\infty$ or $l_2$). However, an $l_\infty$ certifiably robust model may not be certifiably robust against $l_2$ perturbation (and vice versa) and also has low robustness against other perturbations (e.g. geometric and patch transformation). By constructing a theoretical framework to analyze and mitigate the tradeoff, we propose the first multi-norm certified training framework \textbf{CURE}, consisting of several multi-norm certified training methods, to attain better \emph{union robustness} when training from scratch or fine-tuning a pre-trained certified model. Inspired by our theoretical findings, we devise bound alignment and connect natural training with certified training for better union robustness. Compared with SOTA-certified training, \textbf{CURE} improves union robustness to $32.0\%$ on MNIST, $25.8\%$ on CIFAR-10, and $10.6\%$ on TinyImagenet across different epsilon values. It leads to better generalization on a diverse set of challenging unseen geometric and patch perturbations to $6.8\%$ and $16.0\%$ on CIFAR-10. Overall, our contributions pave a path towards \textit{generalized certified robustness}.
URL: https://openreview.net/forum?id=U5U7pazr6X
---
Title: Rethinking Coreset Selection: The Surprising Effectiveness of Soft Labels
Authors: Saumyaranjan Mohanty, Deexitha Vattivella, Konda Reddy Mopuri
Abstract: Data-efficient deep learning is an emerging and powerful branch of deep learning that focuses on minimizing the amount of labeled data required for training. Coreset selection is one such method, where the goal is to select a representative subset from the original dataset, which can achieve comparable generalization performance at a much lower computation and disk space overhead. Dataset Distillation (DD), another branch of data-efficient deep learning, achieves this goal through distilling a small synthetic dataset from the original dataset. While DD works exploit soft labels (probabilistic target labels instead of traditional one-hot labels), which have yielded significant improvement over hard labels, to the best of our knowledge, no such study exists for coreset selection. In this work, for the first time, we
study the impact of soft labels on generalization accuracy for the image classification task for various coreset selection algorithms. While soft labels improve the performance of all the methods, surprisingly, random selection with soft labels performs on par or better than existing coreset selection approaches. Our findings suggest that future coreset algorithms should benchmark against random selection with soft labels as an important baseline.
URL: https://openreview.net/forum?id=Ll78kAR1lj
---
Title: GENIE: Watermarking Graph Neural Networks for Link Prediction
Authors: Venkata Sai Pranav Bachina, Aaryan Ajay Sharma, Charu Sharma, Ankit Gangwal
Abstract: The rapid adoption, usefulness, and resource-intensive training of Graph Neural Network~(GNN) models have made them an invaluable intellectual property in graph-based machine learning. However, their wide-spread adoption also makes them susceptible to stealing, necessitating robust Ownership Demonstration~(OD) techniques. Watermarking is a promising OD framework for deep neural networks, but existing methods fail to generalize to GNNs due to the non-Euclidean nature of graph data. Existing works on GNN watermarking primarily focus on node and graph classification, overlooking Link Prediction (LP).
In this paper, we propose \genie~(watermarking \textbf{G}raph n\textbf{E}ural \textbf{N}etworks for l\textbf{I}nk pr\textbf{E}diction), the first scheme to watermark GNNs for LP. \genie creates a novel backdoor for both node-representation and subgraph-based LP methods, utilizing a unique trigger set and a secret watermark vector. Our OD scheme is equipped with Dynamic Watermark Thresholding~(DWT), ensuring high verification probability while addressing practical issues in existing OD schemes. We extensively evaluate \genie across 4~diverse model architectures~(\ie SEAL, GCN, GraphSAGE and NeoGNN), 7~real-world datasets and 21~watermark removal techniques and demonstrate its robustness to watermark removal and ownership piracy attacks. Finally, we discuss adaptive attacks against \genie and a defense strategy to counter it.
URL: https://openreview.net/forum?id=EmDuoySsbe
---
Title: Approximately Equivariant Recurrent Generative Models for Quasi-Periodic Time Series with a Progressive Training Scheme
Authors: Ruwen Fulek, Markus Lange-Hegermann
Abstract: We present a simple yet effective generative model for time series, based on a Recurrent Variational Autoencoder that we refer to as RVAE-ST. Recurrent layers often struggle with unstable optimization and poor convergence when modeling long sequences. To address
these limitations, we introduce a progressive training scheme that gradually increases the sequence length, stabilizing optimization and enabling consistent learning over extended horizons. By composing known components into a recurrent, approximately time-shift-equivariant topology, our model introduces an inductive bias that aligns with the structure of quasi-periodic and nearly stationary time series. Across several benchmark datasets, RVAE-ST matches or surpasses state-of-the-art generative models, particularly on quasi-periodic data, while remaining competitive on more irregular signals. Performance is evaluated through ELBO, Fréchet Distance, discriminative metrics, and visualizations of the learned latent embeddings.
URL: https://openreview.net/forum?id=KHk5EECG3Z
---
Title: Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution
Authors: Muhammad Umair Haider, Hammad Rizwan, Hassan Sajjad, Peizhong Ju, A.B. Siddique
Abstract: Pervasive polysemanticity in large language models (LLMs) undermines discrete neuron–concept attribution, posing a significant challenge for model interpretation and control. We systematically analyze both encoder and decoder based LLMs across diverse datasets, and observe that even highly salient neurons for specific semantic concepts consistently exhibit polysemantic behavior.
Importantly, we uncover a consistent pattern: concept-conditioned activation magnitudes of neurons form distinct, often Gaussian-like distributions with minimal overlap. Building on this observation, we hypothesize that interpreting and intervening on concept-specific activation ranges can enable more precise interpretability and targeted manipulation in LLMs. To this end, we introduce NeuronLens, a novel range-based interpretation and manipulation framework, that localizes concept attribution to activation ranges within a neuron.
Extensive empirical evaluations show that range-based interventions enable effective manipulation of target concepts while causing substantially less collateral degradation to auxiliary concepts and overall model performance compared to neuron-level masking.
URL: https://openreview.net/forum?id=AukyIhfBuW
---
Title: Interpreting Kolmogorov-Arnold Networks in Neuroimaging: A Path-Based Attribution Framework
Authors: Suhrud Murthy, Venkatesh Babu Radhakrishnan, Neelam Sinha
Abstract: Explainability aspects of most classification models are learnt through instance-specific analysis. However, in understanding diseases, it is important to consider population-wide analysis in order to identify affected regions that are consistently seen across cohorts of diseased population. In this study, we report utility of Kolmogorov-Arnold Networks (KANs) in understanding population-wide characteristics seen in subjects affected by Alzheimer's disease (AD). KANs offer enhanced interpretability through learnable activation functions on network edges. Thus, the learned functions reflect the characteristics of the entire span of training data. In a KAN network trained for classification, attributions through the network can be traced to understand how specific inputs influence the output label. In this study, we propose a path-based attribution framework that generates global importance maps by tracing exhaustive information flow through all potential paths. Our method initially scores the functions on the edges of a trained KAN using an appropriate scoring function. Subsequently, these scores are propagated through the network to compute path-attributions. This approach scales linearly with network depth, and is only dependent on model training and does not need further analysis on training data post-hoc. Evaluation on three public AD neuroimaging datasets (OASIS, ADNI, Mendeley, totally comprising 7428 acquisitions), were carried out on 2D brain slices as well as 3D brain volumes. The corresponding KAN test accuracies are $93.24\%$, $81.85\%$, and $91.25\%$ on OASIS, ADNI, and Mendeley datasets, respectively. Alongside, competitive or improved performance via metrics such as Insertion AUC, Deletion AUC and Sufficiency, is also demonstrated. The generated attribution maps identify clinically meaningful regions including the body and genu of corpus callossum, corona radiata, bilateral caudate nuclei, medial prefrontal cortex and temporal lobe structures, aligned with established AD pathology literature. By providing voxel-level global attributions as network-intrinsic properties, our framework addresses a critical gap in AI interpretability and supports exploratory clinical analysis and model auditing of AI-assisted AD diagnosis systems.
URL: https://openreview.net/forum?id=cPtKpNdYc2
---
Title: E$^2$M: Double Bounded $\alpha$-Divergence Optimization for Tensor-based Discrete Density Estimation
Authors: Kazu Ghalamkari, Jesper Løve Hinrich, Morten Mørup
Abstract: Tensor-based discrete density estimation requires flexible modeling and proper divergence criteria to enable effective learning; however, traditional approaches using α-divergence face analytical challenges due to the α-power terms in the objective function, which hinder the derivation of closed-form update rules. We present a generalization of the expectation-maximization~(EM) algorithm, called the E2M algorithm. It circumvents this issue by first relaxing the optimization into the minimization of a surrogate objective based on the Kullback–Leibler (KL) divergence, which is tractable via the standard EM algorithm, and subsequently applying a tensor many-body approximation in the M-step to enable simultaneous closed-form updates of all parameters. Our approach offers flexible modeling for not only a variety of low-rank structures, including the CP, Tucker, and Tensor Train formats, but also their mixtures, thus allowing us to leverage the strengths of different low-rank structures. We evaluate the effectiveness of our approach on synthetic and real datasets, highlighting its comparable convergence to gradient-based procedures, robustness to outliers, and favorable density estimation performance compared to prominent existing tensor-based methods.
URL: https://openreview.net/forum?id=954CjhXSXL
---
Title: D-Garment: Physically Grounded Latent Diffusion for Dynamic Garment Deformations
Authors: Antoine Dumoulin, Adnane Boukhayma, Laurence Boissieux, Bharath Bhushan Damodaran, Pierre Hellier, Stefanie Wuhrer
Abstract: We present a method to dynamically deform 3D garments, in the form of a 3D polygon mesh, based on body shape, motion, and physical cloth material properties. Considering physical cloth properties allows to learn a physically grounded model, with the advantage of being more accurate in terms of physically inspired metrics such as strain or curvature.
Existing work studies pose-dependent garment modeling to generate garment deformations from example data, and possibly data-driven dynamic cloth simulation to generate realistic garments in motion. We propose *D-Garment*, a learning-based approach trained on new data generated with a physics-based simulator. Compared to prior work, our 3D generative model learns garment deformations conditioned by physical material properties, which allows to model loose cloth geometry, especially for large deformations and dynamic wrinkles driven by body motion. Furthermore, the model can be efficiently fitted to observations captured using vision sensors such as 3D point clouds. We leverage the capability of diffusion models to learn flexible and powerful generative priors by modeling the 3D garment in a 2D parameter space independently from the mesh resolution. This representation allows to learn a template-specific latent diffusion model. This allows to condition global and local geometry with body and cloth material information.
We quantitatively and qualitatively evaluate *D-Garment* on both simulations and data captured with a multi-view acquisition platform. Compared to recent baselines, our method is more realistic and accurate in terms of shape similarity and physical validity metrics. Code and data are available for research purposes at https://dumoulina.github.io/d-garment/
URL: https://openreview.net/forum?id=NrPyio1aUK
---
Title: Empowering Power Outage Prediction with Spatially Aware Hybrid Graph Neural Networks and Contrastive Learning
Authors: Xuyang Shen, Zijie Pan, Diego Cerrai, Xinxuan Zhang, Christopher Colorio, Emmanouil Anagnostou, Dongjin Song
Abstract: Extreme weather events, such as severe storms, hurricanes, snowstorms, and ice storms, which are exacerbated by climate change, frequently cause widespread power outages. These outages halt industrial operations, impact communities, damage critical infrastructure, profoundly disrupt economies, and have far-reaching effects across various sectors. To mitigate these effects, the University of Connecticut and Eversource Energy Center have developed an outage prediction modeling (OPM) system to provide pre-emptive forecasts for electric distribution networks before such weather events occur. However, existing predictive models in the system do not incorporate the spatial effect of extreme weather events. To this end, we develop Spatially Aware Hybrid Graph Neural Networks (SA-HGNN) with contrastive learning to enhance the OPM predictions for extreme weather-induced power outages. Specifically, we first encode spatial relationships of both static features (e.g., land cover, infrastructure) and event-specific dynamic features (e.g., wind speed, precipitation) via Spatially Aware Hybrid Graph Neural Networks (SA-HGNN). Next, we leverage contrastive learning to handle the imbalance problem associated with different types of extreme weather events and generate location-specific embeddings by minimizing intra-event distances between similar locations while maximizing inter-event distances across all locations. Thorough empirical studies in four utility service territories, i.e., Connecticut, Western Massachusetts, Eastern Massachusetts, and New Hampshire, demonstrate that SA-HGNN can achieve state-of-the-art performance for power outage prediction.
URL: https://openreview.net/forum?id=Vf5FDYrOiU
---
Title: BN-Pool: Bayesian Nonparametric Pooling for Graphs
Authors: Daniele Castellana, Filippo Maria Bianchi
Abstract: We introduce BN-Pool, the first clustering-based pooling method for Graph Neural Networks that adaptively determines the number of supernodes in a coarsened graph.
BN-Pool leverages a generative model based on a Bayesian nonparametric framework for partitioning graph nodes into an unbounded number of clusters. During training, the node-to-cluster assignments are learned by combining the supervised loss of the downstream task with an unsupervised auxiliary term, which encourages the reconstruction of the original graph topology while penalizing unnecessary proliferation of clusters. By automatically discovering the optimal coarsening level for each graph, BN-Pool preserves the performance of soft-clustering pooling methods while avoiding their typical redundancy by learning compact pooled graphs.
The code is available at https://github.com/NGMLGroup/Bayesian-Nonparametric-Graph-Pooling
URL: https://openreview.net/forum?id=3B3Zr2xfkf
---
Title: MACAW: A Causal Generative Model for Medical Imaging
Authors: Vibujithan Vigneshwaran, Erik Yuiti Ohara, Matthias Wilms, Nils Forkert
Abstract: Although deep learning techniques show promising results for many neuroimaging tasks in research settings, they have not yet found widespread use in clinical scenarios. One of the reasons for this problem is that many machine learning models only identify correlations between the input images and the outputs of interest, which can lead to many practical problems, such as encoding of uninformative biases and reduced explainability. Thus, recent research is exploring if integrating \textit{a priori} causal knowledge into deep learning models is a potential avenue to identify these problems. However, encoding causal reasoning and generating genuine counterfactuals necessitates computationally expensive invertible processes, thus restricting analyses to a small number of causal variables and rendering them infeasible for generating even 2D images. To overcome these limitations, this work introduces a new causal generative architecture named Masked Causal Flow (MACAW) for neuroimaging applications. Within this context, three main contributions are described. First, a novel approach that integrates complex causal structures into normalizing flows is proposed. Second, counterfactual prediction is performed to identify the changes in effect variables associated with a cause variable. Finally, an explicit Bayesian inference for classification is derived and implemented, providing an inherent uncertainty estimation. The feasibility of the proposed method was first evaluated using synthetic data and then using MRI brain data from more than 23000 participants of the UK biobank study. The evaluation results show that the proposed method can (1) accurately encode causal reasoning and generate counterfactuals highlighting the structural changes in the brain known to be associated with aging, (2) accurately predict a subject's age from a single 2D MRI slice, and (3) generate new samples assuming other values for subject-specific indicators such as age, sex, and body mass index.
URL: https://openreview.net/forum?id=eYW037oqQ4
---
Title: MS-IMAP - A Multi-Scale Graph Embedding Approach for Interpretable Manifold Learning
Authors: Shay Deutsch, Lionel Yelibi, Alex Tong Lin, Arjun Ravi Kannan
Abstract: Deriving meaningful representations from complex, high-dimensional data in unsupervised settings is crucial across diverse machine learning applications. This paper introduces a framework for multi-scale graph network embedding based on spectral graph wavelets that employs a contrastive learning approach. We theoretically show that in Paley-Wiener spaces on combinatorial graphs, the spectral graph wavelets operator provides greater flexibility and control over smoothness compared to the Laplacian operator, motivating our approach. A key advantage of the proposed embedding is its ability to establish a correspondence between the embedding and input feature spaces, enabling the derivation of feature importance. We validate the effectiveness of our graph embedding framework on multiple public datasets across various downstream tasks, including clustering and unsupervised feature importance.
URL: https://openreview.net/forum?id=pc6BgWrCjp
---
Title: Counting Still Counts: Understanding Neural Complex Query Answering Through Query Relaxation
Authors: Yannick Brunink, Daniel Daza, Yunjie He, Michael Cochez
Abstract: Neural methods for Complex Query Answering (CQA) over knowledge graphs (KGs) are widely believed to learn patterns that generalize beyond explicit graph structure, allowing them to infer answers that are unreachable through symbolic query processing.
In this work, we critically examine this assumption through a systematic analysis comparing neural CQA models with an alternative, training-free query relaxation strategy that retrieves possible answers by relaxing query constraints and counting resulting paths. Across multiple datasets and query structures, we find several cases where neural and relaxation-based approaches perform similarly, with no neural model consistently outperforming the latter. Moreover, a similarity analysis reveals that their retrieved answers exhibit little overlap, and that combining their outputs consistently improves performance.
These results call for a re-evaluation of progress in neural query answering: despite their complexity, current models fail to subsume the reasoning patterns captured by query relaxation. Our findings highlight the importance of stronger non-neural baselines and suggest that future neural approaches could benefit from incorporating principles of query relaxation.
URL: https://openreview.net/forum?id=YVFxB6bkeC
---
Title: PRISM: PRIor from corpus Statistics for topic Modeling
Authors: Tal Ishon, Yoav Goldberg, Uri Shaham
Abstract: Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While recent methods often incorporate external knowledge (e.g., pre-trained embeddings), such reliance limits applicability in emerging or underexplored domains. We introduce PRISM, a corpus-intrinsic method that derives a Dirichlet parameter from word co-occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single cell RNA-seq data show that PRISM improves topic coherence and interpretability, rivaling models that rely on external knowledge. These results underscore the value of corpus-driven initialization for topic modeling in resource-constrained settings.
Code is available at: https://github.com/shaham-lab/PRISM.
URL: https://openreview.net/forum?id=454v3Xbtza
---
Title: A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems
Authors: Xavier Gonzalez, E. Kelly Buchanan, Hyun Dong Lee, Jerry Weihong Liu, Ke Alexander Wang, David M. Zoltowski, Leo Kozachkov, Christopher Re, Scott Linderman
Abstract: Harnessing parallelism in seemingly sequential models is a central challenge for modern machine learning. Several approaches have been proposed for evaluating sequential processes in parallel using iterative fixed-point methods, like Newton, Picard, and Jacobi iterations. In this work, we show that these methods can be understood within a common framework based on linear dynamical systems (LDSs), where different iteration schemes arise naturally as approximate linearizations of a nonlinear recursion. Moreover, we theoretically analyze the rates of convergence of these methods, and we verify the predictions of this theory with several case studies. This unifying framework highlights shared principles behind these techniques and clarifies when particular fixed-point methods are most likely to be effective. By bridging diverse algorithms through the language of LDSs, the framework provides a clearer theoretical foundation for parallelizing sequential models and points toward new opportunities for efficient and scalable computation.
URL: https://openreview.net/forum?id=fw6GgAIGur
---
Title: MissNODAG: Differentiable Learning of Cyclic Causal Graphs from Incomplete Data
Authors: Muralikrishnna Guruswamy Sethuraman, Razieh Nabi, Faramarz Fekri
Abstract: Causal discovery in real-world systems, such as biological networks, is often complicated by feedback loops and incomplete data. Standard algorithms, which assume acyclic structures or fully observed data, struggle with these challenges. To address this gap, we propose MissNODAG, a differentiable framework for learning both the underlying cyclic causal graph and the missingness mechanism from partially observed data, including data *missing not at random*. Our framework integrates an additive noise model with an expectation-maximization procedure, alternating between imputing missing values and optimizing the observed data likelihood, to uncover both the cyclic structures and the missingness mechanism. We establish consistency guarantees under exact maximization of the score function in the large sample setting. Finally, we demonstrate the effectiveness of MissNODAG through synthetic experiments and an application to real-world gene perturbation data.
URL: https://openreview.net/forum?id=nNZXQ3Q0GP
---
Title: Learning Long-Range Representations with Equivariant Messages
Authors: Egor Rumiantsev, Marcel F. Langer, Tulga-Erdene Sodjargal, Michele Ceriotti, Philip Loche
Abstract: Machine learning interatomic potentials trained on first-principles reference data are becoming valuable tools for computational physics, biology, and chemistry. Equivariant message-passing neural networks, including transformers, achieve state-of-the-art accuracy but rely on cutoff-based graphs, limiting their ability to capture long-range effects such as electrostatics or dispersion, as well as electron delocalization. While long-range correction schemes based on inverse power laws of interatomic distances have been proposed, they are unable to communicate higher-order geometric information and are thus limited in applicability. To address this shortcoming, we propose the use of equivariant, rather than scalar, charges for long-range interactions, and design a graph neural network architecture, Lorem, around this long-range message passing mechanism. We consider several datasets specifically designed to highlight non-local physical effects, and compare short-range message passing with different receptive fields to invariant and equivariant long-range message passing. Even though most approaches work for careful dataset-specific choices of their model hyperparameters, Lorem works consistently without such changes, with excellent benchmark performance.
URL: https://openreview.net/forum?id=pZI9e4SW9P
---
Title: Kernel Matrix Estimation of a Determinantal Point Process from a Finite Set of Samples: Properties and Algorithms
Authors: Marc Castella, Jean-Christophe Pesquet
Abstract: Determinantal point processes (DPPs) on finite sets have recently gained popularity because of their ability to promote diversity among selected elements in a given subset. The probability distribution of a DPP is defined by the determinant of a positive semi-definite, real-valued matrix. When estimating the DPP parameter matrix, it is often more convenient to express the maximum likelihood criterion using the framework of L-ensembles. However, the resulting optimization problem is non-convex and N P-hard to solve.
In this paper, we establish conditions under which the maximum likelihood criterion has a well-defined optimum for a given finite set of samples. We demonstrate that regularization is generally beneficial for ensuring a proper solution. To solve the resulting optimization
problem, we propose a proximal algorithm which minimizes a penalized criterion. Through simulations, we compare our algorithm with previously proposed approaches, illustrating their differing behaviors and providing empirical support for our theoretical findings.
URL: https://openreview.net/forum?id=Cyx9LwB5IN
---
Title: Self-Supervised Laplace Approximation for Bayesian Uncertainty Quantification
Authors: Julian Rodemann, Alexander Marquard, Thomas Augustin, Michele Caprio
Abstract: Approximate Bayesian inference typically revolves around computing the posterior parameter distribution. In practice, however, the main object of interest is often a model’s predictions rather than its parameters. In this work, we propose to bypass the parameter posterior and focus directly on approximating the posterior predictive distribution. We achieve this by drawing inspiration from self-training within self-supervised and semi-supervised learning. Essentially, we quantify a Bayesian model's predictive uncertainty by refitting on self-predicted data. The idea is strikingly simple: If a model assigns high likelihood to self-predicted data, these predictions are of low uncertainty, and vice versa. This yields a deterministic, sampling-free approximation of the posterior predictive. The modular structure of our Self-Supervised Laplace Approximation (SSLA) further allows us to plug in different prior specifications, enabling classical Bayesian sensitivity (w.r.t. prior choice) analysis. In order to bypass expensive refitting, we further introduce an approximate version of SSLA, called ASSLA. We study (A)SSLA both theoretically and empirically in regression models ranging from Bayesian linear models to Bayesian neural networks. Across a wide array of regression tasks with simulated and real-world datasets, our methods outperform classical Laplace approximations in predictive calibration while remaining computationally efficient.
URL: https://openreview.net/forum?id=T8w8L2t3JG
---
New submissions
===============
Title: Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA
Abstract: Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi agent framework that combines domain-specific specialist agents with Two-Phase Verification (Wu et al., 2024) and S-Score Weighted Fusion to improve both calibration and discrimination in medical multiple-choice question answering. Four specialist agents (respiratory, cardiology, neurology, gastroenterology) generate independent diagnoses using Qwen2.5-7B-Instruct. Each diagnosis is then subjected to a two-phase self-verification process that measures internal consistency and produces a Specialist Confidence Score (S-score). The S-scores drive a weighted fusion strategy that selects the final answer and calibrates the reported confidence. We evaluate across four experimental settings, covering 100-question and 250-question high disagreement subsets of both MedQA-USMLE and MedMCQA. Calibration improvement is the central finding, with ECE reduced by 49 74% across all four settings, including the harder MedMCQA benchmark where these gains persist even when absolute accuracy is constrained by knowledge-intensive recall demands. On MedQA-250, the full system achieves ECE = 0.091 (74.4% reduction over the single-specialist baseline) and AUROC = 0.630 (+0.056) at 59.2% accuracy. Ablation analysis identifies Two-Phase Verification as the primary calibration driver and multi-agent reasoning as the primary accuracy driver. These results establish that consistency-based verification produces more reliable uncertainty estimates across diverse medical question types, providing a practical confidence signal for deferral in safety-critical clinical AI applications.
URL: https://openreview.net/forum?id=2ZiljJOMLh
---
Title: Differentially Private Synthetic Data via APIs 3: Using Simulators Instead of Foundation Models
Abstract: Differentially private (DP) synthetic data enables the use of sensitive datasets while providing strong privacy guarantees. Recently, Private Evolution (PE) has emerged as a promising framework for generating DP synthetic data using only inference APIs of foundation models. However, suitable foundation models are not always available for every private data domain.
In this paper, we show that PE is more general than previously understood and can incorporate APIs beyond foundation models. In particular, we show that state-of-the-art non–neural-network data synthesizers, including computer graphics-based image generators and physics simulation tools (which we refer to as simulators), integrate naturally into PE. We call the resulting method Sim-PE.
Sim-PE broadens the applicability of PE while improving performance. For image synthesis, Sim-PE improves downstream classification accuracy by up to 3x, reduces FID by up to 80%, and is significantly more efficient. We further show that simulators and foundation models can be seamlessly combined within Sim-PE for additional gains.
Beyond performance improvements, Sim-PE enables new DP applications that are difficult or impossible for existing methods, including generating DP control signals whose execution reproduces the distribution of private observations.
URL: https://openreview.net/forum?id=GMoM2QpYuz
---
Title: Closing the Modality Gap for Mixed Modality Search
Abstract: Mixed modality search, retrieving information across a heterogeneous corpus composed of images, texts, and multimodal documents, is an important yet underexplored real-world application. In this work, we investigate how contrastive vision-language models, such as CLIP, perform on the mixed modality search task. Our analysis reveals a critical limitation: these models exhibit a pronounced modality gap in the embedding space, where image and text embeddings form distinct clusters, leading to intra-modal ranking bias and inter-modal fusion failure. To address this issue, we propose GR-CLIP, a lightweight post-hoc calibration method that removes the modality gap in CLIP’s embedding space. Evaluated on MixBench, the first benchmark specifically designed for mixed modality search, GR-CLIP improves NDCG@10 by up to 26\% over CLIP, surpasses recent vision-language generative embedding models by 4\%, while using 75$\times$ less compute.
URL: https://openreview.net/forum?id=jKy1yDMFcs
---
Title: Unified Sample Difficulty Estimation in Pathology Foundation Models
Abstract: The fast scaling speed of histopathology datasets allows researchers to train various foundation models for disease-centered research with applications in classifying disease-state information and predicting gene expression levels. However, it has been shown that current models tend to be overconfident and make classification at a low-calibration level. This case is underexplored for regression-type tasks such as gene expression prediction as well, which could seriously affect the diagnosis and treatment based on the developed models. To resolve this critical issue, we propose a \underline{u}niversal framework\footnote{Full codes can be found here: \url{https://anonymous.4open.science/r/USD-F176/} (also in supplementary files).} to estimate the \underline{s}ample \underline{d}ifficulty (USD) in both regression and classification tasks. In particular, we fit the data in the embedding space with Gaussian distribution and then utilize prior-informed relative Mahalanobis distance to estimate sample difficulty. Moreover, we incorporate such difficulty as a weight to regularize the model prediction, which can improve model performance by emphasizing challenging samples. Our method can be seamlessly extended to regression tasks by the incorporation of discrete targets. Extensive experiments demonstrate that our proposed USD can improve the disease-state classification accuracy by up to 3.8\% and gene-level correlation by up to 62.2\% compared with the most frequently used approaches. Finally, we provide comprehensive ablation tests to demonstrate the importance of including sample difficulty in the training stage and case studies for assigning samples with different difficulty levels.
URL: https://openreview.net/forum?id=LLlOJs4o2N
---
Title: Efficiently Attacking Memorization Scores
Abstract: Influence estimation tools—such as memorization scores—are widely used to understand model behavior, attribute training data, and inform dataset curation. However, recent applications in data valuation and responsible machine learning raise the question: can these scores themselves be adversarially manipulated? In this work, we present a systematic study of the feasibility of attacking memorization-based influence estimators when valuators only have black-box access to scores. We characterize attacks for producing highly memorized samples as highly sensitive queries in the regime where a trained algorithm is accurate. Our inverse based attacks are practical, requiring only black-box access to model outputs, incur modest computational overhead, and generalize across data modalities. We empirically validate our attack across a wide suite of image classification tasks, showing that even state-of-the-art proxies are vulnerable to targeted score manipulations. In addition, we provide a theoretical analysis of the stability of memorization scores under adversarial perturbations, revealing conditions under which influence estimates are inherently fragile. Our findings highlight critical vulnerabilities in influence-based attribution and suggest the need for robust defenses. All code can be found at https://anonymous.4open.science/r/MemAttack-5413/
URL: https://openreview.net/forum?id=rAQNI3lH7K
---
Title: Beyond Fixed Horizons: A Theoretical Framework for Adaptive Denoising Diffusions
Abstract: We introduce a new class of generative diffusion models that, unlike conventional denoising diffusion models, achieve a time-homogeneous structure for both the noising and denoising processes, allowing the number of steps to adaptively adjust based on the noise level. This is accomplished by conditioning the forward process using Doob's $h$-transform, which terminates the process at a suitable sampling distribution at a random time. The model is particularly well suited for generating data with lower intrinsic dimensions, as the termination criterion simplifies to a first-hitting rule. A key feature of the model is its adaptability to the target data, enabling a variety of downstream tasks using a pre-trained unconditional generative model. These tasks include natural conditioning through appropriate initialisation of the denoising process and classification of noisy data.
URL: https://openreview.net/forum?id=xowTpMf9fM
---
Title: Self-Consistent Flow: Unifying Velocity and Endpoint Prediction for Rectified Flow Models
Abstract: In rectified-flow–based generative models, the neural network can be trained to predict two different targets, such as the instantaneous velocity or the data endpoint, to perform denoising. Although prior work shows that these parameterizations lead to different empirical behaviors, the mechanisms underlying their respective advantages remain to be underexplored, and how to combine them effectively is still unclear. In this work, we analyze how learning errors from different parameterizations affect the generation performance. We show that predicting the data endpoint has a clear training signal that stabilizes training, whereas predicting the velocity maintains stable sampling dynamics near the data manifold. Motivated by these insights, we propose Self-Consistent Flow (SC-Flow), a new method that unifies the benefits of both parameterizations. By employing a lightweight consistency loss, SC-Flow jointly trains a single network to predict both the local velocity and the data endpoint, and the consistency between the two predictions improves the model's performance. The method requires no major architectural changes and adds minimal computational overhead. Extensive experiments on image generation tasks demonstrate that SC-Flow substantially stabilizes optimization and improves the straightness of generation paths, leading to significant gains in generation quality over standard rectified-flow baselines.
URL: https://openreview.net/forum?id=z9FMXOJgyl
---
Title: Kernel Banzhaf: A Fast and Robust Estimator for Banzhaf Values
Abstract: Banzhaf values provide a popular, interpretable alternative to the widely-used Shapley values for applications in explainable AI. Like Shapley values, computing Banzhaf values exactly requires time exponential in the number of inputs, necessitating the use of efficient estimators. In this work, we introduce Kernel Banzhaf, a regression-based estimator for Banzhaf values. Our approach leverages an existing regression formulation, whose exact solution corresponds to the exact Banzhaf values. Inspired by the success of Kernel SHAP for Shapley values, Kernel Banzhaf obtains an efficient approximation to the Banzhaf values by solving a subsampled instance of this regression problem. Through empirical evaluations across eight datasets, we find that Kernel Banzhaf significantly outperforms prior methods in terms of accuracy and robustness to noise. We complement our experimental evaluation with strong theoretical guarantees on Kernel Banzhaf’s performance.
URL: https://openreview.net/forum?id=NYuk0js8Rc
---
Title: Learning 3D Hypersonic Flow with Physics-Enhanced Neural Fields: A Case Study on the Orion Reentry Capsule
Abstract: We develop a 3D aerothermodynamic simulator for the Orion reentry capsule at hypersonic speeds, a timely case study given its role in upcoming lunar missions. The large computational meshes required for these scenarios make traditional computational fluid dynamics impractical for full-mission performance prediction and control. In this work, we propose physics-enhanced 3D neural fields for predicting steady hypersonic flow around aerodynamic bodies. The model maps spatial coordinates and angle of attack to pressure, temperature, and velocity components. We enhance the base model with Fourier positional feature mappings, which allow it to capture the sharp discontinuities typical of hypersonic flows, and further constrain the solution by imposing no-slip and isothermal wall conditions. We compare our proposed approach to other surrogate alternatives, such as graph neural networks, and demonstrate its superior performance in capturing the steep gradients ubiquitous in this regime. Our formulation yields a continuous and computationally efficient aerothermodynamic surrogate that supports rapid exploration of operating conditions based on angle of attack variation under realistic flight profiles. While we focus on Orion, the proposed framework provides a general methodology for data-driven simulation in 3D hypersonic aerothermodynamics.
URL: https://openreview.net/forum?id=ce2X1X3l0Y
---
Title: Compositional Knowledge Cannot Be Cheaply Distilled
Abstract: Knowledge distillation can compress large neural networks with little loss in task accuracy, yet distilled models frequently preserve factual recall while failing at multi-step reasoning. We prove that this disparity reflects a fundamental separation, not a limitation of current methods. Our first result establishes tight minimax distillation rates that depend solely on student model capacity, entirely independent of teacher complexity; we match the upper and lower bounds exactly. Our second result, proven rigorously for linear compositional primitives, shows that transferring depth-$k$ compositional knowledge requires $k$ times more samples than transferring factual knowledge, an unavoidable gap driven by the information-packing structure of compositions. Our third result provides information-theoretic fidelity bounds showing that representation alignment imposes architectural limits beyond what additional data can overcome. The compositional separation holds for any distillation algorithm and persists even assuming sufficient student capacity and unlimited teacher queries. Experiments on synthetic benchmarks and GPT-2 distillation confirm all three predictions, with the factual-compositional gap matching theory at approximately three times more data for arithmetic reasoning. These bounds offer practitioners a principled basis for predicting when distillation will preserve reasoning capabilities and when it will not.
URL: https://openreview.net/forum?id=KmRN09BmTa
---
Title: Build-Bench: Benchmarking LLM Agents on Compiling Real- World Open-Source Software
Abstract: Automatically compiling open-source software (OSS) projects is a vital, labor-intensive, and
complex task, which makes it a good challenge for LLM Agents. Existing methods rely on
manually curated rules and workflows, which cannot adapt to OSS that requires customized
configuration or environment setup. Recent attempts using Large Language Models (LLMs)
used selective evaluation on a subset of highly rated OSS, a practice that underestimates the
realistic challenges of OSS compilation. In practice, compilation instructions are often absent,
dependencies are undocumented, and successful builds may even require patching source
files or modifying build scripts. We propose a more challenging and realistic benchmark,
Build-Bench, comprising OSS that are more diverse in quality, scale, and characteristics.
Furthermore, we propose a simple yet strong baseline LLM-based agent, OSS-Build-Agent,
an effective system with an enhanced build instruction retrieval module that outperforms
the prior rule-based and agentic baselines we evaluated on Build-Bench and is adaptable
to heterogeneous OSS characteristics. We also provide a detailed analysis on different
compilation method design choices and their influence on the whole task, providing insights
to guide future advances. We believe that performance on Build-Bench can faithfully reflect
an agent’s ability to tackle compilation as a complex software engineering tasks, and, as such,
our benchmark will spur innovation with a significant impact on downstream applications in
the fields of software development and software security.
URL: https://openreview.net/forum?id=BxJnz9EqO5
---
Title: Complex Equation Learner: Rational Symbolic Regression with Gradient Descent in Complex Domain
Abstract: Symbolic regression aims to discover interpretable equations from data, yet modern gradient-based methods fail for operators that introduce singularities or domain constraints, including division, logarithms, and square roots. As a result, Equation Learner-type models typically avoid these operators or impose restrictions, e.g. constraining denominators to prevent poles, which narrows the hypothesis class. We propose a complex weight extension of the Equation Learner that mitigates real-valued optimization pathologies by allowing optimization trajectories to bypass real-axis degeneracies. The proposed approach converges stably even when the target expression has real-domain poles, and it enables unconstrained use of operations such as logarithm and square root. We Validate the method on symbolic regression benchmarks and show it can recover singular behavior from experimental frequency response data.
URL: https://openreview.net/forum?id=Oa6I9aD8Nk
---
Title: Demonstration-Guided Continual Reinforcement Learning in Dynamic Environments
Abstract: Reinforcement learning (RL) excels in various applications but struggles in dynamic environments where the underlying Markov decision process evolves. Continual reinforcement learning (CRL) enables RL agents to continually learn and adapt to new tasks, but balancing stability (preserving prior knowledge) and plasticity (acquiring new knowledge) remains challenging. Existing methods primarily address the stability-plasticity dilemma through mechanisms where past knowledge influences optimization but rarely affects the agent’s behavior directly, which may hinder effective knowledge reuse and efficient learning. In contrast, we propose demonstration-guided continual reinforcement learning (DGCRL), which stores prior knowledge in an external, self-evolving demonstration repository that directly guides RL exploration and adaptation. For each task, the agent dynamically selects the most relevant demonstration and follows a curriculum-based strategy to accelerate learning, gradually shifting from demonstration-guided exploration to fully self-exploration. Extensive experiments on 2D navigation and MuJoCo locomotion tasks demonstrate its superior average performance, enhanced knowledge transfer, mitigation of forgetting, and training efficiency. The additional sensitivity analysis and ablation study further validate its effectiveness.
URL: https://openreview.net/forum?id=67ebl3nvmB
---
Title: ZSQ4MIL: Zero-shot Quantization for Multiple Instance Learning
Abstract: Zero-shot quantization (ZSQ) has emerged as a promising model compression paradigm that bypasses data privacy and security barriers in model deployment by leveraging synthesized samples for quantization calibration. Despite its remarkable success in various mainstream vision tasks, such as classification, detection, and segmentation, existing ZSQ approaches cannot be directly transferred to Multiple Instance Learning (MIL) models due to the unique hierarchical bag-instance structure inherent to MIL. To bridge this gap, this paper proposes ZSQ4MIL, the first ZSQ framework explicitly tailored for MIL. The core of ZSQ4MIL lies in synthesizing calibration data that perfectly aligns with the intrinsic distributional characteristics of MIL data. Specifically, we deeply analyze three core structural priors of instances within MIL bags: (i) instance heterogeneity, (ii) inter-class separability, and (iii) intra-class compactness. Empirical findings reveal that these structural priors fundamentally dictate the behavioral discrepancy of MIL models when processing random Gaussian noise versus real images. This insight inspires us to reconstruct MIL-aware calibration data by inverting optimized Gaussian noise. Methodologically, we first introduce an instance-level contrastive learning strategy to preserve feature heterogeneity. Subsequently, a grouped instance pseudo-labeling constraint is enforced to guarantee inter-class separability. Finally, a class-centric distance optimization scheme is proposed to further enhance intra-class compactness and widen inter-class margins. Based on these, we establish the first comprehensive quantization benchmarks for MIL. Extensive experiments thoroughly validate the effectiveness of ZSQ4MIL, which remarkably surpasses real-data-dependent calibration methods under several configurations. Code is available in the supplementary material.
URL: https://openreview.net/forum?id=D1gvKw1397
---
Title: ERDE: Entropy-Regularized Distillation for Early-exit
Abstract: Although deep neural networks and in particular Convolutional Neural Networks have demonstrated state-of-the-art performance in image classification with relatively high efficiency, they still exhibit high computational costs, often rendering them impractical for real-time and edge applications. Therefore, a multitude of compression techniques have been developed to reduce these costs while maintaining accuracy.
In addition, dynamic architectures have been introduced to modulate the level of compression at execution time, which is a desirable property in many resource-limited application scenarios. The proposed method effectively integrates two well-established optimization techniques: early exits and knowledge distillation, where a reduced student early-exit model is trained from a more complex teacher early-exit model. The primary contribution of this research lies in the approach for training the student early-exit model.
In comparison to the conventional Knowledge Distillation loss, our approach incorporates a new entropy-based loss for images where the teacher's classification was incorrect. The proposed method optimizes the trade-off between accuracy and efficiency, thereby achieving significant reductions in computational complexity without compromising classification performance. The validity of this approach is substantiated by experimental results on image classification datasets CIFAR10, CIFAR100 and SVHN, which further opens new research perspectives for Knowledge Distillation in other contexts.
URL: https://openreview.net/forum?id=zLWbwuPhuH
---
Title: Evolutionary Rim Attention for Linear-Time Sequence Modeling
Abstract: Transformer self-attention is powerful, but its quadratic dependence on sequence length limits
efficiency at long context. Biological nervous systems, by contrast, appear to rely ki on sparse,
local, and hierarchical processing rather than all-to-all pairwise comparison. Motivated by this
contrast, we introduce EvoRimNet, a sequence model with two parallel pathways:
(Inhibitory Rim Attention, a local competitive operator implemented with multi-scale
causal depthwise convolutions initialized with Mexican-hat-like profiles, and Content-Addressable Thalamic Relay, a hierarchical sparse memory that supports content-based write and read operations.
On WikiText-2 word-level language modeling, EvoRimNet achieves 162.6 test perplexity,
versus 167.3 for a Modern Transformer (RoPE + SwiGLU + RMSNorm) and 188.9 for GPT, improving over both baselines at matched parameter counts across three random seeds. At sequence length 4096, our implementation is
14.8 times faster and uses 67 times less peak memory than the GPT baseline in our measurements. Ablations confirm that each biological component; inhibition, thalamic relay and hierarchical routing contributes independently and measurably to performance. These results demonstrate that local competitive inhibition combined with sparse content-addressable relay can match or exceed transformer-based attention on language modeling benchmarks, while scaling linearly in both time and memory.
URL: https://openreview.net/forum?id=IOYlVchL1e
---
Title: Estimating the Adoption of Software Engineering Best Practices in Machine Learning Research
Abstract: While experimental reproduction remains a pillar of the scientific method, we observe that the software best practices supporting the reproduction of Machine Learning (ML) research are often undervalued or overlooked. We quantify these concerns by surveying the usage of software best practices in software repositories associated with publications at major ML conferences and journals such as NeurIPS, ICML, ICLR, TMLR, and MLOSS within the last decade. We report the results of this survey that identify areas where software best practices are lacking and areas with potential for growth in the ML community. Finally, we discuss the implications and present concrete recommendations on how we, as a community, can improve reproducibility in ML research.
URL: https://openreview.net/forum?id=13ThuknGwY
---
Title: From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise
Abstract: Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education. However, existing benchmarks are documented to be contaminated and are based on multiple-choice questions, which suffer from inherent biases. To measure domain-specific knowledge in LLMs, we present a deterministic pipeline that transforms raw domain corpora into completion-style benchmarks without relying on other LLMs or costly human annotation. Our approach first extracts domain-specific keywords and related target vocabulary from an input corpus. It then constructs prompt-target pairs where domain-specific words serve as prediction targets. By measuring LLMs' ability to complete these prompts, we provide a direct assessment of domain knowledge at low computational cost. Our pipeline avoids benchmark contamination, enables automated updates with new domain data, and facilitates fair comparisons between base and instruction-tuned (chat) models. We validate our approach by showing that model performances on our benchmark significantly correlate with those on an expert-curated benchmark. We then demonstrate how our benchmark provides insights into knowledge acquisition in domain-adaptive, continual, and general pretraining. Finally, we examine the effects of instruction fine-tuning by comparing base and chat models within our unified evaluation framework. In conclusion, our pipeline enables scalable, domain-specific, LLM-independent, and unbiased evaluation of both base and chat models.
URL: https://openreview.net/forum?id=0281hBaWmh
---
Title: Path Planning for Diffusion Language Model Sampling
Abstract: Any order generation of discrete data using masked diffusion language models (MDMs) offers a compelling alternative to traditional autoregressive models, especially in domains that lack a natural causal ordering of data. However, current popular MDMs depart from their successful continuous diffusion model counterparts with simplified masked inference wherein unmasked tokens cannot be iteratively refined---even if there is a mistake. In this paper, we extract the full power of MDMs by introducing a novel inference sampling strategy termed \emph{Path Planning (P2)} that decomposes each generation step into two sub-stages: planning and denoising. Under P2, the planner at every step selects appropriate tokens that are marked to be updated, which can then be sampled using the denoiser. We demonstrate that P2 generalizes all existing sampling strategies for MDMs and critically enhances generative quality through the new capability of
refining and updating existing unmasked tokens. We theoretically prove that P2 establishes a (new) expanded evidence lower bound (ELBO) on the log marginal likelihood of data. We instantiate P2 with a family of planners including: 1.) Self-Planning, 2.) BERT-Planning, and 3.) Trained-Planning with a learned planner leading to SOTA generative performance for MDMs on a suite of domains. Specifically, solely using P2 inference, we observe relative improvements of $22\%$ in protein sequence foldability, $8\%$ in RNA sequence pLDDT, $4\%$ in math reasoning, $68\%$ in story generation (ROUGE score), and $33\%$ in code generation for the challenging pass@1 metric
URL: https://openreview.net/forum?id=azdknjl0hE
---
Title: Mitigating the Likelihood Paradox in Flow-based OOD Detection via Entropy Manipulation
Abstract: Deep generative models that can tractably compute input likelihoods, including normalizing flows, often assign unexpectedly high likelihoods to out-of-distribution (OOD) inputs. We mitigate this likelihood paradox by manipulating input entropy based on semantic similarity, applying stronger perturbations to inputs that are less similar to an in-distribution memory bank. We provide a theoretical analysis showing that entropy control increases the expected log-likelihood gap between in-distribution and OOD samples in favor of the in-distribution, and we explain why the procedure works without any additional training of the density model. We then evaluate our method against likelihood-based OOD detectors on standard benchmarks and find consistent AUROC improvements over baselines, supporting our explanation.
URL: https://openreview.net/forum?id=pfWtxxsveL
---
Title: Understanding and Harnessing Sparsity for Unified Multimodal Models
Abstract: Large multimodal models have achieved remarkable progress in both understanding and generation. Recent efforts pursue unified multimodal models that integrate heterogeneous components to support both capabilities within a single framework. However, such unification introduces inference inefficiencies, e.g., specific tasks or samples may not require the full knowledge or capacity of the unified model. Yet, a systematic understanding of how these inefficiencies manifest across different components remains limited. In this work, we first conduct a systematic analysis of unified multimodal model components using training-free pruning as a probing methodology, considering both depth pruning and width reduction. Our study reveals that the understanding component exhibits notable compressibility in both understanding and generation tasks, which is more pronounced in the latter. In contrast, the generation components are highly sensitive to compression, with performance deteriorating sharply even under moderate compression ratios. To address this limitation, we propose the Mixture-of-Experts (MoE) Adaptation, inspired by the dynamic activation patterns observed across different samples. This approach partitions the generation module into multiple experts and enables sparse activation to restore generation quality. We validate the effectiveness of sparse activation through expert-frozen tuning and further demonstrate that a fully trainable adaptation delivers additional gains. As a result, the adapted BAGEL model achieves performance comparable to the full model while activating only about half of its parameters. The code will be released upon acceptance.
URL: https://openreview.net/forum?id=2mNkOlfJ0z
---
Title: FairT2I: Latent Variable Guidance for Training-Free Bias Mitigation with LLM-Assisted Bias Detection
Abstract: Text-to-image models have transformed visual content creation, but their reliance on large uncurated web data can encode and amplify societal biases. We present \emph{FairT2I}, a training-free, inference-time framework that leverages large language models to detect implicit bias dimensions in prompts and mitigate them during generation. FairT2I has three components. First, LLM-based bias detection identifies bias-relevant attributes implied by the prompt and makes them explicit for control. Second, attribute resampling generates bias-aware prompts by sampling these attributes from a user-specified target distribution, including uniform, statistics-based, or custom choices. Third, \emph{latent variable guidance} provides a theoretically grounded guidance rule that decomposes the diffusion score into attribute-conditioned components and reweights them to match the target attribute distribution. This can be viewed as an attribute-level generalization of classifier-free guidance, which applies a single global guidance scale to strengthen conditioning without explicit control over individual attributes. Experiments with both uniform targets and real-world employment statistics show that FairT2I outperforms existing training-free bias mitigation methods. On Parti Prompts, FairT2I improves diversity without sacrificing image quality or prompt fidelity, achieving a better quality--diversity trade-off than classifier-free guidance.
URL: https://openreview.net/forum?id=WHYE1WjMYf
---
Title: Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity
Abstract: State Space Models (SSMs) have emerged as efficient alternatives to Transformers for sequential modeling, but their inability to leverage modality-specific features limits their performance in multi-modal pretraining. Here, we propose Mixture-of-Mamba, a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block. We evaluate Mixture-of-Mamba across three multi-modal pretraining settings: Transfusion (interleaved text and continuous image tokens with diffusion loss), Chameleon (interleaved text and discrete image tokens), and an extended three-modality framework incorporating speech. Mixture-of-Mamba consistently reaches the same loss values at earlier training steps with significantly reduced computational costs. In the Transfusion setting, Mixture-of-Mamba achieves equivalent image loss using only 34.76% of the training FLOPs at the 1.4B scale. In the Chameleon setting, Mixture-of-Mamba reaches similar image loss with just 42.50% of the FLOPs at the 1.4B scale, and similar text loss with just 65.40% of the FLOPs. In the three-modality setting, MoM matches speech loss at 24.80% of the FLOPs at the 1.4B scale. Our ablation study highlights the synergistic effects of decoupling projection components, where joint decoupling yields greater gains than individual modifications. These results establish modality-aware sparsity as a versatile and effective design principle, extending its impact from Transformers to SSMs and setting new benchmarks in multi-modal pretraining.
URL: https://openreview.net/forum?id=GbY2kcB8jv
---