J2C Certification: Probabilistic Pretraining for Improved Neural Regression
Boris N. Oreshkin, Shiv Kumar Tavker, Dmitry Efimov
https://openreview.net/forum?id=F6BTATGXaf
---
J2C Certification: CodePDE: An Inference Framework for LLM-driven PDE Solver Generation
Shanda Li, Tanya Marwah, Junhong Shen, Weiwei Sun, Andrej Risteski, Yiming Yang, Ameet Talwalkar
https://openreview.net/forum?id=eG3Qy5Oux6
---
Accepted papers
===============
Title: Probabilistic Pretraining for Improved Neural Regression
Authors: Boris N. Oreshkin, Shiv Kumar Tavker, Dmitry Efimov
Abstract: While transfer learning has revolutionized computer vision and natural language processing, its application to probabilistic regression remains underexplored, particularly for tabular data. We introduce NIAQUE (Neural Interpretable Any-Quantile Estimation), a novel permutation-invariant architecture that enables effective transfer learning across diverse regression tasks. Through extensive experiments on 101 datasets, we demonstrate that pre-training NIAQUE on multiple datasets and fine-tuning on target datasets consistently outperforms both traditional tree-based models and transformer-based neural baseline. On real-world Kaggle competitions, NIAQUE achieves competitive performance against heavily hand-crafted and feature-engineered solutions and outperforms strong baselines such as TabPFN and TabDPT, while maintaining interpretability through its probabilistic framework. Our results establish NIAQUE as a robust and scalable approach for tabular regression, effectively bridging the gap between traditional methods and modern transfer learning.
URL: https://openreview.net/forum?id=F6BTATGXaf
---
Title: CAE: Repurposing the Critic as an Explorer in Deep Reinforcement Learning
Authors: Yexin Li
Abstract: Exploration remains a fundamental challenge in reinforcement learning, as many existing methods either lack theoretical guarantees or fall short in practical effectiveness. In this paper, we propose CAE, i.e., the Critic as an Explorer, a lightweight approach that repurposes the value networks in standard deep RL algorithms to drive exploration, without introducing additional parameters. CAE leverages multi-armed bandit techniques combined with a tailored scaling strategy, enabling efficient exploration with provable sub-linear regret bounds and strong empirical stability. Remarkably, it is simple to implement, requiring only about 10 lines of code. For complex tasks where learning reliable value networks is difficult, we introduce CAE+, an extension of CAE that incorporates an auxiliary network. CAE+ increases the parameter count by less than 1% while preserving implementation simplicity, adding roughly 10 additional lines of code. Extensive experiments on MuJoCo, MiniHack, and Habitat validate the effectiveness of CAE and CAE+, highlighting their ability to unify theoretical rigor with practical efficiency.
URL: https://openreview.net/forum?id=54MOD02xC2
---
Title: CodePDE: An Inference Framework for LLM-driven PDE Solver Generation
Authors: Shanda Li, Tanya Marwah, Junhong Shen, Weiwei Sun, Andrej Risteski, Yiming Yang, Ameet Talwalkar
Abstract: Partial differential equations (PDEs) are fundamental to modeling physical systems, yet solving them remains a complex challenge. Traditional numerical solvers rely on expert knowledge to implement and are computationally expensive, while neural-network-based solvers require large training datasets and often lack interpretability. In this work, we frame PDE solving as a code generation task and introduce CodePDE, the first inference framework for generating PDE solvers using large language models (LLMs). With CodePDE, we present a thorough evaluation on critical capacities of LLM for PDE solving: reasoning, debugging, self-refinement, and test-time scaling. CodePDE shows that, with advanced inference-time algorithms and scaling strategies, LLMs can achieve strong performance across a range of representative PDE problems. We also identify novel insights into LLM-driven solver generation, such as trade-offs between solver reliability and sophistication, design principles for LLM-powered PDE solving agents, and failure modes for LLM on hard tasks. These insights offer guidance for building more capable and reliable LLM-based scientific engines.
URL: https://openreview.net/forum?id=eG3Qy5Oux6
---
New submissions
===============
Title: Efficient Test-time Scaling via Iterative Deepening
Abstract: Recent reasoning models, such as OpenAI’s O1 series, have demonstrated exceptional performance on complex reasoning tasks and revealed new test-time scaling laws. Inspired by this, many people have been studying how to train models to achieve effective self-evaluation and self-correction to further enable the scaling paradigm. However, less studied is how to efficiently scale test-time compute from a fixed model, and this remains a challenge. In this paper, we focus on whether LLMs can benefit from matching the pattern of correct responses. Specifically, we explore how systematically triggering a model's self-correction mechanisms can improve performance on challenging reasoning tasks. To this end, we propose a novel iterative deepening sampling algorithm framework designed to enhance self-correction and generate higher-quality samples. Through extensive experiments on Math500, AIME, and GPQA-diamond benchmarks, we demonstrate that our method achieves a higher success rate on difficult tasks and provide detailed ablation studies to analyze its effectiveness across diverse settings.
URL: https://openreview.net/forum?id=oSNRwIM6hU
---
Title: LiteXrayNet: Bilateral Asymmetry-Aware Attention for Lightweight Pediatric Pneumonia Detection
Abstract: Pediatric pneumonia remains a major cause of mortality among children under five, with the greatest burden in resource-constrained settings where access to timely diagnosis is limited. Although deep learning methods have achieved strong performance in chest X-ray analysis, many existing approaches rely on large models that are difficult to deploy in such environments and do not explicitly account for the bilateral anatomical structure that radiologists routinely use during interpretation. We present LiteXrayNet, a lightweight convolutional neural network that incorporates Bilateral Asymmetry Attention (BAA), a geometry-guided attention mechanism designed to model left-right lung correspondence through spatial splitting, horizontal flipping, and adaptive feature gating. With only 127K parameters, LiteXrayNet achieves competitive pneumonia classification performance, attaining an F1 score of 97.31% and an accuracy of 97.90%, while supporting real-time inference on edge hardware with latencies of 4.11 ms on GPU and 14.53 ms on CPU. Feature-level bilateral asymmetry analysis indicates that BAA induces representations that differ systematically from those produced by generic attention mechanisms, while Grad-CAM visualizations suggest anatomically structured attention patterns consistent with common radiological reasoning. These results suggest that incorporating domain-specific anatomical priors as architectural constraints can support efficient and interpretable models suitable for deployment in resource-limited clinical settings.
URL: https://openreview.net/forum?id=dsu8ZAL4LJ
---
Title: RSQ: Learning from Important Tokens Leads to Better Quantized LLMs
Abstract: Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining. Previous methods typically quantize the weights of each layer by “uniformly” optimizing the layer reconstruction loss across all output tokens. However, in this paper, we demonstrate that better quantized models can be obtained by prioritizing learning from important tokens. Building on this finding, we propose RSQ (Rotate, Scale, then Quantize), which (1) applies rotations (orthogonal transformation) to the model to mitigate weight outliers, (2) scales the token feature based on its importance, and (3) quantizes the model using the GPTQ framework with the second-order statistics computed by scaled tokens. To compute token importance, we explore both heuristic and dynamic strategies. Based on a thorough analysis of all approaches, we adopt attention concentration, which uses attention scores of each token as its importance, as the best approach. We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families: LLaMA3, Mistral, and Qwen2.5. Additionally, models quantized with RSQ achieve superior performance on long-context tasks, further highlighting its effectiveness. Lastly, RSQ demonstrates generalizability across various setups, including different model sizes, calibration datasets, bit precisions, and quantization methods. Our code is available in the supplementary material.
URL: https://openreview.net/forum?id=kBezrKXHVS
---
Title: Modelling Complex Tabular Datasets with a Mixture of Diverse Generative Models.
Abstract: Generative models are widely used, yet they often struggle to capture the multi-modal structure of complex tabular datasets. We address this challenge by introducing a novel framework that employs mixtures of diverse generators, each specialized to different regions
of the data space. Our method proceeds in two stages: first, generators are assigned to data clusters via a compute-efficient bandit-based allocation strategy; second, cluster assignments are refined through an iterative procedure inspired by the Expectation–Maximization (EM)
framework. Crucially, our approach is designed for settings where the generators’ likelihoods are intractable and only generated data samples are accessible. We provide theoretical guarantees by establishing convergence rates of the mixture distribution under approxi-
mate cluster identification. Empirical evaluations on both synthetic and real-world tabular datasets demonstrate that our approach produces high-quality synthetic data, validating its effectiveness in challenging generative modeling tasks.
URL: https://openreview.net/forum?id=3y3mHAldp7
---
Title: Centrality Graph Shift Operators for Graph Neural Networks
Abstract: Graph Shift Operators (GSOs), such as the adjacency and graph Laplacian matrices, play a fundamental role in graph theory and graph representation learning. Traditional GSOs are typically constructed by normalizing the adjacency matrix by the degree matrix, a local centrality metric. In this work, we instead propose and study Centrality GSOs (CGSOs), which normalize adjacency matrices by global centrality metrics such as the PageRank, $k$-core or count of fixed length walks. We study spectral properties of the CGSOs, allowing us to get an understanding of their action on graph signals. This understanding is confirmed by defining and running the spectral clustering algorithm based on different CGSOs on several synthetic and real-world datasets. We furthermore outline how our CGSO can act as the message passing operator in any Graph Neural Network and in particular demonstrate strong performance of a variant of the Graph Convolutional Network and Graph Attention Network using our CGSOs on several real-world datasets.
URL: https://openreview.net/forum?id=Btd0SIpoO4
---
Title: Identifiable Latent Bandits: Leveraging observational data for personalized decision-making
Abstract: Sequential decision-making algorithms such as multi-armed bandits can find optimal personalized decisions, but are notoriously sample-hungry. In personalized medicine, for example, training a bandit from scratch for every patient is typically infeasible, as the number of trials required is much larger than the number of decision points for a single patient. To combat this, latent bandits offer rapid exploration and personalization beyond what context variables alone can offer, provided that a latent variable model of problem instances can be learned consistently. However, existing works give no guidance as to how such a model can be found. In this work, we propose an identifiable latent bandit framework that leads to optimal decision-making with a shorter exploration time than classical bandits by learning from historical records of decisions and outcomes. Our method is based on nonlinear independent component analysis that provably identifies representations from observational data sufficient to infer optimal actions in new bandit instances. We verify this strategy in simulated and semi-synthetic environments, showing substantial improvement over online and offline learning baselines when identifying conditions are satisfied.
URL: https://openreview.net/forum?id=SvkZ76wKpu
---
Title: LoRAQuant: Mixed-Precision Quantization of LoRA to Ultra-Low Bits
Abstract: Low-Rank Adaptation (LoRA) has become a popular technique for parameter-efficient fine-tuning of large language models (LLMs). In many real-world scenarios, multiple adapters are loaded simultaneously to enable LLM customization for personalized user experiences or to support a diverse range of tasks.
Although each adapter is lightweight in isolation, their aggregate cost becomes substantial at scale. To address this, we propose LoRAQuant, a mixed-precision post-training quantization method tailored to LoRA. Specifically, LoRAQuant reparameterizes each adapter by singular value decomposition (SVD) to concentrate the most important information into specific rows and columns. This makes it possible to quantize the important components to higher precision, while quantizing the rest to ultra-low bitwidth. We conduct comprehensive experiments with LLaMA 2-7B, LLaMA 2-13B, and Mistral 7B models on mathematical reasoning, coding, and summarization tasks. Results show that our LoRAQuant uses significantly lower bits than other quantization methods, but achieves comparable or even higher performance.
URL: https://openreview.net/forum?id=71svCWi178
---
Title: Autoregressive Models for Knowledge Graph Generation
Abstract: Knowledge Graph (KG) generation requires models to learn complex semantic dependencies between triples while maintaining domain validity constraints. Unlike link prediction, which scores triples independently, generative models must capture interdependencies across entire subgraphs to produce semantically coherent structures. We present ARK (Auto-Regressive Knowledge Graph Generation), a family of autoregressive models that generate KGs by treating graphs as sequences of (head, relation, tail) triples. ARK learns implicit semantic constraints directly from data, including type consistency, temporal validity, and relational patterns, without explicit rule supervision. On the IntelliGraphs benchmark, our models achieve 89.2% to 100.0% semantic validity across diverse datasets while generating novel graphs not seen during training. We also introduce SAIL, a variational extension of ARK that enables controlled generation through learned latent representations, supporting both unconditional sampling and conditional completion from partial graphs. Our analysis reveals that model capacity (hidden dimensionality >= 64) is more critical than architectural depth for KG generation, with recurrent architectures achieving comparable validity to transformer-based alternatives while offering substantial computational efficiency. These results demonstrate that autoregressive models provide an effective framework for KG generation, with practical applications in knowledge base completion and query answering.
URL: https://openreview.net/forum?id=xhy0tB4uzb
---
Title: Bridging Mechanistic Interpretability and Prompt Engineering with Gradient Ascent for Interpretable Persona Control
Abstract: Controlling emergent behavioral personas (e.g., sycophancy, hallucination) in Large Language Models (LLMs) is critical for AI safety, yet remains a persistent challenge. Existing solutions face a dilemma: manual prompt engineering is intuitive but unscalable and imprecise, while automatic optimization methods are effective but operate as "black boxes" with no interpretable connection to model internals. We propose a novel framework that adapts gradient ascent to LLMs, enabling targeted prompt discovery. In specific, we propose two methods, RESGA and SAEGA, that both optimize randomly initialized prompts to achieve better aligned representation with an identified persona direction. We introduce fluent gradient ascent to control the fluency of discovered persona steering prompts. We demonstrate RESGA and SAEGA’s effectiveness across Llama 3.1, Qwen 2.5, and Gemma 3 for steering three different personas, sycophancy, hallucination, and myopic reward. Crucially, on sycophancy, our automatically discovered prompts achieve significant improvement (49.90% compared with 79.24%). By grounding prompt discovery in mechanistically meaningful features, our method offers a new paradigm for controllable and interpretable behavior modification.
URL: https://openreview.net/forum?id=dcmHPxgo4c
---
Title: Disentangling Causal Importance from Emergent Structure in Multi-Expert Orchestration
Abstract: Multi-expert systems, where multiple Large Language Models (LLMs) collaborate to solve complex tasks, are increasingly adopted for high-performance reasoning and generation. However, the orchestration policies governing expert interaction and sequencing remain largely opaque. We introduce INFORM, an interpretability analysis that treats orchestration as an explicit, analyzable computation, enabling the decoupling of expert interaction structure, execution order, and causal attribution. We use INFORM to evaluate an orchestrator on GSM8K, HumanEval, and MMLU using a homogeneous consortium of ten instruction-tuned experts drawn from LLaMA-3.1 8B, Qwen3 8B, and DeepSeek-R1 8B, with controlled decoding-temperature variation, and a secondary heterogeneous consortium spanning 1B-7B parameter models. Across tasks, routing dominance is a poor proxy for functional necessity. We reveal a divergence between relational importance, captured by routing mass and interaction topology, and intrinsic importance, measured via gradient-based causal attribution: frequently selected experts often act as interaction hubs with limited causal influence, while sparsely routed experts can be structurally critical. Orchestration behaviors emerge asynchronously, with expert centralization preceding stable routing confidence and expert ordering remaining non-deterministic. Targeted ablations show that masking intrinsically important experts induces disproportionate collapse in interaction structure compared to masking frequent peers, confirming that INFORM exposes causal and structural dependencies beyond accuracy metrics alone.
URL: https://openreview.net/forum?id=4W7sgat04A
---
Title: Fractal Predictive Operators: Learnable Iterated Function Systems for Multi-Scale Latent Modeling
Abstract: Joint Embedding Predictive Architectures (JEPAs) rely on latent-space prediction to learn
representations without explicit reconstruction. While effective, their predictors are typically
implemented as shallow feed-forward networks, offering limited control over multi-step dynamics and
stability. We introduce Learnable Iterated Function Systems (LIFS), a contractive predictive operator
that replaces the standard JEPA predictor with a learned mixture of affine maps applied recursively
in latent space. Mixture weights are generated conditionally on the context embedding, allowing
the operator to adapt its local geometry across spatial locations and inputs. LIFS does not change
the training objective or encoder architecture, but explicitly constrains predictor dynamics through
spectral control and adaptive gating. Additionally, our analysis unifies spectral control, exponential
moving average (EMA) updates, and predictive convergence through a contraction-based perspective.
Empirically, integrating LIFS into JEPA improves training stability and yields consistent, though
moderate, gains in linear probing accuracy, particularly for ViT-based encoders and non-overlapping
prediction settings. These results highlight predictor dynamics as an important and underexplored
design axis in self-supervised learning.
URL: https://openreview.net/forum?id=k2Z2gPOtlq
---
Title: CHyLL: Learning Continuous Neural Representations of Hybrid Systems
Abstract: Learning the flows of hybrid systems with both continuous and discrete dynamics is challenging. The existing method learns the dynamics in each discrete mode, which suffers from the combination of mode switching and discontinuities in the flows. In this work, we propose CHyLL (Continuous Hybrid System Learning in Latent Space), which learns a continuous neural representation of a hybrid system without trajectory segmentation, event functions, or mode switching. The key insight of CHyLL is that the reset map glues the state space at the guard surface, reformulating the state space as a piecewise smooth quotient manifold where the flow becomes spatially continuous. Building upon these insights and the embedding theorems grounded in differential topology, CHyLL concurrently learns a singularity-free neural embedding in a higher-dimensional space and the continuous flow in it. We demonstrate that CHyLL can accurately predict the flow of hybrid systems with superior accuracy and identify their topological invariants. Finally, we apply CHyLL to the stochasticoptimal control problem.
URL: https://openreview.net/forum?id=xK4WQnf7Yj
---
Title: Jump Start or False Start? A Theoretical and Empirical Evaluation of LLM-initialized Bandits
Abstract: The recent advancement of Large Language Models (LLMs) offers new opportunities to generate user preference data to warm-start bandits. Recent studies on contextual bandits with LLM initialization (CBLI) have shown that these synthetic priors can significantly lower early regret. However, these findings assume that LLM-generated choices are reasonably aligned with actual user preferences. In this paper, we systematically examine how LLM-generated preferences perform when random and label-flipping noise is injected into the synthetic training data. For aligned domains, we find that warm-starting remains effective up to 30\% corruption, loses its advantage around 40\%, and degrades performance beyond 50\%. When there is systematic misalignment, even without added noise, LLM-generated priors can lead to higher regret than a cold-start bandit. To explain these behaviors, we develop a theoretical analysis that decomposes the effect of random label noise and systematic misalignment on the prior error driving the bandit’s regret, and derive a sufficient condition under which LLM-based warm starts are provably better than a cold-start bandit. We validate these results across multiple conjoint datasets and LLMs, showing that estimated alignment reliably tracks when warm-starting improves or degrades recommendation quality.
URL: https://openreview.net/forum?id=tojKjqIOBd
---
Title: CS-pFedTM: Communication-Efficient and Similarity-based Personalised Federated Learning with Tsetlin Machine
Abstract: Federated Learning has emerged as a promising framework for privacy-preserving collaborative model training across decentralised data sources. However, data heterogeneity remains a major challenge, adversely affecting both the performance and efficiency of FL systems. To address this issue, we propose CS-pFedTM (Communication-Efficient and Similarity-based Personalised Federated Learning with Tsetlin Machine), a method that jointly incorporates communication-aware resource allocation and heterogeneity-driven personalisation. CS-pFedTM enforces communication budget constraints through adaptive clause allocation and tailors personalisation by using similarity between clients’ model parameters as a proxy for data heterogeneity. To further enhance scalability, the proposed framework integrates confidence-based aggregation and class-specific weight masking. Extensive experiments show that CS-pFedTM achieves reductions in communication and runtime costs, with up to $1352\times$ and $210\times$ reductions in upload and download communication respectively, and at least $1.43\times$ improvements in runtime efficiency, while maintaining performance comparable to state-of-the-art personalised FL approaches.
URL: https://openreview.net/forum?id=sdwGiofszZ
---
Title: On The Scalability Of Forward Gradients, Evolutionary Strategies, And Control Variates
Abstract: Stochastic gradient estimation methods such as Forward Gradients (FG) and Evolutionary Strategies (ES) have been proposed to overcome drawbacks of computing gradients with backpropagation (BP). However, FG and ES have large variance in high dimensions, connections between these methods have previously remained unclear, and while pure FG is guaranteed to be unbiased, proposed improvements have typically abandoned this property. We illuminate connections between FG and a popular variant of ES by proving mathematical equivalence on all quadratic objective functions. On an illustrative problem, we demonstrate theoretically how optimal convergence and learning rates scale unfavourably with intrinsic dimensionality and population size. We show that popular gradient descent techniques such as momentum and Adam do not address these fundamental scalability problems. We explore using control variates to reduce variance of FG while maintaining unbiasedness, and while we find limited success in improving over baselines, we also identify challenges that need to be overcome for these methods to scale effectively. Lastly we consider a biased method for variance reduction, and on a particular problem we show that this significantly outperforms the unbiased variance reduction methods that we consider. Assuming access to an asymptotically unbiased control variate, our results suggest that maintaining unbiasedness is not necessarily advantageous for variance reduction techniques, however we leave open the possibility that unbiasedness may be helpful when the control variate is asymptotically biased. Our code is publicly available at https://github.com/anon908bp2zy/forward_grad_public .
URL: https://openreview.net/forum?id=s6g8yZimHE
---
Title: PAC-Bayesian Meta-Learning for Few-Shot Identification of Linear Dynamical Systems
Abstract: Identifying linear time-invariant (LTI) dynamical systems from data is especially challenging when trajectories are short, noisy, or high-dimensional. Traditional system identification methods typically treat each system in isolation and therefore discard shared information that may exist across related systems. We propose a PAC-Bayesian Meta-Learning framework for LTI system identification (PBML-LTI) that explicitly leverages cross-task structure while preserving task-level heterogeneity. Each task corresponds to an unknown LTI system, and a meta-learner uses a collection of training trajectories to learn a data-dependent prior over system parameters. Given a new system with limited trajectory data, the method performs Bayesian inference to produce a posterior distribution over the new system’s parameters, enabling calibrated uncertainty quantification and principled adaptation in the few-shot regime.
A key technical challenge is temporal dependence: trajectories generated by LTI systems violate i.i.d. assumptions underlying standard learning theory. To address this, we develop generalization guarantees for meta-learned priors under sequential dependence using martingale-based PAC-Bayes analysis with sub-normalized concentration tools. The resulting bounds characterize how the quality of the learned prior controls expected identification error on unseen systems, with explicit dependence on trajectory length, noise, and the divergence between task posteriors and the meta-prior. This connects uncertainty-aware meta-identification with finite-sample theory for dependent dynamical data.
URL: https://openreview.net/forum?id=CiGFpSLzFv
---
Title: A Causal Testbed for Disentangling Skill from Aggregate Game Statistics in Chess
Abstract: A long-standing objective in human-AI interaction is to create personalized AI coaching systems that enhance human skill without tainting quantifiable behavioral patterns. We hypothesize that the common problem of style drift in AI coaching results from a failure to recognize the underlying causal structure, namely the collision between skill and behavioral patterns. We propose a methodological testbed for formalizing, quantifying, and addressing skill-behavioral pattern disentanglement under a particular causal structure. Instead of concentrating on holistic chess style, we specifically target a tractable proxy problem: decoupling skill from six interpretable aggregate play statistics. Our contribution is positioned as methodological rather than a comprehensive solution to chess coaching because this simplified feature space allows controlled testing of the collider hypothesis with known ground truth. We evaluate our approach on 30,000 real-world chess games, demonstrating that unsupervised disentanglement models ($\beta$-VAE, InfoGAN) fail on our testbed (MIG $\approx$ 0), while our causally informed architecture achieves strong disentanglement (MIG = 0.89, HSIC $\approx$ 0.00016). Our model produces statistically independent latent representations while maintaining excellent predictive accuracy. While we achieve statistical disentanglement on our defined features, we cannot validate whether the learned representations capture meaningful strategic concepts or enable effective coaching without human evaluation by chess domain experts. Our contribution demonstrates the statistical mechanism by which collider bias prevents disentanglement and how HSIC regularization addresses it.
URL: https://openreview.net/forum?id=X3s31GOYPz
---