Weekly TMLR digest for Jan 04, 2026

5 views

Skip to first unread message

TMLR

unread,

Jan 4, 2026, 12:00:08 AMJan 4

to tmlr-annou...@googlegroups.com

New certifications
==================

J2C Certification: Tabby: A Language Model Architecture for Tabular and Structured Data Synthesis

Sonia Cromp, Satya Sai Srinath Namburi GNVV, Mohammed Alkhudhayri, Catherine Cao, Samuel Guo, Nicholas Roberts, Frederic Sala

https://openreview.net/forum?id=b9FPVnb0Bn

---

J2C Certification: Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices

Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu

https://openreview.net/forum?id=YPVBCTBqHE

---

Accepted papers
===============

Title: Federated Multimodal Fusion for Action Recognition Leveraging Vision-Language Embeddings and Spatio- Temporal CNNs

Authors: Aditi Palit, Kalidas Yeturu

Abstract: Federated learning (FL) for Video Action Recognition (VAR) faces significant challenges in balancing privacy preservation, communication efficiency, and model performance. This paper introduces FLAMeST (Federated Learning for Action Recognition with Multimodal embeddings and Spacio-Temporal Fusion), a FL framework that synergizes Vision-Language Models (VLMs) and spatiotemporal CNNs to address these challenges. Unlike existing works that use BLIP (VLM) solely for caption generation, FLAMeST leverages BLIP in a dual manner. To enhance temporal modeling, complementary spatiotemporal features are extracted using a pre-trained 3D CNN (Slow network). These semantic (BLIP) and motion (Slow) embeddings are concatenated into a unified representation to train a lightweight Multi-Layer Perceptron (MLP). Within the FL paradigm, only the MLP parameters are shared with the server, ensuring raw video data and generated captions remain local. FLAMeST employs the FedAvg algorithm for model aggregation, achieving 99%(↓) lower communication overhead compared to full-model training. Experiments on UCF101 and HMDB51 datasets demonstrate the framework’s robustness, achieving improved accuracies of 5.13%(↑) and 2.71%(↑), respectively, against the baseline.

URL: https://openreview.net/forum?id=AobzdtqiMe

---

Title: Achieving Tighter Finite-Time Rates for Heterogeneous Federated Stochastic Approximation under Markovian Sampling

Authors: Feng Zhu, Aritra Mitra, Robert W. Heath

Abstract: Motivated by collaborative reinforcement learning (RL) and optimization with time-correlated data, we study a generic federated stochastic approximation problem involving $M$ agents, where each agent is characterized by an agent-specific (potentially nonlinear) local operator. The goal is for the agents to communicate intermittently via a server to find the root of the average of the agents' local operators. The generality of our setting stems from allowing for (i) Markovian data at each agent and (ii) heterogeneity in the roots of the agents' local operators. The limited recent work that has accounted for both these features in a federated setting fails to guarantee convergence to the desired point or to show any benefit of collaboration; furthermore, they rely on projection steps in their algorithms to guarantee bounded iterates. Our work overcomes each of these limitations. We develop a novel algorithm called \texttt{FedHSA}, and prove that it guarantees convergence to the correct point, while enjoying an $M$-fold linear speedup in sample-complexity due to collaboration. To our knowledge, \emph{this is the first finite-time result of its kind}, and establishing it (without relying on a projection step) entails a fairly intricate argument that accounts for the interplay between complex temporal correlations due to Markovian sampling, multiple local steps to save communication, and the drift-effects induced by heterogeneous local operators. Our results have implications for a broad class of heterogeneous federated RL problems (e.g., policy evaluation and control) with function approximation, where the agents' Markov decision processes can differ in their probability transition kernels and reward functions.

URL: https://openreview.net/forum?id=1xRG4ECacS

---

Title: CRoPS: A Training-Free Hallucination Mitigation Framework for Vision-Language Models

Authors: Neeraj Anand, Samyak Jha, Udbhav Bamba, Rahul Rahaman

Abstract: Despite the rapid success of Large Vision-Language Models (LVLMs), a persistent challenge is their tendency to generate hallucinated content, undermining reliability in real-world use. Existing training-free methods address hallucinations but face two limitations: (i) they rely on narrow assumptions about hallucination sources, and (ii) their effectiveness declines toward the end of generation, where hallucinations are most likely to occur. A common strategy is to build hallucinated models by completely or partially removing visual tokens and contrasting them with the original model. Yet, this alone proves insufficient, since visual information still propagates into generated text. Building on this insight, we propose a novel hallucinated model that captures hallucination effects by selectively removing key text tokens. We further introduce Generalized Contrastive Decoding, which integrates multiple hallucinated models to represent diverse hallucination sources. Together, these ideas form CRoPS, a training-free hallucination mitigation framework that improves CHAIR scores by 20% and achieves consistent gains across six benchmarks and three LVLM families, outperforming state-of-the-art training-free methods.

URL: https://openreview.net/forum?id=KQSoZDPVGX

---

Title: Improving Foundation Model Group Robustness with Auxiliary Sentence Embeddings

Authors: Sisuo Lyu, Hong Liu, Jie Li, Yan Teng, Yingchun Wang

Abstract: This paper addresses the critical challenge of mitigating group-based biases in vision-language foundation models, a pressing issue for ensuring trustworthy AI deployment. We introduce DoubleCCA, a novel and computationally efficient framework that systematically enriches textual representations to enhance group robustness. Our key innovation is to leverage an auxiliary large sentence embedding model to capture diverse semantic perspectives, counteracting biased representations induced by limited training data. To this end, we propose a two-stage Canonical Correlation Analysis (DoubleCCA) technique: first, aligning augmented and original embeddings in a shared space; second, reconstructing invariant features to align with visual representations, thus enhancing the model's group robustness. We further propose a simple sentence augmentation approach, which aims to improve the robustness of CCA-induced subspaces. Our method is simple to implement and can be easily integrated into existing models, making it a practical solution for improving the robustness of vision-language foundation models to group-based biases. The experiments on a variety of datasets demonstrate that our method outperforms existing methods in terms of both performance and robustness. Our code is available at https://github.com/sisuolv/doublecca.

URL: https://openreview.net/forum?id=5rMtiB96cg

---

Title: Noise-Aware Adaptation of Pre-trained Foundation Models for Single-photon Image Classification

Authors: Ziting Wen, Wenle Dong, Zili Zhang, Yiheng Qiang, KEMI DING, Xiaoqiang Ren

Abstract: Adapting pre-trained foundation models to novel sensor modalities is a fundamental challenge. These models are pre-trained on large RGB datasets that typically lack exposure to the imaging characteristics of other modalities. Physical acquisition effects, such as photon statistics and sensor-specific noise, produce appearance shifts that are underrepresented in pre-training and can degrade transfer performance. We propose a noise-aware adaptation framework that conditions model adaptation on sensor-specific acquisition statistics. Central to our approach is a lightweight Noise Adapter that modulates pre-trained visual features using summary statistics of the sensor’s outputs, to decouple acquisition-induced appearance variation from semantics and improve robustness in low-label regimes. We instantiate this idea as a case study on single-photon LiDAR depth images by designing a Noise Adapter that leverages summary statistics computed from raw single-photon histograms for few-shot classification. We also present an exploratory analysis showing how learned modulation patterns correspond to noise-induced feature shifts, providing insight into the adapter’s role in feature robustness. Experiments on both synthetic and real single-photon datasets show that our method improves accuracy over baselines, with an average improvement of 3\% over the best baseline. These results suggest that explicitly conditioning adaptation on physical acquisition factors is a practical and promising strategy that may generalize to other non-standard modalities. The code is available at~\url{https://github.com/ZiTingW/noise_adapter}.

URL: https://openreview.net/forum?id=qSnrIy6Ohb

---

Title: Proper Orthogonal Decomposition for Scalable Training of Graph Neural Networks

Authors: Abhishek A, Manohar Kaul, Mohit Meena, Mahesh Chandran

Abstract: As large-scale graphs become ubiquitous in real-world applications, there is growing concern about the memory and time requirement to train a graph neural network (GNN) model for such datasets. Storing the entire adjacency and node embedding matrices in memory is infeasible in such a scenario. Standard sampling-based methods for addressing the memory constraint suffer from the dependence of the number of mini-batches on the graph size. Existing sketch-based methods and graph compression techniques operate at higher sketch ratios, with the graph compression techniques showing poor generalization, implying that different GNNs trained on the same synthetic graph have performance gaps. Sketch-based methods necessitate online learning of sketches, further increasing the complexity. In this paper, we propose a new sketch-based algorithm, {PGNN}, employing the Proper Orthogonal Decomposition (POD) method to craft update rules to train GNNs, improving the memory requirement and training time without the complication of updating the sketches during training. Experiments on standard graph datasets show that {PGNN} can reach much lower sketch ratios without compromising the performance. We demonstrate that the POD projection matrix is provably optimal, minimizing an upper bound on the projection error induced by the linearized GNN (SGC) update rule. Empirical findings validate our approach, demonstrating superior performance at reduced sketch ratios and adaptability across various GNN architectures.

URL: https://openreview.net/forum?id=LeL6whBoWE

---

Title: Tabby: A Language Model Architecture for Tabular and Structured Data Synthesis

Authors: Sonia Cromp, Satya Sai Srinath Namburi GNVV, Mohammed Alkhudhayri, Catherine Cao, Samuel Guo, Nicholas Roberts, Frederic Sala

Abstract: Large language models (LLMs) have greatly improved the quality of synthetic text data. We aim to extend these advances to tabular data with Tabby, a simple but powerful post-training modification to the standard Transformer language model architecture, enabling its use for tabular dataset synthesis. Tabby represents differences across columns using Gated Mixture-of-Experts, with column-specific sets of parameters. Empirically, Tabby results in data quality near or equal to that of real data. Pairing Tabby with Plain, our novel tabular training technique, we observe up to a $7\%$ improvement in quality (measured by MLE) over previous methods. Additionally, our approach is more flexible than prior strategies and extends beyond tables, to more general structured data. In a structured JSON setting, Tabby outperforms all other methods by $2$-$3$ points and is the only approach with MLE equal to the upper bound of non-synthetic data.

URL: https://openreview.net/forum?id=b9FPVnb0Bn

---

Title: Do Vision Encoders Truly Explain Object Hallucination?: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore

Authors: Hongseok Oh, Wonseok Hwang

Abstract: Recently, Large Vision-Language Models (LVLMs) show remarkable performance across various domains.
However, these models suffer from object hallucination.
In this work, we study object hallucination primarily in a discriminative, retrieval-style evaluation setting (OHD-Caps), rather than in free-form caption generation.
This study revisits the previous claim that the cause of such hallucinations lies in the limited representational capacity of the vision encoder.
Our analysis implies that the capacity of the vision encoder is not necessarily a major limiting factor in detecting object hallucination.
Based on this insight, we propose Fine-grained CLIPScore (F-CLIPScore), a simple yet effective evaluation metric that enhances object-level granularity by incorporating text embeddings at the noun level.
Evaluations on the OHD-Caps benchmark show that F-CLIPScore significantly outperforms conventional CLIPScore in accuracy by a large margin of 39.6% without additional training.
We further demonstrate that F-CLIPScore-based data filtering reduces object hallucination in LVLM (4.9% in POPE accuracy after alignment pretraining).
Our code is publicly available at https://github.com/abzb1/f-clip

URL: https://openreview.net/forum?id=JTua6tDPgZ

---

Title: Efficient Few-Shot Continual Learning in Vision-Language Models

Authors: Aristeidis Panos, Rahaf Aljundi, Daniel Olmeda Reino, Richard E. Turner

Abstract: Vision-language models (VLMs) excel at tasks like visual question answering and image captioning, but their reliance on frozen, pretrained image encoders like CLIP often leads to persistent vision errors that degrade downstream performance. Moreover, real-world deployment demands that VLMs continually adapt to new, scarce data in a few-shot setting without forgetting prior knowledge. To meet these challenges, we introduce LoRSU (Low-Rank Adaptation with Structured Updates), a lightweight and robust technique for few-shot continual learning of VLMs’ image encoders. Our approach leverages theoretical insights to identify and update only the most critical parameters, achieving significant resource efficiency. Specifically, we demonstrate that LoRSU reduces computational overhead by over 25x compared to full VLM updates, without sacrificing performance. In experiments on VQA benchmarks under a few-shot continual learning protocol, LoRSU demonstrates superior scalability, efficiency, and accuracy, offering a practical solution for dynamic, resource-constrained vision-language applications.

URL: https://openreview.net/forum?id=sQ1w92WW0V

---

Title: Rel-HNN: Split Parallel Hypergraph Neural Network for Learning on Relational Databases

Authors: Md. Tanvir Alam, Md. Ahasanul Alam, Md Mahmudur Rahman, Md Mosaddek Khan

Abstract: Relational databases (RDBs) are ubiquitous in enterprise and real-world applications. Flattening the database poses challenges for deep learning models that rely on fixed-size input representations to capture relational semantics from the structured nature of relational data. Graph neural networks (GNNs) have been proposed to address this, but they often oversimplify relational structures by modeling all the tuples as monolithic nodes and ignoring intra-tuple associations. In this work, we propose a novel hypergraph-based framework, that we call rel-HNN, which models each unique attribute-value pair as a node and each tuple as a hyperedge, enabling the capture of fine-grained intra-tuple relationships. Our approach learns explicit multi-level representations across attribute-value, tuple, and table levels. To address the scalability challenges posed by large RDBs, we further introduce a split-parallel training algorithm that leverages multi-GPU execution for efficient hypergraph learning. Extensive experiments on real-world and benchmark datasets demonstrate that rel-HNN significantly outperforms existing methods in both classification and regression tasks. Moreover, although the benefits of split-parallel training diminish on smaller hypergraphs with fewer nodes due to communication overhead, it achieves substantial speedups of up to 3.18× on large-scale relational datasets and up to 2.94× on large hypergraph datasets.

URL: https://openreview.net/forum?id=L7VP7gxpVG

---

Title: Unreasonable effectiveness of LLM reasoning: a doubly cautionary tale of temporal question-answering

Authors: Dagmara Panas, Ali Payani, Vaishak Belle

Abstract: The remarkable success of Large Language Models in modeling both the syntax and the semantics of language has prompted a body of research into language-adjacent abilities, most notably commonsense reasoning.
As LLMs' performance continues to advance on successive benchmarks, we turn to temporal reasoning, which lags somewhat behind other tasks due to its more complex logic.
We start from previous work, where authors successfully induce (apparent) reasoning by breaking down the problem into a two-step procedure of temporal graph extraction and subsequent reasoning.
Specifically, in the first step an LLM is prompted to parse a natural language description into a semi-structured timeline of events; and in the second step, it is given the extracted timeline and prompted to answer a temporal reasoning question.
We conjecture that this procedure presents two separate opportunities for introducing errors and further hypothesise that a Neuro-symbolic approach should help in this matter.
We follow the recent trend of using external executors in concert with LLMs to carry out exact reasoning and verification.
We see the reasoning step of the original two-step procedure as a natural target for a symbolic solver and design a rule-based solution for Temporal Question-Answering, drawing on ideas from Allen’s Interval Algebra.
To our surprise, we find that our rule-based reasoner does not improve beyond the previously reported, purely neural solution.
It appears that both our approach and the previous method operate at around the limits of achievable performance, imposed by the correctness of information extraction.
Such a result seems to suggest that a non-symbolic LLM is capable of symbolic-level reasoning, although upon further investigation we discover that not to be the case.
It is not that the neural solution makes no reasoning mistakes, but rather that the LLM manages to compensate for some of its erroneous replies by `short-cutting' to the correct answer in other questions; a.k.a. not reasoning but guessing.
Although the effect is not pronounced performance-wise, we feel it is conceptually important: as we argue, production of correct answers is not a measure of reasoning.

URL: https://openreview.net/forum?id=1DkD0Nd8Rd

---

Title: How Many Images Does It Take? Estimating Imitation Thresholds in Text-to-Image Models

Authors: Sahil Verma, Royi Rassin, Arnav Mohanty Das, Gantavya Bhatt, Preethi Seshadri, Chirag Shah, Jeff Bilmes, Hannaneh Hajishirzi, Yanai Elazar

Abstract: Text-to-image models are trained using large datasets of image-text pairs collected from the internet. These datasets often include copyrighted and private images. Training models on such datasets enables them to generate images that might violate copyright laws and
individual privacy. This phenomenon is termed imitation – generation of images with content that has recognizable similarity to its training images. In this work we estimate the point at which a model was trained on enough instances of a concept to be able to imitate it – the imitation threshold. We posit this question as a new problem and propose an efficient approach that estimates the imitation threshold without incurring the colossal cost of training these models from scratch. We experiment with two domains – human faces and art styles, and evaluate four text-to-image models that were trained on three pretraining datasets. We estimate the imitation threshold of these models to be in the range of 200-700 images, depending on the domain and the model. The imitation threshold provides an empirical basis for copyright violation claims and acts as a guiding principle for text-to-image model developers that aim to comply with copyright and privacy laws.

URL: https://openreview.net/forum?id=x0qJo7SPhs

---

Title: Cluster Agnostic Network Lasso Bandits

Authors: Sofien Dhouib, Steven Bilaj, Behzad Nourani-Koliji, Setareh Maghsudi

Abstract: We consider a multi-task contextual bandit setting, where the learner is given a graph encoding relations between the bandit tasks. The tasks' preference vectors are assumed to be piecewise constant over the graph, forming clusters. At every round, we estimate the preference vectors by solving an online network lasso problem with a suitably chosen, time-dependent regularization parameter. We establish a novel oracle inequality relying on a convenient restricted eigenvalue assumption. Our theoretical findings highlight the importance of dense intra-cluster connections and sparse inter-cluster ones. That results in a sublinear regret bound significantly lower than its counterpart in the independent task learning setting. Finally, we support our theoretical findings by experimental evaluation against graph bandit multi-task learning and online clustering of bandits algorithms.

URL: https://openreview.net/forum?id=QjAyoMP1DD

---

Title: Convergence Aspects of Hybrid Kernel SVGD

Authors: Anson MacDonald, Scott A Sisson, Sahani Pathiraja

Abstract: Stein variational gradient descent (SVGD) is a particle-based approximate inference algorithm. Many variants of SVGD have been proposed in recent years, including the hybrid kernel variant (h-SVGD), which has demonstrated promising results on image classification with deep neural network ensembles. By framing h-SVGD as a kernelised Wasserstein gradient flow on a functional that is not the Kullback-Leibler divergence, we demonstrate that h-SVGD does not converge to the target distribution in the mean field limit. Despite this theoretical result, we provide intuition and experimental support for the ability of h-SVGD to improve variance estimation in high dimensions. Unlike other SVGD variants that also alleviate variance collapse, this is achieved at no additional computational cost and without further assumptions on the posterior.

URL: https://openreview.net/forum?id=JZkbMSQDmD

---

Title: AutoAnnotator: A Collaborative Annotation Framework for Large and Small Language Models

Authors: Yao Lu, Ji Zhaiyuan, Jiawei Du, Yu Shanqing, Qi Xuan, Joey Tianyi Zhou

Abstract: Although the annotation paradigm based on Large Language Models (LLMs) has made significant breakthroughs in recent years, its actual deployment still has two core bottlenecks: first, the cost of calling commercial APIs in large-scale annotation is very expensive; second, in scenarios that require fine-grained semantic understanding, such as sentiment classification and toxicity classification, the annotation accuracy of LLMs is even lower than that of Small Language Models (SLMs) dedicated to this field. To address these problems, we propose a new paradigm of multi-model cooperative annotation and design a fully automatic annotation framework AutoAnnotator based on this. Specifically, AutoAnnotator consists of two layers. The upper-level meta-controller layer uses the generation and reasoning capabilities of LLMs to select SLMs for annotation, automatically generate annotation code and verify difficult samples; the lower-level task‑specialist layer consists of multiple SLMs that perform annotation through multi-model voting. In addition, we use the difficult samples obtained by the secondary review of the meta-controller layer as the reinforcement learning set and fine-tune the SLMs in stages through a continual learning strategy, thereby improving the generalization of SLMs. Extensive experiments show that AutoAnnotator outperforms existing open-source/API LLMs in zero-shot, one-shot, CoT, and majority voting settings. Notably, AutoAnnotator reduces the annotation cost by 74.15% compared to directly annotating with GPT-3.5-turbo, while still improving the accuracy by 6.21%. The code is available in https://github.com/Zhaiyuan-Ji/AutoAnnotator.

URL: https://openreview.net/forum?id=LauojtjA9F

---

Title: Positional Encoder Graph Quantile Neural Networks for Geographic Data

Authors: William E. R. de Amorim, Scott A Sisson, Thais Carvalho Valadares Rodrigues, David J Nott, Guilherme S. Rodrigues

Abstract: Positional Encoder Graph Neural Networks (PE-GNNs) are among the most effective models for learning from continuous spatial data. However, their predictive distributions are often poorly calibrated, limiting their utility in applications that require reliable uncertainty quantification. We propose the Positional Encoder Graph Quantile Neural Network (PE-GQNN), a novel framework that combines PE-GNNs with Quantile Neural Networks, partially monotonic neural blocks, and post-hoc recalibration techniques. The PE-GQNN enables flexible and robust conditional density estimation with minimal assumptions about the target distribution, and it extends naturally to tasks beyond spatial data. Empirical results on benchmark datasets show that the PE-GQNN outperforms existing methods in both predictive accuracy and uncertainty quantification, without incurring additional computational cost. We also identify important special cases arising from our formulation, including the PE-GNN.

URL: https://openreview.net/forum?id=5PjL8ZOPBt

---

Title: Policy-Guided Search on Tree-of-Thoughts for Efficient Problem Solving with Bounded Language Model Queries

Authors: Sumedh Pendurkar, Guni Sharon

Abstract: Recent studies explored integrating state-space search algorithms with Language Models (LM) to perform look-ahead on the token generation process, the ``Tree-of-Thoughts'' (ToT), generated by LMs, thereby improving performance on problem-solving tasks. However, the affiliated search algorithms often overlook the significant computational costs associated with LM inference, particularly in scenarios with constrained computational budgets. Consequently, we address the problem of improving LM performance on problem-solving tasks under limited computational budgets. We demonstrate how the probabilities assigned to thoughts by LMs can serve as a heuristic to guide search within the ToT framework, thereby reducing the number of thought evaluations. Building on this insight, we adapt a heuristic search algorithm, Levin Tree Search (LTS), to the ToT framework, which leverages LMs as policies to guide the tree exploration efficiently. We extend the theoretical results of LTS by showing that, for ToT (a pruned tree), LTS guarantees a bound on the number of states expanded, and consequently, on the number of thoughts generated. Additionally, we analyze the sensitivity of this bound to the temperature values commonly used in the final softmax layer of the LM. Empirical evaluation under a fixed LM query budget demonstrates that LTS consistently achieves comparable or higher accuracy than baseline search algorithms within the ToT framework, across three domains (Blocksworld, PrOntoQA, Array Sorting) and four distinct LMs. These findings highlight the efficacy of LTS on ToT, particularly in enabling cost-effective and time-efficient problem-solving, making it well-suited for latency-critical and resource-constrained applications.

URL: https://openreview.net/forum?id=Rlk1bWe2ii

---

Title: Universal Differential Equations for Stable Multi-Step Volatility Time Series Forecasting

Authors: Prasanna Devadiga, Kishan Gurumurthy, Kshitij Mohan

Abstract: Neural differential equations such as Neural ODEs, Neural CDEs, and Universal Differential Equations (UDEs) model temporal evolution as a continuous-time flow rather than a fixed-step recurrence. Even for regularly sampled data, this formulation differs fundamentally from discrete-time architectures: it learns smooth vector fields governing instantaneous rates of change, reducing discretization bias and improving long-horizon stability. We present a systematic study of Universal Differential Equations for financial volatility forecasting, a domain characterized by regime shifts, heavy tails, and jump discontinuities. UDEs extend Neural ODEs by embedding mechanistic structure within learned dynamics, using neural networks to parameterize coefficients in partially known differential equations instead of learning the system purely from data. Our UDE variants incorporate volatility’s empirical regularities while retaining neural flexibility for regime adaptation. Across market regimes, they outperform both continuous-time baselines and discrete-time models, achieving higher accuracy and greater long-horizon stability while remaining interpretable. These results suggest that UDEs grounded in mechanistic structure and neural flexibility offer a principled route to stable, interpretable multi-step forecasting in nonstationary domains.

URL: https://openreview.net/forum?id=uWGNexco2M

---

Title: Occam’s Razor for SSL: Memory-Efficient Parametric Instance Discrimination

Authors: Eric Gan, Patrik Reizinger, Alice Bizeul, Attila Juhos, Mark Ibrahim, Randall Balestriero, David Klindt, Wieland Brendel, Baharan Mirzasoleiman

Abstract: Self-supervised learning (SSL) is the prevalent paradigm for representation learning often relying on pairwise similarity between multiple augmented views of each example. Numerous learning methods with various complexities such as gradient stopping, negative sampling, projectors, additional regularization terms, were introduced in the past years. These methods can be effective, but they require careful hyperparameter tuning, have increased computational and memory requirements and struggle with latent dimensionality collapse. Furthermore, complexities such as gradient stopping make them hard to analyse theoretically and confound the essential components of SSL. We introduce a simple parametric instance discrimination method, called Datum IndEx as its Target (DIET). DIET has a single computational branch, without explicit negative sampling, gradient stopping or other hyperparameters. We empirically demonstrate that DIET (1) can be implemented in a memory-efficient way; (2) achieves competitive performance with state-of-the-art SSL methods on small-scale datasets; and (3) is robust to hyperparameters such as batch size. We uncover tight connections to Spectral Contrastive Learning in the lazy training regime, leading to practical insights about the role of feature normalization. Compared to SimCLR or VICReg, DIET also has higher-rank embeddings on CIFAR100 and TinyImageNet, suggesting that DIET captures more latent information.

URL: https://openreview.net/forum?id=GFNTbsVFlP

---

Title: Quantifying Context Bias in Domain Adaptation for Object Detection

Authors: Hojun Son, Asma A. Almutairi, Arpan Kusari

Abstract: Domain adaptation for object detection (DAOD) has become essential to counter performance degradation caused by distribution shifts between training and deployment domains. However, a critical factor influencing DAOD—context bias resulting from learned foreground-background (FG–BG) association—remains underexplored. In this work, we present the first comprehensive empirical and causal analysis specifically targeting context bias in DAOD. We address three key questions regarding FG–BG association in object detection: (a) whether FG–BG association is encoded during training, (b) whether there is a causal relationship between FG–BG association and detection performance, and (c) whether FG–BG association affects DAOD. To examine how models capture FG–BG association, we analyze class-wise and feature-wise performance degradation using background masking and feature perturbation, measured via change in accuracy (defined as drop rate). To explore the causal role of FG–BG association, we apply do-calculus to FG–BG pairs guided by class activation mapping (CAM). To quantify the causal influence of FG–BG association across domains, we propose a novel metric—Domain Association Gradient—defined as the ratio of drop rate to maximum mean discrepancy (MMD). Through systematic experiments involving background masking, feature-level perturbations, and CAM, we reveal that convolution-based object detection models encode FG–BG association. The association substantially impacts detection performance, particularly under domain shifts where background information significantly diverges. Our results demonstrate that context bias not only exists but also causally undermines the generalization capabilities of object detection models across domains. Furthermore, we validate these findings across multiple models and datasets, including state-of-the-art architectures such as ALDI++. This study highlights the necessity of addressing context bias explicitly in DAOD frameworks, providing insights that pave the way for developing more robust and generalizable object detection systems.

URL: https://openreview.net/forum?id=YRU0A0nraG

---

Title: Exploring exploration with foundation agents in interactive environments

Authors: Daniel P. Sawyer, Nan Rosemary Ke, Hubert Soyer, Martin Engelcke, John Reid, David P Reichert, Drew A. Hudson, Alexander Lerchner, Danilo Jimenez Rezende, Timothy P Lillicrap, Michael Curtis Mozer, Jane X Wang

Abstract: Foundation models excel at single-turn reasoning, but many real-world challenges, from scientific research to technology development, require multi-turn exploration in dynamic interactive environments. Crucial components of learning from experience in these settings, such as efficiently gathering information to test hypotheses, meta-learning a model of the world's dynamics, and adapting to unexpected changes, remain largely unexplored for these models. We first evaluate foundation models in Feature World, a setting that primarily tests information gathering about a static hidden reward function. In this initial setting, we show that state-of-the-art foundation models come close to optimal efficiency in selecting maximally informative actions in tasks with simple reward functions. As a proof of concept, we also show a model can gather information efficiently in a 3D embodied version of this task, though errors in vision limit some aspects of performance. In order to test exploration across multiple dependent turns and trials, we implement a custom, text-based version of the Alchemy environment, a benchmark designed for meta-learning. Here, agents must deduce a latent causal structure by integrating information across multiple state-dependent trials. In this more complex setting, we find that recent foundation models struggle to meta-learn strategies that enable improved performance over time. However, prompting the models to summarize their observations at regular intervals enables an emergent meta-learning process, allowing them to improve across trials. Notably, in some models, summarization also enabled adaptive re-learning of this information when the environment's rules change unexpectedly. While most models performed reasonably well on simple Feature World tasks, evaluations in Alchemy reveal stark differences in robustness among the models, with Gemini 2.5 performing best, followed by Claude 3.7, and ChatGPT-4o and o4-mini struggling the most. These results underscore Alchemy's value as a benchmark for meta-learning and strategy adaptation in foundation models. By moving beyond simple discovery to complex, stateful environments, we demonstrate that the most significant challenge for foundation agents is not selecting informative actions in the moment, but rather seeking and integrating knowledge through adaptive strategies over time. Intriguingly, we find there is likely no intrinsic barrier to future generations of foundation agents more fully mastering these abilities.

URL: https://openreview.net/forum?id=wOrkUTr0W5

---

Title: DA-DPO: Cost-efficient Difficulty-aware Preference Optimization for Reducing MLLM Hallucinations

Authors: Longtian Qiu, Shan Ning, Chuyu Zhang, Jiaxuan Sun, Xuming He

Abstract: Direct Preference Optimization (DPO) has shown strong potential for mitigating hallucinations in Multimodal Large Language Models (MLLMs). However, existing multimodal DPO approaches often suffer from overfitting due to the difficulty imbalance in preference data. Our analysis shows that MLLMs tend to overemphasize easily distinguishable preference pairs, which hinders fine-grained hallucination suppression and degrades overall performance.
To address this issue, we propose Difficulty-Aware Direct Preference Optimization (DA-DPO), a cost-effective framework designed to balance the learning process. DA-DPO consists of two main components: (1) Difficulty Estimation leverages pre-trained vision–language models with complementary generative and contrastive objectives, whose outputs are integrated via a distribution-aware voting strategy to produce robust difficulty scores without additional training; and (2) Difficulty-Aware Training reweights preference pairs based on their estimated difficulty, down-weighting easy samples while emphasizing harder ones to alleviate overfitting.
This framework enables more effective preference optimization by prioritizing challenging examples, without requiring new data or extra fine-tuning stages. Extensive experiments demonstrate that DA-DPO consistently improves multimodal preference optimization, yielding stronger robustness to hallucinations and better generalization across standard benchmarks, while remaining computationally efficient. The project page is available at https://artanic30.github.io/project_pages/DA-DPO/.

URL: https://openreview.net/forum?id=M52CgPcgGx

---

Title: HopCast: Calibration of Autoregressive Dynamics Models

Authors: Muhammad Bilal Shahid, Cody Fleming

Abstract: Deep learning models are often trained to approximate dynamical systems that can be modeled using differential equations. Many of these models are optimized to predict one step ahead; such approaches produce calibrated one-step predictions if the predictive model can quantify uncertainty, such as Deep Ensembles. At inference time, multi-step predictions are generated via autoregression, which needs a sound uncertainty propagation method to produce calibrated multi-step predictions. This work introduces an alternative Predictor-Corrector approach named HopCast that uses Modern Hopfield Networks (MHN) to learn the errors of a deterministic Predictor that approximates the dynamical system. The Corrector predicts a set of errors for the Predictor's output based on a context state at any timestep during autoregression. The set of errors creates sharper and well-calibrated prediction intervals with higher predictive accuracy compared to baselines without uncertainty propagation. The calibration and prediction performances are evaluated across a set of dynamical systems. This work is also the first to benchmark existing uncertainty propagation methods based on calibration errors.

URL: https://openreview.net/forum?id=wsO6nxvGof

---

Title: Fractal Generative Models

Authors: Tianhong Li, Qinyi Sun, Lijie Fan, Kaiming He

Abstract: Modularization is a cornerstone of computer science, abstracting complex functions into atomic building blocks. In this paper, we introduce a new level of modularization by abstracting generative models themselves into atomic modules. Our method constructs generative models by recursively invoking atomic generative modules, resulting in architectures with fractal-like, self-similar properties. We call this new class of models fractal generative models. As a running example, we instantiate our fractal framework using autoregressive models as the atomic modules and examine it on the challenging task of pixel-by-pixel image generation. Our experiments show strong performance in both likelihood estimation and generation quality. We hope this work could serve as a starting point for future research into fractal generative models, establishing a new paradigm in generative modeling.

URL: https://openreview.net/forum?id=Qk9kn6lOlW

---

Title: Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices

Authors: Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu

Abstract: Scaling the input context length of a large language model (LLM) incurs a significant increase in computation cost and memory footprint to maintain the attention key-value (KV) cache.
Existing KV cache compression methods suffer from inefficient compression strategies and limited memory reduction effects, making it difficult for LLMs to conduct long-context inference on consumer-grade devices, especially when inferring long-context stream input.
Such obstacles prevent consumer-grade devices from supporting more complex applications, creating challenges for the democratization of LLMs.
To overcome this, we propose Locret, a framework to create an eviction policy compatible with chunked prefill. By evaluating the causal importance of KV cache units using \textit{retaining heads}, Locret enables precise eviction of cache units, facilitating efficient long-context inference.
In our empirical studies, Locret outperforms the recent popular and competitive approaches in terms of memory efficiency and generation quality
--- Locret achieves up to $20\times$ of KV cache compression ratio within less than $10\%$ performance loss.
Furthermore, Locret achieves 128K+ long-context inference on a single NVIDIA 4090 GPU without compromising generation quality and only costs $<1$ GPU hour of additional training.

URL: https://openreview.net/forum?id=YPVBCTBqHE

---

New submissions
===============

Title: A Systematic Assessment of Weak-to-Strong Confidence Prediction in Large Language Models

Abstract: As large language models (LLMs) are being deployed across a wide range of application domains, understanding their capacity through uncertainty quantification (UQ) is crucial for ensuring safe and reliable behavior. Reliable uncertainty estimates that accompany the text generated by an LLM can signal when a response is likely to be incorrect and thus serve as an effective fail-safe mechanism against hallucinations. In this paper, we explore the extent to which the probability of a frontier model answering a query correctly can be predicted by smaller, weaker models with publicly available embeddings using a simple probe. We show that this probability can be predicted effectively, and the probes are easy to train, making oversight of large proprietary models more widely accessible. Leveraging embeddings from models as small as Llama3-8b, our predictor achieves 83.4% AUROC on TriviaQA and 64.3% on MMLU, and improves selective prediction accuracy by up to 17.9%. We then carefully analyze how different factors impact the probe performance.
Across six benchmarks and fifteen weak predictors, we show that the performance does not simply improve with predictor model size, and that the weak-to-strong signal is robust to label imbalance and embedding aggregation choices. These findings support the view that representational compatibility between weak-model embeddings and the strong model’s behavior matters more than model size alone. Overall, our results advance the understanding of weak-to-strong generalization and provide a simple, scalable framework for building more trustworthy LLMs.

URL: https://openreview.net/forum?id=xYSzkg5qPD

---

Title: TabKDE: Simple and Scalable Tabular Data Generation with Kernel Density Estimates

Abstract: Tabular data generation considers a large table with multiple columns -- each column comprised of numerical, categorical, or sometimes ordinal values. The goal is to produce new rows for the table that replicate the distribution of rows from the original data -- without just copying those initial rows. The last 3 years has seen enormous progress on this problem, mostly using computational expensive methods that employ one-hot encoding, VAEs, and diffusion.

This paper describes a new approach to the problem of tabular data generation. By employing copula transformations and modeling the distribution as a kernel density estimate we can nearly match the accuracy and privacy-preservation achievements of the previous methods, but with almost no training time. Our method is very scalable, and can be run on data sets orders of magnitude larger than prior state-of-the-art on a simple laptop. Moreover, because we employ kernel density estimates, we can store the model as a coreset of the original data -- we believe the first for generative modeling -- and as a result, require significantly less space as well. Our code is available here: http://github.com/tabkde/tabkde-main

URL: https://openreview.net/forum?id=1eH8K8EAvM

---

Title: The Gaussian-Multinoulli Restricted Boltzmann Machine: A Potts Model Extension of the GRBM

Abstract: Many real-world tasks, from associative memory to symbolic reasoning, benefit from discrete, structured representations that standard continuous latent models can struggle to express. We introduce the Gaussian–Multinoulli Restricted Boltzmann Machine (GM-RBM), a generative energy-based model that extends the Gaussian–Bernoulli RBM (GB-RBM) by replacing binary hidden units with $q$-state categorical (Potts) units, yielding a richer latent state space for multivalued concepts. We provide a self-contained derivation of the energy, conditional distributions, and learning rules, and detail practical training choices (contrastive divergence with temperature annealing and intra-slot diversity constraints) that avoid state collapse. To separate architectural effects from sheer latent capacity, we evaluate under both capacity-matched and parameter-matched setups, comparing GM-RBM with GB-RBM configured to have the same number of possible latent assignments. On analogical recall and structured memory benchmarks, GM-RBM achieves competitive, and in several regimes, improved recall at equal capacity with comparable training cost, despite using only Gibbs updates. The discrete $q$-ary formulation is also amenable to efficient implementation. These results clarify when categorical hidden units provide a simple, scalable alternative to binary latents for discrete inference within tractable RBMs.

URL: https://openreview.net/forum?id=3QuXwfMcKo

---

Reply all

Reply to author

Forward

0 new messages