Weekly TMLR digest for Oct 26, 2025

8 views

Skip to first unread message

TMLR

unread,

Oct 26, 2025, 12:00:10 AMOct 26

to tmlr-annou...@googlegroups.com

New certifications
==================

Expert Certification: Tracking the Median of Gradients with a Stochastic Proximal Point Method

Fabian Schaipp, Guillaume Garrigos, Umut Simsekli, Robert M. Gower

https://openreview.net/forum?id=WMxLEgYGxu

---

Survey Certification, Expert Certification: Getting aligned on representational alignment

Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C. Love, Christopher J Cueva, Erin Grant, Iris Groen, Jascha Achterberg, Joshua B. Tenenbaum, Katherine M. Collins, Katherine Hermann, Kerem Oktar, Klaus Greff, Martin N Hebart, Nathan Cloos, Nikolaus Kriegeskorte, Nori Jacoby, Qiuyi Zhang, Raja Marjieh, Robert Geirhos, Sherol Chen, Simon Kornblith, Sunayana Rane, Talia Konkle, Thomas O'Connell, Thomas Unterthiner, Andrew Kyle Lampinen, Klaus Robert Muller, Mariya Toneva, Thomas L. Griffiths

https://openreview.net/forum?id=Hiq7lUh4Yn

---

J2C Certification: A Case for Library-Level $k$-Means Binning in Histogram Gradient-Boosted Trees

Asher Labovich

https://openreview.net/forum?id=UaTrLLspJa

---

J2C Certification: Generalized Smooth Stochastic Variational Inequalities: Almost Sure Convergence and Convergence Rates

Daniil Vankov, Angelia Nedich, Lalitha Sankar

https://openreview.net/forum?id=EjqSpbUBWU

---

J2C Certification: AB-UPT: Scaling Neural CFD Surrogates for High- Fidelity Automotive Aerodynamics Simulations via Anchored- Branched Universal Physics Transformers

Benedikt Alkin, Maurits Bleeker, Richard Kurle, Tobias Kronlachner, Reinhard Sonnleitner, Matthias Dorfer, Johannes Brandstetter

https://openreview.net/forum?id=nwQ8nitlTZ

---

J2C Certification: Temporal Test-Time Adaptation with State-Space Models

Mona Schirmer, Dan Zhang, Eric Nalisnick

https://openreview.net/forum?id=HFETOmUtrV

---

J2C Certification: Multi-model Online Conformal Prediction with Graph-Structured Feedback

Erfan Hajihashemi, Yanning Shen

https://openreview.net/forum?id=9u8ugbismg

---

J2C Certification: MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning

Yupeng Chen, Senmiao Wang, Yushun Zhang, Zhihang Lin, Haozhe Zhang, Weijian Sun, Tian Ding, Ruoyu Sun

https://openreview.net/forum?id=T1qXIDn9my

---

J2C Certification: RouteFinder: Towards Foundation Models for Vehicle Routing Problems

Federico Berto, Chuanbo Hua, Nayeli Gast Zepeda, André Hottung, Niels Wouda, Leon Lan, Junyoung Park, Kevin Tierney, Jinkyoo Park

https://openreview.net/forum?id=QzGLoaOPiY

---

J2C Certification: Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation

Or Tal, Felix Kreuk, Yossi Adi

https://openreview.net/forum?id=xXc5DeaBYw

---

J2C Certification: Discrete Audio Tokens: More Than a Survey!

Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli

https://openreview.net/forum?id=eqNchtvc6v

---

J2C Certification: Adversarial Robustness of Graph Transformers

Philipp Foth, Lukas Gosch, Simon Geisler, Leo Schwinn, Stephan Günnemann

https://openreview.net/forum?id=4xK0vjxTWL

---

J2C Certification: Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits

Yeshwanth Venkatesha, Souvik Kundu, Priyadarshini Panda

https://openreview.net/forum?id=PTIUjARnbc

---

J2C Certification: Chimera: State Space Models Beyond Sequences

Aakash Lahoti, Tanya Marwah, Ratish Puduppully, Albert Gu

https://openreview.net/forum?id=yv0TUssepk

---

J2C Certification: Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning

Jingcheng Niu, Subhabrata Dutta, Ahmed Elshabrawy, Harish Tayyar Madabushi, Iryna Gurevych

https://openreview.net/forum?id=10QqO1tM1H

---

J2C Certification: Reasoning Under 1 Billion: Memory-Augmented Reinforcement Learning for Large Language Models

Hung Le, Van Dai Do, Dung Nguyen, Svetha Venkatesh

https://openreview.net/forum?id=tmdwuU2uKs

---

J2C Certification: Model Tensor Planning

An Thai Le, Khai Nguyen, Minh Nhat VU, Joao Carvalho, Jan Peters

https://openreview.net/forum?id=fk1ZZdXCE3

---

J2C Certification: ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models

Yaswanth Narsupalli, Abhranil Chandra, Sreevatsa Muppirala, Manish Gupta, Pawan Goyal

https://openreview.net/forum?id=otSHFe8wTf

---

J2C Certification: FraGNNet: A Deep Probabilistic Model for Tandem Mass Spectrum Prediction

Adamo Young, Fei Wang, David Wishart, BO WANG, Russell Greiner, Hannes Rost

https://openreview.net/forum?id=UsqeHx9Mbx

---

J2C Certification: HDCS: Hierarchy Discovery and Critic Shaping for Reinforcement Learning with Automaton Specification

Duo XU, Faramarz Fekri

https://openreview.net/forum?id=BGoRme2MfG

---

J2C Certification: G2D2: Gradient-Guided Discrete Diffusion for Inverse Problem Solving

Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Bac Nguyen, Stefano Ermon, Yuki Mitsufuji

https://openreview.net/forum?id=fj23qnVifX

---

J2C Certification: Synthesizing world models for bilevel planning

Zergham Ahmed, Joshua B. Tenenbaum, Chris Bates, Samuel J. Gershman

https://openreview.net/forum?id=m9V4JHLJrD

---

J2C Certification: Slicing the Gaussian Mixture Wasserstein Distance

Moritz Piening, Robert Beinert

https://openreview.net/forum?id=yPBtJ4JPwi

---

J2C Certification: Rollout Total Correlation for Deep Reinforcement Learning

Bang You, Huaping Liu, Jan Peters, Oleg Arenz

https://openreview.net/forum?id=qTdRJAL8Li

---

J2C Certification: Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler

Aleksandr Dremov, Alexander Hägele, Atli Kosson, Martin Jaggi

https://openreview.net/forum?id=ZnSYEcZod3

---

J2C Certification: Stochastic Block Model-Aware Topological Neural Networks for Graph Link Prediction

Yuzhou Chen, Xiao Guo, Shujie Ma

https://openreview.net/forum?id=FBjVSPAsgs

---

J2C Certification: Low-rank Momentum Factorization for Memory Efficient Training

Pouria Mahdavinia, Mehrdad Mahdavi

https://openreview.net/forum?id=W3D3TVo9a3

---

J2C Certification: HalluEntity: Benchmarking and Understanding Entity-Level Hallucination Detection

Min-Hsuan Yeh, Max Kamachee, Seongheon Park, Yixuan Li

https://openreview.net/forum?id=494k7e9R5D

---

J2C Certification: Online Selective Conformal Inference: Errors and Solutions

Yusuf Sale, Aaditya Ramdas

https://openreview.net/forum?id=PjIQwFyP07

---

J2C Certification: LC-PLM: Long-context Protein Language Modeling Using Bidirectional Mamba with Shared Projection Layers

Yingheng Wang, Zichen Wang, Gil Sadeh, Luca Zancato, Alessandro Achille, George Karypis, Huzefa Rangwala

https://openreview.net/forum?id=dWvztQzfy4

---

J2C Certification: LumiNet: Perception-Driven Knowledge Distillation via Statistical Logit Calibration

Md. Ismail Hossain, M M Lutfe Elahi, Sameera Ramasinghe, Ali Cheraghian, Fuad Rahman, Nabeel Mohammed, Shafin Rahman

https://openreview.net/forum?id=3rU1lp9w2l

---

J2C Certification: Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?

Olawale Elijah Salaudeen, Nicole Chiou, Shiny Weng, Sanmi Koyejo

https://openreview.net/forum?id=fNywRyqPQo

---

J2C Certification: Exploring the Limitations of Layer Synchronization in Spiking Neural Networks

Roel Koopman, Amirreza Yousefzadeh, Mahyar Shahsavari, Guangzhi Tang, Manolis Sifalakis

https://openreview.net/forum?id=mfmAVwtMIk

---

J2C Certification: Class-wise Generalization Error: an Information-Theoretic analysis

Firas Laakom, Moncef Gabbouj, Jürgen Schmidhuber, Yuheng Bu

https://openreview.net/forum?id=asW4VcDFpi

---

J2C Certification: Physics-Aware Spatiotemporal Causal Graph Network for Forecasting with Limited Data

Zijun Cui, Sam Griesemer, Sungyong Seo, Joshua Hikida, Yan Liu

https://openreview.net/forum?id=n3yrVzPcNa

---

J2C Certification: Doubly Robust Uncertainty Quantification for Quantile Treatment Effects in Sequential Decision Making

Yang Xu, Chengchun Shi, Shikai Luo, Lan Wang, Rui Song

https://openreview.net/forum?id=F0BwbieVws

---

J2C Certification: Rational Tuning of LLM Cascades via Probabilistic Modeling

Michael J. Zellinger, Matt Thomson

https://openreview.net/forum?id=YCBVcGSZeR

---

J2C Certification: UniTST: Effectively Modeling Inter-Series and Intra-Series Dependencies for Multivariate Time Series Forecasting

Juncheng Liu, Chenghao Liu, Gerald Woo, Yiwei Wang, Bryan Hooi, Caiming Xiong, Doyen Sahoo

https://openreview.net/forum?id=p3y5q4cvzV

---

J2C Certification: Investigating Continual Pretraining in Large Language Models: Insights and Implications

Çağatay Yıldız, Nishaanth Kanna Ravichandran, Nitin Sharma, Matthias Bethge, Beyza Ermis

https://openreview.net/forum?id=aKjJoEVKgO

---

J2C Certification: Learning Energy-Based Generative Models via Potential Flow: A Variational Principle Approach to Probability Density Homotopy Matching

Junn Yong Loo, Leong Fang Yu, Michelle Adeline, Julia K. Lau, Hwa Hui Tew, Arghya Pal, VISHNU MONN BASKARAN, Chee-Ming Ting, Raphael CW Phan

https://openreview.net/forum?id=vc7poEYOFK

---

J2C Certification: Reinforcement Learning from Bagged Reward

Yuting Tang, Xin-Qiang Cai, Yao-Xiang Ding, Qiyu Wu, Guoqing Liu, Masashi Sugiyama

https://openreview.net/forum?id=bXUipBbZDA

---

J2C Certification: Tighter sparse variational Gaussian processes

Thang D Bui, Matthew Ashman, Richard E. Turner

https://openreview.net/forum?id=L33DSu3zvq

---

J2C Certification: Generalized Compressed Sensing for Image Reconstruction with Diffusion Probabilistic Models

Ling-Qi Zhang, Zahra Kadkhodaie, Eero P Simoncelli, David H. Brainard

https://openreview.net/forum?id=lmHh4FmPWZ

---

J2C Certification: Information Theoretic Guarantees For Policy Alignment In Large Language Models

Youssef Mroueh, Apoorva Nitsure

https://openreview.net/forum?id=Uz9J77Riul

---

J2C Certification: Selective Concept Bottleneck Models Without Predefined Concepts

Simon Schrodi, Julian Schur, Max Argus, Thomas Brox

https://openreview.net/forum?id=PMO30TLI4l

---

J2C Certification: SE3Set: Harnessing Equivariant Hypergraph Neural Networks for Molecular Representation Learning

Hongfei Wu, Lijun Wu, Guoqing Liu, Zhirong Liu, Bin Shao, Zun Wang

https://openreview.net/forum?id=muWEt1TOyo

---

J2C Certification: Return-Aligned Decision Transformer

Tsunehiko Tanaka, Kenshi Abe, Kaito Ariu, Tetsuro Morimura, Edgar Simo-Serra

https://openreview.net/forum?id=lTt2cTW8h1

---

J2C Certification: Recall and Refine: A Simple but Effective Source-free Open- set Domain Adaptation Framework

Ismail Nejjar, Hao Dong, Olga Fink

https://openreview.net/forum?id=HBZoXjUAqV

---

J2C Certification: Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Weixin Liang, LILI YU, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin

https://openreview.net/forum?id=Nu6N69i8SB

---

J2C Certification: Setting the Record Straight on Transformer Oversmoothing

Gbetondji Jean-Sebastien Dovonon, Michael M. Bronstein, Matt Kusner

https://openreview.net/forum?id=HHI6qWLFF1

---

J2C Certification: Calibrated Probabilistic Forecasts for Arbitrary Sequences

Charles Marx, Volodymyr Kuleshov, Stefano Ermon

https://openreview.net/forum?id=nuIUTHGlM5

---

J2C Certification: Optimization Guarantees for Square-Root Natural-Gradient Variational Inference

Navish Kumar, Thomas Möllenhoff, Mohammad Emtiyaz Khan, Aurelien Lucchi

https://openreview.net/forum?id=OMOFmb6ve7

---

J2C Certification: Path-Specific Counterfactual Fairness via Dividend Correction

Daisuke Hatano, Satoshi Hara, Hiromi Arai

https://openreview.net/forum?id=RXoSmiyObR

---

J2C Certification: FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness

Vincent Abbott, Gioele Zardini

https://openreview.net/forum?id=pF2ukh7HxA

---

J2C Certification: On Training-Conditional Conformal Prediction and Binomial Proportion Confidence Intervals

Rudi Coppola, Manuel Mazo Espinosa

https://openreview.net/forum?id=pSk5qyt1ob

---

J2C Certification: Early Directional Convergence in Deep Homogeneous Neural Networks for Small Initializations

Akshay Kumar, Jarvis Haupt

https://openreview.net/forum?id=VNM6V1gi3k

---

J2C Certification: Distributed Quasi-Newton Method for Fair and Fast Federated Learning

Shayan Mohajer Hamidi, Linfeng Ye

https://openreview.net/forum?id=KbteA50cni

---

J2C Certification: Multi-Bellman operator for convergence of $Q$-learning with linear function approximation

Diogo S. Carvalho, Pedro A. Santos, Francisco S. Melo

https://openreview.net/forum?id=D2PjEPGXgh

---

J2C Certification: Dynamics-inspired Structure Hallucination for Protein-protein Interaction Modeling

Fang Wu, Stan Z. Li

https://openreview.net/forum?id=GGHk5ukO6t

---

J2C Certification: Celo: Training Versatile Learned Optimizers on a Compute Diet

Abhinav Moudgil, Boris Knyazev, Guillaume Lajoie, Eugene Belilovsky

https://openreview.net/forum?id=SLqJbt4emY

---

J2C Certification: Directed Exploration in Reinforcement Learning from Linear Temporal Logic

Marco Bagatella, Andreas Krause, Georg Martius

https://openreview.net/forum?id=cjK5ZvP4zZ

---

J2C Certification: Adam-family Methods with Decoupled Weight Decay in Deep Learning

Kuangyu Ding, Nachuan Xiao, Kim-chuan Toh

https://openreview.net/forum?id=xVEHiAZ7uR

---

J2C Certification: CNN Interpretability with Multivector Tucker Saliency Maps for Self-Supervised Models

Aymene Mohammed Bouayed, Samuel Deslauriers-gauthier, Adrian IACOVELLI, David Naccache

https://openreview.net/forum?id=VM8bNd5A09

---

J2C Certification: On the stability of gradient descent with second order dynamics for time-varying cost functions

Travis E Gibson, Sawal Acharya, Anjali Parashar, Joseph Emilio Gaudio, Anuradha Annaswamy

https://openreview.net/forum?id=HlzjI2fn2T

---

J2C Certification: Dimension reduction via score ratio matching

Ricardo Baptista, Michael Brennan, Youssef Marzouk

https://openreview.net/forum?id=mvbZBaqSXo

---

J2C Certification: Leveraging a Simulator for Learning Causal Representations from Post-Treatment Covariates for CATE

Lokesh Nagalapatti, Pranava Singhal, Avishek Ghosh, Sunita Sarawagi

https://openreview.net/forum?id=vmmgFW3ztz

---

J2C Certification: Iterated $Q$-Network: Beyond One-Step Bellman Updates in Deep Reinforcement Learning

Théo Vincent, Daniel Palenicek, Boris Belousov, Jan Peters, Carlo D'Eramo

https://openreview.net/forum?id=Lt2H8Bd8jF

---

J2C Certification: Accelerating Non-Conjugate Gaussian Processes By Trading Off Computation For Uncertainty

Lukas Tatzel, Jonathan Wenger, Frank Schneider, Philipp Hennig

https://openreview.net/forum?id=UdcF3JbSKb

---

J2C Certification: Faster Diffusion Through Temporal Attention Decomposition

Haozhe Liu, Wentian Zhang, Jinheng Xie, Francesco Faccio, Mengmeng Xu, Tao Xiang, Mike Zheng Shou, Juan-Manuel Perez-Rua, Jürgen Schmidhuber

https://openreview.net/forum?id=xXs2GKXPnH

---

J2C Certification: LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, Chunyuan Li

https://openreview.net/forum?id=zKv8qULV6n

---

J2C Certification: Localize-and-Stitch: Efficient Model Merging via Sparse Task Arithmetic

Yifei He, Yuzheng Hu, Yong Lin, Tong Zhang, Han Zhao

https://openreview.net/forum?id=9CWU8Oi86d

---

J2C Certification: BM$^2$: Coupled Schrödinger Bridge Matching

Stefano Peluchetti

https://openreview.net/forum?id=fqkq1MgONB

---

J2C Certification: Wasserstein Coreset via Sinkhorn Loss

Haoyun Yin, Yixuan Qiu, Xiao Wang

https://openreview.net/forum?id=DrMCDS88IL

---

J2C Certification: No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Manu Gaur, Darshan Singh S, Makarand Tapaswi

https://openreview.net/forum?id=gqh0yzPYdo

---

Accepted papers
===============

Title: Initialization Matters: Unraveling the Impact of Pre-Training on Federated Learning

Authors: Divyansh Jhunjhunwala, Pranay Sharma, Zheng Xu, Gauri Joshi

Abstract: Initializing with pre-trained models when learning on downstream tasks is becoming standard practice in machine learning. Several recent works explore the benefits of pre-trained initialization in a federated learning (FL) setting, where the downstream training is performed at the edge clients with heterogeneous data distribution. These works show that starting from a pre-trained model can substantially reduce the adverse impact of data heterogeneity on the test performance of a model trained in a federated setting, with no changes to the standard FedAvg training algorithm. In this work, we provide a deeper theoretical understanding of this phenomenon. To do so, we study the class of two-layer convolutional neural networks (CNNs) and provide bounds on the training error convergence and test error of such a network trained with FedAvg. We introduce the notion of aligned and misaligned filters at initialization and show that the data heterogeneity only affects learning on misaligned filters. Starting with a pre-trained model typically results in fewer misaligned filters at initialization, thus producing a lower test error even when the model is trained in a federated setting with data heterogeneity. Experiments in synthetic settings and practical FL training on CNNs verify our theoretical findings.

URL: https://openreview.net/forum?id=wW4Cvhkxcx

---

Title: LanPaint: Training-Free Diffusion Inpainting with Asymptotically Exact and Fast Conditional Sampling

Authors: Candi Zheng, Yuan Lan, Yang Wang

Abstract: Diffusion models excel at joint pixel sampling for image generation but lack efficient training-free methods for partial conditional sampling (e.g., inpainting with known pixels). Prior works typically formulate this as an intractable inverse problem, relying on coarse variational approximations, heuristic losses requiring expensive backpropagation, or slow stochastic sampling. These limitations preclude (1) accurate distributional matching in inpainting results, (2) efficient inference modes without gradients, and (3) compatibility with fast ODE-based samplers. To address these limitations, we propose LanPaint: a training-free, asymptotically exact partial conditional sampling method for ODE-based and rectified-flow diffusion models. By leveraging carefully designed Langevin dynamics, LanPaint enables fast, backpropagation-free Monte Carlo sampling. Experiments demonstrate that our approach achieves superior performance with precise partial conditioning and visually coherent inpainting across diverse tasks. Code is available on https://github.com/scraed/LanPaint.

URL: https://openreview.net/forum?id=JPC8JyOUSW

---

Title: Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles

Authors: Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu li, Feiyu Xiong, Jiaqian Yu, Matthew B. Blaschko

Abstract: The application of rule-based reinforcement learning (RL) to multimodal large language models (MLLMs) introduces unique challenges and potential deviations from findings in text-only domains, particularly for perception-heavy tasks. This paper provides a comprehensive study of rule-based visual RL, using jigsaw puzzles\footnote[2]{For a preliminary study with another pretext task, image rotation, please refer to~\Cref{app:rotation}.} as a structured experimental framework. Jigsaw puzzles offer inherent ground truth, adjustable difficulty, and demand complex decision-making, making them ideal for this study. Our research reveals several key findings: \textit{Firstly,} we find that MLLMs, initially performing near to random guessing on the simplest jigsaw puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning. \textit{Secondly,} training on jigsaw puzzles can induce generalization to other visual tasks, with effectiveness tied to specific task configurations. \textit{Thirdly,} MLLMs can learn and generalize with or without explicit reasoning, though open-source models often favor direct answering. Consequently, even when trained for step-by-step reasoning, they can ignore the thinking process in deriving the final answer. \textit{Fourthly,} we observe that complex reasoning patterns appear to be pre-existing rather than emergent, with their frequency increasing alongside training and task difficulty. \textit{Finally,} our results demonstrate that RL exhibits more effective generalization than Supervised Fine-Tuning (SFT), and an initial SFT cold start phase can hinder subsequent RL optimization. Although these observations are based on jigsaw puzzles and may vary across other visual tasks, this research contributes a valuable piece of jigsaw to the larger puzzle of collective understanding rule-based visual RL and its potential in multimodal learning. The code is available at: \href{https://github.com/zifuwanggg/Jigsaw-R1}{https://github.com/zifuwanggg/Jigsaw-R1}.

URL: https://openreview.net/forum?id=XqQCsuyPve

---

Title: Mixture of Experts for Image Classification: What's the Sweet Spot?

Authors: Mathurin VIDEAU, Alessandro Leite, Marc Schoenauer, Olivier Teytaud

Abstract: Mixture-of-Experts (MoE) models have shown promising potential for parameter-efficient scaling across domains. However, their application to image classification remains limited, often requiring billion-scale datasets to be competitive. In this work, we explore the integration of MoE layers into image classification architectures using open datasets. We conduct a systematic analysis across different MoE configurations and model scales. We find that moderate parameter activation per sample provides the best trade-off between performance and efficiency. However, as the number of activated parameters increases, the benefits of MoE diminish. Our analysis yields several practical insights for vision MoE design. First, MoE layers most effectively strengthen tiny and mid-sized models, while gains taper off for large-capacity networks and do not redefine state-of-the-art ImageNet performance. Second, a Last-2 placement heuristic offers the most robust cross-architecture choice, with Every-2 slightly better for Vision Transform (ViT), and both remaining effective as data and model scale increase. Third, larger datasets (e.g., ImageNet-21k) allow more experts, up to 16, for ConvNeXt to be utilized effectively without changing placement, as increased data reduces overfitting and promotes broader expert specialization. Finally, a simple linear router performs best, suggesting that additional routing complexity yields no consistent benefit.

URL: https://openreview.net/forum?id=hKise4AJgp

---

Title: ViTime: Foundation Model for Time Series Forecasting Powered by Vision Intelligence

Authors: Luoxiao Yang, Yun Wang, Xinqi Fan, Israel Cohen, jingdong chen, Zijun Zhang

Abstract: Time series forecasting (TSF) possesses great practical values in various fields, including power and energy, transportation, etc. TSF methods have been studied based on kncowledge from classical statistics to modern deep learning. Yet, all of them were developed based on one fundamental concept, the numerical data fitting. Thus, the models developed have been long known for being problem-specific and lacking application generalizability. Practitioners expect a TSF foundation model that serves TSF tasks in different applications. The central question is then how to develop such a TSF foundation model. This paper offers one pioneering study in the TSF foundation model development method and proposes a vision intelligence-powered framework, ViTime, for the first time. ViTime fundamentally shifts TSF from numerical fitting to operations based on a binary image-based time series metric space and naturally supports both point and probabilistic forecasting. We also provide rigorous theoretical analyses of ViTime, including quantization-induced system error bounds and principled strategies for optimal parameter selection. Furthermore, we propose RealTS, an innovative synthesis algorithm generating diverse and realistic training samples, effectively enriching the training data and significantly enhancing model generalizability. Extensive experiments demonstrate ViTime's SOTA performance. In zero-shot scenarios, ViTime outperforms TimesFM by 9-15%. With just 10% fine-tuning data, ViTime surpasses both leading foundation models and fully-supervised benchmarks, a gap that widens with 100% fine-tuning. ViTime also exhibits exceptional robustness, effectively handling missing data and outperforming TimesFM by 20-30% under various data perturbations, validating the power of its visual space data operation paradigm.

URL: https://openreview.net/forum?id=XInsJDBIkp

---

Title: Boosting Revisited: Benchmarking and Advancing LP-Based Ensemble Methods

Authors: Fabian Akkerman, Julien Ferry, Christian Artigues, Emmanuel Hebrard, Thibaut Vidal

Abstract: Despite their theoretical appeal, totally corrective boosting methods based on linear programming have received limited empirical attention. In this paper, we conduct the first large-scale experimental study of six LP-based boosting formulations, including two novel methods, NM-Boost and QRLP-Boost, across 20 diverse datasets. We evaluate the use of both heuristic and optimal base learners within these formulations, and analyze not only accuracy, but also ensemble sparsity, margin distribution, anytime performance, and hyperparameter sensitivity. We show that totally corrective methods can outperform or match state-of-the-art heuristics like XGBoost and LightGBM when using shallow trees, while producing significantly sparser ensembles. We further show that these methods can thin pre-trained ensembles without sacrificing performance, and we highlight both the strengths and limitations of using optimal decision trees in this context.

URL: https://openreview.net/forum?id=lscC4PZUE4

---

Title: Improved seeding strategies for k-means and k-GMM

Authors: Guillaume Carrière, Frederic Cazals

Abstract: We revisit the randomized seeding techniques for k-means clustering and k-GMM (Gaussian
Mixture model fitting with Expectation-Maximization), formalizing their three key ingredients: the metric used for seed sampling, the number of candidate seeds, and the metric used
for seed selection. This analysis yields novel families of initialization methods exploiting a
lookahead principle–conditioning the seed selection to an enhanced coherence with the final
metric used to assess the algorithm, and a multipass strategy to tame down the effect of
randomization.
Experiments show a significant improvement over classical contenders. In particular, for
k-means, our methods improve on the recently designed multi-swap strategy (similar results
in terms of SSE, seeding ∼ ×6 faster), which was the first one to outperform the greedy
k-means++ seeding.
Our experimental analysis also shed light on subtle properties of k-means often overlooked,
including the (lack of) correlations between the SSE upon seeding and the final SSE, the
variance reduction phenomena observed in iterative seeding methods, and the sensitivity of
the final SSE to the pool size for greedy methods.
Practically, our most effective seeding methods are strong candidates to become one of the–if
not the–standard techniques. From a theoretical perspective, our formalization of seeding
opens the door to a new line of analytical approaches.

URL: https://openreview.net/forum?id=4Ut2YnekhN

---

Title: Diversity Augmentation of Dynamic User Preference Data for Boosting Personalized Text Summarizers

Authors: Parthiv Chatterjee, Shivam R Sonawane, Amey Hengle, Aditya Tanna, Sourish Dasgupta, Tanmoy Chakraborty

Abstract: Document summarization facilitates efficient identification and assimilation of user-relevant content, a process inherently influenced by individual subjectivity. Discerning $\textit{subjective}$ salient information within a document, particularly when it has multiple facets, poses significant challenges. This complexity underscores the necessity for $\textit{personalized summarization}$. However, training models for personalized summarization has so far been challenging, particularly because diverse training data containing both user preference history (i.e., $\textit{click-skip}$ trajectory) and expected (gold-reference) summaries are scarce. The MS/CAS PENS dataset is a rare resource in this direction. However, the training data only contains preference history $\textit{without any target summaries}$, thereby blocking end-to-end supervised learning. Also, the diversity in terms of topic transitions along the trajectory is relatively low, thereby leaving scope for better generalization. To address this, we propose PerAugy, a novel $\textit{cross-trajectory shuffling}$ and $\textit{summary-content perturbation}$-based data augmentation technique that significantly boosts the accuracy of four state-of-the-art (SOTA) baseline user-encoders commonly used in personalized summarization frameworks (\text{best result}: $\text{0.132}$$\uparrow$ w.r.t AUC). We select two such SOTA summarizer frameworks as baselines and observe that when augmented with their corresponding improved user-encoders, they consistently show an increase in personalization ($\text{avg. boost}$: ${61.2\%}\uparrow$ w.r.t. PSE-SU4 metric). As a post-hoc analysis of the role of induced diversity in the augmented dataset by PerAugy, we introduce three dataset diversity metrics -- $\mathrm{TP}$, $\mathrm{RTC}$, and DegreeD to quantify the induced diversity. We find that $\mathrm{TP}$ and DegreeD have a strong correlation with the user-encoder performance when trained on the PerAugy-generated dataset across all accuracy metrics, indicating that the increase in dataset diversity plays a major role in performance gain.

URL: https://openreview.net/forum?id=JVx7Qi8tz3

---

Title: Adversarial Surrogate Risk Bounds for Binary Classification

Authors: Natalie Frank

Abstract: A central concern in classification is the vulnerability of machine learning models to adversarial attacks. Adversarial training is one of the most popular techniques for training robust classifiers, which involves minimizing an adversarial surrogate risk. Recent work has characterized the conditions under which any sequence minimizing the adversarial surrogate risk also minimizes the adversarial classification risk in the binary setting, a property known as \emph{adversarial consistency}. However, these results do not address the rate at which the adversarial classification risk approaches its optimal value along such a sequence. This paper provides surrogate risk bounds that quantify that convergence rate.

URL: https://openreview.net/forum?id=Bay1cHLk7h

---

Title: Risk‑Seeking Reinforcement Learning via Multi‑Timescale EVaR Optimization

Authors: Deep Kumar Ganguly, Ajin George Joseph, Sarthak Girotra, Sirish Sekhar

Abstract: Tail-aware objectives shape agents' behavior when navigating uncertainty and can depart from risk-neutral scenarios. Risk measures such as Value at Risk (VaR) and Conditional Value at Risk (CVaR) have shown promising results in reinforcement learning. In this paper, we study the incorporation of a relatively new coherent risk measure, Entropic Value at Risk (EVaR), as a high-return, risk-seeking objective that the agent seeks to maximize. We propose a multi-timescale stochastic approximation algorithm to seek the optimal parameterized EVaR policy. Our algorithm enables effective exploration of high-return tails and robust gradient approximation to optimize the EVaR objective. We analyze the asymptotic behavior of our proposed algorithm and rigorously evaluate it across various discrete and continuous benchmark environments. The results highlight that the EVaR policy achieves higher cumulative returns and corroborate that EVaR is indeed a competitive risk-seeking objective for RL.

URL: https://openreview.net/forum?id=4nbEgNDsii

---

Title: Revisiting Contrastive Divergence for Density Estimation and Sample Generation

Authors: Azwar Abdulsalam, Joseph G. Makin

Abstract: Energy-based models (EBMs) have recently attracted renewed attention as models for complex distributions of data, like natural images. Improved image generation under the maximum-likelihood (MLE) objective has been achieved by combining very complex energy functions, in the form of deep neural networks, with Langevin dynamics for sampling from the model. However, Nijkamp and colleagues have recently shown that such EBMs become good generators without becoming good density estimators: an impractical number of Langevin steps is typically required to exit the burn-in of the Markov chain, so the training merely sculpts the energy landscape near the distribution used to initialize the chain. Careful hyperparameter choices and the use of persistent chains can significantly shorten the required number of Langevin steps, but at the price that new samples can be generated only in the vicinity of the persistent chain and not from noise. Here we introduce a simple method to achieve both convergence of the Markov chain in a practical number of Langevin steps (L = 500) and the ability to generate diverse, high-quality samples from noise. Under the hypothesis that Hinton’s classic contrastive-divergence (CD) training does yield good density estimators, but simply lacks a mechanism for connecting the noise manifold to the learned data manifold, we combine CD with an MLE-like loss. We demonstrate that a simple ConvNet can be trained with this method to be good at generation as well as density estimation for CIFAR-10, Oxford Flowers, and a synthetic dataset in which the learned density can be verified visually.

URL: https://openreview.net/forum?id=i5K4SZeqtq

---

Title: Tracking the Median of Gradients with a Stochastic Proximal Point Method

Authors: Fabian Schaipp, Guillaume Garrigos, Umut Simsekli, Robert M. Gower

Abstract: There are several applications of stochastic optimization where one can benefit from a robust estimate of the gradient. For example, domains such as distributed learning with corrupted nodes, the presence of large outliers in the training data, learning under privacy constraints, or even heavy-tailed noise due to the dynamics of the algorithm itself. Here we study SGD with robust gradient estimators based on estimating the median.
We first derive iterative methods based on the stochastic proximal point method for computing the median gradient and generalizations thereof. Then we propose an algorithm estimating the median gradient across *iterations*, and find that several well known methods are particular cases of this framework.
For instance, we observe that different forms of clipping allow to compute online estimators of the *median* of gradients, in contrast to (heavy-ball) momentum, which corresponds to an online estimator of the *mean*.
Finally, we provide a theoretical framework for an algorithm computing the median gradient across *samples*, and show that the resulting method can converge even under heavy-tailed, state-dependent noise.

URL: https://openreview.net/forum?id=WMxLEgYGxu

---

Title: What Time Tells Us? An Explorative Study of Time Awareness Learned from Static Images

Authors: Dongheng Lin, Han Hu, Jianbo Jiao

Abstract: Time becomes visible through illumination changes in what we see. Inspired by this, in this paper we explore the potential to learn time awareness from static images, trying to answer: *what time tells us?* To this end, we first introduce a Time-Oriented Collection (TOC) dataset, which contains 130,906 images with reliable timestamps. Leveraging this dataset, we propose a Time-Image Contrastive Learning (TICL) approach to jointly model timestamps and related visual representations through cross-modal contrastive learning.
We found that the proposed TICL, 1) not only achieves state-of-the-art performance on the timestamp estimation task, over various benchmark metrics, 2) but also, interestingly, though only seeing static images, the time-aware embeddings learned from TICL show strong capability in several time-aware downstream tasks such as time-based image retrieval, video scene classification, and time-aware image editing. Our findings suggest that time-related visual cues can be learned from static images and are beneficial for various vision tasks, laying a foundation for future research on understanding time-related visual context. Project page: https://rathgrith.github.io/timetells_release/

URL: https://openreview.net/forum?id=f1MYOG4iDG

---

Title: Solving the Cold Start Problem on One's Own as an End User via Preference Transfer

Authors: Ryoma Sato

Abstract: We propose a new approach that enables end users to directly solve the cold start problem by themselves. The cold start problem is a common issue in recommender systems, and many methods have been proposed to address the problem on the service provider's side. However, when the service provider does not take action, users are left with poor recommendations and no means to improve their experience. We propose an algorithm, Pretender, that allows end users to proactively solve the cold start problem on their own. Pretender does not require any special support from the service provider and can be deployed independently by users. We formulate the problem as minimizing the distance between the source and target distributions and optimize item selection from the target service accordingly. Furthermore, we establish theoretical guarantees for Pretender based on a discrete quadrature problem. We conduct experiments on real-world datasets to demonstrate the effectiveness of Pretender.

URL: https://openreview.net/forum?id=Sgj0ZdoVWH

---

Title: Getting aligned on representational alignment

Authors: Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C. Love, Christopher J Cueva, Erin Grant, Iris Groen, Jascha Achterberg, Joshua B. Tenenbaum, Katherine M. Collins, Katherine Hermann, Kerem Oktar, Klaus Greff, Martin N Hebart, Nathan Cloos, Nikolaus Kriegeskorte, Nori Jacoby, Qiuyi Zhang, Raja Marjieh, Robert Geirhos, Sherol Chen, Simon Kornblith, Sunayana Rane, Talia Konkle, Thomas O'Connell, Thomas Unterthiner, Andrew Kyle Lampinen, Klaus Robert Muller, Mariya Toneva, Thomas L. Griffiths

Abstract: Biological and artificial information processing systems form representations of the world that they can use to categorize, reason, plan, navigate, and make decisions. How can we measure the similarity between the representations formed by these diverse systems? Do similarities in representations then translate into similar behavior? If so, then how can a system's representations be modified to better match those of another system? These questions pertaining to the study of \emph{representational alignment} are at the heart of some of the most promising research areas in contemporary cognitive science, neuroscience, and machine learning. In this Perspective, we survey the exciting recent developments in representational alignment research in the fields of cognitive science, neuroscience, and machine learning. Despite their overlapping interests, there is limited knowledge transfer between these fields, so work in one field ends up duplicated in another, and useful innovations are not shared effectively. To improve communication, we propose a unifying framework that can serve as a common language for research on representational alignment, and map several streams of existing work across fields within our framework. We also lay out open problems in representational alignment where progress can benefit all three of these fields. We hope that this paper will catalyze cross-disciplinary collaboration and accelerate progress for all communities studying and developing information processing systems.

URL: https://openreview.net/forum?id=Hiq7lUh4Yn

---

Title: A Hierarchical Nearest Neighbour Approach to Contextual Bandits

Authors: Stephen Pasteris, Madeleine Dwyer, Chris Hicks, Vasilios Mavroudis

Abstract: In this paper we consider the contextual bandit problem in metric spaces. We design and analyse an algorithm that can handle the fully adversarial problem in which no assumptions are made about the space itself, or the generation of contexts and losses. In addition to analysing our performance on general metric spaces, we further analyse the important special case in which the space is euclidean, and furthermore analyse the i.i.d. stochastic setting. Unlike previous work our algorithm is adaptive to the local density of contexts and the smoothness of the decision boundary of the comparator policy, as well as other quantities. Our algorithm is highly efficient - having a per-trial time polylogarithmic in both the number of trials and the number of actions when the dimensionality of the metric space is bounded. We also give the results of real world experiments, demonstrating the excellent performance of our algorithm.

URL: https://openreview.net/forum?id=4bJMIrI5oX

---

Title: Prior Specification for Exposure-based Bayesian Matrix Factorization

Authors: Zicong Zhu, Issei Sato

Abstract: The rapid development of the Internet has resulted in a surge of information, particularly with the rise of recommender systems (RSs). One of the most significant challenges facing existing RS models is data sparsity. To address problems related to sparse data, Bayesian models have been applied to RS systems because of their effectiveness with small sample sizes. However, the performance of Bayesian models is heavily influenced by the choice of prior distributions and hyperparameters. Recent research has introduced an analytical method for specifying prior distributions in generic Bayesian models. The major concept is a statistical technique called Prior Predictive Matching~(PPM), which optimizes hyperparameters by aligning virtual statistics generated by the prior with observed data. This approach aims to reduce the need for repeated and costly posterior inference and enhance overall Bayesian model performance. However, our evaluation of this theoretical method reveals considerable deviations in prior specification estimates as data sparsity increases. In this study, we present an enhanced method for specifying priors in Bayesian matrix factorization models. We improve the estimators by implementing an exposure-based model to better simulate data scarcity. Our method demonstrates significant accuracy improvements in hyperparameter estimation during synthetic experiments. We also explore the feasibility of applying this method to real-world datasets and provide insights into how the model's behavior adapts to varying levels of data sparsity.

URL: https://openreview.net/forum?id=o5R4Hv9XqC

---

Title: Reasoning Under 1 Billion: Memory-Augmented Reinforcement Learning for Large Language Models

Authors: Hung Le, Van Dai Do, Dung Nguyen, Svetha Venkatesh

Abstract: Recent advances in fine-tuning large language models (LLMs) with reinforcement learning (RL) have shown promising improvements in complex reasoning tasks, particularly when paired with chain-of-thought (CoT) prompting. However, these successes have been largely demonstrated on large-scale models with billions of parameters, where a strong pretraining foundation ensures effective initial exploration. In contrast, RL remains challenging for tiny LLMs with 1 billion parameters or fewer because they lack the necessary pretraining strength to explore effectively, often leading to suboptimal reasoning patterns. This work introduces a novel intrinsic motivation approach, called Memory-R+, that leverages episodic memory to address this challenge, improving tiny LLMs in CoT reasoning tasks. Inspired by human memory-driven learning, our method leverages successful reasoning patterns stored in memory while allowing controlled exploration to generate novel responses. Intrinsic rewards are computed efficiently using a kNN-based episodic memory, allowing the model to discover new reasoning strategies while quickly adapting to effective past solutions. Experiments on three reasoning datasets demonstrate that our approach significantly enhances smaller LLMs' reasoning performance and generalization capability, making RL-based reasoning improvements more accessible in low-resource settings.

URL: https://openreview.net/forum?id=tmdwuU2uKs

---

New submissions
===============

Title: Denoising Efficiency and Lines Matching Models

Abstract: In this paper, we analyze the denoising loss used by key denoising models and identify an inefficiency that stems from the random pairing which they employ between samples from the source and target distributions. Regressing the denoiser under these non-deterministic conditions causes its predictions to collapse toward the mean of the source or target distributions. We show that this degeneracy creates false basins of attraction, distorting the denoising trajectories and ultimately increasing the number of steps required to sample these models.

We also analyze the alternative strategy of deriving the pairing from an Optimal Transport between the two distributions, and show that while this approach can alleviate this degeneracy, it suffers from a curse of dimensionality, where the pairing set size must scale exponentially with the signal's dimension.

In order to empirically validate and utilize these theoretical observations, we design a new training approach that circumvents these pitfalls by leveraging the deterministic ODE-based samplers, offered by certain denoising diffusion and score-matching models. These deterministic samplers establish a well-defined change-of-variables between the source and target distributions. We use this correspondence to construct a new probability flow model, the Lines Matching Model (LMM), which matches globally straight lines interpolating between the two distributions. We show that the flow fields produced by the LMM exhibit notable temporal consistency, resulting in trajectories with excellent straightness scores, and allow us to exceed the quality of distilling the input correspondence.

The LMM flow formulation allows us to further enhance the fidelity of the generated samples beyond the input correspondence by integrating domain-specific reconstruction and adversarial losses. Overall, the LMM achieves state-of-the-art FID scores with minimal NFEs on established benchmark datasets: 1.57/1.39 (NFE=1/2) on CIFAR-10, 1.47/1.17 on ImageNet, and 2.68/1.54 on AFHQ.

URL: https://openreview.net/forum?id=K8QIvC6vQk

---

Title: Mitigating Unintended Memorization with LoRA in Federated Learning for LLMs

Abstract: Federated learning (FL) is a popular paradigm for collaborative training which avoids direct data exposure between clients. However, data privacy issues still remain: FL-trained large language models are capable of memorizing and completing phrases and sentences contained in training data when given their prefixes. Thus, it is possible for adversarial and honest-but-curious clients to recover training data of other participants simply through targeted prompting. In this work, we demonstrate that a popular and simple fine-tuning strategy, low-rank adaptation (LoRA), reduces memorization during FL by a factor of up to 10 without significant performance cost. We study this effect by performing fine-tuning tasks in high-risk domains such as medicine, law, and finance. We observe a reduction in memorization for a wide variety of model families, from 1B to 70B parameters. We find that LoRA can reduce memorization in centralized learning as well, and we compare how the memorization patterns differ. Furthermore, we study the effect of hyperparameters and show that LoRA can be combined with other privacy-preserving techniques such as gradient clipping and Gaussian noise, secure aggregation, and Goldfish loss to further improve record-level privacy while maintaining performance.

URL: https://openreview.net/forum?id=WKPyZnLIW4

---

Title: DiffKGW: Stealthy and Robust Diffusion Model Watermarking

Abstract: Diffusion models are known for their supreme capability to generate realistic images. However, ethical concerns, such as copyright protection and the generation of inappropriate content, pose significant challenges for the practical deployment of diffusion models. Recent work has proposed a flurry of watermarking techniques that inject artificial patterns into initial latent representations of diffusion models, offering a promising solution to these issues. However, enforcing a specific pattern on selected elements can disrupt the Gaussian distribution of the initial latent representation. Inspired by watermarks for large language models (LLMs), we generalize the LLM KGW watermark to image diffusion models and propose a stealthy probability adjustment approach DiffKGW that preserves the Gaussian distribution of initial latent representation. In addition, we dissect the design principles of state-of-the-art watermarking techniques and introduce a unified framework. We identify a set of dimensions that explain the manipulation enforced by watermarking methods, including the distribution of individual elements, the specification of watermark shapes within each channel, and the choice of channels for watermark embedding. Through the empirical studies on regular text-to-image applications and the first systematic attempt at watermarking image-to-image diffusion models, we thoroughly verify the effectiveness of our proposed framework through comprehensive evaluations. On all the diffusion models, including Stable Diffusion, our approach induced from the proposed framework not only preserves image quality but also outperforms existing methods in robustness against a wide range of attacks.

URL: https://openreview.net/forum?id=OXi9vcIOgD

---

Title: From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers

Abstract: Knowledge distillation (KD) methods are pivotal in compressing large pre-trained language models into smaller models, ensuring computational efficiency without significantly dropping performance. Traditional KD techniques assume homogeneity in modalities between the teacher (source) and the student (target) models. On the other hand, existing multimodal knowledge distillation methods require modality-specific pre-training of the teacher model, which is computationally infeasible in most cases. In this paper, we introduce ARMADA, an efficient cross-modal knowledge distillation framework designed to transfer knowledge from large vision-language models, including black-box models, to language-only models. Unlike existing KD techniques that rely on the internal structures of multimodal teachers or require computationally expensive pre-training, ARMADA leverages novel alignment techniques to distil knowledge without altering the teacher model, ensuring efficiency and scalability. We empirically validate ARMADA on twelve natural language understanding, eight complex generative reasoning and five instruction-tuning tasks, demonstrating consistent performance improvements in large models such as DeBERTa-v2-1.4B, OPT-1.3B, LLaMA-\{3B, 7B, 8B\}. ARMADA achieves up to $3.4$% improvement on language understanding tasks and $2.6$% boost in generative reasoning, all without requiring expensive multimodal pre-training or fine-tuning of the teacher model. Our findings challenge conventional knowledge distillation paradigms by demonstrating that even vision-language models, despite lacking direct textual understanding, can significantly enhance language models when distilled appropriately.

URL: https://openreview.net/forum?id=Lvjgy71Gpu

---

Title: Unlocking [CLS] Features for Continual Post-Training

Abstract: Continual learning requires models to integrate new classes or domains over time while preserving previously acquired knowledge. Within this paradigm, foundation models often achieve strong performance, but they still remain subject to the stability–plasticity trade-off, where excessive plasticity leads to forgetting of prior knowledge, and excessive stability constrains the adaptation. This necessitates an effective post-training strategy that introduces minimal yet functional modifications. To address this challenge, we first introduce a new parameter-efficient fine-tuning module ‘Learn and Calibrate’, or LuCA, designed to acquire task-specific knowledge through an adapter-calibrator couple, enabling well-refined feature representations. Then, for each task, we deploy a sparse LuCA module on top of the last classification token [CLS] just before the classifier, which we refer to as ‘Token-level Sparse Calibration and Adaptation’, or TOSCA. By leaving the generalization capabilities of the foundation models intact and adapting exclusively via the last token, our approach achieves a harmonious balance between stability and plasticity while reducing both training and inference complexity. We demonstrate that TOSCA yields state-of-the-art performance while introducing 8 times fewer parameters compared to prior methods.

URL: https://openreview.net/forum?id=OWfWyj6krc

---

Title: Implicit Probabilistic Reasoning Does Not Reflect Explicit Answers in Large Language Models

Abstract: The handling of probabilities in the form of uncertainty or partial information is an essential task for LLMs in many settings and applications. A common approach to evaluate an LLM's probabilistic reasoning capabilities is to assess its ability to answer questions pertaining to probability through the use of multiple-choice questions (MCQs). However, this paradigm, which we refer to as explicit probabilistic reasoning, has been shown in the literature to yield significant limitations (e.g., sensitivity to answer ordering). In this work, we introduce an alternative approach, named implicit probabilistic reasoning, which evaluates the models' ability to integrate probabilistic reasoning into their text generation process. To achieve this, we rephrase MCQs as text-completion scenarios with a determined set of outcomes and compare the model's next-token probability assignments to the true likelihood of the outcomes. In line with previous work, we find that models exhibit solid performance in their explicit probabilistic reasoning (i.e., answers to MCQs). However, during text completion (i.e., implicit probabilistic reasoning), where the same information must be taken into account to generate text, the models' predictions often significantly diverge from the known ground truth. For instance, our evaluation method reveals that implicit probabilistic reasoning is improperly influenced by many factors, such as independent prior events, partial observations about a result, or statistical background information. All of these issues can cause erroneous results to be produced in text generation, which are not detected by conventional MCQ-based evaluation

URL: https://openreview.net/forum?id=HaaAY4ZXPa

---

Title: Text-Guided Diffusion Based Ambiguous Medical Image Segmentation

Abstract: Medical image segmentation often suffers from ambiguity due to unclear boundaries, expert
inconsistencies, and varying interpretation standards. Traditional segmentation models
produce single deterministic outputs, failing to capture this uncertainty and the range of
plausible interpretations in such cases. To address this, we introduce AmbiguousTextDiff, a
novel text-guided diffusion model that generates diverse and plausible segmentation proposals
reflecting the ambiguity observed in medical imaging. By combining the strengths of text-
conditional diffusion models with ambiguity-aware training, our approach generates multiple
valid segmentations for a single input image. We use descriptive text prompts including
anatomical and diagnostic attributes as conditioning signals to guide segmentation. We
generate these prompts by extracting detailed metadata from the LIDC-IDRI dataset such
as nodule size, texture, spiculation, and malignancy. This text-based conditioning improves
both the controllability and clinical relevance of the model’s outputs, aligning them more
closely with radiologist interpretation. Extensive evaluations and ablations on the LIDC-IDRI
dataset demonstrate that AmbiguousTextDiff achieves superior performance across Combined
Sensitivity, Diversity Agreement, Generalized Energy Distance (GED), and Collective Insight
(CI) Score offering a comprehensive measure of both accuracy and uncertainty. Our results
highlight the value of text-guided diffusion for ambiguity-aware segmentation and establish a
new direction for controllable and interpretable medical image analysis.

URL: https://openreview.net/forum?id=k46r0L5ZyR

---

Title: A Novel Method for Time Series Counterfactual Inference Based on Penalized Autoencoders

Abstract: The ability to accurately perform counterfactual inference on time series is crucial for decision-making in fields like finance, healthcare, and marketing, as it allows us to understand the impact of events or treatments on outcomes over time. In this paper, we introduce a new counterfactual inference approach tailored to time series data impacted by market events, which arises from an industrial context. Utilizing the abduction-action-prediction procedure and the Structural Causal Model framework, we begin employing methods based on variational autoencoders and adversarial autoencoders, both previously used in counterfactual works although not in time series settings. Then, we present the Conditional Entropy-Penalized Autoencoder (CEPAE), a novel autoencoder-based approach for counterfactual inference, which employs an entropy penalization loss over the latent space to achieve disentangled data representations. We validate our approach both theoretically and experimentally on synthetic, semi-synthetic, and real-world datasets, showing that CEPAE outperforms the other approaches in the evaluated metrics.

URL: https://openreview.net/forum?id=X6lrzqOtQo

---

Title: CRoPS: A Training-Free Hallucination Mitigation Framework for Vision-Language Models

Abstract: Despite the rapid success of Large Vision-Language Models (LVLMs), a persistent challenge is their tendency to generate hallucinated content, undermining reliability in real-world use. Existing training-free methods address hallucinations but face two limitations: (i) they rely on narrow assumptions about hallucination sources, and (ii) their effectiveness declines toward the end of generation, where hallucinations are most likely to occur. A common strategy is to build hallucinated models by completely or partially removing visual tokens and contrasting them with the original model. Yet, this alone proves insufficient, since visual information still propagates into generated text. Building on this insight, we propose a novel hallucinated model that captures hallucination effects by selectively removing key text tokens. We further introduce Generalized Contrastive Decoding, which integrates multiple hallucinated models to represent diverse hallucination sources. Together, these ideas form CRoPS, a training-free hallucination mitigation framework that improves CHAIR scores by 20% and achieves consistent gains across six benchmarks and three LVLM families, outperforming state-of-the-art training-free methods.

URL: https://openreview.net/forum?id=KQSoZDPVGX

---

Title: Moment Constrained Optimal Transport for Control Applications

Abstract: This paper concerns the application of techniques from optimal transport (OT) to mean field control, in which the probability measures of interest in OT correspond to empirical distributions associated with a large collection of controlled agents. The control objective of interest motivates a one-sided relaxation of OT, in which the first marginal is fixed and the second marginal is constrained to a “moment class”: a set of probability measures defined by generalized moment constraints. This relaxation is particularly interesting for control problems as it enables the coordination of agents without the need to know the desired distribution beforehand. The inclusion of an entropic regularizer is motivated by both computational considerations, and also to impose hard constraints on agent behavior. A computational approach inspired by the Sinkhorn algorithm is proposed to solve this problem. This new approach to distributed control is illustrated with an application of charging a fleet of electric vehicles while satisfying grid constraints. An online version is proposed and applied in a case study on the ElaadNL dataset containing 10,000 EV charging transactions in the Netherlands. This empirical validation demonstrates the effectiveness of the proposed approach to optimizing flexibility while respecting grid constraints.

URL: https://openreview.net/forum?id=2hAtSpnat9

---

Title: DAG-aware Transformer for Causal Effect Estimation

Abstract: Causal inference is a critical task across fields such as healthcare, economics, and the social sciences. While recent advances in machine learning, especially those based on the deep-learning architectures, have shown potential in estimating causal effects, existing approaches often fall short in handling complex causal structures and lack adaptability across various causal scenarios. In this paper, we present a novel transformer-based method for causal inference that overcomes these challenges. The core innovation of our model lies in its integration of causal Directed Acyclic Graphs (DAGs) directly into the attention mechanism, enabling it to accurately model the underlying causal structure. This allows for flexible estimation of both average treatment effects (ATE) and conditional average treatment effects (CATE). Extensive experiments on both synthetic and real-world datasets demonstrate that our approach surpasses existing methods in estimating causal effects across a wide range of scenarios. The flexibility and robustness of our model make it a valuable tool for researchers and practitioners tackling complex causal inference problems.

URL: https://openreview.net/forum?id=pzFwkjRVXy

---

Title: Single-loop Algorithms for Stochastic Non-Convex Optimization with Weakly-Convex Constraints

Abstract: Constrained optimization with multiple functional inequality constraints has significant applications in machine learning. This paper examines a crucial subset of such problems where both the objective and constraint functions are weakly convex. Existing methods often face limitations, including slow convergence rates or reliance on double-loop algorithmic designs. To overcome these challenges, we introduce a novel single-loop penalty-based stochastic algorithm. Following the classical exact penalty method, our approach employs a hinge-based penalty, which permits the use of a constant penalty parameter, enabling us to achieve a state-of-the-art complexity for finding an approximate Karush-Kuhn-Tucker (KKT) solution. We further extend our algorithm to address finite-sum coupled compositional objectives, which are prevalent in artificial intelligence applications, establishing improved complexity over existing approaches. Finally, we validate our method through experiments on fair learning with receiver operating characteristic (ROC) fairness constraints and continual learning with non-forgetting constraints.

URL: https://openreview.net/forum?id=aCgOR2KvAI

---

Title: Still Competitive: Revisiting Recurrent Models for Irregular Time Series Prediction

Abstract: Modeling irregularly sampled multivariate time series is a persistent challenge in domains like healthcare and sensor networks. While recent works have explored a variety of complex learning architectures to solve the prediction problems for irregularly sampled time series, it remains unclear what are the true benefits of some of these architectures, and whether clever modifications of simpler and more efficient RNN-based algorithms are still competitive, i.e. they are on par with or even superior to these methods. In this work, we propose and study GRUwE: Gated Recurrent Unit with Exponential basis functions, that builds upon RNN-based architectures for observations made at irregular times. GRUwE supports both regression-based and event-based predictions in continuous time. GRUwE works by maintaining a Markov state representation of the time series that updates with the arrival of irregular observations. The Markov state update relies on two reset mechanisms: (i) observation-triggered reset, and (ii) time-triggered reset of the GRU state using learnable exponential decays, to support the predictions in continuous time. Our empirical evaluations across several real-world benchmarks on next-observation and next-event prediction tasks demonstrate that GRUwE can indeed achieve competitive to superior performance compared to the recent state-of-the-art (SOTA) methods. Thanks to its simplicity, GRUwE offers compelling advantages: it is easy to implement, requires minimal hyper-parameter tuning efforts, and significantly reduces the computational overhead in the online deployment.

URL: https://openreview.net/forum?id=YLoZA77QzR

---

Title: CATS: Mitigating Correlation Shift for Multivariate Time Series Classification

Abstract: Unsupervised Domain Adaptation (UDA) leverages labeled source data to train models for unlabeled target data. Given the prevalence of multivariate time series (MTS) data across various domains, the UDA task for MTS classification has emerged as a critical challenge. In this work, we identify a key property of MTS: multivariate correlations vary significantly across domains, and further formalize this phenomenon as a novel type of domain shift, termed correlation shift. To mitigate correlation shift, we propose a scalable and parameter-efficient \underline{C}orrelation \underline{A}dapter for M\underline{TS} (CATS). Designed as a plug-and-play technique compatible with various Transformer variants, \method{} employs temporal convolution to capture local temporal patterns and a graph attention module to model the changing multivariate correlation. The adapter reweights the target correlations to align the source correlations with a theoretically guaranteed precision. A correlation alignment loss is further proposed to mitigate correlation shift, bypassing the alignment challenge from the non-i.i.d. nature of MTS data. Extensive experiments on four real-world datasets demonstrate that (1) compared with vanilla Transformer-based models, CATS increases over $10\%$ average accuracy while only adding around $1\%$ parameters, and (2) all Transformer variants equipped with CATS either reach or surpass state-of-the-art baselines.

URL: https://openreview.net/forum?id=WdTfsGh6VI

---

Title: Kolmogorov-Arnold Network Autoencoders for High-Dimensional Data Representation

Abstract: In the era of high-dimensional data, traditional machine learning models often face challenges related to computational complexity, overfitting, and suboptimal feature representations. This paper introduces Kolmogorov-Arnold Network Autoencoders (KANAs), a novel framework that leverages Kolmogorov-Arnold Networks (KANs) to transform dimensionality reduction and data reconstruction. Through experiments across diverse datasets, KANA consistently demonstrates superior reconstruction fidelity and linear probing accuracy, establishing itself as a powerful and versatile tool for high-dimensional data processing. The proposed framework shows strong potential for applications in areas such as scientific modeling and data compression.

URL: https://openreview.net/forum?id=clhnYdO3nh

---

Title: Outlier-Free SpeechLM for Fast Adaptation and Robust Quantization

Abstract: We introduce SOFA (Stabilized Outlier-Free Attention), a drop-in replacement for the softmax activation that tackles the attention-outlier problem when turning a text-only LLM into a speech-text multi-modal model (SpeechLM).
Our primary observation is that outliers emerge from both multi-modal low-rank adaptation and post-training quantization of transformer attention, degrading state-of-the-art SpeechLMs performance.
To address these issues, we leverage a pretrained language model as a foundation and replace the standard softmax attention with SOFA which can be applied as a drop-in replacement of the vanilla softmax.
We propose a plug-in method that directly eliminates outliers without adjusting pretraining weights and quantitatively measure the prevalence and impact of outliers in a unified speech-text transformer.
We evaluate two multi‑modal adaptation strategies: full fine‑tuning on multi‑modal data followed by post‑training quantization, and apply LoRA on SOFA equipped model (SOFA-LoRA adapter) which keeps the pretrained LLM frozen without extra pre‑training.
The full fine-tuning route delivers strong, consistent gains across all modalities (textLM, SpeechLM, ASR, TTS), whereas the SOFA-LoRA adapter without touching any pretrained weights—surpasses the vanilla-LoRA adapter baseline and is particularly effective on text-output tasks such as ASR, all while retaining full compatibility with standard LLM checkpoints.
Empirically, on the OPT-1.3b model, incorporating SOFA into SpeechLM yields a 88% improvement in multi-modal low-rank adaptation and a 37% improvement in post-training quantization.

URL: https://openreview.net/forum?id=2gDSRwfQTN

---

Title: ReVision: High-Quality Video Generation with Explicit 3D Physics Modeling for Complex Motion and Interaction

Abstract: In recent years, video generation has seen significant advancements. However, challenges still persist in generating complex motions and interactions. To address these challenges, we introduce ReVision, a plug-and-play framework that explicitly integrates parameterized 3D physical knowledge into a pretrained conditional video generation model, significantly enhancing its ability to generate high-quality videos with complex motion and interactions. Specifically, ReVision consists of three stages. First, a video diffusion model is used to generate a coarse video. Next, we extract a set of 2D and 3D features from the coarse video to construct a 3D object-centric representation, which is then refined by our proposed parameterized physical prior model to produce an accurate 3D motion sequence. Finally, this refined motion sequence is fed back into the same video diffusion model as additional conditioning, enabling the generation of motion-consistent videos, even in scenarios involving complex actions and interactions. We validate the effectiveness of our approach on Stable Video Diffusion, where ReVision significantly improves motion fidelity and coherence. Remarkably, with only 1.5B parameters, it even outperforms a state-of-the-art video generation model with over 13B parameters on complex video generation by a substantial margin. Our results suggest that, by incorporating 3D physical knowledge, even a relatively small video diffusion model can generate complex motions and interactions with greater realism and controllability, offering a promising solution for physically plausible video generation.

URL: https://openreview.net/forum?id=mQ5frFQTFV

---

Title: A Propagation Property of the Pareto Order with Applications to Multiobjective Optimization

Abstract: We establish a propagation property of the strict Pareto dominance order $\prec$ on $\mathbb{R}^k$ and present its application for efficient nondominated sorting and Pareto archive maintenance in multiobjective optimization. Precisely, if $u,v\in\mathbb{R}^k$ are mutually nondominated ($(u\nprec v) \wedge\ (v\nprec u)$), then no $q\in\mathbb{R}^k$ can satisfy $u\prec q\prec v$ or $v\prec q\prec u$. We give algebraic and geometric proofs, the latter via containment of strict down-sets (lower orthants). As a corollary, we state and prove a \emph{post-witness $u \prec q$– elimination rule} for Pareto archive $S$ insertion: once a witness $w\in S$ with $q\prec w$ is found, no remaining $u \in S \setminus\{w\}$ can dominate $q$. We therefore can skip all subsequent ``$u\prec q$?'' checks and only test ``$q\prec u$?'' to remove dominated vectors. We provide pseudocode for the resulting archive-insertion routine, and outline extensions to weak, $\varepsilon$-, and noisy dominance. Finally, under a standard random-input model in which points are drawn independently from a continuous distribution on $\mathbb{R}^k$ (general position almost surely), we derive a closed-form expression for the expected post-witness fraction of the remaining “$u\prec q$?” comparisons (over $u\in S\setminus\{w\}$) that become unnecessary once a first witness $w$ with $q\prec w$ is identified. The formula reveals how savings in comparison-checks scale with dimension $k$ (and archive size), justifies witness-first heuristic scanning orders, and provides a reproducible baseline for empirical evaluation of dominance comparison-count reductions in archive insertion and nondominated sorting implementations. This probabilistic baseline complements the deterministic post-witness exclusion guaranteed by the propagation property in mutually nondominated curated archives.

URL: https://openreview.net/forum?id=LfBLWOzSzG

---

Title: A Multilevel Low-Rank Newton Method with Super-linear Convergence Rate and its Application to Non-convex Problems

Abstract: Second-order methods can address the shortcomings of first-order methods for the optimization of large-scale machine learning models.
However, second-order methods have significantly higher computational costs associated with the computation of second-order information. Subspace methods that are based on randomization have addressed some of these computational costs as they compute search directions in lower dimensions. Even though super-linear convergence rates have been empirically observed, it has not been possible to rigorously show that these variants of second-order methods can indeed achieve such fast rates.
Also, it is not clear whether subspace methods are efficient for non-convex settings.
To address these shortcomings, we develop a link between multigrid optimization methods and low-rank Newton methods that enables us to prove the super-linear rates of stochastic low-rank Newton methods rigorously. Our method does not require any computations in the original model dimension. We further propose a truncated version of the method that is capable of solving high-dimensional non-convex problems. Preliminary numerical experiments show that our method has a better escape rate from saddle points compared to the state-of-the-art first-order methods and thus returns lower training errors.

URL: https://openreview.net/forum?id=PKakPzVVja

---

Title: Graph Coarsening using Game Theoretic Approach

Abstract: Graph coarsening is a method for reducing the size of an original graph while preserving its structural and feature-related properties. In graph machine learning, it is often employed as a preprocessing step to improve efficiency and scalability when handling large graph
datasets. In this study, we address the challenge of coarsening an original graph into a coarsened graph that retains these characteristics. We propose a Cooperative-Based Graph Coarsening (CGC) algorithm, which leverages cooperative game theory as a framework
for combinatorial optimization, aiming to minimize the total Dirichlet energy of the graph through localized optimizations. We prove that the proposed coarsening game is a potential game that guarantees convergence to a stable coarsened graph. Tests on real-world datasets
demonstrate that CGC algorithm surpasses prior state-of-the-art techniques in terms of coarsened graph accuracy and achieves reduced time complexity. These results highlight the potential of game-theoretic approaches in the advancement of graph coarsening techniques.

URL: https://openreview.net/forum?id=5vLBjQJCln

---

Title: Defending Multimodal Large Language Models Against Jailbreak Attacks Through Embedding-Space Adversarial Smoothing

Abstract: Multimodal large language models have achieved unprecedented capabilities in integrating visual perception with natural language understanding. However, jailbreak attacks exploit coordinated vision-text manipulations through typographic prompts, pictorial code,
and multi-modal linkage, achieving attack success rates exceeding 90%. We introduce
Embedding-Space Adversarial Smoothing (ESAS), operating directly on the embedding
manifold through cross-modal coupled interpolation, contrastive safety anchoring, and
lightweight adapter transformation. Our framework synthesizes adversarial embeddings via
gradient-based visual perturbations and text suffix injection, applies Beta-distributed mixing for smoothed manifold trajectories, and leverages safety anchors to attract embeddings
toward safe regions while repelling adversarial zones. Evaluation across seven attacks and
four architectures demonstrates 78.8% attack mitigation, reducing ASR from 79.2% to 16.8%
with 0.6% accuracy drop.ESAS outperforms four state-of-the-art defenses, maintaining ASR
below 20% under perturbations up to epsilon 0.15. This work establishes embedding-space
geometric regularization as a principled paradigm for defending multimodal systems against
cross-modal jailbreak threats.

URL: https://openreview.net/forum?id=YAnGiUur2P

---

Title: Is Oracle Pruning the True Oracle? -- A Sanity-Check of Pruning Paradigms over 35 years

Abstract: Oracle pruning, which selects unimportant weights by minimizing the pruned train loss, has been taken as the foundation for most neural network pruning methods for over 35 years, while few (if not none) have thought about how much the foundation really holds. This paper, for the first time, attempts to examine its validity on modern deep models through empirical correlation analyses and provide reflections on the field of neural network pruning. Specifically, for a typical pruning algorithm with three stages (pre-training, pruning, and retraining), we analyze the model performance correlation before and after retraining. Extensive experiments (37K models are trained) across a wide spectrum of models (LeNet5, VGG, ResNets, ViT, MLLM) and datasets (MNIST and its variants, CIFAR10/CIFAR100, ImageNet-1K, MLLM data) are conducted. The results lead to a surprising conclusion: on modern deep learning models, the performance before retraining is barely correlated with the performance after retraining. Namely, the weights selected by oracle pruning can hardly guarantee a good performance after retraining. This further implies that existing works using oracle pruning to derive pruning criteria may be groundless from the beginning. Further studies suggest that the rising task complexity is one factor that makes oracle pruning invalid nowadays. Finally, given the evidence, we argue that the retraining stage in a pruning algorithm should be accounted for when developing any pruning criterion.

URL: https://openreview.net/forum?id=k13YnckIZY

---

Title: Provably Efficient Reward Transfer in Reinforcement Learning with Discrete Markov Decision Processes

Abstract: In this paper, we propose a new solution to reward adaptation (RA) in reinforcement learning, where the agent adapts to a target reward function based on one or more existing source behaviors learned a priori under the same domain dynamics but different reward functions. While learning the target behavior from scratch is possible, it is often inefficient given the available source behaviors. Our work introduces a new approach to RA through the manipulation of Q-functions. Assuming the target reward function is a known function of the source reward functions, we compute bounds on the Q-function and present an iterative process (akin to value iteration) to tighten these bounds. Such bounds enable action pruning in the target domain before learning even starts. We refer to this method as "$Q-Manipulation$" (Q-M). The iteration process assumes access to a lite-model, which is easy to provide or learn. We formally prove that Q-M, under discrete domains, does not affect the optimality of the returned policy and show that it is provably efficient in terms of sample complexity in a probabilistic sense. Q-M is evaluated in a variety of synthetic and simulation domains to demonstrate its effectiveness, generalizability, and practicality.

URL: https://openreview.net/forum?id=u2b31c9Noe

---

Title: Neural Fourier Transform for Multiple Time Series Prediction

Abstract: Multivariate time series forecasting is an important task in various fields such as economic planning, healthcare management, and environmental monitoring.
In this work, we present a novel methodology for improving multivariate forecasting, particularly, in data sets with strong seasonality.
We frame the forecasting task as a Multi-Dimensional Fourier Transform (MFT) problem and propose the Neural Fourier Transform (NFT) that leverages a deep learning model to predict future time series values by learning the MFT coefficients.
The efficacy of NFT is empirically validated on 7 diverse datasets, demonstrating improvements over multiple forecasting horizons and lookbacks, thereby establishing new state-of-the-art results.
Our contributions advance the field of multivariate time series forecasting by providing a model that excels in predictive accuracy.
The code of this study is publicly available.

URL: https://openreview.net/forum?id=0GBjIwRuVp

---

Title: Prompt-based Adaptation in Large-scale Vision Models: A Survey

Abstract: In computer vision, Visual Prompting (VP) and Visual Prompt Tuning (VPT) have recently emerged as lightweight and effective alternatives to full fine-tuning for adapting large-scale vision models within the ``pretrain-then-finetune'' paradigm. However, despite rapid progress, their conceptual boundaries remain blurred, as VP and VPT are frequently used interchangeably in current research, reflecting a lack of systematic distinction between these techniques and their respective applications. In this survey, we revisit the designs of VP and VPT from first principles, and conceptualize them within a unified framework termed Prompt-based Adaptation (PA). We provide a taxonomy that categorizes existing methods into learnable, generative, and non-learnable prompts, and further organizes them by injection granularity—pixel-level and token-level. Beyond the core methodologies, we examine PA’s integrations across diverse domains, including medical imaging, 3D point clouds, and vision-language tasks, as well as its role in test-time adaptation and trustworthy AI. We also summarize current benchmarks and identify key challenges and future directions. To the best of our knowledge, we are the first comprehensive survey dedicated to PA's methodologies and applications in light of their distinct characteristics. Our survey aims to provide a clear roadmap for researchers and practitioners in all area to understand and explore the evolving landscape of PA-related research.

URL: https://openreview.net/forum?id=UwtXDttgsE

---

Title: Occam’s Razor for SSL: Memory-Efficient Parametric Instance Discrimination

Abstract: Self-supervised learning (SSL) is the prevalent paradigm for representation learning often relying on pairwise similarity between multiple augmented views of each example. Numerous learning methods with various complexities such as gradient stopping, negative sampling, projectors, additional regularization terms, were introduced in the past years. These methods can be effective, but they require careful hyperparameter tuning, have increased computational and memory requirements and struggle with latent dimensionality collapse. Furthermore, complexities such as gradient stopping make them hard to analyse theoretically and confound the essential components of SSL. We introduce a simple parametric instance discrimination method, called Datum IndEx as its Target (DIET). DIET has a single computational branch, without explicit negative sampling, gradient stopping or other hyperparameters. We empirically demonstrate that DIET (1) can be implemented in a memory-efficient way; (2) achieves competitive performance with state-of-the-art SSL methods on small-scale datasets; and (3) is robust to hyperparameters such as batch size. We uncover tight connections to Spectral Contrastive Learning in the lazy training regime, leading to practical insights about the role of feature normalization. Compared to SimCLR or VICReg, DIET also has higher-rank embeddings on CIFAR100 and TinyImageNet, suggesting that DIET captures more latent information.

URL: https://openreview.net/forum?id=GFNTbsVFlP

---

Title: Pave Your Own Path: Graph Gradual Domain Adaptation on Fused Gromov-Wasserstein Geodesics

Abstract: Graph neural networks, despite their impressive performance, are highly vulnerable to distribution shifts on graphs.
Existing graph domain adaptation (graph DA) methods often implicitly assume a mild shift between source and target graphs, limiting their applicability to real-world scenarios with large shifts.
Gradual domain adaptation (GDA) has emerged as a promising approach for addressing large shifts by gradually adapting the source model to the target domain via a path of unlabeled intermediate domains.
Existing GDA methods exclusively focus on independent and identically distributed (IID) data with a predefined path, leaving their extension to non-IID graphs without a given path an open challenge.
To bridge this gap, we present Gadget, the first GDA framework for non-IID graph data.
First (theoretical foundation), the Fused Gromov-Wasserstein (FGW) distance is adopted as the domain discrepancy for non-IID graphs, based on which, we derive an error bound on node, edge and graph-level tasks, showing that the target domain error is proportional to the length of the path.
Second (optimal path), guided by the error bound, we identify the FGW geodesic as the optimal path, which can be efficiently generated by our proposed algorithm.
The generated path can be seamlessly integrated with existing graph DA methods to handle large shifts on graphs, improving state-of-the-art graph DA methods by up to 6.8% in accuracy on real-world datasets.

URL: https://openreview.net/forum?id=dTPBqTKGPs

---

Title: On Connection between $\texttt{CLS}$ Token and Virtual Node: Are they both sides of the same coin?

Abstract: Transformers emerged as a promising tool in learning long-range dependencies in language tasks. Recently, it has been shown that the self-attention module in the Transformer operates on the complete graph of tokens obtained from the input dataset. Specially, in language and vision-related tasks, a \texttt{CLS} token is added at the beginning of the token sequence to extract the representation of the entire sequence. On the other hand, a Virtual Node (VN) is added in a graph to mitigate the effect of Oversquashing, an information bottleneck typically observed in Graph Neural Networks (GNNs). In this work, we observe that both \texttt{CLS} token and VN are structurally identical in their respective graphs. Both \texttt{CLS} token and VN are connected to every other token or node and aggregate features from all other remaining tokens or nodes in a similar way. Although the embedding of the \texttt{CLS} token is used to classify the input sequence, the embedding of VN is not fully explored; instead, standard pooling is employed to extract the graph-level representation for classification. Thus, to bridge the gap, we consider representations of VNs to solve classification tasks on standard graph datasets and observe surprisingly competitive performances.

URL: https://openreview.net/forum?id=21hIKHKYtx

---

Title: Remote Action Generation: Remote Control with Minimal Communication

Abstract: We address the challenge of remote control where one or more actors, lacking direct reward access, are steered by a controller over a communication-constrained channel. The controller learns an optimal policy from observed rewards and communicates action guidance to the actors, which becomes demanding for large or continuous action spaces. To achieve rate-efficient communication throughout this interactive learning and control process, we introduce a novel framework leveraging remote generation. Instead of transmitting full action specifications, the controller sends minimal information, enabling the actors to locally generate actions by sampling from the controller's evolving target policy. This guided sampling is facilitated by an importance sampling approach. Concurrently, the actors use the received guidance as supervised learning data to learn the controller's policy. This actor-side learning improves their local sampling capabilities, progressively reducing future communication needs. Our solution, Guided Remote Action Sampling Policy (GRASP), demonstrates significant communication reduction, achieving an average 12-fold data reduction across all experiments (50-fold for continuous action spaces) compared to direct action transmission, and a 41-fold reduction compared to reward transmission.

URL: https://openreview.net/forum?id=MpwOWtwUTJ

---

Title: Cross-Server Interoperability in Multi-MCP Automated AI Agent Networks

Abstract: This paper introduced a combined framework for cross-server interoperability in multi-MCP
automated AI agent networks. The design combined communication abstraction, orchestra-
tion optimization, and security validation. The framework was tested on BoT-IoT, ToN-IoT,
and PettingZoo datasets, which represented adversarial traffic detection, telemetry-heavy
IoT environments, and dynamic multi-agent orchestration. Results showed improvements
in coverage, efficiency, and robustness, with accuracy, precision, recall, and F1-score above
0.95 across multiple trials. Ablation analysis confirmed the role of each component, scalabil-
ity tests showed stable performance as servers increased, and stress evaluations demonstrated
graceful degradation under heavier attack loads. Error analysis and statistical validation
supported the reliability of the outcomes, while resource usage comparisons indicated re-
duced runtime and memory consumption against baselines. Cross-domain generalization
confirmed adaptability across unseen datasets. These findings demonstrated that inter-
operability in heterogeneous MCP networks can be achieved without sacrificing efficiency,
scalability, or reliability, providing a foundation for secure and practical multi-domain agent
collaboration

URL: https://openreview.net/forum?id=htkKOuWGyB

---

Title: SoundnessBench: A Soundness Benchmark for Neural Network Verifiers

Abstract: Neural network (NN) verification aims to formally verify properties of NNs, which is crucial for ensuring the behavior of NN-based models in safety-critical applications. In recent years, the community has developed many NN verifiers and benchmarks to evaluate them. However, existing benchmarks typically lack ground-truth for hard instances where no current verifier can verify the property and no counterexample can be found. This makes it difficult to validate the soundness of a verifier, when it claims verification on such challenging instances that no other verifier can handle. In this work, we develop a new benchmark for NN verification, named "SoundnessBench", specifically for testing the soundness of NN verifiers. SoundnessBench consists of instances with deliberately inserted counterexamples that are hidden from adversarial attacks commonly used to find counterexamples. Thereby, it can identify false verification claims when hidden counterexamples are known to exist. We design a training method to produce NNs with hidden counterexamples and systematically construct our SoundnessBench with instances across various model architectures, activation functions, and input data. We demonstrate that our training effectively produces hidden counterexamples and our SoundnessBench successfully identifies bugs in state-of-the-art NN verifiers.

URL: https://openreview.net/forum?id=UuYYldVLH3

---

Title: Navigating the Labyrinth: Evaluating LLMs’ Ability to Reason About Search Problems

Abstract: Large Language Models (LLMs) have recently achieved impressive performance in math and reasoning benchmarks. However, they often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we introduce a new benchmark, SearchBench, which contains 11 unique search problems inspired by intuitive puzzles. Each SearchBench problem type is equipped with automated pipelines to generate an arbitrary number of instances and analyze the feasibility, correctness, and optimality of LLM-generated solutions. We show that using step-by-step, language-only reasoning, even the most advanced LLMs fail to solve SearchBench; for example, OpenAI’s frontier models GPT-4 and o1-preview solve only 1.4% and 18.6% of problems, respectively. The reason is that SearchBench problems require considering multiple pathways and performing backtracking, posing a significant challenge to auto-regressive models. Interestingly, performance is significantly boosted when we prompt models to generate a complete A* search algorithm—a comparatively more cognitively difficult task. This approach effectively offloads the iterative search and backtracking process from the models, which they struggles with in text. This in-context learning baseline is further enhanced via a Multi-Stage-Multi-Try (MSMT) inference method, increasing GPT-4’s rate of correct solutions to over 57%.

URL: https://openreview.net/forum?id=oub2I1ioL5

---

Title: StyleTalker: Stylized Talking Head Avatar from Monocular Video

Abstract: We introduce **StyleTalker**, a text-guided framework for editing and animating dynamic 3D head avatars from a monocular video. Current 3D scene editing techniques face two main challenges when applied in this task: 1) They typically require multi-view videos for accurate geometry reconstruction. Additionally, they are not suited for dynamic scenarios, making them ineffective for editing talking head avatars from a single-view video. 2) They struggle with fine-grained local edits, largely due to biases inherited from pre-trained 2D image diffusion models and limitations in detecting detailed facial landmarks. To overcome these challenges, we propose StyleTalker with two key innovations: **1)** A **mesh-enhanced 3D Gaussian reconstruction** approach that combines 3D head priors with multi-view video diffusion, improving the accuracy and flexibility of the reconstruction process. **2)** A **landmark-driven talking head editing** method that uses 3D facial landmarks to guide the editing process. By adjusting the strength of the edits based on the distance to these landmarks, our method ensures that the avatar's original identity is preserved while achieving the desired editing. Our extensive experiments demonstrate that StyleTalker outperforms current state-of-the-art methods, delivering high-quality edits and enabling the animation of avatars with diverse facial expressions, all based on a single-source video.

URL: https://openreview.net/forum?id=p6DtT0r2ed

---

Title: Semantic-aware Adversarial Fine-tuning for CLIP

Abstract: Recent studies have shown that CLIP model's adversarial robustness in zero-shot classification tasks can be enhanced by adversarially fine-tuning its image encoder with adversarial examples (AEs), which are generated by minimizing the cosine similarity between images and a hand-crafted template (e.g., ''A photo of a {label}''). However, it has been shown that the cosine similarity between a single image and a single hand-crafted template is insufficient to measure the similarity for image-text pairs. Building on this, in this paper, we find that the AEs generated using cosine similarity may fail to fool CLIP when the similarity metric is replaced with semantically enriched alternatives, making the image encoder fine-tuned with these AEs less robust. To overcome this issue, we first propose a semantic-ensemble attack to generate semantic-aware AEs by minimizing the average similarity between the original image and an ensemble of refined textual descriptions. These descriptions are initially generated by a foundation model to capture core semantic features beyond hand-crafted templates and are then refined to reduce hallucinations. To this end, we propose Semantic-aware Adversarial Fine-Tuning (SAFT), which fine-tunes CLIP's image encoder with semantic-aware AEs. Extensive experiments show that SAFT outperforms current methods, achieving substantial improvements in zero-shot adversarial robustness across 16 datasets. Our code is available at: https://anonymous.4open.science/r/SAFT-FA06.

URL: https://openreview.net/forum?id=SzZOBzueK0

---

Title: Best of mini-N in-loop Sampling: A Contextual Quality Reward Model for Reliable and Efficient Best-of-N Sampling

Abstract: Modern preference alignment techniques, such as Best-of-N (BoN) sampling, rely on reward models trained with pairwise comparison data. While effective at learning relative preferences, this paradigm fails to capture a signal of response acceptability, leaving systems vulnerable to selecting the least bad of many unacceptable options. This is particularly problematic for hard prompts, where the risk of such false acceptances increases with the number of samples. In this paper, we address this critical reliability gap by introducing a new data collection and modeling framework. By augmenting preference data with an outside option, inspired by discrete choice models, we train a reward model that can distinguish not just what is better, but what is good enough. We leverage this capability to create an adaptive inference strategy, best of mini-N in-loop, which partitions the generation budget into sequential loops with a calibrated, early-exit condition. Our experiments show that when tuned as an alignment guardrail, it reduces reliability failures by 70%, and when tuned as an inference accelerator, it improves average inference speed by over 22% in IMDB-sentiment setting. We thus provide a principled and flexible framework for practitioners to explicitly manage the trade-off between reliability and computational efficiency.

URL: https://openreview.net/forum?id=dZjpc5c835

---

Title: Hierarchical Time Series Forecasting with Robust Reconciliation

Abstract: This paper focuses on forecasting hierarchical time-series data, where each higher-level observation equals the sum of its corresponding lower-level time series. In such contexts, the forecast values should be coherent, meaning that the forecast value of each parent series exactly matches the sum of the forecast values of its child series. Existing hierarchical forecasting methods typically generate base forecasts independently for each series and then apply a reconciliation procedure to adjust them so that the resulting forecast values are coherent across the hierarchy. These methods generally derive an optimal reconciliation, using a covariance matrix of the forecast error. In practice, however, the true covariance matrix is unknown and has to be estimated from finite samples in advance. This gap between the true and estimated covariance matrix may degrade forecast performance. To address this issue, we propose a robust optimization framework for hierarchical reconciliation that accounts for uncertainty in the estimated covariance matrix. We first introduce an uncertainty set for the estimated covariance matrix and formulate a reconciliation problem that minimizes the worst-case expected squared error over this uncertainty set. We show that our problem can be cast as a semidefinite optimization problem. Numerical experiments demonstrate that the proposed robust reconciliation method achieved better forecast performance than existing hierarchical forecasting methods, which indicates the effectiveness of integrating uncertainty into the reconciliation process.

URL: https://openreview.net/forum?id=XHPLjF52gY

---

Title: Simplifying Optimal Transport through Schatten-$p$ Regularization

Abstract: We propose a new general framework for recovering low-rank structure in optimal transport using Schatten-$p$ norm regularization. Our approach extends existing methods that promote sparse and interpretable transport maps or plans, while providing a unified and principled family of convex programs that encourage low-dimensional structure. The convexity of our formulation enables direct theoretical analysis: we derive optimality conditions and prove recovery guarantees for low-rank couplings and barycentric maps in simplified settings. To efficiently solve the proposed program, we develop a mirror-descent algorithm with convergence guarantees for $p \geq 1$. Experiments on synthetic and real data demonstrate the method’s efficiency, scalability, and ability to recover low-rank transport structures.

URL: https://openreview.net/forum?id=DIawkTG5VH

---

Title: Order from Chaos: Physical World Understanding from Glitchy Gameplay Videos

Abstract: Understanding the physical world, including object dynamics, material properties, and causal interactions, remains a core challenge in artificial intelligence. Although recent multi-modal large language models (MLLMs) have demonstrated impressive general reasoning capabilities, they still fall short of achieving human-level understanding of physical principles. Existing datasets for physical reasoning either rely on real-world videos, which incur high annotation costs, or on synthetic simulations, which suffer from limited realism and diversity. In this paper, we propose a novel paradigm that leverages glitches in gameplay videos, referring to visual anomalies that violate predefined physical laws, as a rich and scalable supervision source for physical world understanding. We introduce PhysGame, an instruction-tuning dataset containing 140,057 glitch-centric question–answer pairs across five physical domains and sixteen fine-grained categories. To ensure data accuracy, we design a meta-information–guided prompting strategy that utilizes gameplay metadata such as titles and descriptions to guide high-quality QA generation. Complementing PhysGame, we construct GameBench, an expert-annotated benchmark with 880 glitch-identified gameplay videos designed to evaluate physical reasoning capabilities. Extensive experiments show that PhysGame significantly enhances both Game2Real transferability, improving the real-world physical reasoning performance of Qwen2.5-VL by 2.5% on PhysBench, and Game2General transferability, yielding a 1.9% gain on the MVBench benchmark. Moreover, PhysGame-tuned models achieve a 3.7% absolute improvement on GameBench, demonstrating enhanced robustness in detecting physical implausibilities. These results indicate that learning from gameplay anomalies offers a scalable and effective pathway toward advancing physical world understanding in multimodal intelligence.

URL: https://openreview.net/forum?id=Oe5TdpPv1b

---

Title: $$\texttt{C2-DPO}$$: Constrained Controlled Direct Preference Optimization

Abstract: Direct preference optimization (\texttt{DPO}) has emerged as a promising approach for solving the alignment problem in AI. In this paper, we make two counter-intuitive observations about \texttt{DPO}. First, we show that the \texttt{DPO} loss could be derived by starting from an alternative optimization problem that only defines the KL guardrail on in-sample responses, unlike the original RLHF problem where guardrails are defined on the entire distribution. Second, we prove a surprising property of this alternative optimization problem, where both the preferred and rejected responses tend to decrease in probability under its optimal policy, a phenomenon typically displayed by DPO in practice. To control this behavior, we propose a set of constraints designed to limit the displacement of probability mass between the preferred and rejected responses in the reference and target policies. The resulting algorithm, which we call Constrained Controlled DPO (\texttt{C2-DPO}), has a meaningful RLHF interpretation. By hedging against the displacement, \texttt{C2-DPO} provides practical improvements over vanilla \texttt{DPO} when aligning several language models using standard preference datasets.

URL: https://openreview.net/forum?id=7h5Ho9t5NL

---

Title: RewardSDS: Aligning Score Distillation via Reward- Weighted Sampling

Abstract: Score Distillation Sampling (SDS) has emerged as a highly effective technique for leveraging 2D diffusion priors for a diverse set of tasks such as text-to-3D generation. While powerful, SDS still struggles with achieving fine-grained alignment to user intent. To overcome this limitation, we introduce RewardSDS, a novel approach that weights noise samples based on the alignment scores of a reward model, producing a weighted SDS loss. This loss prioritizes gradients from noise samples that yield aligned high-reward output. Our approach is broadly applicable and can be applied to diverse methods extending SDS. In particular, we also demonstrate its applicability to Variational Score Distillation (VSD) by introducing RewardVSD. We evaluate RewardSDS on text-to-image, 2D editing, and text-to-3D generation tasks, demonstrating a significant improvement over SDS and subsequent baselines on a diverse set of metrics measuring generation quality and alignment to desired reward models.

URL: https://openreview.net/forum?id=2aeNk00Yw2

---

Title: Guess-and-Learn (G&L): Measuring the Cumulative Error Cost of Cold-Start Adaptation

Abstract: Evaluation of machine learning models typically emphasizes final accuracy, overlooking the cost of adaptation: the cumulative errors incurred while learning from scratch. Guess-and-Learn (G&L) v1.0 addresses this gap by measuring cold-start adaptability---the total mistakes a model makes while sequentially labeling an unlabeled dataset. At each step, the learner selects an instance, predicts its label, receives the ground truth, and updates parameters under either online (per-sample) or batch (delayed) mode. The resulting error trajectory exposes adaptation speed, selection quality, and bias---dynamics invisible to endpoint metrics. G&L defines four tracks (Scratch/Pretrained $\times$ Online/Batch) to disentangle the effects of initialization and update frequency. We formalize the protocol, relate it to classical mistake-bound theory, and estimate a heuristic ``oracle reference band'' for MNIST as a plausibility reference. Baseline experiments on MNIST and AG~News, spanning classical methods (Perceptron, $k$-NN), convolutional architectures (CNN, ResNet-50), and pretrained transformers (ViT-B/16, BERT-base), reveal systematic differences in early-phase efficiency: smaller models can adapt with fewer initial errors, while pretraining benefits vary by domain. Across settings, current models remain well above the oracle band, highlighting an adaptability gap. By quantifying the mistake cost of early learning, G&L complements conventional benchmarks and provides a reproducible framework for developing learners that are not only accurate in the limit but also reliable from the first examples.

URL: https://openreview.net/forum?id=uNxKhjcRp9

---

Title: Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out-of-Distribution Generalization

Abstract: Recent progress has pushed AI frontiers from pattern-recognition tasks toward problems that require step-by-step, System-2-style reasoning, especially with large language models. Yet, unlike learning, where generalization and out‑of‑distribution (OoD) evaluation concepts are well formalized, there is no clear, consistent definition or metric for reasoning ability. We propose Complexity Out‑of‑Distribution (Complexity OoD) generalization as a framework and problem setting to define and measure reasoning. A model exhibits Complexity OoD generalization when it maintains performance on test instances whose minimal required solution complexity, either representational (richer solution structure) or computational (more reasoning steps/program length), exceeds that of all training examples. We formalize complexity via solution description Kolmogorov complexity and operational proxies (e.g., object/relation counts; reasoning‑step counts), clarifying how Complexity OoD differs from length and compositional OoD. This lens unifies learning and reasoning: many cases solvable with System‑1‑like processing at low complexity become System‑2‑like under complexity pressure, while System‑2 can be viewed as generalization over solution structures. We translate this perspective into practice with recommendations for operationalizing Complexity OoD across the stack: incorporating complexity into benchmark and evaluation metric design rethinking supervision to target solution traces (from final outcomes to process‑level feedback and RL/search), seeking and designing inductive biases for Complexity‑OoD generalization, addressing learning‑to‑reason spillovers such as spurious shortcuts, semantic robustness, catastrophic forgetting, and step‑wise calibration. In light of recent controversies over LLM reasoning, we put the problem on firm footing: treat reasoning as Complexity OoD, enabling rigorous evaluation and more systematic research.

URL: https://openreview.net/forum?id=07fh13gWs0

---

Title: SPoT: Subpixel Placement of Tokens in Vision Transformers

Abstract: Vision Transformers naturally accommodate sparsity, yet standard tokenization methods confine features to discrete patch grids. This constraint prevents models from fully exploiting sparse regimes, forcing awkward compromises. We propose Subpixel Placement of Tokens (SPoT), a novel tokenization strategy that positions tokens continuously within images, effectively sidestepping grid-based limitations. With our proposed oracle-guided search, we uncover substantial performance gains achievable with ideal subpixel token positioning, drastically reducing the number of tokens necessary for accurate predictions during inference. SPoT provides a new direction for flexible, efficient, and interpretable ViT architectures, redefining sparsity as a strategic advantage rather than an imposed limitation.

URL: https://openreview.net/forum?id=Ms13C812AJ

---

Title: Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

Abstract: Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preference data, causing overfitting on easy examples and under-learning from informative ones. Recent methods have emerged to counter this. While Identity Preference Optimization (IPO) addresses general overfitting, its uniform regularization can be overly conservative. The more targeted approach of $\beta$-DPO suffers from its own limitations: its batch-level adaptation applies a single, compromised temperature to mixed-margin pairs, its linear update rule can produce unstable negative $\beta$ values, and its filtering mechanism discards potentially useful training signals.

In this work, we introduce Margin-Adaptive Direct Preference Optimization (MADPO), a method that provides a stable, data-preserving, and instance-level solution. MADPO employs a practical two-step approach: it first trains a reward model to estimate preference margins and then uses these margins to apply a continuous, adaptive weight to the DPO loss for each individual training sample. This re-weighting scheme creates an effective target margin that is amplified for hard pairs and dampened for easy pairs, allowing for granular control over the learning signal.

We provide a comprehensive theoretical analysis, proving that MADPO has a well-behaved optimization landscape and is robust to reward model estimation errors. We validate our theory with experiments on a sentiment generation task, where MADPO consistently and significantly outperforms strong baselines across datasets of varying quality. It achieves performance gains of up to +33.3\% on High Quality data and +10.5\% on Low Quality data over the next-best method. Our results establish MADPO as a more robust and principled approach to preference alignment.

URL: https://openreview.net/forum?id=nkulfhphPr

---

Title: GIFT: A Framework for Global Interpretable Faithful Textual Explanations of Vision Classifiers

Abstract: Understanding the decision processes of deep vision models is essential for their safe and trustworthy deployment in real-world settings. Existing explainability approaches, such as saliency maps or concept-based analyses, often suffer from limited faithfulness, local scope, or ambiguous semantics. We introduce GIFT, a post-hoc framework that derives Global, Interpretable, Faithful, and Textual explanations for vision classifiers. GIFT begins by generating a large set of faithful, local visual counterfactuals, then employs vision–language models to translate these counterfactuals into natural-language descriptions of visual changes. These local explanations are aggregated by a large language model into concise, human-readable hypotheses about the model’s global decision rules. Crucially, GIFT includes a verification stage that quantitatively assesses the causal effect of each proposed explanation by performing image-based interventions, ensuring that the final textual explanations remain faithful to the model’s true reasoning process. Across diverse datasets, including the synthetic CLEVR benchmark, the real-world CelebA faces, and the complex BDD driving scenes, GIFT reveals not only meaningful classification rules but also unexpected biases and latent concepts driving model behavior. Altogether, GIFT bridges the gap between local counterfactual reasoning and global interpretability, offering a principled and extensible approach to causally grounded textual explanations for vision models.

URL: https://openreview.net/forum?id=OwhW5MpFmD

---

Title: DASB - Discrete Audio and Speech Benchmark

Abstract: Discrete audio tokens have recently gained considerable attention for their potential to bridge audio and language processing, enabling multimodal language models that can both generate and understand audio. However, preserving key information such as phonetic content, speaker identity, and paralinguistic cues remains a major challenge.
Identifying the optimal tokenizer and configuration is further complicated by inconsistent evaluation settings across existing studies. To address this, we introduce the Discrete Audio and Speech Benchmark (DASB), a comprehensive framework for benchmarking discrete audio tokens across speech, general audio, and music domains on a range of discriminative and generative tasks. Our results show that discrete representations are less robust than continuous ones and require careful tuning of factors such as model architecture, data size, learning rate, and capacity. Semantic tokens generally outperform acoustic tokens, but a gap remains between discrete tokens and continuous features, highlighting the need for further research.

URL: https://openreview.net/forum?id=vGWrp0NjaE

---

Title: Softmax is $1/2$-Lipschitz: A tight bound across all $\ell_p$ norms

Abstract: The softmax function is a basic operator in machine learning and optimization, used in classification, attention mechanisms, reinforcement learning, game theory, and problems involving log-sum-exp terms. Existing robustness guarantees of learning models and convergence analysis of optimization algorithms typically consider the softmax operator to have a Lipschitz constant of $1$ with respect to the $\ell_2$ norm. In this work, we prove that the softmax function is contractive with the Lipschitz constant $1/2$, uniformly across all $\ell_p$ norms with $p \ge 1$. We also show that the local Lipschitz constant of softmax attains $1/2$ for $p = 1$ and $p = \infty$, and for $p \in (1,\infty)$, the constant remains strictly below $1/2$ and the supremum $1/2$ is achieved only in the limit. To our knowledge, this is the first comprehensive norm-uniform analysis of softmax Lipschitz continuity. We demonstrate how the sharper constant directly improves a range of existing theoretical results on robustness and convergence. We further validate the sharpness of the $1/2$ Lipschitz constant of the softmax operator through empirical studies on attention-based architectures (ViT, GPT-2, Qwen3-8B) and on stochastic policies in reinforcement learning.

URL: https://openreview.net/forum?id=6dowaHsa6D

---

Title: Super-fast Rates of Convergence for Neural Network Classifiers under the Hard Margin Condition

Abstract: We study the classical binary classification problem for hypothesis spaces of Deep Neural Networks (DNNs) with ReLU activation under Tsybakov's low-noise condition with exponent $q > 0$, as well as its limit case $q = \infty$, which we refer to as the \textit{hard margin condition}. We demonstrate that DNN solutions to the empirical risk minimization (ERM) problem with square loss surrogate and $\ell_p$ penalty on the weights $(0 < p < \infty)$ can achieve excess risk bounds of order $\mathcal{O}\left(n^{-\alpha}\right)$ for arbitrarily large $\alpha > 1$ under the hard-margin condition, provided that the Bayes regression function $\eta$ satisfies a \textit{distribution-adapted} smoothness condition relative to the marginal data distribution $\rho_X$. Additionally, we establish minimax lower bounds, showing that these rates cannot be improved upon. Our proof relies on a novel decomposition of the excess risk for general ERM-based classifiers, which may be of independent interest.

URL: https://openreview.net/forum?id=HXun3l0Feu

---

Title: Shape Happens: Automatic Feature Manifold Discovery in LLMs via Supervised Multi-Dimensional Scaling

Abstract: The linear representation hypothesis states that language models (LMs) encode concepts as directions in their latent space, forming organized, multidimensional manifolds. Prior efforts focus on discovering specific geometries for specific features, and thus lack generalization. We introduce Supervised Multi-Dimensional Scaling (SMDS), a model-agnostic method to automatically discover feature manifolds. We apply SMDS to temporal reasoning as a case study, finding that different features form various geometric structures such as circles, lines, and clusters. SMDS reveals many insights on these structures: they consistently reflect the properties of the concepts they represent; are stable across model families and sizes; actively support reasoning in models; and dynamically reshape in response to context changes. Together, our findings shed light on the functional role of feature manifolds, supporting a model of entity-based reasoning in which LMs encode and transform structured representations.

URL: https://openreview.net/forum?id=vCKZ40YYPr

---

Title: ENA: Efficient N-dimensional Attention

Abstract: Efficient modeling of long sequences of high-order data requires a more efficient architecture than Transformers. In this paper, we investigate two key aspects of extending linear recurrent models, especially those originally designed for long-context language modeling, to high-order data: scanning strategies and attention-hybrid architectures. Empirical results suggest that scanning provides limited benefits, while attention-hybrid models yield promising results. Focusing on the latter, we further evaluate types of attention that are worth integrating and find that tiled high-order sliding window attention (SWA) is efficient in both theory and practice. We term the resulting hybrid architecture of linear recurrence and high-order SWA Efficient N-dimensional Attention (ENA), and conduct several experiments to evaluate its effectiveness. With comparable performance to Transformers and higher efficiency, ENA offers a promising and practical solution for ultra-long high-order data modeling.

URL: https://openreview.net/forum?id=bMDc1wMkNe

---

Title: Transferring Human Daily Activity Skills to Surgical Robots via Deep Successor Features

Abstract: We propose a framework for surgical robot task learning that leverages large-scale human Activities of Daily Living (ADL) datasets to mitigate the scarcity of surgical training data. Surgical robot learning is uniquely constrained: datasets are limited, costly, and unsafe to collect at scale. In contrast, the robotics community has curated extensive ADL datasets capturing motor behaviors such as food preparation and tool use. Our key insight is that these datasets encode transferable visuomotor primitives—such as instrument manipulation and hand–eye coordination—that parallel the basic skills underlying surgical maneuvers. Inspired by how surgeons develop expertise by first mastering everyday skills before refining them in the operating room, we leverage ADL data to pretrain representations for surgical robot learning. To address task variability and embodiment differences, we design a modular deep successor feature architecture that learns predictive state representations from ADL tool-use and adapts them to surgical domains. Unlike prior approaches that depend solely on limited surgical data, our framework enables large-scale offline pretraining on abundant non-surgical datasets while supporting efficient reinforcement learning during deployment. We validate the framework on the da Vinci Research Kit (dVRK) in both simulation and real-world settings, showing that pretraining on ADLs accelerates adaptation with limited surgical data and improves sample efficiency compared to imitation learning and reinforcement learning baselines. While our current evaluation emphasizes a subset of fundamental surgical tasks, our results provide initial evidence that ADL pretraining offers a principled and scalable pathway toward data-efficient and safe autonomous surgical robot learning.

URL: https://openreview.net/forum?id=DoTxvJ3os3

---

Title: A Deep Bayesian Nonparametric Framework for Robust Mutual Information Estimation

Abstract: Mutual Information (MI) is a crucial measure for capturing dependencies between variables, yet exact computation is challenging in high dimensions with intractable likelihoods, impacting accuracy and robustness. One idea is to use an auxiliary neural network to train an MI estimator; however, methods based on the empirical distribution function (EDF) can introduce sharp fluctuations in the MI loss due to poor out-of-sample performance, thereby destabilizing convergence.
We present a Bayesian nonparametric (BNP) solution for training an MI estimator by constructing the MI loss with a finite representation of the Dirichlet process posterior to incorporate regularization in the training process. With this regularization, the MI loss integrates both prior knowledge and empirical data to reduce the loss sensitivity to fluctuations and outliers in the sample data, particularly in small sample settings like mini-batches. This approach addresses the challenge of balancing accuracy and low variance by effectively reducing variance, leading to stabilized and robust MI loss gradients during training and enhancing the convergence of the MI approximation while offering stronger theoretical guarantees for convergence. We explore the application of our estimator in maximizing MI between the data space and the latent space of a variational autoencoder. Experimental results demonstrate significant improvements in convergence over EDF-based methods, with applications across synthetic and real datasets, notably in 3D CT image generation—yielding enhanced structure discovery and reduced overfitting in data synthesis. While this paper focuses on generative models in application, the proposed estimator is not restricted to this setting and can be applied more broadly in various BNP learning procedures.

URL: https://openreview.net/forum?id=mqGzGKXnFi

---

Title: Towards Customized Knowledge Distillation for Efficient Dense Image Predictions

Abstract: It has been revealed that efficient dense image prediction (EDIP) models designed for AI chips, trained using the knowledge distillation (KD) framework, encounter two key challenges, including maintaining boundary region completeness and ensuring target region connectivity, despite their favorable real-time capacity to recognize the main object regions. In this work, we propose a customized boundary and context knowledge distillation (BCKD) method for EDIPs, which facilitates the targeted KD from large accurate teacher models to compact small student models. Specifically, the boundary distillation focuses on extracting explicit object-level boundaries from the hierarchical feature maps to enhance the student model's mask quality in boundary regions. Meanwhile, the context distillation leverages self-relations as a bridge to transfer implicit pixel-level contexts from the teacher model to the student model, ensuring strong connectivity in target regions. Our method is specifically designed for the EDIP tasks and is characterized by its simplicity and efficiency. Theoretical analysis and extensive experimental results across semantic segmentation, object detection, and instance segmentation on five representative datasets demonstrate the effectiveness of BCKD, resulting in well-defined object boundaries and smooth connecting regions.

URL: https://openreview.net/forum?id=4verIe3tE4

---

Title: On Convergence of the Alternating Directions Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) Algorithms

Abstract: We study convergence rates of practical Hamiltonian Monte Carlo (HMC) style algorithms where the Hamiltonian motion is approximated with leapfrog integration and where gradients of the log target density are accessed via a stochastic gradient (SG) oracle.
Importantly, our analysis extends to allowing the use of general auxiliary distributions via a novel HMC procedure of alternating directions (AD).

The convergence analysis is based on the investigation of the Dirichlet forms associated with the underlying Markov chain driving the algorithms. For this purpose, we provide a detailed analysis on the error of the leapfrog integrator for Hamiltonian motions when both the kinetic and potential energy functions are in general form. We characterize the explicit dependence of the convergence rates on key parameters such as the problem dimension, functional properties of the target and auxiliary distributions and the quality of the SG oracle. Our analysis also identifies a crucial derivative condition on the log density of the auxiliary distribution, and we show that Gaussians (auxiliaries for standard HMC) as well as common choices of general auxiliaries for ADHMC satisfy this condition.

URL: https://openreview.net/forum?id=YcXrEtecYa

---

Title: Amplified Patch-Level Differential Privacy for Free via Random Cropping

Abstract: Random cropping is one of the most common data augmentation techniques in computer vision, yet the role of its inherent randomness in training differentially private machine learning models has thus far gone unexplored. We observe that when sensitive content in an image is spatially localized, such as a face or license plate, random cropping can probabilistically exclude that content from the model’s input. This introduces a third source of stochasticity in differentially private training with stochastic gradient descent, in addition to gradient noise and minibatch sampling. This additional randomness amplifies differential privacy without requiring changes to model architecture or training procedure. We formalize this effect by introducing a patch-level neighboring relation for vision data and deriving tight privacy bounds for differentially private stochastic gradient descent (DP-SGD) when combined with random cropping. Our analysis quantifies the patch inclusion probability and shows how it composes with minibatch sampling to yield a lower effective sampling rate. Empirically, we validate that patch-level amplification improves the privacy-utility trade-off across multiple segmentation architectures and datasets. Our results demonstrate that aligning privacy accounting with domain structure and additional existing sources of randomness can yield stronger guarantees at no additional cost.

URL: https://openreview.net/forum?id=pSWuUF8AVP

---

Title: A Multi-Fidelity Control Variate Approach for Policy Gradient Estimation

Abstract: Many reinforcement learning (RL) algorithms are impractical for deployment in operational systems or for training with computationally expensive high-fidelity simulations, as they require large amounts of data. Meanwhile, low-fidelity simulators—such as reduced-order models, heuristic reward functions, or generative world models—can cheaply provide useful data for RL training, even if they are too coarse for direct sim-to-real transfer. We propose multi-fidelity policy gradients (MFPGs), an RL framework that mixes a small amount of data from the target environment with a control variate formed from a large volume of low-fidelity simulation data to construct an unbiased, variance-reduced estimator for on-policy policy gradients. We instantiate the framework by developing a practical, multi-fidelity variant of the classical REINFORCE algorithm. We show that under standard assumptions, the MFPG estimator guarantees asymptotic convergence of multi-fidelity REINFORCE to locally optimal policies in the target environment, and achieves faster finite-sample convergence rates compared to training with high-fidelity data alone. Empirically, we evaluate the MFPG algorithm across a suite of simulated robotics benchmark tasks in scenarios with limited high-fidelity data but abundant off-dynamics, low-fidelity data. With mild to moderate dynamics gaps, MFPG reliably improves the median performance over a standard baseline trained with only high-fidelity data, matching the performance of leading multi-fidelity baselines despite its simplicity and minimal tuning overhead. Under large dynamics gaps, MFPG demonstrates the strongest robustness among the evaluated multi-fidelity approaches. An additional experiment shows that MFPG can remain effective even when low-fidelity environments exhibit reward misspecification. Thus, MFPG not only offers a novel paradigm for efficient sim-to-real transfer, but also provides a principled approach to managing the trade-off between policy performance and data collection costs.

URL: https://openreview.net/forum?id=zAo0L7Dcqt

---

Title: Learning Adaptive Multi-Stage Energy-based Prior for Hierarchical Generative Model

Abstract: Hierarchical generative models represent data with multiple layers of latent variables organized in a top-down structure. These models typically assume Gaussian priors for multi-layer latent variables, which lack expressivity for the contextual dependencies among latents, resulting in a distribution gap between the prior and the learned posterior. Recent works have explored hierarchical energy-based prior models (EBMs) as a more expressive alternative to bridge this gap. However, most approaches learn only a \textit{single} EBM, which can be ineffective when the target distribution is highly multi-modal and multi-scale across hierarchical layers of latent variables. In this work, we propose a framework that learns \textit{multi-stage} hierarchical EBM priors, where a sequence of adaptive stages progressively refines the prior to match the posterior. Our method supports both joint training with the generator and a more efficient two-phase strategy for deeper hierarchies. Experiments across standard benchmarks show that our approach consistently generates higher-quality images and learns richer hierarchical representations.

URL: https://openreview.net/forum?id=W2zqUkA9Ub

---

Title: Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

Abstract: Mixture-of-Experts (MoE) has garnered significant attention for its ability to scale up neural networks while utilizing the same or even fewer active parameters. However, MoE does not alleviate the massive memory requirements of networks, which limits their practicality in real-world applications, especially in the era of large language models (LLMs). While recent work explores the possibility of removing entire layers of MoE to reduce memory, the performance degradation is still notable. In this paper, we propose ConDense-MoE (CD-MoE), which, instead of dropping the entire MoE layer, condenses the large, sparse MoE layer into a smaller, denser layer with only a few experts activated for all tokens, while maintaining hardware friendliness. Our approach is specifically designed for fine-grained MoE with shared experts, where Feed-Forward Networks are split into many small experts, with certain experts isolated to serve as shared experts that are always activated, such as DeepSeekMoE and QwenMoE. We demonstrate the effectiveness of our method. Specifically, for the DeepSeekMoE-16B model, our approach maintains 90% of the average accuracy while reducing memory usage by 27.5% and increasing inference speed by 1.26 times. Moreover, we show that by applying lightweight expert fine-tuning—only to the condensed layers—and using 5 hours on a single 80G A100 GPU, we can successfully recover 98% of the original performance.

URL: https://openreview.net/forum?id=BQe6j6sAu6

---

Title: FARM: Enhancing Molecular Representations with Functional Group Awareness

Abstract: We introduce Functional Group-Aware Representations for Small Molecules (FARM), a novel foundation model designed to bridge the gap between SMILES, natural language, and molecular graphs. The key innovation of FARM lies in its functional group (FG) annotation at the atomic level, which enables both FG-enhanced SMILES and FG graphs: SMILES are enriched with FG information to specify which functional group each atom belongs to, while the FG graph captures the molecular backbone by showing how the functional groups are connected. As an example of FG-enhanced SMILES, instead of using "O" to represent all oxygen atoms, we assign specific tokens like "O_ketone" and "O_hydroxyl" to differentiate oxygen atoms belonging to distinct FGs. This tokenization not only injects chemical knowledge into SMILES but also expands the chemical lexicon, effectively bridging the gap between SMILES and natural language in terms of vocabulary size, making the sequences more suitable for Transformer-based models. An example of an FG graph can be seen in methanol (CH3OH), which contains a hydroxyl group attached to a methyl group. In the FG graph, each FG is represented as a node: one for the hydroxyl (OH) and one for the methyl (CH3). FARM then learns molecules from two complementary perspectives to fully encode both functional and structural information. Masked language modeling on FG-enhanced SMILES captures atom-level features enriched with FG context, while graph neural networks encode higher-level molecular topology by representing how FGs are connected. Contrastive learning then aligns these two views into a unified embedding, ensuring that atom-level details and FG-level structure are jointly represented. This dual-level modeling is central to FARM’s ability to predict molecular properties accurately. We rigorously evaluate FARM on the MoleculeNet dataset, achieving state-of-the-art performance on 11 out of 13 tasks, and further validate its generalization on the photostability dataset for quantum mechanics properties. These results highlight FARM’s potential to improve molecular representation learning, demonstrate its strong transfer learning capabilities across drug discovery and material design domains, and make way for broad applications in pharmaceutical research and functional materials development. The code is available at: https://anonymous.4open.science/r/farm_molecular_representation-1E2F

URL: https://openreview.net/forum?id=2QGXeWfqbX

---

Title: Solving Inverse Problems via Diffusion-Based Priors: An Approximation-Free Ensemble Sampling Approach

Abstract: Diffusion models (DMs) have proven to be effective in modeling high-dimensional distributions, leading to their widespread adoption for representing complex priors in Bayesian inverse problems (BIPs). However, current DM-based posterior sampling methods proposed for solving common BIPs rely on heuristic approximations to the generative process. To exploit the generative capability of DMs and avoid the usage of such approximations, we propose an ensemble-based algorithm that performs posterior sampling without the use of heuristic approximations. Our algorithm is motivated by existing work that combines DM-based methods with the sequential Monte Carlo (SMC) method. By examining how the prior evolves through the diffusion process encoded by the pre-trained score function, we derive a modified partial differential equation (PDE) governing the evolution of the corresponding posterior distribution. This PDE includes a modified diffusion term and a reweighting term, which can be simulated via stochastic weighted particle methods. Theoretically, we prove that the error between the true posterior distribution can be bounded in terms of the training error of the pre-trained score function and the number of particles in the ensemble. Empirically, we validate our algorithm on several inverse problems in imaging to show that our method gives more accurate reconstructions compared to existing DM-based methods.

URL: https://openreview.net/forum?id=qN8ASsfjKs

---

Title: Uncertainty estimation in classification via weighted test-time augmentation

Abstract: In classification, deep learning models are considered superior in terms of prediction accuracy compared to standard statistical models. However, these models are often overconfident in their predictions, which inhibits their use in safety critical applications where mistakes can lead to disastrous consequences. To address this issue, several uncertainty quantification methods have been proposed to introduce more reliable predictions and better calibration. In this paper, we focus on a universal uncertainty quantification method called test-time augmentation (TTA). We then present a weighted version of test-time augmentation (WTTA) that introduces weights to the algorithm to generate better augmentations. Our approach is then illustrated with various models and data sets. In a simulation study, we show that the WTTA approach produces better uncertainty estimates as we are able to compare to the real uncertainties in the data. Furthermore, the method is applied to two benchmark data sets used in the development of machine learning models. Although, it is a rather simple post-process method, WTTA is arguably able to outperform the standard TTA and temperature scaling methods in terms of calibration error and prediction accuracy, especially in small data sets.

URL: https://openreview.net/forum?id=zE4f3Gspfa

---

Title: Uncovering the Redundancy in Transformers via a Unified Study of Layer Dropping

Abstract: While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different Transformer modules, including blocks, MLP layers, and attention layers, through the lens of layer dropping. Surprisingly, despite the pivotal role of attention mechanisms in distinguishing Transformers from other architectures, we find that a large portion of attention layers exhibit excessively high redundancy and can be pruned without degrading performance. For example, LLaMA-3-70B achieves a 43.4\% speedup with only a 1.8\% drop in performance by pruning half of its attention layers. In contrast, dropping MLP layers severely impairs the model's ability to distinguish between tokens, leading to catastrophic performance degradation. Moreover, our analysis reveals that attention layer redundancy persists not only throughout training but is also evident in randomly initialized models. We attribute this redundancy to three key factors that constrain representational updates from attention layers: sparse attention patterns, over-smoothed token embeddings, and the low representational magnitude of attention outputs. Overall, our findings offer valuable insights into the internal redundancy of Transformer architectures and provide practical guidance for designing more efficient LLMs. Code will be released upon acceptance.

URL: https://openreview.net/forum?id=1I7PCbOPfe

---

Title: Large-Scale Constraint Generation - Can LLMs Parse Hundreds of Constraints?

Abstract: Recent research has explored the constrained generation capabilities of Large Language Models (LLMs) when explicitly prompted by few task-specific requirements. In contrast, we introduce Large-Scale Constraint Generation (LSCG), a new problem that evaluates whether LLMs can parse a large, fine-grained, generic list of constraints. To examine the LLMs’ ability to handle an increasing number constraints, we create a practical instance of LSCG, called Words Checker. In Words Checker, we evaluate the impact of model characteristics (e.g., size, family) and steering techniques (e.g., Simple Prompt, Chain of Thought, Best of N ) on performance. We also propose FoCusNet, a small and dedicated model that parses the original list of constraints into a smaller subset, helping the LLM focus on relevant constraints. Experiments reveal that existing solutions suffer a significant performance drop as the number of constraints increases, with FoCusNet showing an 8-13% accuracy boost.

URL: https://openreview.net/forum?id=g4Y7YiIfoZ

---

Title: Learning Encoding-Decoding Direction Pairs to Unveil Concepts of Influence in Deep Vision Networks

Abstract: Empirical evidence shows that deep vision networks represent concepts as directions in latent space, vectors which we call concept embeddings. For each concept, a latent factor—a scalar—indicates the degree of its presence in an input patch. For a given patch, the latent factors of multiple concepts are encoded into a compact vector representation by linearly combining concept embeddings, with the latent factors serving as coefficients. Since these embeddings enable such encoding, we refer to them as encoding directions. A latent factor can be recovered from the representation by taking the inner product with a filter, a vector which we call a decoding direction. These encoding-decoding direction pairs are not directly accessible, but recovering them unlocks significant potential to open the black-box nature of deep networks, enabling understanding, debugging, and improving deep learning models. Decoding directions help attribute meaning to latent codes, while encoding directions help assess the influence of the concept on the predictions, and both directions may assist model correction by unlearning concepts irrelevant to the network's prediction task. Compared to previous matrix decomposition, autoencoder, and dictionary learning approaches which rely on the reconstruction of feature activations, we propose a different perspective to learn these direction pairs. We base identifying the decoding directions on directional clustering of feature activations and introduce signal vectors to estimate encoding directions under a probabilistic perspective. Unlike most other works, we also take advantage of the knowledge encoded in the weights of the network to guide our direction search. For this, we illustrate that a novel technique called \textit{Uncertainty Region Alignment} can exploit this knowledge to effectively reveal interpretable directions that influence the network's predictions. We perform a thorough and multifaceted comparative analysis to offer insights on the fidelity of direction pairs, the advantages of the method compared to other unsupervised direction learning approaches, and how the learned directions compare in relation to those learned with supervision. We find that: a) In controlled settings with synthetic data, our approach is effective in recovering the ground-truth encoding-decoding direction pairs; b) In real-world settings, the decoding directions correspond to monosemantic interpretable concepts, often scoring substantially better in interpretability metrics than other unsupervised baselines; c) In the same settings, signal vectors are faithful estimators of the concept encoding directions validated with a novel approach based on activation maximization. At the application level, we provide examples that demonstrate how the learned directions can help to a) understand global model behavior; b) explain individual sample predictions in terms of local, spatially-aware, concept contributions; and c) intervene on the network's prediction strategy to provide either counterfactual explanations or correct erroneous model behavior.

URL: https://openreview.net/forum?id=lIeyZpPEJn

---

Title: LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

Abstract: Large Language Models (LLMs) with multimodal capabilities have revolutionized vision-language tasks, but their deployment often requires huge memory and computational resources. While post-training quantization (PTQ) has successfully compressed language models to as low as 1-bit precision without significant performance loss, its effectiveness for multimodal LLMs (MLLMs) remains relatively unexplored. In this paper, we present the first study on ultra-low bit (<4-bit) quantization for multimodal LLMs. Our analysis reveals that multimodal tokens and intermediate layer activations produced by them exhibit significantly higher statistical variance and entropy compared to text tokens, making them less tolerant to ultra-low bit quantization. However, the activation distributions of multimodal tokens varies significantly over different layers, with some layers having lower entropy activation distributions. We empirically show that such layers in these models can better tolerate ultra-low bit quantization. Building on these insights, we propose a novel strategy for MLLM quantization, LUQ: Layerwise Ultra-Low Bit Quantization, which selectively applies ultra-low bit quantization to layers that are more resilient to it. Additionally, we also show that using a mix of multimodal tokens (image and text) for PTQ boosts VQA performance in the ultra-low bit regime. We evaluate our method on LLaVA-1.5 and Qwen-2.5-VL across 9 popular VQA benchmarks. The resulting LUQ models use 40\% and 31\% less memory than their 4-bit counterparts, respectively, while exhibiting a performance degradation of less than 10\% on the MME benchmark.

URL: https://openreview.net/forum?id=3eK6U6ZiSp

---

Title: Towards Unified Benchmark and Models for Multi-Modal Perceptual Metrics

Abstract: Human perception of similarity across uni- and multi-modal inputs is highly complex, making it challenging to develop automated metrics that accurately mimic it. While general-purpose vision-language models (VLMs) like CLIP and large multi-modal models (LMMs) can serve as zero-shot perceptual metrics, they are not explicitly trained for this task. As a result, recent efforts have developed specialized models for narrow perceptual tasks. However, the extent to which these metrics align with human perception remains unclear. To address this, we introduce UniSim-Bench, a benchmark covering seven multi-modal perceptual similarity tasks across 25 datasets. Our evaluation reveals that models fine-tuned on a specific dataset struggle to generalize to unseen datasets within the same task or to related perceptual tasks. As a first step towards a unified multi-task perceptual similarity metric, we fine-tune both encoder-based and generative vision-language models on a subset of UniSim-Bench tasks. This approach achieves the highest average performance and, in some cases, surpasses task-specific models. Our comparative analysis demonstrates that encoder-based VLMs exhibit superior generalization capabilities as perceptual metrics. However, these models still struggle with unseen tasks, underscoring the challenge of developing a robust, unified metric that accurately captures the human notions of similarity.

URL: https://openreview.net/forum?id=g1rIemDcNC

---

Title: Denoising Hamiltonian Network for Physical Reasoning

Abstract: Machine learning frameworks for physical problems must capture and enforce physical constraints that preserve the structure of dynamical systems. Many existing approaches achieve this by integrating physical operators into neural networks. While these methods offer theoretical guarantees, they face two key limitations: (i) they primarily model local relations between adjacent time steps, overlooking longer-range or higher-level physical interactions, and (ii) they focus on forward simulation while neglecting broader physical reasoning tasks. We propose the Denoising Hamiltonian Network (DHN), a novel framework that generalizes Hamiltonian mechanics operators into more flexible neural operators. DHN captures non-local temporal relationships and mitigates numerical integration errors through a denoising mechanism. DHN also supports multi-system modeling with a global conditioning mechanism. We demonstrate its effectiveness and flexibility across three diverse physical reasoning tasks with distinct inputs and outputs.

URL: https://openreview.net/forum?id=KublEgx7Hv

---

Title: LJ-Bench: Ontology-Based Benchmark for U.S. Crime

Abstract: The potential of Large Language Models (LLMs) to provide harmful information remains a significant concern due to the vast breadth of illegal queries they may encounter. Unfortunately, existing benchmarks only focus on a handful types of illegal activities, and are not even grounded in legal works. In this work, we introduce an ontology of crime-related concepts grounded in the legal frameworks of Model Panel Code, which serves as an influential reference for criminal law and has been adopted by many U.S. states, and instantiated using Californian Law. This structured knowledge forms the foundation for LJ-Bench, the first comprehensive benchmark designed to evaluate LLM robustness against a wide range of illegal activities. Spanning 76 distinct crime types organized taxonomically, LJ-Bench enables systematic assessment of diverse attacks, revealing valuable insights into LLM vulnerabilities across various crime categories — LLMs exhibit heightened susceptibility to attacks targeting societal harm rather than those directly impacting individuals. Our benchmark aims to facilitate the development of more robust and trustworthy LLMs. The LJ-Bench benchmark and LJ-Ontology, along with experiments implementation for reproducibility are publicly available at https://anonymous.4open.science/r/LJ-Bench-TMLR-2025/.

URL: https://openreview.net/forum?id=gsWEbyzFl2

---

Title: Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks

Abstract: The ratio of "outlier" parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many previous studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a "bridge training" stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.

URL: https://openreview.net/forum?id=N7DSUbnzYo

---

Reply all

Reply to author

Forward

0 new messages