Daily TMLR digest for Feb 23, 2026

10 views
Skip to first unread message

TMLR

unread,
Feb 23, 2026, 12:30:09 AM (5 days ago) Feb 23
to tmlr-anno...@googlegroups.com

Accepted papers
===============


Title: VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Authors: Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Raghuveer Thirukovalluru, Xuan Zhang, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, Semih Yavuz

Abstract: Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, retrieval-augmented generation (RAG) systems, and recommendation. To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering -- spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.

URL: https://openreview.net/forum?id=TpU38jbKIJ

---

Title: LoRA-Ensemble: Efficient Uncertainty Modelling for Self-Attention Networks

Authors: Dominik J. Mühlematter, Michelle Halbheer, Alexander Becker, Dominik Narnhofer, Helge Aasen, Konrad Schindler, Mehmet Ozgur Turkoglu

Abstract: Numerous real-world decisions rely on machine learning algorithms and require calibrated uncertainty estimates. However, modern methods often yield overconfident, uncalibrated predictions. The dominant approach to quantifying the uncertainty inherent in the model is to train an ensemble of separate predictors and measure their empirical variance. In an explicit implementation, the ensemble has a high computational cost and memory footprint, especially if the base model itself is already large, like modern transformers. This motivates efforts to develop implicit ensemble methods that emulate the ensemble without explicitly instantiating all its members. We introduce LoRA-Ensemble, a parameter-efficient ensembling method for self-attention networks. It is based on Low-Rank Adaptation (LoRA), originally developed for efficient LLM fine-tuning, and extends it into an implicit ensembling scheme, where all ensemble members share the same, pre-trained self-attention network, but have individual low-rank matrices for the attention projections. The resulting method not only outperforms state-of-the-art implicit techniques like BatchEnsemble, but even matches or exceeds the accuracy of an Explicit Ensemble, while at the same time achieving superior calibration.

URL: https://openreview.net/forum?id=yhXXmOMpSQ

---

Title: DiffKGW: Stealthy and Robust Diffusion Model Watermarking

Authors: Tianxin Wei, Ruizhong Qiu, Yifan Chen, Yunzhe Qi, Jiacheng Lin, Wenxuan Bao, Wenju Xu, Sreyashi Nag, Ruirui Li, Hanqing Lu, Zhengyang Wang, Chen Luo, Hui Liu, Suhang Wang, Jingrui He, Qi He, Xianfeng Tang

Abstract: Diffusion models are known for their supreme capability to generate realistic images. However, ethical concerns, such as copyright protection and the generation of inappropriate content, pose significant challenges for the practical deployment of diffusion models. Recent work has proposed a flurry of watermarking techniques that inject artificial patterns into initial latent representations of diffusion models, offering a promising solution to these issues. However, enforcing a specific pattern on selected elements can disrupt the Gaussian distribution of the initial latent representation. Inspired by watermarks for large language models (LLMs), we generalize the LLM KGW watermark to image diffusion models and propose a stealthy probability adjustment approach DiffKGW that preserves the Gaussian distribution of initial latent representation. In addition, we dissect the design principles of state-of-the-art watermarking techniques and introduce a unified framework. We identify a set of dimensions that explain the manipulation enforced by watermarking methods, including the distribution of individual elements, the specification of watermark shapes within each channel, and the choice of channels for watermark embedding. Through the empirical studies on regular text-to-image applications and the first systematic attempt at watermarking image-to-image diffusion models, we thoroughly verify the effectiveness of our proposed framework through comprehensive evaluations. On all the diffusion models, including Stable Diffusion, our approach induced from the proposed framework not only preserves image quality but also outperforms existing methods in robustness against a wide range of attacks.

URL: https://openreview.net/forum?id=OXi9vcIOgD

---

Title: MiniGPT-Med: A Unified Vision-Language Model for Radiology Image Understanding

Authors: Asma Alkhaldi, Raneem Alnajim, Layan Alabdullatef, Rawan Alyahya, Jun Chen, Deyao Zhu, Ahmed Z. Alsinan, Mohamed Elhoseiny

Abstract: Recent advances in artificial intelligence (AI) have precipitated significant breakthroughs in healthcare, particularly in the refinement of diagnostic procedures. However, existing studies have been limited in terms of functional coverage. This study introduces MiniGPT-Med, a vision-language model adapted from MiniGPT-v2 for medical applications through domain-specific fine-tuning on medical datasets. MiniGPT-Med demonstrates remarkable versatility across various imaging modalities, including X-rays, CT scans, and MRIs, enhancing its utility. The model is capable of performing tasks such as medical report generation, visual question answering (VQA), and disease identification within medical imagery. Its integrated processing of both image and textual clinical data markedly improves diagnostic accuracy. Our empirical assessments confirm the superior performance of MiniGPT-Med in disease detection, medical report generation, and VQA benchmarks, representing a significant step towards reducing the gap in assisting radiology practice. Furthermore, it achieves state-of-the-art performance in medical report generation, with substantial gains in BERT-Sim over both specialist and generalist baselines, improving by 17 and 12 points, respectively. MiniGPT-Med promises to become a unified Vision-Language model for radiology diagnoses, enhancing diagnostic efficiency across a wide range of medical imaging applications.

URL: https://openreview.net/forum?id=NenHFEg1Di

---

Title: Augmenting Molecular Graphs with Geometries via Machine Learning Interatomic Potentials

Authors: Cong Fu, Yuchao Lin, Zachary Krueger, Haiyang Yu, Maho Nakata, Jianwen Xie, Emine Kucukbenli, Xiaofeng Qian, Shuiwang Ji

Abstract: Accurate molecular property predictions require 3D geometries, which are typically obtained using expensive methods such as density functional theory (DFT). Here, we attempt to obtain molecular geometries by relying solely on machine learning interatomic potential (MLIP) models. To this end, we first curate a large-scale molecular relaxation dataset comprising 3.5 million molecules and 300 million snapshots. Then MLIP pre-trained models are trained with supervised learning to predict energy and forces given 3D molecular structures. Once trained, we show that the pre-trained models can be used in different ways to obtain geometries either explicitly or implicitly. First, it can be used to obtain approximate low-energy 3D geometries via geometry optimization. While these geometries do not consistently reach DFT-level chemical accuracy or convergence, they can still improve downstream performance compared to non-relaxed structures. To mitigate potential biases and enhance downstream predictions, we introduce geometry fine-tuning based on the relaxed 3D geometries. Second, the pre-trained models can be directly fine-tuned for property prediction when ground truth 3D geometries are available. Our results demonstrate that MLIP pre-trained models trained on relaxation data can learn transferable molecular representations to improve downstream molecular property prediction and can provide practically valuable but approximate molecular geometries that benefit property predictions. Our code is publicly available at: https://github.com/divelab/AIRS/.

URL: https://openreview.net/forum?id=JwxhHTISJL

---

Title: From Feature Visualization to Visual Circuits: Effect of Model Perturbation

Authors: geraldin nanfack, Michael Eickenberg, Eugene Belilovsky

Abstract: Understanding the inner workings of large-scale deep neural networks is challenging yet crucial in several high-stakes applications. Mechanistic interpretability is an emergent field that tackles this challenge, often by identifying human-understandable subgraphs in deep neural networks known as circuits. In vision-pretrained models, these subgraphs are typically interpreted by visualizing their node features through a popular technique called feature visualization. Recent works have analyzed the stability of different feature visualization types under the adversarial model manipulation framework, where models are subtly perturbed to alter their interpretations while maintaining performance. However, existing model manipulation methods have two key limitations: (1) they manipulate either synthetic or natural feature visualizations individually, but not both simultaneously, and (2) no work has studied whether circuit-based interpretations are vulnerable to such manipulations.
This paper exposes these vulnerabilities by proposing a novel attack called ProxPulse that simultaneously manipulates both types of feature visualizations. Surprisingly, we find that visual circuits exhibit some robustness to ProxPulse. We therefore introduce CircuitBreaker, the first attack targeting entire circuits, which successfully manipulates circuit interpretations, revealing that circuits also lack robustness. The effectiveness of these attacks is validated across a range of pre-trained models, from smaller architectures like AlexNet to medium-scale models like ResNet-50, and larger ones such as ResNet-152 and DenseNet-201 on ImageNet. ProxPulse changes both visualization types with <1\% accuracy drop, while our CircuitBreaker attack manipulates visual circuits with attribution correlation scores dropping from near-perfect to ~0.6 while preserving circuit head functionality.

URL: https://openreview.net/forum?id=x6ZwuyTy65

---


New submissions
===============


Title: Issues with Value-Based Multi-objective Reinforcement Learning: Value Function Interference and Overestimation Sensitivity

Abstract: Multi-objective reinforcement learning (MORL) algorithms extend conventional reinforcement learning (RL) to the more general case of problems with multiple, conflicting objectives, represented by vector-valued rewards. Widely-used scalar RL methods such as Q-learning can be modified to handle multiple objectives by (1) learning vector-valued value functions, and (2) performing action selection using a scalarisation or ordering operator which reflects the user's preferences with respect to the different objectives. This paper investigates two previously unreported issues which can hinder the performance of value-based MORL algorithms when applied in conjunction with a non-linear utility function -- value function interference, and sensitivity to overestimation. We illustrate the nature of these phenomena on simple multi-objective MDPs using a tabular implementation of multiobjective Q-learning.

URL: https://openreview.net/forum?id=KImrufIw0L

---

Title: DMT-JEPA: Learning Discriminative Masked Targets for Joint-Embedding Predictive Architecture

Abstract: The joint-embedding predictive architecture (JEPA) recently has shown impressive results in extracting visual representations from unlabeled imagery under a masking strategy. However, we reveal its disadvantages, notably its insufficient understanding of local semantics. This deficiency originates from masked modeling in the embedding space, resulting in a reduction of discriminative power and can even lead to the neglect of critical local semantics. To bridge this gap, we introduce DMT-JEPA, a novel masked modeling objective rooted in JEPA, specifically designed to generate discriminative latent targets from neighboring information. Our key idea is simple: we consider a set of semantically similar neighboring patches as a target of a masked patch. To be specific, the proposed DMT-JEPA (a) computes feature similarities between each masked patch and its corresponding neighboring patches to select patches having semantically meaningful relations, and (b) employs lightweight cross-attention heads to aggregate features of neighboring patches as the masked targets. Consequently, DMT-JEPA highlights that increased discriminative power of target representations benefits a diverse spectrum of downstream tasks. Through extensive experiments, we demonstrate our effectiveness across various visual benchmarks, including ImageNet-1K image classification, ADE20K semantic segmentation, and COCO object detection tasks.
Code is available at: \url{https://anonymous.4open.science/r/DMT-JEPA-anony}.

URL: https://openreview.net/forum?id=73demKsXn4

---

Title: Towards Scalable Explainable AI: Using Vision-Language Models to Interpret Vision Systems

Abstract: Explainable AI (xAI) is increasingly important for the trustworthy deployment of vision models in domains such as medical imaging, autonomous driving, and safety-critical systems. However, modern vision models are typically trained on massive datasets, making it nearly impossible for researchers to manually track how models learn from each sample, especially when relying on saliency maps that require intensive visual inspection. Traditional xAI methods, while useful, often focus on the instance-level explanation and risk losing important information about model behavior at scale, leaving analysis time-consuming, subjective, and difficult to reproduce. To overcome these challenges, we propose an automated evaluation pipeline that leverages Vision-Language Models to analyze vision models at both the sample and dataset levels. Our pipeline systematically assesses, generates, and interprets saliency-based explanations, aggregates them into structured summaries, and enables scalable discovery of failure cases, biases, and behavioral trends. By reducing reliance on manual inspection while preserving critical information, the proposed approach facilitates more efficient and reproducible xAI research, supporting the development of robust and transparent vision models.

URL: https://openreview.net/forum?id=Ta2cvwmlVb

---

Title: DINOv3

Abstract: Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images—using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models’ flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.

URL: https://openreview.net/forum?id=2NlGyqNjns

---

Title: UniRec: Unified Multimodal Encoding for LLM-Based Recommendations

Abstract: Large language models (LLMs) have recently shown promise for multimodal recommen-
dation, particularly with text and image inputs. Yet real-world recommendation signals
extends far beyond these modalities. To reflect this, we formalize recommendation fea-
tures into four modalities: text, images, categorical features, and numerical attributes, and
emphasize unique challenges this heterogeneity poses for LLMs in understanding multi-
modal information. In particular, these challenges arise not only across modalities but also
within them, as attributes (e.g., price, rating, time) may all be numeric yet carry distinct
meanings. Beyond this intra-modality ambiguity, another major challenge is the nested
structure of recommendation signals, where user histories are sequences of items, each car-
rying multiple attributes. To address these challenges, we propose UniRec, a unified mul-
timodal encoder for LLM-based recommendation. UniRec first employs modality-specific
encoders to produce consistent embeddings across heterogeneous signals. It then applies
a triplet representation—comprising attribute name, type, and value—to separate schema
from raw inputs and preserve semantic distinctions. Finally, a hierarchical Q-Former mod-
els the nested structure of user interactions while maintaining their layered organization.
On multiple real-world benchmarks, UniRec outperforms state-of-the-art multimodal and
LLM-based recommenders by up to 15%, while extensive ablation studies further validate
the contributions of each component.

URL: https://openreview.net/forum?id=WXE255GWhQ

---

Title: Demystifying MaskGIT Sampler and Beyond: Adaptive Order Selection in Masked Diffusion

Abstract: Masked diffusion models have shown promising performance in generating high-quality samples in a wide range of domains, but accelerating their sampling process remains relatively underexplored. To investigate efficient samplers for masked diffusion, this paper theoretically analyzes the MaskGIT sampler for image modeling, revealing its implicit temperature sampling mechanism. Through this analysis, we show that MaskGIT is asymptotically equivalent to a choose-then-sample (CTS) formulation, instantiated as the “moment sampler,” which explicitly separates index selection from token sampling. This CTS reformulation is essential: it yields unbiased token sampling and exposes an algorithmic design space for index selection, both of which are inaccessible in MaskGIT’s original formulation. Regarding token sampling, we reveal that MaskGIT implicitly adopts a low-temperature sampler, which explains why MaskGIT often degrades with more sampling steps. The CTS reformulation of MaskGIT allows to fix the temperature sampling to ensure unbiasedness. We also improve the index selection in CTS through two key innovations: a partial caching technique for transformers that approximates longer sampling trajectories without proportional computational cost, and a hybrid approach formalizing the exploration-exploitation trade-off in adaptive unmasking. Experiments in image and text domains demonstrate our theory as well as the efficiency of our proposed methods, advancing both theoretical understanding and practical implementation of masked diffusion samplers.

URL: https://openreview.net/forum?id=mKlW68i2Ig

---

Title: Scaling Laws for Masked-Reconstruction Transformers on Single-Cell Transcriptomics

Abstract: Neural scaling laws -- power-law relationships between loss, model size, and data -- have been extensively documented for language and vision transformers, yet their existence in single-cell genomics remains largely unexplored. We present the first systematic study of scaling behaviour for masked-reconstruction transformers trained on single-cell RNA sequencing (scRNA-seq) data. Using expression profiles from the CELLxGENE Census, we construct two experimental regimes: a data-rich regime (512 highly variable genes, 200,000 cells) and a data-limited regime (1,024 genes, 10,000 cells). Across seven model sizes spanning three orders of magnitude in parameter count (533 to 3.4 x 10^8 parameters), we fit the parametric scaling law to validation mean squared error (MSE). The data-rich regime exhibits clear power-law scaling with an irreducible loss floor of c ~ 1.44, while the data-limited regime shows negligible scaling, indicating that model capacity is not the binding constraint when data are scarce. These results establish that scaling laws analogous to those observed in natural language processing do emerge in single-cell transcriptomics when sufficient data are available, and they identify the data-to-parameter ratio as a critical determinant of scaling behaviour. A preliminary conversion of the data-rich asymptotic floor to information-theoretic units yields an estimate of approximately 2.30 bits of entropy per masked gene position. We discuss implications for the design of single-cell foundation models and outline the additional measurements needed to refine this entropy estimate.

URL: https://openreview.net/forum?id=a8rUQqionr

---

Title: Time-Aware Prior Fitted Networks for Zero-Shot Forecasting with Exogenous Variables

Abstract: In many time series forecasting settings, the target time series is accompanied by exogenous covariates, such as promotions and prices in retail demand; temperature in energy load; calendar and holiday indicators for traffic or sales; and grid load or fuel costs in electricity pricing. Ignoring these exogenous signals can substantially degrade forecasting accuracy, particularly when they drive spikes, discontinuities, or regime and phase changes in the target series. Most current time series foundation models (e.g., Chronos, Sundial, TimesFM, TimeMoE, TimeLLM, and LagLlama) ignore exogenous covariates and make forecasts solely from the numerical time series history, thereby limiting their performance. In this paper, we develop ApolloPFN, a prior-data fitted network (PFN) that is time-aware (unlike prior PFNs) and that natively incorporates exogenous covariates (unlike prior univariate forecasters). Our design introduces two major advances: (i) a synthetic data generation procedure tailored to resolve the failure modes that arise when tabular (non-temporal) PFNs are applied to time series; and (ii) time-aware architectural modifications that embed inductive biases needed to exploit the time series context. We demonstrate that ApolloPFN achieves state-of-the-art results across benchmarks, such as M5 and electric price forecasting, that contain exogenous information.

URL: https://openreview.net/forum?id=nJARpxp3cF

---

Title: A Unified Framework with Environmental and Interaction Uncertainty for Robust Multi-Agent Reinforcement Learning

Abstract: Multi-agent reinforcement learning (MARL) has achieved remarkable success across diverse domains, yet its robustness remains hindered by various inherent uncertainties arising from multi-agent systems. Although previous studies have explored robustness in MARL, most of them focus on a single type of uncertainty, without a unified framework to handle multiple sources simultaneously. As a result, their methods often fail to remain robust when exposed to diverse and interacting disturbances. To address this limitation, we propose a unified framework that explicitly models two complementary sources of uncertainty: environmental uncertainty, caused by stochastic dynamics, and interaction uncertainty, arising from the unpredictable behaviors of other agents. We capture these factors using hierarchical entropy-based uncertainty sets, which are then integrated into the robust Markov game formulation. This hierarchical design enables the framework to distinguish the distinct impacts of each uncertainty source while avoiding the excessive conservatism of treating them as a single unified set. On top of this formulation, we introduce the solution concept of an Aleatoric Robust Equilibrium (ARE), where each agent optimizes its policy against worst-case scenarios derived from the hierarchical sets. To compute the ARE, we develop specialized actor–critic algorithms with theoretical convergence guarantees. Extensive experiments in both the multi-agent particle environment (MPE) and the multi-agent MuJoCo benchmark show that our approach achieves consistently superior robustness and performance across a wide range of uncertainty settings.

URL: https://openreview.net/forum?id=DMllImVr8k

---

Title: Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detection

Abstract: Zero-shot anomaly detection/localization trains on a source domain and discriminates images from unseen target domains given only textual prompts (e.g., “normal" vs. “anomaly"); therefore, performance hinges on generalization. Recent methods build on CLIP for its strong zero-shot generalization; however, as we show, localization has not improved as much as detection and, especially for small regions, remains near random, with AUPRO close to chance, indicating weak pixel-level generalization. We attribute this to CLIP’s limited ability to retain fine-grained features in its vision encoder and insufficient alignment between the text encoder and dense visual features, which have not been effectively addressed in previous methods. To address these challenges, first, we replace CLIP’s vision encoder with an adapted vision encoder that uses a correlation-based attention module to better preserve fine-grained features and small details. Second, we boost text–vision alignment by conditioning the learnable prompts in the text encoder on image context extracted from the vision encoder and performing local-to-global representation fusion, further improving localization. Finally, we show that our correlation-based attention module can incorporate feature correlations from additional models such as DINOv2, further enhancing spatial understanding and localization. We call our model Crane (Context-Guided Prompt Learning and Attention Refinement) and its DINOv2-boosted variant Crane+ and show that it improves the state-of-the-art by up to 28% in pixel-level localization (AUPRO) and up to 4.5% in image-level detection (AP), across 14 industrial and medical datasets.

URL: https://openreview.net/forum?id=logc7dzJRS

---

Title: On Symmetric Losses for Policy Optimization with Noisy Preferences

Abstract: Optimizing policies based on human preferences is key to aligning language models with human intent.
This work focuses on reward modeling, a core component in reinforcement learning from human feedback (RLHF), and offline preference optimization, such as direct preference optimization.
Conventional approaches typically assume accurate annotations. However, real-world preference data often contains noise due to human errors or biases, which can be asymmetric.
We propose a principled framework for robust policy optimization under noisy preferences based on the view of reward modeling as a binary classification problem.
Specifically, we demonstrate that asymmetric preference noise can be effectively treated as symmetric noise under this framework.
This viewpoint allows us to leverage symmetric losses, well known for their robustness to label noise in classification, for reward modeling, which leads to our Symmetric Preference Optimization (SymPO) method, a novel offline preference optimization algorithm.
Theoretically, we prove that symmetric losses enable successful policy improvement even with noisy labels, as the resulting reward is rank-preserving—a property we identify as sufficient for policy improvement.
Empirical evaluations on a synthetic dataset and real-world language model alignment tasks demonstrate that SymPO achieves competitive or higher performance than existing robust methods in high-noise scenarios.

URL: https://openreview.net/forum?id=cBWGLmSeao

---

Title: Theoretical Foundations of Continual Learning via Drift-Plus-Penalty

Abstract: In many real-world settings, data streams are inherently nonstationary and arrive sequentially, necessitating learning systems to adapt continuously without repeatedly retraining from scratch. Continual learning (CL) addresses this setting by seeking to incorporate new tasks while preventing catastrophic forgetting, whereby updates for recent data induce performance degradation on previously acquired knowledge. We introduce a control-theoretic perspective on CL that explicitly regulates the temporal evolution of forgetting, framing adaptation to new tasks as a controlled process subject to long-term stability constraints. We focus on replay-based CL settings in which a finite memory buffer preserves representative samples from prior tasks, allowing forgetting to be explicitly regulated. We propose COntinual Learning with Drift-Plus-Penalty (\texttt{COLD}), a novel CL framework grounded in the stochastic optimization-based Drift-Plus-Penalty (DPP) principle. At each task, \texttt{COLD} minimizes the instantaneous penalty corresponding to the current task loss while simultaneously maintaining a virtual queue that explicitly tracks deviations from long-term stability on previously learned tasks, hence capturing the stability–plasticity trade-off as a regulated dynamical process. We establish stability and convergence guarantees that characterize this trade-off, as governed by a tunable control parameter. Empirical results on standard benchmark datasets show that the proposed framework consistently achieves superior accuracy compared to a wide range of state-of-the-art CL baselines, while exhibiting competitive and tunable forgetting behavior that reflects the explicit regulation of the stability–plasticity trade-off through virtual queues and the DPP objective.

URL: https://openreview.net/forum?id=QhxNMdhhBy

---

Title: Towards Principled Task Grouping for Multi-Task Learning

Abstract: Multi-task learning (MTL) aims to leverage shared information among tasks to improve learning efficiency and accuracy. However, MTL often struggles to effectively manage positive and negative transfer between tasks, which can hinder performance improvements. Task grouping addresses this challenge by organizing tasks into meaningful clusters, maximizing beneficial transfer while minimizing detrimental interactions.
This paper introduces a principled approach to task grouping in MTL, advancing beyond existing methods by addressing key theoretical and practical limitations. Unlike prior studies, our method offers a theoretically grounded approach that does not depend on restrictive assumptions for constructing transfer gains. We also present a flexible mathematical programming formulation that accommodates a wide range of resource constraints, thereby enhancing its versatility.
Experimental results across diverse domains, including computer vision datasets, combinatorial optimization benchmarks, and time series tasks, demonstrate the superiority of our method over extensive baselines, thereby validating its effectiveness and general applicability in MTL without sacrificing efficiency.

URL: https://openreview.net/forum?id=3DeSIpzuro

---

Title: Harnessing Optimization Dynamics for Curvature-Informed Model Merging

Abstract: Model merging is an effective strategy for composing capabilities in large language models without the need for costly joint retraining. We study this process in the supervised fine-tuning (SFT) stage, consolidating multiple checkpoints specialized for distinct capabilities (e.g., math, coding, and precise instruction following) into a single model. First, we introduce Optimization Trajectory Aware (OTA) Merging, a curvature-aware method for mitigating task interference that uses optimizer second-moment statistics as a diagonal curvature proxy to first prune the task vector with our Fast Fisher Grafting (FFG) technique and then reweight the pruned vector. When merging diverse, capability-based checkpoints, OTA improves the merged model's performance over strong baseline methods, as evaluated on unseen capability-based benchmarks. Second, we conduct a comprehensive, theoretically-inspired empirical analysis to explain the effectiveness of OTA. Our analysis surprisingly reveals that FFG implicitly induces a layer- and role-wise aware pruning mechanism that is capable of maintaining fine-tuning performance at much more aggressive pruning ratios compared to magnitude pruning and that exhibits interpretable task localization properties. Third, an extensive comparison of our curvature proxy across capability checkpoints shows that experts converge to a basin with substantial curvature similarity, offering a novel lens on why simple linear merging can be effective in practice. This result further strengthens our ablation study, showing that FFG is critical for merging performance. Finally, we develop a memory-light variant of OTA that efficiently compresses the second moments, mitigating the additional storage requirements of our method and improving scalability. We make all code, training and evaluation scripts, visualization artifacts, and capability-specific SFT checkpoints accessible through an anonymized repository at \url{https://github.com/tmlr-ota/ota}.

URL: https://openreview.net/forum?id=Wb2r8TdAyD

---

Title: On Hamming–Lipschitz Type Stability of the Subdominant (Minmax) Ultrametric: Theory and Simple Proofs

Abstract: We study the subdominant (minmax) ultrametric as an operator on pairwise data. Prior stability results show that this operator is non-expansive under uniform perturbations in the supremum norm and in the Gromov–Hausdorff sense, but they say nothing about how widely sparse, targeted edits can ripple through the hierarchy. We close this gap with a pair-count Lipschitz theory in Hamming space: we bound how many ultrametric entries can change, regardless of their magnitudes. The analysis is routed through the \emph{minimum spanning tree} (MST), which encodes the ultrametric as path bottlenecks. Our first theorem proves a locality principle; only pairs whose MST path crosses an edited or newly exposed cut can change, so the impact is confined to a union of fundamental cut rectangles. Building on this, we derive an instance dependent $$\ell_0$$ type Lipschitz bound whose constant is determined entirely by the MST’s exposed cuts. We then show optimality by constructing cases where a single off-tree edit forces a quadratic number of changes, so no smaller universal constant is possible for our proposed Lipschitz constant. Finally, under a mild minimal-overlap condition, the upper bound on the number of changed entries of the ultrametric is order-tight, yielding a two-sided characterization of propagation. Conceptually, this advances a magnitude-versus-extent picture for ultrametric stability: classical results control how much entries move under uniform perturbation; our theory controls how far changes spread under sparse edits. Additionally, as a proof of concept, we derive a risk score from our Lipschitz constant that identifies vulnerable edges in the graph. We use this score to drive two case studies: vulnerability maps of deep embeddings of CIFAR-10, ImageNet-10, and STL-10, where targeted edits to high-score edges cause far larger ultrametric and clustering changes than random edits with the same budget, and fragility maps in a superpixel-based single image segmentation that highlight load-bearing boundaries.

URL: https://openreview.net/forum?id=R4ASOCp3uM

---

Title: Sharpness-Aware Minimization Driven by Local-Integrability Flatness

Abstract: Sharpness-Aware Minimization (SAM) improves generalization by optimizing for worst-case loss under parameter perturbations, but its max-based objective can be overly conservative, noise-sensitive, and reliant on smoothness assumptions that often fail in modern nonsmooth networks. We propose Lebesgue Sharpness-Aware Minimization (LSAM), a measure-theoretic alternative grounded in the Lebesgue Differentiation Theorem and local Sobolev regularity. Instead of minimizing the worst-case loss, LSAM minimizes the local average loss in a neighborhood of the parameters. This average-case notion of flatness favors Sobolev-regular Lebesgue points with low local loss oscillation and yields a generalization bound depending only on local integrability, a modulus of continuity, and a Sobolev-induced flatness term—without requiring Hessians or global Lipschitz conditions. To make LSAM practical, we introduce a Monte Carlo estimator of the local average that provides an unbiased gradient with modest overhead. Experiments on CIFAR-10/100 with ResNet, ResNeXt, WideResNet, and PyramidNet show that LSAM consistently finds flatter minima and improves test accuracy over both SGD and SAM.

URL: https://openreview.net/forum?id=29Zg9k5NCo

---

Reply all
Reply to author
Forward
0 new messages