Daily TMLR digest for Jan 31, 2026

1 view
Skip to first unread message

TMLR

unread,
Jan 31, 2026, 12:30:07 AM (12 days ago) Jan 31
to tmlr-anno...@googlegroups.com

Accepted papers
===============


Title: SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation

Authors: Shrikant Kendre, Austin Xu, Honglu Zhou, Michael S Ryoo, Shafiq Joty, Juan Carlos Niebles

Abstract: Traditional evaluation metrics for textual and visual question answering—like ROUGE, METEOR, and Exact Match (EM)—focus heavily on n-gram based lexical similarity, often missing the deeper semantic understanding needed for accurate assessment. While measures like BERTScore and MoverScore leverage contextual embeddings to address this limitation, they lack flexibility in balancing sentence-level and keyword-level semantics and ignore lexical similarity, which remains important. Large Language Model (LLM) based evaluators, though powerful, come with drawbacks like high costs, bias, inconsistency, and hallucinations. To address these issues, we introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching. This composite method balances lexical precision and semantic relevance, offering a comprehensive evaluation. Extensive benchmarks across text, image, and video QA tasks show SMILE is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.

URL: https://openreview.net/forum?id=lnpOvuQYih

---

Title: CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

Authors: Kung-Hsiang Huang, Akshara Prabhakar, Onkar Thorat, Divyansh Agarwal, Prafulla Kumar Choubey, Yixin Mao, Silvio Savarese, Caiming Xiong, Chien-Sheng Wu

Abstract: While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent-user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.

URL: https://openreview.net/forum?id=EPlpe3Fx1x

---

Title: Dual-Phase Continual Learning: Supervised Adaptation Meets Unsupervised Retention

Authors: Vaibhav Singh, Rahaf Aljundi, Eugene Belilovsky

Abstract: Foundational vision-language models (VLMs) excel across diverse tasks, but adapting them to new domains without forgetting prior knowledge remains a critical challenge. Continual Learning (CL) addresses this challenge by enabling models to learn sequentially from new data while mitigating the forgetting of prior information, typically under supervised settings involving label shift. Nonetheless, abrupt distribution shifts can still cause substantial forgetting, potentially nullifying the benefits of supervised updates, especially when storing or replaying past data is infeasible. In this work, we propose leveraging unlabeled test-time data in an unsupervised manner to reinforce prior task performance without requiring replay or stored examples. Unlike traditional Test-Time Adaptation (TTA), which primarily focuses on domain shift or corruption, our method improves performance on earlier tasks by exploiting representative test samples encountered during deployment. We introduce a simple teacher-student framework with gradient-based sparse parameter updates, and show that it effectively mitigates forgetting in class-incremental CL for VLMs, offering a memory-free alternative to episodic replay with strong empirical results.

URL: https://openreview.net/forum?id=GFrHdXzZwo

---

Title: Enhancing Concept Localization in CLIP-based Concept Bottleneck Models

Authors: Rémi Kazmierczak, Steve Azzolin, Goran Frehse, Eloïse Berthier, Gianni Franchi

Abstract: This paper addresses explainable AI (XAI) through the lens of Concept Bottleneck Models (CBMs) that do not require explicit concept annotations, relying instead on concepts extracted using CLIP in a zero-shot manner. We show that CLIP, which is central in these techniques, is prone to concept hallucination—incorrectly predicting the presence or absence of concepts within an image in scenarios used in numerous CBMs, hence undermining the faithfulness of explanations. To mitigate this issue, we introduce Concept Hallucination Inhibition via Localized Interpretability (CHILI), a technique that disentangles image embeddings and localizes pixels corresponding to target concepts. Furthermore, our approach supports the generation of saliency-based explanations that are more interpretable.

URL: https://openreview.net/forum?id=2xaOl0wluw

---

Title: Random Projection-Induced Gaussian Latent Features for Arbitrary Style Transfer

Authors: Weizhi Lu, Zhongzheng Li, Dongchen Gao, Mingrui Chen, Weiyu Li, jinglin zhang, Wei Zhang

Abstract: The feature transfer technique centered on mean and variance statistics, widely known as AdaIN, lies at the core of current style transfer research. This technique relies on the assumption that latent features for style transfer follow Gaussian distributions. In practice, however, this assumption is often hard to meet, as the features typically exhibit sparse distributions due to the significant spatial correlation inherent in natural images. To tackle this issue, we propose first performing a random projection on the sparse features, and then conducting style transfer on these projections. Statistically, the projections will satisfy or approximate Gaussian distributions, thereby better aligning with AdaIN's requirements and enhancing transfer performance. With the stylized projections, we can further reconstruct them back to the original feature space by leveraging compressed sensing theory, thereby obtaining the stylized features. The entire process constitutes a projection-stylization-reconstruction module, which can be seamlessly integrated into AdaIN without necessitating network retraining. Additionally, our proposed module can also be incorporated into another promising style transfer technique based on cumulative distribution functions, dubbed EFDM. This technique faces limitations when there are substantial differences in sparsity levels between content and style features. By projecting both types of features into dense Gaussian distributions, random projection can reduce their sparsity disparity, thereby improving performance. Experiments demonstrate that the aforementioned performance improvements can be achieved on existing state-of-the-art approaches.

URL: https://openreview.net/forum?id=XBu6iqHof8

---

Title: Privacy Profiles Under Tradeoff Composition

Authors: Paul Glasserman

Abstract: Privacy profiles and tradeoff functions are two frameworks for comparing differential privacy guarantees of alternative privacy mechanisms. We study connections between these frameworks. We show that the composition of tradeoff functions corresponds to a binary operation on privacy profiles we call their T-convolution. Composition of tradeoff functions characterizes group privacy guarantees, so the T-convolution provides a bridge for translating group privacy properties from one framework to the other. Composition of tradeoff functions has also been used to characterize mechanisms with log-concave additive noise; we derive a corresponding property based on privacy profiles. We also derive new bounds on privacy profiles for log-concave mechanisms based on new convexity properties. In developing these ideas, we characterize regular privacy profiles, which are privacy profiles for mutually absolutely continuous probability measures.

URL: https://openreview.net/forum?id=gRvKjXWacu

---

Title: Efficient Dilated Squeeze and Excitation Neural Operator for Differential Equations

Authors: Prajwal Chauhan, Salah Eddine Choutri, Saif Jabari

Abstract: Fast and accurate surrogates for physics-driven partial differential equations (PDEs) are essential in fields such as aerodynamics, porous media design, and flow control. However, many transformer-based models and existing neural operators remain parameter-heavy, resulting in costly training and sluggish deployment. We propose D-SENO (Dilated Squeeze-Excitation Neural Operator), a lightweight operator learning framework for efficiently solving a wide range of PDEs, including airfoil potential flow, Darcy flow in porous media, pipe Poiseuille flow, and incompressible Navier–Stokes vortical fields. D-SENO combines dilated convolution (DC) blocks with squeeze-and-excitation (SE) modules to jointly capture wide receptive fields and dynamics alongside channel-wise attention, enabling both accurate and efficient PDE inference. Carefully chosen dilation rates allow the receptive field to focus on critical regions, effectively modeling long-range physical dependencies. Meanwhile, the SE modules adaptively recalibrate feature channels to emphasize dynamically relevant scales. Our model achieves training speed of up to $\approx 20\times$ faster than standard transformer-based models and neural operators, while also surpassing (or matching) them in accuracy across multiple PDE benchmarks. Ablation studies show that removing the SE modules leads to a slight drop in performance.

URL: https://openreview.net/forum?id=Xl942THEUa

---

Title: How Well Can Preference Optimization Generalize Under Noisy Feedback?

Authors: Shawn Im, Yixuan Li

Abstract: As large language models (LLMs) advance their capabilities, aligning these models with human preferences has become crucial. Preference optimization, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for aligning LLMs. However, most existing works assume noise-free feedback, which is unrealistic due to the inherent errors and inconsistencies in human judgments. This paper addresses the impact of noisy feedback on preference optimization, providing generalization guarantees under these conditions. In particular, we consider noise models that correspond to common real-world sources of noise, such as mislabeling and uncertainty. Unlike traditional analyses that assume convergence, our work focuses on finite-step preference optimization, offering new insights that are more aligned with practical LLM training. We describe how generalization decays with different types of noise across levels of noise rates based on the preference data distribution and number of samples. Our analysis for noisy preference learning applies to a broad family of preference optimization losses such as DPO, IPO, SLiC, etc. Empirical validation on contemporary LLMs confirms the practical relevance of our findings, offering valuable insights for developing AI systems that align with human preferences.

URL: https://openreview.net/forum?id=8f5gRWwzDx

---

Title: KITTEN: A Knowledge-Integrated Evaluation of Image Generation on Visual Entities

Authors: Hsin-Ping Huang, Xinyi Wang, Yonatan Bitton, Hagai Taitelbaum, Gaurav Singh Tomar, Ming-Wei Chang, Xuhui Jia, Kelvin C.K. Chan, Hexiang Hu, Yu-Chuan Su, Ming-Hsuan Yang

Abstract: Recent advances in text-to-image generation have improved the quality of synthesized images, but evaluations mainly focus on aesthetics or alignment with text prompts. Thus, it remains unclear whether these models can accurately represent a wide variety of realistic visual entities. To bridge this gap, we propose KITTEN, a benchmark for Knowledge-InTegrated image generaTion on real-world ENtities. Using KITTEN, we conduct a systematic study of recent text-to-image models, retrieval-augmented models, and unified understanding and generation models, focusing on their ability to generate real-world visual entities such as landmarks and animals. Analyses using carefully designed human evaluations, automatic metrics, and MLLMs as judges show that even advanced text-to-image and unified models fail to generate accurate visual details of entities. While retrieval-augmented models improve entity fidelity by incorporating reference images, they tend to over-rely on them and struggle to create novel configurations of the entities in creative text prompts. The dataset and evaluation code are publicly available at https://kitten-project.github.io.

URL: https://openreview.net/forum?id=wejaKS9Ps0

---


New submissions
===============


Title: A Survey on Behavioral Data Representation Learning

Abstract: Behavioral data, reflecting dynamic and complex interactions among entities, are pivotal for advancing multidisciplinary research and practical applications. Effective modeling and representation of behavioral data facilitate enhanced understanding, predictive analytics, and informed decision-making across diverse domains. This paper presents a comprehensive taxonomy of behavioral data representation learning methods, categorized by data modalities: tabular data, event sequences, dynamic graphs, and natural language. Within each category, we further dissect methods based on distinct modeling strategies and capabilities, and provide detailed reviews of their developments. Additionally, we extensively discuss significant downstream applications, datasets, and benchmarks, highlighting their roles in guiding methodological development and evaluating performance. To support further exploration in behavioral data representation learning, we release a continuously maintained repository at [Anonymous GitHub](https://anonymous.4open.science/r/BehavioralDataSurvey) that curates the methods and papers covered in this survey.

URL: https://openreview.net/forum?id=NSdV2qvglw

---

Title: DeltaSM: Delta-Level Contrastive Learning with Mamba for Time-Series Representation

Abstract: Self-supervised contrastive learning offers a compelling route to transferable time-series representations in label-scarce settings.
Yet existing frameworks face a persistent trade-off between preserving fine-grained local dynamics at high temporal resolution and scaling to long sequences under practical compute constraints.
Convolutional encoders often require deep stacks to retain rapid transitions, whereas Transformers incur quadratic cost in sequence length, making high-resolution long-context training expensive.
Recent selective state-space models such as \emph{Mamba} enable linear-time ($O(L)$) sequence modeling and offer a promising path to mitigate this bottleneck.
However, their potential for \emph{general-purpose} time-series representation learning remains underexplored;
to our knowledge, prior Mamba-based contrastive learners have not been evaluated on the full UCR 2018 archive (128 datasets) under a unified protocol.
We propose \textbf{DeltaSM} (\emph{Delta-selective Mamba}), a self-supervised framework for univariate time series that reconciles efficiency and expressivity.
DeltaSM integrates (i) a lightweight Mamba backbone, (ii) token-budget-constrained training, and (iii) a $\Delta$-level contrastive objective that counterbalances Mamba's smoothing tendency.
Specifically, we apply curvature-adaptive weighting to first-order differences of the latent sequence, encouraging the encoder to emphasize informative local transitions without increasing computational cost.
At inference time, we further augment the learned time-domain embeddings with explicitly extracted frequency-domain descriptors from the raw signal to improve expressivity at negligible overhead.
Across all 128 UCR datasets, under \textbf{Protocol A}---a unified compute setting with a fixed number of optimization steps and a standardized downstream classifier---DeltaSM converges in seconds and achieves classification accuracy comparable to or better than strong baselines such as TS-TCC, TS2Vec, and TimesURL, using a single global configuration and a fixed pretraining-step budget (300 optimization updates per dataset).
On a focused subset that includes long-sequence datasets under \textbf{Protocol B}---where baselines are allowed their recommended training budgets and hyperparameters while DeltaSM remains fixed as in Protocol A---DeltaSM reduces pretraining time by up to $184\times$ while remaining competitive.
Extensive ablations confirm that curvature-based weighting is crucial for suppressing noise while capturing local dynamics, and that inference-time frequency integration provides complementary gains with minimal additional cost.

URL: https://openreview.net/forum?id=3S1kAqCiAS

---

Title: BG-HGNN: Toward Efficient Learning for Complex Heterogeneous Graphs

Abstract: Heterogeneous graphs—comprising diverse node and edge types connected through varied relations—are ubiquitous in real-world applications. Message-passing heterogeneous graph neural networks (HGNNs) have emerged as a powerful model class for such data. However, existing HGNNs typically allocate a separate set of learnable weights for each relation type to model relational heterogeneity. Despite their promise, these models are effective primarily on simple heterogeneous graphs with only a few relation types. In this paper, we show that this standard design inherently leads to parameter explosion (the number of learnable parameters grows rapidly with the number of relation types) and relation collapse (the model loses the ability to distinguish among different relations). These issues make existing HGNNs inefficient or impractical for complex heterogeneous graphs with many relation types. To address these challenges, we propose Blend&Grind-HGNN (BG-HGNN), a unified feature-representation framework that integrates and distills relational heterogeneity into a shared low-dimensional feature space. This design eliminates the need for relation-specific parameter sets and enables efficient, expressive learning even as the number of relations grows. Empirically, BG-HGNN achieves substantial gains over state-of-the-art HGNNs—improving parameter efficiency by up to 28.96 $\times$ and training throughput by up to up to 110.30 $\times$—while matching or surpassing their accuracy on complex heterogeneous graphs.

URL: https://openreview.net/forum?id=pkhDICLhy7

---

Title: Symmetric Divergence and Normalized Similarity: A Unified Topological Framework for Representation Analysis

Abstract: Topological Data Analysis (TDA) offers a principled, intrinsic lens for comparing neural representations. However, existing paired topological divergences (e.g., RTD) are limited by heuristic asymmetry and, more critically, unbounded scores that depend on sample size, hindering reliable cross-scenario benchmarking. To address these challenges, we develop a unified topological toolkit serving two complementary needs: fine-grained structural diagnosis and robust, standardized evaluation.
First, we complete the RTD framework by introducing \textbf{Symmetric Representation Topology Divergence (SRTD)} and its efficient variant \textbf{SRTD-lite}. Beyond resolving the theoretical asymmetry of prior variants, SRTD consolidates diagnostic information into a single, comprehensive cross-barcode signature. This allows for precise localization of structural discrepancies and serves as an effective optimization objective without the overhead of dual directional computations.
Second, to enable reliable benchmarking across heterogeneous settings, we propose \textbf{Normalized Topological Similarity (NTS)}. By measuring the rank correlation of hierarchical merge orders, NTS yields a scale-invariant metric bounded between -1 and 1, effectively overcoming the scale and sample-dependence of unnormalized divergences.
Experiments across synthetic and real-world deep learning settings demonstrate that our toolkit captures functional shifts in CNNs missed by geometric measures and robustly maps LLM genealogy even under distance saturation, offering a rigorous, topology-aware perspective that complements measures like CKA.

URL: https://openreview.net/forum?id=pGgJ9qB2Io

---

Title: Confidence-Aware Explanations for 3D Molecular Graphs via Energy-Based Masking

Abstract: Graph Neural Networks (GNNs) have become a powerful tool for modeling molecular data. To improve their reliability and interpretability, various explanation methods aim to identify key molecular substructures, typically a subset of edges, influencing the decision-making process. Early work on 2D GNNs represented molecules as graphs with atoms as nodes and bonds as edges, ignoring 3D geometric configurations. While existing explanation methods perform well for 2D GNNs, there is a growing demand for 3D explanation techniques suited to 3D GNNs, which often surpass 2D GNNs in performance. Current explanation methods struggle with 3D GNNs due to the construction of edges based on distance cutoffs, leading to an exponential increase in the number of edges a molecular graph possesses. We identify key sources of explanation errors and decompose them into two components, derived from an upper bound between optimized masks and the true explanatory subgraph. This gap is especially significant in 3D GNNs because of the dense edge structures. To improve explanation fidelity, our method assigns two energy values to each atom, representing its contribution to predictions: one for importance and one for non-importance. The explanation model becomes more confident when the distinction between importance and non-importance is clearer. Analogous to physics, lower energy values indicate greater stability, enhancing confidence in the associated scenario. By optimizing these energy values to distinguish the two cases, we minimize both components of the error bound and identify a stable subgraph with high explanation fidelity. Experiments with various 3D backbone models on widely used datasets are conducted to validate our method's effectiveness in providing accurate and reliable explanations for 3D molecular graphs.

URL: https://openreview.net/forum?id=V4oLqP0vJf

---

Reply all
Reply to author
Forward
0 new messages