Daily TMLR digest for Feb 11, 2026

0 views

Skip to first unread message

TMLR

unread,

Feb 11, 2026, 12:30:06 AM (yesterday) Feb 11

to tmlr-anno...@googlegroups.com

New certifications
==================

Survey Certification: The Five Ws of Multi-Agent Communication: Who Talks to Whom, When, What, and Why - A Survey from MARL to Emergent Language and LLMs

Jingdi Chen, Hanqing Yang, Zongjun Liu, Carlee Joe-Wong

https://openreview.net/forum?id=LGsed0QQVq

---

Accepted papers
===============

Title: CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration

Authors: Seyed Amir Kasaei, Ali Aghayari, Arash Marioriyad, Niki Sepasian, Shayan Baghayi Nejad, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban

Abstract: Text-to-image diffusion models, such as Stable Diffusion, can produce high-quality and diverse images but often fail to achieve compositional alignment, particularly when prompts describe complex object relationships, attributes, or spatial arrangements. Recent inference-time approaches address this by optimizing or exploring the initial noise under the guidance of reward functions that score text–image alignment—without requiring model fine-tuning. While promising, each strategy has intrinsic limitations when used alone: optimization can stall due to poor initialization or unfavorable search trajectories, whereas exploration may require a prohibitively large number of samples to locate a satisfactory output. Our analysis further shows that neither single reward metrics nor ad-hoc combinations reliably capture all aspects of compositionality, leading to weak or inconsistent guidance. To overcome these challenges, we present Category-Aware Reward-based Initial Noise Optimization and EXploration (CARINOX), a unified framework that combines noise optimization and exploration with a principled reward selection procedure grounded in correlation with human judgments. Evaluations on two complementary benchmarks—covering diverse compositional challenges—show that CARINOX raises average alignment scores by +16% onT2I-CompBench++ and +11% on the HRS benchmark, consistently outperforming state-of-the-art optimization and exploration-based methods across all major categories, while preserving image quality and diversity.

URL: https://openreview.net/forum?id=XB1cwXHV0c

---

Title: Steering Large Reasoning Models towards Concise Reasoning via Flow Matching

Authors: Yawei Li, Benjamin Bergner, Yinghan Zhao, Vihang Prakash Patil, Bei Chen, Cheng Wang

Abstract: Large Reasoning Models (LRMs) excel at complex reasoning tasks, but their efficiency is often hampered by overly verbose outputs. Prior steering methods attempt to address this issue by applying a single, global vector to hidden representations—an approach grounded in the restrictive \textit{linear representation hypothesis}. In this work, we introduce FlowSteer, a nonlinear steering method that goes beyond uniform linear shifts by learning a complete \textit{transformation between the distributions} associated with verbose and concise reasoning. This transformation is learned via \textit{Flow Matching} as a velocity field, enabling precise, input-dependent control over the model's reasoning process. By aligning steered representations with the distribution of concise-reasoning activations, FlowSteer yields more compact reasoning than the linear shifts. Across diverse reasoning benchmarks, FlowSteer demonstrates strong task performance and token efficiency compared to leading inference-time baselines. Our work demonstrates that modeling the full distributional transport with generative techniques offers a more effective and principled foundation for controlling LRMs.

URL: https://openreview.net/forum?id=qwcJMdGerK

---

Title: The Five Ws of Multi-Agent Communication: Who Talks to Whom, When, What, and Why - A Survey from MARL to Emergent Language and LLMs

Authors: Jingdi Chen, Hanqing Yang, Zongjun Liu, Carlee Joe-Wong

Abstract: Multi-agent sequential decision-making underpins many real-world systems, from autonomous vehicles and robotics to collaborative AI assistants. In dynamic and partially observable environments, effective communication is essential for reducing uncertainty and enabling coordination. Although research on multi-agent communication (MA-Comm) spans diverse paradigms, we organize this survey explicitly around the Five Ws of communication: who communicates with whom, what is communicated, when communication occurs, and why communication is beneficial. This lens provides a coherent structure for synthesizing diverse approaches and exposing shared design principles across paradigms. Within Multi-Agent Reinforcement Learning (MARL), early work relied on hand-designed or implicit communication protocols, followed by trainable, end-to-end mechanisms optimized for reward and control. While effective, these approaches often yield task-specific and weakly interpretable communication, motivating research on Emergent Language (EL), where agents develop more structured or symbolic protocols through interaction. EL methods, however, still face challenges in grounding, generalization, and scalability, which have driven recent interest in large language models (LLMs) as a means to leverage natural language priors for reasoning, planning, and coordination in open-ended multi-agent settings. This progression motivates our survey: we analyze how communication paradigms evolve in response to the limitations of earlier approaches and how MARL, EL, and LLM-based systems address complementary aspects of multi-agent communication. This paper provides a unified survey of MA-Comm across MARL, EL, and LLM-based multi-agent systems. Organized around the Five Ws, we examine how different paradigms motivate, structure, and operationalize communication, reveal cross-paradigm trade-offs, and identify open challenges in communication, coordination, and learning. By offering systematic comparisons and design-oriented insights, this survey helps the community extract effective communication design patterns and supports the development of hybrid systems that combine learning, language, and control to meet diverse task, scalability, and interpretability requirements.

URL: https://openreview.net/forum?id=LGsed0QQVq

---

New submissions
===============

Title: Robust Answers, Fragile Logic: Probing the Decoupling Hypothesis in LLM Reasoning

Abstract: While Chain-of-Thought (CoT) prompting has become a cornerstone for complex reasoning in Large Language Models (LLMs), the faithfulness of the generated reasoning remains an open question. We investigate the Decoupling Hypothesis: that correct answers often mask fragile, post-hoc rationalizations that are not causally tied to the model's prediction. To systematically verify this, we introduce MATCHA, a novel Answer-Conditioned Probing framework. Unlike standard evaluations that focus on final output accuracy, MATCHA isolates the reasoning phase by conditioning generation on the model's predicted answer, allowing us to stress-test the stability of the rationale itself. Our experiments reveal a critical vulnerability: under imperceptible input perturbations, LLMs frequently maintain the correct answer while generating inconsistent or nonsensical reasoning - effectively being ``Right for the Wrong Reasons''. Using LLM judges to quantify this robustness gap, we find that multi-step and commonsense tasks are significantly more susceptible to this decoupling than logical tasks. Furthermore, we demonstrate that adversarial examples generated by MATCHA transfer non-trivially to black-box models. Our findings expose the illusion of CoT robustness and underscore the need for future architectures that enforce genuine answer-reasoning consistency rather than mere surface-level accuracy.

URL: https://openreview.net/forum?id=pMhTFUdM4G

---

Title: Augmented Mixup Procedure for Privacy-Preserving Collaborative Training

Abstract: Mixup, introduced by Zhang et al., is a regularization technique for training neural networks that generates convex combinations of input samples and their corresponding labels. Motivated by this approach, Huang et al. proposed InstaHide, an image encryption method designed to preserve the discriminative properties of data while protecting original information during collaborative training across multiple parties. However, recent studies by Carlini et al., Luo et al., and Chen et al. have demonstrated that attacks exploiting the linear system generated by the mixup procedure can compromise the security guarantees of InstaHide. To address this vulnerability, we propose a modified mixing procedure that introduces perturbations into samples before forming convex combinations, making the associated linear inverse problem ill-conditioned for adversaries. We present a theoretical worst-case security analysis and empirically evaluate the performance of our method in mitigating such attacks. Our results indicate that robust attack mitigation can be achieved by increasing the perturbation level, without causing a significant reduction in classification accuracy. Furthermore, we compare the performance of our approach with that of InstaHide on standard benchmark datasets, including MNIST, CIFAR-10, CIFAR-100, and Tiny-ImageNet.

URL: https://openreview.net/forum?id=1SrZyNgmpY

---

Title: MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

Abstract: We present MixtureVitae, an open‑access pretraining corpus built to minimize legal risk while providing strong downstream performance. MixtureVitae follows a permissive‑first, risk‑mitigated sourcing strategy that combines public‑domain and permissively licensed text (e.g., CC‑BY/Apache) with carefully justified low‑risk additions (e.g., government works and EU TDM‑eligible sources). MixtureVitae adopts a simple, single-stage pretraining recipe that integrates a large proportion of permissive synthetic instruction and reasoning data—signals typically introduced during post-training and generally scarce in permissive web corpora. We categorize all sources into a three-tier scheme that reflects varying risk levels and provide shard-level provenance metadata to enable risk-aware usage. In controlled experiments using the open‑sci‑ref training protocol (fixed architectures and hyperparameters; 50B and 300B token budgets across 130M–1.7B parameters), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B-parameters/300B-tokens setting, they surpass FineWeb‑Edu and approach DCLM late in training. Performance is particularly strong on MMLU and on math and code benchmarks: a 1.7B model pretrained on 300B MixtureVitae tokens matches or exceeds a strong 1.7B instruction‑tuned baseline on GSM8K, HumanEval, and MBPP, despite using over 36$\times$ fewer tokens (300B vs. $\approx$11T). Supported by a thorough decontamination analysis, these results show that permissive‑first data with high instruction and reasoning density, tiered by licensing and provenance-related risk, can provide a practical and risk-mitigated foundation for training capable LLMs, reducing reliance on broad web scrapes without sacrificing competitiveness.

URL: https://openreview.net/forum?id=SyCcUNUUMf

---

Title: Learning Structured Set Utility Functions with Contrastive Element Representations

Abstract: Learning utility functions over sets of elements is central to many machine learning and decision-making tasks such as feature selection, sensor placement, and content recommendation, where the goal is to evaluate and select an optimal subset of elements that provide the largest utility. These utility functions often exhibit desirable properties like monotonicity and submodularity over sets, but are typically expensive to evaluate and may lack an explicit analytical form. Moreover, the utility of a set can vary depending on certain contextual variables, further complicating the learning task. In this work, we propose a unified framework for modeling and learning contextual set functions with monotone submodular structure from data using deep networks equipped with structural regularization. Our key insight is to decompose the set function into two learnable components: (i) a context-conditioned contrastive embedding network that maps elements to a shared latent space based on performance and contextual similarity, and (ii) an aggregation network that predicts set-level utility from the sum of embeddings with a submodular norm-based regularization term encouraging the learned function to exhibit diminishing returns. This combination improves utility prediction for unseen sets and contexts and enables greedy subset selection, which admits near-optimality guarantees. We evaluate our framework on a wide variety of real-world contextual subset selection tasks such as content recommendation, document summarization, and sensor selection demonstrating consistent improvements in utility prediction compared to baselines and stronger subset selection performance under context shifts.

URL: https://openreview.net/forum?id=SZ8mOziJBx

---

Title: Forcing and Diagnosing Failure Modes of Fourier Neural Operators Across Diverse PDE Families

Abstract: Fourier Neural Operators (FNOs) have shown strong performance in learning solution maps of partial differential equations (PDEs). Still, their robustness under distribution shifts, long-horizon rollouts, and structural perturbations remains poorly understood. We present a systematic stress-testing framework that probes failure modes of FNOs across five qualitatively different PDE families: dispersive, elliptic, multi-scale fluid, financial, and chaotic systems. Rather than optimizing in-distribution accuracy, we design controlled stress tests — including parameter shifts, boundary or terminal-condition changes, resolution extrapolation with spectral analysis, and iterative rollouts — to expose vulnerabilities such as spectral bias, compounding integration errors, and overfitting to restricted boundary regimes. Our large-scale evaluation (1,000 trained models) reveals that distribution shifts in parameters or boundary conditions can inflate errors by more than an order of magnitude, while resolution changes primarily concentrate error in high-frequency modes. Input perturbations generally do not amplify error, though worst-case scenarios (e.g., localized Poisson perturbations) remain challenging. These findings provide a comparative failure-mode atlas and actionable insights for improving robustness in operator learning.

URL: https://openreview.net/forum?id=0S1LWZHQYn

---

Title: Scalable Equilibrium Propagation via Intermediate Error Signals for Deep Convolutional CRNNs

Abstract: Equilibrium Propagation (EP) is a biologically inspired local learning rule first proposed for convergent recurrent neural networks (CRNNs), in which synaptic updates depend only on neuron states from two distinct phases. EP estimates gradients that closely align with those computed by Backpropagation Through Time (BPTT) while significantly reducing computational demands, positioning it as a potential candidate for on-chip training in neuromorphic architectures. However, prior studies on EP have been constrained to shallow architectures, as deeper networks suffer from the vanishing gradient problem, leading to convergence difficulties in both energy minimization and gradient computation. To address the vanishing gradient problem in deep EP networks, we propose a novel EP framework that incorporates intermediate error signals to enhance information flow and convergence of neuron dynamics. This is the first work to integrate knowledge distillation and local error signals into EP, enabling the training of significantly deeper architectures. Our proposed approach achieves state-of-the-art performance on the CIFAR-10 and CIFAR-100 datasets, showcasing its scalability on deep VGG architectures. These results represent a significant advancement in the scalability of EP, paving the way for its application in real-world systems.

URL: https://openreview.net/forum?id=iXFmzKpPNA

---

Title: Injecting Falsehoods: Adversarial Man-in-the-Middle Attacks Undermining Factual Recall in LLMs

Abstract: LLMs are now an integral part of information retrieval. As such, their role as question answering chatbots raises significant concerns due to their shown vulnerability to adversarial man-in-the-middle (MitM) attacks. Here, we propose the first principled attack evaluation on LLM factual memory under prompt injection via Xmera, our novel, theory-grounded MitM framework. By perturbing the input given to "victim" LLMs in three closed-book and fact-based QA settings, we undermine the correctness of the responses and assess the uncertainty of their generation process. Surprisingly, trivial instruction-based attacks report the highest success rate (up to ~85.3%) while simultaneously having a high uncertainty for incorrectly answered questions. To provide a simple defense mechanism against Xmera, we train Random Forest classifiers on the response uncertainty levels to distinguish between attacked and unattacked queries (average AUC of up to ~96%). We believe that signaling users to be cautious about the answers they receive from black-box and potentially corrupt LLMs is a first checkpoint toward user cyberspace safety.

URL: https://openreview.net/forum?id=DWxrPA4ZBY

---

Title: Extracting Probabilistic Knowledge from Large Language Models for Bayesian Network Parameterization

Abstract: In this work, we evaluate the potential of Large Language Models (LLMs) in building Bayesian Networks (BNs) by approximating domain expert priors. LLMs have demonstrated potential as factual knowledge bases; however, their capability to generate probabilistic knowledge about real-world events remains understudied. We explore utilizing the probabilistic knowledge inherent in LLMs to derive probability estimates for statements regarding events and their relationships within a BN. Using LLMs in this context allows for the parameterization of BNs, enabling probabilistic modeling within specific domains. Our experiments on eighty publicly available Bayesian Networks, from healthcare to finance, demonstrate that querying LLMs about the conditional probabilities of events provides meaningful results when compared to baselines, including random and uniform distributions, as well as approaches based on next-token generation probabilities. We explore how these LLM-derived distributions can serve as expert priors to refine distributions extracted from data, especially when data is scarce. Overall, this work introduces a promising strategy for automatically constructing Bayesian Networks by combining probabilistic knowledge extracted from LLMs with real-world data. Additionally, we establish the first comprehensive baseline for assessing LLM performance in extracting probabilistic knowledge.

URL: https://openreview.net/forum?id=Fy3Byg3CVo

---

Reply all

Reply to author

Forward

0 new messages