Daily TMLR digest for Jan 24, 2026

0 views
Skip to first unread message

TMLR

unread,
Jan 24, 2026, 12:30:07 AMJan 24
to tmlr-anno...@googlegroups.com

Accepted papers
===============


Title: Domain Translation with Monolingual Lexical Distribution

Authors: Yusuke Sakai, Zhi Qu, Hidetaka Kamigaito, Taro Watanabe, Xiaojiang Liu

Abstract: Neural machine translation (NMT) often demands a large amount of high-quality training data when adapting to a new domain with a carefully designed fine-tuning strategy. However, constructing a sufficient amount of parallel data for training poses challenges even for fine-tuning. This work proposes to fine-tune a generic NMT model using only the monolingual lexical distribution estimated from a small amount of in-domain data in the target language. Word frequency plays a critical role in analyzing the differences among corpora in various fields, e.g., psycholinguistic and language education, and our challenge lies in whether we can fit a model using the naive statistics collected from a target language domain in NMT. We leverage a variant of energy-based models (EBMs) based on Conditional Distributional Policy Gradients (CDPG) with a large number of EBMs to constrain the fine-tuning process with lexical distribution. We conduct experiments across four translation directions and four domain datasets, totaling 16 domain adaptation scenarios. The results demonstrate that our method enables robust domain shift while mitigating catastrophic forgetting, achieving effective domain adaptation using only a small amount of monolingual resources.

URL: https://openreview.net/forum?id=UKLBobrFCR

---

Title: Order from Chaos: Physical World Understanding from Glitchy Gameplay Videos

Authors: Meng Cao, Haoran Tang, Haoze Zhao, Mingfei Han, Ruyang Liu, Qiang Sun, Xiaojun Chang, Ian Reid, Xiaodan Liang

Abstract: Understanding the physical world, including object dynamics, material properties, and causal interactions, remains a core challenge in artificial intelligence. Although recent multi-modal large language models (MLLMs) have demonstrated impressive general reasoning capabilities, they still fall short of achieving human-level understanding of physical principles. Existing datasets for physical reasoning either rely on real-world videos, which incur high annotation costs, or on synthetic simulations, which suffer from limited realism and diversity. In this paper, we propose a novel paradigm that leverages glitches in gameplay videos, referring to visual anomalies that violate predefined physical laws, as a rich and scalable supervision source for physical world understanding. We introduce PhysGame, an instruction-tuning dataset containing 140,057 glitch-centric question–answer pairs across five physical domains and sixteen fine-grained categories. To ensure data accuracy, we design a meta-information–guided prompting strategy that utilizes gameplay metadata such as titles and descriptions to guide high-quality QA generation. Complementing PhysGame, we construct GameBench, an expert-annotated benchmark with 880 glitch-identified gameplay videos designed to evaluate physical reasoning capabilities. Extensive experiments show that PhysGame significantly enhances both Game2Real transferability, improving the real-world physical reasoning performance of Qwen2.5-VL by 2.5% on PhysBench, and Game2General transferability, yielding a 1.9% gain on the MVBench benchmark. Moreover, PhysGame-tuned models achieve a 3.7% absolute improvement on GameBench, demonstrating enhanced robustness in detecting physical implausibilities. These results indicate that learning from gameplay anomalies offers a scalable and effective pathway toward advancing physical world understanding in multimodal intelligence.

URL: https://openreview.net/forum?id=Oe5TdpPv1b

---

Title: Generative Causal Structure Learning with Dual Latent Spaces and Annealing

Authors: Soma Bandyopadhyay, Sudeshna Sarkar

Abstract: In this work, we address causal structure learning in the presence of unobserved confounders. Such causal structures can be represented by Acyclic Directed Mixed Graphs (ADMGs), where observed cause-effect relations are depicted by directed edges and unobserved confounded relations by bidirected edges. Prior methods for causal structure learning with unobserved common causes have primarily focused on search-based approaches, and more recently on flow-based generative models. We propose a novel generative method based on a variant of the Variational Autoencoder (VAE) with dual latent spaces to represent the directed cause-effect relations and the bidirected unobserved confounded relations, associating two trainable adjacency matrices. To enhance the learning process, we introduce a causality constraint combined with the concept of a causal annealing strategy during training, guiding the learning toward meaningful causal structures. Experimental results show that our method achieves competitive performance in identifying both observed and latent causal relationships on synthetic datasets. Furthermore, we demonstrate that the learned causal structure significantly improves downstream causal inference performance on real-world data.

URL: https://openreview.net/forum?id=wI5rFWfjKV

---

Title: Learning and Transferring Physical Models through Derivatives

Authors: Alessandro Trenta, Andrea Cossu, Davide Bacciu

Abstract: We propose Derivative Learning (DERL), a supervised approach that models physical systems by learning their partial derivatives. We also leverage DERL to build physical models incrementally, by designing a distillation protocol that effectively transfers knowledge from a pre-trained model to a student one. We provide theoretical guarantees that DERL can learn the true physical system, being consistent with the underlying physical laws, even when using empirical derivatives. DERL outperforms state-of-the-art methods in generalizing an ODE to unseen initial conditions and a parametric PDE to unseen parameters. We also design a method based on DERL to transfer physical knowledge across models by extending them to new portions of the physical domain and a new range of PDE parameters. This
introduces a new pipeline to build physical models incrementally in multiple stages.

URL: https://openreview.net/forum?id=IbBCDDeDF7

---


New submissions
===============


Title: Precision Is Not Performance: A Utility-Aware Evaluation of Quantized LLM Inference

Abstract: Large Language Models (LLMs) have become an increasingly important part of most modern AI systems; however, as LLMs grow in size, their usable responses are delayed. Additionally, as memory consumption and cost of computation become large factors for using LLM effectively, we will experience the challenge of achieving efficient inference with LLMs in the absence of adequate resources. Quantization, which refers to reducing numerical precision during the inference stage of the model to reduce memory usage, is a possible way to address these issues. Through quantization, model memory usage and cost efficiency will be enhanced. Unfortunately, research into quantization has typically focused on theoretical performance predictions and sample performance testing (i.e., isolated performance benchmarks), providing a limited view of how reduced numerical precision would impact the end-to-end behavior of inferred responses from the model in the real world. As a result, a significant gap exists in the practical ability to make decisions about deploying quantized LLM models. To help fill this gap, this study presents a utility-aware, end-to-end evaluation framework for LLM Inference using quantization, where the framework operates on actual models, actual prompts, and actual hardware, capturing for each combination the latency, throughput, and the resulting impact on the quality of the response when quantized in various levels of precision. This study illustrates the usefulness and trade-off between latency and quality of throughput in LLM Inference by utilizing the framework to assess its own trial models against a set benchmark high-precision evaluation. The proposed UAQF (Utility-Aware Quantization Framework) utilizes a modern instruction-tuned LLM (Mistral-7B-Instruct-v0.2) and has been tested on FP16, 8-bit, and 4-bit quantization. The experimental results suggest that lower bit quantization increases throughput dramatically, with little or no effect on the ability to generate high-quality LLM outputs. The study further demonstrates that aggressive quantization of LLMs yields significantly greater overall utility than intermediate precision quantizations, underscoring the need for empirical and deployment-oriented methods for quantization assessment.

URL: https://openreview.net/forum?id=Abl9vSlIn0

---

Title: Mitigating Preference Hacking in Policy Optimization with Pessimism

Abstract: This work tackles the problem of overoptimization in reinforcement learning from human feedback (RLHF), a prevalent technique for aligning models with human preferences. RLHF relies on reward or preference models trained on \emph{fixed preference datasets}, and these models are unreliable when evaluated outside the support of this preference data, leading to the common reward or preference hacking phenomenon. We propose novel, pessimistic objectives for RLHF which are provably robust to overoptimization through the use of pessimism in the face of uncertainty, and design practical algorithms, P3O and PRPO, to optimize these objectives. Our approach is derived for the general preference optimization setting, but can be used with reward models as well. We evaluate P3O and PRPO on the tasks of fine-tuning language models for document summarization and creating helpful assistants, demonstrating a remarkable resilience to overoptimization.

URL: https://openreview.net/forum?id=nuJZj6idVY

---

Title: CoQuEST: Entity-Focused Code-Mixed Question Generation for Entertainment Videos

Abstract: Earlier research on video-based question generation has primarily focused on generating questions about general objects and attributes, often neglecting the complexities of bilingual communication and entity-specific queries. This study addresses these limitations by developing a multimodal transformer framework capable of integrating video and textual inputs to generate semantically rich, entity-centric, and information-driven questions in a code-mixed Hindi-English format. Such a system is particularly significant for multilingual societies, offering applications in bilingual education, interactive learning platforms, conversational agents, and promoting cultural and linguistic relevance. To the best of our knowledge, there does not exist any large-scale Hindi-English (Hinglish) code-mixed dataset for video-based question generation. To address this limitation, we curated a subset of the TVQA dataset and annotated it by bilingual experts, ensuring fluency, contextual appropriateness, and adherence to the code-mixed structure. Empirical evaluation shows that CoQuEST demonstrated competitive performance with metrics of RQUGE: 1.649, BLEU-1: 0.04, CIDEr: 0.29, METEOR: 0.20, Distinct-1: 0.96, Distinct-2: 0.99, ROUGE-L: 0.20, and BERT-Score F1: 0.88, validating its practical utility and effectiveness. We make the code and dataset publicly available.

URL: https://openreview.net/forum?id=rHtLS6pFNW

---

Title: Adaptive Model Selection in Offline Contextual MDP's without Stationarity

Abstract: Contextual MDP's are powerful tools with wide applicability in areas from biostatistics to machine learning. However, specializing them to offline datasets has been challenging due to a lack of robust, theoretically backed, methods. Our work tackles this problem by introducing a new approach towards adaptive estimation and cost optimization of contextual MDP's. This estimator, to the best of our knowledge, is the first of its kind, and is endowed with strong optimality guarantees. We achieve this by overcoming the key technical challenges evolving from the endogenous properties of contextual MDP's; such as non-stationarity, or model irregularity. Our guarantees are established under complete generality by utilizing the relatively recent and powerful statistical technique of $T$-estimation (Baraud, 2011). We first provide a procedure for selecting an estimator given a sample from a contextual MDP and use it derive oracle risk bounds under two distinct, but nevertheless meaningful, loss functions. We then consider the problem of determining the optimal control with the aid of the aforementioned density estimate and provide finite sample guarantees for the cost function.

URL: https://openreview.net/forum?id=FGBZ4q1HPZ

---

Title: A Resilience Framework for Bi-Criteria Combinatorial Optimization with Bandit Feedback

Abstract: We study bi-criteria combinatorial optimization under noisy function evaluations. While resilience and black-box offline-to-online reductions have been studied in single-objective settings, extending these ideas to bi-criteria problems introduces new challenges due to the coupled degradation of approximation guarantees for objectives and constraints. We introduce a notion of $(\alpha,\beta,\delta,\texttt{N})$-resilience for bi-criteria approximation algorithms, capturing how joint approximation guarantees degrade under bounded (possibly adversarial) oracle noise, and develop a general black-box framework that converts any resilient offline algorithm into an online algorithm for bi-criteria combinatorial multi-armed bandits with bandit feedback. The resulting online guarantees achieve sublinear regret and cumulative constraint violation of order $\tilde{O}(\delta^{2/3}\texttt{N}^{1/3}T^{2/3})$ without requiring structural assumptions such as linearity, submodularity, or semi-bandit feedback on the noisy functions. We demonstrate the applicability of the framework by establishing resilience for several classical greedy algorithms in submodular optimization.

URL: https://openreview.net/forum?id=jcjxXUMyJ5

---

Title: Learning Embeddings for Discrete Tree-Structured Data via Structural Prediction

Abstract: Tree-structured data in natural language syntax, program analysis, and other symbolic domains are typically discrete, rooted, and ordered combinatorial objects. Despite their ubiquity, scalable and learnable representations for comparing such discrete structural trees remain limited. Classical methods such as tree edit distance (TED) and tree kernels provide principled structural measures but are computationally prohibitive, while previous neural encoders often produce latent representations without defining a consistent or interpretable space.

We introduce a framework for learning embeddings for discrete tree-structured data in which a Transformer encoder is trained through structural prediction tasks—predicting parent indices, node positions, and optionally tree-level categories. Rather than supervising distances directly, these structural objectives induce a coherent Euclidean embedding space for rooted, ordered trees.

A key property of the resulting embedding space is its stability under local structural perturbations: a bounded number of edits, such as inserting or deleting a leaf node, produces a proportionally bounded change in the embedding. Empirically, real datasets exhibit a global envelope in which the ratio between embedding distance and edit count remains uniformly bounded. This yields a smoother and more robust structure than TED and other discrete comparison methods, which often exhibit abrupt jumps under minor structural variations.

We demonstrate the effectiveness of our approach across Universal Dependencies treebanks, synthetic random trees, and abstract syntax trees. The learned embeddings correlate strongly with TED, reveal cross-linguistic and cross-parser structural patterns, separate natural from random syntax, and support structure-only code clone retrieval. Together, these results show that structural prediction alone can induce a stable, scalable, and domain-general embedding space that captures fine-grained properties of discrete tree structure.

URL: https://openreview.net/forum?id=W1bzDTSwzA

---

Title: Conversational Markov Chains: A Framework for Behavioral Analysis of Large Language Models

Abstract: How do you compare two language models that score identically on benchmarks but behave very differently in conversation? One model may explore diverse topics fluidly, while another repeats familiar patterns.

We propose modeling multi-turn conversations as Markov chains over semantic states. By embedding conversation turns and clustering them into discrete states, we construct transition graphs that capture conversational dynamics beyond static performance metrics. From this structure, we derive three interpretable observables: entropy rate, spectral gap, and stationary distribution, corresponding respectively to behavioral diversity, responsiveness, and long-term conversational patterns.

We apply the framework to over 300,000 turns of teacher–student dialogue, comparing Llama 3.1 8B and Mistral 7B. Despite similar benchmark performance, the models exhibit distinct behavioral signatures: Llama produces more diverse responses and transitions more fluidly between semantic states, while Mistral concentrates probability mass on a narrower set of conversational behaviours.

Conversational Markov analysis provides a principled, model-agnostic tool for analysing how language models behave over time, complementing existing evaluation methods and enabling deeper insight into conversational dynamics.

URL: https://openreview.net/forum?id=PhaC70uXc5

---

Title: Limits to Predicting Online Speech Using Large Language Models

Abstract: Our paper studies the predictability of online speech -- that is, how well language models learn to model the distribution of user generated content on X (previously Twitter). We define predictability as a measure of the model's uncertainty, i.e. its negative log-likelihood. As the basis of our study, we collect 10M tweets for ``tweet-tuning'' base models and a further 6.25M posts from more than five thousand X (previously Twitter) users and their peers. In our study involving more than 5000 subjects, we find that predicting posts of individual users remains surprisingly hard. Moreover, it matters greatly what context is used: models using the users' own history significantly outperform models using posts from their social circle. We validate these results across four large language models ranging in size from 1.5 billion to 70 billion parameters. Moreover, our results replicate if instead of prompting the model with additional context, we finetune on it. We follow up with a detailed investigation on what is learned in-context and a demographic analysis. Up to 20% of what is learned in-context is the use of @-mentions and hashtags. Our main results hold across the demographic groups we studied.

URL: https://openreview.net/forum?id=nAzOZauIB8

---

Reply all
Reply to author
Forward
0 new messages