Daily TMLR digest for Dec 31, 2025

2 views

Skip to first unread message

TMLR

unread,

Dec 31, 2025, 12:30:06 AM12/31/25

to tmlr-anno...@googlegroups.com

Accepted papers
===============

Title: Unreasonable effectiveness of LLM reasoning: a doubly cautionary tale of temporal question-answering

Authors: Dagmara Panas, Ali Payani, Vaishak Belle

Abstract: The remarkable success of Large Language Models in modeling both the syntax and the semantics of language has prompted a body of research into language-adjacent abilities, most notably commonsense reasoning.
As LLMs' performance continues to advance on successive benchmarks, we turn to temporal reasoning, which lags somewhat behind other tasks due to its more complex logic.
We start from previous work, where authors successfully induce (apparent) reasoning by breaking down the problem into a two-step procedure of temporal graph extraction and subsequent reasoning.
Specifically, in the first step an LLM is prompted to parse a natural language description into a semi-structured timeline of events; and in the second step, it is given the extracted timeline and prompted to answer a temporal reasoning question.
We conjecture that this procedure presents two separate opportunities for introducing errors and further hypothesise that a Neuro-symbolic approach should help in this matter.
We follow the recent trend of using external executors in concert with LLMs to carry out exact reasoning and verification.
We see the reasoning step of the original two-step procedure as a natural target for a symbolic solver and design a rule-based solution for Temporal Question-Answering, drawing on ideas from Allen’s Interval Algebra.
To our surprise, we find that our rule-based reasoner does not improve beyond the previously reported, purely neural solution.
It appears that both our approach and the previous method operate at around the limits of achievable performance, imposed by the correctness of information extraction.
Such a result seems to suggest that a non-symbolic LLM is capable of symbolic-level reasoning, although upon further investigation we discover that not to be the case.
It is not that the neural solution makes no reasoning mistakes, but rather that the LLM manages to compensate for some of its erroneous replies by `short-cutting' to the correct answer in other questions; a.k.a. not reasoning but guessing.
Although the effect is not pronounced performance-wise, we feel it is conceptually important: as we argue, production of correct answers is not a measure of reasoning.

URL: https://openreview.net/forum?id=1DkD0Nd8Rd

---

Title: How Many Images Does It Take? Estimating Imitation Thresholds in Text-to-Image Models

Authors: Sahil Verma, Royi Rassin, Arnav Mohanty Das, Gantavya Bhatt, Preethi Seshadri, Chirag Shah, Jeff Bilmes, Hannaneh Hajishirzi, Yanai Elazar

Abstract: Text-to-image models are trained using large datasets of image-text pairs collected from the internet. These datasets often include copyrighted and private images. Training models on such datasets enables them to generate images that might violate copyright laws and
individual privacy. This phenomenon is termed imitation – generation of images with content that has recognizable similarity to its training images. In this work we estimate the point at which a model was trained on enough instances of a concept to be able to imitate it – the imitation threshold. We posit this question as a new problem and propose an efficient approach that estimates the imitation threshold without incurring the colossal cost of training these models from scratch. We experiment with two domains – human faces and art styles, and evaluate four text-to-image models that were trained on three pretraining datasets. We estimate the imitation threshold of these models to be in the range of 200-700 images, depending on the domain and the model. The imitation threshold provides an empirical basis for copyright violation claims and acts as a guiding principle for text-to-image model developers that aim to comply with copyright and privacy laws.

URL: https://openreview.net/forum?id=x0qJo7SPhs

---

Title: Cluster Agnostic Network Lasso Bandits

Authors: Sofien Dhouib, Steven Bilaj, Behzad Nourani-Koliji, Setareh Maghsudi

Abstract: We consider a multi-task contextual bandit setting, where the learner is given a graph encoding relations between the bandit tasks. The tasks' preference vectors are assumed to be piecewise constant over the graph, forming clusters. At every round, we estimate the preference vectors by solving an online network lasso problem with a suitably chosen, time-dependent regularization parameter. We establish a novel oracle inequality relying on a convenient restricted eigenvalue assumption. Our theoretical findings highlight the importance of dense intra-cluster connections and sparse inter-cluster ones. That results in a sublinear regret bound significantly lower than its counterpart in the independent task learning setting. Finally, we support our theoretical findings by experimental evaluation against graph bandit multi-task learning and online clustering of bandits algorithms.

URL: https://openreview.net/forum?id=QjAyoMP1DD

---

Title: Convergence Aspects of Hybrid Kernel SVGD

Authors: Anson MacDonald, Scott A Sisson, Sahani Pathiraja

Abstract: Stein variational gradient descent (SVGD) is a particle-based approximate inference algorithm. Many variants of SVGD have been proposed in recent years, including the hybrid kernel variant (h-SVGD), which has demonstrated promising results on image classification with deep neural network ensembles. By framing h-SVGD as a kernelised Wasserstein gradient flow on a functional that is not the Kullback-Leibler divergence, we demonstrate that h-SVGD does not converge to the target distribution in the mean field limit. Despite this theoretical result, we provide intuition and experimental support for the ability of h-SVGD to improve variance estimation in high dimensions. Unlike other SVGD variants that also alleviate variance collapse, this is achieved at no additional computational cost and without further assumptions on the posterior.

URL: https://openreview.net/forum?id=JZkbMSQDmD

---

Title: AutoAnnotator: A Collaborative Annotation Framework for Large and Small Language Models

Authors: Yao Lu, Ji Zhaiyuan, Jiawei Du, Yu Shanqing, Qi Xuan, Joey Tianyi Zhou

Abstract: Although the annotation paradigm based on Large Language Models (LLMs) has made significant breakthroughs in recent years, its actual deployment still has two core bottlenecks: first, the cost of calling commercial APIs in large-scale annotation is very expensive; second, in scenarios that require fine-grained semantic understanding, such as sentiment classification and toxicity classification, the annotation accuracy of LLMs is even lower than that of Small Language Models (SLMs) dedicated to this field. To address these problems, we propose a new paradigm of multi-model cooperative annotation and design a fully automatic annotation framework AutoAnnotator based on this. Specifically, AutoAnnotator consists of two layers. The upper-level meta-controller layer uses the generation and reasoning capabilities of LLMs to select SLMs for annotation, automatically generate annotation code and verify difficult samples; the lower-level task‑specialist layer consists of multiple SLMs that perform annotation through multi-model voting. In addition, we use the difficult samples obtained by the secondary review of the meta-controller layer as the reinforcement learning set and fine-tune the SLMs in stages through a continual learning strategy, thereby improving the generalization of SLMs. Extensive experiments show that AutoAnnotator outperforms existing open-source/API LLMs in zero-shot, one-shot, CoT, and majority voting settings. Notably, AutoAnnotator reduces the annotation cost by 74.15% compared to directly annotating with GPT-3.5-turbo, while still improving the accuracy by 6.21%. The code is available in https://github.com/Zhaiyuan-Ji/AutoAnnotator.

URL: https://openreview.net/forum?id=LauojtjA9F

---

Title: Positional Encoder Graph Quantile Neural Networks for Geographic Data

Authors: William E. R. de Amorim, Scott A Sisson, Thais Carvalho Valadares Rodrigues, David J Nott, Guilherme S. Rodrigues

Abstract: Positional Encoder Graph Neural Networks (PE-GNNs) are among the most effective models for learning from continuous spatial data. However, their predictive distributions are often poorly calibrated, limiting their utility in applications that require reliable uncertainty quantification. We propose the Positional Encoder Graph Quantile Neural Network (PE-GQNN), a novel framework that combines PE-GNNs with Quantile Neural Networks, partially monotonic neural blocks, and post-hoc recalibration techniques. The PE-GQNN enables flexible and robust conditional density estimation with minimal assumptions about the target distribution, and it extends naturally to tasks beyond spatial data. Empirical results on benchmark datasets show that the PE-GQNN outperforms existing methods in both predictive accuracy and uncertainty quantification, without incurring additional computational cost. We also identify important special cases arising from our formulation, including the PE-GNN.

URL: https://openreview.net/forum?id=5PjL8ZOPBt

---

New submissions
===============

Title: A Systematic Assessment of Weak-to-Strong Confidence Prediction in Large Language Models

Abstract: As large language models (LLMs) are being deployed across a wide range of application domains, understanding their capacity through uncertainty quantification (UQ) is crucial for ensuring safe and reliable behavior. Reliable uncertainty estimates that accompany the text generated by an LLM can signal when a response is likely to be incorrect and thus serve as an effective fail-safe mechanism against hallucinations. In this paper, we explore the extent to which the probability of a frontier model answering a query correctly can be predicted by smaller, weaker models with publicly available embeddings using a simple probe. We show that this probability can be predicted effectively, and the probes are easy to train, making oversight of large proprietary models more widely accessible. Leveraging embeddings from models as small as Llama3-8b, our predictor achieves 83.4% AUROC on TriviaQA and 64.3% on MMLU, and improves selective prediction accuracy by up to 17.9%. We then carefully analyze how different factors impact the probe performance.
Across six benchmarks and fifteen weak predictors, we show that the performance does not simply improve with predictor model size, and that the weak-to-strong signal is robust to label imbalance and embedding aggregation choices. These findings support the view that representational compatibility between weak-model embeddings and the strong model’s behavior matters more than model size alone. Overall, our results advance the understanding of weak-to-strong generalization and provide a simple, scalable framework for building more trustworthy LLMs.

URL: https://openreview.net/forum?id=xYSzkg5qPD

---

Reply all

Reply to author

Forward

0 new messages