Daily TMLR digest for Dec 01, 2025

0 views
Skip to first unread message

TMLR

unread,
Dec 1, 2025, 12:30:08 AM (4 days ago) Dec 1
to tmlr-anno...@googlegroups.com

Accepted papers
===============


Title: Joint Diffusion for Universal Hand-Object Grasp Generation

Authors: Jinkun Cao, Jingyuan Liu, Kris Kitani, Yi Zhou

Abstract: Predicting and generating human hand grasp over objects is critical for animation and robotic tasks. In this work, we focus on generating both the hand and objects in a grasp by a single diffusion model. Our proposed Joint Hand-Object Diffusion (JHOD) models the hand and object in a unified latent representation. It uses the hand-object grasping data to learn to accommodate hand and object to form plausible grasps. Also, to enforce the generalizability over diverse object shapes, it leverages large-scale object datasets to learn an inclusive object latent embedding. With or without a given object as an optional condition, the diffusion model can generate grasps unconditionally or conditional to the object. Compared to the usual practice of learning object-conditioned grasp generation from only hand-object grasp data, our method benefits from more diverse object data used for training to handle grasp generation more universally. According to both qualitative and quantitative experiments, both conditional and unconditional generation of hand grasp achieves good visual plausibility and diversity. With the extra inclusiveness of object representation learned from large-scale object datasets, the proposed method generalizes well to unseen object shapes.

URL: https://openreview.net/forum?id=TZ0ztsYR6x

---

Title: Sparse-Input Neural Network using Group Concave Regularization

Authors: Bin Luo, Susan Halabi

Abstract: Simultaneous feature selection and non-linear function estimation is challenging in modeling, especially in high-dimensional settings where the number of variables exceeds the available sample size. In this article, we investigate the problem of feature selection in neural networks. Although the group least absolute shrinkage and selection operator (LASSO) has been utilized to select variables for learning with neural networks, it tends to select unimportant variables into the model to compensate for its over-shrinkage. To overcome this limitation, we propose a framework of sparse-input neural networks using group concave regularization for feature selection in both low-dimensional and high-dimensional settings. The main idea is to apply a proper concave penalty to the $l_2$ norm of weights from all outgoing connections of each input node, and thus obtain a neural net that only uses a small subset of the original variables. In addition, we develop an effective algorithm based on backward path-wise optimization to yield stable solution paths, in order to tackle the challenge of complex optimization landscapes. We provide a rigorous theoretical analysis of the proposed framework, establishing finite-sample guarantees for both variable selection consistency and prediction accuracy. These results are supported by extensive simulation studies and real data applications, which demonstrate the finite-sample performance of the estimator in feature selection and prediction across continuous, binary, and time-to-event outcomes.

URL: https://openreview.net/forum?id=m9UsLHZYeX

---

Title: Stabilizing black-box model selection with the inflated argmax

Authors: Melissa Adrian, Jake A Soloff, Rebecca Willett

Abstract: Model selection is the process of choosing from a class of candidate models given data. For instance, methods such as the LASSO and sparse identification of nonlinear dynamics (SINDy) formulate model selection as finding a sparse solution to a linear system of equations determined by training data. However, absent strong assumptions, such methods are highly unstable: if a single data point is removed from the training set, a different model may be selected. In this paper, we present a new approach to stabilizing model selection with theoretical stability guarantees that leverages a combination of bagging and an ''inflated'' argmax operation. Our method selects a small collection of models that all fit the data, and it is stable in that, with high probability, the removal of any training point will result in a collection of selected models that overlaps with the original collection. We illustrate this method in (a) a simulation in which strongly correlated covariates make standard LASSO model selection highly unstable, (b) a Lotka–Volterra model selection problem focused on identifying how competition in an ecosystem influences species' abundances, (c) a graph subset selection problem using cell-signaling data from proteomics, and (d) unsupervised $\kappa$-means clustering. In these settings, the proposed method yields stable, compact, and accurate collections of selected models, outperforming a variety of benchmarks.

URL: https://openreview.net/forum?id=DSDWHsQLgA

---

Title: CoCoIns: Consistent Subject Generation via Contrastive Instantiated Concepts

Authors: Lee Hsin-Ying, Kelvin C.K. Chan, Ming-Hsuan Yang

Abstract: While text-to-image generative models can synthesize diverse and faithful content, subject variation across multiple generations limits their application to long-form content generation. Existing approaches require time-consuming fine-tuning, reference images for all subjects, or access to previously generated content. We introduce Contrastive Concept Instantiation (CoCoIns), a framework that effectively synthesizes consistent subjects across multiple independent generations. The framework consists of a generative model and a mapping network that transforms input latent codes into pseudo-words associated with specific concept instances. Users can generate consistent subjects by reusing the same latent codes. To construct such associations, we propose a contrastive learning approach that trains the network to distinguish between different combinations of prompts and latent codes. Extensive evaluations on human faces with a single subject show that CoCoIns performs comparably to existing methods while maintaining greater flexibility. We also demonstrate the potential for extending CoCoIns to multiple subjects and other object categories.

URL: https://openreview.net/forum?id=fPZ7DNlOSn

---

Title: Unlocking the matrix form of the Quaternion Fourier Transform and Quaternion Convolution: Properties, connections, and application to Lipschitz constant bounding

Authors: Giorgos Sfikas, George Retsinas

Abstract: Linear transformations are ubiquitous in machine learning, and matrices are the standard way to represent them. In this paper, we study matrix forms of quaternionic versions of the Fourier Transform and Convolution operations. Quaternions offer a powerful representation unit, however they are related to difficulties in their use that stem foremost from non-commutativity of quaternion multiplication, and due to that $\mu^2 = -1$ possesses infinite solutions in the quaternion domain. Handling of quaternionic matrices is consequently complicated in several aspects (definition of eigenstructure, determinant, etc.). Our research findings clarify the relation of the Quaternion Fourier Transform matrix to the standard (complex) Discrete Fourier Transform matrix, and the extend on which well-known complex-domain theorems extend to quaternions. We focus especially on the relation of Quaternion Fourier Transform matrices to Quaternion Circulant matrices (representing quaternionic convolution), and the eigenstructure of the latter. A proof-of-concept application that makes direct use of our theoretical results is presented, where we present a method to bound the Lipschitz constant of a Quaternionic Convolutional Neural Network. Code is publicly available at: https://github.com/sfikas/quaternion-fourier-convolution-matrix.

URL: https://openreview.net/forum?id=rhcpXTxb8j

---


New submissions
===============


Title: Probing Layer-wise Memorization and Generalization in Deep Neural Networks via Model Stitching

Abstract: It is well-known that deep neural networks can both memorize randomly labeled training data and generalize to unseen inputs. However, despite several prior efforts, the mechanism and dynamics of how and where memorization takes place in the network are still unclear, with contradictory findings in the literature. To address this, we aim to study the functional similarity between the layers of the memorized model to the model that generalizes. Specifically, we leverage model stitching as a tool to enable layer-wise comparison of a memorized noisy model, trained on a partially noisy-labeled dataset, to that of the generalized clean model, trained on a clean, noise-free dataset.
Our simple but effective approach guides the design of experiments that help shed light on the learning dynamics of different layers in deep neural networks and why models with harmful memorization still generalize well. Our results show that early layers are as important as deeper ones for generalization. We find that ``cleaning'' the early layers of the noisy model improves the functional similarity of its deeper layers to that of the corresponding layers in the clean model. Moreover, cleaning the noise in the early layers of the noisy model can drastically reduce memorization and improve generalization. Furthermore, noise fixation up to a certain depth results in generalization similar to that of a noise-free model. However, interestingly, the reverse may not be true. That is, if early layers are noisy but deeper layers are noise-free, then perfect memorization cannot be achieved, emphasizing the dominant role of deeper layers in memorization.
Our extensive experiments on four different architectures - customized CNN model, ResNet-18, ResNet-34, and ResNet-50, and three datasets - SVHN, CIFAR-10, and CIFAR-100, with varying levels of noise, consistently corroborate our findings.

URL: https://openreview.net/forum?id=wWye46fXo7

---

Title: Transformers in the Dark: Navigating Unknown Search Spaces via Bandit Feedback

Abstract: Effective problem solving with Large Language Models (LLMs) can be enhanced when they are paired with external search algorithms. By viewing the space of diverse ideas and their follow-up possibilities as a tree structure, the search algorithm can navigate such a search space and guide the LLM toward better solutions more efficiently. While the search algorithm enables an effective balance between exploitation and exploration of a tree-structured space, the need for an external component can complicate the overall problem-solving process. We therefore pose the following question: can LLMs or their underlying Transformer architectures fully internalize the search algorithm? To answer this question, we first introduce a simplified framework in which tree extensions and feedback signals are externally specified, allowing for controlled evaluation of search capabilities. We call this setting unknown tree search with bandit feedback. Within this setting, we show that Transformers are theoretically expressive enough to implement distinct search strategies and can be trained from scratch to approximate those strategies. Our Transformer models are capable of generalizing to unseen conditions such as longer horizons or deeper trees. Furthermore, we demonstrate that continued task-focused training unlocks the complete capabilities of a pretrained LLM, by fine-tuning the LLM on search trajectories.

URL: https://openreview.net/forum?id=Jij7zCjVfc

---

Title: From Clutter to Clarity: Visual Recognition through Foveated Object-Centric Learning (FocL)

Abstract: Human active vision integrates spatial attention (dorsal) and object recognition (ventral) as distinct information processing pathways. Rapid eye movements focus perception on task-relevant regions while filtering out background clutter. Mimicking this ventral specialization, we introduce FocL (Foveated Object-Centric Learning), a training strategy that biases image classification models toward label-consistent object regions by replacing full images with foveated crops. Standard training often relies on spurious correlation between label and background, increasing memorization of hard examples in the tail of the difficulty distribution. FocL simulates saccades by jittering fixation points and extracting foveated glimpses from annotated bounding boxes. This object-first restructuring reduces non-foreground contamination and lowers mean training loss. FocL reduces memorization, lowering mean cumulative sample loss by approximately 65 % and making nearly all high-memorization samples (top 1 %) easier to learn. It also increases the mean $\ell_2$ adversarial perturbation distance required to flip predictions by approximately 62 %. On ImageNet-V1, FocL achieves around 11 % higher accuracy on oracle crops. When paired with the Segment Anything Model (SAM) as a dorsal proposal generator, FocL provides around an 8 % gain on ImageNet-V1 and around 8 % under natural distribution shift (ImageNet-V2). Extending this setup to COCO, FocL improves cross-domain mAP by 3--4 points without any target-domain training. Finally, FocL reaches higher accuracy using roughly 56 % less training data, offering a simple path to more robust and efficient visual recognition.

URL: https://openreview.net/forum?id=kVS7sMlv7P

---

Title: CP-POL + PPI: Conformal Guarantees in Partially-Observed Label Space

Abstract: We study Conformal Prediction (CP) in the practical and challenging regime where labeled training and calibration data observe only a subset of the label space. In this setting, classical Conformal guarantees no longer control marginal risk and naive unseen labels detection methods are either overconservative or uninformative. We introduce CP-POL, a simple operational pipeline that couples Split CP over observed labels with a calibrated novelty test and integrates Prediction-Powered Inference (PPI) for finite sample population estimation. We provide a non-asymptotic theory that (i) proves Le Cam impossibility result: novelty test from features alone is hopeless without structural assumptions, (ii) derives tight finite-sample coverage decompositions that isolate the role of the non-conforming event $s(X)>q$, (iii) gives Dvoretzky-Kiefer-Wolfowitz (DKW)-based conservative estimators and anytime martingale analogues for the novel mass function $\pi_{nov}$, (iv) identifies practically meaningful structural conditions under which strong guarantees for novel region prediction hold, and (v) proves finite-sample PPI bounds that cleanly separate sampling fluctuation, trained model error and novel-mass effects. We validate the theory with reproducible simulations. All bounds are non-asymptotic and designed for immediate use in deployed monitoring pipelines.

URL: https://openreview.net/forum?id=GEy2BtBQKa

---

Title: A Normative Framework for Reasoning in Language Models

Abstract: Large Language Models (LLMs) increasingly exhibit advanced abilities, enabled by techniques such as chain-of-thought prompting and test-time deliberation. However, they continue to struggle with tasks that demand complex reasoning, prompting debate over whether their outputs reflect genuine reasoning processes or merely statistical pattern generation. These difficulties stem in part from the absence of a unified framework for explaining and assessing reasoning in LLMs, which limits our ability to diagnose errors, establish bounds, and design effective interventions. In this paper, we propose a normative framework that characterizes reasoning as probabilistic inference over propositions and we show how this abstraction can be instantiated in LLMs. Within this framework, we provide a typology of reasoning modes, formalise success criteria for proposition-level correctness, and derive a taxonomy of failure modes. For each class, we map model-level requirements to LLM-level implementation constraints and identify potential remedies. Finally, we outline a roadmap for improving proposition-level accuracy under tractable approximations. Our contribution is both diagnostic and prescriptive: an account of what it means for LLMs to reason, where and why current systems fail, and how to close the gap.

URL: https://openreview.net/forum?id=rexmsDzqwf

---

Title: Learn to Explore: Meta NAS via Bayesian Optimization Guided Graph Generation

Abstract: Neural Architecture Search (NAS) automates the design of high-performing neural networks but typically targets a single predefined task, thereby restricting its real-world applicability. To address this, Meta Neural Architecture Search (Meta-NAS) has emerged as a promising paradigm that leverages prior knowledge across tasks to enable rapid adaptation to new ones. Nevertheless, existing Meta-NAS methods often struggle with poor generalization, limited search spaces, or high computational costs. In this paper, we propose a novel Meta-NAS framework, GraB-NAS. Specifically, GraB-NAS first models neural architectures as graphs, and then a hybrid search strategy is developed to find and generate new graphs that lead to promising neural architectures. The search strategy combines global architecture search via Bayesian Optimization in the search space with local exploration for novel neural networks via gradient ascent in the latent space. Such a hybrid search strategy allows GraB-NAS to discover task-aware architectures with strong performance, even beyond the predefined search space. Extensive experiments demonstrate that GraB-NAS outperforms state-of-the-art Meta-NAS baselines, achieving better generalization and search effectiveness.

URL: https://openreview.net/forum?id=w15FmwsmKW

---

Title: Addition is almost all you need: Compressing neural networks with double binary factorization

Abstract: Binary quantization approaches, which replace weight matrices with binary matrices and substitute costly multiplications with cheaper additions,
offer a computationally efficient approach to address the increasing computational and storage requirements of Large Language Models (LLMs).
However, the severe quantization constraint ($\pm1$) can lead to significant accuracy degradation.
In this paper, we propose Double Binary Factorization (DBF), a novel method that factorizes dense weight matrices into products of two binary (sign) matrices, each accompanied by scaling vectors.
DBF preserves the efficiency advantages of binary representations while achieving compression rates that are competitive with or superior to state-of-the-art methods.
Specifically, in a 1-bit per weight range, DBF is better than existing binarization approaches. In a 2-bit per weight range, DBF is competitive with the best quantization methods like QuIP\# and QTIP.
Unlike most existing compression techniques, which offer limited compression level choices, DBF allows fine-grained control over compression ratios by adjusting the factorization's intermediate dimension.
Based on this advantage, we further introduce an algorithm for estimating non-uniform layer-wise compression ratios for DBF, based on previously developed channel pruning criteria.

URL: https://openreview.net/forum?id=k5kUKoewdQ

---

Title: SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation

Abstract: Traditional evaluation metrics for textual and visual question answering—like ROUGE, METEOR, and Exact Match (EM)—focus heavily on n-gram based lexical similarity, often missing the deeper semantic understanding needed for accurate assessment. While measures like BERTScore and MoverScore leverage contextual embeddings to address this limitation, they lack flexibility in balancing sentence-level and keyword-level semantics and ignore lexical similarity, which remains important. Large Language Model (LLM) based evaluators, though powerful, come with drawbacks like high costs, bias, inconsistency, and hallucinations. To address these issues, we introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching. This composite method balances lexical precision and semantic relevance, offering a comprehensive evaluation. Extensive benchmarks across text, image, and video QA tasks show SMILE is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.

URL: https://openreview.net/forum?id=lnpOvuQYih

---

Title: StethoLM: Audio Language Model for Cardiopulmonary Analysis Across Clinical Tasks

Abstract: Listening to heart and lung sounds — auscultation — is one of the first and most fundamental steps in a clinical examination. Despite being fast and non-invasive, it demands years of experience to interpret subtle audio cues. Recent deep learning methods have made progress in automating cardiopulmonary sound analysis, yet most are restricted to simple classification and offer little clinical interpretability or decision support. We present StethoLM, the first audio–language model specialized for cardiopulmonary auscultation, capable of performing instruction-driven clinical tasks across the full spectrum of auscultation analysis. StethoLM integrates audio encoding with a medical language model backbone and is trained on StethoBench, a comprehensive benchmark comprising 77,027 instruction–response pairs synthesized from 16,125 labeled cardiopulmonary recordings spanning seven clinical task categories: binary classification, detection, reporting, reasoning, differential diagnosis, comparison, and location-based analysis. Through multi-stage training that combines supervised fine-tuning and direct preference optimization, StethoLM achieves substantial gains in performance and robustness on out-of-distribution data. Our work establishes a foundation for instruction-following AI systems in clinical auscultation.

URL: https://openreview.net/forum?id=i9RuUH9Jyj

---

Reply all
Reply to author
Forward
0 new messages