Daily TMLR digest for Aug 20, 2025

1 view

Skip to first unread message

TMLR

unread,

Aug 20, 2025, 12:06:05 AMAug 20

to tmlr-anno...@googlegroups.com

Accepted papers
===============

Title: TRIDE: A Text-assisted Radar-Image weather-aware fusion network for Depth Estimation

Authors: Huawei Sun, Zixu Wang, Hao Feng, Julius Ott, Lorenzo Servadei, Robert Wille

Abstract: Depth estimation, essential for autonomous driving, seeks to interpret the 3D environment surrounding vehicles. The development of radar sensors, known for their cost-efficiency and robustness, has spurred interest in radar-camera fusion-based solutions. However, existing algorithms fuse features from these modalities without accounting for weather conditions, despite radars being known to be more robust than cameras under adverse weather. Additionally, while Vision-Language models have seen rapid advancement, utilizing language descriptions alongside other modalities for depth estimation remains an open challenge. This paper first introduces a text-generation strategy along with feature extraction and fusion techniques that can assist monocular depth estimation pipelines, leading to improved accuracy across different algorithms on the KITTI dataset. Building on this, we propose TRIDE, a radar-camera fusion algorithm that enhances text feature extraction by incorporating radar point information. To address the impact of weather on sensor performance, we introduce a weather-aware fusion block that adaptively adjusts radar weighting based on current weather conditions. Our method, benchmarked on the nuScenes dataset, demonstrates performance gains over the state-of-the-art, achieving a 12.87% improvement in MAE and a 9.08% improvement in RMSE. Code: https://github.com/harborsarah/TRIDE

URL: https://openreview.net/forum?id=ZMqMnwMfse

---

Title: Corner Cases: How Size and Position of Objects Challenge ImageNet-Trained Models

Authors: Mishal Fatima, Steffen Jung, Margret Keuper

Abstract: Backgrounds in images play a major role in contributing to spurious correlations among different data points. Owing to aesthetic preferences of humans capturing the images, datasets can exhibit positional (location of the object within a given frame) and size (region-of-interest to image ratio) biases for different classes. In this paper, we show that these biases can impact how much a model relies on spurious features in the background to make its predictions. To better illustrate our findings, we propose a synthetic dataset derived from ImageNet-1k, Hard-Spurious-ImageNet, which contains images with various backgrounds, object positions, and object sizes. By evaluating the dataset on different pretrained models, we find that most models rely heavily on spurious features in the background when the region-of-interest (ROI) to image ratio is small and the object is far from the center of the image. Moreover, we also show that current methods that aim to mitigate harmful spurious features, do not take into account these factors, hence fail to achieve considerable performance gains for worst-group accuracies when the size and location of core features in an image change. The dataset and implementation code are available at \url{https://github.com/Mishalfatima/Corner_Cases}.

URL: https://openreview.net/forum?id=Yqf2BhqfyZ

---

New submissions
===============

Title: AC-PKAN: Attention-Enhanced and Chebyshev Polynomial-Based Physics-Informed Kolmogorov–Arnold Networks

Abstract: Kolmogorov–Arnold Networks (KANs) have recently shown promise for solving partial differential equations (PDEs). Yet their original formulation is computationally and memory intensive, motivating the introduction of Chebyshev Type-I-based KANs (Chebyshev1KANs). Although Chebyshev1KANs have outperformed the vanilla KANs architecture, our rigorous theoretical analysis reveals that they still suffer from rank collapse, ultimately limiting their expressive capacity. To overcome these limitations, we enhance Chebyshev1KANs by integrating wavelet-activated MLPs with learnable parameters and an internal attention mechanism. We prove that this design preserves a full-rank Jacobian and is capable of approximating solutions to PDEs of arbitrary order. Furthermore, to alleviate the loss instability and imbalance introduced by the Chebyshev polynomial basis, we externally incorporate a Residual Gradient Attention (RGA) mechanism that dynamically re-weights individual loss terms according to their gradient norms and residual magnitudes. By jointly leveraging internal and external attention, we present AC-PKAN, a novel architecture that constitutes an enhancement to weakly supervised Physics-Informed Neural Networks (PINNs) and extends the expressive power of KANs. Experimental results from nine benchmark tasks across three domains show that AC-PKAN consistently outperforms or matches state-of-the-art models such as PINNsFormer, establishing it as a highly effective tool for solving complex real-world engineering problems in zero-data or data-sparse regimes. The code will be made publicly available upon acceptance.

URL: https://openreview.net/forum?id=J4SkwpIgj7

---

Title: How iteration composition influences convergence and stability in deep learning

Abstract: Despite exceptional achievements, training neural networks remains computationally expen- sive and is often plagued by instabilities that can degrade convergence. While learning rate schedules can help mitigate these issues, finding optimal schedules is time-consuming and resource-intensive. This work explores theoretical issues concerning training stability in the constant-learning-rate (i.e., without schedule) and small-batch-size regime. Surprisingly, we show that the composition order of gradient updates affects stability and convergence in gradient-based optimizers. We illustrate this new line of thinking using backward-SGD, which produces parameter iterates at each step by reverting the usual forward composition order of batch gradients. Our theoretical analysis shows that in contractive regions (e.g., around minima) backward-SGD converges to a point while the standard forward-SGD generally only converges to a distribution. This leads to improved stability and convergence which we demonstrate experimentally. While full backward-SGD is computationally intensive in practice, it highlights that the extra freedom of modifying the usual iteration composition by reusing creatively previous batches at each optimization step may have important beneficial effects in improving training. To our knowledge, this represents a new and unexplored avenue in deep learning optimization.

URL: https://openreview.net/forum?id=GZCBM2Yo3a

---

Title: ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment

Abstract: We address the task of zero-shot video classification for extremely fine-grained actions (e.g., Windmill Dunk in basketball), where no video examples or temporal annotations are available for unseen classes. While image–language models (e.g., CLIP, SigLIP) show strong open-set recognition, they lack temporal modeling needed for video understanding. We propose ActAlign, a truly zero-shot, training-free method that formulates video classification as a sequence alignment problem, preserving the generalization strength of pretrained image–language models. For each class, a large language model (LLM) generates an ordered sequence of sub-actions, which we align with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video–text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on ActionAtlas—the most diverse benchmark of fine-grained actions across multiple sports—where human performance is only 61.6%. ActAlign outperforms billion-parameter video–language models while using $\sim 8\times$ fewer parameters. Our approach is model-agnostic and domain-general, demonstrating that structured language priors combined with classical alignment methods can unlock the open-set recognition potential of image–language models for fine-grained video understanding.

URL: https://openreview.net/forum?id=Nwzn4qMTGb

---

Title: Convergence of linear programming hierarchies for Gibbs states of spin systems

Abstract: We consider the problem of computing expectation values of local functions under the Gibbs distribution of a spin system. In particular, we study two families of linear programming hierarchies for this problem. The first hierarchy imposes local spin flip equalities and has been considered in the bootstrap literature in high energy physics. For this hierarchy, we prove fast convergence under a spatial mixing (decay of correlations) condition. This condition is satisfied for example above the critical temperature for Ising models on a d-dimensional grid. The second hierarchy is based on a Markov chain having the Gibbs state as a fixed point and has been studied in the optimization literature and more recently in the bootstrap literature. For this hierarchy, we prove fast convergence provided the Markov chain mixes rapidly. Both hierarchies lead to an ε-approximation for local expectation values using a linear program of size quasi-polynomial in n/ε, where n is the total number of sites, provided the interactions can be embedded in a d-dimensional grid with constant d. Compared to standard Monte Carlo methods, an advantage of this approach is that it always (i.e., for any system) outputs rigorous upper and lower bounds on the expectation value of interest, without needing an a priori analysis of the convergence speed.

URL: https://openreview.net/forum?id=mc1dPxZsv3

---

Title: Efficient Distillation of Classifier-Free Guidance using Adapters

Abstract: While classifier-free guidance (CFG) is essential for conditional diffusion models, it doubles the number of neural function evaluations (NFEs) per inference step. To mitigate this inefficiency, we introduce adapter guidance distillation (AGD), a novel approach that simulates CFG in a single forward pass. AGD leverages lightweight adapters to approximate CFG, effectively doubling the sampling speed while maintaining or even improving sample quality. Unlike prior guidance distillation methods that tune the entire model, AGD keeps the base model frozen and only trains minimal additional parameters ($\sim$2%) to significantly reduce the resource requirement of the distillation phase. Additionally, this approach preserves the original model weights and enables the adapters to be seamlessly combined with other checkpoints derived from the same base model. We also address a key mismatch between training and inference in existing guidance distillation methods by training on CFG-guided trajectories instead of standard diffusion trajectories. Through extensive experiments, we show that AGD achieves comparable or superior FID to CFG across multiple architectures with only half the NFEs. Notably, our method enables the distillation of large models ($\sim$2.6B parameters) on a single consumer GPU with 24 GB of VRAM, making it more accessible than previous approaches that require multiple high-end GPUs. We will publicly release the implementation of our method.

URL: https://openreview.net/forum?id=uMz8FiiW01

---

Title: Robust Conformal Prediction for Infrequent Classes

Abstract: Many real-world classification tasks involve datasets with large and imbalanced label spaces,
making class-specific uncertainty quantification particularly challenging.
Conformal Prediction (CP) provides a model-agnostic framework, which formally
guarantees coverage, meaning that its prediction sets contain the true label with
a user-defined probability (confidence level). However, standard class-conditional
methods often fail when data is scarce for some classes. We propose a method
that uses domain knowledge or label hierarchies to dynamically group semantically
related classes to meet the desired coverage for a given confidence threshold.
Our method maintains class-conditioned calibration when possible and provides
group-conditioned guarantees where necessary.
We evaluate our method on outcome diagnoses prediction, an important clinical task
that does not only benefit from robust uncertainty estimation, but also presents a very imbalanced label distribution.
We conduct experiments using three clinical datasets employing two medical taxonomies (ICD-10 and CCSR)
and label spaces of varying sizes with up to more than 1,000 classes.
Our results show that the proposed approach consistently improves class-conditional coverage for infrequent diagnoses,
outperforming strong baselines in all settings in terms of class-conditional coverage. By improving coverage
for underrepresented classes, our method enhances the reliability and trustworthiness of predictive models.
This improvement is especially valuable in clinical applications, where failure to detect rare but serious conditions can lead to
harmful consequences.

URL: https://openreview.net/forum?id=nJ4p8rh3Ig

---

Title: Helios 2.0: A Robust, Ultra-Low Power Gesture Recognition System for Event-Sensor based Wearables

Abstract: We present an advance in machine learning powered wearable technology: a mobile-optimised, real-time, ultra-low-power gesture recognition model. This model utilizes an event camera system that enables natural hand gesture control for smart glasses. Critical challenges in hand gesture recognition include creating systems that are intuitive, adaptable to diverse users and environments, and energy-efficient allowing practical wearable applications. Our approach addresses these challenges through four key contributions: a novel machine learning model designed for ultra-low-power on device gesture recognition, a novel training methodology to improve the gesture recognition capability of the model, a novel simulator to generate synthetic micro-gesture data, and purpose-built real-world evaluation datasets. We first carefully selected microgestures: lateral thumb swipes across the index finger (in both directions) and a double pinch between thumb and index fingertips. These human-centered interactions leverage natural hand movements, ensuring intuitive usability without requiring users to learn complex command sequences. To overcome variability in users and environments, we developed a simulation methodology that enables comprehensive domain sampling without extensive real-world data collection. Our simulator synthesizes longer, multi-gesture sequences using Markov-based transitions, class-balanced sampling, and kinematic blending. We propose a sequence-based training approach to learn robust micro-gesture recognition entirely from simulated data. For energy efficiency, we introduce a five-stage, quantization-aware architecture with >99.8\% of compute optimized for low-power DSP execution. We demonstrate on real-world data that our proposed model is able to generalise to challenging new users and environmental domains, achieving F1 scores above 80\%. The model operates at just 6-8 mW when exploiting the Qualcomm Snapdragon Hexagon DSP. In addition, this model surpasses an F1 score of 80\% in all gesture classes in user studies. This improves on the state-of-the-art for F1 accuracy by 20\% with a power reduction 25x when using DSP. This advancement for the first time brings deploying ultra-low-power vision systems in wearable devices closer and opens new possibilities for seamless human-computer interaction. A real-time video demonstration of Helios 2.0 can be found here: https://0e84f9dd10852326-tracking-platform-shared-public-assets.s3.eu-west-1.amazonaws.com/IMG_6222.mov

URL: https://openreview.net/forum?id=2sY0M7iZwR

---

Reply all

Reply to author

Forward

0 new messages