Weekly TMLR digest for Apr 13, 2025

7 views

Skip to first unread message

TMLR

unread,

Apr 13, 2025, 12:00:09 AMApr 13

to tmlr-annou...@googlegroups.com

New certifications
==================

Reproducibility Certification: Contextualized Messages Boost Graph Representations

Brian Godwin Lim, Galvin Brice Sy Lim, Renzo Roel Tan, Kazushi Ikeda

https://openreview.net/forum?id=sXr1fRjs1N

---

Accepted papers
===============

Title: ResiDual Transformer Alignment with Spectral Decomposition

Authors: Lorenzo Basile, Valentino Maiorca, Luca Bortolussi, Emanuele Rodolà, Francesco Locatello

Abstract: When examined through the lens of their residual streams, a puzzling property emerges in transformer networks: residual contributions (e.g., attention heads) sometimes specialize in specific tasks or input attributes. In this paper, we analyze this phenomenon in vision transformers, focusing on the spectral geometry of residuals, and explore its implications for modality alignment in vision-language models. First, we link it to the intrinsically low-dimensional structure of visual head representations, zooming into their principal components and showing that they encode specialized roles across a wide variety of input data distributions. Then, we analyze the effect of head specialization in multimodal models, focusing on how improved alignment between text and specialized heads impacts zero-shot classification performance. This specialization-performance link consistently holds across diverse pre-training data, network sizes, and objectives, demonstrating a powerful new mechanism for boosting zero-shot classification through targeted alignment. Ultimately, we translate these insights into actionable terms by introducing ResiDual, a technique for spectral alignment of the residual stream. Much like panning for gold, it lets the noise from irrelevant unit principal components (i.e., attributes) wash away to amplify task-relevant ones. Remarkably, this dual perspective on modality alignment yields fine-tuning level performance on different data distributions while modelling an extremely interpretable and parameter-efficient transformation, as we extensively show on 70 pre-trained network-dataset combinations (7 models, 10 datasets).

URL: https://openreview.net/forum?id=z37LCgSIzI

---

Title: Dynamic Pricing in the Linear Valuation Model using Shape Constraints

Authors: Daniele Bracale, Moulinath Banerjee, Yuekai Sun, Salam Turki, Kevin Stoll

Abstract: We propose a shape-constrained approach to dynamic pricing for censored data in the linear valuation model eliminating the need for tuning parameters commonly required by existing methods. Previous works have addressed the challenge of unknown market noise distribution $F_0$ using strategies ranging from kernel methods to reinforcement learning algorithms, such as bandit techniques and upper confidence bounds (UCB), under the assumption that $F_0$ satisfies Lipschitz (or stronger) conditions. In contrast, our method relies on isotonic regression under the weaker assumption that $F_0$ is $\alpha$-H\"older continuous for some $\alpha \in (0,1]$, for which we derive a regret upper bound. Simulations and experiments with real-world data obtained by Welltower Inc (a major healthcare Real Estate Investment Trust) consistently demonstrate that our method attains lower empirical regret in comparison to several existing methods in the literature while offering the advantage of being tuning-parameter free.

URL: https://openreview.net/forum?id=uKZ0R4IQaO

---

Title: Rank Suggestion in Non-negative Matrix Factorization: Residual Sensitivity to Initial Conditions (RSIC)

Authors: Marc A. Tunnell, Zachary DeBruine, Erin Carrier

Abstract: Determining the appropriate rank in Non-negative Matrix Factorization (NMF) is a critical challenge that often requires extensive parameter tuning and domain-specific knowledge. Traditional methods for rank determination focus on identifying a single optimal rank, which may not capture the complex structure inherent in real-world datasets. In this study, we introduce a novel approach called Residual Sensitivity to Intial Conditions (RSIC) that suggests potentially multiple ranks of interest by analyzing the sensitivity of the relative residuals (e.g., relative reconstruction error) to different initializations. By computing the Mean Coordinatewise Interquartile Range (MCI) of the residuals across multiple random initializations, our method identifies regions where the NMF solutions are less sensitive to initial conditions and potentially more meaningful. We evaluate RSIC on a diverse set of datasets, including single-cell gene expression data, image data, and text data, and compare it against current state-of-the-art rank determination methods. Our experiments demonstrate that RSIC effectively identifies relevant ranks consistent with the underlying structure of the data, outperforming traditional methods in scenarios where they are computationally infeasible or less accurate. This approach provides a more scalable and generalizable solution for rank determination in NMF that does not rely on domain-specific knowledge or assumptions.

URL: https://openreview.net/forum?id=9Xj5w4DX0t

---

Title: Downstream Task Guided Masking Learning in Masked Autoencoders Using Multi-Level Optimization

Authors: Han Guo, Ramtin Hosseini, Ruiyi Zhang, Sai Ashish Somayajula, Ranak Roy Chowdhury, Rajesh K. Gupta, Pengtao Xie

Abstract: Masked Autoencoder (MAE) is a notable method for self-supervised pretraining in visual representation learning. It operates by randomly masking image patches and reconstructing these masked patches using the unmasked ones. A key limitation of MAE lies in its disregard for the varying informativeness of different patches, as it uniformly selects patches to mask. To overcome this, some approaches propose masking based on patch informativeness. However, these methods often do not consider the specific requirements of downstream tasks, potentially leading to suboptimal representations for these tasks. In response, we introduce the Multi-level Optimized Mask Autoencoder (MLO-MAE), a novel framework that leverages end-to-end feedback from downstream tasks to learn an optimal masking strategy during pretraining. Our experimental findings highlight MLO-MAE's significant advancements in visual representation learning. Compared to existing methods, it demonstrates remarkable improvements across diverse datasets and tasks, showcasing its adaptability and efficiency. Our code is available at https://github.com/Alexiland/MLO-MAE.

URL: https://openreview.net/forum?id=cFmmaxkD5A

---

Title: Latte: Latent Diffusion Transformer for Video Generation

Authors: Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, Yu Qiao

Abstract: We propose Latte, a novel Latent Diffusion Transformer for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, \textit{i.e.}, FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to the text-to-video generation (T2V) task, where Latte achieves results that are competitive with recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.

URL: https://openreview.net/forum?id=ntGPYNUF3t

---

Title: Graph Potential Field Neural Network for Massive Agents Group-wise Path Planning

Authors: Yueming Lyu, Xiaowei Zhou, Xingrui Yu, Ivor Tsang

Abstract: Multi-agent path planning is important in both multi-agent path finding and multi-agent reinforcement learning areas. However, continual group-wise multi-agent path planning that requires the agents to perform as a team to pursue high team scores instead of individually is less studied. To address this problem, we propose a novel graph potential field-based neural network (GPFNN), which models a valid potential field map for path planning. Our GPFNN unfolds the T-step iterative optimization of the potential field maps as a T-layer feedforward neural network. Thus, a deeper GPFNN leads to more precise potential field maps without the over-smoothing issue. A potential field map inherently provides a monotonic potential flow from any source node to the target nodes to construct the optimal path (w.r.t. the potential decay), equipping our GPFNN with an elegant planning ability. Moreover, we incorporate dynamically updated boundary conditions into our GPFNN to address group-wise multi-agent path planning that supports both static targets and dynamic moving targets. Empirically, experiments on three different-sized mazes (up to $1025 \times 1025$ sized mazes) with up to 1,000 agents demonstrate the planning ability of our GPFNN to handle both static and dynamic moving targets. Experiments on extensive graph node classification tasks on six graph datasets (up to millions of nodes) demonstrate the learning ability of our GPFNN.

URL: https://openreview.net/forum?id=LJHVPWNnV6

---

Title: Rethinking Patch Dependence for Masked Autoencoders

Authors: Letian Fu, Long Lian, Renhao Wang, Baifeng Shi, XuDong Wang, Adam Yala, Trevor Darrell, Alexei A Efros, Ken Goldberg

Abstract: In this work, we examine the impact of inter-patch dependencies in the decoder of masked autoencoders (MAE) on representation learning. We decompose the decoding mechanism for masked reconstruction into self-attention between mask tokens and cross-attention between masked and visible tokens. Our findings reveal that MAE reconstructs coherent images from visible patches not through interactions between patches in the decoder but by learning a global representation within the encoder. This discovery leads us to propose a simple visual pretraining framework: cross-attention masked autoencoders (CrossMAE). This framework employs only cross-attention in the decoder to independently read out reconstructions for a small subset of masked patches from encoder outputs. This approach achieves comparable or superior performance to traditional MAE across models ranging from ViT-S to ViT-H and significantly reduces computational requirements. By its design, CrossMAE challenges the necessity of interaction between mask tokens for effective masked pretraining. Code and models are publicly available: https://crossmae.github.io/

URL: https://openreview.net/forum?id=JT2KMuo2BV

---

Title: A Systematic Evaluation of the Planning and Scheduling Abilities of the Reasoning Model o1

Authors: Karthik Valmeekam, Kaya Stechly, Atharva Gundawar, Subbarao Kambhampati

Abstract: OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive Large Language Models (LLMs)–making it a new kind of model: a Large Reasoning Model (LRM)–and be generally capable of tackling procedural reasoning tasks. We present the first comprehensive evaluation of these models on the fundamental tasks of planning and scheduling. Previous research attempted to use LLMs’ expressive generation capabilities to solve these problems, but met with only limited success. We fill in the gaps in this literature by testing a larger suite of state-of-the-art LLMs on a set of large benchmarks, and then use this as a baseline to evaluate o1-preview and o1-mini. We see that while they can offer significant accuracy improvements over LLMs, this single metric is misleading and incomplete, as LRM queries demand large and unpredictable costs and take significant amounts of time to complete. We provide a case study demonstrating that, at those same price points, other methods of inference time scaling can do just as well. We also show that, contrary to OpenAI’s injunctions, o1’s performance can be improved further by embedding it in compound systems that separately, but complementarily, scale inference time further. Finally, while the paper is focused on o1, we provide similar evaluations of a more recent (and open-weight) LRM -- DeepSeek R1.

URL: https://openreview.net/forum?id=FkKBxp0FhR

---

Title: Evaluating Compositional Scene Understanding in Multimodal Generative Models

Authors: Shuhao Fu, Andrew Jun Lee, Yixin Anna Wang, Ida Momennejad, Trevor Bihl, Hongjing Lu, Taylor Whittington Webb

Abstract: The visual world is fundamentally compositional. Visual scenes are defined by the composition of objects and their relations. Hence, it is essential for computer vision systems to reflect and exploit this compositionality to achieve robust and generalizable scene understanding. While major strides have been made toward the development of general-purpose, multimodal generative models, including both text-to-image models and multimodal vision-language models, it remains unclear whether these systems are capable of accurately generating and interpreting scenes involving the composition of multiple objects and relations. In this work, we present an evaluation of the compositional visual processing capabilities in the current generation of text-to-image (DALL-E 3) and multimodal vision-language models (GPT-4V, GPT-4o, Claude Sonnet 3.5, QWEN2-VL-72B, and InternVL2.5-38B), and compare the performance of these systems to human participants. The results suggest that these systems display some ability to solve compositional and relational tasks, showing notable improvements over the previous generation of multimodal models, but with performance nevertheless well below the level of human participants, particularly for more complex scenes involving many (>5) objects and multiple relations. These results highlight the need for further progress toward compositional understanding of visual scenes.

URL: https://openreview.net/forum?id=7bIfe2I7bK

---

Title: Distributed and Secure Kernel-Based Quantum Machine Learning

Authors: Arjhun Swaminathan, Mete Akgün

Abstract: Quantum computing promises to revolutionize machine learning, offering significant efficiency gains for tasks such as clustering and distance estimation. Additionally, it provides enhanced security through fundamental principles like the measurement postulate and the no-cloning theorem, enabling secure protocols such as quantum teleportation and quantum key distribution. While advancements in secure quantum machine learning are notable, the development of secure and distributed quantum analogs of kernel-based machine learning techniques remains underexplored.

In this work, we present a novel approach for securely computing three commonly used kernels: the polynomial, radial basis function (RBF), and Laplacian kernels, when data is distributed, using quantum feature maps. Our methodology formalizes a robust framework that leverages quantum teleportation to enable secure and distributed kernel learning. The proposed architecture is validated using IBM’s Qiskit Aer Simulator on various public datasets.

URL: https://openreview.net/forum?id=3jdI0aEW3k

---

Title: An Embedding is Worth a Thousand Noisy Labels

Authors: Francesco Di Salvo, Sebastian Doerrich, Ines Rieger, Christian Ledig

Abstract: The performance of deep neural networks scales with dataset size and label quality, rendering the efficient mitigation of low-quality data annotations crucial for building robust and cost-effective systems. Existing strategies to address label noise exhibit severe limitations due to computational complexity and application dependency. In this work, we propose WANN, a Weighted Adaptive Nearest Neighbor approach that builds on self-supervised feature representations obtained from foundation models. To guide the weighted voting scheme, we introduce a reliability score $\eta$, which measures the likelihood of a data label being correct. WANN outperforms reference methods, including a linear layer trained with robust loss functions, on diverse datasets of varying size and under various noise types and severities. WANN also exhibits superior generalization on imbalanced data compared to both Adaptive-NNs (ANN) and fixed k-NNs. Furthermore, the proposed weighting scheme enhances supervised dimensionality reduction under noisy labels. This yields a significant boost in classification performance with 10x and 100x smaller image embeddings, minimizing latency and storage requirements. Our approach, emphasizing efficiency and explainability, emerges as a simple, robust solution to overcome inherent limitations of deep neural network training.

URL: https://openreview.net/forum?id=X3gSvQjShh

---

Title: LTL-Constrained Policy Optimization with Cycle Experience Replay

Authors: Ameesh Shah, Cameron Voloshin, Chenxi Yang, Abhinav Verma, Swarat Chaudhuri, Sanjit A. Seshia

Abstract: Linear Temporal Logic (LTL) offers a precise means for constraining the behavior of reinforcement learning agents. However, in many settings where both satisfaction and optimality conditions are present, LTL is insufficient to capture both. Instead, LTL-constrained policy optimization, where the goal is to optimize a scalar reward under LTL constraints, is needed. This constrained optimization problem proves difficult in deep Reinforcement Learning (DRL) settings, where learned policies often ignore the LTL constraint due to the sparse nature of LTL satisfaction. To alleviate the sparsity issue, we introduce Cycle Experience Replay (CyclER), a novel reward shaping technique that exploits the underlying structure of the LTL constraint to guide a policy towards satisfaction by encouraging partial behaviors compliant with the constraint. We provide a theoretical guarantee that optimizing CyclER will achieve policies that satisfy the LTL constraint with near-optimal probability. We evaluate CyclER in three continuous control domains. Our experimental results show that optimizing CyclER in tandem with the existing scalar reward outperforms existing reward-shaping methods at finding performant LTL-satisfying policies.

URL: https://openreview.net/forum?id=gxUp2d4JTw

---

Title: Bézier Flow: a Surface-wise Gradient Descent Method for Multi-objective Optimization

Authors: Akiyoshi Sannai, Yasunari Hikima, Ken Kobayashi, Akinori Tanaka, Naoki Hamada

Abstract: This paper proposes a framework to construct a multi-objective optimization algorithm from a single-objective optimization algorithm by using the Bézier simplex model. Additionally, we extend the stability of optimization algorithms in the sense of Probably Approximately Correct (PAC) learning and define the PAC stability. We prove that it leads to an upper bound on the generalization error with high probability.
Furthermore, we show that multi-objective optimization algorithms derived from a gradient descent-based single-objective optimization algorithm are PAC stable. We conducted numerical experiments with synthetic and real multi-objective optimization problem instances and demonstrated that our method achieved lower generalization errors than the existing multi-objective optimization algorithms.

URL: https://openreview.net/forum?id=I1gALvbRxj

---

Title: Maximising the Utility of Validation Sets for Imbalanced Noisy-label Meta-learning

Authors: Hoang Anh Dung, Cuong C. Nguyen, Vasileios Belagiannis, Thanh-Toan Do, Gustavo Carneiro

Abstract: Meta-learning is an effective method to handle imbalanced and noisy-label learning, but it generally depends on a clean validation set. Unfortunately, this validation set has poor scalability when the number of classes increases, as traditionally these samples need to be randomly selected, manually labelled and balanced-distributed. This problem therefore has motivated the development of meta-learning methods to automatically select validation samples that are likely to have clean labels and balanced class distribution. Unfortunately, a common missing point of existing meta-learning methods for noisy label learning is the lack of consideration for data informativeness when constructing the validation set. The construction of an informative validation set requires hard samples, i.e., samples that the model has low confident prediction, but these samples are more likely to be noisy, which can degrade the meta reweighting process. Therefore, the balance between sample informativeness and cleanness is an important criteria for validation set optimization. In this paper, we propose new criteria to characterise the utility of such meta-learning validation sets, based on: 1) sample informativeness; 2) balanced class distribution; and 3) label cleanliness. We also introduce a new imbalanced noisy-label meta-learning (INOLML) algorithm that auto- matically builds a validation set by maximising such utility criteria. The proposed method shows state-of-the-art (SOTA) results compared to previous meta-learning and noisy-label learning approaches on several noisy-label learning benchmarks.

URL: https://openreview.net/forum?id=SBM9yeNZz5

---

Title: Controlled Training Data Generation with Diffusion Models

Authors: Teresa Yeo, Andrei Atanov, Harold Luc Benoit, Aleksandr Alekseev, Ruchira Ray, Pooya Esmaeil Akhoondi, Amir Zamir

Abstract: We present a method to control a text-to-image generative model to produce training data useful for supervised learning. Unlike previous works that employ an open-loop approach via pre-defined prompts to generate new data using either a language model or human expertise, we develop an automated closed-loop system that involves two feedback mechanisms. The first mechanism uses feedback from a given supervised model to find adversarial prompts that result in generated images that maximize the model's loss and, consequently, expose its vulnerabilities. While these adversarial prompts generate training examples curated for improving the given model, they are not curated for a specific target distribution of interest, which can be inefficient. Therefore, we introduce the second feedback mechanism that can optionally guide the generation process towards a desirable target distribution. We call the method combining these two mechanisms Guided Adversarial Prompts. The proposed closed-loop system allows us to control the training data generation for a given model and target image distribution. We evaluate on different tasks, datasets, and architectures, with different types of distribution shifts (corruptions, spurious correlations, unseen domains) and illustrate the advantages of the proposed feedback mechanisms compared to open-loop approaches.

URL: https://openreview.net/forum?id=sSOxuUjE2o

---

Title: (Accelerated) Noise-adaptive Stochastic Heavy-Ball Momentum

Authors: Anh Quang Dang, Reza Babanezhad Harikandeh, Sharan Vaswani

Abstract: Stochastic heavy ball momentum (SHB) is commonly used to train machine learning models, and often provides empirical improvements over stochastic gradient descent. By primarily focusing on strongly-convex quadratics, we aim to better understand the theoretical advantage of SHB and subsequently improve the method. For strongly-convex quadratics, Kidambi et al. (2018) show that SHB (with a mini-batch of size $1$) cannot attain accelerated convergence, and hence has no theoretical benefit over SGD. They conjecture that the practical gain of SHB is a by-product of using larger mini-batches. We first substantiate this claim by showing that SHB can attain an accelerated rate when the mini-batch size is larger than a threshold $b^*$ that depends on the condition number $\kappa$. Specifically, we prove that with the same step-size and momentum parameters as in the deterministic setting, SHB with a sufficiently large mini-batch size results in an $O\left(\exp(-\frac{T}{\sqrt{\kappa}}) + \sigma \right)$ convergence when measuring the distance to the optimal solution in the $\ell_2$ norm, where $T$ is the number of iterations and $\sigma^2$ is the variance in the stochastic gradients. We prove a lower-bound which demonstrates that a $\kappa$ dependence in $b^*$ is necessary. To ensure convergence to the minimizer, we design a noise-adaptive multi-stage algorithm that results in an $O\left(\exp\left(-\frac{T}{\sqrt{\kappa}}\right) + \frac{\sigma}{\sqrt{T}}\right)$ rate when measuring the distance to the optimal solution in the $\ell_2$ norm. We also consider the general smooth, strongly-convex setting and propose the first noise-adaptive SHB variant that converges to the minimizer at an $O(\exp(-\frac{T}{\kappa}) + \frac{\sigma^2}{T})$ rate when measuring the distance to the optimal solution in the squared $\ell_2$ norm. We empirically demonstrate the effectiveness of the proposed algorithms.

URL: https://openreview.net/forum?id=Okxp1W8If0

---

Title: Quantile Activation: Correcting a failure mode of traditional ML models

Authors: Aditya Challa, Sravan Danda, Laurent Najman, Snehanshu Saha

Abstract: Standard ML models fail to infer the context distribution and suitably adapt. For instance, the learning fails when the underlying distribution is actually a mixture of distributions with contradictory labels. Learning also fails if there is a shift between train and test distributions. Standard neural network architectures like MLPs or CNNs are not equipped to handle this.

In this article, we propose a simple activation function, quantile activation (QAct), that addresses this problem without significantly increasing computational costs. The core idea is to "adapt" the outputs of each neuron to its context distribution. The proposed quantile activation (QAct) outputs the relative quantile position of neuron activations within their context distribution, diverging from the direct numerical outputs common in traditional networks.

A specific case of the above failure mode is when there is an inherent distribution shift, i.e the test distribution differs slightly from the train distribution. We validate the proposed activation function under covariate shifts, using datasets designed to test robustness against distortions. Our results demonstrate significantly better generalisation across distortions compared to conventional classifiers and other adaptive methods, across various architectures. Although this paper presents a proof of concept, we find that this approach unexpectedly outperforms DINOv2 (small), despite DINOv2 being trained with a much larger network and dataset.

URL: https://openreview.net/forum?id=nWk5OtZ7ze

---

Title: Contextualized Messages Boost Graph Representations

Authors: Brian Godwin Lim, Galvin Brice Sy Lim, Renzo Roel Tan, Kazushi Ikeda

Abstract: Graph neural networks (GNNs) have gained significant attention in recent years for their ability to process data that may be represented as graphs. This has prompted several studies to explore their representational capability based on the graph isomorphism task. Notably, these works inherently assume a countable node feature representation, potentially limiting their applicability. Interestingly, only a few study GNNs with uncountable node feature representation. In the paper, a new perspective on the representational capability of GNNs is investigated across all levels—node-level, neighborhood-level, and graph-level—when the space of node feature representation is uncountable. Specifically, the injective and metric requirements of previous works are softly relaxed by employing a pseudometric distance on the space of input to create a soft-injective function such that distinct inputs may produce similar outputs if and only if the pseudometric deems the inputs to be sufficiently similar on some representation. As a consequence, a simple and computationally efficient soft-isomorphic relational graph convolution network (SIR-GCN) that emphasizes the contextualized transformation of neighborhood feature representations via anisotropic and dynamic message functions is proposed. Furthermore, a mathematical discussion on the relationship between SIR-GCN and key GNNs in literature is laid out to put the contribution into context, establishing SIR-GCN as a generalization of classical GNN methodologies. To close, experiments on synthetic and benchmark datasets demonstrate the relative superiority of SIR-GCN, outperforming comparable models in node and graph property prediction tasks.

URL: https://openreview.net/forum?id=sXr1fRjs1N

---

Title: GOTHAM: Graph Class Incremental Learning Framework under Weak Supervision

Authors: Aditya Hemant Shahane, Prathosh AP, Sandeep Kumar

Abstract: Graphs are growing rapidly and so are the number of different categories associated with it. Applications like e-commerce, healthcare, recommendation systems, and various social media platforms are rapidly moving towards graph representation of data due to their ability to capture both structural and attribute information. One crucial task in graph analysis is node classification, where unlabeled nodes are categorized into predefined classes. In practice, novel classes appear incrementally sometimes with just a few labels (seen classes) or even without any labels (unseen classes), either because they are new or haven't been explored much. Traditional methods assume abundant labeled data for training, which isn't always feasible. We investigate a broader objective: Graph Class Incremental Learning under Weak Supervision (GCL), addressing this challenge by meta-training on base classes with limited labeled instances. During the incremental streams, novel classes can have few-shot or zero-shot representation. Our proposed framework GOTHAM efficiently accommodates these unlabeled nodes by finding the closest prototype representation, serving as class representatives in the attribute space. For Text-Attributed Graphs (TAGs), our framework additionally incorporates semantic information to enhance the representation. By employing teacher-student knowledge distillation to mitigate forgetting, GOTHAM achieves promising results across various tasks. Experiments on datasets such as Cora-ML, Amazon, and OBGN-Arxiv showcase the effectiveness of our approach in handling evolving graph data under limited supervision.

URL: https://openreview.net/forum?id=hCyT4RsF27

---

New submissions
===============

Title: Rollout Total Correlation for Deep Reinforcement Learning

Abstract: Learning task-relevant representations is crucial for reinforcement learning. Recent approaches aim to learn such representations by improving the temporal consistency in the observed transitions. However, they only consider individual transitions and can fail to achieve long-term consistency. Instead, we argue that capturing aspects of the state that correlate with other states and actions of the trajectory---even more distant in the future---could further help in extracting task-relevant information. Hence, in this paper we investigate how to learn representations by maximizing the rollout total correlation, the correlation among all learned representations and actions within the trajectories produced by the agent. For improving rollout total correlation, we propose to combine two complementary lower bounds based on a generative and a discriminative model, combined with a simple and effective technique of chunk-wise mini-batching. Furthermore, we propose an intrinsic reward based on the learned representation for better exploration. Experimental evaluations on a set of challenging image-based simulated control tasks show that our method achieves better sample efficiency, and robustness to both white noise and natural video backgrounds compared to leading baselines.

URL: https://openreview.net/forum?id=qTdRJAL8Li

---

Title: Influential Bandits: Pulling an Arm May Change the Environment

Abstract: While classical formulations of multi-armed bandit problems assume that each arm's reward is independent and stationary, real-world applications often involve non-stationary environments and interdependencies between arms. In particular, selecting one arm may influence the future rewards of other arms, a scenario not adequately captured by existing models such as rotting bandits or restless bandits. To address this limitation, we propose the influential bandit problem, which models inter-arm interactions through an unknown, symmetric, positive semi-definite interaction matrix that governs the dynamics of arm losses. We formally define this problem and establish two regret lower bounds, including a superlinear $\Omega(T^2 / \log^2 T)$ bound for the standard UCB algorithm and an algorithm-independent $\Omega(T)$ bound, which highlight the inherent difficulty of the setting. We then introduce a new algorithm based on a lower confidence bound (LCB) estimator tailored to the structure of the loss dynamics. Under mild assumptions, our algorithm achieves a regret of $O(KT \log T)$, which is nearly optimal in terms of its dependence on the time horizon. The algorithm is simple to implement and computationally efficient. Empirical evaluations on both synthetic and real-world datasets demonstrate the presence of inter-arm influence and confirm the superior performance of our method compared to conventional bandit algorithms.

URL: https://openreview.net/forum?id=YNKaDfYbY3

---

Title: Collaboration with Dynamic Open Ad Hoc Team via Team State Modelling

Abstract: Open ad hoc teamwork presents the challenging problem of designing an autonomous agent that can rapidly adapt to collaborate with teammates without prior coordination in an open environment. Existing methods primarily rely on fixed, predefined teammate types, overlooking the fact that teammates may change dynamically. To address this limitation, we propose a novel reinforcement learning approach, the Open Online Teammate Adaptation Framework (Open-OTAF), which enables a controlled agent to collaborate with dynamic teammates in open ad hoc environments. To achieve this, the controlled agent employs a dual teamwork situation inference model to capture the current teamwork state, facilitating decision-making under partial observability. To handle the dynamic nature of teammate types, we first introduce a Chinese Restaurant Process-based model to categorize diverse teammate policies into distinct clusters, improving the efficiency of identifying teamwork situations. Next, to model heterogeneous agent relationships and accommodate a variable number of teammates, we represent the team as a heterogeneous graph and leverage heterogeneous graph attention neural networks to learn the representation of the teamwork situation. Extensive experiments across four challenging multi-agent benchmark tasks—Level-Based Foraging, Wolf-Pack, Cooperative Navigation, and FortAttack—demonstrate that our method successfully enables dynamic teamwork in open ad hoc settings. Open-OTAF outperforms state-of-the-art methods, achieving superior performance with faster convergence.

URL: https://openreview.net/forum?id=BukMU42P3G

---

Title: Counting Hours, Counting Losses: The Toll of Unpredictable Work Schedules on Financial Security

Abstract: Financial instability is a pressing concern in the United States, with drivers that include growing employment disparities and insufficient wages. While research typically focuses on financial aspects such as income inequality in precarious work environments, there is a tendency to overlook the time-related aspect of unstable work schedules. The inability to rely on a consistent work schedule not only leads to burnout and conflicts between work and family life but also results in financial shocks that directly impact workers' income and assets. Unforeseen fluctuations in earnings pose challenges in financial planning, affecting decisions regarding savings and spending, and ultimately undermining individuals' long-term financial stability and well-being. This issue is particularly evident in sectors where workers experience frequently changing schedules without sufficient notice. The lack of advance notice disproportionately affects vulnerable groups, including those in the food service and retail sectors, part-time and hourly workers, individuals with lower incomes and education levels, and specific racial groups. These groups are already more financially vulnerable, and the unpredictable nature of their work schedules exacerbates their financial fragility.

Our objective in this study is to understand how unforeseen fluctuations in earnings exacerbate financial fragility by investigating the extent to which individuals' financial management depends on their ability to anticipate and plan for future events. To address this question, we develop an online learning approach in which individuals adapt their consumption strategies over time in response to financial uncertainty and evolving information. This approach forms the basis of our simulation framework, which models how workers manage consumption in the face of variable work schedules and the imperative to avoid financial ruin.

With this framework, we demonstrate both theoretically and empirically how a worker's capacity to anticipate schedule changes enhances their long-term utility. Conversely, the inability to predict future events can worsen workers' financial instability. Moreover, our framework enables us to explore interventions aimed at mitigating the problem of schedule uncertainty and evaluate their effectiveness.

URL: https://openreview.net/forum?id=PEZz2i9kiP

---

Title: Synthesizing Minority Samples for Long-tailed Classification via Distribution Matching

Abstract: In many real-world applications, deep neural networks (DNNs) often perform poorly on datasets with long-tailed distributions. To address this issue, a promising approach is to propose an optimization objective to transform real majority samples into synthetic minority samples. However, this objective is designed only from the classification perspective. To this end, we propose a novel framework that synthesizes minority samples from the majority by considering both classification and distribution matching. Specifically, our method adjusts the distribution of synthetic minority samples to closely align with that of the true minority class, while enforcing the synthetic samples to learn more generalizable and discriminative features of the minority class. Experimental results on several standard benchmark datasets demonstrate the effectiveness of our method in both long-tailed classification and synthesizing high-quality synthetic minority samples.

URL: https://openreview.net/forum?id=VqLe8tPbZn

---

Title: Emergent Corpus Pre-training Benefits Vision Language Models

Abstract: Vision-Language Pre-trained Models (VL-PTMs) have achieved impressive performance across a wide range of tasks, but their success often hinges on access to large-scale multimodal datasets. While effective in high-resource settings, these models tend to struggle in data-scarce regimes. In this work, we investigate Emergent Communication (EC) as a mechanism to improve sample efficiency in VL-PTMs. We pre-train a Vision-Language Model (VLM) using EC tokens generated through a referential game between two artificial agents. Across three diverse cross-modal matching and reasoning benchmarks, EC pretraining yields substantial gains, improving Visual Referring Expression (VRE) accuracy by 108.6% and Visual Entailment (VE) by 69.6%. To further validate the the effectiveness of EC pretraining, we introduce LLaVA-1.5-EC, a LLaVA variant trained entirely on EC tokens. LLaVA-1.5-EC outperforms strong LVLM baselines, including BLIP-2 (13B), achieving relative gains of 104.23% on VizWiz, 34.8% on GQA, and 10.8% on VQAv2, and top performance on MMBench, a challenging instruction-following benchmark. These results highlight the transferability and generalization capacity of EC pretraining and underscore the potential of leveraging grounded EC tokens to enhance vision-language reasoning in low-resource settings, especially in settings with limited natural language data. We discuss implications and propose avenues for future research to explore the connections between EC and VL for multimodal understanding and effective human-machine communication. Code and data are available at anonymized link.

URL: https://openreview.net/forum?id=bivKGSaXkD

---

Title: Stochastic Block Model-Aware Topological Neural Networks for Graph Link Prediction

Abstract: Link prediction is an important learning task for graph-structured data and is indispensable to understanding graphs' properties. Recent works focus on designing complicated graph neural networks (GNNs) architectures to explore and capture various pairwise interactions among graph nodes. Most GNNs are based on combining graph structural and node feature information by iterative message-passing schemes. However, despite GNNs revolutionizing the field of graph representation learning, some thorny questions are raised concerning whether GNNs can efficiently learn the edge probabilities based on topological structures (i.e., higher-order interactions) and node features, and provide statistically rigorous uncertainty estimates. In this paper, we tackle these challenges and propose a novel stochastic block model (SBM)-aware topological neural networks, called SBM-TNN, that uses SBMs to infer the latent community structure of nodes from graph structures and uses persistent homology to encode higher-order information. Furthermore, we theoretically study the entrywise bound and asymptotic normality of the estimated edge probability matrix to quantify the uncertainty in statistical inference of the edge probabilities. Our extensive experiments for link prediction on both graphs and knowledge graphs show that SBM-TNN achieves state-of-the-art performance over a set of popular baseline methods.

URL: https://openreview.net/forum?id=FBjVSPAsgs

---

Title: Optimized Graph Structures for Calibrating Graph Neural Networks with Out-of-Distribution Nodes

Abstract: Graph neural networks~(GNNs) achieve remarkable success in tasks such as node classification, link prediction, and graph classification. However, despite their effectiveness, the reliability of the GNN's prediction remains a major concern. particularly when the graphs contain out-of-distribution~(OOD) nodes. Up to now, the calibration of GNNs in the presence of OOD nodes is still largely under-explored. Our empirical studies reveal that the calibration issue becomes significantly more complex when OOD nodes are present, and existing calibration methods prove to be less effective in this scenario. Recently, graph structure learning~(GSL), a family of data-centric learning approaches, has proved to be effective in mitigating the adverse effects of the noisy information in the graph topology by optimizing the graph structure alongside with GNN training. However, current GSL methods do not explicitly address the calibration issue in graphs with OOD nodes. To tackle the this challenge, we propose a novel framework called \underline{G}raph \underline{C}alibration via \underline{S}tructure \underline{O}ptimization~(GCSO) to calibrate GNNs against OOD nodes. Our empirical findings suggest that manually reducing the weight of edges connecting in-distribution~(ID) nodes and OOD nodes could effectively mitigate the calibration issue. However, identifying these edges and determining their appropriate weights is challenging, as the distribution of OOD nodes is unknown. To address it, we propose a novel framework to calibrate GNNs against OOD nodes. In our method we first develop an iterative edge-sampling mechanism to capture the topological information of the graph and formulate it as the Markov Decision Process~(MDP). Then, we leverage the actor-critic method to dynamically adjust the edge weights and assess their impact on target nodes. Additionally, we design a specialized reward signal to guide the policy function toward an optimal graph structure that minimizes the negative influence of OOD nodes. Note that our modified graph structure could be seamlessly integrated with existing temperature scaling-based calibration techniques for further improvement. Experimental results on benchmark datasets demonstrate that our method can effectively reduce the expected calibration error~(ECE) while maintaining comparable accuracy in GNNs. And our approach outperforms strong baseline methods, The anonymous GitHub repository for the code is available at \url{https://anonymous.4open.science/r/calibration-7F61}.

URL: https://openreview.net/forum?id=tmZxjo07SB

---

Title: Visual-Word Tokenizer: Beyond Fixed Sets of Tokens in Vision Transformers

Abstract: The cost of deploying vision transformers increasingly represents a barrier to wider industrial adoption. Existing compression techniques require additional end-to-end fine-tuning or incur a significant drawback to runtime, thus making them ill-suited for online (real-time) inference where a prediction is made on any new input as it comes in. We introduce the Visual Word Tokenizer (VWT), a training-free method for reducing energy costs while retaining performance and runtime. The VWT groups visual subwords (image patches) that are frequently used into visual words while infrequent ones remain intact. To do so, intra-image or inter-image statistics are leveraged to identify similar visual concepts for sequence compression. Experimentally, we demonstrate a reduction in wattage of up to 25% with only a 20% increase in runtime at most. Comparative approaches of 8-bit quantization and token merging achieve a lower or similar energy efficiency but exact a higher toll on runtime (up to 100% or more). Our results indicate that VWTs are well-suited for efficient online inference with a marginal compromise on performance.

URL: https://openreview.net/forum?id=TY4qi6dBnA

---

Title: Universal and Efficient Detection of Adversarial Data through Nonuniform Impact on Network Layers

Abstract: Deep Neural Networks (DNNs) are notoriously vulnerable to adversarial input designs with limited noise budgets. While numerous successful attacks with subtle modifications to original input have been proposed, defense techniques against these attacks are relatively understudied. Existing defense approaches either focus on improving DNN robustness by negating the effects of perturbations or use a secondary model to detect adversarial data. Although equally important, the attack detection approach, which is studied in this work, provides a more practical defense compared to the robustness approach. We show that the existing detection methods are either ineffective against the state-of-the-art attack techniques or computationally inefficient for real-time processing. We propose a novel universal and efficient method to detect adversarial examples by analyzing the varying degrees of impact of attacks on different DNN layers. Through theoretical arguments and extensive experiments, we demonstrate that our detection method is highly effective, computationally efficient for real-time processing, compatible with any DNN architecture, and applicable across different domains, such as image, video, and audio.

URL: https://openreview.net/forum?id=0CY5APFnFI

---

Title: Provable Robustness of (Graph) Neural Networks Against Data Poisoning and Backdoor Attacks

Abstract: Generalization of machine learning models can be severely compromised by data poisoning, where adversarial changes are applied to the training data. This vulnerability has led to interest in certifying (i.e., proving) that such changes up to a certain magnitude do not affect test predictions. We, for the first time, certify Graph Neural Networks (GNNs) against poisoning attacks, including backdoors, targeting the node features of a given graph. Our certificates are white-box and based upon $(i)$ the neural tangent kernel, which characterizes the training dynamics of sufficiently wide networks; and $(ii)$ a novel reformulation of the bilevel optimization problem describing poisoning as a mixed-integer linear program. Consequently, we leverage our framework to provide fundamental insights into the role of graph structure and its connectivity on the worst-case robustness behavior of convolution-based and PageRank-based GNNs. We note that our framework is more general and constitutes the first approach to derive white-box poisoning certificates for NNs, which can be of independent interest beyond graph-related tasks.

URL: https://openreview.net/forum?id=jIAPLDdGVx

---

Title: A stochastic gradient descent algorithm with random search directions

Abstract: Stochastic coordinate descent algorithms are efficient methods in which each iterate is obtained by fixing most coordinates at their values from the current iteration, and approximately minimizing the objective with respect to the remaining coordinates. However, this approach is usually restricted to canonical basis vectors of $\mathbb{R}^d$. In this paper, we develop a new class of stochastic gradient descent algorithms with random search directions which uses the directional derivative of the gradient estimate following more general random vectors. We establish the almost sure convergence of these algorithms with decreasing step. We further investigate their central limit theorem and pay particular attention to analyze the impact of the search distributions on the asymptotic covariance matrix. We also provide non-asymptotic $\mathbb{L}^p$ rates of convergence.

URL: https://openreview.net/forum?id=npER8AaLSb

---

Title: End-to-end generation and evaluation of nuclei-aware histology patches using diffusion models

Abstract: In recent years, computational pathology has witnessed remarkable progress, particularly through the adoption of deep learning techniques in segmentation and classification tasks that enhance diagnostic and prognostic workflows. Despite its importance, training effective deep learning models for these applications remains a significant challenge due to the need for large-scale annotated datasets. We present a nuclei-aware semantic tissue generation framework leveraging advancements in conditional diffusion modeling. Our framework generates high-quality synthetic tissue patches that are inherently annotated with instances of six distinct nuclei types. We demonstrate the efficacy of generated samples through extensive quantitative and expert evaluation.

URL: https://openreview.net/forum?id=uFOP7v0Van

---

Title: RefDeblur: Blind Motion Deblurring with Self-Generated Reference Image

Abstract: The challenge of blind motion deblurring is often tackled via two distinct paradigms: kernel-based and kernel-free methods. Each deblurring method provides inherent strengths. Kernel-based methods facilitate generating texture-detailed sharp images by closely aligning with the blurring process. In contrast, kernel-free methods are more effective in handling complex blur patterns. Building upon these complementary benefits, we propose a hybrid framework that decomposes a non-uniform deblurring task into two simpler tasks: a uniform kernel estimation, managed by our kernel-based method, and error prediction, handled by our kernel-free method. Our kernel-based method serves to generate a reference image with realistic texture details while our kernel-free model refines the reference image by correcting residual errors with preserving texture details. To efficiently build our kernel-based model, we consider the logarithmic fourier space that facilitates estimating a blur kernel easier by simplifying the relationship between blur and sharp samples. Furthermore, the regime under using a texture-detailed reference image allows for reducing the size of our kernel-free model without compromising performance. As a result, the proposed method achieves remarkable performance on several datasets such as RealBlur, RSBlur and GoPro, and comparable performance to state-of-the-art methods with a 75% reduction in computational costs.

URL: https://openreview.net/forum?id=Nyewu7xztw

---

Title: To ArXiv or not to ArXiv: A Study Quantifying Pros and Cons of Posting Preprints Online

Abstract: Double-blind conferences have engaged in debates over whether to allow authors to post their papers online on arXiv or elsewhere during the review process. Independently, some authors of research papers face the dilemma of whether to put their papers on arXiv due to its pros and cons. We conduct a study to substantiate this debate and dilemma via quantitative measurements. Specifically, we conducted surveys of reviewers in two top-tier double-blind computer science conferences---ICML 2021 (5361 submissions and 4699 reviewers) and EC 2021 (498 submissions and 190 reviewers). Our three main findings are as follows. First, more than a third of the reviewers self-report searching online for a paper they are assigned to review. Second, conference policies restricting authors from publicising their work on social media or posting preprints before the review process may have only limited effectiveness in maintaining anonymity. Third, outside the review process, we find that preprints from better-ranked institutions experience a very small increase in visibility compared to preprints from other institutions.

URL: https://openreview.net/forum?id=YhdQn73Qci

---

Title: Towards Robust Scale-Invariant Mutual Information Estimators

Abstract: Mutual information (MI) is hard to estimate for high dimensional data, and various estimators have been proposed over the years to tackle this problem. Here, we note that there exists another challenging problem, namely that many estimators of MI, which we denote as $I(X;T)$, are sensitive to scale, i.e., $I(X;\alpha T)\neq I(X;T)$ where $\alpha \in \mathbb{R}$. Although some normalization methods have been hinted at in previous works, there is no in-depth study of the problem. In this work, we study new normalization strategies for MI estimators to be scale-invariant, particularly for the Kraskov–Stögbauer–Grassberger (KSG) and the neural network-based MI (MINE) estimators. We provide theoretical and empirical results and show that the original un-normalized estimators are not scale-invariant and highlight the consequences of an estimator's scale-dependence. We propose new global normalization strategies that are tuned to the corresponding estimator and scale invariant. We compare our global normalization strategies to existing local normalization strategies and provide intuitive and empirical arguments to support the use of global normalization. Extensive experiments across multiple distributions and settings are conducted, and we find that our proposed variants KSG-Global-$L_{\infty}$ and MINE-Global-Corrected are most accurate within their respective approaches. Finally, we perform an information plane analysis of neural networks and observe clearer trends of fitting and compression using the normalized estimators compared to the original un-normalized estimators. Our work highlights the importance of scale awareness and global normalization in the MI estimation problem.

URL: https://openreview.net/forum?id=vB7Wvytko5

---

Title: Meta-Sparsity: Learning Optimal Sparse Structures in Multitask Networks through Meta-learning

Abstract: This paper presents meta-sparsity, a framework for learning model sparsity, basically learning the parameter that controls the degree of sparsity, that allows deep neural networks (DNNs) to inherently generate optimal sparse shared structures in multi-task learning (MTL) setting. This proposed approach enables the dynamic learning of sparsity patterns across a variety of tasks, unlike traditional sparsity methods that rely heavily on manual hyperparameter tuning. Inspired by Model Agnostic Meta-Learning (MAML), the emphasis is on learning shared and optimally sparse parameters in multi-task scenarios by implementing a penalty-based, channel-wise structured sparsity during the meta-training phase. This method improves the model’s efficacy by removing unnecessary parameters and enhances its ability to handle both seen and previously unseen tasks. The effectiveness of meta-sparsity is rigorously evaluated by extensive experiments on two datasets, NYU-v2 and CelebAMask-HQ, covering a broad spectrum of tasks ranging from pixel-level to image-level predictions. The results show that the proposed approach performs well across many tasks, indicating its potential as a versatile tool for creating efficient and adaptable sparse neural networks. This work, therefore, presents an approach towards learning sparsity, contributing to the efforts in the field of sparse neural networks and suggesting new directions for research towards parsimonious models.

URL: https://openreview.net/forum?id=SyjEihChuq

---

Title: Differentiated Aggregation to Improve Generalization in Federated Learning

Abstract: This paper focuses on reducing the communication cost of federated learning by exploring generalization bounds and representation learning. We first characterize a tighter generalization bound for one-round federated learning based on local clients’ generalizations and heterogeneity of data distribution (non-iid scenario). We also characterize a generalization bound in R-round federated learning and its relation to the number of local updates (local stochastic gradient descents (SGDs)). Then, based on our generalization bound analysis and its interpretation through representation learning, we infer that less frequent aggregations for the representation extractor (typically corresponds to initial layers) compared to the head (usually the final layers) leads to the creation of more generalizable models, particularly in non-iid scenarios. We design a novel Federated Learning with Adaptive Local Steps (FedALS) algorithm based on our generalization bound and representation learning analysis. FedALS employs varying aggregation frequencies for different parts of the model, so reduces the communication cost. The paper is followed with experimental results showing the effectiveness of FedALS. Our codes are available for reproducibility.

URL: https://openreview.net/forum?id=F5hgpQ1Ccd

---

Title: Survey of Video Diffusion Models: Foundations, Implementations, and Applications

Abstract: Recent advances in diffusion models have revolutionized video generation, offering superior temporal consistency and visual quality compared to traditional generative adversarial networks-based approaches. While this emerging field shows tremendous promise in applications, it faces significant challenges in motion consistency, computational efficiency, and ethical considerations. This survey provides a comprehensive review of diffusion-based video generation, examining its evolution, technical foundations, and practical applications. We present a systematic taxonomy of current methodologies, analyze architectural innovations and optimization strategies, and investigate applications across low-level vision tasks such as denoising and super-resolution. Additionally, we explore the synergies between diffusion-based video generation and related domains, including video representation learning, question answering, and retrieval. Compared to the existing surveys (Lei et al., 2024a;b; Melniket al., 2024; Cao et al., 2023; Xing et al., 2024c) which focus on specific aspects of video generation, such as human video synthesis (Lei et al., 2024a) or long-form content generation (Lei et al., 2024b), our work provides a broader, more updated, and more fine-grained perspective on diffusion-based approaches with a special section for evaluation metrics, industry solutions, and training engineering techniques in video generation. This survey serves as a foundational resource for researchers and practitioners working at the intersection of diffusion models and video generation, providing insights into both the theoretical frameworks and practical implementations that drive this rapidly evolving field.

URL: https://openreview.net/forum?id=2ODDBObKjH

---

Title: What Time Tells Us? An Explorative Study of Time Awareness Learned from Static Images

Abstract: Time becomes visible through illumination changes in what we see. Inspired by this, in this paper we explore the potential to learn time awareness from static images, trying to answer: *what time tells us?* To this end, we first introduce a Time-Oriented Collection (TOC) dataset, which contains 130,906 images with reliable timestamps. Leveraging this dataset, we propose a Time-Image Contrastive Learning (TICL) approach to jointly model timestamps and related visual representations through cross-modal contrastive learning.
We found that the proposed TICL, 1) not only achieves state-of-the-art performance on the timestamp estimation task, over various benchmark metrics, 2) but also, interestingly, though only seeing static images, the time-aware embeddings learned from TICL show strong capability in several time-aware downstream tasks such as time-based image retrieval, video scene classification, and time-aware image editing. Our findings suggest that time-related visual cues can be learned from static images and are beneficial for various vision tasks, laying a foundation for future research on understanding time-related visual context.

URL: https://openreview.net/forum?id=f1MYOG4iDG

---

Title: Regularization-based Framework for Quantization-, Fault- and Variability-Aware Training

Abstract: Efficient inference is critical for deploying deep learning models on edge AI devices. Low-bit quantization (e.g., 3- and 4-bit) with fixed-point arithmetic improves efficiency, while low-power memory technologies like analog nonvolatile memory enable further gains. However, these methods introduce non-ideal hardware behavior, including bit faults and device-to-device variability. We propose a regularization-based quantization-aware training (QAT) framework that supports fixed, learnable step-size, and learnable non-uniform quantization, achieving competitive results on CIFAR-10 and ImageNet. Our method also extends to Spiking Neural Networks (SNNs), demonstrating strong performance on 4-bit networks on CIFAR10-DVS and N-Caltech 101. Beyond quantization, our framework enables fault- and variability-aware fine-tuning, mitigating stuck-at faults (fixed weight bits) and device resistance variability. Compared to prior fault-aware training, our approach significantly improves performance recovery under upto 20% bit-fault rate and 40% device-to-device variability. Our results establish a generalizable framework for quantization and robustness-aware training, enhancing efficiency and reliability in low-power, non-ideal hardware.

URL: https://openreview.net/forum?id=6CRQbAH7by

---

Title: Textual Prototypes Guided Balanced Visual Feature Learning For Long-Tailed Vision Recognition

Abstract: In recent advancements, pre-trained contrastive models like CLIP have demonstrated remarkable multi-modal prowess in tackling diverse vision tasks. Yet, their potential in addressing the long-tailed vision recognition challenge has not been thoroughly investigated. In this study, we observe that textual features coming from CLIP exhibit a more discriminative and balanced distribution compared to their visual counterparts. Leveraging this insight, we propose a novel approach that uses these balanced textual features as prototypes to guide the learning of robust, disentangled representations from biased visual features. Our method begins with the fine-tuning of CLIP through contrastive learning, enabling the encoders to better adapt to the target dataset. Subsequently, we freeze the visual encoder and apply a linear adapter to enhance the visual representations. To achieve robust vision recognition, we integrate a linear classifier into our framework, which is initialized with the fine-tuned textual features and the weights can be viewed as prototypes. We then introduce a principled approach to robust vision representation learning by minimizing the optimal transport distance between the refined visual features and the prototypes, facilitating the disentanglement of biased features and the iterative optimization of prototypes towards class centroids. Additionally, we introduce a supervised contrastive learning loss based on the transport plan for further enhanced robust vision representation learning. Extensive experiments on long-tailed vision recognition benchmarks demonstrate the superiority of our method.

URL: https://openreview.net/forum?id=PxMTHVUbRM

---

Reply all

Reply to author

Forward

0 new messages