J2C Certification: Single-loop Algorithms for Stochastic Non-Convex Optimization with Weakly-Convex Constraints
Ming Yang, Gang Li, Quanqi Hu, Qihang Lin, Tianbao Yang
https://openreview.net/forum?id=aCgOR2KvAI
---
Accepted papers
===============
Title: Single-loop Algorithms for Stochastic Non-Convex Optimization with Weakly-Convex Constraints
Authors: Ming Yang, Gang Li, Quanqi Hu, Qihang Lin, Tianbao Yang
Abstract: Constrained optimization with multiple functional inequality constraints has significant applications in machine learning. This paper examines a crucial subset of such problems where both the objective and constraint functions are weakly convex. Existing methods often face limitations, including slow convergence rates or reliance on double-loop algorithmic designs. To overcome these challenges, we introduce a novel single-loop penalty-based stochastic algorithm. Following the classical exact penalty method, our approach employs a hinge-based penalty, which permits the use of a constant penalty parameter, enabling us to achieve a state-of-the-art complexity for finding an approximate Karush-Kuhn-Tucker (KKT) solution. We further extend our algorithm to address finite-sum coupled compositional objectives, which are prevalent in artificial intelligence applications, establishing improved complexity over existing approaches. Finally, we validate our method through experiments on fair learning with receiver operating characteristic (ROC) fairness constraints and continual learning with non-forgetting constraints.
URL: https://openreview.net/forum?id=aCgOR2KvAI
---
Title: Explainable Graph Learning for Particle Accelerator Operations
Authors: Song Wang, Chris Tennant, Jundong Li
Abstract: Particle accelerators are vital tools in physics, medicine, and industry, requiring precise tuning to ensure optimal beam performance. However, real-world deviations from idealized simulations make beam tuning a time-consuming and error-prone process. In this work, we propose an explanation-driven framework for providing actionable insight into beamline operations, with a focus on the injector beamline at the Continuous Electron Beam Accelerator Facility (CEBAF). We represent beamline configurations as heterogeneous graphs, where setting nodes represent elements that human operators can actively adjust during beam tuning, and reading nodes passively provide diagnostic feedback. To identify the most influential setting nodes responsible for differences between any two beamline configurations, our approach first predicts the resulting changes in reading nodes caused by variations in settings, and then learns importance scores that capture the joint influence of multiple setting nodes. Experimental results on real-world CEBAF injector data demonstrate the framework’s ability to generate interpretable insights that can assist human operators in beamline tuning and reduce operational overhead.
URL: https://openreview.net/forum?id=jnReRk2EX1
---
Title: Robust Clustering using Gaussian Mixtures in the Presence of Cellwise Outliers
Authors: Pushpendra Rajpurohit, Petre Stoica, Prabhu babu
Abstract: In this paper we propose a novel algorithm for robust estimation of Gaussian Mixture Model (GMM) parameters and clustering that explicitly accounts for cell outliers. To achieve this, the proposed algorithm minimizes a penalized negative log-likelihood function where the penalty term is derived via the false discovery rate principle. The penalized negative log-likelihood function is cyclically minimized over outlier positions and the GMM parameters. Furthermore, the minimization over the GMM parameters is done using the majorization minimization framework: specifically we minimize a tight upper bound on the negative log-likelihood function which decouples into simpler optimization subproblems that can be solved efficiently.
We present several numerical simulation studies comprising experiments aimed at evaluating the performance of the proposed method on synthetic as well as real world data and at systematically comparing it with state-of-the-art robust techniques in different scenarios. The simulation studies demonstrate that our approach effectively addresses the challenges inherent in parameter estimation of GMM and clustering in contaminated data environments.
URL: https://openreview.net/forum?id=oVHPEgjdWk
---
Title: Enhancing Deep Consistent Graph Metric with Affinity and Alignment for Incremental Social Event Detection using Cross-Layer Attention
Authors: Shraban Kumar Chatterjee, Shubham Gupta, Suman Kundu
Abstract: Existing methods of event detection from social media (i.e., X), for instance, KPGNN, FinEvent, and CLKD, use triplet loss for feature separation. Triplet loss suffers from two notable discrepancies in the latent space: (i) inconsistency in intra-event and inter-event distances, and (ii) an inability to ensure the closeness of messages from the same event across different mini-batches. The present paper proposes two novel loss functions to improve consistency in the latent space. The first loss function guarantees consistent intra-event and inter-event distances by increasing the affinity between intra-event points. On the other hand, the alignment loss enhances the cosine similarity between the feature space and label space, thereby aligning features of the same event class across diverse mini-batches. We provide theoretical justification that the proposed loss ensures discriminative features in the latent space, like CGML, without its costly pairwise or specialised batching. Adding to our loss function, we introduce a new attention module designed to effectively address heterogeneous relations without necessitating a separate optimisation objective. Through comprehensive experimentation on two publicly available datasets, we have shown an average improvement of $24.05\%$, $27.23\%$ and $123.69\%$ in NMI, AMI and ARI, respectively, over supervised SOTA event detection methods. Our method also shows improvements over SOTA unsupervised event detection methods across both datasets. These are supported by statistical significance tests. Generalizability of the proposed loss in general clustering problem in graph domain is shown through experiments.
URL: https://openreview.net/forum?id=vNJ7mCgDbq
---
Title: Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
Authors: Hanjiang Hu, Alexander Robey, Changliu Liu
Abstract: Large language models (LLMs) are shown to be vulnerable to jailbreaking attacks where adversarial prompts are designed to elicit harmful responses. While existing defenses effectively mitigate single-turn attacks by detecting and filtering unsafe inputs, they fail against multi-turn jailbreaks that exploit contextual drift over multiple interactions, gradually leading LLMs away from safe behavior. To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues. Our approach models the dialogue with LLMs using state-space representations and introduces a novel neural barrier function (NBF) to detect and filter harmful queries emerging from evolving contexts proactively. Our method achieves invariant safety at each turn of dialogue by learning a safety predictor that accounts for adversarial queries, preventing potential context drift toward jailbreaks. Extensive experiments under multiple LLMs show that our NBF-based safety steering outperforms safety alignment, prompt-based steering, and lightweight LLM guardrails baselines, offering stronger defenses against multi-turn jailbreaks while maintaining a better trade-off among safety, helpfulness, and over-refusal. Check out the website at https://sites.google.com/view/llm-nbf/home.
URL: https://openreview.net/forum?id=dcyLr9xYoI
---
New submissions
===============
Title: Constraint-Aware Flow Matching via Randomized Exploration
Abstract: We consider the problem of designing constraint-aware flow matching (FM) models that address the issue of constraint violations commonly observed in vanilla generative models. We consider two scenarios, viz.: (a) when a differentiable distance function to the constraint set is given, and (b) when the constraint set is only available via queries to a membership oracle. For case (a), we propose a simple adaptation of the FM objective with an additional term that penalizes the distance between the constraint set and the generated samples. For case (b), we propose to employ randomization and learn a mean flow that is numerically shown to have a high likelihood of satisfying the constraints. This approach deviates significantly from existing works that require simple convex constraints, knowledge of a barrier function, or a reflection mechanism to constrain the probability flow. Furthermore, in the proposed setting we show that a two-stage approach, where both stages approximate the same original flow but with only the second stage probing the constraints via randomization, is more computationally efficient. Through several synthetic cases of constrained generation, we numerically show that the proposed approaches achieve significant gains in terms of constraint satisfaction while matching the target distributions. As a showcase for a practical oracle-based constraint, we show how our approach can be used for training an adversarial example generator, using queries to a hard-label black-box classifier. We conclude with several future research directions.
URL: https://openreview.net/forum?id=OR4h9WPJhV
---
Title: The Hidden Cost of Modeling P(x): Vulnerability to Membership Inference Attacks in Generative Text Classifiers
Abstract: Membership Inference Attacks (MIAs) pose a critical privacy threat by enabling adversaries to determine whether a specific sample was included in a model's training dataset. Despite extensive research on MIAs, systematic comparisons between generative and discriminative classifiers remain limited. This work addresses this gap by first providing theoretical motivation for why generative classifiers exhibit heightened susceptibility to MIAs, then validating these insights through comprehensive empirical evaluation.
Our study encompasses discriminative, generative, and pseudo-generative text classifiers across varying training data volumes, evaluated on nine benchmark datasets. Employing a diverse array of MIA strategies, we consistently demonstrate that fully generative classifiers which explicitly model the joint likelihood $P(X,Y)$ are most vulnerable to membership leakage. Furthermore, we observe that the canonical inference approach commonly used in generative classifiers significantly amplifies this privacy risk.
These findings reveal a fundamental utility-privacy trade-off inherent in classifier design, underscoring the critical need for caution when deploying generative classifiers in privacy-sensitive applications. Our results motivate future research directions in developing privacy-preserving generative classifiers that can maintain utility while mitigating membership inference vulnerabilities.
URL: https://openreview.net/forum?id=SHMC01wdVM
---
Title: On Rate-Optimal Partitioning Classification from Observable and from Privatised Data
Abstract: In this paper we revisit the classical method of partitioning classification and prove novel convergence rates under relaxed conditions, both for observable (non-privatised) and for privatised data. We consider the problem of classification in a $d$ dimensional Euclidean space. Previous results on the partitioning classifier worked with the strong density assumption (SDA), which is restrictive, as we demonstrate through simple examples. Here, we study the problem under much milder assumptions. We presuppose that the distribution of the inputs is a mixture of an absolutely continuous and a discrete distribution, such that the absolutely continuous component is concentrated to a $d_a$ dimensional subspace. In addition to the standard Lipschitz and margin conditions, a novel characteristic of the absolutely continuous component is introduced, by which the convergence rate of the classification error probability is computed, both for the binary and for the multi-class cases. This bound can reach the minimax optimal convergence rate achievable using SDA, but under much milder distributional assumptions. Interestingly, this convergence rate depends only on the intrinsic dimension of the continuous inputs, $d_a$, and not on $d$. Under privacy constraints, the data cannot be directly observed, and the constructed classifiers are functions of the randomised outcome of a suitable local differential privacy mechanism. In this paper we add Laplace distributed noises to the discontinuations of all possible locations of the feature vector and to its label. Again, tight upper bounds on the convergence rate of the classification error probability can be derived, without using SDA, such that this rate depends on $2d_a$.
URL: https://openreview.net/forum?id=KYYvIrtgK0
---
Title: On Theoretical Identifiability of Discrete Latent Causal Graphical Models
Abstract: This paper considers a challenging problem of identifying a causal graphical model under the presence of latent variables. While various identifiability conditions have been proposed in the literature, they often require multiple pure children per latent variable or restrictions on the latent causal graph. Furthermore, it is common for all observed variables to exhibit the same modality. Consequently, the existing identifiability conditions are often too stringent for complex real-world data. We consider a general nonparametric measurement model with arbitrary observed variable types and binary latent variables, and propose a double triangular graphical condition that guarantees identifiability of the entire causal graphical model. The proposed condition significantly relaxes the popular pure children condition. We also establish necessary conditions for identifiability and provide valuable insights into fundamental limits of identifiability. Simulation studies verify that latent structures satisfying our conditions can be accurately estimated from data. We also illustrate the practicality of our conditions with a real data example.
URL: https://openreview.net/forum?id=KiiSlAsLuN
---
Title: Nested Slice Sampling: Vectorized Nested Sampling for GPU-Accelerated Inference
Abstract: Model comparison and calibrated uncertainty quantification often require integrating over parameters, but scalable inference can be challenging for complex, multimodal targets. Nested Sampling is a robust alternative to standard MCMC, yet its typically sequential structure and hard constraints make efficient accelerator implementations difficult. This paper introduces Nested Slice Sampling (NSS), a GPU-friendly, vectorized formulation of Nested Sampling that uses Hit-and-Run Slice Sampling for constrained updates. A tuning analysis yields a simple near-optimal rule for setting the slice width, improving high-dimensional behavior and making per-step compute more predictable for parallel execution. Experiments on challenging synthetic targets, high dimensional Bayesian inference, and Gaussian process hyperparameter marginalization show that NSS maintains accurate evidence estimates and high-quality posterior samples, and is particularly robust on difficult multimodal problems where current state-of-the-art methods such as tempered SMC baselines can struggle. An open-source implementation is released to facilitate adoption and reproducibility.
URL: https://openreview.net/forum?id=5mF2eRl3gt
---
Title: A Survey on Efficient Protein Language Models
Abstract: Protein language models (pLMs) have become indispensable tools in computational biology, driving advances in variant effect prediction, functional annotation, structure prediction, and engineering. However, their rapid expansion from millions to tens of billions of parameters introduces significant computational, accessibility, and sustainability challenges that limit practical application in environments constrained by GPU memory, hardware availability, and energy budgets. This survey presents the first comprehensive review of efficient pLMs, synthesizing recent advancements across four key dimensions. We first examine (1) dataset efficiency through meta-learning-based few-shot and scaling-law-guided data allocation; and (2) architecture efficiency via lightweight alternatives including quantized transformers, embedding compression, and convolution-based designs. Furthermore, we review (3) training efficiency through scaling-law-informed pretraining, structure-integrated multimodal approaches, and low-rank adaptations with diverse distillation strategies; and (4) inference efficiency via quantization, dense-retrieval, and structure-search methods. By providing a structured taxonomy and practical guidance, this survey enables the development of high-performance, scalable, yet sustainable next-generation pLMs.
URL: https://openreview.net/forum?id=PTReuOwsXz
---
Title: WAREX: Web Agent Reliability Evaluation on Existing Benchmarks
Abstract: Recent advances in browser-based LLM agents have shown promise for automating tasks ranging from simple form filling to hotel booking or online shopping. Current benchmarks measure agent performance in controlled environments, such as containers or stable networks, where websites behave deterministically. However, in the real world, users access websites over networks and HTTPS connections that introduce instability from multiple sources: client-side, server-side issues or broader system failures. Moreover, live websites are prone to web attacks such Cross-Site Scripting, as well as general site modifications which can cause unexpected or malicious pop-ups or improper functionality. To address this gap, we present WAREX, a plug-and-play, network-layer tool that integrates with existing web agent benchmarks by simulating common website failures. We measure the impact of WAREX across three popular benchmarks: WebArena, WebVoyager, and REAL. Our experiments show that introducing WAREX leads to significant drops in task success rates, highlighting the limited robustness of state-of-the-art agents. We demonstrate that WAREX serves as more than a diagnostic tool. By fine-tuning an open-source model (Qwen3-8B) on WAREX-generated "failure-recovery" trajectories, we achieve an 88.9% relative improvement in error recovery rates, validating WAREX as a core component for training the next generation of reliable web agents.
URL: https://openreview.net/forum?id=o4pXVP8RCD
---
Title: Unlocking The Power Of Layer-By-Layer Training And Fine- Tuning
Abstract: Layer-wise (LW) training of deep neural networks has long been associated with memory and
parallelism advantages, yet it suffers from information degradation and poor convergence
in deep architectures. Recent work attributes these issues to the loss of input information
and the lack of layer-role differentiation, as measured by the Hilbert-Schmidt Independence
Criterion (HSIC).
In this paper, we present a novel algorithm that enables full end-to-end training of ResNet-
18/ResNet-50 and end-to-end fine-tuning of Large Language Models (LLMs) using a modified
LW approach, while minimizing performance degradation. Our fundamental contribution
lies in the discovery that strategically reintroducing the final layers during LW training
mitigates the convergence degradation typically observed during LW when compared to
conventional end-to-end fine-tuning.
We introduce Segmented Propagation (SegProp), a training paradigm that seamlessly integrates
the computational efficiency of LW optimization with the representational power
of global supervision. Quantitative results demonstrate substantial improvements in convergence
compared to standard LW fine-tuning of LLMs and compared to LW training of
ResNet-18/ResNet-50. SegProp improves ResNet-50 accuracy on CIFAR-10 from 90.0%
(LW) to 94.3%, approaching E2E training at 95.5%. On ResNet-18, SegProp improves
CIFAR-10 accuracy from 93.7% (LW) to 95.2%, closely matching E2E at 95.5%. On Mistral-
Nemo-Instruct-2407, SegProp segmented fine-tuning matches E2E MMLU (5-shot) performance
(69.3%), and for Llama3.1-8B-Instruct it achieves 78.9% on Winogrande (5-shot),
closely matching E2E fine-tuning at 79.1%.
URL: https://openreview.net/forum?id=p5ObETPuTi
---
Title: Efficient and Programmable Exploration of Synthesizable Chemical Space
Abstract: The constrained nature of synthesizable chemical space poses a significant challenge for sampling molecules that are both synthetically accessible and possess desired properties. In this work, we present PrexSyn, an efficient and programmable model for molecular discovery within synthesizable chemical space. PrexSyn is based on a decoder-only transformer trained on a billion-scale datastream of synthesizable pathways paired with molecular properties, enabled by a real-time, high-throughput C++-based data generation engine. The large-scale training data allows PrexSyn to reconstruct the synthesizable chemical space nearly perfectly at a high inference speed and learn the association between properties and synthesizable molecules. Based on its learned property-pathway mappings, PrexSyn can generate synthesizable molecules that satisfy not only single-property conditions but also composite property queries joined by logical operators, thereby allowing users to ``program'' generation objectives. Moreover, by exploiting this property-based querying capability, PrexSyn can efficiently optimize molecules against black-box oracle functions via iterative query refinement, achieving higher sampling efficiency than even synthesis-agnostic baselines, making PrexSyn a powerful general-purpose molecular optimization tool. Overall, PrexSyn pushes the frontier of synthesizable molecular design by setting a new state of the art in synthesizable chemical space coverage, molecular sampling efficiency, and inference speed.
URL: https://openreview.net/forum?id=xDlIer2UnI
---
Title: Differential Privacy for Transformer Embeddings of Text with Nonparametric Variational Information Bottleneck
Abstract: We propose a privacy-preserving method for sharing text data by sharing noisy versions of their transformer embeddings.
It has been shown that hidden representations learned by deep models can encode sensitive information from the input, making it possible for adversaries to recover the input data with considerable accuracy. This problem is exacerbated in transformer embeddings because they consist of multiple vectors, one per token. To mitigate this risk, we propose Nonparametric Variational Differential Privacy (NVDP), which ensures both useful data sharing and strong privacy protection. We take a differential privacy (DP) approach, integrating a nonparametric variational information bottleneck (NVIB) layer into the transformer architecture to inject noise into its multivector embeddings and thereby hide information, and measuring privacy protection with Rényi Divergence (RD) and its corresponding Bayesian Differential Privacy (BDP) guarantee. Training the NVIB layer calibrates the noise level according to the utility of the downstream task. We test NVDP on the General Language Understanding Evaluation (GLUE) benchmark and show that varying the noise level gives us a useful trade-off between privacy and accuracy. With lower noise levels, our model maintains high accuracy while offering strong privacy guarantees, effectively balancing privacy and utility.
URL: https://openreview.net/forum?id=Y5rKWT4e6G
---
Title: Concatenated Matrix SVD: Compression Bounds, Incremental Approximation, and Error-Constrained Clustering
Abstract: Large collections of matrices arise throughout modern machine learning, signal processing, and scientific computing, where they are commonly compressed by concatenation followed by truncated singular value decomposition (SVD). This strategy enables parameter sharing and efficient reconstruction and has been widely adopted across domains ranging from multi-view learning and signal processing to neural network compression. However, it leaves a fundamental question unanswered: which matrices can be safely concatenated and compressed together under explicit reconstruction error constraints? Existing approaches rely on heuristic or architecture-specific grouping and provide no principled guarantees on the resulting SVD approximation error. In the present work, we introduce a theory-driven framework for compression-aware clustering of matrices under SVD compression constraints. Our analysis establishes new spectral bounds for horizontally concatenated matrices, deriving global upper bounds on the optimal rank-$r$ SVD reconstruction error from lower bounds on singular value growth. The first bound follows from Weyl-type monotonicity under blockwise extensions, while the second leverages singular values of incremental residuals to yield tighter, per-block guarantees. We further develop an efficient approximate estimator based on incremental truncated SVD that tracks dominant singular values without forming the full concatenated matrix. Therefore, we propose three clustering algorithms that merge matrices only when their predicted joint SVD compression error remains below a user-specified threshold. The algorithms span a trade-off between speed, provable accuracy, and scalability, enabling compression-aware clustering with explicit error control.
URL: https://openreview.net/forum?id=E9n35dehqx
---
Title: FedIndex: Federated Domain Adaptation with Continuous Domain Indices
Abstract: Federated domain adaptation incorporates source clients’ knowledge to improve the model performance on the target client under the coordination of the server, mitigating the impact of data insufficiency and domain shift. Existing federated domain adaptation (FDA) methods focus on domain adaptation with categorical domain indices (e.g., “source” and “target”), while many real-world tasks involve domains with continuous domain indices. For instance, hospitals need to adapt disease analysis and prediction across patients via age, a continuous domain index in medical applications capturing the underlying relation between patient information and disease analysis. Prior FDA methods struggle with such tasks due to their ignorance of continuous domain indices. This paper proposes FedIndex to enable FDA with continuous domain indices. FedIndex performs adversarial domain adaptation across clients with the help of a global discriminator, aligning all domains’ distributions. Our theoretical analysis demonstrates the capability of FedIndex to generate domain-invariant features across clients using continuous domain indices without accessing data on clients, simultaneously maintaining privacy preservation. Our empirical results show that FedIndex outperforms the state-of-the-art FDA methods on synthetic and real-world datasets.
URL: https://openreview.net/forum?id=fnbGFH0330
---
Title: Multitask Transformer Models for Demographic and Industry Profiling on Long-Form Blog Texts
Abstract: We address the challenge of multitask author profiling on long-form blog text by developing four transformer-based models that jointly predict gender, age group, and industry. Using a cleaned version of the Blog Authorship Corpus, we explore document-length handling strategies that span input ranges from 192 to 500 tokens, including long-context encoding, BART-based summarization, and chunked processing with prediction fusion. Our experiments show that multitask learning consistently outperforms strong single-task baselines, with the largest gains for industry. We further find that broader input context yields more reliable predictions, while alternative representations emphasize complementary stylistic and topical cues. Taken together, these findings provide a comprehensive analysis of text-length effects in multitask author profiling and highlight the importance of contextual breadth for robust demographic inference. The dataset was preprocessed by merging industry tags into fourteen categories and applying standard text normalization.
URL: https://openreview.net/forum?id=WtFwcCvt9i
---
Title: Causally Fair Node Classification on Non-IID Graph Data
Abstract: Fair machine learning seeks to identify and mitigate biases in predictions against unfavorable populations characterized by demographic attributes, such as race and gender. Recent research has extended fairness to graph data, such as social networks, but many neglect the causal relationships among data instances. This paper addresses the prevalent challenge in fair ML algorithms, which typically assume Independent and Identically Distributed (IID) data, from the causal perspective. We base our research on the Network Structural Causal Model (NSCM) framework and develop a Message Passing Variational Autoencoder for Causal Inference (MPVA) framework to compute interventional distributions and facilitate causally fair node classification through estimated interventional distributions. Theoretical soundness of the proposed method is established under two general and practical conditions: Decomposability and Graph Independence. These conditions formalize when interventional distributions can be computed using do-calculus in non-IID settings, thereby grounding the framework in rigirous causal inference theory rather than imposing ad hoc constraints. Empirical evaluations on semi-synthetic and real-world datasets demonstrate that MPVA outperforms conventional methods by effectively approximating interventional distributions and mitigating bias. The implications of our findings underscore the potential of causalitybased fairness in complex ML applications, setting the stage for further research into relaxing the initial assumptions to enhance model fairness.
URL: https://openreview.net/forum?id=AwptwzGld5
---
Title: RA-CoA: Training-free Fashion Image Captioning via Retrieval-Augmented Chain-of-Attributes
Abstract: Fashion Image Captioning (FIC) plays a vital role in enhancing user experience and product search in e-commerce platforms. Unlike natural scene image captioning, FIC requires fine-grained visual reasoning and knowledge of domain-specific terminology to capture subtle attributes such as neckline and closure types, graphic patterns, and dress silhouettes. Moreover, as fashion inventories evolve rapidly with new trends, styles, and frequently emerging vocabulary, developing training-free captioning solution becomes essential for scalability and real-world adaptability. Instruction-tuned vision-language models (VLMs) offer a promising solution to fashion image captioning dueto their strong zero-shot capabilities and natural language fluency. However, these general-purpose models often lack attribute-level coverage and precision, and tend to hallucinate or misidentify fine-grained fashion details, making them less suitable for high-fidelity applications like product cataloging or personalized recommendations. To address this, we propose RA-CoA (Retrieval-Augmented Chain-of-Attributes), a novel, training-free framework that disentangles fashion image captioning into two interpretable stages: (i) retrieval of relevant attribute sets from a product knowledge base, and (ii) attribute-level reasoning to generate the final caption. RA-CoA is a model-agnostic approach that works with frozen VLMs to improve fine-grained attribute precision in product captions without the need for fine-tuning. Extensive evaluations across diverse VLM model families under different prompting paradigms demonstrate that RA-CoA significantly improves caption quality, achieving an average gain of 26.3% METEOR score over zero-shot captioning. We shall make our code publicly available upon acceptance.
URL: https://openreview.net/forum?id=PpkOrVUpJ6
---