Accepted papers
===============
Title: GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models
Authors: Muhammad Jehanzeb Mirza, Mengjie Zhao, Zhuoyuan Mao, Sivan Doveh, Wei Lin, Paul Gavrikov, Michael Dorkenwald, Shiqi Yang, Saurav Jha, Hiromi Wakaki, Yuki Mitsufuji, Horst Possegger, Rogerio Feris, Leonid Karlinsky, James R. Glass
Abstract: In this work, we propose GLOV, which enables Large Language Models (LLMs) to act as implicit optimizers for Vision-Language Models (VLMs) to enhance downstream vision tasks. GLOV prompts an LLM with the downstream task description, querying it for suitable VLM prompts (\eg for zero-shot classification with CLIP). These prompts are ranked according to their fitness for the downstream vision task. In each respective optimization step, the ranked prompts are fed as in-context examples (with their accuracies) to equip the LLM with the knowledge of the type of prompts preferred by the downstream VLM. Furthermore, we explicitly guide the LLM's generation at each optimization step by adding an offset vector -- calculated from the embedding differences between previous \textit{positive} and \textit{negative} solutions -- to the intermediate layer of the network for the next generation. This offset vector biases the LLM generation toward the type of language the downstream VLM prefers, resulting in enhanced performance on the downstream vision tasks. We comprehensively evaluate our GLOV on two tasks: object recognition and the critical task of enhancing VLM safety. Our GLOV shows performance improvement by up to $15.0\%$ and $57.5\%$ for dual-encoder (\eg~CLIP) and encoder-decoder (\eg~\llava) models for object recognition and reduces the attack success rate (ASR) on state-of-the-art VLMs by up to $60.7\%$.
URL: https://openreview.net/forum?id=kZLANTp6Vw
---
Title: ABC: Achieving Better Control of Visual Embeddings using VLLMs
Authors: Benjamin Schneider, Florian Kerschbaum, Wenhu Chen
Abstract: Visual embedding models excel at zero-shot tasks like visual retrieval and classification.
However, these models cannot be used for tasks that contain ambiguity or require user in-
struction. These tasks necessitate an embedding model which outputs can use a natural
language instruction to control the representation of a visual embedding. Existing CLIP-
based approaches embed images and text independently, and fuse the result. We find that
this results in weak interactions between modalities, and poor user control over the repre-
sentation. We introduce ABC, an open-source multimodal embedding model that uses a
vision-language model backbone to deeply integrate image features with natural language
instructions. ABC achieves best-for-size performance on MSCOCO image-to-text retrieval
and is the top performing model on classification and VQA tasks in the Massive Multimodal
Embedding Benchmark. With a strongly unified vision-language representation, ABC can
use natural language to solve subtle and potentially ambiguous visual retrieval problems. To
evaluate this capability, we design CtrlBench, a benchmark that requires interleaving tex-
tual instructions with image content for correct retrieval. ABC advances the state of visual
embeddings, outputting high-quality visual representations with natural language control.
Our model and datasets are available at our project page: https://tiger-ai-lab.github.io/ABC/
URL: https://openreview.net/forum?id=RezANmBpxW
---
Title: Spatio-temporal Partial Sensing Forecast of Long-term Traffic
Authors: Zibo Liu, Zhe Jiang, Zelin Xu, Tingsong Xiao, Zhengkun Xiao, Yupu Zhang, Haibo Wang, Shigang Chen
Abstract: Traffic forecasting uses recent measurements by sensors installed at chosen locations to forecast the future road traffic. Existing work either assumes all locations are equipped with sensors or focuses on short-term forecast. This paper studies partial sensing forecast of long-term traffic, assuming sensors are available only at some locations. The problem is challenging due to the unknown data distribution at unsensed locations, the intricate spatio-temporal correlation in long-term forecasting, as well as noise to traffic patterns. We propose a Spatio-temporal Long-term Partial sensing Forecast model (SLPF) for traffic prediction, with several novel contributions, including a rank-based embedding technique to reduce the impact of noise in data, a spatial transfer matrix to overcome the spatial distribution shift from sensed locations to unsensed locations, and a multi-step training process that utilizes all available data to successively refine the model parameters for better accuracy. Extensive experiments on several real-world traffic datasets demonstrate its superior performance. Our source code is at https://github.com/zbliu98/SLPF
URL: https://openreview.net/forum?id=Ff08aPjVjD
---
Title: DiffCLIP: Differential Attention Meets CLIP
Authors: Hasan Abed Al Kader Hammoud, Bernard Ghanem
Abstract: We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. Differential attention was originally developed for large language models to amplify relevant context while canceling out noisy information. In this work, we integrate this mechanism into CLIP's dual encoder (image and text) framework. With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks. Across zero-shot classification, retrieval, and robustness benchmarks, DiffCLIP consistently outperforms baseline CLIP models. Notably, these gains come with negligible computational overhead, demonstrating that differential attention can significantly enhance multi-modal representations without sacrificing efficiency.
URL: https://openreview.net/forum?id=2I2fTehry2
---
Title: On the Role of Discrete Representation in Sparse Mixture of Experts
Authors: Giang Do, Kha Pham, Hung Le, Truyen Tran
Abstract: Sparse Mixture of Experts (SMoE) is an effective solution for scaling up model capacity without increasing the computational costs. A crucial component of SMoE is the router, responsible for directing the input to relevant experts; however, it also presents a major weakness, leading to routing inconsistencies and representation collapse issues. Instead of fixing the router like previous works, we propose an alternative that assigns experts to input via \emph{indirection}, which employs the discrete representation of input that points to the expert. The discrete representations are learned via vector quantization, resulting in a new architecture dubbed Vector-Quantized Mixture of Experts (VQMoE). We provide theoretical support and empirical evidence demonstrating the VQMoE's ability to overcome the challenges present in traditional routers. Through extensive evaluations on both large language models and vision tasks for pre-training and fine-tuning, we show that VQMoE achieves a 28\% improvement in robustness compared to other SMoE routing methods while maintaining strong performance in fine-tuning tasks.
URL: https://openreview.net/forum?id=GTWKmojpI7
---
Title: RefDeblur: Blind Motion Deblurring with Self-Generated Reference Image
Authors: Insoo Kim, Geonseok Seo, Hyong-Euk Lee, Jinwoo Shin
Abstract: The challenge of blind motion deblurring is often tackled via two distinct paradigms: kernel-based and kernel-free methods. Each deblurring method provides inherent strengths. Kernel-based methods facilitate generating texture-detailed sharp images by closely aligning with the blurring process. In contrast, kernel-free methods are more effective in handling complex blur patterns. Building upon these complementary benefits, we propose a hybrid framework that decomposes a non-uniform deblurring task into two simpler tasks: a uniform kernel estimation, managed by our kernel-based method, and error prediction, handled by our kernel-free method. Our kernel-based method serves to generate a reference image with realistic texture details while our kernel-free model refines the reference image by correcting residual errors with preserving texture details. To efficiently build our kernel-based model, we consider the logarithmic fourier space that facilitates estimating a blur kernel easier by simplifying the relationship between blur and sharp samples. Furthermore, the regime under using a texture-detailed reference image allows for reducing the size of our kernel-free model without compromising performance. As a result, the proposed method achieves remarkable performance on several datasets such as RealBlur, RSBlur and GoPro, and comparable performance to state-of-the-art methods with a 75% reduction in computational costs.
URL: https://openreview.net/forum?id=Nyewu7xztw
---
Title: Enhancing Cost Efficiency in Active Learning with Candidate Set Query
Authors: Yeho Gwon, Sehyun Hwang, Hoyoung Kim, Jungseul Ok, Suha Kwak
Abstract: This paper introduces a cost-efficient active learning (AL) framework for classification, featuring a novel query design called candidate set query. Unlike traditional AL queries requiring the oracle to examine all possible classes, our method narrows down the set of candidate classes likely to include the ground-truth class, significantly reducing the search space and labeling cost. Moreover, we leverage conformal prediction to dynamically generate small yet reliable candidate sets, adapting to model enhancement over successive AL rounds. To this end, we introduce an acquisition function designed to prioritize data points that offer high information gain at lower cost. Empirical evaluations on CIFAR-10, CIFAR-100, and ImageNet64x64 demonstrate the effectiveness and scalability of our framework. Notably, it reduces labeling cost by 48% on ImageNet64x64. The project page can be found at https://yehogwon.github.io/csq-al.
URL: https://openreview.net/forum?id=LhHxl30xQ1
---
New submissions
===============
Title: Overcoming Open-Set Approaches to Adversarial Defense
Abstract: Machine learning (ML) models are increasingly proposed to replace or augment safety-critical sensor processing systems, yet their fragility to evasion attacks remains a well-documented open problem. This work analyzes a class of deep neural network defenses that add a none-of-the-above (NOTA) class as an open-set-inspired closed-set adversarial defense. We show that such approaches often appear far more robust than they are because standard adversarial attacks lack explicit handling for large auxiliary classes like NOTA–causing stopping criteria,target-selection, and objective function behaviors that mask true vulnerabilities. We formalize these issues in a taxonomy of evaluation pitfalls, adapt seven prominent adversarial attacks to eliminate them, and show that adding a NOTA class alone, does not solve the core challenge of defending DNNs against evasion attacks. We release our adapted attack suite to enable more rigorous future evaluations of open-set-inspired defenses.
URL: https://openreview.net/forum?id=iuQ9r8VSIX
---
Title: Multimodal Prescriptive Deep Learning
Abstract: We introduce a multimodal deep learning framework, Prescriptive Neural Networks (PNNs), that combines ideas from optimization and machine learning, and is, to the best of our knowledge, the first prescriptive method to handle multimodal data. The PNN is a feedforward neural network trained on embeddings to output an outcome-optimizing prescription. In two real-world multimodal datasets, we demonstrate that PNNs prescribe treatments that are able to greatly improve estimated outcomes in transcatheter aortic valve replacement (TAVR) procedures by reducing estimated postoperative complication rates by over 40\% and in liver trauma injuries by reducing estimated mortality rates by 25\%. In four real-world, unimodal tabular datasets, we demonstrate that PNNs outperform or perform comparably to other well-known, state-of-the-art prescriptive models; importantly, on tabular datasets, we also recover interpretability through knowledge distillation, fitting interpretable Optimal Classification Tree models onto the PNN prescriptions as classification targets, which is critical for many real-world applications. Finally, we demonstrate that our multimodal PNN models achieve stability across randomized data splits comparable to other prescriptive methods and produce realistic prescriptions across the different datasets.
URL: https://openreview.net/forum?id=AwfWOCVLbJ
---
Title: Probably Approximately Correct Causal Discovery
Abstract: The discovery of causal relationships is a foundational problem in artificial intelligence, statistics, epidemiology, economics, and beyond. While elegant theories exist for accurate causal discovery given infinite data, real-world applications are inherently resource-constrained. Effective methods for inferring causal relationships from observational data must perform well under finite data and time constraints, where “performing well” implies achieving high, though not perfect accuracy.
In his seminal paper *A Theory of the Learnable*, Valiant highlighted the importance of resource constraints in supervised machine learning, introducing the concept of Probably Approximately Correct (PAC) learning as an alternative to exact learning. Inspired by Valiant's work, we propose the Probably Approximately Correct Causal (PACC) Discovery framework, which extends PAC learning principles to the causal field. This framework emphasizes both computational and sample efficiency for established causal methods such as propensity score techniques and instrumental variable approaches. Furthermore, we show that it can provide theoretical guarantees for other widely used methods, such as the Self-Controlled Case Series (SCCS) method, which had previously lacked such guarantees.
URL: https://openreview.net/forum?id=N3UJqEQZmL
---
Title: The Speed-up Factor: A Quantitative Multi-Iteration Active Learning Performance Metric
Abstract: Machine learning models excel with abundant annotated data, but annotation is often costly and time-intensive.
Active learning (AL) aims to improve the performance-to-annotation ratio by using query methods (QMs) to iteratively select the most informative samples.
While AL research focuses mainly on QM development, the evaluation of this iterative process lacks appropriate performance metrics.
This work reviews eight years of AL evaluation literature and formally introduces the speed-up factor, a quantitative multi-iteration QM performance metric that indicates the fraction of samples needed to match random sampling performance.
Using four datasets from diverse domains and seven QMs of various types, we empirically evaluate the speed-up factor and compare it with state-of-the-art AL performance metrics.
The results confirm the assumptions underlying the speed-up factor, demonstrate its accuracy in capturing the described fraction, and reveal its superior stability across iterations.
URL: https://openreview.net/forum?id=q6hRb6fETo
---
Title: Alice in Wonderland: Variations in Simple Problems Reveal Severe Generalization Deficits in Large Language and Reasoning Models
Abstract: Large language and reasoning models (LLMs, LRMs) are instances of foundation models exhibiting scaling laws that predict generalization improvement when increasing the pre-training scale. As such, they are supposed to possess strong generalization and therefore transfer robustly across various tasks and conditions in few-show or zero-shot manner. Such claims rely on various standardized benchmarks that should measure core functions like generalization and reasoning, where state-of-the-art (SOTA) models score high. We demonstrate here a severe breakdown of zero-shot generalization in most SOTA models which claim strong function, including reasoning models like DeepSeek R1 or o1-mini, trained at the largest scales, using a simple, short common sense problem formulated in concise natural language, easily solvable by humans (AIW problem). The breakdown is severe as it manifests on a simple problem in both low average performance and, importantly, in strong performance fluctuations on natural variations in problem template that do not change either problem structure or its difficulty at all. By testing models on further control problems with similar form, we rule out that breakdown might be rooted in minor low-level issues like natural language or numbers parsing. In conventional LLMs, we observe strong overconfidence in the wrong solutions, expressed in form of plausible sounding explanation-like confabulations. We use these observations to stimulate re-assessment of the capabilities of current generation of LLMs and LRMs as claimed by standardized language understanding and reasoning benchmarks. Such re-assessment also requires common action to establish benchmarks that would allow proper detection of such deficits in generalization and reasoning that remain undiscovered by current evaluation procedures, where models with clear deficits still manage to score high. We discuss how this illusion might be caused by leakage of test sets into training, and how procedural test problem generation can alleviate this. Code for reproducing experiments in the paper and raw experiments data can be found at https://anonymous.4open.science/r/AITW_anonymous-69A6
URL: https://openreview.net/forum?id=frA7uYn2um
---