Accepted papers
===============
Title: Decoding-based Regression
Authors: Xingyou Song, Dara Bahri
Abstract: Language models have recently been shown capable of performing regression wherein numeric predictions are represented as decoded strings. In this work, we provide theoretical grounds for this capability and furthermore investigate the utility of causal sequence decoding models as numeric regression heads given any feature representation. We find that, despite being trained in the usual way - for next-token prediction via cross-entropy loss - decoder-based heads are as performant as standard pointwise heads when benchmarked over standard regression tasks, while being flexible enough to capture smooth numeric distributions, such as in the task of density estimation.
URL: https://openreview.net/forum?id=avUQ8jguxg
---
Title: Explaining Confident Black-Box Predictions
Authors: Evan Yao, Retsef Levi, Assaf Avrahami, Abraham Meidan
Abstract: Interpretability is crucial for leveraging predictive machine learning for decision-making, but the strongest performing models are often black-boxes in that they are difficult to understand. For binary classification models, a growing body of literature seeks to find \textit{model-agnostic} explanations by treating a model as a list of 0/1 predictions and identifying patterns for when a model predicts $1$ over $0$ (or vice versa). While such explanations are
useful for understanding when a model predicts 1 over 0, they do
not consider the confidence (i.e., the probability) behind predictions, a critical piece of information provided by most classification models. Since the 0/1 predictions of a model depend on the choice of a subjective threshold for discretizing predicted probabilities, as one changes the threshold, the resulting explanations may change despite the underlying model staying the same. In contrast, this work proposes model-agnostic explanations that treat a black-box model as a \textit{ranking} across a dataset from lowest predicted probability of $1$ to highest, rather than a list of 0/1 predictions. Under this ranking, a useful explanation should capture broadly when a model \textit{confidently} predicts $1$ (i.e., highly ranked data points). Since highly confident predictions are often correlated with predictions that are more accurate and actionable, understanding when a model predicts confidently is often quite valuable to a practitioner.
This work builds explanations based on rule lists (i.e., a collection of if-then rules) as well as a novel special case called checklists. A strong rule list or checklist is satisfied by a large number of data points that are ranked highly by the model. This criteria is measured by the traditional metric of support (i.e., the number of data points an explanation applies to), the \textit{average} ranking of those data points, which we call the Average Black-Box Ranking (ABBR), as well as the sparsity of the explanation (e.g., number of rules in the rule list, among others). Given these metrics, this work develops a local-search based optimization methodology for finding explanations based on rule lists and checklists that maximize ABBR for a user-specified support and sparsity constraint. The methodology leverages a local search approach where an initial rule list is chosen greedily from a pool of candidate rules, then slowly perturbed by swapping rules from the rule list with those in the candidate pool. This approach is evaluated on 6 real world datasets in application areas ranging from healthcare to criminal justice and finance. Empirical results suggest that this methodology finds rule lists of length at most 5 with ABBR within 7.4\% of the optimal ABBR of any explanation, while checklists provide greater interpretability for a small cost in performance.
URL: https://openreview.net/forum?id=SAwZpgKJcc
---
Title: Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation
Authors: Yutong He, Alexander Robey, Naoki Murata, Yiding Jiang, Joshua Nathaniel Williams, George J. Pappas, Hamed Hassani, Yuki Mitsufuji, Ruslan Salakhutdinov, J Zico Kolter
Abstract: Prompt engineering is an effective but labor-intensive way to control text-to-image (T2I) generative models. Its time-intensive nature and complexity have spurred the development of algorithms for automated prompt generation. However, these methods often struggle with transferability across T2I models, require white-box access to the underlying model, or produce non-intuitive prompts. In this work, we introduce PRISM, an algorithm that automatically produces human-interpretable and transferable prompts that can effectively generate desired concepts given only black-box access to T2I models. Inspired by large language model (LLM) jailbreaking, PRISM leverages the in-context learning ability of LLMs to iteratively refine the candidate prompt distribution built upon the reference images. Our experiments demonstrate the versatility and effectiveness of PRISM in generating accurate prompts for objects, styles, and images across multiple T2I models, including Stable Diffusion, DALL-E, and Midjourney.
URL: https://openreview.net/forum?id=IVYVDN6pJ6
---
Title: Emergent Corpus Pre-training Benefits Vision Language Models
Authors: Makanjuola Adekunmi Ogunleye, Chase Vickery, Ismini Lourentzou
Abstract: Vision-Language Pre-trained Models (VL-PTMs) have achieved impressive performance across a wide range of tasks, but their success often hinges on access to large-scale multimodal datasets. While effective in high-resource settings, these models tend to struggle in data-scarce regimes. In this work, we investigate Emergent Communication (EC) as a mechanism to improve sample efficiency in VL-PTMs. We pre-train a Vision-Language Model (VLM) using EC tokens generated through a referential game between two artificial agents. Across three diverse cross-modal matching and reasoning benchmarks, EC pretraining yields substantial gains, improving Visual Referring Expression (VRE) accuracy by 108.6% and Visual Entailment (VE) by 69.6%. To further validate the effectiveness of EC pretraining, we introduce LLaVA-1.5-EC, a LLaVA variant trained entirely on EC tokens. LLaVA-1.5-EC outperforms strong LVLM baselines, including BLIP-2 (13B), achieving relative gains of 104.23% on VizWiz, 34.8% on GQA, and 10.8% on VQAv2, and top performance on MMBench, a challenging instruction-following benchmark. These results highlight the transferability and generalization capacity of EC pretraining and underscore the potential of leveraging grounded EC tokens to enhance vision-language reasoning in low-resource settings, especially in settings with limited natural language data. We discuss implications and propose avenues for future research to explore the connections between EC and VL for multimodal understanding and effective human-machine communication.
Project Website: https://plan-lab.github.io/ec-vlm/
URL: https://openreview.net/forum?id=bivKGSaXkD
---
New submissions
===============
Title: Federated Multimodal Fusion for Action Recognition Leveraging Vision-Language Embeddings and Spatio- Temporal CNNs
Abstract: Federated learning (FL) for Video Action Recognition (VAR) faces significant challenges in balancing privacy preservation, communication efficiency, and model performance. This paper introduces FLAMeST (Federated Learning for Action Recognition with Multimodal embeddings and Spacio-Temporal Fusion), a FL framework that synergizes Vision-Language Models (VLMs) and spatiotemporal CNNs to address these challenges. Unlike existing works that use BLIP (VLM) solely for caption generation, FLAMeST leverages BLIP in a dual manner. To enhance temporal modeling, complementary spatiotemporal features are extracted using a pre-trained 3D CNN (Slow network). These semantic (BLIP) and motion (Slow) embeddings are concatenated into a unified representation to train a lightweight Multi-Layer Perceptron (MLP). Within the FL paradigm, only the MLP parameters are shared with the server, ensuring raw video data and generated captions remain local. FLAMeST employs the FedAvg algorithm for model aggregation, achieving 99%(↓) lower communication overhead compared to full-model training. Experiments on UCF101 and HMDB51 datasets demonstrate the framework’s robustness, achieving improved accuracies of 5.13%(↑) and 2.71%(↑), respectively, against the baseline.
URL: https://openreview.net/forum?id=AobzdtqiMe
---
Title: Large Language Models Suffer From Their Own Output: An Analysis of the Self-Consuming Training Loop
Abstract: Large Language Models (LLMs) are already widely used to generate content for a variety of online platforms. As we are not able to safely distinguish LLM-generated content from human-produced content, LLM-generated content is used to train the next generation of LLMs, giving rise to a self-consuming training loop. From the image generation domain we know that such a self-consuming training loop reduces both quality and diversity of images finally ending in a model collapse. However, it is unclear whether this alarming effect can also be observed for LLMs. Therefore, we present the first study investigating the self-consuming training loop for LLMs. Further, we propose a novel method based on logic expressions that allows us to unambiguously verify the correctness of LLM-generated content, which is difficult for natural language text. We find that the self-consuming training loop produces correct outputs, however, the output declines in its diversity depending on the proportion of the used generated data. Fresh data can slow down this decline, but not stop it. Further, we observe similar results on a real natural language dataset. Given these concerning results, we encourage researchers to study methods to negate this process.
URL: https://openreview.net/forum?id=FzIzju42B3
---
Title: Formal Methods in Robot Policy Learning: A Survey on Current Techniques and Future Directions
Abstract: As hardware and software systems have grown in complexity, formal methods have been indispensable tools for (1) rigorously specifying acceptable behaviors, (2) synthesizing programs to meet these specifications, and (3) validating the correctness of existing programs. In the field of robotics, a similar trend of rising complexity has emerged, driven in large part by the adoption of deep learning. While this shift has enabled the development of highly performant robot policies, their implementation as deep neural networks has posed challenges to traditional formal analysis, leading to models that are inflexible, fragile, and difficult to interpret. In response, the robotics community has introduced new formal and semi-formal methods to support the precise specification of complex objectives, guide the learning process to achieve them, and enable the verification of learned policies against them.
In this survey, we provide a comprehensive overview of how formal methods are integrated into robot policy learning. We organize our discussion around three key pillars: specification, synthesis, and verification of learned policies. For each, we highlight representative techniques, compare their scalability and expressiveness, and summarize how they contribute to meaningfully improving realistic robot safety and correctness. We conclude with a discussion of remaining obstacles for achieving that goal and promising directions for advancing formal methods in robot learning.
URL: https://openreview.net/forum?id=DZkikdg5sl
---