Daily TMLR digest for Aug 28, 2025

0 views

Skip to first unread message

TMLR

unread,

Aug 28, 2025, 12:06:07 AM (10 days ago) Aug 28

to tmlr-anno...@googlegroups.com

New certifications
==================

Featured Certification: MobileCLIP2: Improving Multi-Modal Reinforced Training

Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari

https://openreview.net/forum?id=WeF9zolng8

---

Reproducibility Certification: [Re] Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents

Oliver van Erven, Konstantinos Zafeirakis, Jacobus Smit, Julio Smidi, Luc Buijs

https://openreview.net/forum?id=EWWxSkUchO

---

Accepted papers
===============

Title: MobileCLIP2: Improving Multi-Modal Reinforced Training

Authors: Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari

Abstract: Foundation image-text models such as CLIP with zero-shot capabilities enable a wide array of applications. MobileCLIP is a recent family of image-text models at 3-15ms latency and 50-150M parameters with state-of-the-art zero-shot accuracy. The main ingredients in MobileCLIP were its low-latency and light architectures and a novel multi-modal reinforced training that made knowledge distillation from multiple caption-generators and CLIP teachers efficient, scalable, and reproducible. In this paper, we improve the multi-modal reinforced training of MobileCLIP through: 1) better CLIP teacher ensembles trained on the DFN dataset, 2) improved captioner teachers trained on the DFN dataset and fine-tuned on a diverse selection of high-quality image-caption datasets. We discover new insights through ablations such as the importance of temperature tuning in contrastive knowledge distillation, the effectiveness of caption-generator fine-tuning for caption diversity, and the additive improvement from combining synthetic captions generated by multiple models. We train a new family of models called MobileCLIP2 and achieve state-of-the-art ImageNet-1k zero-shot accuracies at low latencies. In particular, we observe 2.2% improvement in ImageNet-1k accuracy for MobileCLIP2-B compared with MobileCLIP-B architecture. Notably, MobileCLIP2-S4 matches the zero-shot accuracy of SigLIP-SO400M/14 on ImageNet-1k while being 2× smaller and improves on DFN ViT-L/14 at 2.5× lower latency. We release our pretrained models and the data generation code. The data generation code makes it easy to create new reinforced datasets with arbitrary teachers using distributed scalable processing.

URL: https://openreview.net/forum?id=WeF9zolng8

---

Title: [Re] Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents

Authors: Oliver van Erven, Konstantinos Zafeirakis, Jacobus Smit, Julio Smidi, Luc Buijs

Abstract: Large Language Models (LLMs) are increasingly used in strategic decision-making environments, including game-theoretic scenarios where multiple agents interact under predefined rules. One such setting is the common pool resource environment. In this study, we build upon Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents (Piatti et al., 2024), a framework designed to test cooperation strategies among LLM agents. We begin by replicating their results to a large degree to validate the framework, reproducing the original claims regarding model scale in their simulation environment. Then, we extend the analysis to include models that represent the recent reasoning paradigm: Phi-4, DeepSeek-R1, and one of the distilled variants, which show improvements over their baseline counterparts but come at a higher computational cost. Here, we identify a notable trend: specialized models with reasoning-oriented training outperform general-purpose models of similar scale in this environment. Finally, we investigate the impact of different experiments, including the veil of ignorance mechanism and other prompting strategies based on universalization principles with varying levels of abstraction. Our results suggest that older models benefit significantly from explicit boundary conditions, whereas newer models demonstrate greater robustness to implicit constraints.

URL: https://openreview.net/forum?id=EWWxSkUchO

---

Title: Mixture of Balanced Information Bottlenecks for Long-Tailed Visual Recognition

Authors: Yifan Lan, Cai xin, Jun Cheng, Shan Tan

Abstract: Deep neural networks (DNNs) have achieved significant success in various applications with large-scale and balanced data. However, data in real-world visual recognition are usually long-tailed, bringing challenges to efficient training and deployment of DNNs. Information bottleneck (IB) is an elegant approach for representation learning. In this paper, we propose a balanced information bottleneck (BIB) approach, in which loss function re-balancing and self-distillation techniques are integrated into the original IB network. BIB is thus capable of learning a sufficient representation with essential label-related information fully preserved for long-tailed visual recognition. To further enhance the representation learning capability, we also propose a novel structure of mixture of multiple balanced information bottlenecks (MBIB), where different BIBs are responsible for combining knowledge from different network layers. MBIB facilitates an end-to-end learning strategy that trains representation and classification simultaneously from an information theory perspective. We conduct experiments on commonly used long-tailed datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018. Both BIB and MBIB reach state-of-the-art performance for long-tailed visual recognition.

URL: https://openreview.net/forum?id=9eiALSuZGA

---

Title: Can Masked Autoencoders Also Listen to Birds?

Authors: Lukas Rauch, René Heinrich, Ilyass Moummad, Alexis Joly, Bernhard Sick, Christoph Scholz

Abstract: Masked Autoencoders (MAEs) learn rich representations in audio classification through an efficient self-supervised reconstruction task. Yet, general-purpose models struggle in fine-grained audio domains such as bird sound classification, which demands distinguishing subtle inter-species differences under high intra-species variability. We show that bridging this domain gap requires full-pipeline adaptation beyond domain-specific pretraining data. Using BirdSet, a large-scale bioacoustic benchmark, we systematically adapt pretraining, fine-tuning, and frozen feature utilization. Our Bird-MAE sets new state-of-the-art results on BirdSet’s multi-label classification benchmark. Additionally, we introduce the parameter-efficient prototypical probing, which boosts the utility of frozen MAE features by achieving up to 37 mAP points over linear probes and narrowing the gap to fine-tuning in low-resource settings. Bird-MAE also exhibits strong few-shot generalization with prototypical probes on our newly established few-shot benchmark on BirdSet, underscoring the importance of tailored self-supervised learning pipelines for fine-grained audio domains.

URL: https://openreview.net/forum?id=GIBWR0Xo2J

---

Title: Factor Learning Portfolio Optimization Informed by Continuous-Time Finance Models

Authors: Sinong Geng, houssam nassif, Zhaobin Kuang, Anders Max Reppen, K. Ronnie Sircar

Abstract: We study financial portfolio optimization in the presence of unknown and uncontrolled system variables referred to as stochastic factors. Existing work falls into two distinct categories: (i) reinforcement learning employs end-to-end policy learning with flexible factor representation, but does not precisely model the dynamics of asset prices or factors; (ii) continuous-time finance methods, in contrast, take advantage of explicitly modeled dynamics but pre-specify, rather than learn, factor representation. We propose FaLPO (factor learning portfolio optimization), a framework that interpolates between these two approaches. Specifically, FaLPO hinges on deep policy gradient to learn a performant investment policy that takes advantage of flexible representation for stochastic factors. Meanwhile, FaLPO also incorporates continuous-time finance models when modeling the dynamics. It uses the optimal policy functional form derived from such models and optimizes an objective that combines policy learning and model calibration. We prove the convergence of FaLPO and provide performance guarantees via a finite-sample bound. On both synthetic and real-world portfolio optimization tasks, we observe that FaLPO outperforms five leading methods. Finally, we show that FaLPO can be extended to other decision-making problems with stochastic factors.

URL: https://openreview.net/forum?id=KLOJUGusVE

---

Title: Private Regression via Data-Dependent Sufficient Statistic Perturbation

Authors: Cecilia Ferrando, Daniel Sheldon

Abstract: Sufficient statistic perturbation (SSP) is a widely used method for differentially private linear regression. SSP adopts a data-independent approach where privacy noise from a simple distribution is added to sufficient statistics. However, sufficient statistics can often be expressed as linear queries and better approximated by data-dependent mechanisms. In this paper we introduce data-dependent SSP for linear regression based on post-processing privately released marginals, and find that it outperforms state-of-the-art data-independent SSP. We extend this result to logistic regression by developing an approximate objective that can be expressed in terms of sufficient statistics, resulting in a novel and highly competitive SSP approach for logistic regression. We also make a connection to synthetic data for machine learning: for models with sufficient statistics, training on synthetic data corresponds to data-dependent SSP, with the overall utility determined by how well the mechanism answers these linear queries.

URL: https://openreview.net/forum?id=gtCfDKm9ME

---

New submissions
===============

Title: RLHF in an SFT Way: From Optimal Solution to Reward-Weighted Alignment

Abstract: Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption, specifically for online sampling-based methods like Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO). Even with recent simplifications, such as Direct Preference Optimization (DPO) that designs an offline implicit reward learning objective relying on pre-collected preference datasets, the problems of over-fitting and training instability remain hindering the alignment process from the expected optimal performance. To address the existing challenges, we propose a novel simplification of RLHF from the perspective of variational inference, called **V**ariational **A**lignment with **R**e-weighting (**VAR**). Specifically, by directly minimizing the distribution gap between the learning LLM policy and the optimal solution of RLHF, we transform the alignment objective into an offline reward-driven re-weighted supervised fine-tuning (SFT) form, which only requires minor adjustment on the SFT loss to obtain noticeable improvement on training stability and effectiveness. In comprehensive evaluation benchmarks, our objective empowers LLMs to outperform offline alignments, demonstrating superior performance in both helpfulness and harmlessness metrics (avg. $\uparrow7.16\%$ than DPO). Meanwhile, when compared to online sampling methods, our method is also comparable even better while significantly reducing computational overhead and accelerating convergence speed (over $5\times$ faster than GRPO), suggesting our approach as an efficient and effective solution in bridging the gap between efficiency and performance in LLM alignment.

URL: https://openreview.net/forum?id=jewB0UhFuj

---

Title: Beyond Affinity: A Benchmark of 1D, 2D, and 3D Methods Reveals Critical Trade-offs in Structure-Based Drug Design

Abstract: Currently, the field of structure-based drug design is dominated by three main types of algorithms: search-based algorithms, deep generative models, and reinforcement learning. While existing works have typically focused on comparing models within a single algorithmic category, cross-algorithm comparisons remain scarce. In this paper, to fill the gap, we establish a benchmark to evaluate the performance of fifteen models across these different algorithmic foundations by assessing the pharmaceutical properties of the generated molecules and their docking affinities and poses with specified target proteins. We highlight the unique advantages of each algorithmic approach and offer recommendations for the design of future SBDD models. We emphasize that 1D/2D ligand-centric drug design methods can be used in SBDD by treating the docking function as a black-box oracle, which is typically neglected. Our evaluation reveals distinct patterns across model categories. 3D structure-based models excel in binding affinities but show inconsistencies in chemical validity and pose quality. 1D models demonstrate reliable performance in standard molecular metrics but rarely achieve optimal binding affinities. 2D models offer balanced performance, maintaining high chemical validity while achieving moderate binding scores. Through detailed analysis across multiple protein targets, we identify key improvement areas for each model category, providing insights for researchers to combine strengths of different approaches while addressing their limitations.

URL: https://openreview.net/forum?id=gaTwx1rzCw

---

Title: Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective

Abstract: Large Language Models (LLMs) are known to exhibit social, demographic, and gender biases, often as a consequence of the data on which they are trained. In this work, we adopt a mechanistic interpretability approach to analyze how such biases are structurally represented within models such as GPT-2 and Llama2. Focusing on demographic and gender biases, we explore different metrics to identify the internal edges responsible for biased behavior. We then assess the stability, localization, and generalizability of these components across dataset and linguistic variations. Through systematic ablations, we demonstrate that bias-related computations are highly localized, often concentrated in a small subset of layers. Moreover, the identified components change across fine-tuning settings, including those unrelated to bias. Finally, we show that removing these components not only reduces biased outputs but also affects other NLP tasks, such as named entity recognition and linguistic acceptability judgment because of the sharing of important components with these tasks.

URL: https://openreview.net/forum?id=EpQ2CBJTjD

---

Title: Multi-Modal Foundation Models for Computational Pathology: A Survey

Abstract: Foundation models have emerged as a powerful paradigm in computational pathology (CPath), enabling scalable and generalizable analysis of histopathological images. While early developments centered on uni-modal models trained solely on visual data, recent advances have highlighted the promise of multi-modal foundation models that integrate heterogeneous data sources such as textual reports, structured domain knowledge, and molecular profiles. In this survey, we provide a comprehensive and up-to-date review of multi-modal foundation models in CPath, with a particular focus on models built upon hematoxylin and eosin (H&E) stained whole slide images (WSIs) and tile-level representations. We categorize 32 state-of-the-art multi-modal foundation models into three major paradigms: vision-language, vision-knowledge graph, and vision-gene expression. We further divide vision-language models into non-LLM-based and LLM-based approaches. Additionally, we analyze 28 available multi-modal datasets tailored for pathology, grouped into image-text pairs, instruction datasets, and image-other modality pairs. Our survey also presents a taxonomy of downstream tasks, highlights training and evaluation strategies, and identifies key challenges and future directions. We aim for this survey to serve as a valuable resource for researchers and practitioners working at the intersection of pathology and AI.

URL: https://openreview.net/forum?id=NZ7GSH92cY

---

Title: State Combinatorial Generalization In Decision Making With Conditional Diffusion Models

Abstract: Many real-world decision-making problems are combinatorial in nature, where states (e.g., surrounding traffic of a self-driving car) can be seen as a combination of basic elements (e.g., pedestrians, trees, and other cars). Due to combinatorial complexity, observing all combinations of basic elements in the training set is infeasible, which leads to an essential yet understudied problem of zero-shot generalization to states that are unseen combinations of previously seen elements. In this work, we first formalize this problem and then demonstrate how existing value-based reinforcement learning (RL) algorithms struggle due to unreliable value predictions in unseen states. We argue that this problem cannot be addressed with exploration alone, but requires more expressive and generalizable models. We demonstrate that behavior cloning with a conditioned diffusion model trained on successful trajectory generalizes better to states formed by new combinations of seen elements than traditional RL methods. Through experiments in maze, driving, and multiagent environments, we show that conditioned diffusion models outperform traditional RL techniques and highlight the broad applicability of our problem formulation.

URL: https://openreview.net/forum?id=XB1dd01Ozz

---

Title: Diffusion Self-Weighted Guidance for Offline Reinforcement Learning

Abstract: Offline reinforcement learning (RL) recovers the optimal policy $\pi$ given historical observations of an agent. In practice, $\pi$ is modeled as a weighted version of the agent's behavior policy $\mu$, using a weight function $w$ working as a critic of the agent's behavior. Though recent approaches to offline RL based on diffusion models have exhibited promising results, the computation of the required scores is challenging due to their dependence on the unknown $w$. In this work, we alleviate this issue by constructing a diffusion over both the actions and the weights. With the proposed setting, the required scores are directly obtained from the diffusion model without learning extra networks. Our main conceptual contribution is a novel guidance method, where guidance (which is a function of $w$) comes from the same diffusion model, therefore, our proposal is termed Self-Weighted Guidance (SWG). We show that SWG generates samples from the desired distribution on toy examples and performs on par with state-of-the-art methods on D4RL's challenging environments, while maintaining a streamlined training pipeline. We further validate SWG through ablation studies on weight formulations and scalability.

URL: https://openreview.net/forum?id=jmXBnpmznv

---

Title: STLDM: Spatio-Temporal Latent Diffusion Model for Precipitation Nowcasting

Abstract: Precipitation nowcasting is a critical spatio-temporal prediction task for society to prevent severe damage owing to extreme weather events. Despite the advances in this field, the underlying complex and stochastic nature of this task still poses challenges to previous approaches. Specifically, deterministic models produce blurry predictions while generative models suffer from poor accuracy. In this paper, we present a simple yet effective model architecture termed STLDM, which learns the latent representation from end to end alongside both the Variational Autoencoder and the conditioning network. Experimental results across multiple radar datasets demonstrate that the proposed STLDM is more effective and superior to the state of the art.

URL: https://openreview.net/forum?id=f4oJwXn3qg

---

Title: ComFe: An Interpretable Head for Vision Transformers

Abstract: Interpretable computer vision models explain their classifications through comparing the distances between the local embeddings of an image and a set of prototypes that represent the training data. However, these approaches introduce additional hyper-parameters that need to be tuned to apply to new datasets, scale poorly, and are more computationally intensive to train in comparison to black-box approaches. In this work, we introduce Component Features (ComFe), a highly scalable interpretable-by-design image classification head for pretrained Vision Transformers (ViTs) that can obtain competitive performance in comparison to comparable non-interpretable methods. To our knowledge, ComFe is the first interpretable head and unlike other interpretable approaches can be readily applied to large-scale datasets such as ImageNet-1K. Additionally, ComFe provides improved robustness and outperforms previous interpretable approaches on key benchmark datasets while using a consistent set of hyperparameters and without finetuning the pretrained ViT backbone. With only global image labels and no segmentation or part annotations, ComFe can identify consistent component features within an image and determine which of these features are informative in making a prediction. Code is available at https://anonymous.4open.science/r/cospress-83E3/README.md.

URL: https://openreview.net/forum?id=cI4wrDYFqE

---

Title: HopCast: Calibration of Autoregressive Dynamics Models

Abstract: Deep learning models are often trained to approximate dynamical systems that can be modeled using differential equations. Many of these models are optimized to predict one step ahead; such approaches produce calibrated one-step predictions if the predictive model can quantify uncertainty, such as Deep Ensembles. At inference time, multi-step predictions are generated via autoregression, which needs a sound uncertainty propagation method to produce calibrated multi-step predictions. This work introduces an alternative Predictor-Corrector approach named HopCast that uses Modern Hopfield Networks (MHN) to learn the errors of a deterministic Predictor that approximates the dynamical system. The Corrector predicts a set of errors for the Predictor's output based on a context state at any timestep during autoregression. The set of errors creates sharper and well-calibrated prediction intervals with higher predictive accuracy compared to baselines without uncertainty propagation. The calibration and prediction performances are evaluated across a set of dynamical systems. This work is also the first to benchmark existing uncertainty propagation methods based on calibration errors.

URL: https://openreview.net/forum?id=wsO6nxvGof

---

Title: RT2I-Bench: Evaluating Robustness of Text-to-Image Systems Against Adversarial Attacks

Abstract: Text-to-Image (T2I) systems have demonstrated impressive abilities in the generation of images from text descriptions. However, these systems remain susceptible to adversarial prompts—carefully crafted input manipulations that can result in misaligned or even toxic outputs. This vulnerability highlights the need for systematic evaluation and development of attack strategies that exploit these weaknesses, as well as defense mechanisms that safeguard T2I models. This work introduces RT2I-Bench, a comprehensive benchmark designed to assess the robustness of T2I systems against adversarial attacks. The benchmark serves two primary purposes. First, it provides a structured evaluation of various adversarial attacks, examining their effectiveness, transferability, stealthiness and potential for generating misaligned or toxic outputs, as well as assessing the resilience of state-of-the-art T2I models to such attacks. We observe that state-of-the-art T2I systems are vulnerable to adversarial prompts, with the most effective attacks achieving success rates of over 60\% across the majority of T2I models we tested. Second, RT2I-Bench enables the creation of a set of strong adversarial prompts (consisting of 1,439 that induce misaligned or targeted outputs and 173 that induce toxic outputs), which are effective across a wide range of systems. This dataset offers a valuable resource for robustness testing and defense evaluation. Finally, our benchmark is designed to be extensible, enabling the seamless addition of new attack techniques, T2I models, and evaluation metrics. This flexible framework provides an automated and scalable solution for robustness assessment and adversarial prompt generation in T2I systems.

URL: https://openreview.net/forum?id=ZUiWjEouSf

---

Title: Designing a Conditional Prior Distribution for Flow-Based Generative Models

Abstract: Flow-based generative models have recently shown impressive performance for conditional generation tasks, such as text-to-image generation. However, current methods transform a general unimodal noise distribution to a specific mode of the target data distribution. As such, every point in the initial source distribution can be mapped to every point in the target distribution, resulting in long average paths. To this end, in this work, we tap into a non-utilized property of conditional flow-based models: the ability to design a non-trivial prior distribution. Given an input condition, such as a text prompt, we first map it to a point lying in data space, representing an “average" data point with the minimal average distance to all data points of the same conditional mode (e.g., class). We then utilize the flow matching formulation to map samples from a parametric distribution centered around this point to the conditional target distribution. Experimentally, our method significantly improves training times and generation efficiency (FID, KID and CLIP alignment scores) compared to baselines, producing high quality samples using fewer sampling steps.

URL: https://openreview.net/forum?id=Teh9Bq4giF

---

Title: Towards shutdownable agents via stochastic choice

Abstract: The POST-Agents Proposal (PAP) is an idea for ensuring that advanced artificial agents never resist shutdown. A key part of the PAP is using a novel ‘Discounted Reward for Same-Length Trajectories (DReST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be 'USEFUL'), and (2) choose stochastically between different trajectory-lengths (be NEUTRAL' about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus provide some initial evidence that DReST reward functions could train advanced agents to be USEFUL and NEUTRAL. Our theoretical work suggests that these agents would be useful and shutdownable.

URL: https://openreview.net/forum?id=j5Qv7KdWBn

---

Title: Clus-UCB: A Near-Optimal Algorithm for Clustered Bandits

Abstract: We study a stochastic multi-armed bandit setting where arms are partitioned into known clusters, such that the mean rewards of arms within a cluster differ by at most a known threshold. While the clustering structure is known a priori, the arm means are unknown. We derive an asymptotic lower bound on the regret that improves upon the classical bound of Lai & Robbins (1985). We then propose Clus-UCB, an efficient algorithm that closely matches this lower bound asymptotically. Clus-UCB is designed to exploit the clustering structure and introduces a new index to evaluate an arm, which depends on other arms within the cluster. In this way, arms share information among each other. We present simulation results of our algorithm and compare its performance against KL-UCB and other well known algorithms for bandits with dependent arms. Finally, we address some limitations of this work and conclude by mentioning some possible future research.

URL: https://openreview.net/forum?id=QDMvPO9WJT

---

Title: Model Alignment Search

Abstract: When can we say that two neural systems are the same? What nuances do we miss when we fail to causally probe the representations of the systems? In this work, we introduce a method for connecting neural representational similarity to behavior through causal interventions. The method learns transformations that find an aligned subspace in which behavioral information can be interchanged between multiple distributed networks' representations. We first show that the method can be used to transfer the behavior from one frozen Neural Network (NN) to another in a manner similar to model stitching, and we show how the method can differ from correlative similarity measures like Representational Similarity Analysis. Next, we empirically and theoretically show how the method can be equivalent to model stitching when desired, or it can take a form that has a more restrictive focus to shared causal information; in both forms, it reduces the number of required matrices for a comparison of n models to be linear in n. We then present a case study on number-related tasks showing that the method can be used to examine specific subtypes of causal information, and we present another case study showing that the method can reveal toxicity in fine-tuned DeepSeek-r1-Qwen-1.5B models. Lastly, we show how to augment the loss with a counterfactual latent auxiliary objective to improve causal relevance when one of the two networks is causally inaccessible (as is often the case in comparisons with biological networks). We use our results to encourage the use of causal methods in neural similarity analyses and to suggest future explorations of network similarity methodology for model misalignment.

URL: https://openreview.net/forum?id=I9shNCSmCU

---

Title: Transformers as Implicit State Estimators: In-Context Learning in Dynamical Systems

Abstract: Predicting the behavior of a dynamical system from noisy observations of its past outputs is a classical problem encountered across engineering and science. For linear systems with Gaussian inputs, the Kalman filter -- the best linear minimum mean-square error estimator of the state trajectory -- is optimal in the Bayesian sense. For nonlinear systems, Bayesian filtering is typically approached using suboptimal heuristics such as the Extended Kalman Filter (EKF), or numerical methods such as particle filtering (PF). In this work, we show that transformers, employed in an in-context learning (ICL) setting, can implicitly infer hidden states in order to predict the outputs of a wide family of dynamical systems, without test-time gradient updates or explicit knowledge of the system model. Specifically, when provided with a short context of past input–output pairs and, optionally, system parameters, a frozen transformer accurately predicts the current output. In linear-Gaussian regimes, its predictions closely match those of the Kalman filter; in nonlinear regimes, its performance approaches that of EKF and PF. Moreover, prediction accuracy degrades gracefully when key parameters, such as the state-transition matrix, are withheld from the context, demonstrating robustness and implicit parameter inference. These findings suggest that transformer in-context learning provides a flexible, non-parametric alternative for output prediction in dynamical systems, grounded in implicit latent-state estimation.

URL: https://openreview.net/forum?id=hIMK5MvGkP

---

Title: StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs

Abstract: As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce $\textbf{StructEval}$, a comprehensive benchmark for evaluating LLMs' capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: $\textbf{1)}$ generation tasks, producing structured output from natural language prompts, and $\textbf{2)}$ conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps—even state-of-the-art models like o1-mini achieve only $75.58$ average score, with open-source alternatives lagging approximately $10$ points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.

URL: https://openreview.net/forum?id=buDwV7LUA7

---

Title: Cropping outperforms dropout as an augmentation strategy for training self-supervised text embeddings

Abstract: Text embeddings, i.e. vector representations of entire texts, play an important role in many NLP applications, such as retrieval-augmented generation, sentiment analysis, clustering, or visualizing collections of texts for data exploration. Currently, top-performing embedding models are derived from pre-trained language models via extensive supervised fine-tuning using curated text pairs. This contrasts with computer vision, where self-supervised training based on data augmentations has demonstrated remarkable success. Here we systematically compare the two most well-known augmentation strategies for positive pair generation in contrastive learning of text embeddings. We assess embedding quality on MTEB and additional in-domain evaluations and show that cropping augmentation strongly outperforms the dropout-based approach. We find that on out-of-domain data, the quality of resulting embeddings is below the supervised SOTA models, but for in-domain data, self-supervised fine-tuning produces high-quality text embeddings after very short fine-tuning, sometimes only marginally below the supervised SOTA. Finally, we show that representation quality increases towards the last transformer layers, which undergo the largest change during fine-tuning; and that fine-tuning only those last layers is sufficient to reach similar embedding quality.

URL: https://openreview.net/forum?id=gVRsIh9x7W

---

Reply all

Reply to author

Forward

0 new messages