J2C Certification: Auditing Predictive Models for Intersectional Biases
Kate Boxer, Edward McFowland III, Daniel B. Neill
https://openreview.net/forum?id=1JTnlHMSmO
---
Accepted papers
===============
Title: Template-Based Probes Are Imperfect Lenses for Counterfactual Bias Evaluation in LLMs
Authors: Farnaz Kohankhaki, D. B. Emerson, Jacob-Junqi Tian, Laleh Seyyed-Kalantari, Faiza Khan Khattak
Abstract: Bias in large language models (LLMs) has many forms, from overt discrimination to implicit stereotypes. Counterfactual bias evaluation is a widely used approach to quantifying bias and often relies on template-based probes that explicitly state group membership. It aims to measure whether the outcome of a task performed by an LLM is invariant to a change in group membership. In this work, we find that template-based probes can introduce systematic distortions in bias measurements. Specifically, we consistently find that such probes suggest that LLMs classify text associated with White race as negative at disproportionately elevated rates. This is observed consistently across a large collection of LLMs, over several diverse template-based probes, and with different classification approaches. We hypothesize that this arises artificially due to linguistic asymmetries present in LLM pretraining data, in the form of markedness, (e.g., Black president vs. president) and templates used for bias measurement (e.g., Black president vs. White president). These findings highlight the need for more rigorous methodologies in counterfactual bias evaluation, ensuring that observed disparities reflect genuine biases rather than artifacts of linguistic conventions.
URL: https://openreview.net/forum?id=lhWEovKdoU
---
Title: Learning from Online Videos at Inference Time for Computer-Use Agents
Authors: Yujian Liu, Ze Wang, Hao Chen, Ximeng Sun, Xiaodong Yu, Jialian Wu, Jiang Liu, Emad Barsoum, Zicheng Liu, Shiyu Chang
Abstract: Computer-use agents can operate computers and automate laborious tasks, but despite recent rapid progress, they still lag behind human users, especially when tasks require domain-specific procedural knowledge about particular applications, platforms, and multi-step workflows. Humans can bridge this gap by watching video tutorials: we search, skim, and selectively imitate short segments that match our current subgoal. In this paper, we study how to enable computer-use agents to learn from online videos at inference time effectively. We propose a framework that retrieves and filters tutorial videos, converts them into structured demonstration trajectories, and dynamically selects trajectories as in-context guidance during execution. Particularly, using a VLM, we infer UI actions, segment videos into short subsequences of actions, and assign each subsequence a textual objective. At inference time, a two-stage selection mechanism dynamically chooses a single trajectory to add in context at each step, focusing the agent on the most helpful local guidance for its next decision. Experiments on two widely used benchmarks show that our framework consistently outperforms strong base agents and variants that use only textual tutorials or transcripts. Analyses highlight the importance of trajectory segmentation and selection, action filtering, and visual information, suggesting that abundant online videos can be systematically distilled into actionable guidance that improves computer-use agents at inference time.
URL: https://openreview.net/forum?id=YDFQIe6dqI
---
Title: Auditing Predictive Models for Intersectional Biases
Authors: Kate Boxer, Edward McFowland III, Daniel B. Neill
Abstract: Predictive models that satisfy group fairness criteria in aggregate for members of a protected class, but do not guarantee subgroup fairness, could produce biased predictions for individuals at the intersection of two or more protected classes. To address this risk, we propose Conditional Bias Scan (CBS), an auditing framework for detecting intersectional biases in the outputs of classification models that may lead to disparate impact. CBS aims to identify the subgroup with the most significant bias against the protected class, compared to the equivalent subgroup in the non-protected class. The framework can audit for predictive biases using common group fairness definitions that can be represented as conditional independence statements (separation and sufficiency) for both probabilistic and binarized predictions. We show through empirical evaluations that this methodology has substantially higher bias detection power compared to similar methods that audit for subgroup fairness. We then use this approach to detect statistically significant intersectional biases in the predictions of the COMPAS pre-trial risk assessment tool and a model trained on the German Credit data.
URL: https://openreview.net/forum?id=1JTnlHMSmO
---
Title: StFT: Spatio-temporal Fourier Transformer for Long-term Dynamics Prediction
Authors: Da Long, Shandian Zhe, Samuel Williams, Leonid Oliker, Zhe Bai
Abstract: Simulating the long-term dynamics of multi-scale and multi-physics systems poses a significant challenge in understanding complex phenomena across science and engineering. The complexity arises from the intricate interactions between scales and the interplay of diverse physical processes, which manifest in PDEs through coupled, nonlinear terms that govern the evolution of multiple physical fields across scales. Neural operators have shown potential in short-term prediction of such complex spatio-temporal dynamics; however,
achieving stable high-fidelity predictions and providing robust uncertainty quantification over extended time horizons remains an open and unsolved area of research. These limitations often lead to stability degradation with rapid error accumulation, particularly in long-term forecasting of systems characterized by multi-scale behaviors involving dynamics of different orders. To address these challenges, we propose an autoregressive Spatio-temporal Fourier Transformer (StFT), in which each transformer block is designed to learn the system dynamics at a distinct scale through a dual-path architecture that integrates frequency-domain and spatio-temporal representations. By leveraging a structured hierarchy of StFT blocks, the resulting model explicitly captures the underlying dynamics across both macro- and micro- spatial scales. Furthermore, a generative residual correction mechanism is introduced to learn a probabilistic refinement temporally while simultaneously quantifying prediction uncertainties, enhancing both the accuracy and reliability of long-term probabilistic forecasting. Evaluations conducted on three benchmark datasets (plasma, fluid, and atmospheric dynamics) demonstrate the advantages of our approach over state-of-the-art ML methods.
URL: https://openreview.net/forum?id=o9Cb0ri2oW
---
New submissions
===============
Title: Structured Noise Adaptation for Sequential Bayesian Filtering with Embedded Latent Transfer Operators
Abstract: Kalman filters based on the Embedded Latent Transfer Operators (ELTOs) emerge as powerful statistical tools for sequential state estimation. However, a critical limitation stems from their use of a simplified noise model, which fails to dynamically adapt to non-stationary processes. To address this limitation, we introduce an ELTO-based Bayesian filtering approach with a new structured parameterization for the filter’s noise model. This parameterization enables structured noise adaptation, which couples the data-driven learning of an optimal time-invariant noise model with dynamic parameter adaptation that responds to local changes in dynamics within non-stationary processes. Empirical results show that the proposed noise model improves the filter’s dynamic state estimation performance in noisy, time-varying environments.
URL: https://openreview.net/forum?id=smFAyzvh5r
---
Title: Differentiable Cluster Discovery in Temporal Graphs
Abstract: Existing temporal graph clustering methods suffer from poor optimization dynamics due to reliance on heuristically initialized cluster assignment distribution without considering the dynamic nature of the evolving graph. The target cluster assignment distribution often conflicts with evolving temporal representations, leading to oscillatory gradients and unstable convergence. Motivated by the need for differentiable and adaptive clustering in dynamic settings, we propose TGRAIL (Temporal Graph Alignment and Index Learning), a novel end-to-end framework for temporal graph clustering based on Gumbel–Softmax sampling. TGRAIL enables discrete cluster assignments while maintaining the gradient flow. To ensure stable training, we formulate the clustering objective as an expectation over Monte Carlo samples and show that this estimator is both unbiased and variance-reduced. Furthermore, we incorporate a temporal consistency loss to preserve the order of interactions across time. Extensive experiments on six real-world temporal graph datasets demonstrate that our approach consistently outperforms state-of-the-art baselines, achieving higher clustering accuracy and robustness. Our results validate the effectiveness of jointly optimizing temporal dynamics and discrete cluster assignments in evolving graphs.
URL: https://openreview.net/forum?id=1caZVb6zL7
---
Title: A Survey on Benchmarks of LLM-based GUI Agents
Abstract: LLM-based GUI agents have made rapid progress in understanding visual interfaces, interpreting user intentions, and executing multi-step operations across web, mobile, and desktop environments. As these agents become more capable, systematic and reproducible evaluation has become essential for measuring progress and identifying remaining weaknesses. This survey provides a comprehensive overview of benchmarks for LLM-based GUI agents, covering three major categories: grounding and QA tasks, navigation and multi-step reasoning tasks, and open-world environments that reflect realistic and dynamic software usage. We examine how existing benchmarks evaluate both component-level abilities, such as intent understanding, GUI grounding, navigation, and context tracking, and system-level abilities, such as adaptation, personalization, privacy protection, safety, and computational efficiency. By comparing datasets, environments, and evaluation metrics, the survey reveals clear trends in benchmark design, along with persistent gaps including limited adaptability, vulnerability to malicious interfaces and prompt attacks, lack of interpretability, and significant computational overhead. We highlight emerging directions such as safety aware evaluation, user-centric personalization, lightweight deployment, and zero-shot generalization. This survey aims to serve as a practical guide for researchers who design GUI agents, build benchmarks, or study LLM-driven user interface automation.
URL: https://openreview.net/forum?id=ri3yPWE21Q
---
Title: DEMIX: Dual-Encoder Latent Masking Framework for Mixed Noise Reduction in Ultrasound Imaging
Abstract: Ultrasound imaging is widely used in noninvasive medical diagnostics due to its efficiency, portability, and avoidance of ionizing radiation. However, its utility is limited by the signal quality. Signal-dependent speckle noise, signal-independent sensor noise, and non-uniform spatial blurring caused by the transducer and modeled by the point spread function (PSF) degrade the image quality. These degradations challenge conventional image restoration methods, which assume simplified noise models, and highlight the need for specialized algorithms capable of effectively reducing the degradations while preserving fine structural details. We propose DEMIX, a novel dual-encoder denoising framework with a masked gated fusion mechanism, for denoising ultrasound images degraded by mixed noise and further degraded by PSF-induced distortions. DEMIX is inspired by diffusion models and is characterized by a forward process and a deterministic reverse process. DEMIX adaptively assesses the different noise components, disentangles them in the latent space, and suppresses these components while compensating for PSF degradations. Extensive experiments on two ultrasound datasets, along with a downstream segmentation task, demonstrate that DEMIX consistently outperforms state-of-the-art baselines, achieving superior noise suppression and preserving structural details. The code will be made available.
URL: https://openreview.net/forum?id=aguvqEbor8
---
Title: Towards Understanding the Transferability of Adversarial Suffixes in Large Language Models
Abstract: Discrete optimization-based jailbreaking attacks on large language models aim to generate short, nonsensical suffixes that, when appended onto input prompts, elicit disallowed content. Notably, these suffixes are often transferable—succeeding on prompts and models for which they were never optimized. And yet, despite the fact that transferability is surprising and empirically well-established, the field lacks a rigorous analysis of when and why transfer occurs. To fill this gap, we identify three statistical properties that strongly correlate with transfer success across numerous experimental settings: (1) how much a prompt without a suffix activates a model’s internal refusal direction, (2) how strongly a suffix induces a push away from this direction, and (3) how large these shifts are in directions orthogonal to refusal. On the other hand, we find that prompt semantic similarity only weakly correlates with transfer success. These findings lead to a more fine-grained understanding of transferability, which we use in interventional experiments to showcase how our statistical analysis can translate into practical improvements in attack success.
URL: https://openreview.net/forum?id=wQZmcEZCUK
---