Accepted papers
===============
Title: Heterogeneous Knowledge for Augmented Modular Reinforcement Learning
Authors: Lorenz Wolf, Mirco Musolesi
Abstract: Existing modular Reinforcement Learning (RL) architectures are generally based on reusable components, also allowing for ``plug-and-play'' integration. However, these modules are homogeneous in nature - in fact, they essentially provide policies obtained via RL through the maximization of individual reward functions. Consequently, such solutions still lack the ability to integrate and process multiple types of information (i.e., heterogeneous knowledge representations), such as rules, sub-goals, and skills from various sources. In this paper, we discuss several practical examples of heterogeneous knowledge and propose Augmented Modular Reinforcement Learning (AMRL) to address these limitations. Our framework uses a selector to combine heterogeneous modules and seamlessly incorporate different types of knowledge representations and processing mechanisms. Our results demonstrate the performance and efficiency improvements, also in terms of generalization, which can be achieved by augmenting traditional modular RL with heterogeneous knowledge sources and processing mechanisms. Finally, we examine the safety, robustness, and interpretability issues stemming from the introduction of knowledge heterogeneity.
URL: https://openreview.net/forum?id=eme87YbiND
---
Title: HyResPINNs: A Hybrid Residual Physics-Informed Neural Network Architecture Designed to Balance Expressiveness and Trainability
Authors: Madison Cooley, Mike Kirby, Shandian Zhe, Varun Shankar
Abstract: Physics-informed neural networks (PINNs) have emerged as a powerful approach for solving partial differential equations (PDEs) by training neural networks with loss functions that incorporate physical constraints. In this work, we introduce HyResPINNs, a two-level convex-gated architecture designed to maximize approximation expressiveness for a fixed number of degrees of freedom (DoF). The first level involves a trainable, per-block combination of smooth basis functions with trainable sparsity, and deep neural networks; the second involves the ability to gate entire blocks (much like in ResNets or Highway Nets), allowing for expressivity along the depth dimension of the architecture. Our empirical evaluation on a diverse set of challenging PDE problems demonstrates that HyResPINNs consistently achieve superior accuracy to baseline methods while remaining competitive relative to training times. These results highlight the potential of HyResPINNs to combine desirable features from traditional scientific computing methods and modern machine learning, paving the way for more robust and expressive approaches to physics-informed modeling.
URL: https://openreview.net/forum?id=et9WkjkqAw
---
Title: Why Settle for Mid: A Probabilistic Viewpoint to Spatial Relationship Alignment in Text-to-image Models
Authors: Parham Rezaei, Arash Marioriyad, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban
Abstract: Despite the ability of text-to-image models to generate high-quality, realistic, and diverse images, they face challenges in compositional generation, often struggling to accurately represent details specified in the input prompt. A prevalent issue in compositional generation is the misalignment of spatial relationships, as models often fail to faithfully generate images that reflect the spatial configurations specified between objects in the input prompts.
To address this challenge, we propose a novel probabilistic framework for modeling the relative spatial positioning of objects in a scene, leveraging the concept of Probability of Superiority (PoS). Building on this insight, we make two key contributions. First, we introduce a novel evaluation metric, PoS-based Evaluation (PSE), designed to assess the alignment of 2D and 3D spatial relationships between text and image, with improved adherence to human judgment. Second, we propose PoS-based Generation (PSG), an inference-time method that improves the alignment of 2D and 3D spatial relationships in T2I models without requiring fine-tuning. PSG employs a PoS-based reward function that can be utilized in two distinct ways: (1) as a gradient-based guidance mechanism applied to the cross-attention maps during the denoising steps, or (2) as a search-based strategy that evaluates a set of initial noise vectors to select the best one. Extensive experiments demonstrate that the PSE metric exhibits stronger alignment with human judgment compared to traditional center-based metrics, providing a more nuanced and reliable measure of complex spatial relationship accuracy in text-image alignment. Furthermore, PSG significantly enhances the ability of text-to-image models to generate images with specified spatial configurations, outperforming state-of-the-art methods across multiple evaluation metrics and benchmarks.
URL: https://openreview.net/forum?id=mFlanJKVFD
---
Title: The AI Hippocampus: How Far are We From Human Memory?
Authors: Zixia Jia, Jiaqi Li, Yipeng Kang, Yuxuan Wang, Tong Wu, Quansen Wang, Xiaobo Wang, Shuyi Zhang, Junzhe Shen, Qing Li, Siyuan Qi, Yitao Liang, Di He, Zilong Zheng, Song-Chun Zhu
Abstract: Memory plays a foundational role in augmenting the reasoning, adaptability, and contextual fidelity of modern Large Language Models (LLMs) and Multi-Modal LLMs (MLLMs). As these models transition from static predictors to interactive systems capable of continual learning and personalized inference, the incorporation of memory mechanisms has emerged as a central theme in their architectural and functional evolution. This survey presents a comprehensive and structured synthesis of memory in LLMs and MLLMs, organizing the literature into a cohesive taxonomy comprising implicit, explicit, and agentic memory paradigms. Specifically, the survey delineates three primary memory frameworks. \textit{Implicit memory} refers to the knowledge embedded within the internal parameters of pre-trained transformers, encompassing their capacity for memorization, associative retrieval, and contextual reasoning. Recent work has explored methods to interpret, manipulate, and reconfigure this latent memory. \textit{Explicit memory} involves external storage and retrieval components designed to augment model outputs with dynamic, queryable knowledge representations—such as textual corpora, dense vectors, and graph-based structures—thereby enabling scalable and updatable interaction with information sources. \textit{Agentic memory} introduces persistent, temporally extended memory structures within autonomous agents, facilitating long-term planning, self-consistency, and collaborative behavior in multi-agent systems, with relevance to embodied and interactive AI. Extending beyond text, the survey examines the integration of memory within multi-modal settings, where coherence across vision, language, audio, and action modalities is essential. Key architectural advances, benchmark tasks, and open challenges are discussed, including issues related to memory capacity, alignment, factual consistency, and cross-system interoperability. By charting the current landscape and identifying critical research directions, this survey aims to inform the development of memory-augmented (M)LLMs that are more flexible, context-sensitive, and aligned with the requirements of real-world intelligent systems. The survey’s website is available at \url{https://github.com/bigai-nlco/LLM-Memory-Survey}.
URL: https://openreview.net/forum?id=Sk7pwmLuAY
---
New submissions
===============
Title: Reconciling In-Context and In-Weight Learning via Dual Representation Space Encoding
Abstract: In-context learning (ICL) is a valuable capability exhibited by Transformers pretrained on diverse sequence tasks. However, previous studies have observed that ICL often conflicts with the model’s inherent in-weight learning (IWL) ability. By examining the representation space learned by a toy model in synthetic experiments, we identify the shared encoding space for context and samples in Transformers as a potential source of this conflict. To address this, we modify the model architecture to separately encode the context and samples into two distinct spaces: a \textit{task representation space} and a \textit{sample representation space}. We model these two spaces under a simple yet principled framework, assuming a linear representational structure and treating them as a pair of dual spaces. Both theoretical analysis and empirical results demonstrate that our proposed architecture, CoQE, not only enhances ICL performance through improved representation learning, but also successfully reconciles ICL and IWL capabilities across all tested conditions.
URL: https://openreview.net/forum?id=bJK7VIOWAU
---
Title: Retrieval-augmented Adaptive Decoding for Open-ended Question Answering Generation
Abstract: Ensuring truthfulness in large language models (LLMs) remains a critical challenge for reliable text generation. While supervised fine-tuning and reinforcement learning with human feedback have shown promise, they require a substantial amount of annotated data and computational resources, limiting scalability. In contrast, decoding-time interventions offer lightweight alternatives without model retraining. However, existing decoding strategies often face issues like prompt sensitivity, limited generalization, or dependence on internal model states. We propose \textbf{Retrieval-Augmented Decoding (RAD)}, a context-aware adaptive decoding method that leverages a compact reference grounding space built from \textit{as few as 10 annotated examples} and comprising pairs of context embeddings and next-token logits from truthful responses, to enable retrieval-based logit shaping during inference. At each decoding step, RAD retrieves high-quality semantically similar contexts from this grounding space and aggregates their associated next token logits to modify the model's current logits. Across three open-ended question-answering benchmarks and four LLMs, our method consistently outperforms strong baselines and shows robust cross-task generalization, underscoring the promise of context-aware decoding for enhancing factual reliability.
URL: https://openreview.net/forum?id=XVeVhCKkqN
---
Title: When Test-Time Training Fails: A Critical Analysis of Robustness and Hyperparameter Sensitivity
Abstract: Test-time training (TTT) through input perplexity minimization has emerged as a promising approach for enhancing language model performance during inference. However, questions remain about its practical robustness and applicability beyond popular benchmarks. This paper presents a preliminary analysis investigating two critical questions: whether TTT is effective on unseen tasks and how sensitive it is to hyperparameter choices. We evaluate TTT on three anti-memorization datasets—Memo-Trap, GSM-Symbolic, and Math-Perturb—using six models from the Qwen 2.5 and Llama 3 families. Our findings reveal that while TTT shows effectiveness on common benchmarks such as AIME 2024, it struggles with tasks designed to counter memorization, raising questions about whether the gains stem from domain adaptation or data contamination. We identify significant performance differences among optimizers, with SGD outperforming Adam despite slower convergence. Through extensive hyperparameter sweeps over learning rates, training steps, weight decay, momentum, and gradient normalization, we demonstrate that TTT is highly sensitive to these choices, with no universal recipe across tasks and models. Notably, gradient normalization emerges as an effective technique for improving robustness by mitigating catastrophic performance drops and reducing sensitivity to the learning rate. Our analysis also reveals that tuning feed-forward networks can achieve better peak performance than full model tuning, while attention-only tuning provides more stable worst-case performance. These findings highlight the need for continued research into making test-time training more practical and reliable for real-world deployment. Since this research only focuses on a specific algorithm of TTT: input perplexity minimization, our conclusion may not be applied to all TTT algorithms. We call on the community to pay closer attention to TTT's sensitivity to make it better suited for real-world applications
URL: https://openreview.net/forum?id=0Eh31N1Hoj
---
Title: GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation
Abstract: Modern deep learning methods typically treat image sequences as large tensors of sequentially stacked frames. However, is this straightforward representation ideal given the current state-of-the-art (SoTA)? In this work, we address this question in the context of generative models and aim to devise a more effective way of modeling image sequence data. Observing the inefficiencies and bottlenecks of current SoTA image sequence generation methods, we showcase that rather than working with large tensors, we can improve the generation process by factorizing it into first generating the coarse sequence at low resolution and then refining the individual frames at high resolution. We train a generative model solely on grid images comprising subsampled frames. Yet, we learn to generate image sequences, using the strong self-attention mechanism of the Diffusion Transformer (DiT) to capture correlations between frames. In effect, our formulation extends a 2D image generator to operate as a 3D image-sequence generator without introducing any architectural modifications. Subsequently, we super-resolve each frame individually to add the sequence-independent high-resolution details. This approach offers several advantages and can overcome key limitations of the SoTA in this domain. Compared to existing image sequence generation models, our method achieves superior synthesis quality and improved coherence across sequences. It also delivers high-fidelity generation of arbitrary-length sequences and increased efficiency in inference time and training data usage. Furthermore, our straightforward formulation enables our method to generalize effectively across diverse data domains, which typically require additional priors and supervision to model in a generative context. Our method consistently delivers superior quality and offers a $>2\times$ speedup in inference rates across various datasets. We will make our implementation publicly available.
URL: https://openreview.net/forum?id=QLD47Ou5lp
---
Title: Discovering Symbolic Differential Equations with Symmetry Invariants
Abstract: Discovering symbolic differential equations from data uncovers fundamental dynamical laws underlying complex systems. However, existing methods often struggle with the vast search space of equations and may produce equations that violate known physical laws. In this work, we address these problems by introducing the concept of \textit{symmetry invariants} in equation discovery. We leverage the fact that differential equations admitting a symmetry group can be expressed in terms of differential invariants of symmetry transformations. Thus, we propose to use these invariants as atomic entities in equation discovery, ensuring the discovered equations satisfy the specified symmetry. Our approach integrates seamlessly with existing equation discovery methods such as sparse regression and genetic programming, improving their accuracy and efficiency. We validate the proposed method through applications to various physical systems, such as Darcy flow and reaction-diffusion, demonstrating its ability to recover parsimonious and interpretable equations that respect the laws of physics.
URL: https://openreview.net/forum?id=9t1dEyYfPc
---
Title: Solving Constrained Optimization Problems as ODE-based Models Using Reinforcement Learning
Abstract: Previous learning-to-optimize (L2O) methods on constrained optimization problems often treat neural networks as initializers that generate approximate solutions requiring substantial post-hoc refinements. This approach overlooks a key insight: Solving complex optimization problems often requires iterative refinement of candidate solutions, a process naturally aligned with the Markov Decision Process (MDP) and reinforcement learning (RL) framework. We show that within the MDP framework, RL and Ordinary Differential Equation (ODE)-based generative models (e.g., diffusion, flow matching) are formally equivalent, unifying them as trainable optimizers. Building on our unified perspective, we propose to train a flow-matching model within an RL paradigm as a learnable refinement mechanism, thereby incorporating constraint satisfaction directly into the optimization process. To further enhance feasibility, we introduce a minimal correction step that adjusts solutions to ensure constraint compliance. Empirical results demonstrate that our approach achieves state-of-the-art performance across a range of constrained optimization tasks, yielding improvements in efficiency, solution quality, and feasibility over prior baselines.
URL: https://openreview.net/forum?id=QW0ZX4zRC2
---
Title: OmniCache: Multidimensional Hierarchical Feature Caching for Diffusion Models
Abstract: Recent high-resolution image and video diffusion models, e.g., SD3, FLUX, Sora, have advanced generative intelligence but remain computationally expensive due to quadratic attention and multi-step inference. In this paper, we address the challenge of computational inefficiency in image & video generation by exploiting the inherent redundancy in the processed token content. We identify four primary types of redundancies: intra-frame, inter-frame, motion, and step redundancy. To mitigate these, we propose OmniCache, a novel mechanism that employs multidimensional hierarchical feature caching techniques: Frame Cache and Block Cache, together with incorporating Token Cache across transformer layers. These strategies enable us to compress spatial features in the temporal layers and temporal features in the spatial layers, significantly enhancing generation efficiency without the need for additional training. Moreover, we also study the improvements introduced by the orthogonal layered caching technique with OmniCache. OmniCache is evaluated on state-of-the-art diffusion models for both image and video generation, including SD3, SVD-XT, and Latte. It achieves up to 35% reduction in inference latency on Stable Diffusion 3 (SD3), 25% on SVD-XT, and 28% on Latte, while maintaining high visual fidelity.
URL: https://openreview.net/forum?id=5lRaQ4XAwN
---
Title: ODE-Constrained Generative Modeling of Cardiac Dynamics for 12-Lead ECG Synthesis
Abstract: Generating realistic training data for supervised learning remains a significant challenge in artificial intelligence. This is particularly true in the synthesis of electrocardiograms (ECGs), where the objective is to develop a synthetic 12-lead ECG model. The primary challenge in this task lies in accurately modeling the intricate biological and physiological interactions among different ECG leads. Although mathematical process models have shed light on these dynamics, effectively incorporating this understanding into generative models is not straightforward. We introduce an innovative method that employs ordinary differential equations (ODEs) to enhance the fidelity of 12-lead ECG data generation. This approach integrates cardiac dynamics directly into the generative optimization process via a novel Euler Loss, producing biologically plausible data that respects real-world variability and inter-lead constraints. Empirical analysis on the G12EC and PTB-XL datasets demonstrates that augmenting training data with MultiODE-GAN yields consistent, statistically significant improvements in specificity across multiple cardiac abnormalities. This highlights the value of enforcing physiological coherence in synthetic medical data.
URL: https://openreview.net/forum?id=4N56Pwwsti
---
Title: Adapt via Bayesian Nonparametric Clustering: Fine-Grained Classification for Model Recycling Under Domain and Category Shift
Abstract: Recycling pretrained classification models for new domains, known as Source-Free Domain Adaptation (SFDA), has been extensively studied under the closed-set assumption that source and target domains share identical label spaces. However, this assumption does not hold when unseen classes appear in the target domain. Addressing this category shift is challenging, as unknown target classes usually arise with no prior knowledge of their identities or number, and becomes particularly difficult in the source-free setting, where access to source data is unavailable. Most existing methods treat all unknown classes as a single group during both training and evaluation, limiting their capacity to model the underlying structure within the unknown class space. In this work, we present Adapt via Bayesian Nonparametric Clustering (ABC), a novel framework designed for SFDA scenarios where unknown target classes are present. Unlike prior methods, ABC explicitly achieves fine-grained classification of unknown target classes, offering a more structured vision of the problem. Our method first identifies high-confidence target samples likely to belong to known source classes. Using these as guidance, we develop a guided Bayesian nonparametric clustering approach that learns distinct prototypes for both known and unknown classes without requiring the number of unknown classes a priori, and assigns target samples accordingly. We further introduce a training objective that refines the source model by encouraging prototype-based discriminability and local prediction consistency. Experiments show that our method achieves competitive performance on standard benchmarks while simultaneously providing effective clustering of unknown classes.
URL: https://openreview.net/forum?id=J5B4yt7C37
---
Title: A Dual-Protection Framework for Copyright Protection and Image Editing Using Multi-Label Conformal Prediction
Abstract: Recent advances in diffusion models have significantly enhanced image editing capabilities, raising serious concerns about copyright protection. Traditional watermarks often fail to withstand diffusion-based edits, making image protection challenging. To address this, we propose a method that embeds an imperceptible perturbation in images, serving as a watermark while simultaneously disrupting the output of latent diffusion models. Our approach employs a Score Estimator trained on select latent embeddings to embed the watermark by minimizing the score function. We then apply conformal inference to compute p-values for watermark detection. To distort the output of latent diffusion models, we shift watermarked image embeddings away from the distribution mean, distorting unauthorized generations. Experiments demonstrate our framework's superior performance in watermark detection, imperceptibility, and robustness against attacks, offering a comprehensive approach to protect images against latent diffusion models.
URL: https://openreview.net/forum?id=yiOmppKOdj
---
Title: Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
Abstract: In standard large vision-language models (LVLMs) pre-training, the model typically maximizes the joint probability of the caption conditioned on the image via next-token prediction (NTP); however, since only a small subset of caption tokens directly relates to the visual content, this naive NTP unintentionally fits the model to noise and increases the risk of hallucination. We present PRIOR, a simple vision-language pre-training approach that addresses this issue by prioritizing image-related tokens through differential weighting in the NTP loss, drawing from the importance sampling framework. PRIOR introduces a reference model—a text-only large language model (LLM) trained on the captions without image inputs, to weight each token based on its probability for LVLMs training. Intuitively, tokens that are directly related to the visual inputs are harder to predict without the image and thus receive lower probabilities from the text-only reference LLM. During training, we implement a token-specific re-weighting term based on the importance scores to adjust each token's loss. We implement PRIOR in two distinct settings: LVLMs with visual encoders and LVLMs without visual encoders. We observe 19% and 8% average relative improvement, respectively, on several vision-language benchmarks compared to NTP. In addition, PRIOR exhibits superior scaling properties, as demonstrated by significantly higher scaling coefficients, indicating greater potential for performance gains compared to NTP given increasing compute and data. The code and data will be made public.
URL: https://openreview.net/forum?id=jDcnL1hB1Z
---
Title: Progressive Depth Up-scaling via Optimal Transport
Abstract: Scaling Large Language Models (LLMs) yields performance gains but incurs substantial training costs. Depth up-scaling offers training efficiency by adding new layers to pre-trained models. However, most existing methods copy or average weights from base layers, neglecting neuron permutation differences. This limitation can potentially cause misalignment that harms performance. Inspired by applying Optimal Transport (OT) for neuron alignment, we propose Optimal Transport Depth Up-Scaling (OpT-DeUS). OpT-DeUS aligns and fuses Transformer blocks in adjacent base layers via OT for new layer creation, to mitigate neuron permutation mismatch between layers. OpT-DeUS achieves better overall performance and offers improved training efficiency than existing methods for continual pre-training and supervised fine-tuning across different model sizes. To further evaluate the impact of interpolation positions, our extensive analysis shows that inserting new layers closer to the top results in higher training efficiency due to shorter back-propagation time while obtaining additional performance gains. We also find a strong correlation between strong depth up-scaling performance and high transport matrix entropy. Code is provided in the supplementary material.
URL: https://openreview.net/forum?id=ybKrBKnxPV
---
Title: RWR-RGCN : A Novel Framework for Fraud Detection via Node Context Aggregation
Abstract: The integrity of online reviews is crucial for businesses, yet widespread review fraud poses significant risks. This paper addresses this challenge by leveraging the power of multi-relational graph convolutional networks (RGCNs) for fraud detection. We introduce RWR-RGCN, a novel framework integrating a multi-layer RGCN architecture with Random Walks with Restart (RWR). The essential role of capturing critical connections lies in RWR generating node sequences, which can aggregate node features, enhancing the model's understanding of the local and global context within the review graph. To further refine fraud detection, we incorporate Louvain clustering for community identification, identifying high-modularity clusters indicative of coordinated fraudulent activity. Evaluated on the Yelp dataset, RWR-RGCN achieved an AUC of 82.58\% and a recall of 94.56\%, surpassing the state-of-the-art and baseline methods in AUC and recall. These results demonstrate the superior effectiveness of the proposed framework in detecting fraudulent activity within complex online review networks.
URL: https://openreview.net/forum?id=y3mhzyu1TT
---
Title: Counting Still Counts: Understanding Neural Complex Query Answering Through Query Relaxation
Abstract: Neural methods for Complex Query Answering (CQA) over knowledge graphs (KGs) are widely believed to learn patterns that generalize beyond explicit graph structure, allowing them to infer answers that are unreachable through symbolic query processing.
In this work, we critically examine this assumption through a systematic analysis comparing neural CQA models with an alternative, training-free query relaxation strategy that retrieves possible answers by relaxing query constraints and counting resulting paths. Across multiple datasets and query structures, we find several cases where neural and relaxation-based approaches perform similarly, with no neural model consistently outperforming the latter. Moreover, a similarity analysis reveals that their retrieved answers exhibit little overlap, and that combining their outputs consistently improves performance.
These results call for a re-evaluation of progress in neural query answering: despite their complexity, current models fail to subsume the reasoning patterns captured by query relaxation. Our findings highlight the importance of stronger non-neural baselines and suggest that future neural approaches could benefit from incorporating principles of query relaxation.
URL: https://openreview.net/forum?id=YVFxB6bkeC
---
Title: Phase Transitions or Continuous Evolution? Methodological Sensitivity in Neural Network Training Dynamics
Abstract: Recent work on neural network training dynamics often identifies transitions or phase changes in weight matrices through rank-based metrics. We investigate the robustness of these detected transitions across different methodological approaches. Analyzing 55 experiments spanning Transformer, CNN, and MLP architectures (30,147 measurement points), we find that transition detection
exhibits substantial sensitivity to methodological choices. Varying the detection threshold from 2$\sigma$ to 100$\sigma$ changes total detected transitions by an order of magnitude (25,513 to 1,608). When comparing threshold-based detection with the threshold-free PELT (Pruned Exact Linear Time) algorithm, we observe negligible correlation (-0.029) between methods: PELT identifies 40--52 transitions per layer while threshold methods at 5$\sigma$ detect 0.00-0.09. Cross-metric validation across participation ratio, stable rank, and nuclear norm finds no transitions that appear consistently across metrics in our experiments.
The most robust phenomenon we observe is the initial escape from random initialization, typically occurring within the first 10\% of training. Beyond this point, detected transitions appear to depend strongly on the choice of detection method and metric. While architecture-specific patterns emerge within each method, the lack of agreement across methods and metrics raises
important questions about the interpretation of phase transitions in neural network training.
Our findings suggest that current detection methods cannot reliably identify phase transitions in models at the scales we studied, with training dynamics exhibiting predominantly continuous evolution beyond initialization. We propose practical guidelines for practitioners that embrace continuous monitoring approaches and discuss the implications for understanding neural network optimization. This work highlights the importance of methodological scrutiny when characterizing training dynamics and suggests that multiple perspectives—both continuous and discrete—may be needed to fully understand how neural networks learn.
URL: https://openreview.net/forum?id=MkZIew531l
---
Title: From Prompts to Perception: Auditing Stereotypes in Multimodal AI
Abstract: Multimodal large language models (MLLMs) and text-to-image (T2I) systems are pervasive, yet how stereotypes propagate across pipelines remains unclear. We present a model-agnostic auditing framework that evaluates joint stereotype formation across T2I and MLLM pipelines using four T2I models and five MLLMs. We use seven nationalities (American, Indian, Iranian, Japanese, Mexican, Nigerian, Russian) along with five gender terms (man, woman, boy, girl, person) to create a set of images, which is then evaluated across different attributes and traits. For the evaluation, we also generate a set of images as a neutral baseline along with distance and radar plots. Embeddings through t-SNE and distance plots reveal tight nationality clusters and a drift of gender neutral prompts toward “man”. We further introduce five metrics: TDS and WTD to quantify trait shifts; SDI and OM for label dominance/overlap; and MCS for corruption-induced instability. TDS and WTD show minimal deviation for American and maximal for Nigerian groups, indicating that physical traits can be nationality-specific. Frequency plots, treemaps, along with SDI and OM, indicate that there is an over-reliance on a few words. MCS shows that mild degradations yield 15-45% meaningful label changes and accuracy drops, indicating that noise affects predictions. Our framework offers actionable and reproducible tools for auditing stereotype risk in multimodal AI.
URL: https://openreview.net/forum?id=qjNoOeJVfJ
---
Title: Dynamic Neural Graph Encoding of Inference Processes in Deep Weight Space
Abstract: The rapid advancements in using neural networks as implicit data representations have attracted significant interest in developing machine learning methods that analyze and process the weight spaces of other neural networks. However, efficiently handling these high-dimensional weight spaces remains challenging. Existing methods often overlook the sequential nature of layer-by-layer processing in neural network inference. In this work, we propose a novel approach using dynamic graphs to represent neural network parameters, capturing the temporal dynamics of inference. Our Dynamic Neural Graph Encoder (DNG-Encoder) processes these graphs, preserving the sequential nature of neural processing. Additionally, we also leverage DNG-Encoder to develop INR2JLS for facilitate downstream applications, such as classifying INRs. Our approach demonstrates significant improvements across multiple tasks, surpassing the state-of-the-art INR classification accuracy by approximately 10% on the CIFAR-100-INR. The source code has been made available in the supplementary materials.
URL: https://openreview.net/forum?id=4fweEyVYLF
---
Title: On Adversarial Attacks In Acoustic Localization
Abstract: Multi-rotor aerial vehicles (drones) are increasingly deployed across diverse domains, where
accurate navigation is critical. The limitations of vision-based methods under poor lighting
and occlusions have driven growing interest in acoustic sensing as an alternative. However,
the security of acoustic-based localization has not been examined. Adversarial attacks pose
a serious threat, potentially leading to mission-critical failures and safety risks. While prior
research has explored adversarial attacks on vision-based systems, no work has addressed
the acoustic setting. In this paper, we present the first comprehensive study of adversarial
robustness in acoustic drone localization. We formulate white-box projected gradient descent (PGD) attacks from an external sound source and show their significant impact on
localization accuracy. Furthermore, we propose a novel defense algorithm based on rotor
phase modulation, capable of effectively recovering clean signals and mitigating adversarial
degradation. Our results highlight both the vulnerability of acoustic localization and the
potential for robust defense strategies.
URL: https://openreview.net/forum?id=Nxm5xXoLFb
---