Daily TMLR digest for Aug 11, 2025

1 view
Skip to first unread message

TMLR

unread,
Aug 11, 2025, 12:06:07 AMAug 11
to tmlr-anno...@googlegroups.com

Accepted papers
===============


Title: Semantic Mapping in Indoor Embodied AI - A Survey on Advances, Challenges, and Future Directions

Authors: Sonia Raychaudhuri, Angel X Chang

Abstract: Intelligent embodied agents (e.g. robots) need to perform complex semantic tasks in unfamiliar environments. Among many skills that the agents need to possess, building and maintaining a semantic map of the environment is most crucial in long-horizon tasks. A semantic map captures information about the environment in a structured way, allowing the agent to reference it for advanced reasoning throughout the task. While existing surveys in embodied AI focus on general advancements or specific tasks like navigation and manipulation, this paper provides a comprehensive review of semantic map-building approaches in embodied AI, specifically for indoor navigation. We categorize these approaches based on their structural representation (spatial grids, topological graphs, dense point-clouds or hybrid maps) and the type of information they encode (implicit features or explicit environmental data). We also explore the strengths and limitations of the map building techniques, highlight current challenges, and propose future research directions. We identify that the field is moving towards developing open-vocabulary, queryable, task-agnostic map representations, while high memory demands and computational inefficiency still remaining to be open challenges. This survey aims to guide current and future researchers in advancing semantic mapping techniques for embodied AI systems.

URL: https://openreview.net/forum?id=USgQ38RG6G

---

Title: Joint Generative Modeling of Grounded Scene Graphs and Images via Diffusion Models

Authors: Bicheng Xu, Qi Yan, Renjie Liao, Lele Wang, Leonid Sigal

Abstract: A grounded scene graph represents a visual scene as a graph, where nodes denote objects (including labels and spatial locations) and directed edges encode relations among them. In this paper, we introduce a novel framework for joint grounded scene graph - image generation, a challenging task involving high-dimensional, multi-modal structured data. To effectively model this complex joint distribution, we adopt a factorized approach: first generating a grounded scene graph, followed by image generation conditioned on the generated grounded scene graph. While conditional image generation has been widely explored in the literature, our primary focus is on the generation of grounded scene graphs from noise, which provides efficient and interpretable control over the image generation process. This task requires generating plausible grounded scene graphs with heterogeneous attributes for both nodes (objects) and edges (relations among objects), encompassing continuous attributes (e.g., object bounding boxes) and discrete attributes (e.g., object and relation categories). To address this challenge, we introduce DiffuseSG, a novel diffusion model that jointly models the heterogeneous node and edge attributes. We explore different encoding strategies to effectively handle the categorical data. Leveraging a graph transformer as the denoiser, DiffuseSG progressively refines grounded scene graph representations in a continuous space before discretizing them to generate structured outputs. Additionally, we introduce an IoU-based regularization term to enhance empirical performance. Our model outperforms existing methods in grounded scene graph generation on the Visual Genome and COCO-Stuff datasets, excelling in both standard and newly introduced metrics that more accurately capture the task’s complexity. Furthermore, we demonstrate the broader applicability of DiffuseSG in two important downstream tasks: (1) achieving superior results in a range of grounded scene graph completion tasks, and (2) enhancing grounded scene graph detection models by leveraging additional training samples generated by DiffuseSG. Code is available at https://github.com/ubc-vision/DiffuseSG.

URL: https://openreview.net/forum?id=2cxxZI2LOL

---


New submissions
===============


Title: Efficient Few-Shot Continual Learning in Vision-Language Models

Abstract: Vision-language models (VLMs) excel at tasks like visual question answering and image captioning, but their reliance on frozen, pretrained image encoders like CLIP often leads to persistent vision errors that degrade downstream performance. Moreover, real-world deployment demands that VLMs continually adapt to new, scarce data in a few-shot setting without forgetting prior knowledge. To meet these challenges, we introduce LoRSU (Low-Rank Adaptation with Structured Updates), a lightweight and robust technique for few-shot continual learning of VLMs’ image encoders. Our approach leverages theoretical insights to identify and update only the most critical parameters, achieving significant resource efficiency. Specifically, we demonstrate that LoRSU reduces computational overhead by over 25x compared to full VLM updates, without sacrificing performance. In experiments on VQA benchmarks under a few-shot continual learning protocol, LoRSU demonstrates superior scalability, efficiency, and accuracy, offering a practical solution for dynamic, resource-constrained vision-language applications.

URL: https://openreview.net/forum?id=sQ1w92WW0V

---

Title: Tex4D: Zero-shot 4D Character Texturing with Video Diffusion Models

Abstract: 3D meshes are widely used in movies, games, AR, and VR for their efficiency in animation and minimal memory footprint, leading to the creation of a large number of mesh sequences. However, creating dynamic textures for these mesh sequences to model the appearance transformations remains labor-intensive for professional artists. In this work, we present Tex4D, a zero-shot approach that creates multi-view and temporally consistent dynamic mesh textures by integrating the inherent 3D geometry knowledge with the expressiveness of video diffusion models. Given an untextured mesh sequence and a text prompt as inputs, our method enhances multi-view consistency by synchronizing the diffusion process across different views through latent aggregation in the UV space. To ensure temporal consistency, such as lighting changes, wrinkles, and appearance transformations, we leverage prior knowledge from a conditional video generation model for texture synthesis. Using the video diffusion model and the UV texture aggregation in a straightforward way leads to blurred results. We analyze the underlying causes and propose a simple yet effective modification to the DDIM sampling process to address this issue. Additionally, we introduce a reference latent texture to strengthen the correlation between frames during the denoising process. To the best of our knowledge, Tex4D is the first method specifically designed for 4D character texturing. Extensive experiments demonstrate its superiority in producing multi-view and multi-frame consistent dynamic textures for mesh sequences.

URL: https://openreview.net/forum?id=brKF0ta0Bg

---

Title: Model-Free Learning with Heterogeneous Dynamical Systems: A Federated LQR Approach

Abstract: We study a model-free federated linear quadratic regulator (LQR) problem where M agents with unknown, distinct yet similar dynamics collaboratively learn an optimal policy to minimize an average quadratic cost while keeping their data private. To exploit the similarity of the agents' dynamics, we propose to use federated learning (FL) to allow the agents to periodically communicate with a central server to train policies by leveraging a larger dataset from all the agents. With this setup, we seek to understand the following questions: (i) Is the learned common policy stabilizing for all agents? (ii) How close is the learned common policy to each agent's own optimal policy? (iii) Can each agent learn its own optimal policy faster by leveraging data from all agents? To answer these questions, we propose the federated and model-free algorithm FedLQR. Our analysis overcomes numerous technical challenges, such as heterogeneity in the agents’ dynamics, multiple local updates, and stability concerns. We show that FedLQR produces a common policy that, at each iteration, is stabilizing for all agents. Moreover, we prove that when learning each agent's optimal policy, FedLQR achieves a sample complexity reduction proportional to the number of agents M in a low-heterogeneity regime, compared to the single-agent setting.

URL: https://openreview.net/forum?id=WSRQeCUc3g

---

Title: MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

Abstract: As the era of large language models (LLMs) on behalf of users unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a framework for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. While existing methods such as Direct Preference Optimization (DPO) and its variants treat preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO extends this paradigm by integrating prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. More importantly, MaPPO introduces no additional hyperparameter, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin with consistent improvement on DPO variants, including widely used SimPO, IPO, and CPO. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks, including MT-Bench, AlpacaEval 2.0, and Arena-Hard, demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.

URL: https://openreview.net/forum?id=NKoV9wEiox

---

Title: Rethinking Industrial Anomaly Detection in the Era of Large Vision-Language Models

Abstract: State-of-the-art methods for industrial anomaly detection (IAD) typically rely on a training set of images to define normal conditions, flagging any deviations as anomalies. Obtaining this training set has two main issues - it is time consuming to obtain an extensive labeled set, and the assumption that all patterns outside the training set are truly anomalous is often unrealistic. Many rare patterns not captured in the training set, such as environmental changes, positional changes, or permissible deformation, may not constitute actual industrial defects. In this paper, we reframe the IAD task by using large vision-language models (LVLMs) without fine-tuning on training images. LVLMs can interpret and generalize from a single reference image, and can be more robust to rare but acceptable changes in images. Our experiments on two popular benchmarks, MvTec-AD and VisA, show that LVLMs with just one image and a textual description is competitive with state-of-the-art models, and offer a more robust and generalizable solution even with variations in testing images. We also identify a key limitation: LVLM performance degrades when detecting small anomalies. Despite this, our findings highlight the potential of LVLMs as a flexible and scalable foundation for industrial anomaly detection, opening new directions for future research.

URL: https://openreview.net/forum?id=EyhzBqeRNb

---

Title: Adaptive Mesh Quantization for Neural PDE Solvers

Abstract: Physical systems commonly exhibit spatially varying complexity, presenting a significant challenge for neural PDE solvers. In traditional numerical methods, adaptive mesh refinement addresses this challenge by increasing node density in dynamic regions, thereby allocating more computational resources where needed. However, for graph neural operators, this is not always a feasible or optimal strategy. We therefore introduce a novel approach to this issue: rather than modifying grid resolution, we maintain a fixed mesh while dynamically adjusting the bit-width used by a quantized model. We propose an adaptive bit-width allocation strategy driven by a lightweight auxiliary model that identifies high-loss regions in the input mesh. This enables dynamic resource distribution in the main model, where regions of higher difficulty are allocated increased bit-width, optimizing computational resource utilization. We demonstrate our framework's effectiveness by integrating it with two state-of-the-art models, MP-PDE and GraphViT, to evaluate performance across multiple tasks: 2D Darcy flow, large-scale unsteady fluid dynamics in 2D, steady-state Navier–Stokes simulations in 3D, and a 2D hyper-elasticity problem.
Our framework demonstrates consistent Pareto improvements over uniformly quantized baselines, yielding up to 50\% improvements in performance at the same cost.

URL: https://openreview.net/forum?id=NN17y897WG

---

Title: Learning to Prompt Your Domain for Federated Vision-Language Models

Abstract: The prompt tuning paradigm, with its great advantages of low parameter count and stable training, has recently inspired numerous applications of CLIP-like vision-language models in federated learning. However, in this work, we posit that under significant domain gaps across federated participants, prompt-based CLIP may easily collapse to non-optimal solutions due to the neglect of domain-aware knowledge. We present a novel prompt tuning method, termed ADAPT, to address this issue by learning both intra- and inter-domain prompts. Specifically, we assign each federated participant a domain-specific prompt and use the image's visual features as a condition to guide the generation of language features, with the underlying idea that the prompted CLIP should detect the input image's domain correspondence before making the prediction of its category. Extensive experiments demonstrate ADAPT's significant efficiency and effectiveness in federated learning. For example, by learning and sharing only 2.1M parameters, ADAPT attains a 69.8% average accuracy over the six domains of DomainNet, which improves the original CLIP accuracy by 16.2%.

URL: https://openreview.net/forum?id=OS7zPOZjr3

---

Reply all
Reply to author
Forward
0 new messages