Daily TMLR digest for May 30, 2025

3 views

Skip to first unread message

TMLR

unread,

May 30, 2025, 12:06:08 AM5/30/25

to tmlr-anno...@googlegroups.com

Accepted papers
===============

Title: ViewFusion: Learning Composable Diffusion Models for Novel View Synthesis

Authors: Bernard Spiegl, Andrea Perin, Stephane Deny, Alexander Ilin

Abstract: Deep learning is providing a wealth of new approaches to the problem of novel view synthesis, from Neural Radiance Field (NeRF) based approaches to end-to-end style architectures. Each approach offers specific strengths but also comes with limitations in their applicability. This work introduces ViewFusion, an end-to-end generative approach to novel view synthesis with unparalleled flexibility. ViewFusion consists in simultaneously applying a diffusion denoising step to any number of input views of a scene, then combining the noise gradients obtained for each view with an (inferred) pixel-weighting mask, ensuring that for each region of the target view only the most informative input views are taken into account. Our approach resolves several limitations of previous approaches by (1) being trainable and generalizing across multiple scenes and object classes, (2) adaptively taking in a variable number of pose-free views at both train and test time, (3) generating plausible views even in severely underdetermined conditions (thanks to its generative nature)---all while generating views of quality on par or even better than comparable methods. Limitations include not generating a 3D embedding of the scene, resulting in a relatively slow inference speed, and our method only being tested on the relatively small Neural 3D Mesh Renderer dataset. Code is available.

URL: https://openreview.net/forum?id=amUisgrmte

---

Title: Learning Actionable Counterfactual Explanations in Large State Spaces

Authors: Keziah Naggita, Matthew Walter, Avrim Blum

Abstract: Recourse generators provide actionable insights, often through feature-based counterfactual explanations (CFEs), to help negatively classified individuals understand how to adjust their input features to achieve a positive classification. These feature-based CFEs, which we refer to as \emph{low-level} CFEs, are overly specific (e.g., coding experience: $4 \to 5+$ years) and often recommended in a feature space that doesn't straightforwardly align with real-world actions. To bridge this gap, we introduce three novel recourse types grounded in real-world actions: high-level continuous (\emph{hl-continuous}), high-level discrete (\emph{hl-discrete}), and high-level ID (\emph{hl-id}) CFEs.

We formulate single-agent CFE generation methods for hl-discrete and hl-continuous CFEs. For the hl-discrete CFE, we cast the task as a weighted set cover problem that selects the least cost set of hl-discrete actions that satisfy the eligibility of features, and model the hl-continuous CFE as a solution to an integer linear program that identifies the least cost set of hl-continuous actions capable of favorably altering the prediction of a linear classifier. Since these methods require costly optimization per agent, we propose data-driven CFE generation approaches that, given instances of agents and their optimal CFEs, learn a CFE generator that quickly provides optimal CFEs for new agents. This approach, also viewed as one of learning an optimal policy in a family of large but deterministic MDPs, considers several problem formulations, including formulations in which the actions and their effects are unknown, and therefore addresses informational and computational challenges.

We conduct extensive empirical evaluations using publicly available healthcare datasets (BRFSS, Foods, and NHANES) and fully-synthetic data. For negatively classified agents identified by linear and threshold-based binary classifiers, we compare the proposed forms of recourse to low-level CFEs, which suggest how the agent can transition from state $\mathbf{x}$ to a new state $\mathbf{x}'$ where the model prediction is desirable. We also extensively evaluate the effectiveness of our neural network-based, data-driven CFE generation approaches. Empirical results show that the proposed data-driven CFE generators are accurate and resource-efficient, and the proposed forms of recourse offer various advantages over the low-level CFEs.

URL: https://openreview.net/forum?id=tXnVRpRlR8

---

Title: Learning distributed representations with efficient SoftMax normalization

Authors: Lorenzo Dall'Amico

Abstract: Learning distributed representations, or embeddings, that encode the relational similarity patterns among objects is a relevant task in machine learning. A popular method to learn the embedding matrices $X, Y$ is optimizing a loss function of the term ${\rm SoftMax}(XY^T)$. The complexity required to calculate this term, however, runs quadratically with the problem size, making it a computationally heavy solution. In this article, we propose a linear-time heuristic approximation to compute the normalization constants of ${\rm SoftMax}(XY^T)$ for embedding vectors with bounded norms. We show on some pre-trained embedding datasets that the proposed estimation method achieves higher or comparable accuracy with competing methods. From this result, we design an efficient and task-agnostic algorithm that learns the embeddings by optimizing the cross entropy between the softmax and a set of probability distributions given as inputs. The proposed algorithm is interpretable and easily adapted to arbitrary embedding problems. We consider a few use cases and observe similar or higher performances and a lower computational time than similar ``2Vec'' algorithms.

URL: https://openreview.net/forum?id=9M4NKMZOPu

---

Title: Explaining Node Embeddings

Authors: Zohair Shafi, Ayan Chatterjee, Tina Eliassi-Rad

Abstract: Node embedding algorithms produce low-dimensional latent representations of nodes in a graph. These embeddings are often used for downstream tasks, such as node classification and link prediction. In this paper, we investigate the following two questions: (Q1) Can we explain each embedding dimension with human-understandable graph features (e.g. degree, clustering coefficient and PageRank). (Q2) How can we modify existing node embedding algorithms to produce embeddings that can be easily explained by human-understandable graph features? We find that the answer to Q1 is yes and introduce a new framework called XM (short for eXplain eMbedding) to answer Q2. A key aspect of XM involves minimizing the nuclear norm of the generated explanations. We show that by minimizing the nuclear norm, we minimize the lower bound on the entropy of the generated explanations. We test XM on a variety of real-world graphs and show that XM not only preserves the performance of existing node embedding methods, but also enhances their explainability.

URL: https://openreview.net/forum?id=QQZ8uPxFb3

---

New submissions
===============

Title: $\textit{VIA}$: Unified Spatiotemporal $\underline{Vi}$deo $\underline{A}$daptation for Global and Local Video Editing

Abstract: Video editing serves as a fundamental pillar of digital media, spanning applications in entertainment, education, and professional communication.
However, previous methods often overlook the necessity of comprehensively understanding both global and local contexts, leading to inaccurate and inconsistent edits in the spatiotemporal dimension, especially for long videos.
In this paper, we introduce $\textit{VIA}$, a unified spatiotemporal $\underline{VI}$deo $\underline{A}$daptation framework for global and local video editing, pushing the limits of consistently editing minute-long videos.
First, to ensure local consistency within individual frames, we designed \emph{test-time editing adaptation} to adapt a pre-trained image editing model for improving consistency between potential editing directions and the text instruction, and adapt masked latent variables for precise local control.
Furthermore, to maintain global consistency over the video sequence, we introduce \emph{spatiotemporal adaptation} that recursively \textbf{gathers} consistent attention variables in key frames and strategically applies them across the whole sequence to realize the editing effects.
Extensive experiments demonstrate that, compared to baseline methods, our $\textit{VIA}$ approach produces edits that are more faithful to the source videos, more coherent in the spatiotemporal context, and more precise in local control. More importantly, we show that $\textit{VIA}$ can achieve consistent long video editing in minutes, unlocking the potential for advanced video editing tasks over long video sequences.

URL: https://openreview.net/forum?id=qny1BqVEZ3

---

Title: DiffCLIP: Differential Attention Meets CLIP

Abstract: We propose DiffCLIP, a novel vision-language model that extends the differential attention mechanism to CLIP architectures. Differential attention was originally developed for large language models to amplify relevant context while canceling out noisy information. In this work, we integrate this mechanism into CLIP's dual encoder (image and text) framework. With minimal additional parameters, DiffCLIP achieves superior performance on image-text understanding tasks. Across zero-shot classification, retrieval, and robustness benchmarks, DiffCLIP consistently outperforms baseline CLIP models. Notably, these gains come with negligible computational overhead, demonstrating that differential attention can significantly enhance multi-modal representations without sacrificing efficiency.

URL: https://openreview.net/forum?id=2I2fTehry2

---

Title: A Proximal Operator for Inducing 2:4-Sparsity

Abstract: Recent hardware advancements in AI Accelerators and GPUs allow to efficiently compute sparse matrix multiplications, especially when 2 out of 4 consecutive weights are set to zero. However, this so-called 2:4 sparsity usually comes at a decreased accuracy of the model. We derive a regularizer that exploits the local correlation of features to find better sparsity masks in trained models. We minimize the regularizer jointly with a local squared loss by deriving the proximal operator for which we show that it has an efficient solution in the 2:4-sparse case. After optimizing the mask, we introduce masked-gradient updates to further minimize the local squared loss. We illustrate our method on toy problems and apply it to pruning entire large language models up to 70B parameters. On models up to 13B we improve over previous state of the art algorithms, whilst on 70B models we match their performance.

URL: https://openreview.net/forum?id=AsFbXRIe4q

---

Title: Efficient Class-Incremental Segmentation Learning via Expanding Visual Transformers

Abstract: Incrementally learning new semantic concepts while retaining existing information is fundamental for several real-world applications. While behaviors of different sizes of backbones and architectural choices have been studied to propose efficient limited-sized architectures within many non-incremental computer vision applications, only large convolutional and Visual Transformer (ViT) backbones have been explored for class-incremental semantic segmentation, without providing a fair comparison wrt model size. In this work, we propose a fair study across existing class-incremental semantic segmentation methods, focusing on the models efficiency wrt their memory footprint. Moreover, we propose TILES (Transformer-based Incremental Learning for Expanding Segmenter), a novel approach exploiting small-size ViT backbones efficiency to offer an alternative solution where severe memory constraints are applied. It is based on expanding the architecture with the increments, allowing to learn new tasks while retaining old knowledge within a limited memory footprint. Besides, in order to tackle the background semantic shift, we apply adaptive losses specific to the incremental branches, while balancing old and new knowledge. Furthermore, we exploit the confidence of each incremental task to propose an efficient branch merging strategy. TILES provides state-of-the-art results on challenging benchmarks using up to $14$ times fewer parameters.

URL: https://openreview.net/forum?id=9lBWcsOIcz

---

Reply all

Reply to author

Forward

0 new messages