Accepted papers
===============
Title: Let's Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models
Authors: Yuhao Sun, Chengyi Cai, Jiacheng Zhang, Zesheng Ye, Xingliang YUAN, Feng Liu
Abstract: Recent research has shown that aligning fine-grained text descriptions with localized image patches can significantly improve the zero-shot performance of pre-trained vision-language models (e.g., CLIP).
However, we find that both fine-grained text descriptions and localized image patches often contain redundant information, making text-visual alignment less effective.
In this paper, we tackle this issue from two perspectives: \emph{view refinement} and \emph{description refinement}, termed as \textit{\textbf{Bi}-refinement for \textbf{F}ine-grained \textbf{T}ext-visual \textbf{A}lignment} (BiFTA).
\emph{View refinement} removes redundant image patches with high \emph{Intersection over Union} (IoU) ratios, resulting in more distinctive visual samples.
\emph{Description refinement} removes redundant text descriptions with high pairwise cosine similarity, ensuring greater diversity in the remaining descriptions.
BiFTA achieves superior zero-shot performance on 6 benchmark datasets for both ViT-based and ResNet-based CLIP, justifying the necessity to remove redundant information in visual-text alignment.
URL: https://openreview.net/forum?id=ZmbkzZnHO4
---
Title: HypCBC: Domain-Invariant Hyperbolic Cross-Branch Consistency for Generalizable Medical Image Analysis
Authors: Francesco Di Salvo, Sebastian Doerrich, Jonas Alle, Christian Ledig
Abstract: Robust generalization beyond training distributions remains a critical challenge for deep neural networks. This is especially pronounced in medical image analysis, where data is often scarce and covariate shifts arise from different hardware devices, imaging protocols, and heterogeneous patient populations. These factors collectively hinder reliable performance and slow down clinical adoption. Despite recent progress, existing learning paradigms primarily rely on the Euclidean manifold, whose flat geometry fails to capture the complex, hierarchical structures present in clinical data. In this work, we exploit the advantages of hyperbolic manifolds to model complex data characteristics. We present the first comprehensive validation of hyperbolic representation learning for medical image analysis and demonstrate statistically significant gains across eleven in-distribution datasets and three ViT models. We further propose an unsupervised, domain-invariant hyperbolic cross-branch consistency constraint. Extensive experiments confirm that our proposed method promotes domain-invariant features and outperforms state-of-the-art Euclidean methods by an average of $+2.1\%$ AUC on three domain generalization benchmarks: Fitzpatrick17k, Camelyon17-WILDS, and a cross-dataset setup for retinal imaging. These datasets span different imaging modalities, data sizes, and label granularities, confirming generalization capabilities across substantially different conditions.
URL: https://openreview.net/forum?id=1spGpYmDjy
---
Title: TextOCVP: Object-Centric Video Prediction with Language Guidance
Authors: Angel Villar-Corrales, Gjergj Plepi, Sven Behnke
Abstract: Understanding and forecasting future scene states is critical for autonomous agents to plan and act effectively in complex environments. Object-centric models, with structured latent spaces, have shown promise in modeling object dynamics and interactions in order to predict future scene states, but often struggle to scale beyond simple synthetic datasets and to integrate external guidance, limiting their applicability in robotic environments. To address these limitations, we propose TextOCVP, an object-centric model for video prediction guided by textual descriptions. TextOCVP parses an observed scene into object representations, called slots, and utilizes a text-conditioned transformer predictor to forecast future object states and video frames. Our approach jointly models object dynamics and interactions while incorporating textual guidance, enabling accurate and controllable predictions. TextOCVP’s structured latent space offers a more precise control of the forecasting process, outperforming several video prediction baselines on two datasets. Additionally, we show that structured object-centric representations provide superior robustness to novel scene configurations, as well as improved controllability and interpretability, enabling more precise and understandable predictions.
URL: https://openreview.net/forum?id=7JEgXCyQgX
---
Title: Achieving Faster than O(1/t) Convergence in General Convex Federated Learning
Authors: Jie Liu, Zuang Wang, Yongqiang Wang
Abstract: This paper aims to achieve faster than O(1/t) convergence in federated learning for general
convex loss functions. Under the independent and identical distribution (IID) condition, we
show that accurate convergence to an optimal solution can be achieved in convex federated
learning even when individual clients select stepsizes locally without any coordination. More
importantly, this local stepsize strategy allows exploitation of the local geometry of individual
clients’ loss functions, and is shown to lead to faster convergence than the case where
a same universal stepsize is used for all clients. Then, when the distribution is non-IID,
we employ the sharing of gradients besides the global model parameter to ensure o(1/t)
convergence to an optimal solution in convex federated learning. For both algorithms, we
theoretically prove that stepsizes that are much larger than existing counterparts are allowed,
which leads to much faster convergence in empirical evaluations. It is worth noting
that, beyond providing a general framework for federated learning with drift correction, our
second algorithm’s achievement of o(1/t) convergence to the exact optimal solution under
general convex loss functions has not been previously reported in the federated learning
literature—except in certain restricted convex cases with additional constraints. We believe
that this is significant because even after incorporating momentum, existing first-order
federated learning algorithms can only ensure O(1/t) convergence for general convex loss
functions when no additional assumptions on heterogeneity are imposed.
URL: https://openreview.net/forum?id=Dae3jVdPod
---
Title: Detecting generalization deficits in large language and reasoning models by using natural variations in simple problems
Authors: Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti, Jenia Jitsev
Abstract: Large language and reasoning models (LLMs, LRMs) are instances of foundation models exhibiting scaling laws that predict generalization improvement when increasing the pre-training scale. As such, they are supposed to possess strong generalization and therefore transfer robustly across various tasks and conditions in few-show or zero-shot manner. Such claims rely on various standardized benchmarks that should measure core functions like generalization and reasoning, where state-of-the-art (SOTA) models score high. We demonstrate remarkable zero-shot generalization deficit in most SOTA models which claim strong function, including reasoning models like DeepSeek R1 or o1-mini, trained at the largest scales, using a simple, short common sense math problem formulated in concise natural language, easily solvable by humans, which we term Alice in Wonderland, AIW, problem. The deficit manifests in strong performance fluctuations on natural variations in the simple problem template that do not change either problem structure or its difficulty at all. By testing models on further control problems with similar form, we rule out that deficit might be rooted in minor low-level issues like natural language or numbers parsing. In conventional LLMs, we observe strong overconfidence in the wrong solutions, expressed in form of plausible sounding explanation-like confabulations. Many models showing the deficit also collapse close to 0 accuracy on AIW problems, while still exhibiting high scores on various standardized benchmarks. We show how this illusion of strong function might be caused by leakage of test sets into training. For reasoning models, while observing clearly improved performance compared to LLMs, we still see strong fluctuations on problem variations that keep structure and difficulty unchanged. Our observations suggest that current LLMs and LRMs possess generalization deficits that can be detected by controlled structure and difficulty preserving variations in simple problems, in contrast to standardized benchmarks which contain problems of higher difficulty but fail to detect such clear deficits. Code for reproducing experiments in the paper and raw experiments data can be found at https://github.com/LAION-AI/AIW
URL: https://openreview.net/forum?id=frA7uYn2um
---
Title: DiffCATS: Causally Associated Time-Series Generation through Diffusion Models
Authors: Giuseppe Masi, Andrea Coletta, Elizabeth Fons, Svitlana Vyetrenko, Novella Bartolini
Abstract: Modeling and recovering causal relationships in time-series data can be crucial for supporting real-world interventions and decision-making, but progress in Time-Series Causal Discovery (TSCD) is often limited by the lack of high-quality datasets with diverse and realistic temporal causal relationships. This highlights the need to provide synthetic time-series generation tools, with realism as a primary objective, an aspect that requires incorporating causal relationships beyond mere correlation. To address this challenge, we propose a diffusion model called DiffCATS. It simultaneously generates multiple causally associated time-series as well as a ground truth causal graph that reflects their mutual temporal dependencies, requiring only observational time-series data for training. Experiments demonstrate that it outperforms state-of-the-art methods in producing realistic time-series with causal graphs that closely resemble those of real-world phenomena. We highlight the practical utility of our data on three downstream tasks, including benchmarking widely used TSCD algorithms.
URL: https://openreview.net/forum?id=FwC6CyaHop
---
New submissions
===============
Title: Fake News Classification in Urdu: A Domain Adaptation Approach for a Low-Resource Language
Abstract: Misinformation on social media is a widely acknowledged issue, and researchers worldwide are actively engaged in its detection. However, low-resource languages such as Urdu have received limited attention in this domain. An obvious approach is to utilize a multilingual pretrained language model and fine-tune it for a downstream classification task, such as misinformation detection. However, these models struggle with domain-specific terms, leading to suboptimal performance. To address this, we investigate the effectiveness of domain adaptation before fine-tuning for fake news classification in Urdu, employing a staged training approach to optimize model generalization. We evaluate two widely used multilingual models, XLM-RoBERTa and mBERT, and apply domain-adaptive pretraining using a publicly available Urdu news corpus. Experiments on four publicly available Urdu fake news datasets show that domain-adapted XLM-R generally outperforms its vanilla counterpart, while domain-adapted mBERT exhibits mixed results. These findings highlight the varying impact of domain adaptation across multilingual architectures in low-resource settings. We release our domain-adapted models, code, and datasets at URL withheld.
URL: https://openreview.net/forum?id=hSS4jvad8g
---