Expert Certification: Sparse Decomposition of Graph Neural Networks
Yaochen Hu, Mai Zeng, Ge Zhang, Pavel Rumiantsev, Liheng Ma, Yingxue Zhang, Mark Coates
https://openreview.net/forum?id=xdWP1d8BxI
---
Accepted papers
===============
Title: FedDr+: Stabilizing Dot-regression with Global Feature Distillation for Federated Learning
Authors: Seongyoon Kim, Minchan Jeong, Sungnyun Kim, Sungwoo Cho, Sumyeong Ahn, Se-Young Yun
Abstract: Federated Learning (FL) has emerged as a pivotal framework for the development of effective global models (global FL) or personalized models (personalized FL) across clients with heterogeneous, non-iid data distribution. A key challenge in FL is client drift, where data heterogeneity impedes the aggregation of scattered knowledge. Recent studies have tackled the client drift issue by identifying significant divergence in the last linear (classifier) layer. To mitigate this divergence, strategies such as freezing the classifier weights and aligning the feature extractor accordingly have proven effective. Although the local alignment between classifier and feature extractor has been studied as a crucial factor in FL, we observe that it may lead the model to overemphasize the observed classes and underestimate the unobserved classes within each client. Therefore, our goals are twofold: (1) improving local alignment and (2) maintaining the representation of unseen class samples, ensuring that the solution seamlessly incorporates knowledge from individual clients, thus enhancing performance in both global and personalized FL. To achieve this, we introduce a novel algorithm named FedDr+, which empowers local model alignment using dot-regression loss. FedDr+ freezes the classifier as a simplex ETF to align the features and improves aggregated global models by employing a feature distillation mechanism to retain information about unseen/missing classes. Our empirical results demonstrate that FedDr+ not only outperforms methods with a frozen classifier but also surpasses other state-of-the-art approaches, ensuring robust performance across diverse data distributions.
URL: https://openreview.net/forum?id=a6WthNFhL2
---
Title: Sparsified State-Space Models are Efficient Highway Networks
Authors: Woomin Song, Jihoon Tack, Sangwoo Mo, Seunghyuk Oh, Jinwoo Shin
Abstract: State-space models (SSMs) offer a promising architecture for sequence modeling, providing an alternative to Transformers by replacing expensive self-attention with linear recurrences. In this paper, we propose a simple yet effective trick to enhance SSMs within given computational budgets by sparsifying them. Our intuition is that tokens in SSMs are highly redundant due to gradual recurrent updates, and dense recurrence operations block the delivery of past information. In particular, we observe that upper layers of SSMs tend to be more redundant as they encode global information, while lower layers encode local information. Motivated by this, we introduce Simba, a hierarchical sparsification method for SSMs based on token pruning. Simba sparsifies upper layers more than lower layers, encouraging the upper layers to behave like highways. To achieve this, we propose a novel token pruning criterion for SSMs, measuring the global impact of tokens on the final output by accumulating local recurrences. We demonstrate that Simba outperforms the baseline model, Mamba, with the same FLOPS in various natural language tasks. Moreover, we illustrate the effect of highways, showing that Simba not only enhances efficiency but also improves the information flow across long sequences. Code is available at https://github.com/woominsong/Simba.
URL: https://openreview.net/forum?id=G1p0YwrX8X
---
Title: Removing Structured Noise using Diffusion Models
Authors: Tristan Stevens, Hans van Gorp, Faik C Meral, Junseob Shin, Jason Yu, Jean-luc Robert, Ruud Van Sloun
Abstract: Solving ill-posed inverse problems requires careful formulation of prior beliefs over the signals of interest and an accurate description of their manifestation into noisy measurements. Handcrafted signal priors based on e.g. sparsity are increasingly replaced by data-driven deep generative models, and several groups have recently shown that state-of-the-art score-based diffusion models yield particularly strong performance and flexibility. In this paper, we show that the powerful paradigm of posterior sampling with diffusion models can be extended to include rich, structured, noise models. To that end, we propose a joint conditional reverse diffusion process with learned scores for the noise and signal-generating distribution. We demonstrate strong performance gains across various inverse problems with structured noise, outperforming competitive baselines using normalizing flows, adversarial networks and various posterior sampling methods for diffusion models. This opens up new opportunities and relevant practical applications of diffusion modeling for inverse problems in the context of non-Gaussian measurement models.
URL: https://openreview.net/forum?id=BvKYsaOVEn
---
Title: On Using Certified Training towards Empirical Robustness
Authors: Alessandro De Palma, Serge Durand, Zakaria Chihani, François Terrier, Caterina Urban
Abstract: Adversarial training is arguably the most popular way to provide empirical robustness against specific adversarial examples. While variants based on multi-step attacks incur significant computational overhead, single-step variants are vulnerable to a failure mode known as catastrophic overfitting, which hinders their practical utility for large perturbations. A parallel line of work, certified training, has focused on producing networks amenable to formal guarantees of robustness against any possible attack. However, the wide gap between the best-performing empirical and certified defenses has severely limited the applicability of the latter. Inspired by recent developments in certified training, which rely on a combination of adversarial attacks with network over-approximations, and by the connections between local linearity and catastrophic overfitting, we present experimental evidence on the practical utility and limitations of using certified training towards empirical robustness. We show that, when tuned for the purpose, a recent certified training algorithm can prevent catastrophic overfitting on single-step attacks, and that it can bridge the gap to multi-step baselines under appropriate experimental settings. Finally, we present a conceptually simple regularizer for network over-approximations that can achieve similar effects while markedly reducing runtime.
URL: https://openreview.net/forum?id=UaaT2fI9DC
---
Title: Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
Authors: Weixin Liang, LILI YU, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin
Abstract: The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2% of the wall-clock time and text quality in 75.6% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).
URL: https://openreview.net/forum?id=Nu6N69i8SB
---
Title: Multiplayer Information Asymmetric Contextual Bandits
Authors: William Chang, Yuanhao Lu
Abstract: Single-player contextual bandits are a well-studied problem in reinforcement learning that has seen applications in various fields such as advertising, healthcare, and finance. In light of the recent work on information asymmetric bandits, we propose a novel multiplayer information asymmetric contextual bandit framework where there are multiple players each with their own set of actions. At every round, they observe the same context vectors and simultaneously take an action from their own set of actions, giving rise to a joint action. However, upon taking this action the players are subjected to information asymmetry in (1) actions and/or (2) rewards. We designed an algorithm mLinUCB by modifying the classical single-player algorithm LinUCB in \cite{chu2011contextual} to achieve the optimal regret $O(\sqrt{T})$ when only one kind of asymmetry is present. We then propose a novel algorithm ETC that is built on explore-then-commit principles to achieve the same optimal regret when both types of asymmetry are present.
URL: https://openreview.net/forum?id=nMCJ8bFq4B
---
Title: Relationship between Batch Size and Number of Steps Needed for Nonconvex Optimization of Stochastic Gradient Descent using Armijo-Line-Search Learning Rate
Authors: Yuki Tsukada, Hideaki Iiduka
Abstract: While stochastic gradient descent (SGD) can use various learning rates, such as constant or diminishing rates, previous numerical results showed that SGD performs better than other deep-learning optimizers when it uses learning rates given by line search methods. In this paper, we perform a convergence analysis on SGD with a learning rate given by an Armijo line search for nonconvex optimization indicating that the upper bound of the expectation of the squared norm of the full gradient becomes small when the number of steps and the batch size are large. Next, we show that, for SGD with the Armijo-line-search learning rate, the number of steps needed for nonconvex optimization is a monotone decreasing convex function of the batch size; that is, the number of steps needed for nonconvex optimization decreases as the batch size increases. Furthermore, we show that the stochastic first-order oracle (SFO) complexity, which is the stochastic gradient computation cost, is a convex function of the batch size; that is, there exists a critical batch size that minimizes the SFO complexity. Finally, we provide numerical results that support our theoretical results.
URL: https://openreview.net/forum?id=pqZ6nOm3WF
---
Title: Daphne: Multi-Pass Compilation of Probabilistic Programs into Graphical Models and Neural Networks
Authors: Christian Dietrich Weilbach, Frank Wood
Abstract: Daphne is a probabilistic programming system that provides an expressive syntax to denote a large, but restricted, class of probabilistic models. Programs written in the Daphne language can be compiled into a general graph data structure of a corresponding probabilistic graphical model with simple link functions that can easily be implemented in a wide range of programming environments. Alternatively Daphne can also further compile such a graphical model into understandable and vectorized PyTorch code that can be used to train neural networks for inference. The Daphne compiler is structured in a layered multi-pass compiler framework that allows independent and easy extension of the syntax by adding additional passes. It leverages extensive partial evaluation to reduce all syntax extensions to the graphical model at compile time.
URL: https://openreview.net/forum?id=OGCuDFab4b
---
Title: Cluster Tree for Nearest Neighbor Search
Authors: Dan Kushnir, Sandeep Silwal
Abstract: Tree-based algorithms are an important and widely used class of algorithms for Nearest Neighbor Search (NNS) with random partition (RP) tree being arguably the most well studied. However, in spite of possessing theoretical guarantees and strong practical performance, a major drawback of the RP tree is its lack of adaptability to the input dataset. Inspired by recent theoretical and practical works for NNS, we attempt to remedy this by introducing *ClusterTree*, a new tree based algorithm. Our approach utilizes randomness as in RP trees while adapting to the underlying cluster structure of the dataset to create well-balanced and meaningful partitions. Experimental evaluations on real world datasets demonstrate improvements over RP trees and other tree based methods for NNS while maintaining efficient construction time. In addition, we show theoretically and empirically that *ClusterTree* finds partitions which are superior to those found by RP trees in preserving the cluster structure of the input dataset.
URL: https://openreview.net/forum?id=ELtNtkGXoK
---
Title: Neuron-based explanations of neural networks sacrifice completeness and interpretability
Authors: Nolan Simran Dey, Eric Taylor, Alexander Wong, Bryan P. Tripp, Graham W. Taylor
Abstract: High quality explanations of neural networks (NNs) should exhibit two key properties. Completeness ensures that they accurately reflect a network’s function and interpretability makes them understandable to humans. Many existing methods provide explanations of individual neurons within a network. In this work we provide evidence that for AlexNet pretrained on ImageNet, neuron-based explanation methods sacrifice both completeness and interpretability compared to activation principal components. Neurons are a poor basis for AlexNet embeddings because they don’t account for the distributed nature of these representations. By examining two quantitative measures of completeness and conducting a user study to measure interpretability, we show the most important principal components provide more complete and interpretable explanations than the most important neurons. Much of the activation variance may be explained by examining relatively few high-variance PCs, as opposed to studying every neuron. These principal components also strongly affect network function, and are significantly more interpretable than neurons. Our findings suggest that explanation methods for networks like AlexNet should avoid using neurons as a basis for embeddings and instead choose a basis, such as principal components, which accounts for the high dimensional and distributed nature of a network's internal representations. Interactive demo and code available at https://ndey96.github.io/neuron-explanations-sacrifice.
URL: https://openreview.net/forum?id=UWNa9Pv6qA
---
Title: Building Blocks for Robust and Effective Semi-Supervised Real-World Object Detection
Authors: Moussa Kassem Sbeyti, Nadja Klein, Azarm Nowzad, Fikret Sivrikaya, Sahin Albayrak
Abstract: Semi-supervised object detection (SSOD) based on pseudo-labeling significantly reduces dependence on large labeled datasets by effectively leveraging both labeled and unlabeled data. However, real-world applications of SSOD often face critical challenges, including class imbalance, label noise, and labeling errors. We present an in-depth analysis of SSOD under real-world conditions, uncovering causes of suboptimal pseudo-labeling and key trade-offs between label quality and quantity. Based on our findings, we propose four building blocks that can be seamlessly integrated into an SSOD framework. Rare Class Collage (RCC): a data augmentation method that enhances the representation of rare classes by creating collages of rare objects. Rare Class Focus (RCF): a stratified batch sampling strategy that ensures a more balanced representation of all classes during training. Ground Truth Label Correction (GLC): a label refinement method that identifies and corrects false, missing, and noisy ground truth labels by leveraging the consistency of teacher model predictions. Pseudo-Label Selection (PLS): a selection method for removing low-quality pseudo-labeled images, guided by a novel metric estimating the missing detection rate while accounting for class rarity. We validate our methods through comprehensive experiments on autonomous driving datasets, resulting in up to 6% increase in SSOD performance. Overall, our investigation and novel, data-centric, and broadly applicable building blocks enable robust and effective SSOD in complex, real-world scenarios. Code is available at https://mos-ks.github.io/publications.
URL: https://openreview.net/forum?id=vRYt8QLKqK
---
Title: Compositionality in Time Series: A Proof of Concept using Symbolic Dynamics and Compositional Data Augmentation
Authors: Michael Hagmann, Michael Staniek, Stefan Riezler
Abstract: This work investigates whether time series of natural phenomena can be understood as being generated by sequences of latent states which are ordered in systematic and regular ways. We focus on clinical time series and ask whether clinical measurements can be interpreted as being generated by meaningful physiological states whose succession follows systematic principles. Uncovering the underlying compositional structure will allow us to create synthetic data to alleviate the notorious problem of sparse and low-resource data settings in clinical time series forecasting, and deepen our understanding of clinical data.
We start by conceptualizing compositionality for time series as a property of the data generation process, and then study data-driven procedures that can reconstruct the elementary states and composition rules of this process.
We evaluate the success of this methods using two empirical tests originating from a domain adaptation perspective.
Both tests infer the similarity of the original time series distribution and the synthetic time series distribution from the similarity of expected risk of time series forecasting models trained and tested on original and synthesized data in specific ways.
Our experimental results show that the test set performance achieved by training on compositionally synthesized data is comparable to training on original clinical time series data, and that evaluation of models on compositionally synthesized test data shows similar results to evaluating on original test data.
In both experiments, performance based on compositionally synthesized data by far surpasses that based on synthetic data that were created by randomization-based data augmentation.
An additional downstream evaluation of the prediction task of sequential organ failure assessment (SOFA) scores shows significant performance gains when model training is entirely based on compositionally synthesized data compared to training on original data, with improvements increasing with the size of the synthesized training set.
URL: https://openreview.net/forum?id=msI02LXVJX
---
Title: Understanding and Robustifying Sub-domain Alignment for Domain Adaptation
Authors: Yiling Liu, Juncheng Dong, Ziyang Jiang, Ahmed Aloui, Keyu Li, Michael Hunter Klein, Vahid Tarokh, David Carlson
Abstract: In unsupervised domain adaptation (UDA), aligning source and target domains improves the predictive performance of learned models on the target domain. A common methodological improvement in alignment methods is to divide the domains and align sub-domains instead. These sub-domain-based algorithms have demonstrated great empirical success but lack theoretical support. In this work, we establish a rigorous theoretical understanding of the advantages of these methods that have the potential to enhance their overall impact on the field. Our theory uncovers that sub-domain-based methods optimize an error bound that is at least as strong as non-sub-domain-based error bounds and is empirically verified to be much stronger. Furthermore, our analysis indicates that when the marginal weights of sub-domains shift between source and target tasks, the performance of these methods may be compromised. We therefore implement an algorithm to robustify sub-domain alignment for domain adaptation under sub-domain shift, offering a valuable adaptation strategy for future sub-domain-based methods. Empirical experiments across various benchmarks validate our theoretical insights, prove the necessity for the proposed adaptation strategy, and demonstrate the algorithm's competitiveness in handling label shift.
URL: https://openreview.net/forum?id=oAzu0gzUUb
---
Title: SAFE-NID: Self-Attention with Normalizing-Flow Encodings for Network Intrusion Detection
Authors: Brian Matejek, Ashish Gehani, Nathaniel D. Bastian, Daniel J Clouse, Bradford J Kline, Susmit Jha
Abstract: Machine learning models are increasingly adopted to monitor network traffic and detect intrusions. In this work, we introduce SAFE-NID, a novel machine learning approach for real-time packet-level traffic monitoring and intrusion detection that includes a safeguard to detect zero day attacks as out-of-distribution inputs. Unlike traditional models, which falter against zero-day attacks and concept drift, SAFE-NID leverages a lightweight encoder-only transformer architecture combined with a novel normalizing flows-based safeguard. This safeguard not only quantifies uncertainty but also identifies out-of-distribution (OOD) inputs, enabling robust performance in dynamic threat landscapes. Our generative model learns class-conditional representations of the internal features of the deep neural network. We demonstrate the effectiveness of our approach by converting publicly available network flow-level intrusion datasets into packet-level ones. We release the labeled packet-level versions of these datasets with over 50 million packets each and describe the challenges in creating these datasets. We withhold from the training data certain attack categories to simulate zero-day attacks. Existing deep learning models, which achieve an accuracy of over 99% when detecting known attacks, only correctly classify 1% of the novel attacks. Our proposed transformer architecture with normalizing flows model safeguard achieves an area under the receiver operating characteristic curve of over 0.97 in detecting these novel inputs, outperforming existing combinations of neural architectures and model safeguards. The additional latency in processing each packet by the safeguard is a small fraction of the overall inference task. This dramatic improvement in detecting zero-day attacks and distribution shifts emphasizes SAFE-NID’s novelty and utility as a reliable and efficient safety monitoring tool for real-world network intrusion detection.
URL: https://openreview.net/forum?id=hDywd5AbIM
---
Title: A Unified View of Double-Weighting for Marginal Distribution Shift
Authors: José I. Segovia-Martín, Santiago Mazuelas, Anqi Liu
Abstract: Supervised classification traditionally assumes that training and testing samples are drawn from the same underlying distribution. However, practical scenarios are often affected by distribution shifts, such as covariate and label shifts. Most existing techniques for correcting distribution shifts are based on a reweighted approach that weights training samples, assigning lower relevance to the samples that are unlikely at testing. However, these methods may achieve poor performance when the weights obtained take large values at certain training samples. In addition, in multi-source cases, existing methods do not exploit complementary information among sources, and equally combine sources for all instances. In this paper, we establish a unified learning framework for distribution shift adaptation. We present a double-weighting approach to deal with distribution shifts, considering weight functions associated with both training and testing samples. For the multi-source case, the presented methods assign source-dependent weights for training and testing samples, where weights are obtained jointly using information from all sources. We also present generalization bounds for the proposed methods that show a significant increase in the effective sample size compared with existing approaches. Empirically, the proposed methods achieve enhanced classification performance in both synthetic and empirical experiments.
URL: https://openreview.net/forum?id=aPyJilTiIb
---
Title: Distilling Datasets Into Less Than One Image
Authors: Asaf Shul, Eliahu Horwitz, Yedid Hoshen
Abstract: Dataset distillation aims to compress a dataset into a much smaller one so that a model trained on the distilled dataset achieves high accuracy. Current methods frame this as maximizing the distilled classification accuracy for a budget of K distilled images-per-class, where K is a positive integer. In this paper, we push the boundaries of dataset distillation, compressing the dataset into less than an image-per-class. It is important to realize that the meaningful quantity is not the number of distilled images-per-class but the number of distilled pixels-per-dataset. We therefore, propose Poster Dataset Distillation (PoDD), a new approach that distills the entire original dataset into a single poster. The poster approach motivates new technical solutions for creating training images and learnable labels. Our method can achieve comparable or better performance with less than an image-per-class compared to existing methods that use one image-per-class. Specifically, our method establishes a new state-of-the-art performance on CIFAR-10, CIFAR-100, and CUB200 on the well established 1 IPC benchmark, while using as little as 0.3 images-per-class.
URL: https://openreview.net/forum?id=qsipSdfWeV
---
Title: On Using Secure Aggregation in Differentially Private Federated Learning with Multiple Local Steps
Authors: Mikko A. Heikkilä
Abstract: Federated learning is a distributed learning setting where the main aim is to train machine learning models without having to share raw data but only what is required for learning. To guarantee training data privacy and high-utility models, differential privacy and secure aggregation techniques are often combined with federated learning. However, with fine-grained protection granularities, e.g., with the common sample-level protection, the currently existing techniques generally require the parties to communicate for each local optimization step, if they want to fully benefit from the secure aggregation in terms of the resulting formal privacy guarantees. In this paper, we show how a simple new analysis allows the parties to perform multiple local optimization steps while still benefiting from using secure aggregation. We show that our analysis enables higher utility models with guaranteed privacy protection under limited number of communication rounds.
URL: https://openreview.net/forum?id=uxyWlXPuIg
---
Title: Which Backbone to Use: A Resource-efficient Domain Specific Comparison for Computer Vision
Authors: Pranav Jeevan P, Amit Sethi
Abstract: For computer vision applications on small, niche, and proprietary datasets, fine-tuning a neural network (NN) backbone that is pre-trained on a large dataset, such as the ImageNet, is a common practice. However, it is unknown whether the backbones that perform well on large datasets, such as vision transformers, are also the right choice for fine-tuning on smaller custom datasets. The present comprehensive analysis aims to aid machine learning practitioners in selecting the most suitable backbone for their specific problem. We systematically evaluated multiple lightweight, pre-trained backbones under consistent training settings across a variety of domains spanning natural, medical, deep space, and remote sensing images. We found that even though attention-based architectures are gaining popularity, they tend to perform poorly compared to CNNs when fine-tuned on small amounts of domain-specific data. We also observed that certain CNN architectures consistently perform better than others when controlled for network size. Our findings provide actionable insights into the performance trade-offs and effectiveness of different backbones for a broad spectrum of computer vision domains.
URL: https://openreview.net/forum?id=XVSQnnf7QT
---
Title: Enhancing Maritime Trajectory Forecasting via H3 Index and Causal Language Modelling (CLM)
Authors: Nicolas Drapier, Aladine Chetouani, Aurélien Chateigner
Abstract: The prediction of ship trajectories is a growing field of study in artificial intelligence. Traditional methods rely on the use of LSTM, GRU networks, and even Transformer architectures for the prediction of spatio-temporal series. This study proposes a viable alternative for predicting these trajectories using only GNSS positions. It considers this spatio-temporal problem as a natural language processing problem. The latitude/longitude coordinates of AIS messages are transformed into cell identifiers using the H3 index. Thanks to the pseudo-octal representation, it becomes easier for language models to learn the spatial hierarchy of the H3 index. The method is qualitatively compared to a classical Kalman filter and quantitatively to Seq2Seq and TrAISformer models. The Fréchet distance is introduced as the main evaluation metric for these comparisons. We show that it is possible to predict ship trajectories quite precisely up to 8 hours ahead with 30 minutes of context, using solely GNSS positions, without relying on any additional information such as speed, course, or external conditions — unlike many traditional methods. We demonstrate that this alternative works well enough to predict trajectories worldwide.
URL: https://openreview.net/forum?id=tIfS6jyO9f
---
Title: Lower Ricci Curvature for Efficient Community Detection
Authors: Yun Jin Park, Didong Li
Abstract: This study introduces the Lower Ricci Curvature (LRC), a novel, scalable, and scale-free discrete curvature designed to enhance community detection in networks. Addressing the computational challenges posed by existing curvature-based methods, LRC offers a streamlined approach with linear computational complexity, which makes it well suited for large-scale network analysis. We further develop an LRC-based preprocessing method that effectively augments popular community detection algorithms. Through applications on multiple real-world datasets, including the NCAA football league network, the DBLP collaboration network, the Amazon product co-purchasing network, and the YouTube social network, we demonstrate the efficacy of our method in significantly improving the performance of various community detection algorithms.
URL: https://openreview.net/forum?id=EoiuRII7MQ
---
Title: Meta-learning Optimizers for Communication-Efficient Learning
Authors: Charles-Étienne Joseph, Benjamin Thérien, Abhinav Moudgil, Boris Knyazev, Eugene Belilovsky
Abstract: Communication-efficient variants of SGD, specifically local SGD, have received a great deal of interest in recent years. These approaches compute multiple gradient steps locally on each worker, before averaging model parameters, helping relieve the critical communication bottleneck in distributed deep learning training. Although many variants of these approaches have been proposed, they can sometimes lag behind state-of-the-art adaptive optimizers for deep learning. In this work, we investigate if the recent progress in the emerging area of learned optimizers can potentially close this gap in homogeneous data and homogeneous device settings while remaining communication-efficient. Specifically, we meta-learn how to perform global updates given an update from local SGD iterations. Our results demonstrate that learned optimizers can substantially outperform local SGD and its sophisticated variants while maintaining their communication efficiency. Our learned optimizers can even generalize to unseen and much larger datasets and architectures, including ImageNet and ViTs, and to unseen modalities such as language modeling. We therefore show the potential of learned optimizers for improving communication-efficient distributed learning.
URL: https://openreview.net/forum?id=uRbf9ANAns
---
Title: Sparse Decomposition of Graph Neural Networks
Authors: Yaochen Hu, Mai Zeng, Ge Zhang, Pavel Rumiantsev, Liheng Ma, Yingxue Zhang, Mark Coates
Abstract: Graph Neural Networks (GNN) exhibit superior performance in graph representation learning, but their inference cost can be high, due to an aggregation operation that can require a memory fetch for a very large number of nodes.
This inference cost is the major obstacle to deploying GNN models with \emph{online prediction} to reflect the potentially dynamic node features.
To address this, we propose an approach to reduce the number of nodes that are included during aggregation.
We achieve this through a sparse decomposition, learning to approximate node representations using a weighted sum of linearly transformed features of a carefully selected subset of nodes within the extended neighbourhood.
The approach achieves linear complexity with respect to the average node degree and the number of layers in the graph neural network.
We introduce an algorithm to compute the optimal parameters for the sparse decomposition, ensuring an accurate approximation of the original GNN model, and present effective strategies to reduce the training time and improve the learning process.
We demonstrate via extensive experiments that our method outperforms other baselines designed for inference speedup, achieving significant accuracy gains with comparable inference times for both node classification and spatio-temporal forecasting tasks.
URL: https://openreview.net/forum?id=xdWP1d8BxI
---
Title: Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques
Authors: Shwai He, Daize Dong, Liang Ding, Ang Li
Abstract: Scaling large language models has driven remarkable advancements across various domains, yet the continual increase in model size presents significant challenges for real-world deployment. The Mixture of Experts (MoE) architecture offers a promising solution by dynamically selecting and activating only a subset of experts during inference, thus substantially reducing computational costs while preserving high performance. Despite these benefits, MoE introduces new inefficiencies, such as excessive parameters and communication overhead. In this work, we present a holistic study of compression techniques for Mixture of Experts to enhance both efficiency and scalability. While recent efforts have focused on Expert Trimming, which reduces the number of experts, these approaches still suffer from considerable communication and computational costs. To address this, we propose more aggressive strategies, such as Layer Drop, which removes entire MoE layers, and Block Drop, which eliminates transformer blocks. Surprisingly, these aggressive pruning techniques not only preserve model performance but also substantially improve computation and memory efficiency. Furthermore, beyond Expert Trimming, we also introduce Expert Slimming, which compresses individual experts to further boost performance and can be seamlessly integrated with Expert Trimming. Extensive experimental results demonstrate the effectiveness of our proposed methods—Layer Drop and Block Drop—along with the comprehensive recipe that integrates Expert Slimming and Expert Trimming, achieving a 6.05× speedup with 77.1% reduced memory usage while maintaining over 92% of performance on Mixtral-8×7B. Our code is released at https://github.com/CASE-Lab-UMD/Unified-MoE-Compression.
URL: https://openreview.net/forum?id=HTpMOl6xSI
---
New submissions
===============
Title: MagicPose4D: Crafting Articulated Models with Appearance and Motion Control
Abstract: With the success of 2D and 3D visual generative models, there is growing interest in generating 4D content. Existing methods primarily rely on text prompts to produce 4D content, but they often fall short of accurately defining complex or rare motions. To address this limitation, we propose MagicPose4D, a novel framework for refined control over both appearance and motion in 4D generation. Unlike current 4D generation methods, MagicPose4D accepts monocular videos or mesh sequences as motion prompts, enabling precise and customizable motion control. MagicPose4D comprises two key modules:
(i) Dual-Phase 4D Reconstruction Module which operates in two phases. The first phase focuses on capturing the model's shape using accurate 2D supervision and less accurate but geometrically informative 3D pseudo-supervision without imposing skeleton constraints. The second phase extracts the 3D motion (skeleton poses) using more accurate pseudo-3D supervision, obtained in the first phase, and introduces kinematic chain-based skeleton constraints to ensure physical plausibility. Additionally, we propose a Global-local Chamfer loss that aligns the overall distribution of predicted mesh vertices with the supervision while maintaining part-level alignment without extra annotations. (ii) Cross-category Motion Transfer Module leverages the extracted motion from the 4D reconstruction module and uses a kinematic-chain-based skeleton to achieve cross-category motion transfer. It ensures smooth transitions between frames through dynamic rigidity, facilitating robust generalization without additional training. Through extensive experiments, we demonstrate that MagicPose4D significantly improves the accuracy and consistency of 4D content generation, outperforming existing methods in various benchmarks.
URL: https://openreview.net/forum?id=qgHq1NFUJk
---
Title: Labeling without Seeing? Blind Annotation for Privacy-Preserving Entity Resolution
Abstract: The entity resolution problem requires finding pairs across datasets that belong to different owners but refer to the same entity in the real world. To train and evaluate solutions (either rule-based or machine-learning-based) to the entity resolution problem, generating a ground truth dataset with entity pairs or clusters is needed. However, such a data annotation process involves humans as domain oracles to review the plaintext data for all candidate record pairs from different parties, which inevitably infringes the privacy of data owners, especially in privacy-sensitive cases like medical records. To the best of our knowledge, there is no prior work on privacy-preserving ground truth labeling in the context of entity resolution. We propose a novel blind annotation protocol based on homomorphic encryption that allows domain oracles to collaboratively label ground truth without sharing data in plaintext with other parties. In addition, we design a domain-specific, user-friendly language that conceals the complex underlying homomorphic encryption circuits, making it more accessible and easier for users to adopt this technique. The empirical experiments indicate the feasibility of our privacy-preserving protocol (f-measure on average achieves more than 90\% compared with the real ground truth).
URL: https://openreview.net/forum?id=bAM8y3Hm0p
---
Title: Compressed Decentralized Momentum Stochastic Gradient Methods for Nonconvex Optimization
Abstract: In this paper, we design two compressed decentralized algorithms for solving nonconvex stochastic optimization under two different scenarios. Both algorithms adopt a momentum technique to achieve fast convergence and a message-compression technique to save communication costs. Though momentum acceleration and compressed communication have been used in literature, it is highly nontrivial to theoretically prove the effectiveness of their composition in a decentralized algorithm that can maintain the benefits of both sides, because of the need to simultaneously control the consensus error, the compression error, and the bias from the momentum gradient.
For the scenario where gradients are bounded, our proposal is a compressed decentralized adaptive method. To the best of our knowledge, this is the first decentralized adaptive stochastic gradient method with compressed communication. For the scenario of data heterogeneity without bounded gradients, our proposal is a compressed decentralized heavy-ball method, which applies a gradient tracking technique to address the challenge of data heterogeneity. Notably, both methods achieve an optimal convergence rate, and they can achieve linear speed up and adopt topology-independent algorithmic parameters within a certain regime of the user-specified error tolerance. Superior empirical performance is observed over state-of-the-art methods on training deep neural networks (DNNs) and Transformers.
URL: https://openreview.net/forum?id=RqhMQHHkB4
---
Title: On Efficient Bayesian Exploration in Model-Based Reinforcement Learning
Abstract: In this work, we address the challenge of data-efficient exploration in reinforcement learning by developing a principled, information-theoretic approach to intrinsic motivation. Specifically, we introduce a novel class of exploration bonuses that targets epistemic uncertainty rather than the aleatoric noise inherent in the environment. We prove that these bonuses naturally signal epistemic information gains and converge to zero once the agent becomes sufficiently certain about the environment’s dynamics and rewards, thereby aligning exploration with genuine knowledge gaps. To enable practical use, we also discuss tractable approximations via sparse variational Gaussian Processes, Deep Kernels and Deep Ensemble models. We then propose a Predictive Trajectory Sampling with Bayesian Exploration (PTS-BE) algorithm, which combines model-based planning with our proposed information-theoretic bonuses to achieve sample-efficient deep exploration. Empirically, we demonstrate that PTS-BE substantially outperforms other baselines across a variety of environments characterized by sparse rewards and/or purely exploratory tasks.
URL: https://openreview.net/forum?id=Na02hDWqkF
---
Title: A Survey on Verifiable Cross-Silo Federated Learning
Abstract: Federated Learning (FL) is a widespread approach that allows training machine learning (ML) models with data distributed across multiple devices. In cross-silo FL, which often appears in domains like healthcare or finance, the number of participants is moderate, and each party typically represents a well-known organization. For instance, in medicine data owners are often hospitals or data hubs which are well-established entities. However, malicious parties may still attempt to disturb the training procedure in order to obtain certain benefits, for example, a biased result or a reduction in computational load. While one can easily detect a malicious agent when data used for training is public, the problem becomes much more acute when it is necessary to maintain the privacy of the training dataset. To address this issue, there is recently growing interest in developing verifiable protocols, where one can check that parties do not deviate from the training procedure and perform computations correctly. In this paper, we present a survey on verifiable cross-silo FL. We analyze various protocols, fit them in a taxonomy, and compare their efficiency and threat models. We also analyze Zero-Knowledge Proof (ZKP) schemes and discuss how their overall cost in a FL context can be minimized. Lastly, we identify research gaps and discuss potential directions for future scientific work.
URL: https://openreview.net/forum?id=uMir8UIHST
---
Title: Open Problems in Mechanistic Interpretability
Abstract: Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.
URL: https://openreview.net/forum?id=91H76m9Z94
---
Title: On Diffusion Posterior Sampling via Sequential Monte Carlo for Zero-Shot Scaffolding of Protein Motifs
Abstract: With the advent of diffusion models, new proteins can be generated at an unprecedented rate. The motif scaffolding problem requires steering this generative process to yield proteins with a desirable functional substructure -- a motif. While models have been trained to take the motif as conditional input, recent techniques in diffusion posterior sampling can be leveraged as zero-shot alternatives whose approximations can be corrected with sequential Monte Carlo (SMC) algorithms. In this work, we introduce a new set of guidance potentials to describe and solve scaffolding tasks by adapting SMC-aided diffusion posterior samplers with an unconditional model, Genie, acting as a prior. Against established benchmarks, we successfully scaffold several single-motif and multi-motif problems. The latter is possible by pairing reconstruction guidance with SE(3)-invariant potentials. In the single-motif case, we find these potentials perform comparably to the conventional masking approach and that reconstruction guidance outperforms replacement methods when aided with SMC. We additionally consider a guidance potential for point symmetry constraints and produce designable internally symmetric monomers with our setup. Overall, this work highlights the capabilities and areas for improvement of zero-shot posterior samplers in motif scaffolding tasks.
URL: https://openreview.net/forum?id=KXRYY7iwqh
---
Title: A Survey of State Representation Learning for Deep Reinforcement Learning
Abstract: Representation learning methods are an important tool for addressing the challenges posed by complex observations spaces in sequential decision making problems. Recently, many methods have used a wide variety of types of approaches for learning meaningful state representations in reinforcement learning, allowing better sample efficiency, generalization, and performance. This survey aims to provide a broad categorization of these methods within a model-free online setting, exploring how they tackle the learning of state representations differently. We categorize the methods into six main classes, detailing their mechanisms, benefits, and limitations. Through this taxonomy, our aim is to enhance the understanding of this field and provide a guide for new researchers. We also discuss techniques for assessing the quality of representations, and detail relevant future directions.
URL: https://openreview.net/forum?id=gOk34vUHtz
---
Title: Do Concept Bottleneck Models Respect Localities?
Abstract: Concept-based explainability methods use human-understandable intermediaries to produce explanations for machine learning models. These methods assume concept predictions can help understand a model's internal reasoning. In this work, we assess the degree to which such an assumption is true by analyzing whether concept predictors leverage ``relevant'' features to make predictions, a term we call locality. Concept-based models that fail to respect localities also fail to be explainable because concept predictions are based on spurious features, making the interpretation of the concept predictions vacuous. To assess whether concept-based models respect localities, we construct and use three metrics to characterize when models respect localities, complementing our analysis with theoretical results. Many concept-based models used in practice fail to respect localities because concept predictors cannot always clearly distinguish distinct concepts. Based on these findings, we propose suggestions for alleviating this issue.
URL: https://openreview.net/forum?id=4mCkRbUXOf
---
Title: FB-MOAC: A Reinforcement Learning Algorithm for Forward-Backward Markov Decision Processes
Abstract: Reinforcement learning (RL) algorithms are effective in solving problems that can be modeled as Markov decision processes (MDPs).
They primarily target forward MDPs whose dynamics evolve over time from an initial state.
However, several important problems in stochastic control and network systems, among others,
exhibit both a forward and a backward dynamics. As a consequence, they cannot be expressed as a standard MDP, thereby calling
for a novel theory for RL in this context.
Accordingly, this work introduces the concept of Forward-Backward Markov Decision Processes (FB-MDPs)
for multi-objective problems and develops a novel theoretical framework to characterize their optimal solutions.
Moreover, it introduces the FB-MOAC algorithm that employs a step-wise forward-backward mechanism to obtain optimal policies
with guaranteed convergence and a competitive rate with respect to standard approaches in RL.
FB-MOAC is finally evaluated on three use cases in the context of mathematical finance, mobile resource management,
and edge computing.
The obtained results show that FB-MOAC outperforms the state of the art across different metrics, highlighting its ability to learn and maximize rewards.
URL: https://openreview.net/forum?id=li5DyC6rfS
---
Title: Associative memory inspires improvements for in-context learning using a novel attention residual stream architecture
Abstract: Large language models (LLMs) demonstrate an impressive ability to utilise information within the context of their input sequences to appropriately respond to data unseen by the LLM during its training procedure. This ability is known as in-context learning (ICL). Humans and non-human animals demonstrate similar abilities, however their neural architectures differ substantially from LLMs. Despite this, a critical component within LLMs, the attention mechanism, resembles modern associative memory models, widely used in and influenced by the computational neuroscience community to model biological memory systems. Using this connection, we introduce an associative memory model capable of performing ICL. We use this as inspiration for a novel residual stream architecture which allows information to directly flow between attention heads. We test this architecture during training within a two-layer Transformer and show its ICL abilities manifest more quickly than without this modification. We then apply our architecture in small language models with 8 million parameters, focusing on attention head values, with results also indicating improved ICL performance at this larger and more naturalistic scale.
URL: https://openreview.net/forum?id=lcTFm4LIRR
---
Title: ModernTCN Revisited: A Reproducibility Study with Extended Benchmarks
Abstract: This study presents a reproducibility analysis of ModernTCN, a recently proposed convolutional architecture for time series analysis. ModernTCN aims to address the limitations of traditional Temporal Convolutional Networks (TCNs) by enhancing the effective receptive field (ERF) and capturing long-range dependencies. We validate the experimental setup and performance claims of the original paper, and extend the evaluation to include additional datasets and tasks, such as short-term forecasting on ETT, classification on Speech Commands and PhysioNet, and ablation studies on the cross-variable component. Our results show that while ModernTCN achieves competitive performance, its state-of-the-art claims are tempered by sensitivity to experimental settings and data handling. Furthermore, ModernTCN's performance on Speech Commands lags behind convolutional methods with global receptive fields, and it exhibits less parameter efficiency. However, ablation studies on the PhysioNet dataset confirm the importance of the cross-variable component in handling missing data. This study provides a comprehensive evaluation of ModernTCN's contributions, reproducibility, and generalizability in time series analysis.
URL: https://openreview.net/forum?id=R20kKdWmVZ
---
Title: Do Think Tags Really Help LLMs Plan? A Critical Evaluation of ReAct-Style Prompting
Abstract: The reasoning abilities of Large Language Models (LLMs) remain a topic of considerable interest and debate. Among the original papers arguing for emergent reasoning abilities of LLMs, ReAct became particularly popular by claiming to tease out LLM reasoning abilities with special prompting involving “interleaving reasoning trace with action execution". In this paper, we critically examine the claims of ReAct style prompting for planning and sequential decision-making problems. By introducing systematic variations to the input prompt, we perform a sensitivity analysis along the original claims of ReAct. Our experiments in AlfWorld and WebShop, domains that were used in the original ReAct work, show that the performance is minimally influenced by the interleaved reasoning trace or by the content of these generated reasoning traces. Instead, the performance of LLMs is primarily driven by the unreasonably high degree of similarity between input example tasks and queries, with shockingly little ability to generalize. In addition to raising questions on claims about reasoning abilities, this lack of generalization also implicitly forces the prompt designer to provide instance-specific examples, significantly increasing the cognitive burden on the human. Our empirical results show that the perceived reasoning abilities of LLMs stem from the exemplar-query similarity and approximate retrieval rather than any inherent reasoning abilities, thereby leading to severe lack of generalization beyond the few-shot examples given in the prompts.
URL: https://openreview.net/forum?id=aFAMPSmNHR
---
Title: On The Landscape of Spoken Language Models: A Comprehensive Survey
Abstract: The field of spoken language processing is undergoing a shift from training custom-built, task-specific models toward using and optimizing spoken language models (SLMs) which act as universal speech processing systems. This trend is similar to the progression toward universal language models that has taken place in the field of (text) natural language processing. SLMs include both ``pure'' language models of speech---models of the distribution of tokenized speech sequences---and models that combine speech encoders with text language models, often including both spoken and written input or output. Work in this area is very diverse, with a range of terminology and evaluation settings. This paper aims to contribute an improved understanding of SLMs via a unifying literature survey of recent work in the context of the evolution of the field. Our survey categorizes the work in this area by model architecture, training, and evaluation choices, and describes some key challenges and directions for future work.
URL: https://openreview.net/forum?id=BvxaP3sVbA
---
Title: Continuous Tensor Relaxation for Finding Diverse Solutions in Combinatorial Optimization Problems
Abstract: Finding the optimal solution is often the primary goal in combinatorial optimization (CO). However, real-world applications frequently require diverse solutions rather than a single optimum, particularly in two key scenarios. First, when directly handling constraints is challenging, penalties are incorporated into the cost function, reformulating the problem as an unconstrained CO problem. Tuning these penalties to obtain a desirable solution is often time-consuming. Second, the optimal solution may lack practical relevance when the cost function or constraints only approximate a more complex real-world problem. To address these challenges, generating (i) penalty-diversified solutions by varying penalty intensities and (ii) variation-diversified solutions with distinct structural characteristics provides valuable insights, enabling practitioners to post-select the most suitable solution for their specific needs. However, efficiently discovering these diverse solutions is more challenging than finding a single optimal one. This study introduces Continual Tensor Relaxation Annealing (CTRA), a computationally efficient framework for unsupervised-learning (UL)-based CO solvers that generates diverse solutions within a single training run. CTRA leverages representation learning and parallelization to automatically discover shared representations, substantially accelerating the search for these diverse solutions. Numerical experiments demonstrate that CTRA outperforms existing UL-based solvers in generating these diverse solutions while significantly reducing computational costs.
URL: https://openreview.net/forum?id=ix33zd5zCw
---
Title: Node Duplication Improves Cold-start Link Prediction
Abstract: Graph Neural Networks (GNNs) are prominent in graph machine learning and have shown state-of-the-art performance in Link Prediction (LP) tasks. Nonetheless, recent studies show that GNNs struggle to produce good results on low-degree nodes despite their overall strong performance. In practical applications of LP, like recommendation systems, improving performance on low-degree nodes is critical, as it amounts to tackling the cold-start problem of improving the experiences of users with few observed interactions. In this paper, we investigate improving GNNs' LP performance on low-degree nodes while preserving their performance on high-degree nodes and propose a simple yet surprisingly effective augmentation technique called NodeDup. Specifically, NodeDup duplicates low-degree nodes and creates links between nodes and their own duplicates before following the standard supervised LP training scheme. By leveraging a ``multi-view'' perspective for low-degree nodes, NodeDup shows significant LP performance improvements on low-degree nodes without compromising any performance on high-degree nodes. Additionally, as a plug-and-play augmentation module, NodeDup can be easily applied on existing GNNs with very light computational cost. Extensive experiments show that NodeDup achieves 38.49%, 13.34%, and 6.76% relative improvements on isolated, low-degree, and warm nodes, respectively, on average across all datasets compared to GNNs and the existing cold-start methods.
URL: https://openreview.net/forum?id=hIOTzz87N9
---
Title: Designing Large Foundation Models for Efficient Training and Inference: A Survey
Abstract: This paper focuses on modern efficient training and inference technologies on foundation models and illustrates them from two perspectives: model and system design. Model and System Design optimize LLM training and inference from different aspects to save computational resources, making LLMs more efficient, affordable, and more accessible.
URL: https://openreview.net/forum?id=h2BKKBiIcu
---
Title: Transformers trained on proteins can learn to attend to Euclidean distance
Abstract: While conventional Transformers generally operate on sequence data, they can be used in conjunction with structure models, typically SE(3)-invariant or equivariant graph neural networks (GNNs), for 3D applications such as protein structure modelling. These hybrids typically involve either (1) preprocessing/tokenizing structural features as input for Transformers or (2) taking Transformer embeddings and processing them within a structural representation. However, there is evidence that Transformers can learn to process structural information on their own, such as the AlphaFold3 structural diffusion model. In this work we show that Transformers can function independently as structure models when passed linear embeddings of coordinates. We first provide a theoretical explanation for how Transformers can learn to filter attention as a 3D Gaussian with learned variance. We then validate this theory using both simulated 3D points and in the context of masked token prediction for proteins. Finally, we show that pre-training protein Transformer encoders with structure improves performance on a downstream task, yielding better performance than custom structural models. Together, this work provides a basis for using standard Transformers as hybrid structure-language models.
URL: https://openreview.net/forum?id=mU59bDyqqv
---
Title: An Empirical Study of the Accuracy-Robustness Trade-off and Training Efficiency in Robust Self-Supervised Learning
Abstract: Self-supervised learning (SSL) has made significant strides in learning image representations, yet its principles remain partially understood, particularly in adversarial scenarios. This work explores the interplay between SSL and adversarial training (AT), focusing on whether this integration can yield robust representations that balance computational efficiency, clean accuracy, and robustness. A major challenge lies in the inherently high cost of AT, which combines an inner maximization problem (generating adversarial examples) with an outer minimization problem (training representations). This challenge is exacerbated by the extensive training epochs required for SSL convergence, which become even more demanding in adversarial settings.
Recent advances in SSL, such as Extreme-Multi-Patch Self-Supervised Learning (EMP-SSL), have demonstrated that increasing the number of patches per image instance can significantly reduce the number of training epochs. Building on this, we introduce Robust-EMP-SSL, an extension of EMP-SSL specifically designed for adversarial training scenarios. Robust-EMP-SSL is a framework that leverages multiple crops per image to enhance data diversity, integrates invariance terms with regularization to prevent collapse, and optimizes adversarial training efficiency by reducing the required training epochs. By aligning these components, Robust-EMP-SSL enables the learning of robust representations while addressing the high computational costs and accuracy trade-offs inherent in adversarial training.
This study poses a central question: "How can multiple crops or diverse patches, combined with adversarial training strategies, achieve trade-offs between computational efficiency, clean accuracy, and robustness?"
Our empirical results show that Robust-EMP-SSL not only accelerates convergence, but also achieves a superior balance between clean accuracy and adversarial robustness, outperforming SimCLR, a widely used self-supervised baseline that, like other methods, relies on only two augmentations. Furthermore, we propose the Cost-Free Adversarial Multi-Crop Self-Supervised Learning (CF-AMC-SSL) method, which incorporates free adversarial training into the multi-crop SSL framework. CF-AMC-SSL demonstrates the potential to enhance both clean accuracy and adversarial robustness under reduced epoch conditions, further improving efficiency.
These findings highlight the potential of Robust-EMP-SSL and CF-AMC-SSL to make SSL more practical in adversarial scenarios, paving the way for future empirical explorations and real-world applications.
URL: https://openreview.net/forum?id=WTqHDiETg5
---
Title: SAIF: Sparse Adversarial and Imperceptible Attack Framework
Abstract: Adversarial attacks hamper the decision-making ability of neural networks by perturbing the input signal. For instance, adding calculated small distortions to images can deceive a well-trained image classification network. In this work, we propose a novel attack technique called \textbf{S}parse \textbf{A}dversarial and \textbf{I}mperceptible Attack \textbf{F}ramework (SAIF). Specifically, we design imperceptible attacks that contain low-magnitude perturbations at a few pixels and leverage these sparse attacks to reveal the vulnerability of classifiers. We use the Frank-Wolfe (conditional gradient) algorithm to simultaneously optimize the attack perturbations for bounded magnitude and sparsity with $O(1/\sqrt{T})$ convergence. Empirical results show that SAIF computes highly imperceptible and interpretable adversarial examples, and largely outperforms state-of-the-art sparse attack methods on ImageNet and CIFAR-10.
URL: https://openreview.net/forum?id=YZL29eJ5j1
---
Title: Text-to-Image Generation Via Energy-Based CLIP
Abstract: Joint Energy Models (JEMs), while drawing significant research attention, have not been successfully scaled to real-world, high-resolution datasets. We present CLIP-JEM, a novel approach extending JEMs to the multimodal vision-language domain using CLIP, integrating both generative and discriminative objectives. For the generative one, we introduce an image-text joint-energy function based on Cosine similarity in the CLIP space, training CLIP to assign low energy to real image-caption pairs and high energy otherwise. For the discriminative one, we employ contrastive adversarial loss, extending the adversarial training objective to the multimodal domain. CLIP-JEM not only generates realistic images from text but also achieves competitive results on the compositionality benchmark, outperforming leading methods with fewer parameters. Additionally, we demonstrate the superior guidance capability of CLIP-JEM by enhancing CLIP-based generative frameworks and converting unconditional diffusion models to text-based ones. Lastly, we show that our model can serve as a more robust evaluation metric for text-to-image generative tasks than CLIP.
URL: https://openreview.net/forum?id=FBmWiJXIGk
---
Title: Theoretical Learning Performance of Graph Networks: the Impact of Jumping Connections and Layer-wise Sparsification
Abstract: Jumping connections enable Graph Convolutional Networks (GCNs) to overcome over-smoothing, while graph sparsification reduces computational demands by selecting a submatrix of the graph adjacency matrix during neighborhood aggregation. Learning GCNs with graph sparsification has shown empirical success across various applications, but a theoretical understanding of the generalization guarantees remains limited, with existing analyses ignoring either graph sparsification or jumping connections. This paper presents the first learning dynamics and generalization analysis of GCNs with jumping connections using graph sparsification.
Our analysis demonstrates that the generalization accuracy of the learned model closely approximates the highest achievable accuracy within a broad class of target functions dependent on the proposed sparse effective adjacency matrix $A^*$. Thus, graph sparsification maintains generalization performance when $A^*$ accurately models data correlations. We reveal that jumping connections lead to different sparsification requirements across layers. In a two-hidden-layer GCN, the generalization is more affected by the sparsified matrix deviations from $A^*$ of the first layer than the second layer. To the best of our knowledge, this marks the first theoretical characterization of jumping connections' role in sparsification requirements. We validate our theoretical results on benchmark datasets in deep GCNs.
URL: https://openreview.net/forum?id=Q9AkJpfJks
---
Title: Detecting Systematic Weaknesses in Vision Models along Predefined Human-Understandable Dimensions
Abstract: Slice discovery methods (SDMs) are prominent algorithms for finding systematic weaknesses in DNNs. They identify top-k semantically coherent slices/subsets of data where a DNN-under-test has low performance. For being directly useful, slices should be aligned with human-understandable and relevant dimensions, which, for example, are defined by safety and domain experts as part of the operational design domain (ODD). While SDMs can be applied effectively on structured data, their application on image data is complicated by the lack of semantic metadata. To address these issues, we present an algorithm that combines foundation models for zero-shot image classification to generate semantic metadata with methods for combinatorial search to find systematic weaknesses in images. In contrast to existing approaches, ours identifies weak slices that are in line with pre-defined human-understandable dimensions. As the algorithm includes foundation models, its intermediate and final results may not always be exact. Therefore, we include an approach to address the impact of noisy metadata. We validate our algorithm on both synthetic and real-world datasets, demonstrating its ability to recover human-understandable systematic weaknesses. Furthermore, using our approach, we identify systematic weaknesses of multiple pre-trained and publicly available state-of-the-art computer vision DNNs.
URL: https://openreview.net/forum?id=yK9pvt4nBX
---
Title: Diverse Condensed Data Generation via Class Preserving Distribution Matching
Abstract: Large-scale datasets for training many real-world machine learning models pose significant computational resource challenges. One approach to mitigate this is via data condensation, which aims at learning a small dataset but still sufficiently capturing the rich information in the original one. Most of existing approaches learn the condensed dataset and task-related model parameters (e.g., classifier) in a bi-level meta-learning way. The recently proposed distribution matching (DM), however, avoids the expensive bi-level optimization but ignores task-related models. This work proposes a novel class preserving DM framework consisting of two key components. The first one is responsible for capturing the original data distribution of each class based on energy distance, which can encourage the diversity in the generated synthetic data. The other is classifier-critic constraint, which forces the learned synthetic samples to fit pre-trained task-related models, such as an off-the-shelf classifier. Designing the optimization loss in this way, we can generate more diverse and class preserving distilled data without the bi-level optimization. Extensive experiments reveal that our method can produce more effective condensed data for downstream tasks with less training cost and can also be successfully applied to de-biased dataset condensation.
URL: https://openreview.net/forum?id=QOrzmDQYou
---
Title: TSkips: Efficiency Through Explicit Temporal Delay Connections in Spiking Neural Networks
Abstract: Spiking Neural Networks (SNNs) with their bio-inspired Leaky Integrate-and-Fire (LIF) neurons inherently capture temporal information. This makes them well-suited for sequential tasks like processing event-based data from Dynamic Vision Sensors (DVS) and event-based speech tasks. Harnessing the temporal capabilities of SNNs requires mitigating vanishing spikes during training, capturing spatio-temporal patterns and enhancing precise spike timing. To address these challenges, we propose _TSkips_, augmenting SNN architectures with forward and backward skip connections that incorporate explicit temporal delays. These connections capture long-term spatio-temporal dependencies and facilitate better spike flow over long sequences. The introduction of _TSkips_ creates a vast search space of possible configurations, encompassing skip positions and time delay values. To efficiently navigate this search space, this work leverages training-free Neural Architecture Search (NAS) to identify optimal network structures and corresponding delays. We demonstrate the effectiveness of our approach on four event-based datasets: DSEC-flow for optical flow estimation, DVS128 Gesture for hand gesture recognition and Spiking Heidelberg Digits (SHD) and Spiking Speech Commands (SSC) for speech recognition. Our method achieves significant improvements across these datasets: up to 18% reduction in Average Endpoint Error (AEE) on DSEC-flow, 8% increase in classification accuracy on DVS128 Gesture, and up to ~8% and ~16% higher classification accuracy on SHD and SSC, respectively.
URL: https://openreview.net/forum?id=hwz32S06G4
---
Title: Learning Federated Neural Graph Databases for Answering Complex Queries from Distributed Knowledge Graphs
Abstract: The increasing demand for deep learning-based foundation models has highlighted the importance of efficient data retrieval mechanisms. Neural graph databases (NGDBs) offer a compelling solution, leveraging neural spaces to store and query graph-structured data, thereby enabling LLMs to access precise, contextually relevant information. However, current NGDBs are constrained to single-graph operation, limiting their capacity to reason across multiple, distributed graphs. Furthermore, the lack of support for multi-source graph data in existing NGDBs hinders their ability to capture the complexity and diversity of real-world data. In many applications, data is distributed across multiple sources, and the ability to reason across these sources is crucial for making informed decisions. This limitation is particularly problematic when dealing with sensitive graph data, as directly sharing and aggregating such data poses significant privacy risks. As a result, many applications that rely on NGDBs are forced to choose between compromising data privacy or sacrificing the ability to reason across multiple graphs. To address these limitations, we propose to learn Federated Neural Graph DataBases (FedNGDBs), a pioneering systematic framework that empowers privacy-preserving reasoning over multi-source graph data. FedNGDB leverages federated learning to collaboratively learn graph representations across multiple sources, enriching relationships between entities and improving the overall quality of the graph data. Unlike existing methods, FedNGDBs can handle complex graph structures and relationships, making it suitable for various downstream tasks. We evaluate FedNGDBs on three real-world datasets, demonstrating its effectiveness in retrieving relevant information from multi-source graph data while keeping sensitive information secure on local devices. Our results show that FedNGDBs can efficiently retrieve answers to cross-graph queries, making it a promising approach for LLMs and other applications that rely on efficient data retrieval mechanisms.
URL: https://openreview.net/forum?id=3K1LRetR6Y
---
Title: Unsupervised Anomaly Detection through Mass Repulsing Optimal Transport
Abstract: Detecting anomalies in datasets is a longstanding problem in machine learning. In this context, anomalies are defined as a sample that significantly deviates from the remaining data. Meanwhile, Optimal Transport (OT) is a field of mathematics concerned with the transportation, between two probability distribution, at least effort. In classical OT, the optimal transportation strategy of a distribution to itself is the identity, i.e., each sample keeps its mass. In this paper, we tackle anomaly detection by forcing samples to displace its mass, while keeping the least effort objective. We call this new transportation problem Mass Repulsing Optimal Transport (MROT). Naturally, samples lying in low density regions of space will be forced to displace mass very far, incurring a higher transportation cost. In contrast, samples on high density regions are able to send their mass just outside an \emph{exclusion zone}. We use these concepts to design a new anomaly score. Through a series of experiments in existing benchmarks, and fault detection problems, we show that our algorithm improves over existing methods.
URL: https://openreview.net/forum?id=PPGJ3EvENv
---
Title: Out-of-Distribution Detection with Overlap Index
Abstract: Out-of-distribution (OOD) detection is crucial for the deployment of machine learning models in the open world. While existing OOD detectors are effective in identifying OOD samples that deviate significantly from in-distribution (ID) data, they often come with trade-offs. For instance, deep OOD detectors usually suffer from high computational costs, require tuning hyperparameters, and have limited interpretability, whereas traditional OOD detectors may have a low accuracy on large high-dimensional datasets. To address these limitations, we propose a novel effective OOD detection approach that employs an overlap index (OI)-based confidence score function to evaluate the likelihood of a given input belonging to the same distribution as the available ID samples. The proposed OI-based confidence score function is non-parametric, lightweight, and easy to interpret, hence providing strong flexibility and generality. Extensive empirical evaluations indicate that our OI-based OOD detector is competitive with state-of-the-art OOD detectors in terms of detection accuracy on a wide range of datasets while requiring less computation and memory costs. Lastly, we show that the proposed OI-based confidence score function inherits nice properties from OI (e.g., insensitivity to small distributional variations and robustness against Huber $\epsilon$-contamination) and is a versatile tool for estimating OI and model accuracy in specific contexts.
URL: https://openreview.net/forum?id=bDHJhFEgcA
---
Title: Spaced Scheduling for Large Language Model Training
Abstract: Recent breakthroughs in deep learning have accelerated progress toward increasingly capable large language models (LLMs), even sparking discussions about the path to Artificial General Intelligence (AGI).
Yet, current LLM training pipelines continue to depend on heuristics and human-driven empirical analysis to curate data. In practice, more sophisticated data selection methods often incur high costs, show limited adaptability, or fail to surpass simple random baselines consistently across models and datasets.
In this work, we propose *Spaced Scheduled Training* (SST), a novel adaptive data selection strategy that prioritizes training examples based on per-example perplexity computed from the model’s own evolving parameters.
By obviating the need for external reference models, SST customizes data selection to the model’s unique characteristics—including its pre-training data composition— and eliminates biases introduced by these external models.
Extensive experiments on eight LLMs (0.5B to 32B parameters) show that SST consistently outperforms state-of-the-art selection approaches like DEITA and InsTag on the Open LLM Leaderboard.
For instance, with Qwen2.5-32B and a 30k examples data budget, SST achieves a 42.75% Open LLM Leaderboard score, surpassing both the top data-selection baseline (38.21%) and a baseline using 70% more data (39.58%).
We further present a theoretical framework to assess computational overhead induced by a model-based selection method, showing that SST remains efficient in practical scenarios, and propose strategies to mitigate the overhead in worst-case scenarios.
Our findings underscore the potential of model-informed dynamic data selection,
offering an efficient, adaptable, and cost-effective approach.
We release our training code, trained models, and data mixes in our public repository.
URL: https://openreview.net/forum?id=p0KTYl2B9T
---
Title: Modeling Human Beliefs about AI Behavior for Scalable Oversight
Abstract: Contemporary work in AI alignment often relies on human feedback to teach AI systems human preferences and values. Yet as AI systems grow more capable, human feedback becomes increasingly unreliable. This raises the problem of scalable oversight: How can we supervise AI systems that exceed human capabilities? In this work, we propose to model the human evaluator's beliefs about the AI system's behavior to better interpret the human's feedback. We formalize human belief models and theoretically analyze their role in inferring human values. We then characterize the remaining ambiguity in this inference and conditions for which the ambiguity disappears. To mitigate reliance on exact belief models, we then introduce the relaxation of human belief model covering. Finally, we propose using foundation models to construct covering belief models, providing a new potential approach to scalable oversight.
URL: https://openreview.net/forum?id=gSJfsdQnex
---
Title: Expressive Pooling for Graph Neural Networks
Abstract: Considerable efforts have been dedicated to exploring methods that enhance the expressiveness of graph neural networks. Current endeavors primarily focus on modifying the message-passing process to overcome limitations imposed by the Weisfeiler-Leman test, often at the expense of increasing computational cost. In practical applications, message-passing layers are interleaved with pooling layers for graph-level tasks, enabling the learning of increasingly abstract and coarser representations of input graphs. In this work, we formally prove two directions that allow pooling methods to increase the expressive power of a graph neural network while keeping the message-passing method unchanged. We systemically assign eight frequently used pooling operators to our theoretical conditions for increasing expressivity and introduce a novel pooling method \XP, short for eXpressive Pooling, as an additional simple method that satisfies our theoretical conditions. Experiments conducted on the Brec dataset confirm that those pooling methods that satisfy our conditions empirically increase the expressivity of graph neural networks.
URL: https://openreview.net/forum?id=xGADInGWMt
---
Title: Registers in Small Vision Transformers: A Reproducibility Study of Vision Transformers Need Registers
Abstract: Recent work has shown that Vision Transformers (ViTs) can produce “high-norm” artifact tokens in attention maps. These artifacts disproportionately accumulate global information, can degrade performance, and reduce interpretability in these models. Darcet et al. (2024) proposed registers—auxiliary learnable tokens—to mitigate these artifacts. In this reproducibility study, we verify whether these improvements extend to smaller ViTs. Specifically, we examine whether high-norm tokens appear in a DeiT-III Small model, whether registers reduce these artifacts, and how registers influence local and global feature representation. Our results confirm that smaller ViTs also exhibit high-norm tokens and registers partially alleviate them, improving interpretability. Although the overall performance gains are modest, these findings reinforce the utility of registers in enhancing ViTs while highlighting open questions about their varying effectiveness across different inputs and tasks. Our code is available at https://anonymous.4open.science/r/regs-small-vits.
URL: https://openreview.net/forum?id=5JflRlCt3Q
---
Title: Fully Automatic Neural Network Reduction for Formal Verification
Abstract: Formal verification of neural networks is essential before their deployment in safety-critical applications.
However, existing methods for formally verifying neural networks are not yet scalable enough to handle practical problems under strict time constraints.
We address this challenge by introducing a fully automatic and sound reduction of neural networks using reachability analysis.
The soundness ensures that the verification of the reduced network entails the verification of the original network.
Our sound reduction approach is applicable to neural networks with any type of element-wise activation function, such as ReLU, sigmoid, and tanh.
The network reduction is computed on the fly while simultaneously verifying the original network and its specifications.
All parameters are automatically tuned to minimize the network size without compromising verifiability.
We further show the applicability of our approach to convolutional neural networks by explicitly exploiting similar neighboring pixels.
Our evaluation shows that our approach reduces large neural networks to a fraction of the original number of neurons
and thus shortens the verification time to a similar degree.
URL: https://openreview.net/forum?id=gmflcWlVMl
---
Title: CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement
Abstract: Large Language Models (LLMs) have revolutionized code generation but are require significant resources and tend to over-generalize, limiting their task-specific efficiency. Fine-tuning smaller, open-source LLMs is a cost-effective alternative, yet standard supervised approaches rely solely on correct examples, overlooking valuable insights from failures. We introduce CodeLutra, a new framework that leverages both correct and incorrect code attempts. Instead of purely instructing with correct solutions, CodeLutra uses iterative preference-based refinement, comparing successful and failed outputs to better approximate desired results. This process narrows the performance gap with state-of-the-art, larger models, without requiring massive datasets or auxiliary models. For example, on a challenging data science coding task, using only 500 samples improved Llama-3-8B’s accuracy from 28.2% to 48.6%, approaching GPT-4’s level. By capitalizing on both successes and mistakes, \textsc{CodeLutra} offers a scalable, efficient path to high-quality code generation, making smaller open-source models more competitive with leading closed-source alternatives.
URL: https://openreview.net/forum?id=IGsEgWM4to
---
Title: Using the Path of Least Resistance to Explain Deep Networks
Abstract: Integrated Gradients (IG), a widely used axiomatic path-based attribution method, assigns importance scores to input features by integrating model gradients along a straight path from a baseline to the input. While effective in some cases, we show that straight paths can lead to flawed attributions. In this paper, we identify the cause of these misattributions and propose an alternative approach that treats the input space as a Riemannian manifold, computing attributions by integrating gradients along geodesics. We call this method Geodesic Integrated Gradients (GIG).
To approximate geodesic paths, we introduce two techniques: a $k$-Nearest Neighbours-based approach for smaller models and a Stochastic Variational Inference-based method for larger ones. Additionally, we propose a new axiom, Strong Completeness, extending the axioms satisfied by IG. We show that this property is desirable for attribution methods and that GIG is the only method that satisfies it.
Through experiments on both synthetic and real-world data, we demonstrate that GIG outperforms existing explainability methods, including IG.
URL: https://openreview.net/forum?id=M6cL4nWOqK
---
Title: Efficient Representations for Whole Slide Image Classification
Abstract: The advent of digital pathology has transformed diagnostic and research capabilities, offering unprecedented insights through the analysis of high-resolution whole slide images (WSIs). However, the gigapixel size and complexity of WSIs present significant computational challenges. To address this, we propose a scalable and efficient pipeline for WSI classification that integrates patch-based feature extraction, clustering, and advanced representation techniques. Our methodology begins by extracting features from patches identified based on their pathological significance using deep feature embeddings from a pre-trained convolutional neural network (CNN) fine tuned on a histology dataset under noisy labels. This approach ensures that the extracted features are robust and tailored to histopathological patterns despite the inherent noise in the training data. These embeddings are then clustered using K-means clustering to group semantically similar regions. To represent these clusters effectively, we experimented with two strategies: first, using the cluster mean to summarize each cluster; and second, employing Fisher vector (FV) encoding to model the distribution of patch embeddings within clusters using a parametric Gaussian mixture model (GMM). The resulting high-dimensional feature vector encapsulates both local and global tissue structures, enabling robust classification of WSIs. This approach significantly reduces computational overhead while maintaining high accuracy, as validated across multiple datasets. Our innovative framework combines the precision of Fisher vectors with the scalability of clustering, establishing an efficient and precise solution for WSI analysis that advances the practical application of digital pathology in medical diagnostics and research.
URL: https://openreview.net/forum?id=vKLH4PDN7V
---
Title: Reproducibilty Study of Boosting Adversarial Transferability via Gradient Relevance Attack
Abstract: This paper presents a reproducibility study of "Boosting Adversarial Transferability via
Gradient Relevance Attack" by Zhu et al., a paper that introduces the Gradient Relevance
Attack (GRA) method. GRA enhances the transferability of adversarial examples across
different machine learning models, improving black-box adversarial attacks. We successfully
replicated the key experiments, focusing on the gradient relevance framework and the decay
indicator. Our methodology involved reimplementing the GRA algorithm and evaluating it
on the same set of models used in the original paper. We achieved attack success rates com-
parable to those of the original article, within a margin of 1%, confirming the effectiveness
of the GRA method. Additionally, we extended the original work by introducing a dynamic
learning rate (α) that adjusts the step size based on the cosine similarity between the current
momentum and the average gradient. An adjustment factor (γ) of 0.01, with thresholds of
0.75 and 0.25, modulates the step size. Our findings suggest that this adaptive step size
mechanism can lead to faster convergence and potentially improved attack performance in
certain scenarios. This study validates the GRA method and explores avenues for further
improving adversarial transferability through dynamic parameter adjustments.
URL: https://openreview.net/forum?id=cu926HOF7F
---
Title: On the reproducibility of "Discovering and Mitigating Visual Biases through Keyword Explanation"
Abstract: A computer vision model in machine learning is only as good as its training data (Mose-
ley, 2024). Visual biases and spurious correlations in training data can manifest in model
decision-making, potentially leading to discrimination against sensitive groups (Mehrabi
et al., 2021). To address this issue, various methods have been proposed to automate bias
discovery and use these biases to train bias-aware computer vision models. However, one
major drawback of these methods is their lack of transparency, as the discovered biases are
often not human-interpretable. To overcome this, Kim et al. introduced a Bias-to-Text
(B2T) framework that identifies visual biases as keywords and expresses them in natural
language. This paper aims to reproduce their findings and expand upon the evaluation
methods. The central claims that the authors make are that B2T (i) can discover both
novel and known biases, (ii) can facilitate debiased training of image classifiers and (iii) can
be deployed across different classifier architectures, such as Transformer- and convolutional
neural network (CNN)-based models. We successfully reproduce their main claims and ex-
tend their findings by analyzing whether novel bias keywords discovered by B2T represent
actual biases. Additionally, we conduct further robustness experiments, leading us to con-
clude that the framework not only discovers biases in data, but also is sensitive to changes
in the underlying classification model, highlighting a future research direction. Our code is
publicly available at https://anonymous.4open.science/r/B2T-Repr-898B.
URL: https://openreview.net/forum?id=zKbtAPGEh5
---
Title: On the Shift Invariance of Max Pooling Feature Maps in Convolutional Neural Networks
Abstract: This paper focuses on improving the mathematical interpretability of convolutional neural networks (CNNs) in the context of image classification. Specifically, we tackle the instability issue arising in their first layer, which tends to learn parameters that closely resemble oriented band-pass filters when trained on datasets like ImageNet. Subsampled convolutions with such Gabor-like filters are prone to aliasing, causing sensitivity to small input shifts. In this context, we establish conditions under which the max pooling operator approximates a complex modulus, which is nearly shift invariant. We then derive a measure of shift invariance for subsampled convolutions followed by max pooling. In particular, we highlight the crucial role played by the filter's frequency and orientation in achieving stability. We experimentally validate our theory by considering a deterministic feature extractor based on the dual-tree complex wavelet packet transform, a particular case of discrete Gabor-like decomposition.
URL: https://openreview.net/forum?id=hKsvQMUs4r
---