Reproducibility Certification: Reconciling Kaplan and Chinchilla Scaling Laws
Tim Pearce, Jinyeop Song
https://openreview.net/forum?id=NLoaLyuUUF
---
Accepted papers
===============
Title: Expressive Higher-Order Link Prediction through Hypergraph Symmetry Breaking
Authors: Simon Zhang, Cheng Xin, Tamal K. Dey
Abstract: A hypergraph consists of a set of nodes along with a collection of subsets of the nodes called hyperedges. Higher order link prediction is the task of predicting the existence of a missing hyperedge in a hypergraph. A hyperedge representation learned for higher order link prediction is fully expressive when it does not lose distinguishing power up to an isomorphism. Many existing hypergraph representation learners, are bounded in expressive power by the Generalized Weisfeiler Lehman-1 (GWL-1) algorithm, a generalization of the Weisfeiler Lehman-1 (WL-1) algorithm. The WL-1 algorithm can approximately decide whether two graphs are isomorphic. However, GWL-1 has limited expressive power. In fact, GWL-1 can only view the hypergraph as a collection of trees rooted at each of the nodes in the hypergraph. Furthermore, message passing on hypergraphs can already be computationally expensive, particularly with limited GPU device memory. To address these limitations, we devise a preprocessing algorithm that can identify certain regular subhypergraphs exhibiting symmetry with respect to GWL-1. Our preprocessing algorithm runs once with the time complexity linear in the size of the input hypergraph. During training, we randomly drop the hyperedges of the subhypergraphs identifed by the algorithm and add covering hyperedges to break symmetry. We show that our method improves the expressivity of GWL-1. Our extensive experiments 1 also demonstrate the effectiveness of our approach for higher-order link prediction on both graph and hypergraph datasets with negligible change in computation.
URL: https://openreview.net/forum?id=oG65SjZNIF
---
Title: How does over-squashing affect the power of GNNs?
Authors: Francesco Di Giovanni, T. Konstantin Rusch, Michael Bronstein, Andreea Deac, Marc Lackenby, Siddhartha Mishra, Petar Veličković
Abstract: Graph Neural Networks (GNNs) are the state-of-the-art model for machine learning on graph-structured data. The most popular class of GNNs operate by exchanging information between adjacent nodes, and are known as Message Passing Neural Networks (MPNNs). While understanding the expressive power of MPNNs is a key question, existing results typically consider settings with uninformative node features. In this paper, we provide a rigorous analysis to determine which function classes of node features can be learned by an MPNN of a given capacity. We do so by measuring the level of *pairwise interactions* between nodes that MPNNs allow for. This measure provides a novel quantitative characterization of the so-called over-squashing effect, which is observed to occur when a large volume of messages is aggregated into fixed-size vectors. Using our measure, we prove that, to guarantee sufficient communication between pairs of nodes, the capacity of the MPNN must be large enough, depending on properties of the input graph structure, such as commute times. For many relevant scenarios, our analysis results in impossibility statements in practice, showing that *over-squashing hinders the expressive power of MPNNs*. Our theory also holds for geometric graphs and hence extends to equivariant MPNNs on point clouds. We validate our analysis through extensive controlled experiments and ablation studies.
URL: https://openreview.net/forum?id=KJRoQvRWNs
---
Title: Optimized Tradeoffs for Private Prediction with Majority Ensembling
Authors: Shuli Jiang, Qiuyi Zhang, Gauri Joshi
Abstract: We study a classical problem in private prediction, the problem of computing an $(m\epsilon, \delta)$-differentially private majority of $K$ $(\epsilon, \Delta)$-differentially private algorithms for $1 \leq m \leq K$ and $1 > \delta \geq \Delta \geq 0$. Standard methods such as subsampling or randomized response are widely used, but do they provide optimal privacy-utility tradeoffs? To answer this, we introduce the Data-dependent Randomized Response Majority (DaRRM) algorithm. It is parameterized by a data-dependent noise function $\gamma$, and enables efficient utility optimization over the class of all private algorithms, encompassing those standard methods. Surprisingly, we show that an $(m\epsilon, \delta)$-private majority algorithm with maximal utility can be computed tractably for any $m \leq K$ by a novel structural result that reduces the infinitely many privacy constraints into a polynomial set. In some settings, we show that DaRRM provably enjoys a privacy gain of a factor of 2 over common baselines, with fixed utility. Lastly, we demonstrate the strong empirical effectiveness of our first-of-its-kind privacy-constrained utility optimization for ensembling labels for private prediction from private teachers in image classification. Notably, our DaRRM framework with an optimized $\gamma$ exhibits substantial utility gains when compared against several baselines.
URL: https://openreview.net/forum?id=dwJluAakM8
---
Title: Dataset Distillation via Curriculum Data Synthesis in Large Data Era
Authors: Zeyuan Yin, Zhiqiang Shen
Abstract: Dataset distillation or condensation aims to generate a smaller but representative subset from a large dataset, which allows a model to be trained more efficiently, meanwhile evaluating on the original testing data distribution to achieve decent performance. Previous decoupled methods like SRe$^2$L simply use a unified gradient update scheme for synthesizing data from Gaussian noise, while, we notice that the initial several update iterations will determine the final outline of synthesis, thus an improper gradient update strategy may dramatically affect the final generation quality. To address this, we introduce a simple yet effective global-to-local gradient refinement approach enabled by curriculum data augmentation ($\texttt{CDA}$) during data synthesis. The proposed framework achieves the current published highest accuracy on both large-scale ImageNet-1K and 21K with 63.2% under IPC (Images Per Class) 50 and 36.1% under IPC 20, using a regular input resolution of 224$\times$224 with faster convergence speed and less synthetic time. The proposed model outperforms the current state-of-the-art methods like SRe$^2$L, TESLA, and MTT by more than 4% Top-1 accuracy on ImageNet-1K/21K and for the first time, reduces the gap to its full-data training counterparts to less than absolute 15%. Moreover, this work represents the inaugural success in dataset distillation on the larger-scale ImageNet-21K dataset under the standard 224$\times$224 resolution. Our code and distilled ImageNet-21K dataset of 20 IPC, 2K recovery budget is available at https://github.com/VILA-Lab/SRe2L/tree/main/CDA.
URL: https://openreview.net/forum?id=PlaZD2nGCl
---
Title: Active Learning for Level Set Estimation Using Randomized Straddle Algorithms
Authors: Yu Inatsu, Shion Takeno, Kentaro KUTSUKAKE, Ichiro Takeuchi
Abstract: Level set estimation (LSE) the problem of identifying the set of input points where a function takes a value above (or below) a given threshold is important in practical applications. When the function is expensive to evaluate and black-box, the straddle algorithm, a representative heuristic for LSE based on Gaussian process models, and its extensions with theoretical guarantees have been developed. However, many existing methods include a confidence parameter, $\beta^{1/2}_t$, that must be specified by the user. Methods that choose $\beta^{1/2}_t$ heuristically do not provide theoretical guarantees. In contrast, theoretically guaranteed values of $\beta^{1/2}_t$ need to be increased depending on the number of iterations and candidate points; they are conservative and do not perform well in practice. In this study, we propose a novel method, the randomized straddle algorithm, in which $\beta_t$ in the straddle algorithm is replaced by a random sample from the chi-squared distribution with two degrees of freedom. The confidence parameter in the proposed method does not require adjustment, does not depend on the number of iterations and candidate points, and is not conservative. Furthermore, we show that the proposed method has theoretical guarantees that depend on the sample complexity and the number of iterations. Finally, we validate the applicability of the proposed method through numerical experiments using synthetic and real data.
URL: https://openreview.net/forum?id=N8M2yqRicS
---
Title: UniCtrl: Improving the Spatiotemporal Consistency of Text-to-Video Diffusion Models via Training-Free Unified Attention Control
Authors: Tian Xia, Xuweiyi Chen, Sihan Xu
Abstract: Video Diffusion Models have been developed for video generation, usually integrating text and image conditioning to enhance control over the generated content. Despite the progress, ensuring consistency across frames remains a challenge, particularly when using text prompts as control conditions. To address this problem, we introduce UniCtrl, a novel, plug-and-play method that is universally applicable to improve the spatiotemporal consistency and motion diversity of videos generated by text-to-video models without additional training. UniCtrl ensures semantic consistency across different frames through cross-frame self-attention control, and meanwhile, enhances the motion quality and spatiotemporal consistency through motion injection and spatiotemporal synchronization. Our experimental results demonstrate UniCtrl's efficacy in enhancing various text-to-video models, confirming its effectiveness and universality.
URL: https://openreview.net/forum?id=x2uFJ79OjK
---
Title: Audio-Visual Dataset Distillation
Authors: Saksham Singh Kushwaha, Siva Sai Nagender Vasireddy, Kai Wang, Yapeng Tian
Abstract: In this article, we introduce \textit{audio-visual dataset distillation}, a task to construct a smaller yet representative synthetic audio-visual dataset that maintains the cross-modal semantic association between audio and visual modalities. Dataset distillation techniques have primarily focused on image classification. However, with the growing capabilities of audio-visual models and the vast datasets required for their training, it is necessary to explore distillation methods beyond the visual modality. Our approach builds upon the foundation of Distribution Matching (DM), extending it to handle the unique challenges of audio-visual data. A key challenge is to jointly learn synthetic data that distills both the modality-wise information and natural alignment from real audio-visual data. We introduce a vanilla audio-visual distribution matching framework that separately trains visual-only and audio-only DM components, enabling us to investigate the effectiveness of audio-visual integration and various multimodal fusion methods. To address the limitations of unimodal distillation, we propose two novel matching losses: implicit cross-matching and cross-modal gap matching. These losses work in conjunction with the vanilla unimodal distribution matching loss to enforce cross-modal alignment and enhance the audio-visual dataset distillation process. Extensive audio-visual classification and retrieval experiments on four audio-visual datasets, AVE, MUSIC-21, VGGSound, and VGGSound-10K, demonstrate the effectiveness of our proposed matching approaches and validate the benefits of audio-visual integration with condensed data. This work establishes a new frontier in audio-visual dataset distillation, paving the way for further advancements in this exciting field. \textit{Our source code and pre-trained models will be released}.
URL: https://openreview.net/forum?id=IJlbuSrXmk
---
Title: Stealthy Backdoor Attack via Confidence-driven Sampling
Authors: Pengfei He, Yue Xing, Han Xu, Jie Ren, Yingqian Cui, Shenglai Zeng, Jiliang Tang, Makoto Yamada, Mohammad Sabokrou
Abstract: Backdoor attacks facilitate unauthorized control in the testing stage by carefully injecting harmful triggers during the training phase of deep neural networks. Previous works have focused on improving the stealthiness of the trigger while randomly selecting samples to attack. However, we find that random selection harms the stealthiness of the model. In this paper, we identify significant pitfalls of random sampling, which make the attacks more detectable and easier to defend against. To improve the stealthiness of existing attacks, we introduce a method of strategically poisoning samples near the model's decision boundary, aiming to minimally alter the model's behavior (decision boundary) before and after backdooring. Our main insight for detecting boundary samples is exploiting the confidence scores as a metric for being near the decision boundary and selecting those to poison (inject) the attack. The proposed approach makes it significantly harder for defenders to identify the attacks. Our method is versatile and independent of any specific trigger design. We provide theoretical insights and conduct extensive experiments to demonstrate the effectiveness of the proposed method.
URL: https://openreview.net/forum?id=Flh5EXz8dA
---
Title: Positional Encoding Helps Recurrent Neural Networks Handle a Large Vocabulary
Authors: Takashi Morita
Abstract: This study reports an unintuitive finding that positional encoding enhances learning of recurrent neural networks (RNNs). Positional encoding is a high-dimensional representation of time indices on input data. Most famously, positional encoding complements the capabilities of Transformer neural networks, which lack an inherent mechanism for representing the data order. By contrast, RNNs can encode the temporal information of data points on their own, rendering their use of positional encoding seemingly redundant/unnecessary. Nonetheless, investigations through synthetic benchmarks reveal an advantage of coupling positional encoding and RNNs, especially for handling a large vocabulary that yields low-frequency tokens. Further scrutinization unveils that these low-frequency tokens destabilizes the gradients of vanilla RNNs, and the positional encoding resolves this instability. These results shed a new light on the utility of positional encoding beyond its canonical role as a timekeeper for Transformers.
URL: https://openreview.net/forum?id=PtnwXd13SF
---
Title: AdaWaveNet: Adaptive Wavelet Network for Time Series Analysis
Authors: Han Yu, Peikun Guo, Akane Sano
Abstract: Time series data analysis is a critical component in various domains such as finance, healthcare, and meteorology. Despite the progress in deep learning for time series analysis, there remains a challenge in addressing the non-stationary nature of time series data. Most of the existing models, which are built on the assumption of constant statistical properties over time, often struggle to capture the temporal dynamics in realistic time series and result in bias and error in time series analysis. This paper introduces the Adaptive Wavelet Network (AdaWaveNet), a novel approach that employs Adaptive Wavelet Transformation for multi-scale analysis of non-stationary time series data. AdaWaveNet designed a lifting scheme-based wavelet decomposition and construction mechanism for adaptive and learnable wavelet transforms, which offers enhanced flexibility and robustness in analysis. We conduct extensive experiments on 10 datasets across 3 different tasks, including forecasting, imputation, and a newly established super-resolution task. The evaluations demonstrate the effectiveness of AdaWaveNet over existing methods in all three tasks, which illustrates its potential in various real-world applications.
URL: https://openreview.net/forum?id=m4bE9Y9FlX
---
Title: CREW: Facilitating Human-AI Teaming Research
Authors: Lingyu Zhang, Zhengran Ji, Boyuan Chen
Abstract: With the increasing deployment of artificial intelligence (AI) technologies, the potential of humans working with AI agents has been growing at a great speed. Human-AI teaming is an important paradigm for studying various aspects when humans and AI agents work together. The unique aspect of Human-AI teaming research is the need to jointly study humans and AI agents, demanding multidisciplinary research efforts from machine learning to human-computer interaction, robotics, cognitive science, neuroscience, psychology, social science, and complex systems. However, existing platforms for Human-AI teaming research are limited, often supporting oversimplified scenarios and a single task, or specifically focusing on either human-teaming research or multi-agent AI algorithms. We introduce \textbf{CREW}, a platform to facilitate Human-AI teaming research in real-time decision-making scenarios and engage collaborations from multiple scientific disciplines, with a strong emphasis on human involvement. It includes pre-built tasks for cognitive studies and Human-AI teaming with expandable potentials from our modular design. Following conventional cognitive neuroscience research, CREW also supports multimodal human physiological signal recording for behavior analysis. Moreover, CREW benchmarks real-time human-guided reinforcement learning agents using state-of-the-art algorithms and well-tuned baselines. With CREW, we were able to conduct 50 human subject studies within a week to verify the effectiveness of our benchmark.
URL: https://openreview.net/forum?id=ZRXwHRXm8i
---
Title: Growing Tiny Networks: Spotting Expressivity Bottlenecks and Fixing Them Optimally
Authors: Manon Verbockhaven, Théo Rudkiewicz, Guillaume Charpiat, Sylvain Chevallier
Abstract: Machine learning tasks are generally formulated as optimization problems, where one searches for an optimal function within a certain functional space. In practice, parameterized functional spaces are considered, in order to be able to perform gradient descent. Typically, a neural network architecture is chosen and fixed, and its parameters (connection weights) are optimized, yielding an architecture-dependent result. This way of proceeding however forces the evolution of the function during training to lie within the realm of what is expressible with the chosen architecture, and prevents any optimization across architectures. Costly architectural hyper-parameter optimization is often performed to compensate for this. Instead, we propose to adapt the architecture on the fly during training. We show that the information about desirable architectural changes, due to expressivity bottlenecks when attempting to follow the functional gradient, can be extracted from backpropagation. To do this, we propose a mathematical definition of expressivity bottlenecks, which enables us to detect, quantify and solve them while training, by adding suitable neurons. Thus, while the standard approach requires large networks, in terms of number of neurons per layer, for expressivity and optimization reasons, we provid tools and properties to develop an architecture starting with a very small number of neurons. As a proof of concept, we show results~on the CIFAR dataset, matching large neural network accuracy, with competitive training time, while removing the need for standard architectural hyper-parameter search.
URL: https://openreview.net/forum?id=hbtG6s6e7r
---
Title: Learning State Reachability as a Graph in Translation Invariant Goal-based Reinforcement Learning Tasks
Authors: Hedwin BONNAVAUD, Alexandre Albore, Emmanuel Rachelson
Abstract: Deep Reinforcement Learning proved efficient at learning universal control policies when the goal state is close enough to the starting state, or when the value function features few discontinuities.
But reaching goals that require long action sequences in complex environments remains difficult.
Drawing inspiration from the cognitive process which reuses learned atomic skills in a global planning procedure, we propose an algorithm which encodes reachability between abstract goals as a graph, and produces plans in this goal space.
Transitions between goals rely on the exploitation of a learned policy which enjoys a property we call \emph{translation invariant local optimality}, which encodes the intuition that goal-reaching skills can be reused throughout the state space.
Overall, our contribution permits solving large and difficult navigation tasks, outperforming related methods from the literature.
URL: https://openreview.net/forum?id=PkHkPQMTxg
---
Title: No Identity, no problem: Motion through detection for people tracking
Authors: Martin Engilberge, Friedrich Wilke Grosche, Pascal Fua
Abstract: Tracking-by-detection has become the de facto standard approach to people tracking. To increase robustness, some approaches incorporate re-identification using appearance models and regressing motion offset, which requires costly identity annotations. In this paper, we propose exploiting motion clues while providing supervision only for the detections, which is much easier to do.
Our algorithm predicts detection heatmaps at two different times, along with a 2D motion estimate between the two images. It then warps one heatmap using the motion estimate and enforces consistency with the other one. This provides the required supervisory signal on the motion without the need for any motion annotations. In this manner, we couple the information obtained from different images during training and increase accuracy, especially in crowded scenes and when using low frame-rate sequences.
We show that our approach delivers state-of-the-art results for single- and multi-view multi-target tracking on the MOT17 and WILDTRACK datasets.
URL: https://openreview.net/forum?id=ogEM2H9IGK
---
Title: Stability and Generalization in Free Adversarial Training
Authors: Xiwei Cheng, Kexin Fu, Farzan Farnia
Abstract: While adversarial training methods have significantly improved the robustness of deep neural networks against norm-bounded adversarial perturbations, the generalization gap between their performance on training and test data is considerably greater than that of standard empirical risk minimization. Recent studies have aimed to connect the generalization properties of adversarially trained classifiers to the min-max optimization algorithm used in their training. In this work, we analyze the interconnections between generalization and optimization in adversarial training using the algorithmic stability framework. Specifically, our goal is to compare the generalization gap of neural networks trained using the vanilla adversarial training method, which fully optimizes perturbations at every iteration, with the free adversarial training method, which simultaneously optimizes norm-bounded perturbations and classifier parameters. We prove bounds on the generalization error of these methods, indicating that the free adversarial training method may exhibit a lower generalization gap between training and test samples due to its simultaneous min-max optimization of classifier weights and perturbation variables. We conduct several numerical experiments to evaluate the train-to-test generalization gap in vanilla and free adversarial training methods. Our empirical findings also suggest that the free adversarial training method could lead to a smaller generalization gap over a similar number of training iterations. The paper code is available at https://github.com/Xiwei-Cheng/Stability_FreeAT.
URL: https://openreview.net/forum?id=jmwEiC9bq2
---
Title: Data-Centric Defense: Shaping Loss Landscape with Augmentations to Counter Model Inversion
Authors: Si Chen, Feiyang Kang, Nikhil Abhyankar, Ming Jin, Ruoxi Jia
Abstract: Machine Learning models have shown susceptibility to various privacy attacks, with model inversion (MI) attacks posing a significant threat. Current defense techniques are mostly \emph{model-centric}, involving modifying model training or inference. However, these approaches require model trainers' cooperation, are computationally expensive, and often result in a significant privacy-utility tradeoff. To address these limitations, we propose a novel \emph{data-centric} approach to mitigate MI attacks. Compared to traditional model-centric techniques, our approach offers the unique advantage of enabling each individual user to control their data's privacy risk, aligning with findings from a Cisco survey that only a minority actively seek privacy protection. Specifically, we introduce several privacy-focused data augmentations that modify the private data uploaded to the model trainer. These augmentations shape the resulting model's loss landscape, making it challenging for attackers to generate private target samples. Additionally, we provide theoretical analysis to explain why such augmentations can reduce the risk of model inversion. We evaluate our approach against state-of-the-art MI attacks and demonstrate its effectiveness and robustness across various model architectures and datasets. Specifically, in standard face recognition benchmarks, we reduce face reconstruction success rates to $\leq5\%$, while maintaining high utility with only a 2\% classification accuracy drop, significantly surpassing state-of-the-art model-centric defenses. This is the first study to propose a data-centric approach for mitigating model inversion attacks, showing promising potential for decentralized privacy protection.
URL: https://openreview.net/forum?id=r8wXaLJBIS
---
Title: Modeling Causal Mechanisms with Diffusion Models for Interventional and Counterfactual Queries
Authors: Patrick Chao, Patrick Blöbaum, Sapan Kirit Patel, Shiva Kasiviswanathan
Abstract: We consider the problem of answering observational, interventional, and counterfactual queries in a causally sufficient setting where only observational data and the causal graph are available. Utilizing the recent developments in diffusion models, we introduce diffusion-based causal models (DCM) to learn causal mechanisms, that generate unique latent encodings. These encodings enable us to directly sample under interventions and perform abduction for counterfactuals. Diffusion models are a natural fit here, since they can encode each node to a latent representation that acts as a proxy for exogenous noise. Our empirical evaluations demonstrate significant improvements over existing state-of-the-art methods for answering causal queries. Furthermore, we provide theoretical results that offer a methodology for analyzing counterfactual estimation in general encoder-decoder models, which could be useful in settings beyond our proposed approach.
URL: https://openreview.net/forum?id=EDHQDsqiSe
---
Title: Reconciling Kaplan and Chinchilla Scaling Laws
Authors: Tim Pearce, Jinyeop Song
Abstract: Kaplan and Chinchilla studied the scaling behavior of transformers trained on next-token language prediction. These studies produced different estimates for how the number of parameters ($N$) and training tokens ($D$) should be set to achieve the lowest possible loss for a given compute budget ($C$). Kaplan: $N_\text{optimal} \propto C^{0.73}$, Chinchilla: $N_\text{optimal} \propto C^{0.50}$. This paper finds that much of this discrepancy can be attributed to Kaplan counting non-embedding rather than total parameters, combined with their analysis being performed at small scale. Simulating the Chinchilla study under these conditions produces biased scaling coefficients close to Kaplan's. Hence, this paper reaffirms Chinchilla's scaling coefficients, by explaining the primary cause of Kaplan's original overestimation. As a second contribution, the paper explains differences in the reported relationships between loss and compute. These findings lead us to recommend that future scaling studies use total parameters and compute.
URL: https://openreview.net/forum?id=NLoaLyuUUF
---
Title: Confidence Intervals and Simultaneous Confidence Bands Based on Deep Learning
Authors: Asaf Ben Arie, Malka Gorfine
Abstract: Deep learning models have significantly improved prediction accuracy in various fields, gaining recognition across numerous disciplines. Yet, an aspect of deep learning that remains insufficiently addressed is the assessment of prediction uncertainty. Producing reliable uncertainty estimators could be crucial in practical terms. For instance, predictions associated with a high degree of uncertainty could be sent for further evaluation. Recent works in uncertainty quantification of deep learning predictions, including Bayesian posterior credible intervals and a frequentist confidence-interval estimation, have proven to yield either invalid or overly conservative intervals. Furthermore, there is currently no method for quantifying uncertainty that can accommodate deep neural networks for survival (time-to-event) data that involves right-censored outcomes. In this work, we provide a non-parametric bootstrap method that disentangles data uncertainty from the noise inherent in the adopted optimization algorithm. %, ensuring that based on deep learning estimators with small bias, the resulting point-wise confidence intervals or the simultaneous confidence bands are accurate (i.e., valid and not overly conservative). The validity of the proposed approach is demonstrated through an extensive simulation study, which shows that the method is accurate (i.e., valid and not overly conservative) as long as the network is sufficiently deep to ensure that the estimators provided by the deep neural network exhibit minimal bias. Otherwise, undercoverage of up to 8\% is observed. The proposed ad-hoc method can be easily integrated into any deep neural network without interfering with the training process. The utility of the proposed approach is demonstrated through two applications: constructing simultaneous confidence bands for survival curves generated by deep neural networks dealing with right-censored survival data, and constructing a confidence interval for classification probabilities in the context of binary classification regression. Code for the data analysis and reported simulation is available at Githubsite: \url{https://github.com/Asafba123/Survival_bootstrap}.
URL: https://openreview.net/forum?id=PdbaruPVUY
---
Title: Contaminated Online Convex Optimization
Authors: Tomoya Kamijima, Shinji Ito
Abstract: In online convex optimization, some efficient algorithms have been designed for each of the individual classes of objective functions, e.g., convex, strongly convex, and exp-concave. However, existing regret analyses, including those of universal algorithms, are limited to cases in which the objective functions in all rounds belong to the same class and cannot be applied to cases in which the property of objective functions may change in each time step. This paper introduces a novel approach to address such cases, proposing a new regime we term as \textit{contaminated} online convex optimization. For the contaminated case, we demonstrate that the regret is lower bounded by $\Omega(\log T + \sqrt{k})$. Here, $k$ signifies the level of contamination in the objective functions. We also demonstrate that the regret is bounded by $O(\log T+\sqrt{k\log T})$ when universal algorithms are used. When our proposed algorithms with additional information are employed, the regret is bounded by $O(\log T+\sqrt{k})$, which matches the lower bound. These are intermediate bounds between a convex case and a strongly convex or exp-concave case.
URL: https://openreview.net/forum?id=QdGtwjDgub
---
Title: Deep Tabular Learning via Distillation and Language Guidance
Authors: Ruohan Wang, Wenhao Fu, Carlo Ciliberto
Abstract: Tabular data is arguably one of the most ubiquitous data structures in application domains such as science, healthcare, finance and manufacturing. Given the recent success of deep learning (DL), there has been a surge of new DL models for tabular learning. However, despite the efforts, tabular DL models still clearly trail behind tree-based approaches. In this work, we propose DisTab, a novel framework for tabular learning based on the transformer architecture. Our method leverages model distillation to mimic the favorable inductive biases of tree-based models, and incorporates language guidance for more expressive feature embeddings. Empirically, DisTab outperforms existing tabular DL models and is highly competitive against tree-based models across diverse datasets, effectively closing the gap with these methods.
URL: https://openreview.net/forum?id=p6KIteShzf
---
Title: Feature Distillation Improves Zero-Shot Transfer from Synthetic Images
Authors: Niclas Popp, Jan Hendrik Metzen, Matthias Hein
Abstract: Vision-language foundation models such as CLIP have showcased impressive zero-shot capabilities. However, their applicability in resource-constrained environments is limited due to their size and the resulting latency. Knowledge distillation allows to mitigate these challenges by distilling small image encoders that can replace the large CLIP image encoder. In a zero-shot setting, where only the class names are known, no real domain images can be used for this process. Instead, we investigate the use of synthetic images for this purpose. Unlike existing works that focus on improving the quality of synthetic images to bridge the performance gap compared to training on natural images, we find the choice of loss to be a crucial factor. Specifically, minimizing only the distance between the student and teacher image features, without incorporating image captions in the loss function, increases the robustness to spurious features and data corruptions. As a result, this feature distillation approach greatly improves the transfer performance from synthetic to real images. Leveraging these insights, we are able to train domain-specific students that achieve zero-shot performance comparable to a ViT-B/32 teacher on six fine-grained classification datasets while using up to 92% fewer parameters.
URL: https://openreview.net/forum?id=SP8DLl6jgb
---
Title: Pretrained deep models outperform GBDTs in Learning-To-Rank under label scarcity
Authors: Charlie Hou, Kiran Koshy Thekumparampil, Michael Shavlovsky, Giulia Fanti, Yesh Dattatreya, sujay sanghavi
Abstract: On tabular data, a significant body of literature has shown that current deep learning (DL) models perform at best similarly to Gradient Boosted Decision Trees (GBDTs), while significantly underperforming them on outlier data. However, these works often study problem settings which may not fully capture the complexities of real-world scenarios. We identify a natural tabular data setting where DL models can outperform GBDTs: tabular Learning-to-Rank (LTR) under label scarcity. Tabular LTR applications, including search and recommendation, often have an abundance of unlabeled data, and scarce labeled data. We show that DL rankers can utilize unsupervised pretraining to exploit this unlabeled data. In extensive experiments over both public and proprietary datasets, we show that pretrained DL rankers consistently outperform GBDT rankers on ranking metrics, sometimes by as much as 38%, both overall and on outliers.
URL: https://openreview.net/forum?id=093Q9VxaWt
---
New submissions
===============
Title: Combating Inter-Task Confusion and Catastrophic Forgetting by Metric Learning and Re-Using a Past Trained Model
Abstract: Despite the vast research on class-incremental learning (IL), the critical issues have not yet been fully addressed. In this paper, utilizing metric learning, we tackle two fundamental issues of class-incremental learning (class-IL), inter-task confusion and catastrophic forgetting, which have not been fully addressed yet in the literature. To mitigate the inter-task confusion, we propose an innovative loss by utilizing the centroids of previously learned classes as negatives and current data samples as positives in the embedding space, which reduces overlaps between the classes of the current and past tasks in the embedding space. To combat catastrophic forgetting, we also propose that the past trained model is stored and re-used for generating past data samples. Based on this, we further propose a novel knowledge distillation approach utilizing inter-class embedding clusters, intra-class embedding clusters, and mean square embedding distances. Extensive experiments performed on CIFAR-10, CIFAR-100, Mini-ImageNet, and TinyImageNet show that our proposed exemplar-free metric class-IL method achieves the state-of-the-art performance, beating all baseline methods by notable margins. We release our codes as the supplementary materials.
URL: https://openreview.net/forum?id=jRbKsQ3sYO
---
Title: Formal Verification of Graph Convolutional Networks with Uncertain Node Features and Uncertain Graph Structure
Abstract: Graph neural networks are becoming increasingly popular in the field of machine learning due to their unique ability to process data structured in graphs.
They have also been applied in safety-critical environments where perturbations inherently occur.
However, these perturbations require us to formally verify neural networks before their deployment in safety-critical environments
as neural networks are prone to adversarial attacks.
While there exists research on the formal verification of neural networks,
there is no work verifying the robustness of generic graph convolutional network architectures with uncertainty in the node features and in the graph structure over multiple message-passing steps.
This work addresses this research gap by explicitly preserving the non-convex dependencies of all elements in the underlying computations through reachability analysis with (matrix) polynomial zonotopes.
We demonstrate our approach on three popular benchmark datasets.
URL: https://openreview.net/forum?id=B6y12Ot0cP
---
Title: Almost Sure Convergence of Stochastic Gradient Methods under Gradient Domination
Abstract: Stochastic gradient methods are among the most important algorithms in training machine learning problems. While classical assumptions such as strong convexity allow a simple analysis they are rarely satisfied in applications. In recent years, global and local gradient domination properties have shown to be a more realistic replacement of strong convexity. They were proved to hold in diverse settings such as (simple) policy gradient methods in reinforcement learning and training of deep neural networks with analytic activation functions. We prove almost sure convergence rates $f(X_n)-f^*\in o\big( n^{-\frac{1}{4\beta-1}+\epsilon}\big)$ of the last iterate for stochastic gradient descent (with and without momentum) under global and local $\beta$-gradient domination assumptions. The almost sure rates get arbitrarily close to recent rates in expectation. Finally, we demonstrate how to apply our results to the training task in both supervised and reinforcement learning.
URL: https://openreview.net/forum?id=OTwnNBxZFB
---
Title: Instance-Aware Graph Prompt Learning
Abstract: Graph neural networks stand as the predominant technique for graph representation learning owing to their strong expressive power, yet the performance highly depends on the availability of high-quality labels in an end-to-end manner. Thus the pretraining and fine-tuning paradigm has been proposed to mitigate the label cost issue. Subsequently, the gap between the pretext tasks and downstream tasks has spurred the development of graph prompt learning which inserts a set of graph prompts into the original graph data with minimal parameters while preserving competitive performance. However, the current exploratory works are still limited since they all concentrate on learning fixed task-specific prompts which may not generalize well across the diverse instances that the task comprises. To tackle this challenge, we introduce Instance-Aware Graph Prompt Learning (IA-GPL) in this paper, aiming to generate distinct prompts tailored to different input instances. The process involves generating intermediate prompts for each instance using a lightweight architecture, quantizing these prompts through trainable codebook vectors, and employing the exponential moving average technique to ensure stable training. Extensive experiments conducted on multiple datasets and settings showcase the superior performance of IA-GPL compared to state-of-the-art baselines.
URL: https://openreview.net/forum?id=W50i7r3DHE
---
Title: ResiDual Transformer Alignment with Spectral Decomposition
Abstract: When examined through the lens of their residual streams, a puzzling property emerges in transformer networks:
residual contributions (e.g., attention heads) sometimes specialize in specific tasks or input attributes.
In this paper, we analyze this phenomenon in vision transformers, focusing on the spectral geometry of residuals, and explore its implications for modality alignment in vision-language models.
First, we link it to the intrinsically low-dimensional structure of visual head representations, zooming into their principal components and showing that they encode specialized roles across a wide variety of input data distributions.
Then, we analyze the effect of head specialization in multimodal models, focusing on how improved alignment between text and specialized heads impacts zero-shot classification performance. This specialization-performance link consistently holds across diverse pre-training data, network sizes, and objectives, demonstrating a powerful new mechanism for boosting zero-shot classification through targeted alignment.
Ultimately, we translate these insights into actionable terms by introducing ResiDual, a technique for spectral alignment of the residual stream.
Much like panning for gold, it lets the noise from irrelevant unit principal components (i.e., attributes) wash away to amplify task-relevant ones.
Remarkably, this dual perspective on modality alignment yields fine-tuning level performance on different data distributions while modeling an extremely interpretable and parameter-efficient transformation, as we extensively show on more than 50 (pre-trained network, dataset) pairs.
URL: https://openreview.net/forum?id=z37LCgSIzI
---
Title: Interpretability Needs a New Paradigm
Abstract: Interpretability is the study of explaining models in understandable terms to humans. At present, interpretability is divided into two paradigms: the intrinsic paradigm, which believes that only models designed to be explained can be explained, and the post-hoc paradigm, which believes that black-box models can be explained. At the core of this debate is how each paradigm ensures its explanations are faithful, i.e., true to the model's behavior. This is important, as false but convincing explanations lead to unsupported confidence in artificial intelligence (AI), which can be dangerous. This paper's position is that we should think about new paradigms while staying vigilant regarding faithfulness. First, by examining the history of paradigms in science, we see that paradigms are constantly evolving. Then, by examining the current paradigms, we can understand their underlying beliefs, the value they bring, and their limitations. Finally, this paper presents 3 emerging paradigms for interpretability. The first paradigm designs models such that faithfulness can be easily measured. Another optimizes models such that explanations become faithful. The last paradigm proposes to develop models that produce both a prediction and an explanation.
URL: https://openreview.net/forum?id=IVnGVW0IEH
---
Title: Producers Equilibria and Dynamics in Engagement-Driven Recommender Systems
Abstract: Online platforms such as YouTube, Instagram heavily rely on recommender systems to decide what content to present to users. Producers, in turn, often create content that is likely to be recommended to users and have users engage with it. To do so, producers try to align their content with the preferences of their targeted user base. In this work, we explore the equilibrium behavior of producers who are interested in maximizing user engagement. We study two variants of the content-serving rule for the platform's recommender system, and provide a structural characterization of producer behavior at equilibrium: namely, each producer chooses to focus on a single embedded feature. We further show that specialization, defined as different producers optimizing for distinct types of content, naturally emerges from the competition among producers trying to maximize user engagement. We provide a heuristic for computing equilibria of our engagement game, and evaluate it experimentally. We highlight i) the performance and convergence of our heuristic, ii) the degree of producer specialization, and iii) the impact of the content-serving rule on producer and user utilities at equilibrium and provide guidance on how to set the content-serving rule.
URL: https://openreview.net/forum?id=EWT4GxjGDS
---
Title: Magnifying Three Phases of GAN Training via Evolution of Discriminator and Generator Gradients
Abstract: Generative Adversarial Networks (GANs) are powerful generative models but often suffer from mode mixture and mode collapse. We propose a three-phase characterization of GAN training: fitting, refining, and collapsing, where mode mixture and mode collapse are treated as inter-connected. Inspired by the particle model interpretation of GANs, we leverage the discriminator gradient to analyze particle movement and the generator gradient, specifically "steepness," to quantify the severity of mode mixture by measuring the generator's sensitivity to changes in the latent space. Using these theoretical insights into evolution of gradients, we design a specialized metric that integrates both gradients to detect the transition from refining to collapsing. This metric forms the basis of an early stopping algorithm, which stops training at a point that balances sample quality and diversity. Experiments on synthetic and real-world datasets, including MNIST, Fashion MNIST, and CIFAR-10, validate our theoretical findings and demonstrate the effectiveness of the proposed algorithm.
URL: https://openreview.net/forum?id=58gPkcVbFL
---
Title: PHICO: Personalised Human-AI Cooperative Classification Using Augmented Noisy Labels and Model Prediction
Abstract: The nuanced differences in human behavior and the complex dynamics of human-AI interactions pose significant challenges in optimizing human-AI cooperation. Existing approaches tend to oversimplify the problem and rely on a single global behavior model, which overlooks individual variability, leading to sub-optimal solutions. To bridge this gap, we introduce PHICO, a novel framework for human-AI cooperative classification that initially identifies a set of representative annotator profiles characterized by unique noisy label patterns. These patterns are then augmented to train personalised AI cooperative models, each tailored to an annotator profile. When these models are paired with human inputs that exhibit similar noise patterns from a corresponding profile, they consistently achieve a joint classification accuracy that exceeds those achieved by either AI or human alone. To evaluate PHICO, we introduce novel measures for assessing human-AI cooperative classification and empirically demonstrate its generalisability and performance across diverse datasets including CIFAR-10N, CIFAR-10H, Fashion-MNIST-H, AgNews, and Chaoyang histopathology. PHICO is both a model-agnostic and effective solution for improving human-AI cooperation.
URL: https://openreview.net/forum?id=SSssKg3mHd
---
Title: A Scalable Approach for Mapper via Efficient Spatial Search
Abstract: Topological Data Analysis (TDA) is a branch of applied mathematics that studies the shape of high dimensional datasets using ideas from algebraic topology. The Mapper algorithm is a widely used tool in Topological Data Analysis, used for uncovering hidden structures in complex data. However, existing implementations often rely on naive and inefficient methods for constructing the open covers that Mapper is based on, leading to performance issues, especially with large, high-dimensional datasets. In this study, we introduce a novel, more scalable method for constructing open covers for Mapper, leveraging techniques from computational geometry. Our approach significantly enhances efficiency, improving Mapper’s performance for large high-dimensional data. We will present theoretical insights into our method and demonstrate its effectiveness through experimental evaluations on well-known datasets, showcasing substantial improvements in running time compared to existing approaches. We implemented our method in a new Python library called library-omitted-for-anonymity, which is freely available at link-omitted-for-anonymity, providing a powerful tool for TDA practitioners and researchers.
URL: https://openreview.net/forum?id=lTX4bYREAZ
---
Title: Where to Intervene: Action Selection in Deep Reinforcement Learning
Abstract: Deep reinforcement learning (RL) has gained widespread adoption in recent years but faces significant challenges, particularly in unknown and complex environments. Among these, *high-dimensional action selection* stands out as a critical problem. Existing works often require a sophisticated prior design to eliminate redundancy in the action space, relying heavily on domain expert experience or involving high computational complexity, which limits their generalizability across different RL tasks. In this paper, we address these challenges by proposing a general data-driven action selection approach with model-free and computational-friendly properties. Our method not only *selects minimal sufficient actions* but also *controls the false discovery rate* via knockoff sampling. More importantly, we seamlessly integrate the action selection into deep RL methods during online training. Empirical experiments validate the established theoretical guarantees, demonstrating that our method surpasses various alternative techniques in terms of both performances in variable selection and overall achieved rewards.
URL: https://openreview.net/forum?id=D3au9XkWuy
---
Title: Closed-form proximal operator of regularized exponential functions for incremental learning
Abstract: Incremental model-based minimization methods have recently been proposed as a way to mitigate numerical challenges associated with stochastic or online optimization. One of the main desirable properties is stability w.r.t. step-size choice and loss-function weights. Such properties make them desirable for use-cases when tuning parameters is prohibitive. In contrast to incremental gradient methods, the main computational tool is the proximal operator, rather than the gradient. And this operator is exactly one of the main gaps for adoption in practice - it may be both inefficient in practice, and harder to implement for a practitioner due to the lack of closed-form formulas and expressive calculus.
In this work, we aim to address this challenge for a specific family of losses, which are a composition of exponential on linear functions. One prominent application in mind is that of Poisson regression, where the negative log-likelihood is of this form. We devise a closed-form formula for the proximal operator in terms of Lambert's W function, whose implementation is available in many standard numerical computing and machine-learning packages, such as SciPy or TensorFlow. Then, we show that expressing the same formula in terms of the less-known Wright-Omega function, that is also available in SciPy, provides substantial numerical benefits. Finally, we provide an open-source vectorized PyTorch implementation of the Wright-Omega function and the proximal operator, ported from SciPy. This allows practitioners wishing to use the algorithm devised here to use the entire arsenal of tools provided by PyTorch, such as automatic differentiation and GPU computing. We have made our code available at https://anonymous.4open.science/r/exponential-proximal-point-B8DD.
URL: https://openreview.net/forum?id=L4brh4iAXF
---
Title: Cycle Conditioning for Robust Representation Learning from Categorical Data
Abstract: This paper introduces a novel diffusion-based method for learning representations from categorical data. Conditional diffusion models have demonstrated their potential to extract meaningful representations from input samples. However, they often struggle to yield versatile, general-purpose information, limiting their adaptability to unforeseen tasks. To address this, we propose a cycle conditioning approach for diffusion models, designed to capture expressive information from conditioning samples. However, cycle conditioning alone can be insufficient. Diffusion models may ignore conditioning samples that vary across training iterations, an issue that occurs within cycle conditioning. To counter this limitation, we introduce additional "spelling" information to guide the conditioning process, ensuring that the conditioning sample remains influential during denoising. While this supervision enhances the generalizability of extracted representations, it is constrained by the sparse nature of spelling information in categorical data, leading to sparse latent conditions. This sparsity reduces the robustness of the extracted representations for downstream tasks or as effective guidance in the diffusion process. To overcome this challenge, we propose a linear navigation strategy within the latent space of conditioning samples, allowing dense representations to be extracted even with sparse supervision. Our experiments demonstrate that our method achieves at least a 1.42\% improvement in AUROC and a 4.12\% improvement in AUCPR over the best results from existing state-of-the-art methods.
URL: https://openreview.net/forum?id=GkYOcbNLaW
---
Title: Autoregressive Models in Vision: A Survey
Abstract: Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, i.e., pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models based on the strategy of representation. We then explore the interconnections between autoregressive models and other generative models. Furthermore, we present a multi-faceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multi-modal generation. We also elaborate on their applications in diverse domains, including emerging domains such as embodied AI and 3D medical AI, with about 250 related references. Finally, we highlight the current challenges to autoregressive models in vision with suggestions about potential research directions. We have also set up a Github repository to organize the papers included in this survey at: https://anonymous.4open.science/r/Autoregressive-Models-in-Vision-Survey-TMLR-9641/
URL: https://openreview.net/forum?id=1BqXkjNEGP
---
Title: Evaluating World Models with LLM for Decision Making
Abstract: World model emerges as a key module in decision making, where MuZero and Dreamer achieve remarkable successes in complex tasks. Recent work leverages Large Language Models (LLMs) as general world simulators to simulate the dynamics of the world due to their generalizability. LLMs also serve as the world model for deliberative reasoning in Reasoning via Planning (RAP) and Tree of Thought (ToT). However, the world model is either evaluated as a general world simulator or as a functional module of the agent, i.e., predicting the transitions to assist the planning. In this work, we propose a comprehensive evaluation of the world models with LLMs from the decision making perspective. Specifically, we leverage the 31 diverse environments from (Wang et al., 2023; 2024) and curate the rule-based policy of each environment for the diverse evaluation. Then, we design three main tasks, i.e., policy verification, action proposal, and policy planning, where the world model is used for decision making solely. Finally, we conduct the comprehensive evaluation of the advanced LLMs, i.e., GPT-4o and GPT-4o-mini, on the environments for the three main tasks under various settings. The key observations include: i) GPT-4o significantly outperforms GPT-4o-mini on the three main tasks, especially for the tasks which require the domain knowledge, ii) the performance of the world model with LLM will be decreased for long-term decision-making tasks, and iii) the combination of different functionalities of the world model will brings additional unstabilities of the performance.
URL: https://openreview.net/forum?id=xxJ41g3gyk
---
Title: DStruct2Design: Data Structure Driven Generative Floor Plan Design
Abstract: Text conditioned generative models for images have yielded impressive results. Text conditioned floorplan generation as a special type of raster image generation task also received particular attention.
However there are many use cases in floorplan generation where numerical properties of the generated result are more important than the aesthetics. For instance, one might want to specify sizes for certain rooms in a floorplan and compare the generated floorplan with given specifications. Current approaches, datasets and commonly used evaluations do not support these kinds of constraints. As such, an attractive strategy is to generate an intermediate data structure that contains numerical properties of a floorplan which can be used to generate the final floorplan image. To explore this setting we (1) construct a new dataset for this data-structure to data-structure formulation of floorplan generation using two popular image based floorplan datasets RPLAN and ProcTHOR-10k, and provide the tools to convert further procedurally generated ProcTHOR floorplan data into our format.
(2) We explore the task of floorplan generation given a partial or complete set of constraints and we design a series of metrics and benchmarks to enable evaluating how well samples generated from models respect the constraints.
(3) We create multiple baselines by finetuning a large language model (LLM), Llama3, and demonstrate the feasibility of using floorplan data structure conditioned LLMs for the problem of floorplan generation respecting numerical constraints.
We hope that our language-based approach to this image-based design problem and our newly developed benchmarks will encourage further research on different ways to improve the performance of LLMs and other generative modelling techniques for generating designs where quantitative constraints are only partially specified, but must be respected.
URL: https://openreview.net/forum?id=ERyuDrxsGH
---
Title: Variation Matters: from Mitigating to Embracing Zero-Shot NAS Ranking Function Variation
Abstract: Neural Architecture Search (NAS) is a powerful automatic alternative to manual design of a neural network. In the zero-shot version, we use fast ranking functions to compare architectures without training them. The outputs of the ranking functions often vary significantly due to different sources of randomness, including the evaluated architecture's weights' initialization or the batch of data used for calculations. A common approach to addressing the variation is to average a ranking function output over several evaluations. We propose taking into account the variation in a different manner, by viewing the ranking function output as a random variable representing a proxy performance metric. During the search process, we strive to construct a stochastic ordering of the performance metrics to determine the best architecture. Our experiments show that the proposed stochastic ordering can effectively boost performance of a search on standard benchmark search spaces.
URL: https://openreview.net/forum?id=SbGt90dxdp
---
Title: HyperMagNet: A Magnetic Laplacian based Hypergraph Neural Network
Abstract: In data science, hypergraphs are natural models for data exhibiting multi-way or group relationships in contrast to graphs which only model pairwise relationships. Nonetheless, many proposed hypergraph neural networks effectively reduce hypergraphs to undirected graphs via symmetrized matrix representations, potentially losing important multi-way or group information. We propose an alternative approach to hypergraph neural networks in which the hypergraph is represented as a non-reversible Markov chain. We use this Markov chain to construct a complex Hermitian Laplacian matrix — the magnetic Laplacian — which serves as the input to our proposed hypergraph neural network. We study $\textit{HyperMagNet}$ for the task of node classification, and demonstrate its effectiveness over graph-reduction based hypergraph neural networks.
URL: https://openreview.net/forum?id=Gdf4P7sEzE
---
Title: Valley: Video Assistant with Large Language Model Enhanced Ability
Abstract: Large Language Models (LLMs), with remarkable conversational capability, have emerged as AI assistants that can handle both visual and textual modalities. However, their effectiveness in joint video and language understanding has not been extensively explored.
In the paper, we introduce Valley, a multi-modal foundation model that is designed to enable enhanced video comprehension and instruction-following capabilities. To this end, we construct two datasets, namely `Valley-702k' and `Valley-instruct-73k', to cover a diverse range of video-text alignment and video-based instruction tasks, such as multi-shot captions, long video descriptions, action recognition, causal inference, etc. Then, we adopt ViT-L/14 as the vision encoder and explore three different temporal modeling modules to learn multifaceted features for enhanced video understanding. In addition, we implement a two-phase training approach for Valley: the first phase focuses solely on training the projection module to facilitate the LLM's capacity to understand visual input, and the second phase jointly trains the projection module and the LLM to improve their instruction following ability. Extensive experiments demonstrate that Valley has the potential to serve as an effective video assistant, simplifying complex video-understanding scenarios.
Our code and data are published anonymously at <https://github.com/valley-vl/Valley>.
URL: https://openreview.net/forum?id=xYLYBe2nbI
---
Title: SE3Set: Harnessing Equivariant Hypergraph Neural Networks for Molecular Representation Learning
Abstract: In this paper, we develop SE3Set, an SE(3) equivariant hypergraph neural network architecture tailored for advanced molecular representation learning. Hypergraphs are not merely an extension of traditional graphs; they are pivotal for modeling high-order relationships, a capability that conventional equivariant graph-based methods lack due to their inherent limitations in representing intricate many-body interactions. To achieve this, we first construct hypergraphs via proposing a new fragmentation method that considers both chemical and three-dimensional spatial information of molecular system. We then design SE3Set, which incorporates equivariance into the hypergragh neural network. This ensures that the learned molecular representations are invariant to spatial transformations, thereby providing robustness essential for accurate prediction of molecular properties. SE3Set has shown performance on par with state-of-the-art (SOTA) models for small molecule datasets like QM9 and MD17. It excels on the MD22 dataset, achieving a notable improvement of approximately 20\% in accuracy across all molecules, which highlights the prevalence of complex many-body interactions in larger molecules. This exceptional performance of SE3Set across diverse molecular structures underscores its transformative potential in computational chemistry, offering a route to more accurate and physically nuanced modeling.
URL: https://openreview.net/forum?id=f8I1c0XdLi
---
Title: Building Blocks for Robust and Effective Semi-Supervised Real-World Object Detection
Abstract: Semi-supervised object detection (SSOD) based on pseudo-labeling significantly reduces dependence on large labeled datasets by effectively leveraging both labeled and unlabeled data. However, real-world applications of SSOD often face critical challenges, including class imbalance, label noise, and labeling errors. We present an in-depth analysis of SSOD under real-world conditions, uncovering causes of suboptimal pseudo-labeling and key trade-offs between label quality and quantity. Based on our findings, we propose four model-agnostic building blocks that can be seamlessly integrated into any SSOD framework. Rare Class Collage (RCC): a data augmentation method that enhances the representation of rare classes by creating collages of rare objects. Rare Class Focus (RCF): a stratified batch sampling strategy that ensures a more balanced representation of all classes during training. Ground Truth Label Correction (GLC): a label refinement method that identifies and corrects false, missing, and noisy ground truth labels by leveraging the consistency of teacher model predictions. Pseudo-Label Selection (PLS): a selection method for removing low-quality pseudo-labeled images, guided by a novel metric estimating the missing detection rate while accounting for class rarity. We validate our methods through comprehensive experiments on autonomous driving datasets, resulting in up to 6\% increase in SSOD performance. Overall, our investigation and novel, data-centric, and broadly applicable building blocks enable robust and effective SSOD in complex, real-world scenarios. Code will be released upon publication.
URL: https://openreview.net/forum?id=vRYt8QLKqK
---
Title: On the Connection Between Counterfactual Fairness, Statistical Parity and Individual Fairness
Abstract: The relations among observational fairness notions (those defined based on data distributions) have been studied in the literature, yet the relations between counterfactual fairness and observational fairness notions remain less explored. In this paper, we study the relations between counterfactual fairness and two kinds of observational fairness, statistical parity and individual fairness. In particular, we are interested in understanding whether a predictor trained using counterfactually fair representations (Zuo et al., 2023) can satisfy individual fairness and statistical parity. We show that, for a certain type of causal model called the Gaussian Causal Model (GCM), counterfactual fairness can imply both statistical parity and individual fairness. We also identify another class of causal models under which counterfactual fairness implies statistical parity. Experiments on both synthetic and real-world data demonstrate that counterfactually fair representation can enhance fairness in machine learning models without compromising performance, outperforming methods designed for observational fairness.
URL: https://openreview.net/forum?id=6JAaZzqyJ6
---
Title: Density of states in neural networks: an in-depth exploration of learning in parameter space
Abstract: Learning in neural networks critically hinges on the intricate geometry of the loss landscape associated with a given task. Traditionally, most research has focused on finding specific weight configurations that minimize the loss. In this work, born from the cross-fertilization of machine learning and theoretical soft matter physics, we introduce a novel, computationally efficient approach to examine the weight space across all loss values. Employing the Wang-Landau enhanced sampling algorithm, we explore the neural network density of states -- the number of network parameter configurations that produce a given loss value -- and analyze how it depends on specific features of the training set. Using both real-world and synthetic data, we quantitatively elucidate the relation between data structure and network density of states across different sizes and depths of binary-state networks. This work presents and illustrates a novel, informative analysis method that aims at paving the way for a better understanding of the interplay between structured data and the networks that process, learn, and generate them.
URL: https://openreview.net/forum?id=BLDtWlFKhn
---
Title: Graph Potential Field Neural Network for Massive Agents Group-wise Path Planning
Abstract: Multi-agent path planning is important in both multi-agent path finding and multi-agent reinforcement learning areas. However, group-wise multi-agent path planning that requires the agents to perform as a team to pursue high team scores instead of individually is less studied. To address this problem, we propose a novel graph potential field-based neural network (GPFNN), which models a valid potential field map for path planning. Our GPFNN unfolds the T-step iterative optimization of the potential field maps as a T-layer feedforward neural network. Thus, a deeper GPFNN leads to more precise potential field maps without the over-smoothing issue. A potential field map inherently provides a monotonic potential flow from any source node to the target nodes to construct the optimal path, equipping our GPFNN with an elegant planning ability. Moreover, we incorporate dynamically updated boundary conditions into our GPFNN to address group-wise multi-agent path planning that supports both static targets and dynamic moving targets. Empirically, experiments on three different-sized mazes (up to $1025 \times 1025$ sized mazes) with up to 1,000 agents demonstrate the planning ability of our GPFNN to handle both static and dynamic moving targets. Experiments on extensive graph node classification tasks on six graph datasets (up to millions of nodes) demonstrate the learning ability of our GPFNN.
URL: https://openreview.net/forum?id=LJHVPWNnV6
---
Title: Long Short-Term Imputer: Handling Consecutive Missing Values in Time Series
Abstract: Encountered frequently in time series data, missing values can significantly impede time-series analysis. With the progression of deep learning, advanced imputation models delve into the temporal dependencies inherent in time series data, showcasing remarkable performance. This positions them as intuitive selections for time series imputation tasks which assume ``Miss Completely at Random''. Nonetheless, long-interval consecutive missing values may obstruct the model's ability to grasp long-term temporal dependencies, consequently hampering the efficacy of imputation performance. To tackle this challenge, we propose Long Short-term Imputer (LSTI) to impute consecutive missing values with different length of intervals. Long-term Imputer is designed using the idea of bi-directional autoregression. A forward prediction model and a backward prediction model are trained with a consistency regularization, which is designed to capture long-time dependency and can adapt to long-interval consecutive missing values. Short-term Imputer is designed to capture short-time dependency and can impute the short-interval consecutive missing values effectively. A meta-weighting network is then proposed to take advantage of the strengths of two imputers. As a result, LSTI can impute consecutive missing values with different intervals effectively. Experiments demonstrate that our approach, on average, reduces the error by 57.4% compared to state-of-the-art deep models across five datasets.
URL: https://openreview.net/forum?id=9NVJ0ZgEfT
---
Title: Guided Discrete Diffusion for Electronic Health Record Generation
Abstract: Electronic health records (EHRs) are a pivotal data source that enables numerous applications in computational medicine, e.g., disease progression prediction, clinical trial design, and health economics and outcomes research. Despite wide usability, their sensitive nature raises privacy and confidentially concerns, which limit potential use cases. To tackle these challenges, we explore the use of generative models to synthesize artificial, yet realistic EHRs. While diffusion-based methods have recently demonstrated state-of-the-art performance in generating other data modalities and overcome the training instability and mode collapse issues that plague previous GAN-based approaches, their applications in EHR generation remain underexplored. The discrete nature of tabular medical code data in EHRs poses challenges for high-quality data generation, especially for continuous diffusion models. To this end, we introduce a novel tabular EHR generation method, EHR-D3PM, which enables both unconditional and conditional generation using the discrete diffusion model. Our experiments demonstrate that EHR-D3PM significantly outperforms existing generative baselines on comprehensive fidelity and utility metrics while maintaining less attribute and membership vulnerability risks. Furthermore, we show EHR-D3PM is effective as a data augmentation method and enhances performance on downstream tasks when combined with real data.
URL: https://openreview.net/forum?id=N2rWhTgits
---
Title: Lower Ricci Curvature for Efficient Community Detection
Abstract: This study introduces the Lower Ricci Curvature (LRC), a novel, scalable, and scale-free discrete curvature designed to enhance community detection in networks. Addressing the computational challenges posed by existing curvature-based methods, LRC offers a streamlined approach with linear computational complexity, which makes it well suited for large-scale network analysis. We further develop an LRC-based preprocessing method that effectively augments popular community detection algorithms. Through applications on multiple real-world datasets, including the NCAA football league network, the DBLP collaboration network, the Amazon product co-purchasing network, and the YouTube social network, we demonstrate the efficacy of our method in significantly improving the performance of various community detection algorithms.
URL: https://openreview.net/forum?id=EoiuRII7MQ
---
Title: ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning
Abstract: This paper presents a new self-supervised video representation learning framework \textbf{ARVideo}, which \textit{autoregressively} predict the next video token in a tailored sequence order.
Two key designs are included. First, we organize autoregressive video tokens into
clusters that span both \textit{spatially} and \textit{temporally}, thereby enabling a richer aggregation of contextual information compared to the standard spatial-only or temporal-only clusters. Second, we adopt a randomized spatiotemporal prediction order to facilitate learning from multi-dimensional data, addressing the limitations of a handcrafted spatial-first or temporal-first sequence order. Extensive experiments establish ARVideo as an effective paradigm for self-supervised video representation learning. For example,
when trained with the ViT-B backbone, ARVideo competitively attains 81.2\% on Kinetics-400 and 70.9\% on Something-Something V2, which are on par with the strong benchmark set by VideoMAE. Importantly, ARVideo also demonstrates higher training efficiency, \ie, it trains 14\% faster and requires 58\% less GPU memory compared to VideoMAE.
URL: https://openreview.net/forum?id=TRKwzPnXWQ
---
Title: Anon Embed: Training a Reproducible Long Context Text Embedder
Abstract: This technical report describes the training of anon-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark.
We release the training code and model weights under an Apache 2.0 license.
In contrast with other open-source models, we release the full curated training data and code that allows for full replication of anon-embed-text-v1. You can find code and data to replicate the model at <anon github link>
URL: https://openreview.net/forum?id=IPmzyQSiQE
---
Title: Distributionally Robust Alignment for Medical Federated Vision-Language Pre-training Under Data Heterogeneity
Abstract: Vision-language pre-training (VLP) has emerged as an effective scheme for multimodal representation learning, but its reliance on large-scale multimodal data poses significant challenges for medical applications. Federated learning (FL) offers a promising solution to scale up the dataset for medical VLP while preserving data privacy. However, we observe that client data heterogeneity in real-world scenarios could cause models to learn biased cross-modal alignment during local pre-training. This would limit the transferability of the federally learned representation model on downstream tasks. To address this challenge, we propose Federated Distributionally Robust Alignment (FedDRA), a framework for federated VLP that achieves robust vision-language alignment under heterogeneous conditions. Based on client datasets, we construct a distribution family that encompasses potential test-time domains, and apply a distributionally robust framework to optimize the pre-trained model's performance across this distribution space. This approach bridges the gap between pre-training samples and downstream applications. To avoid over-fitting on client-specific information, we use anchor representation from the global model to guide the local training, and adopt a two-stage approach to first tune deeper layers before updating the entire network. Extensive experiments on real-world datasets demonstrate FedDRA’s effectiveness in enhancing medical federated VLP under data heterogeneity. Our method also adapts well to various medical pre-training methods.
URL: https://openreview.net/forum?id=hb3ZGvBja4
---
Title: Enhancing Gradient Boosting Machines with Attention
Abstract: Gradient boosting machines (GBMs) are a popular machine learning model, well known for their high accuracy and flexibility. Despite significant research in the past two decades improving their accuracy, speed, and robustness, there still lies room for improvement. Notably, with their ability to interpret complex patterns in noisy data. To address this challenge, this paper proposes AMBeRBoost: a novel model that integrates neural attention mechanisms into a GBM, aiming to help the model “focus” on important data and improve its predictive performance on otherwise hard to predict datasets. A series of experiments were performed to evaluate the effects of the attention mechanism, along with the performance of AMBeRBoost against other state-of-the-art models across several publicly available datasets. The results show that AMBeRBoost consistently outperforms the attentionless baseline model on almost all metrics, with results comparable to, and sometimes even exceeding, state-of-the-art models. This research contributes to the continuous improvement and refinement of machine learning models by bridging the gap between GBMs and neural attention mechanisms.
URL: https://openreview.net/forum?id=VQ948Ay8kD
---
Title: Smart Transportation Without Neurons - Fair Metro Network Expansion with Tabular Reinforcement Learning
Abstract: We address the Metro Network Expansion Problem (MNEP), a subset of the Transport Network Design Problem (TNDP), which focuses on expanding metro systems to satisfy travel demand. Traditional methods have relied on exact and heuristic approaches that require expert-defined constraints to reduce the search space and enable tractability. Recently, reinforcement learning (RL), particularly deep reinforcement learning (Deep RL), has emerged as a powerful alternative due to its effectiveness in optimizing complex sequential decision-making processes. However, Deep RL methods can be computationally expensive, environmentally costly and hard to interpret. In this paper we re-formulate the MNEP as a Markov Decision Process (MDP), and solve it through tabular Q-Learning. By using a redefined MDP and a tabular RL approach, we achieve similar performance to Deep RL, with substantially fewer training episodes, offering the added benefit of greater interpretability. Furthermore, we incorporate diverse social equity criteria into the reward functions, balancing efficiency with fairness, thus highlighting the versatility of our method. Our approach is evaluated in real-world settings---specifically in Xi’an and Amsterdam---where it demonstrates competitive results, reducing the total training episodes by a factor of 18 and total carbon emissions by a factor of 12 on average. Our approach provides a replicable, interpretable, and resource-efficient solution, with potential applicability to other combinatorial optimization problems.
URL: https://openreview.net/forum?id=xpJRT3Q8On
---
Title: On Joint Noise Scaling in Differentially Private Federated Learning with Multiple Local Steps
Abstract: Federated learning is a distributed learning setting where the main aim is to train machine learning models without having to share raw data but only what is required for learning. To guarantee training data privacy and high-utility models, differential privacy and secure aggregation techniques are often combined with federated learning. However, with fine-grained protection granularities the currently existing techniques require the parties to communicate for each local optimization step, if they want to fully benefit from the secure aggregation in terms of the resulting formal privacy guarantees. In this paper, we show how a simple new analysis allows the parties to perform multiple local optimization steps while still benefiting from joint noise scaling when using secure aggregation. We show that our analysis enables higher utility models with guaranteed privacy protection under limited number of communication rounds.
URL: https://openreview.net/forum?id=uxyWlXPuIg
---
Title: Meta-learning Population-based Methods for Reinforcement Learning
Abstract: Reinforcement learning (RL) algorithms are highly sensitive to their hyperparameter settings. Recently, numerous methods have been proposed to dynamically optimize these hyperparameters. One prominent approach is Population-Based Bandits (PB2), which uses time-varying Gaussian processes (GP) to dynamically optimize hyperparameters with a population of parallel agents. Despite its strong overall performance, PB2 experiences slow starts due to the GP initially lacking sufficient information. To mitigate this issue, we propose four different methods that utilize meta-data from various environments. These approaches are novel in that they adapt meta-learning methods to accommodate the time-varying setting. Among these approaches, MultiTaskPB2, which uses meta-learning for the surrogate model, stands out as the most promising approach. It outperforms PB2 and other baselines in both anytime and final performance across two RL environment families.
URL: https://openreview.net/forum?id=d9htascfP8
---
Title: Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics
Abstract: Natural language is often the easiest and most convenient modality for humans to specify tasks for robots. However, learning to ground language to behavior typically requires impractical amounts of diverse, language-annotated demonstrations collected on each target robot. In this work, we aim to separate the problem of what to accomplish from how to accomplish it, as the former can benefit from substantial amounts of external observation-only data, and only the latter depends on a specific robot embodiment. To this end, we propose Video-Language Critic, a reward model that can be trained on readily available cross-embodiment data using contrastive learning and a temporal ranking objective, and use it to score behavior traces from a separate actor. When trained on Open X-Embodiment data, our reward model enables 2x more sample-efficient policy training on Meta-World tasks than a sparse reward only, despite a significant domain gap. Using in-domain data but in a challenging task generalization setting on Meta-World, we further demonstrate more sample-efficient training than is possible with prior language-conditioned reward models that are either trained with binary classification, use static images, or do not leverage the temporal information present in video data.
URL: https://openreview.net/forum?id=jJOVpnNrEp
---
Title: Relax and penalize: a new bilevel approach to mixed-binary hyperparameter optimization
Abstract: In recent years, bilevel approaches have become very popular to efficiently estimate high-dimensional hyperparameters of machine learning models. However, to date, binary parameters are handled by continuous relaxation and rounding strategies, which could lead to inconsistent solutions. In this context, we tackle the challenging optimization of mixed-binary hyperparameters by resorting to an equivalent continuous bilevel reformulation based on an appropriate penalty term. We propose an algorithmic framework that, under suitable assumptions, is guaranteed to provide mixed-binary solutions. Moreover, the generality of the method allows to safely use existing continuous bilevel solvers within the proposed framework. We evaluate the performance of our approach for two specific machine learning problems, i.e., the estimation of the group-sparsity structure in regression problems and the data distillation problem. The reported results clearly show that our method can outperform state-of-the-art approaches based on relaxation and rounding.
URL: https://openreview.net/forum?id=A1R1cQ93Cb
---
Title: Entropy Controllable Direct Preference Optimization
Abstract: In the post-training of large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) is an effective approach to achieve generation aligned with human preferences. Direct Preference Optimization (DPO) allows for policy training with a simple binary cross-entropy loss without a reward model. The objective of DPO is regularized by reverse KL divergence that encourages mode-seeking fitting to the reference policy. Nonetheless, we indicate that minimizing reverse KL divergence could fail to capture a mode of the reference distribution, which may hurt the policy's performance.
Based on this observation, we propose a simple modification to DPO, H-DPO, which allows for control over the entropy of the resulting policy, enhancing the distribution's sharpness and thereby enabling mode-seeking fitting more effectively. In our experiments, we show that H-DPO outperformed DPO across various tasks, demonstrating superior results in pass@$k$ evaluations for mathematical tasks. Moreover, H-DPO is simple to implement, requiring only minor modifications to the loss calculation of DPO, which makes it highly practical and promising for wide-ranging applications in the training of LLMs.
URL: https://openreview.net/forum?id=js6TS3ck7H
---
Title: Respecting the limit: Bayesian optimization with a bound on the optimal value
Abstract: In many real-world optimization problems, we have prior information about what objective function values are achievable. In this paper, we study the scenario that we have either exact knowledge of the minimum value or a, possibly inexact, lower bound on its value. We propose bound-aware Bayesian optimization (BABO), a Bayesian optimization method that uses a new surrogate model and acquisition function to utilize such prior information. We present SlogGP, a new surrogate model that incorporates bound information and adapts the Expected Improvement (EI) acquisition function accordingly. Empirical results on a variety of benchmarks demonstrate the benefit of taking prior information about the optimal value into account, and that the proposed approach significantly outperforms existing techniques. Furthermore, we notice that even in the absence of prior information on the bound, the proposed SlogGP surrogate model still performs better than the standard GP model in most cases, which we explain by its larger expressiveness.
URL: https://openreview.net/forum?id=y5Hf0otJLk
---
Title: How to Leverage Predictive Uncertainty Estimates for Reducing Catastrophic Forgetting in Online Continual Learning
Abstract: Many real-world applications require machine-learning models to be able to deal with non-stationary data distributions and thus learn autonomously over an extended period of time, often in an online setting. One of the main challenges in this scenario is the so-called catastrophic forgetting (CF) for which the learning model tends to focus on the most recent tasks while experiencing predictive degradation on older ones. In the online setting, the most effective solutions employ a fixed-size memory buffer to store old samples used for replay when training on new tasks. Many approaches have been presented to tackle this problem and conflicting strategies are proposed to populate the memory. Are the easiest-to-forget or the easiest-to-remember samples more effective in combating CF? Furthermore, it is not clear how predictive uncertainty information for memory management can be leveraged in the most effective manner. Starting from the intuition that predictive uncertainty provides an idea of the samples' location in the decision space, this work presents an in-depth analysis of different uncertainty estimates and strategies for populating the memory. The investigation provides a better understanding of the characteristics data points should have for alleviating CF. Then, we propose an alternative method for estimating predictive uncertainty via the generalised variance induced by the negative log-likelihood. Finally, we demonstrate that the use of predictive uncertainty measures helps in reducing CF in different settings.
URL: https://openreview.net/forum?id=dczXe0S1oL
---
Title: Conditional Latent Space Molecular Scaffold Optimization for Accelerated Molecular Design
Abstract: The rapid discovery of new chemical compounds is essential for advancing global health and developing treatments. While generative models show promise in creating novel molecules, challenges remain in ensuring the real-world applicability of these molecules and finding such molecules efficiently. To address this, we introduce Conditional Latent Space Molecular Scaffold Optimization (CLaSMO), which combines a Conditional Variational Autoencoder (CVAE) with Latent Space Bayesian Optimization (LSBO) to modify molecules strategically while maintaining similarity to the original input. Our LSBO setting improves the sample-efficiency of our optimization, and our modification approach helps us to obtain molecules with higher chances of real-world applicability. CLaSMO explores substructures of molecules in a sample-efficient manner by performing BO in the latent space of a CVAE conditioned on the atomic environment of the molecule to be optimized. Our experiments demonstrate that CLaSMO efficiently enhances target properties with minimal substructure modifications, achieving state-of-the-art results with a smaller model and dataset compared to existing methods. We also provide an open-source web application that enables chemical experts to apply CLaSMO in a Human-in-the-Loop setting.
URL: https://openreview.net/forum?id=jcd8rnrTKi
---
Title: Maximising the Utility of Validation Sets for Imbalanced Noisy-label Meta-learning
Abstract: Meta-learning is an effective method to handle imbalanced and noisy-label learning, but it generally depends on a clean validation set. Unfortunately, this validation set has poor scalability when the number of classes increases, as traditionally these samples need to be randomly selected, manually labelled and balanced-distributed. This problem therefore has motivated the development of meta-learning methods to automatically select validation samples that are likely to have clean labels and balanced class distribution. Unfortunately, a common missing point of existing meta-learning methods for noisy label learning is the lack of consideration for data informativeness when constructing the validation set. The construction of an informative validation set requires hard samples, i.e., samples that the model has low confident prediction, but these samples are more likely to be noisy, which can degrade the meta reweighting process. Therefore, the balance between sample informativeness and cleanness is an important criteria for validation set optimization. In this paper, we propose new criteria to characterise the utility of such meta-learning validation sets, based on: 1) sample informativeness; 2) balanced class distribution; and 3) label cleanliness. We also introduce a new imbalanced noisy-label meta-learning (INOLML) algorithm that auto- matically builds a validation set by maximising such utility criteria. The proposed method shows state-of-the-art (SOTA) results compared to previous meta-learning and noisy-label learning approaches on several noisy-label learning benchmarks.
URL: https://openreview.net/forum?id=SBM9yeNZz5
---
Title: Conditional Image Synthesis with Diffusion Models: A Survey
Abstract: Conditional image synthesis based on user-specified requirements is a key component in creating complex visual content. In recent years, diffusion-based generative modeling has become a highly effective way for conditional image synthesis, leading to exponential growth in the literature. However, the complexity of diffusion-based modeling, the wide range of image synthesis tasks, and the diversity of conditioning mechanisms present significant challenges for researchers to keep up with rapid developments and understand the core concepts on this topic. In this survey, we categorize existing works based on how conditions are integrated into the two fundamental components of diffusion-based modeling, i.e., the denoising network and the sampling process. We specifically highlight the underlying principles, advantages, and potential challenges of various conditioning approaches in the training, re-purposing, and specialization stages to construct a desired denoising network. We also summarize six mainstream conditioning mechanisms in the essential sampling process. All discussions are centered around popular applications. Finally, we pinpoint some critical yet still open problems to be solved in the future and suggest some possible solutions.
URL: https://openreview.net/forum?id=ewwNKwh6SK
---
Title: SynCode: LLM Generation with Grammar Augmentation
Abstract: LLMs are widely used in complex AI applications. These applications underscore the need for LLM outputs to adhere to a specific format, for their integration with other components in the systems. Typically the format rules – e.g., data serialization formats such as JSON, YAML, or Code in Programming Language – are expressed as context-free grammar (CFG). Due to the hallucinations and unreliability of LLMs, instructing LLMs to adhere to specified syntax becomes an increasingly important challenge.
We present SynCode, a novel framework for efficient and general syntactical decoding with LLMs, to address this challenge. SynCode ensures soundness and completeness with respect to the CFG of a formal language, effectively retaining valid tokens while filtering out invalid ones. SynCode uses an offline-constructed, efficient lookup table, the DFA mask store, created from the DFA (Deterministic Finite Automaton) of the language’s grammar for efficient generation. SynCode seamlessly integrates with any language defined by CFG, as evidenced by experiments focusing on generating JSON, SQL, Python, and Go outputs. Our experiments evaluating the effectiveness of SynCode for JSON generation demonstrate that SynCode eliminates all syntax errors and significantly outperforms state-of-the-art baselines. Furthermore, our results underscore how SynCode significantly reduces 96.07% of syntax errors in generated Python and Go code, showcasing its substantial impact on enhancing syntactical precision in LLM generation.
URL: https://openreview.net/forum?id=HiUZtgAPoH
---
Title: System-Aware Neural ODE Processes for Few-Shot Bayesian Optimization
Abstract: We consider the problem of optimizing initial conditions and termination time in dynamical systems governed by unknown ordinary differential equations (ODEs), where evaluating different initial conditions is costly and the state's value can not be measured in real-time but only with a delay while the measuring device processes the sample. To identify the optimal conditions in limited trials, we introduce a few-shot Bayesian Optimization (BO) framework based on the system's prior information. At the core of our approach is the System-Aware Neural ODE Processes (SANODEP), an extension of Neural ODE Processes (NODEP) designed to meta-learn ODE systems from multiple trajectories using a novel context embedding block. We further develop a two-stage BO framework to effectively incorporate search space constraints, enabling efficient optimization of both initial conditions and observation timings. We conduct extensive experiments showcasing SANODEP's potential for few-shot BO within dynamical systems. We also explore SANODEP's adaptability to varying levels of prior information, highlighting the trade-off between prior flexibility and model fitting accuracy.
URL: https://openreview.net/forum?id=FFnRLvWefK
---
Title: Towards LifeSpan Cognitive Systems
Abstract: Building a human-like system that continuously interacts with complex environments -- whether simulated digital worlds or human society -- presents several key challenges. Central to this is enabling continuous, high-frequency interactions, where the interactions are termed experiences. We refer to this envisioned system as the **LifeSpan Cognitive System (LSCS)**. A critical feature of LSCS is its ability to engage in incremental and rapid updates while retaining and accurately recalling past experiences. We identify two major challenges in achieving this: (1) Abstraction and Experience Merging, and (2) Long-term Retention with Accurate Recall. These properties are essential for storing new experiences, organizing past experiences, and responding to the environment in ways that leverage relevant historical data. Unlike language models with continual learning, which typically rely on large corpora for fine-tuning and focus on improving performance within specific domains or tasks, LSCS must rapidly and incrementally update with new information from its environment at a high frequency. Existing technologies with the potential of solving the above two major challenges can be classified into four classes based on a conceptual metric called **Storage Complexity**, which measures the relative space required to store past experiences. Each of these four classes of technologies has its own strengths and limitations while we argue none of them alone can achieve LSCS alone. To this end, we propose a potential paradigm for LSCS that can integrate all four classes of technologies. The new paradigm, serving as a conjecture, operates through two core processes: Absorbing Experiences and Generating Responses.
URL: https://openreview.net/forum?id=LZ9FmeFeLV
---
Title: On Using Certified Training towards Empirical Robustness
Abstract: Adversarial training is arguably the most popular way to provide empirical robustness against specific adversarial examples. While variants based on multi-step attacks incur significant computational overhead, single-step variants are vulnerable to a failure mode known as catastrophic overfitting, which hinders their practical utility for large perturbations. A parallel line of work, certified training, has focused on producing networks amenable to formal guarantees of robustness against any possible attack. However, the wide gap between the best-performing empirical and certified defenses has severely limited the applicability of the latter. Inspired by recent developments in certified training, which rely on a combination of adversarial attacks with network over-approximations, and by the connections between local linearity and catastrophic overfitting, we present experimental evidence on the practical utility and limitations of using certified training towards empirical robustness. We show that, when tuned for the purpose, a recent certified training algorithm can prevent catastrophic overfitting on single-step attacks, and that it can bridge the gap to multi-step baselines under appropriate experimental settings. Finally, we present a novel regularizer for network over-approximations that can achieve similar effects while markedly reducing runtime.
URL: https://openreview.net/forum?id=UaaT2fI9DC
---