Weekly TMLR digest for Sep 03, 2023

7 views

Skip to first unread message

TMLR

unread,

Sep 2, 2023, 8:00:16 PM9/2/23

to tmlr-annou...@googlegroups.com

New certifications
==================

Featured Certification: WOODS: Benchmarks for Out-of-Distribution Generalization in Time Series

Jean-Christophe Gagnon-Audet, Kartik Ahuja, Mohammad Javad Darvishi Bayazi, Pooneh Mousavi, Guillaume Dumas, Irina Rish

https://openreview.net/forum?id=mvftzofTYQ

---

Accepted papers
===============

Title: Semantic Representations of Mathematical Expressions in a Continuous Vector Space

Authors: Neeraj Gangwar, Nickvash Kani

Abstract: Mathematical notation makes up a large portion of STEM literature, yet finding semantic representations for formulae remains a challenging problem. Because mathematical notation is precise, and its meaning changes significantly with small character shifts, the methods that work for natural text do not necessarily work well for mathematical expressions. This work describes an approach for representing mathematical expressions in a continuous vector space. We use the encoder of a sequence-to-sequence architecture, trained on visually different but mathematically equivalent expressions, to generate vector representations (or embeddings). We compare this approach with a structural approach that considers visual layout to embed an expression and show that our proposed approach is better at capturing mathematical semantics. Finally, to expedite future research, we publish a corpus of equivalent transcendental and algebraic expression pairs.

URL: https://openreview.net/forum?id=EWPA9TZcUy

---

Title: WOODS: Benchmarks for Out-of-Distribution Generalization in Time Series

Authors: Jean-Christophe Gagnon-Audet, Kartik Ahuja, Mohammad Javad Darvishi Bayazi, Pooneh Mousavi, Guillaume Dumas, Irina Rish

Abstract: Deep learning models often fail to generalize well under distribution shifts. Understanding and overcoming these failures have led to a new research field on Out-of-Distribution (OOD) generalization. Despite being extensively studied for static computer vision tasks, OOD generalization has been severely underexplored for time series tasks. To shine a light on this gap, we present WOODS: 10 challenging time series benchmarks covering a diverse range of data modalities, such as videos, brain recordings, and smart device sensory signals. We revise the existing OOD generalization algorithms for time series tasks and evaluate them using our systematic framework. Our experiments show a large room for improvement for empirical risk minimization and OOD generalization algorithms on our datasets, thus underscoring the new challenges posed by time series tasks.

URL: https://openreview.net/forum?id=mvftzofTYQ

---

Title: Representations and Computations in Transformers that Support Generalization on Structured Tasks

Authors: Yuxuan Li, James McClelland

Abstract: Transformers have shown remarkable success in natural language processing and computer vision, serving as the foundation of large language and multimodal models. These networks can capture nuanced context sensitivity across high-dimensional language tokens or image pixels, but it remains unclear how highly structured behavior and systematic generalization can arise in these systems. Here, we explore the solution process a causal transformer discovers as it learns to solve a set of algorithmic tasks involving copying, sorting, and hierarchical compositions of these operations. We search for the minimal layer and head configuration sufficient to solve these tasks and unpack the roles of the attention heads, as well as how token representations are reweighted across layers to complement these roles. Our results provide new insights into how attention layers in transformers support structured computation within and across tasks: 1) Replacing fixed position labels with labels sampled from a larger set enables strong length generalization and faster learning. The learnable embeddings of these labels develop different representations, capturing sequence order if necessary, depending on task demand. 2) Two-layer transformers can learn reliable solutions to the multi-level problems we explore. The first layer tends to transform the input representation to allow the second layer to share computation across repeated components within a task or across related tasks. 3) We introduce an analysis pipeline that quantifies how the representation space in a given layer prioritizes different aspects of each item. We show that these representations prioritize information needed to guide attention relative to information that only requires downstream readout.

URL: https://openreview.net/forum?id=oFC2LAqS6Z

---

Title: Causal Parrots: Large Language Models May Talk Causality But Are Not Causal

Authors: Matej Zečević, Moritz Willig, Devendra Singh Dhami, Kristian Kersting

Abstract: Some argue scale is all what is needed to achieve AI, covering even causal models.
We make it clear that large language models (LLMs) cannot be causal and give reason onto why sometimes we might feel otherwise. To this end, we define and exemplify a new subgroup of Structural Causal Model (SCM) that we call meta SCM which encode causal facts about other SCM within their variables.
We conjecture that in the cases where LLM succeed in doing causal inference, underlying was a respective meta SCM that exposed correlations between causal facts in natural language on whose data the LLM was ultimately trained.
If our hypothesis holds true, then this would imply that LLMs are like parrots in that they simply recite the causal knowledge embedded in the data. Our empirical analysis provides favoring evidence that current LLMs are even weak `causal parrots.'

URL: https://openreview.net/forum?id=tv46tCzs83

---

Title: An Option-Dependent Analysis of Regret Minimization Algorithms in Finite-Horizon Semi-MDP

Authors: Gianluca Drappo, Alberto Maria Metelli, Marcello Restelli

Abstract: A large variety of real-world Reinforcement Learning (RL) tasks is characterized by a complex and heterogeneous structure that makes end-to-end (or flat) approaches hardly applicable or even infeasible. Hierarchical Reinforcement Learning (HRL) provides general solutions to address these problems thanks to a convenient multi-level decomposition of the tasks, making their solution accessible. Although often used in practice, few works provide theoretical guarantees to justify this outcome effectively. Thus, it is not yet clear when to prefer such approaches compared to standard flat ones. In this work, we provide an option-dependent upper bound to the regret suffered by regret minimization algorithms in finite-horizon problems. We illustrate that the performance improvement derives from the planning horizon reduction induced by the temporal abstraction enforced by the hierarchical structure. Then, focusing on a sub-setting of HRL approaches, the options framework, we highlight how the average duration of the available options affects the planning horizon and, consequently, the regret itself. Finally, we relax the assumption of having pre-trained options to show how, in particular situations, is still preferable a hierarchical approach over a standard one.

URL: https://openreview.net/forum?id=VP9p4u9jAo

---

Title: On Adaptivity in Quantum Testing

Authors: Omar Fawzi, Nicolas Flammarion, Aurélien Garivier, Aadil Oufkir

Abstract: Can adaptive strategies outperform non-adaptive ones for quantum hypothesis selection?
We exhibit problems where adaptive strategies provably reduce the number of required samples by a factor four in the worst case, and possibly more when the actual difficulty of the problem
makes it possible. In addition, we exhibit specific hypotheses classes for which there is a provable polynomial separation between adaptive and non-adaptive strategies -- a specificity of the quantum framework that does not appear in classical testing.

URL: https://openreview.net/forum?id=Hf95zFnQ7H

---

Title: Teaching Smaller Language Models To Generalise To Unseen Compositional Questions

Authors: Tim Hartill, Neset TAN, Michael Witbrock, Patricia J. Riddle

Abstract: We equip a smaller Language Model to generalise to answering challenging compositional questions that have not been seen in training. To do so we propose a combination of multitask supervised pretraining on up to 93 tasks designed to instill diverse reasoning abilities, and a dense retrieval system that aims to retrieve a set of evidential paragraph fragments. Recent progress in question-answering has been achieved either through prompting methods against very large pretrained Language Models in zero or few-shot fashion, or by fine-tuning smaller models, sometimes in conjunction with information retrieval. We focus on the less explored question of the extent to which zero-shot generalisation can be enabled in smaller models with retrieval against a corpus within which sufficient information to answer a particular question may not exist. We establish strong baselines in this setting for diverse evaluation datasets (StrategyQA, CommonsenseQA, IIRC, DROP, Musique and ARC-DA), and show that performance can be significantly improved by adding retrieval-augmented training datasets which are designed to expose our models to a variety of heuristic reasoning strategies such as weighing partial evidence or ignoring an irrelevant context.

URL: https://openreview.net/forum?id=d4Vr6E0jjm

---

Title: Subgraph Permutation Equivariant Networks

Authors: Joshua Mitton, Roderick Murray-Smith

Abstract: In this work we develop a new method, named Sub-graph Permutation Equivariant Networks (SPEN), which provides a framework for building graph neural networks that operate on sub-graphs, while using a base update function that is permutation equivariant, that are equivariant to a novel choice of automorphism group. Message passing neural networks have been shown to be limited in their expressive power and recent approaches to over come this either lack scalability or require structural information to be encoded into the feature space. The general framework presented here overcomes the scalability issues associated with global permutation equivariance by operating more locally on sub-graphs. In addition, through operating on sub-graphs the expressive power of higher-dimensional global permutation equivariant networks is improved; this is due to fact that two non-distinguishable graphs often contain distinguishable sub-graphs. Furthermore, the proposed framework only requires a choice of $k$-hops for creating ego-network sub-graphs and a choice of representation space to be used for each layer, which makes the method easily applicable across a range of graph based domains. We experimentally validate the method on a range of graph benchmark classification tasks, demonstrating statistically indistinguishable results from the state-of-the-art on six out of seven benchmarks. Further, we demonstrate that the use of local update functions offers a significant improvement in GPU memory over global methods.

URL: https://openreview.net/forum?id=3agxS3aDUs

---

Title: About the Cost of Central Privacy in Density Estimation

Authors: Clément Lalanne, Aurélien Garivier, Rémi Gribonval

Abstract: We study non-parametric density estimation for densities in Lipschitz and Sobolev spaces, and under central privacy. In particular, we investigate regimes where the privacy budget is not supposed to be constant. We consider the classical definition of central differential privacy, but also the more recent notion of central concentrated differential privacy. We recover the result of Barber & Duchi (2014) stating that histogram estimators are optimal against Lipschitz distributions for the L2 risk and, under regular differential privacy, we extend it to other norms and notions of privacy. Then, we investigate higher degrees of smoothness, drawing two conclusions: First, and contrary to what happens with constant privacy budget (Wasserman & Zhou, 2010), there are regimes where imposing privacy degrades the regular minimax risk of estimation on Sobolev densities. Second, so-called projection estimators are near-optimal against the same classes of densities in this new setup with pure differential privacy, but contrary to the constant privacy budget case, it comes at the cost of relaxation. With zero concentrated differential privacy, there is no need for relaxation, and we prove that the estimation is optimal.

URL: https://openreview.net/forum?id=uq29MIWvIV

---

Title: Some Remarks on Identifiability of Independent Component Analysis in Restricted Function Classes

Authors: Simon Buchholz

Abstract: In this short note, we comment on recent results on identifiability of independent component analysis.
We point out an error in earlier works and clarify that this error cannot be fixed as the chosen approach is not sufficiently
powerful to prove identifiability results. In addition, we explain the necessary ingredients to prove stronger identifiability results.
Finally, we discuss and extend the flow-based technique to construct spurious solutions for independent component analysis problems
and provide a counterexample to an earlier identifiability result.

URL: https://openreview.net/forum?id=REtKapdkyI

---

Title: You Only Transfer What You Share: Intersection-Induced Graph Transfer Learning for Link Prediction

Authors: Wenqing Zheng, Edward W Huang, Nikhil Rao, Zhangyang Wang, Karthik Subbian

Abstract: Link prediction is central to many real-world applications, but its performance may be hampered when the graph of interest is sparse. To alleviate issues caused by sparsity, we investigate a previously overlooked phenomenon: in many cases, a densely connected, complementary graph can be found for the original graph. The denser graph may share nodes with the original graph, which offers a natural bridge for transferring selective, meaningful knowledge. We identify this setting as Graph Intersection-induced Transfer Learning (GITL), which is motivated by practical applications in e-commerce or academic co-authorship predictions. We develop a framework to effectively leverage the structural prior in this setting. We first create an intersection subgraph using the shared nodes between the two graphs, then transfer knowledge from the source-enriched intersection subgraph to the full target graph. In the second step, we consider two approaches: a modified label propagation, and a multi-layer perceptron (MLP) model in a teacher-student regime. Experimental results on proprietary e-commerce datasets and open-source citation graphs show that the proposed workflow outperforms existing transfer learning baselines that do not explicitly utilize the intersection structure.

URL: https://openreview.net/forum?id=Nn71AdKyYH

---

Title: Logistic-Normal Likelihoods for Heteroscedastic Label Noise

Authors: Erik Englesson, Amir Mehrpanah, Hossein Azizpour

Abstract: A natural way of estimating heteroscedastic label noise in regression is to model the observed (potentially noisy) target as a sample from a normal distribution, whose parameters can be learned by minimizing the negative log-likelihood. This formulation has desirable loss attenuation properties, as it reduces the contribution of high-error examples. Intuitively, this behavior can improve robustness against label noise by reducing overfitting. We propose an extension of this simple and probabilistic approach to classification that has the same desirable loss attenuation properties. Furthermore, we discuss and address some practical challenges of this extension. We evaluate the effectiveness of the method by measuring its robustness against label noise in classification. We perform enlightening experiments exploring the inner workings of the method, including sensitivity to hyperparameters, ablation studies, and other insightful analyses.

URL: https://openreview.net/forum?id=7wA65zL3B3

---

Title: RECLIP: Resource-efficient CLIP by Training with Small Images

Authors: Runze Li, Dahun Kim, Bir Bhanu, Weicheng Kuo

Abstract: We present RECLIP (Resource-efficient CLIP), a simple method that minimizes computational resource footprint for CLIP (Contrastive Language Image Pretraining). Inspired by the notion of coarse-to-fine in computer vision, we leverage small images to learn from large-scale language supervision efficiently, and finetune the model with high-resolution data in the end. Since the complexity of the vision transformer heavily depends on input image size, our approach significantly reduces the training resource requirements both in theory and in practice. Using the same batch size and training epoch, RECLIP achieves highly competitive zero-shot classification and image-text retrieval accuracy with 6 to 8× less computational resources and 7 to 9× fewer FLOPs than the base- line. Compared to the state-of-the-art contrastive learning methods, RECLIP demonstrates 5 to 59× training resource savings while maintaining highly competitive zero-shot classification and retrieval performance. Finally, RECLIP matches the state of the art in transfer learning to open-vocabulary detection tasks, achieving 32 APr on LVIS. We hope this work will pave the path for the broader research community to explore language supervised pretraining in resource-friendly settings.

URL: https://openreview.net/forum?id=Ufc5cWhHko

---

Title: Reinforcement Learning with Delayed, Composite, and Partially Anonymous Reward

Authors: Washim Uddin Mondal, Vaneet Aggarwal

Abstract: We investigate an infinite-horizon average reward Markov Decision Process (MDP) with delayed, composite, and partially anonymous reward feedback. The delay and compositeness of rewards mean that rewards generated as a result of taking an action at a given state are fragmented into different components, and they are sequentially realized at delayed time instances. The partial anonymity attribute implies that a learner, for each state, only observes the aggregate of past reward components generated as a result of different actions taken at that state, but realized at the observation instance. We propose an algorithm named $\mathrm{DUCRL2}$ to obtain a near-optimal policy for this setting and show that it achieves a regret bound of $\tilde{\mathcal{O}}\left(DS\sqrt{AT} + d (SA)^3\right)$ where $S$ and $A$ are the sizes of the state and action spaces, respectively, $D$ is the diameter of the MDP, $d$ is a parameter upper bounded by the maximum reward delay, and $T$ denotes the time horizon. This demonstrates the optimality of the bound in the order of $T$, and an additive impact of the delay.

URL: https://openreview.net/forum?id=ubCoTAynPp

---

Title: Towards Multi-spatiotemporal-scale Generalized PDE Modeling

Authors: Jayesh K Gupta, Johannes Brandstetter

Abstract: Partial differential equations (PDEs) are central to describing complex physical system simulations. Their expensive solution techniques have led to an increased interest in deep neural network based surrogates. However, the practical utility of training such surrogates is contingent on their ability to model complex multi-scale spatio-temporal phenomena. {In recent years}, various neural network architectures have been proposed to target such phenomena, most notably Fourier Neural Operators (FNOs), which give a natural handle over local \& global spatial information via parameterization of different Fourier modes, and U-Nets which treat local and global information via downsampling and upsampling paths. However, large-scale comparisons between these convolution-based approaches are notoriously sparse. In this work, we make such comprehensive comparisons regarding performance, runtime complexity, memory requirements, and generalization capabilities. Concretely, we stress-test various FNO, (Dilated) ResNet, and U-Net like approaches to fluid mechanics problems in both vorticity-stream and velocity function form. For U-Nets, we transfer recent architectural improvements from computer vision, most notably from object segmentation and generative modeling. Next, we use our insights on design considerations, and introduce U-FNets, i.e., modern U-Nets that are augmented with FNO downsampling layers. Those architectures further improve performance without major degradation of computational cost. Finally, we ablate and discuss various choices for parameter conditioning}, and show promising results on generalization to different PDE parameters and time-scales with a single surrogate model. Source code for our PyTorch benchmark framework is available at https://anonymous.4open.science/r/tmlr-pdemulti-6677/.

URL: https://openreview.net/forum?id=dPSTDbGtBY

---

Title: The Multiquadric Kernel for Moment-Matching Distributional Reinforcement Learning

Authors: Ludvig Killingberg, Helge Langseth

Abstract: Distributional reinforcement learning has gained significant attention in recent years due to its ability to handle uncertainty and variability in the returns an agent will receive for each action it takes. A key challenge in distributional reinforcement learning is finding a measure of the difference between two distributions that is well-suited for use with the distributional Bellman operator, a function that takes in a value distribution and produces a modified distribution based on the agent's current state and action. In this paper, we address this challenge by introducing the multiquadric kernel to moment-matching distributional reinforcement learning. We show that this kernel is both theoretically sound and empirically effective. Our contribution is mainly of a theoretical nature, presenting the first formally sound kernel for moment-matching distributional reinforcement learning with good practical performance. We also provide insights into why the RBF kernel has been shown to provide good practical results despite its theoretical problems. Finally, we evaluate the performance of our kernel on a number of standard benchmarks, obtaining results comparable to the state-of-the-art.

URL: https://openreview.net/forum?id=z49eaB8kiH

---

New submissions
===============

Title: Accelerating Batch Active Learning Using Continual Learning Techniques

Abstract: A major problem with Active Learning (AL) is high training costs since models are typically retrained from scratch after every query round. We start by demonstrating that standard AL on neural networks with warm starting fails, both to accelerate training and to avoid catastrophic forgetting when using fine-tuning over AL query rounds. We then develop a new class of techniques, circumventing this problem, by biasing further training towards previously labeled sets. We accomplish this by employing existing, and developing novel, replay-based Continual Learning (CL) algorithms that are effective at quickly learning the new without forgetting the old, especially when data comes from an evolving distribution. We call this paradigm \textit{"Continual Active Learning" (CAL)}. We show CAL achieves significant speedups using a plethora of replay schemes that use model distillation and that select diverse/uncertain points from the history. We conduct experiments across many data domains, including natural language, vision, medical imaging, and computational biology, each with different neural architectures and dataset sizes. CAL consistently provides a $\sim$3x reduction in training time, while retaining performance and out-of-distribution robustness, showing its wide applicability.

URL: https://openreview.net/forum?id=T55dLSgsEf

---

Title: HICO-DET-SG and V-COCO-SG: New Data Splits for Evaluating the Systematic Generalization Performance of Human-Object Interaction Detection Models

Abstract: Human-Object Interaction (HOI) detection is a task to localize humans and objects in an image and predict the interactions in human-object pairs. In real-world scenarios, HOI detection models are required systematic generalization, i.e., generalization to novel combinations of objects and interactions, because the train data are expected to cover a limited portion of all possible combinations. However, to our knowledge, no open benchmarks or previous work exist for evaluating the systematic generalization performance of HOI detection models. To address this issue, we created two new sets of HOI detection data splits named HICO-DET-SG and V-COCO-SG based on the HICO-DET and V-COCO datasets, respectively. When evaluated on the new data splits, the representative HOI detection models performed much more poorly than when evaluated on the original splits. This reveals that systematic generalization is a challenging goal in HOI detection. By analyzing the evaluation results, we also gain insights for improving the systematic generalization performance and identify four possible future research directions. We hope that our new data splits and presented analysis will encourage further research on systematic generalization in HOI detection.

URL: https://openreview.net/forum?id=ywK0NdJBOk

---

Title: RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

Abstract: Generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data. Such biases can produce suboptimal samples, skewed outcomes, and unfairness, with potentially serious consequences. Consequently, aligning these models with human ethics and preferences is an essential step toward ensuring their responsible and effective deployment in real-world applications. Prior research has primarily employed Reinforcement Learning from Human Feedback (RLHF) to address this problem, where generative models are fine-tuned with RL algorithms guided by a human-feedback-informed reward model. However, the inefficiencies and instabilities associated with RL algorithms frequently present substantial obstacles to the successful alignment, necessitating the development of a more robust and streamlined approach. To this end, we introduce a new framework, Reward rAnked FineTuning (RAFT), designed to align generative models effectively. Utilizing a reward model and a sufficient number of samples, our approach selects the high-quality samples, discarding those that exhibit undesired behavior, and subsequently enhancing the model by fine-tuning on these filtered samples. Our studies show that RAFT can effectively improve the model performance in both reward learning and other automated metrics in both large language models and diffusion models.

URL: https://openreview.net/forum?id=m7p5O7zblY

---

Title: Universal Graph Continual Learning

Abstract: We address catastrophic forgetting issues in graph learning as incoming data transits from one to another graph distribution. Whereas prior studies primarily tackle one setting of graph continual learning such as incremental node classification, we focus on a universal approach wherein each data point in a task can be a node or a graph, and the task varies from node to graph classification. We propose a novel method that enables graph neural networks to excel in this universal setting. Our approach perseveres knowledge about past tasks through a rehearsal mechanism that maintains local and global structure consistency across the graphs. We benchmark our method against various continual learning baselines in real-world graph datasets and achieve significant improvement in average performance and forgetting across tasks.

URL: https://openreview.net/forum?id=wzRE5kTnl3

---

Title: U-Turn Diffusion

Abstract: We present a comprehensive examination of score-based diffusion models of AI for generating synthetic images. These models hinge upon a dynamic auxiliary time mechanism driven by stochastic differential equations, wherein the score function is acquired from input images. Our investigation unveils a criterion for evaluating efficiency of the score-based diffusion models: the power of the generative process depends on the ability to de-construct fast correlations during the reverse/de-noising phase. To improve the quality of the produced synthetic images, we introduce an approach coined "U-Turn Diffusion". The U-Turn Diffusion technique starts with the standard forward diffusion process, albeit with a condensed duration compared to conventional settings. Subsequently, we execute the standard reverse dynamics, initialized with the concluding configuration from the forward process. This U-Turn Diffusion procedure, combining forward, U-turn, and reverse processes, creates a synthetic image approximating an independent and identically distributed (i.i.d.) sample from the probability distribution implicitly described via input samples. To analyze relevant time scales we employ various analytical tools, including auto-correlation analysis, weighted norm of the score-function analysis, and Kolmogorov-Smirnov Gaussianity test. The tools guide us to establishing that analysis of the Kernel Intersection Distance, a metric comparing the quality of synthetic samples with real data samples, reveals the optimal U-turn time.

URL: https://openreview.net/forum?id=bON6OeUdsL

---

Title: Neighborhood Gradient Clustering: An Efficient Decentralized Learning Method for Non-IID Data

Abstract: Decentralized learning algorithms enable the training of deep learning models over large distributed datasets, without the need for a central server. The current state-of-the-art decentralized algorithms mostly assume the data distributions to be Independent and Identically Distributed (IID). In practical scenarios, the distributed datasets can have significantly different data distributions across the agents. This paper focuses on improving decentralized learning on non-IID data with minimal compute and memory overheads. We propose Neighborhood Gradient Clustering (NGC), a novel decentralized learning algorithm that modifies the local gradients of each agent using self- and cross-gradient information. In particular, the proposed method averages the local gradients with model-variant or data-variant cross-gradients based on the communication budget. Model-variant cross-gradients are derivatives of the received neighbors’ model parameters with respect to the local dataset. Data-variant cross-gradient derivatives of the local model with respect to its neighbors’ datasets. The data-variant cross-gradients are aggregated through an additional communication round without breaking the privacy constraints of the decentralized setting. We theoretically analyze the convergence characteristics of NGC and demonstrate its efficiency on non-IID data sampled from various vision and language datasets. Our experiments demonstrate that the proposed method either remains competitive or outperforms (by $0-6\%$) the existing state-of-the-art (SoTA) decentralized learning algorithm on non-IID data with significantly less compute and memory requirements. Further, we show that the model-variant cross-gradient information available locally at each agent can improve the performance on non-IID data by $1-35\%$ without additional communication cost.

URL: https://openreview.net/forum?id=vkiKzK5G3e

---

Title: Learning Representations for Reinforcement Learning with Hierarchical Forward Models

Abstract: Learning control from pixels is difficult for reinforcement learning (RL) agents because representation learning and policy learning are intertwined. Previous approaches remedy this issue with auxiliary representation learning tasks, but they either do not consider the temporal aspect of the problem or only consider single-step transitions, which may cause learning inefficiencies if important environmental changes take many steps to manifest. We propose Hierarchical $k$-Step Latent (HKSL), an auxiliary task that learns multiple representations via a hierarchy of forward models that learn to communicate and an ensemble of $n$-step critics that all operate at varying magnitudes of step skipping. We evaluate HKSL in a suite of 30 robotic control tasks with and without distractors and a task of our creation. We find that HKSL either converges to higher or optimal episodic returns more quickly than several alternative representation learning approaches. Furthermore, we find that HKSL's representations capture task-relevant details accurately across timescales (even in the presence of distractors) and that communication channels between hierarchy levels organize information based on both sides of the communication process, both of which improve sample efficiency.

URL: https://openreview.net/forum?id=fovUNTilp9

---

Title: Fair Evaluation of Graph Markov Neural Networks

Abstract: Graph Markov Neural Networks (GMNN) have recently been proposed to improve regular graph neural networks (GNN) by including label dependencies into the semi-supervised node classification task. GMNNs do this in a theoretically principled way and use three kinds of information to predict labels. Just like ordinary GNNs, they use the node features and the graph structure but they moreover leverage information from the labels of neighboring nodes to improve the accuracy of their predictions. In this paper, we introduce a new dataset named WikiVitals which contains a graph of 48k mutually referred Wikipedia articles classified into 32 categories and connected by 2.3M edges. Our aim is to rigorously evaluate the contributions of three distinct sources of information to the prediction accuracy of GMNN for this dataset: the content of the articles, their connections with each other and the correlations among their labels. For this purpose we adapt a method which was recently proposed for performing fair comparisons of GNN performance using an appropriate randomization over partitions and a clear separation of model selection and model assessment.

URL: https://openreview.net/forum?id=4n9itvGILv

---

Title: Early Stopping for Deep Image Prior

Abstract: Deep image prior (DIP) and its variants have shown remarkable potential to solve inverse problems in computational imaging (CI), needing no separate training data. Practical DIP models are often substantially overparameterized. During the learning process, these models first learn the desired visual content and then pick up potential modeling and observational noise, i.e., performing early learning then overfitting. Thus, the practicality of DIP hinges on early stopping (ES) that can capture the transition period. In this regard, most previous DIP works for CI tasks only demonstrate the potential of the models, reporting the peak performance against the ground truth but providing no clue about how to operationally obtain near-peak performance without access to the ground truth. In this paper, we set to break this practicality barrier of DIP, and propose an effective ES strategy that consistently detects near-peak performance across several CI tasks and DIP variants. Simply based on the running variance of DIP intermediate reconstructions, our ES method not only outpaces the existing ones---which only work in very narrow regimes, but also remains effective when combined with methods that try to mitigate overfitting.

URL: https://openreview.net/forum?id=231ZzrLC8X

---

Title: Unifying Generative Models with GFlowNets and Beyond

Abstract: There are many frameworks for generative modeling, each often presented with their own specific training algorithms and inference methods.
Here, we demonstrate the connections between existing generative models and the recently introduced GFlowNet framework (Bengio, et al.), a probabilistic inference machine which treats sampling as a decision-making process. This analysis sheds light on their overlapping traits and provides a unifying viewpoint through the lens of learning with Markovian trajectories.
Our framework provides a means for unifying training and inference algorithms, and provides a route to shine a unifying light over many generative models.
Beyond this, we provide a practical and experimentally verified recipe for improving generative modeling with insights from the GFlowNet perspective.

URL: https://openreview.net/forum?id=BmFsyIEFu4

---

Title: The (Un)Scalability of Informed Heuristic Function Estimation in NP-Hard Search Problems

Abstract: The A* algorithm is commonly used to solve NP-hard combinatorial optimization problems. When provided with a completely informed heuristic function, A* can solve such problems in time complexity that is polynomial in the solution cost and branching factor. In light of this fact, we examine a line of recent publications that propose fitting deep neural networks to the completely informed heuristic function. We assert that these works suffer from inherent scalability limitations since --- under the assumption of NP $\not \subseteq$ P/poly --- such approaches result in either (a) network sizes that scale super-polynomially in the instance sizes or (b) the accuracy of the fitted deep neural networks scales inversely with the instance sizes. Complementing our theoretical claims, we provide experimental results for three representative NP-hard search problems. The results suggest that fitting deep neural networks to informed heuristic functions requires network sizes that grow exponentially with the problem instance size. We conclude by suggesting that the research community should focus on scalable methods for integrating heuristic search with machine learning, as opposed to methods relying on informed heuristic estimation.

URL: https://openreview.net/forum?id=JllRdycmLk

---

Title: Do we need to estimate the variance in robust mean estimation?

Abstract: In this paper, we propose self-tuned robust estimators for estimating the mean of distributions with only finite variances. Our method involves introducing a new loss function that considers both the mean parameter and a robustification parameter. By simultaneously optimizing the empirical loss function with respect to both parameters, the resulting estimator for the robustification parameter can adapt to the unknown variance automatically and can achieve near-optimal finite-sample performance. Our approach outperforms previous methods in terms of both computational and asymptotic efficiency. Specifically, it does not require cross-validation or Lepski's method to tune the robustification parameter, and the variance of our estimator achieves the Cram\'er-Rao lower bound.

URL: https://openreview.net/forum?id=CIv8NfvxsX

---

Title: IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages

Abstract: India has a rich linguistic landscape, with languages from 4 major language families spoken by over a billion people. 22 of these languages listed in the Constitution of India (referred to as scheduled languages) are the focus of this work. Given the linguistic diversity, high-quality and accessible Machine Translation (MT) systems are essential in a country like India. Prior to this work, there was (i) no parallel training data spanning all 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all 22 scheduled languages of India. In this work, we aim to address this gap by focusing on the missing pieces required for enabling wide, easy, and open access to good machine translation systems for all 22 scheduled Indian languages. We identify four key areas of improvement: curating and creating larger training datasets, creating diverse and high-quality benchmarks, training multilingual models, and releasing models with open access. Our first contribution is the release of the Bharat Parallel Corpus Collection (BPCC), the largest publicly available parallel corpora for Indic languages. BPCC contains a total of 230M bitext pairs, of which a total of 126M were newly added, including 644K manually translated sentence pairs created as part of this work. Our second contribution is the release of the first $n$-way parallel benchmark covering all 22 Indian languages, featuring diverse domains, Indian-origin content, and conversational test sets. Next, we present IndicTrans2, the first translation model to support all 22 languages, surpassing existing models in performance on multiple existing and new benchmarks created as a part of this work. Lastly, to promote accessibility and collaboration, we will release our models and associated data with permissive licenses upon de-anonymization.

URL: https://openreview.net/forum?id=vfT4YuzAYA

---

Title: Training DNNs Resilient to Adversarial and Random Bit-Flips by Learning Quantization Ranges

Abstract: Promoting robustness in deep neural networks (DNNs) is crucial for their reliable deployment in uncertain environments, such as low-power settings or in the presence of adversarial attacks. In particular, bit-flip weight perturbations in quantized networks can significantly degrade performance, underscoring the need to improve DNN resilience. In this paper, we introduce a training mechanism to learn the quantization range of different DNN layers to enhance DNN robustness against bit-flip errors on the model parameters. The proposed approach, called weight clipping-aware training (WCAT), minimizes the quantization range while preserving performance, striking a balance between the two.
Our experimental results on different models and datasets showcase that DNNs trained with WCAT can tolerate a high amount of noise while keeping the accuracy close to the baseline model. Moreover, we show that our method significantly enhances DNN robustness against adversarial bit-flip attacks. Finally, when considering the energy-reliability trade-off inherent in on-chip SRAM memories, we observe that WCAT consistently improves the Pareto frontier of test accuracy and energy consumption across diverse models.

URL: https://openreview.net/forum?id=BxjHMPwZIH

---

Title: Spiking Synaptic Penalty: Appropriate Penalty Term for Energy-Efficient Spiking Neural Networks

Abstract: Spiking neural networks (SNNs) are energy-efficient neural networks because of their spiking nature. However, as the spike firing rate of SNNs increases, the energy consumption does as well, and thus, the advantage of SNNs diminishes. Here, we tackle this problem by introducing a novel penalty term for the spiking activity into the objective function in the training phase. Our method is designed so as to optimize the energy consumption metric directly without modifying the network architecture. Therefore, the proposed method can reduce the energy consumption more than other methods while maintaining the accuracy. We conducted experiments for image classification tasks, and the results indicate the effectiveness of the proposed method, which mitigates the dilemma of the energy--accuracy trade-off.

URL: https://openreview.net/forum?id=42BKnT2qW3

---

Title: FREED++: Improving RL Agents for Fragment-Based Molecule Generation by Thorough Reproduction

Abstract: A rational design of new therapeutic drugs aims to find a molecular structure with desired biological functionality, e.g., an ability to activate or suppress a specific protein via binding to it.
Molecular docking is a common technique for evaluating protein-molecule interactions.
Recently, Reinforcement Learning (RL) has emerged as a promising approach to generating molecules with the docking score (DS) as a reward.
In this work, we reproduce, scrutinize and improve the recent RL model for molecule generation called FREED (Yang et al., 2021).
Extensive evaluation of the proposed method reveals several limitations and challenges despite the outstanding results reported for three target proteins.
Our contributions include fixing numerous implementation bugs and simplifying the model while increasing its quality, significantly extending experiments, and conducting an accurate comparison with current state-of-the-art methods for protein-conditioned molecule generation.
We show that the resulting fixed model is capable of producing molecules with superior docking scores compared to alternative approaches.

URL: https://openreview.net/forum?id=YVPb6tyRJu

---

Title: Preliminary Analysis of Adversarial Susceptibility in Neural Networks via Analogy to Chaos-Theoretic Sensitive Dependence

Abstract: Although the susceptibility of neural networks has been repeatedly demonstrated experimentally, with a vast array of new attacks and defenses having been developed around this principle, theoretical analysis as to why these models succumb to these attacks in the first place has been limited. This paper uses ideas from Chaos Theory to explain, analyze, and quantify the degree to which neural networks are susceptible to or robust against adversarial attacks, following briefly with preliminary experiments to demonstrate the validity of our ideas. To aid in experimental analysis, we present a new metric, the "susceptibility ratio," given by $\hat \Psi(h, \theta)$, which captures how greatly a model's output will be changed by perturbations to a given input.

Our theoretical and experimental results show that susceptibility to attack grows significantly with the depth of the model, which has safety implications for the design of neural networks for production environments. We provide experimental evidence of the relationship between $\hat \Psi$ and the post-attack accuracy of classification models, as well as a discussion of its application to tasks lacking hard decision boundaries.

URL: https://openreview.net/forum?id=iUGyShok1T

---

Title: SMGRL: Scalable Multi-resolution Graph Representation Learning

Abstract: Graph convolutional networks (GCNs) allow us to learn topologically-aware node embeddings, which can be useful for classification or link prediction. However, they are unable to capture long-range dependencies between nodes without adding additional layers---which in turn leads to over-smoothing and increased time and space complexity. Further, the complex dependencies between nodes make mini-batching challenging, limiting their applicability to large graphs. We propose a Scalable Multi-resolution Graph Representation Learning (SMGRL) framework that enables us to learn multi-resolution node embeddings efficiently. Our framework is model-agnostic and can be applied to any existing GCN model. We dramatically reduce training costs by training only on a reduced-dimension coarsening of the original graph, then exploit self-similarity to apply the resulting algorithm at multiple resolutions. The resulting multi-resolution embeddings can be aggregated to yield high-quality node embeddings that capture both long- and short-range dependencies. Our experiments show that this leads to improved classification accuracy, without incurring high computational costs.

URL: https://openreview.net/forum?id=ZBiXWM0AV8

---

Title: Towards fully covariant machine learning

Abstract: Any representation of data involves arbitrary investigator choices. Because those choices are external to the data-generating process, each choice leads to an exact symmetry, corresponding to the group of transformations that takes one possible representation to another.
These are the passive symmetries; they include coordinate freedom, gauge symmetry, and units covariance, all of which have led to important results in physics. In machine learning, the most visible passive symmetry is the relabeling or permutation symmetry of graphs.
Our goal is to understand the implications for machine learning of the many passive symmetries in play. We discuss dos and don'ts for machine learning practice if passive symmetries are to be respected. We discuss links to causal modeling, and argue that the implementation of passive symmetries is particularly valuable when the goal of the learning problem is to generalize out of sample. This paper is conceptual: It translates among the languages of physics, mathematics, and machine learning. We believe that consideration and implementation of passive symmetries might help machine learning in the same ways that it transformed physics in the twentieth century.

URL: https://openreview.net/forum?id=gllUnpYuXg

---

Title: Data Dependent Generalization Bounds for Neural Networks with ReLU

Abstract: We try to establish that one of the correct data dependent quantities to look at while trying to prove generalization bounds, even for overparameterized neural networks, are the gradients encountered by stochastic gradient descent while training the model. If these are small, then the model generalizes. To make this conclusion rigorous, we weaken the notion of uniform stability of a learning algorithm in a probabilistic way by positing the notion of almost sure (a.s.) support stability and showing that algorithms that have this form of stability have generalization error tending to 0 as the training set size increases. Further, we show that for Stochastic Gradient Descent to be a.s. support stable we only need the loss function to be a.s. locally Lipschitz and locally Smooth at the training points, thereby showing low generalization error with weaker conditions than have been used in the literature. We then show that Neural Networks with ReLU activation and a doubly differentiable loss function possess these properties. Our notion of stability is the first data dependent notion to be able to show good generalization bounds for non-convex functions with learning rates strictly slower than $1/t$ at the $t$-th step. Finally, we present experimental evidence to validate our theoretical results.

URL: https://openreview.net/forum?id=mH6TelHVKD

---

Title: Rewiring with Positional Encodings for Graph Neural Networks

Abstract: Several recent works use positional encodings to extend the receptive fields of graph neural network (GNN) layers equipped with attention mechanisms. These techniques, however, extend receptive fields to the complete graph, at substantial computational cost and risking a change in the inductive biases of conventional GNNs, or require complex architecture adjustments. As a conservative alternative, we use positional encodings to expand receptive fields to $r$-hop neighborhoods. More specifically, our method augments the input graph with additional nodes/edges and uses positional encodings as node and/or edge features. We thus modify graphs before inputting them to a downstream GNN model, instead of modifying the model itself. This makes our method model-agnostic, i.e., compatible with any of the existing GNN architectures. We also provide examples of positional encodings that are lossless with a one-to-one map between the original and the modified graphs. We demonstrate that extending receptive fields via positional encodings and a virtual fully- connected node significantly improves GNN performance and alleviates over-squashing using small $r$. We obtain improvements on a variety of models and datasets and reach state-of- the-art performance using traditional GNNs or graph Transformers.

URL: https://openreview.net/forum?id=dn3ZkqG2YV

---

Reply all

Reply to author

Forward

0 new messages