Weekly TMLR digest for Oct 03, 2023

19 views

Skip to first unread message

TMLR

unread,

Oct 3, 2023, 12:03:32 AM10/3/23

to tmlr-annou...@googlegroups.com

New certifications
==================

Expert Certification: Does ‘Deep Learning on a Data Diet’ reproduce? Overall yes, but GraNd at Initialization does not

Andreas Kirsch

https://openreview.net/forum?id=1dwXa9vmOI

---

Expert Certification: DPVIm: Differentially Private Variational Inference Improved

Joonas Jälkö, Lukas Prediger, Antti Honkela, Samuel Kaski

https://openreview.net/forum?id=GlhM6XX1wv

---

Expert Certification: Stochastic Batch Acquisition: A Simple Baseline for Deep Active Learning

Andreas Kirsch, Sebastian Farquhar, Parmida Atighehchian, Andrew Jesson, Frédéric Branchaud-Charron, Yarin Gal

https://openreview.net/forum?id=vcHwQyNBjW

---

Survey Certification: A Survey on Transformers in Reinforcement Learning

Wenzhe Li, Hao Luo, Zichuan Lin, Chongjie Zhang, Zongqing Lu, Deheng Ye

https://openreview.net/forum?id=r30yuDPvf2

---

Featured Certification: On the Sample Complexity of Lipschitz Constant Estimation

Julien Walden Huang, Stephen J. Roberts, Jan-Peter Calliess

https://openreview.net/forum?id=UIalYAHdBH

---

Featured Certification: Achieving the Pareto Frontier of Regret Minimization and Best Arm Identification in Multi-Armed Bandits

Zixin Zhong, Wang Chi Cheung, Vincent Tan

https://openreview.net/forum?id=XXfEmIMJDm

---

Featured Certification, Reproducibility Certification: High Fidelity Neural Audio Compression

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi

https://openreview.net/forum?id=ivCd8z8zR2

---

Featured Certification: AP: Selective Activation for De-sparsifying Pruned Networks

Shiyu Liu, Rohan Ghosh, Mehul Motani

https://openreview.net/forum?id=EGQSpkUDdD

---

Expert Certification: Neural Causal Structure Discovery from Interventions

Nan Rosemary Ke, Olexa Bilaniuk, Anirudh Goyal, Stefan Bauer, Hugo Larochelle, Bernhard Schölkopf, Michael Curtis Mozer, Christopher Pal, Yoshua Bengio

https://openreview.net/forum?id=rdHVPPVuXa

---

Accepted papers
===============

Title: Dynamic Regret Analysis of Safe Distributed Online Optimization for Convex and Non-convex Problems

Authors: Ting-Jui Chang, Sapana Chaudhary, Dileep Kalathil, Shahin Shahrampour

Abstract: This paper addresses safe distributed online optimization over an unknown set of linear safety constraints. A network of agents aims at jointly minimizing a global, time-varying function, which is only partially observable to each individual agent. Therefore, agents must engage in local communications to generate a safe sequence of actions competitive with the best minimizer sequence in hindsight, and the gap between the two sequences is quantified via dynamic regret. We propose distributed safe online gradient descent (D-Safe-OGD) with an exploration phase, where all agents estimate the constraint parameters collaboratively to build estimated feasible sets, ensuring the action selection safety during the optimization phase. We prove that for convex functions, D-Safe-OGD achieves a dynamic regret bound of $O(T^{2/3} \sqrt{\log T} + T^{1/3}C_T^*)$, where $C_T^*$ denotes the path-length of the best minimizer sequence. We further prove a dynamic regret bound of $O(T^{2/3}{\color{black} \sqrt{\log T}} + T^{2/3}C_T^*)$ for certain non-convex problems, which establishes the first dynamic regret bound for a safe distributed algorithm in the non-convex setting.

URL: https://openreview.net/forum?id=xiQXHvL1eN

---

Title: Revisiting Image Classifier Training for Improved Certified Robust Defense against Adversarial Patches

Authors: Aniruddha Saha, Shuhua Yu, Mohammad Sadegh Norouzzadeh, Wan-Yi Lin, Chaithanya Kumar Mummadi

Abstract: Certifiably robust defenses against adversarial patches for image classifiers ensure correct prediction against any changes to a constrained neighborhood of pixels. PatchCleanser, the state-of-the-art certified defense, uses a double-masking strategy for robust classification. The success of this strategy relies heavily on the model's invariance to image pixel masking. In this paper, we take a closer look at model training schemes to improve this invariance. Instead of using Random Cutout augmentations like PatchCleanser, we introduce the notion of worst-case masking, i.e., selecting masked images which maximize classification loss. However, finding worst-case masks requires an exhaustive search, which might be prohibitively expensive to do on-the-fly during training. To solve this problem, we propose a two-round greedy masking strategy (Greedy Cutout) which finds an approximate worst-case mask location with much less compute. We show that the models trained with our Greedy Cutout improves certified robust accuracy over Random Cutout in PatchCleanser across a range of datasets and architectures. Certified robust accuracy on ImageNet with a ViT-B16-224 model increases from 58.1% to 62.3% against a 3% square patch applied anywhere on the image.

URL: https://openreview.net/forum?id=2tdhQMLg36

---

Title: An Optical Control Environment for Benchmarking Reinforcement Learning Algorithms

Authors: ABULIKEMU ABUDUWEILI, Changliu Liu

Abstract: Deep reinforcement learning has the potential to address various scientific problems. In this paper, we implement an optics simulation environment for reinforcement learning based controllers. The environment captures the essence of nonconvexity, nonlinearity, and time-dependent noise inherent in optical systems, offering a more realistic setting.
Subsequently, we provide the benchmark results of several reinforcement learning algorithms on the proposed simulation environment. The experimental findings demonstrate the superiority of off-policy reinforcement learning approaches over traditional control algorithms in navigating the intricacies of complex optical control environments.

URL: https://openreview.net/forum?id=61TKzU9B96

---

Title: Learning-to-defer for sequential medical decision-making under uncertainty

Authors: Shalmali Joshi, Sonali Parbhoo, Finale Doshi-Velez

Abstract: Learning-to-defer is a framework to automatically defer decision-making to a human expert when ML-based decisions are deemed unreliable. Existing learning-to-defer frameworks are not designed for sequential settings. That is, they defer at every instance independently, based on immediate predictions, while ignoring the potential long-term impact of these interventions. As a result, existing frameworks are myopic. Further, they do not defer adaptively, which is crucial when human interventions are costly. In this work, we propose Sequential Learning-to-Defer (SLTD), a framework for learning-to-defer to a domain expert in sequential decision-making settings. Contrary to existing literature, we pose the problem of learning-to-defer as model-based reinforcement learning (RL) to i) account for long-term consequences of ML-based actions using RL and ii) adaptively defer based on the dynamics (model-based). Our proposed framework determines whether to defer (at each time step) by quantifying whether a deferral now will improve the value compared to delaying deferral to the next time step. To quantify the improvement, we account for potential future deferrals. As a result, we learn a pre-emptive deferral policy (i.e. a policy that defers early if using the ML-based policy could worsen long-term outcomes). Our deferral policy is adaptive to the non-stationarity in the dynamics. We demonstrate that adaptive deferral via SLTD provides an improved trade-off between long-term outcomes and deferral frequency on synthetic, semi-synthetic, and real-world data with non-stationary dynamics. Finally, we interpret the deferral decision by decomposing the propagated (long-term) uncertainty around the outcome, to justify the deferral decision.

URL: https://openreview.net/forum?id=0pn3KnbH5F

---

Title: Learning domain-specific causal discovery from time series

Authors: Xinyue Wang, Konrad Kording

Abstract: Causal discovery (CD) from time-varying data is important in neuroscience, medicine, and machine learning. Techniques for CD encompass randomized experiments, which are generally unbiased but expensive, and algorithms such as Granger causality, conditional-independence-based, structural-equation-based, and score-based methods that are only accurate under strong assumptions made by human designers. However, as demonstrated in other areas of machine learning, human expertise is often not entirely accurate and tends to be outperformed in domains with abundant data. In this study, we examine whether we can enhance domain-specific causal discovery for time series using a data-driven approach. Our findings indicate that this procedure significantly outperforms human-designed, domain-agnostic causal discovery methods, such as Mutual Information, VAR-LiNGAM, and Granger Causality on the MOS 6502 microprocessor, the NetSim fMRI dataset, and the Dream3 gene dataset. We argue that, when feasible, the causality field should consider a supervised approach in which domain-specific CD procedures are learned from extensive datasets with known causal relationships, rather than being designed by human specialists. Our findings promise a new approach toward improving CD in neural and medical data and for the broader machine learning community.

URL: https://openreview.net/forum?id=JFaZ94tT8M

---

Title: Does ‘Deep Learning on a Data Diet’ reproduce? Overall yes, but GraNd at Initialization does not

Authors: Andreas Kirsch

Abstract: Training deep neural networks on vast datasets often results in substantial computational demands, underscoring the need for efficient data pruning. In this context, we critically re-evaluate the data pruning metrics introduced in `Deep Learning on a Data Diet' by Paul et al. (2021): the Gradient Norm (GraNd) (at initialization) and the Error L2 Norm (EL2N). Our analysis uncovers a strong correlation between the GraNd scores at initialization and a sample's input norm, suggesting the latter as a potential baseline for data pruning. However, comprehensive tests on CIFAR-10 show neither metric outperforming random pruning, contradicting one of the findings in Paul et al. (2021). We pinpoint the inconsistency in the GraNd at initialization results to a later-fixed bug in FLAX's checkpoint restoring mechanism (https://github.com/google/flax/commit/28fbd95500f4bf2f9924d2560062fa50e919b1a5). Altogether, our findings do not support using the input norm or GraNd scores at initialization for effective data pruning. Nevertheless, EL2N and GraNd scores at later training epochs do provide useful pruning signals, aligning with the expected performance.

URL: https://openreview.net/forum?id=1dwXa9vmOI

---

Title: Dynamic Subgoal-based Exploration via Bayesian Optimization

Authors: Yijia Wang, Matthias Poloczek, Daniel R. Jiang

Abstract: Reinforcement learning in sparse-reward navigation environments with expensive and limited interactions is challenging and poses a need for effective exploration. Motivated by complex navigation tasks that require real-world training (when cheap simulators are not available), we consider an agent that faces an unknown distribution of environments and must decide on an exploration strategy. It may leverage a series of training environments to improve its policy before it is evaluated in a test environment drawn from the same environment distribution. Most existing approaches focus on fixed exploration strategies, while the few that view exploration as a meta-optimization problem tend to ignore the need for _cost-efficient_ exploration. We propose a cost-aware Bayesian optimization approach that efficiently searches over a class of dynamic subgoal-based exploration strategies. The algorithm adjusts a variety of levers --- the locations of the subgoals, the length of each episode, and the number of replications per trial --- in order to overcome the challenges of sparse rewards, expensive interactions, and noise. An experimental evaluation demonstrates that the new approach outperforms existing baselines across a number of problem domains. We also provide a theoretical foundation and prove that the method asymptotically identifies a near-optimal subgoal design.

URL: https://openreview.net/forum?id=ThJl4d5JRg

---

Title: Weight-balancing fixes and flows for deep learning

Authors: Lawrence K. Saul

Abstract: Feedforward neural networks with homogeneous activation functions possess an internal symmetry: the functions they compute do not change when the incoming and outgoing weights at any hidden unit are rescaled by reciprocal positive values. This paper makes two contributions to our understanding of these networks. The first is to describe a simple procedure, or {\it fix}, for balancing the weights in these networks: this procedure computes multiplicative rescaling factors---one at each hidden unit---that rebalance the weights of these networks without changing the end-to-end functions that they compute. Specifically, given an initial network with arbitrary weights, the procedure determines the functionally equivalent network whose weight matrix is of minimal $\ell_{p,q}$-norm; the weights at each hidden unit are said to be balanced when this norm is stationary with respect to rescaling transformations. The optimal rescaling factors are computed in an iterative fashion via simple multiplicative updates, and the updates are notable in that (a) they do not require the tuning of learning rates, (b) they operate in parallel on the rescaling factors at all hidden units, and (c) they converge monotonically to a global minimizer of the $\ell_{p,q}$-norm. The paper's second contribution is to analyze the optimization landscape for learning in these networks. We suppose that the network's loss function consists of two terms---one that is invariant to rescaling transformations, measuring predictive accuracy, and another (a regularizer) that breaks this invariance, penalizing large weights. We show how to derive a weight-balancing {\it flow} such that the regularizer remains minimal with respect to rescaling transformations as the weights descend in the loss function. These dynamics reduce to an ordinary gradient flow for $\ell_2$-norm regularization, but not otherwise. In this way our analysis suggests a canonical pairing of alternative flows and regularizers.

URL: https://openreview.net/forum?id=uaHyXxyp2r

---

Title: Gated Domain Units for Multi-source Domain Generalization

Authors: Simon Föll, Alina Dubatovka, Eugen Ernst, Siu Lun Chau, Martin Maritsch, Patrik Okanovic, Gudrun Thaeter, Joachim M. Buhmann, Felix Wortmann, Krikamol Muandet

Abstract: The phenomenon of distribution shift (DS) occurs when a dataset at test time differs from the dataset at training time, which can significantly impair the performance of a machine learning model in practical settings due to a lack of knowledge about the data's distribution at test time. To address this problem, we postulate that real-world distributions are composed of latent Invariant Elementary Distributions (I.E.D) across different domains. This assumption implies an invariant structure in the solution space that enables knowledge transfer to unseen domains. To exploit this property for domain generalization, we introduce a modular neural network layer consisting of Gated Domain Units (GDUs) that learn a representation for each latent elementary distribution. During inference, a weighted ensemble of learning machines can be created by comparing new observations with the representations of each elementary distribution. Our flexible framework also accommodates scenarios where explicit domain information is not present. Extensive experiments on image, text, and graph data show consistent performance improvement on out-of-training target domains. These findings support the practicality of the I.E.D assumption and the effectiveness of GDUs for domain generalisation.

URL: https://openreview.net/forum?id=V7BvYJyTmM

---

Title: IBIA: An Incremental Build-Infer-Approximate Framework for Approximate Inference of Partition Function

Authors: Shivani Bathla, Vinita Vasudevan

Abstract: Exact computation of the partition function is known to be intractable, necessitating approximate inference techniques. Existing methods for approximate inference are slow to converge for many benchmarks. The control of accuracy-complexity trade-off is also non-trivial in many of these methods. We propose a novel incremental build-infer-approximate (IBIA) framework for approximate inference that addresses these issues. In this framework, the probabilistic graphical model is converted into a sequence of clique tree forests (SCTF) with bounded clique sizes. We show that the SCTF can be used to efficiently compute the partition function. We propose two new algorithms which are used to construct the SCTF and prove the correctness of both. The first is an algorithm for incremental construction of CTFs that is guaranteed to give a valid CTF with bounded clique sizes and the second is an approximation algorithm that takes a calibrated CTF as input and yields a valid and calibrated CTF with reduced clique sizes as the output. We have evaluated our method using several benchmark sets from recent UAI competitions and our results show good accuracies with competitive runtimes.

URL: https://openreview.net/forum?id=8L7Rh6FIXt

---

Title: Revisiting Sparsity Hunting in Federated Learning: Why does Sparsity Consensus Matter?

Authors: Sara Babakniya, Souvik Kundu, Saurav Prakash, Yue Niu, Salman Avestimehr

Abstract: Edge devices can benefit remarkably from federated learning due to their distributed nature; however, their limited resource and computing power poses limitations in deployment. A possible solution to this problem is to utilize off-the-shelf sparse learning algorithms at the clients to meet their resource budget. However, such naive deployment in the clients causes significant accuracy degradation, especially for highly resource-constrained clients. In particular, our investigations reveal that the lack of consensus in the sparsity masks among the clients may potentially slow down the convergence of the global model and cause a substantial accuracy drop.
With these observations, we present \textit{federated lottery aware sparsity hunting} (FLASH), a unified sparse learning framework for training a sparse sub-model that maintains the performance under ultra-low parameter density while yielding proportional communication benefits. Moreover, given that different clients may have different resource budgets, we present \textit{hetero-FLASH} where clients can take different density budgets based on their device resource limitations instead of supporting only one target parameter density. Experimental analysis on diverse models and datasets shows the superiority of FLASH in closing the gap with an unpruned baseline while yielding up to $\mathord{\sim}10.1\%$ improved accuracy with $\mathord{\sim}10.26\times$ fewer communication, compared to existing alternatives, at similar hyperparameter settings.

URL: https://openreview.net/forum?id=iHyhdpsnyi

---

Title: Relating graph auto-encoders to linear models

Authors: Solveig Klepper, Ulrike von Luxburg

Abstract: Graph auto-encoders are widely used to construct graph representations in Euclidean vector spaces. However, it has already been pointed out empirically that linear models on many tasks can outperform graph auto-encoders.
In our work, we prove that the solution space induced by graph auto-encoders is a subset of the solution space of a linear map. This demonstrates that linear embedding models have at least the representational power of graph auto-encoders based on graph convolutional networks. So why are we still using nonlinear graph auto-encoders? One reason could be that actively restricting the linear solution space might introduce an inductive bias that helps improve learning and generalization. While many researchers believe that the nonlinearity of the encoder is the critical ingredient towards this end, we instead identify the node features of the graph as a more powerful inductive bias. We give theoretical insights by introducing a corresponding bias in a linear model and analyzing the change in the solution space. Our experiments are aligned with other empirical work on this question and show that the linear encoder can outperform the nonlinear encoder when using feature information.

URL: https://openreview.net/forum?id=Y1eYplvxrE

---

Title: Deep Operator Learning Lessens the Curse of Dimensionality for PDEs

Authors: Ke Chen, Chunmei Wang, Haizhao Yang

Abstract: Deep neural networks (DNNs) have achieved remarkable success in numerous domains, and their application to PDE-related problems has been rapidly advancing. This paper provides an estimate for the generalization error of learning Lipschitz operators over Banach spaces using DNNs with applications to various PDE solution operators. The goal is to specify DNN width, depth, and the number of training samples needed to guarantee a certain testing error. Under mild assumptions on data distributions or operator structures, our analysis shows that deep operator learning can have a relaxed dependence on the discretization resolution of PDEs and, hence, lessen the curse of dimensionality in many PDE-related problems including elliptic equations, parabolic equations, and Burgers equations. Our results are also applied to give insights about discretization-invariant in operator learning.

URL: https://openreview.net/forum?id=zmBFzuT2DN

---

Title: Label Noise-Robust Learning using a Confidence-Based Sieving Strategy

Authors: Reihaneh Torkzadehmahani, Reza Nasirigerdeh, Daniel Rueckert, Georgios Kaissis

Abstract: In learning tasks with label noise, improving model robustness against overfitting is a pivotal challenge because the model eventually memorizes labels, including the noisy ones. Identifying the samples with noisy labels and preventing the model from learning them is a promising approach to address this challenge. When training with noisy labels, the per-class confidence scores of the model, represented by the class probabilities, can be reliable criteria for assessing whether the input label is the true label or the corrupted one. In this work, we exploit this observation and propose a novel discriminator metric called confidence error and a sieving strategy called CONFES to differentiate between the clean and noisy samples effectively. We provide theoretical guarantees on the probability of error for our proposed metric. Then, we experimentally illustrate the superior performance of our proposed approach compared to recent studies on various settings, such as synthetic and real-world label noise. Moreover, we show CONFES can be combined with other state-of-the-art approaches, such as Co-teaching and DivideMix to further improve model performance.

URL: https://openreview.net/forum?id=3taIQG4C7H

---

Title: On Perfect Clustering for Gaussian Processes

Authors: Juan Cuesta-Albertos, Subhajit Dutta

Abstract: In this paper, we propose a data based transformation for infinite-dimensional Gaussian processes and derive its limit theorem. For a clustering problem using mixture models, an appropriate modification of this transformation asymptotically leads to perfect separation of the populations under rather general conditions, except the scenario in which differences between clusters depend only on the locations; in which case our procedure is useless. Theoretical properties related to label consistency are studied for the k-means clustering algorithm when used on this transformed data. Good empirical performance of the proposed methodology is demonstrated using simulated as well as benchmark data sets, when compared with some popular parametric and nonparametric methods for such functional data.

URL: https://openreview.net/forum?id=igDOV2KBwM

---

Title: How Reliable is Your Regression Model's Uncertainty Under Real-World Distribution Shifts?

Authors: Fredrik K. Gustafsson, Martin Danelljan, Thomas B. Schön

Abstract: Many important computer vision applications are naturally formulated as regression problems. Within medical imaging, accurate regression models have the potential to automate various tasks, helping to lower costs and improve patient outcomes. Such safety-critical deployment does however require reliable estimation of model uncertainty, also under the wide variety of distribution shifts that might be encountered in practice. Motivated by this, we set out to investigate the reliability of regression uncertainty estimation methods under various real-world distribution shifts. To that end, we propose an extensive benchmark of 8 image-based regression datasets with different types of challenging distribution shifts. We then employ our benchmark to evaluate many of the most common uncertainty estimation methods, as well as two state-of-the-art uncertainty scores from the task of out-of-distribution detection. We find that while methods are well calibrated when there is no distribution shift, they all become highly overconfident on many of the benchmark datasets. This uncovers important limitations of current uncertainty estimation methods, and the proposed benchmark therefore serves as a challenge to the research community. We hope that our benchmark will spur more work on how to develop truly reliable regression uncertainty estimation methods.

URL: https://openreview.net/forum?id=WJt2Pc3qtI

---

Title: RIGNN: A Rationale Perspective for Semi-supervised Open-world Graph Classification

Authors: Xiao Luo, Yusheng Zhao, Zhengyang Mao, Yifang Qin, Wei Ju, Ming Zhang, Yizhou Sun

Abstract: Graph classification has gained growing attention in the graph machine learning community and a variety of semi-supervised methods have been developed to reduce the high cost of annotation. They usually combine graph neural networks (GNNs) and extensive semi-supervised techniques such as knowledge distillation. However, they adhere to the close-set assumption that unlabeled graphs all belong to known classes, limiting their applications in the real world. This paper goes further, investigating a practical problem of semi-supervised open-world graph classification where these unlabeled graph data could come from unseen classes. A novel approach named Rationale-Informed GNN (RIGNN) is proposed, which takes a rationale view to detect components containing the most information related to the label space and classify unlabeled graphs into a known class or an unseen class. In particular, RIGNN contains a relational detector and a feature extractor to produce effective rationale features, which maximize the mutual information with label information and exhibit sufficient disentanglement with non-rationale elements. Furthermore, we construct a graph-of-graph based on geometrical relationships, which gives instructions on enhancing rationale representations. In virtue of effective rationale representations, we can provide accurate and balanced predictions for unlabeled graphs. An extension is also made to accomplish effective open-set graph classification. We verify our proposed methods on four benchmark datasets in various settings and experimental results reveal the effectiveness of our proposed RIGNN compared with state-of-the-art methods.

URL: https://openreview.net/forum?id=qcCE4mC2jI

---

Title: SkillS: Adaptive Skill Sequencing for Efficient Temporally-Extended Exploration

Authors: Giulia Vezzani, Dhruva Tirumala, Markus Wulfmeier, Dushyant Rao, Abbas Abdolmaleki, Ben Moran, Tuomas Haarnoja, Jan Humplik, Roland Hafner, Michael Neunert, Claudio Fantacci, Tim Hertweck, Thomas Lampe, Fereshteh Sadeghi, Nicolas Heess, Martin Riedmiller

Abstract: The ability to effectively reuse prior knowledge is a key requirement when building general and flexible Reinforcement Learning (RL) agents. Skill reuse is one of the most common approaches, but current methods have considerable limitations. For example, fine-tuning an existing policy frequently fails, as the policy can degrade rapidly early in training. In a similar vein, distillation of expert behavior can lead to poor results when given sub-optimal experts. We compare several common approaches for skill transfer on multiple domains including changes in task and system dynamics. We identify how existing methods fail and introduce an alternative approach to mitigate these problems. Our approach learns to sequence temporally-extended skills for exploration but learns the final policy directly from the raw experience. This conceptual split enables rapid adaptation and thus efficient data collection but without constraining the final solution. It significantly outperforms many classical methods across a suite of evaluation tasks and we use a broad set of ablations to highlight the importance of different components of our method.

URL: https://openreview.net/forum?id=JwGKVpRfVD

---

Title: Estimating Differential Equations from Temporal Point Processes

Authors: Shuichi Miyazawa, Daichi Mochihashi

Abstract: Ordinary differential equations (ODEs) allow interpretation of phenomena in various scientific fields. They have mostly been applied to numerical data observed at regular intervals, but not to irregularly observed discrete events, also known as point processes. In this study, we introduce an ODE modeling of such events by combining ODEs with log-Gaussian Cox processes (Møller et al., 1998). In the experiments with different types of ODEs regarding infectious disease, predator-prey interaction, and competition among participants, our method outperformed existing baseline methods assuming regularly observed continuous data with respect to the accuracy of recovering the latent parameters of ODEs. Through both synthetic and actual examples, we also showed the ability of our method to extrapolate, model latent events that cannot be observed, and offer interpretability of phenomena from the viewpoint of the estimated parameters of ODE.

URL: https://openreview.net/forum?id=cJgHzw8Qhq

---

Title: Turning a Curse into a Blessing: Enabling In-Distribution-Data-Free Backdoor Removal via Stabilized Model Inversion

Authors: Si Chen, Yi Zeng, Won Park, Jiachen T. Wang, Xun Chen, Lingjuan Lyu, Zhuoqing Mao, Ruoxi Jia

Abstract: The effectiveness of many existing techniques for removing backdoors from machine learning models relies on access to clean in-distribution data. However, given that these models are often trained on proprietary datasets, it may not be practical to assume that in-distribution samples will always be available.
On the other hand, model inversion techniques, which are typically viewed as privacy threats, can reconstruct realistic training samples from a given model, potentially eliminating the need for in-distribution data.
To date, the only prior attempt to integrate backdoor removal and model inversion involves a simple combination that produced very limited results. This work represents a first step toward a more thorough understanding of how model inversion techniques could be leveraged for effective backdoor removal. Specifically, we seek to answer several key questions: What properties must reconstructed samples possess to enable successful defense? Is perceptual similarity to clean samples enough, or are additional characteristics necessary? Is it possible for reconstructed samples to contain backdoor triggers?

We demonstrate that relying solely on perceptual similarity is insufficient for effective defenses. The stability of model predictions in response to input and parameter perturbations also plays a critical role. To address this, we propose a new bi-level optimization based framework for model inversion that promotes stability in addition to visual quality. Interestingly, we also find that reconstructed samples from a pre-trained generator's latent space do not contain backdoors, even when signals from a backdoored model are utilized for reconstruction. We provide a theoretical analysis to explain this observation. Our evaluation shows that our stabilized model inversion technique achieves state-of-the-art backdoor removal performance without requiring access to clean in-distribution data. Furthermore, its performance is on par with or even better than using the same amount of clean samples.

URL: https://openreview.net/forum?id=XuOE99cmST

---

Title: Momentum Tracking: Momentum Acceleration for Decentralized Deep Learning on Heterogeneous Data

Authors: Yuki Takezawa, Han Bao, Kenta Niwa, Ryoma Sato, Makoto Yamada

Abstract: SGD with momentum is one of the key components for improving the performance of neural networks. For decentralized learning, a straightforward approach using momentum is Distributed SGD (DSGD) with momentum (DSGDm). However, DSGDm performs worse than DSGD when the data distributions are statistically heterogeneous. Recently, several studies have addressed this issue and proposed methods with momentum that are more robust to data heterogeneity than DSGDm, although their convergence rates remain dependent on data heterogeneity and deteriorate when the data distributions are heterogeneous. In this study, we propose Momentum Tracking, which is a method with momentum whose convergence rate is proven to be independent of data heterogeneity. More specifically, we analyze the convergence rate of Momentum Tracking in the setting where the objective function is non-convex and the stochastic gradient is used. Then, we identify that it is independent of data heterogeneity for any momentum coefficient $\beta \in [0, 1)$. Through experiments, we demonstrate that Momentum Tracking is more robust to data heterogeneity than the existing decentralized learning methods with momentum and can consistently outperform these existing methods when the data distributions are heterogeneous.

URL: https://openreview.net/forum?id=8koy8QuTZD

---

Title: Optimistic Optimization of Gaussian Process Samples

Authors: Julia Grosse, Cheng Zhang, Philipp Hennig

Abstract: Bayesian optimization is a popular formalism for global optimization, but its computational costs limit it to expensive-to-evaluate functions. A competing, computationally more effi- cient, global optimization framework is optimistic optimization, which exploits prior knowl- edge about the geometry of the search space in form of a dissimilarity function. We investi- gate to which degree the conceptual advantages of Bayesian Optimization can be combined with the computational efficiency of optimistic optimization. By mapping the kernel to a dissimilarity, we obtain an optimistic optimization algorithm for the Bayesian Optimization setting with a run-time of up to $O(N log N )$. As a high-level take-away we find that, when using stationary kernels on objectives of low evaluation cost, optimistic optimization can be preferable over Bayesian optimization, while for strongly coupled and parametric models, Bayesian optimization can perform much better, even at low evaluation cost. As a concep- tual takeaway, our results demonstrate that balancing exploration and exploitation under Gaussian process assumptions does not require computing a posterior.

URL: https://openreview.net/forum?id=KQ5jI19kF3

---

Title: Linearized Relative Positional Encoding

Authors: Zhen Qin, Weixuan Sun, Kaiyue Lu, Hui Deng, Dongxu Li, Xiaodong Han, Yuchao Dai, Lingpeng Kong, Yiran Zhong

Abstract: Relative positional encoding is widely used in vanilla and linear transformers to represent positional information. However, existing encoding methods of a vanilla transformer are not always directly applicable to a linear transformer, because the latter requires a decomposition of the query and key representations into separate kernel functions. Nevertheless, principles for designing encoding methods suitable for linear transformers remain understudied. In this work, we put together a variety of existing linear relative positional encoding approaches under a canonical form and further propose a family of linear relative positional encoding algorithms via unitary transformation. Our formulation leads to a principled framework that can be used to develop new relative positional encoding methods that preserve linear space-time complexity. Equipped with different models, the proposed linearized relative positional encoding (LRPE) family derives effective encoding for various applications. Experiments show that compared with existing methods, LRPE achieves state-of-the-art performance in language modeling, text classification, and image classification. Meanwhile, it emphasizes a general paradigm for designing broadly more relative positional encoding methods that are applicable to linear transformers.

URL: https://openreview.net/forum?id=xoLyps2qWc

---

Title: DPVIm: Differentially Private Variational Inference Improved

Authors: Joonas Jälkö, Lukas Prediger, Antti Honkela, Samuel Kaski

Abstract: Differentially private (DP) release of multidimensional statistics typically considers an aggregate sensitivity, e.g. the vector norm of a high-dimensional vector. However, different dimensions of that vector might have widely different magnitudes and therefore DP perturbation disproportionately affects the signal across dimensions. We observe this problem in the gradient release of the DP-SGD algorithm when using it for variational inference (VI), where it manifests in poor convergence as well as high variance in outputs for certain variational parameters, and make the following contributions: (i) We mathematically isolate the cause for the difference in magnitudes between gradient parts corresponding to different variational parameters. Using this as prior knowledge we establish a link between the gradients of the variational parameters, and propose an efficient while simple fix for the problem to obtain a less noisy gradient estimator, which we call \emph{aligned} gradients. This approach allows us to obtain the updates for the covariance parameter of a Gaussian posterior approximation without a privacy cost. We compare this to alternative approaches for scaling the gradients using analytically derived preconditioning, e.g. natural gradients. (ii) We suggest using iterate averaging over the DP parameter traces recovered during the training, to reduce the DP-induced noise in parameter estimates at no additional cost in privacy. Finally, (iii) to accurately capture the additional uncertainty DP introduces to the model parameters, we infer the DP-induced noise from the parameter traces and include that in the learned posteriors to make them \emph{noise aware}. We demonstrate the efficacy of our proposed improvements through various experiments on real data.

URL: https://openreview.net/forum?id=GlhM6XX1wv

---

Title: RIFLE: Imputation and Robust Inference from Low Order Marginals

Authors: Sina Baharlouei, Sze-Chuan Suen, Meisam Razaviyayn

Abstract: The ubiquity of missing values in real-world datasets poses a challenge for statistical inference and can prevent similar datasets from being analyzed in the same study, precluding many existing datasets from being used for new analyses. While an extensive collection of packages and algorithms have been developed for data imputation, the overwhelming majority perform poorly if there are many missing values and low sample sizes, which are unfortunately common characteristics in empirical data. Such low-accuracy estimations adversely affect the performance of downstream statistical models. We develop a statistical inference framework for predicting the target variable in the presence of missing data without imputation. Our framework, RIFLE (Robust InFerence via Low-order moment Estimations), estimates low-order moments of the underlying data distribution with corresponding confidence intervals to learn a distributionally robust model. We specialize our framework to linear regression and normal discriminant analysis, and we provide convergence and performance guarantees. This framework can also be adapted to impute missing data. We compare RIFLE with state-of-the-art approaches (including MICE, Amelia, MissForest, KNN-imputer, MIDA, and Mean Imputer) in numerical experiments. Our experiments demonstrate that RIFLE outperforms other benchmark algorithms when the percentage of missing values is high and/or when the number of data points is relatively small. RIFLE is publicly available

URL: https://openreview.net/forum?id=oud7Ny0KQy

---

Title: Offline Reinforcement Learning with Mixture of Deterministic Policies

Authors: Takayuki Osa, Akinobu Hayashi, Pranav Deo, Naoki Morihira, Takahide Yoshiike

Abstract: Offline reinforcement learning (RL) has recently attracted considerable attention as an approach for utilizing past experiences to learn a policy. Recent studies have reported the challenges of offline RL, such as estimating the values of actions that are outside the data distribution. To mitigate offline RL issues, we propose an algorithm that leverages a mixture of deterministic policies. When the data distribution is multimodal, fitting a policy modeled with a unimodal distribution, such as Gaussian distribution, may lead to interpolation between separate modes, thereby resulting in the value estimation of actions that are outside the data distribution. In our framework, the state-action space is divided by learning discrete latent variables, and the sub-policies corresponding to each region are trained. The proposed algorithm was derived by considering the variational lower bound of the offline RL objective function. We show empirically that the use of the proposed mixture policy can reduce the accumulation of the critic loss in offline RL, which was reported in previous studies. Experimental results also indicate that using a mixture of deterministic policies in offline RL improves the performance with the D4RL benchmarking datasets.

URL: https://openreview.net/forum?id=zkRCp4RmAF

---

Title: Stochastic Batch Acquisition: A Simple Baseline for Deep Active Learning

Authors: Andreas Kirsch, Sebastian Farquhar, Parmida Atighehchian, Andrew Jesson, Frédéric Branchaud-Charron, Yarin Gal

Abstract: We examine a simple stochastic strategy for adapting well-known single-point acquisition functions to allow batch active learning. Unlike acquiring the top-K points from the pool set, score- or rank-based sampling takes into account that acquisition scores change as new data are acquired. This simple strategy for adapting standard single-sample acquisition strategies can even perform just as well as compute-intensive state-of-the-art batch acquisition functions, like BatchBALD or BADGE while using orders of magnitude less compute. In addition to providing a practical option for machine learning practitioners, the surprising success of the proposed method in a wide range of experimental settings raises a difficult question for the field: when are these expensive batch acquisition methods pulling their weight?

URL: https://openreview.net/forum?id=vcHwQyNBjW

---

Title: A Survey on Transformers in Reinforcement Learning

Authors: Wenzhe Li, Hao Luo, Zichuan Lin, Chongjie Zhang, Zongqing Lu, Deheng Ye

Abstract: Transformer has been considered the dominating neural architecture in NLP and CV, mostly under supervised settings. Recently, a similar surge of using Transformers has appeared in the domain of reinforcement learning (RL), but it is faced with unique design choices and challenges brought by the nature of RL. However, the evolution of Transformers in RL has not yet been well unraveled. In this paper, we seek to systematically review motivations and progress on using Transformers in RL, provide a taxonomy on existing works, discuss each sub-field, and summarize future prospects.

URL: https://openreview.net/forum?id=r30yuDPvf2

---

Title: On the Sample Complexity of Lipschitz Constant Estimation

Authors: Julien Walden Huang, Stephen J. Roberts, Jan-Peter Calliess

Abstract: Estimating the Lipschitz constant of a function, also known as Lipschitz learning, is a fundamental problem with broad applications in fields such as control and global optimization. In this paper, we study the Lipschitz learning problem with minimal parametric assumptions on the target function. As a first theoretical contribution, we derive novel lower bounds on the sample complexity of this problem for both noise-free and noisy settings under mild assumptions. Moreover, we propose a simple Lipschitz learning algorithm called $\textit{Lipschitz Constant Estimation by Least Squares Regression}$ (referred to as LCLS). We show that LCLS is asymptotically consistent for general noise assumptions and offers finite sample guarantees that can be translated to new upper bounds on the sample complexity of the Lipschitz learning problem. Our analysis shows that the sample complexity rates derived in this paper are optimal in both the noise-free setting and in the noisy setting when the noise is assumed to follow a Gaussian distribution and that LCLS is a sample-optimal algorithm in both cases. Finally, we show that by design, the LCLS algorithm is computationally faster than existing theoretically consistent methods, and can be readily adapted to various noise assumptions with little to no prior knowledge of the target function properties or noise distribution.

URL: https://openreview.net/forum?id=UIalYAHdBH

---

Title: Achieving the Pareto Frontier of Regret Minimization and Best Arm Identification in Multi-Armed Bandits

Authors: Zixin Zhong, Wang Chi Cheung, Vincent Tan

Abstract: We study the Pareto frontier of two archetypal objectives in multi-armed bandits, namely, regret minimization (RM) and best arm identification (BAI) with a fixed horizon. It is folklore that the balance between exploitation and exploration is crucial for both RM and BAI, but exploration is more critical in achieving the optimal performance for the latter objective. To this end, we design and analyze the BoBW-lil’UCB($\gamma$) algorithm. Complementarily, by establishing lower bounds on the regret achievable by any algorithm with a given BAI failure probability, we show that (i) no algorithm can simultaneously perform optimally for both the RM and BAI objectives, and (ii) BoBW-lil’UCB($\gamma$) achieves order-wise optimal performance for RM or BAI under different values of $\gamma$. Our work elucidates the trade-off more precisely by showing how the constants in previous works depend on certain hardness parameters. Finally, we show that BoBW-lil’UCB outperforms a close competitor UCB$_\alpha$ (Degenne et al., 2019) in terms of the time complexity and the regret on diverse datasets such as MovieLens and Published Kinase Inhibitor Set.

URL: https://openreview.net/forum?id=XXfEmIMJDm

---

Title: Quantization Robust Federated Learning for Efficient Inference on Heterogeneous Devices

Authors: Kartik Gupta, Marios Fournarakis, Matthias Reisser, Christos Louizos, Markus Nagel

Abstract: Federated Learning (FL) is a machine learning paradigm to distributively learn machine learning models from decentralized data that remains on-device. Despite the success of standard Federated optimization methods, such as Federated Averaging (FedAvg) in FL, the energy demands and hardware induced constraints for on-device learning have not been considered sufficiently in the literature. Specifically, an essential demand for on-device learning is to enable trained models to be quantized to various bit-widths based on the energy needs and heterogeneous hardware designs across the federation. In this work, we introduce multiple variants of federated averaging algorithm that train neural networks robust to quantization. Such networks can be quantized to various bit-widths with only limited reduction in full precision model accuracy. We perform extensive experiments on standard FL benchmarks to evaluate our proposed FedAvg variants for quantization robustness and provide a convergence analysis for our Quantization-Aware variants in FL. Our results demonstrate that integrating quantization robustness results in FL models that are significantly more robust to different bit-widths during quantized on-device inference.

URL: https://openreview.net/forum?id=lvevdX6bxm

---

Title: High Fidelity Neural Audio Compression

Authors: Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi

Abstract: We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio. Code and samples are available under github.com/facebookresearch/encodec.

URL: https://openreview.net/forum?id=ivCd8z8zR2

---

Title: Fair and Useful Cohort Selection

Authors: Konstantina Bairaktari, Paul Tsela Langton, Huy Nguyen, Niklas Smedemark-Margulies, Jonathan Ullman

Abstract: A challenge in fair algorithm design is that, while there are compelling notions of individual fairness, these notions typically do not satisfy desirable composition properties, and downstream applications based on fair classifiers might not preserve fairness.
To study fairness under composition, Dwork & Ilvento (2019) introduced an archetypal problem called fair-cohort-selection problem, where a single fair classifier is composed with itself to select a group of candidates of a given size, and proposed a solution to this problem.

In this work we design algorithms for selecting cohorts that not only preserve fairness, but also maximize the utility of the selected cohort under two notions of utility that we introduce and motivate. We give optimal (or approximately optimal) polynomial-time algorithms for this problem in both an offline setting, and an online setting where candidates arrive one at a time and are classified as they arrive.

URL: https://openreview.net/forum?id=wRepWp1KC7

---

Title: Walking Out of the Weisfeiler Leman Hierarchy: Graph Learning Beyond Message Passing

Authors: Jan Tönshoff, Martin Ritzert, Hinrikus Wolf, Martin Grohe

Abstract: We propose CRaWl, a novel neural network architecture for graph learning. Like graph neural networks, CRaWl layers update node features on a graph and thus can freely be combined or interleaved with GNN layers. Yet CRaWl operates fundamentally different from message passing graph neural networks. CRaWl layers extract and aggregate information on subgraphs appearing along random walks through a graph using 1D Convolutions. Thereby it detects long range interactions and computes non-local features. As the theoretical basis for our approach, we prove a theorem stating that the expressiveness of CRaWl is incomparable with that of the Weisfeiler Leman algorithm and hence with graph neural networks. That is, there are functions expressible by CRaWl, but not by GNNs and vice versa. This result extends to higher levels of the Weisfeiler Leman hierarchy and thus to higher-order GNNs. Empirically, we show that CRaWl matches state-of-the-art GNN architectures across a multitude of benchmark datasets for classification and regression on graphs.

URL: https://openreview.net/forum?id=vgXnEyeWVY

---

Title: Global Contrastive Learning for Long-Tailed Classification

Authors: Thong Bach, Anh Tong, Truong Son Hy, Vu Nguyen, Thanh Nguyen-Tang

Abstract: We consider the long-tailed classification problem in which a few classes in the training data dominate the majority of the other classes. For concreteness, we focus on the visual domain in this paper. Most current methods employ contrastive learning to learn a representation for long-tailed data. In this paper, first, we investigate $k$-positive sampling, a popular baseline method widely used to build contrastive learning models for imbalanced data. Previous works show that $k$-positive learning, which only chooses $k$ positive samples (instead of all positive images) for each query image, suffers from inferior performance in long-tailed data. In this work, we further point out that k-positive learning limits the learning capability of both head and tail classes. Based on this perspective, we propose a novel contrastive learning framework that improves the limitation in k-positive learning by enlarging its positive selection space, so it can help the model learn more semantic discrimination features. Second, we analyze how the temperature (the hyperparameter used for tuning a concentration of samples on feature space) affects the gradients of each class in long-tailed learning, and propose a new method that can mitigate inadequate gradients between classes, which can help model learning easier. We name this framework as CoGloAT. Finally, we go on to introduce a new prototype learning framework namely ProCo based on coreset selection, which creates a global prototype for each cluster while keeping the computation cost within a reasonable time and show that combining CoGloAT with ProCo can further enhance the model learning ability on long-tailed data.

URL: https://openreview.net/forum?id=xWrtiJwJj5

---

Title: Approximating Naive Bayes on Unlabelled Categorical Data

Authors: Cormac Herley

Abstract: We address the question of binary classification when no labels are available and the input features are categorical. The lack of labels means supervised approaches can't be used, and the lack of a natural distance measure means that most unsupervised methods do poorly. For such problems, where the alternatives might be a) do nothing or b) heuristic rules-based approaches, we offer a third alternative: a classifier that approximates Naive Bayes. Our primary scenarios are those that involve distinguishing scripted, or bot, web traffic from that of legitimate users.

Our main assumption is the existence of some attribute $x_*$ more prevalent in the benign than the scripted traffic; i.e., $P(x_*|\overline{\mbox{bot}}) = K \cdot P(x_*|\mbox{bot}),$ for $K>1.$ We show that any such disparity yields a lower bound on $P(\mbox{bot}|x_{j})$ even when we have no prior estimates of $P(x_*|\overline{\mbox{bot}}),$ $P(x_*|\mbox{bot})$ or $K$ (except that $K>1$). We show that when at least one bin of at least one feature receives no attack traffic then we under-estimate the actual conditional probability by a factor of $1-1/K.$ Thus, any attribute with a large disparity between prevalence in benign and abuse traffic (i.e., $K$ is large), allows good approximation of the Naive Bayes classifier without the benefit of labels.

The approach is particularly suited to problems where $K$ is high and thus the approximation is very accurate. Example problems (and relevant attributes) might be: password-guessing, if login attempts from legitimate users succeed at a much higher rate than those from password-guessing attackers; Credit Card Verification Value (CVV) guessing, if an attacker exhaustively tries all possible 3 or 4-digit values and fails at a higher rate than legitimate users; account registration, if legitimate users use email addresses from services that do not allow fee anonymous accounts (e.g., {\tt .edu}) at a much higher rate than attackers; click-fraud if legitimate users visit pages and services that contain no ads at a higher rate than click-fraud bots.

URL: https://openreview.net/forum?id=KpElM2S9pw

---

Title: $k$-Mixup Regularization for Deep Learning via Optimal Transport

Authors: Kristjan Greenewald, Anming Gu, Mikhail Yurochkin, Justin Solomon, Edward Chien

Abstract: Mixup is a popular regularization technique for training deep neural networks that improves generalization and increases robustness to certain distribution shifts. It perturbs input training data in the direction of other randomly-chosen instances in the training set. To
better leverage the structure of the data, we extend mixup in a simple, broadly applicable way to $k$-mixup, which perturbs $k$-batches of training points in the direction of other $k$-batches. The perturbation is done with displacement interpolation, i.e. interpolation under
the Wasserstein metric. We demonstrate theoretically and in simulations that $k$-mixup preserves cluster and manifold structures, and we extend theory studying the efficacy of standard mixup to the $k$-mixup case. Our empirical results show that training with $k$-mixup
further improves generalization and robustness across several network architectures and benchmark datasets of differing modalities. For the wide variety of real datasets considered, the performance gains of $k$-mixup over standard mixup are similar to or larger than the
gains of mixup itself over standard ERM after hyperparameter optimization. In several instances, in fact, $k$-mixup achieves gains in settings where standard mixup has negligible to zero improvement over ERM.

URL: https://openreview.net/forum?id=lOegPKSu04

---

Title: HypUC: Hyperfine Uncertainty Calibration with Gradient- boosted Corrections for Reliable Regression on Imbalanced Electrocardiograms

Authors: Uddeshya Upadhyay, Sairam Bade, Arjun Puranik, Shahir Asfahan, Melwin Babu, Francisco Lopez-Jimenez, Samuel Asirvatham, Ashim Prasad, Ajit Rajasekharan, Samir Awasthi, Rakesh Barve

Abstract: The automated analysis of medical time series, such as the electrocardiogram (ECG), electroencephalogram (EEG), pulse oximetry, etc, has the potential to serve as a valuable tool for diagnostic decisions, allowing for remote monitoring of patients and more efficient use of expensive and time-consuming medical procedures. Deep neural networks (DNNs) have been demonstrated to process such signals effectively. However, previous research has primarily focused on classifying medical time series rather than attempting to regress the continuous-valued physiological parameters central to diagnosis. One significant challenge in this regard is the imbalanced nature of the dataset, as a low prevalence of abnormal conditions can lead to heavily skewed data that results in inaccurate predictions and a lack of certainty in such predictions when deployed. To address these challenges, we propose HypUC, a framework for imbalanced probabilistic regression in medical time series, making several contributions. (i) We introduce a simple kernel density-based technique to tackle the imbalanced regression problem with medical time series. (ii) Moreover, we employ a probabilistic regression framework that allows uncertainty estimation for the predicted continuous values. (iii) We also present a new approach to calibrate the predicted uncertainty further. (iv) Finally, we demonstrate a technique to use calibrated uncertainty estimates to improve the predicted continuous value and show the efficacy of the calibrated uncertainty estimates to flag unreliable predictions. HypUC is evaluated on a large, diverse, real-world dataset of ECGs collected from millions of patients, outperforming several conventional baselines on various diagnostic tasks, suggesting potential use-case for the reliable clinical deployment of deep learning models and a prospective clinical trial. Consequently, a hyperkalemia diagnosis algorithm based on HypUC is going to be the subject of a real-world clinical prospective study.

URL: https://openreview.net/forum?id=0Xo9giEZWf

---

Title: AP: Selective Activation for De-sparsifying Pruned Networks

Authors: Shiyu Liu, Rohan Ghosh, Mehul Motani

Abstract: The rectified linear unit (ReLU) is a highly successful activation function in neural networks as it allows networks to easily obtain sparse representations, which reduces overfitting in overparameterized networks. However, in the context of network pruning, we find that the sparsity introduced by ReLU, which we quantify by a term called dynamic dead neuron rate (DNR), is not beneficial for the pruned network. Interestingly, the more the network is pruned, the smaller the dynamic DNR becomes during and after optimization. This motivates us to propose a method to explicitly reduce the dynamic DNR for the pruned network, i.e., de-sparsify the network. We refer to our method as Activate-while-Pruning (AP). We note that AP does not function as a stand-alone method, as it does not evaluate the importance of weights. Instead, it works in tandem with existing pruning methods and aims to improve their performance by selective activation of nodes to reduce the dynamic DNR. We conduct extensive experiments using various popular networks (e.g., ResNet, VGG, DenseNet, MobileNet) via two classical and three state-of-the-art pruning methods. The experimental results on public datasets (e.g., CIFAR-10, CIFAR-100) suggest that AP works well with existing pruning methods and improves the performance by 3% - 4%. For larger scale datasets (e.g., ImageNet) and state-of-the-art networks (e.g., vision transformer), we observe an improvement of 2% - 3% with AP as opposed to without. Lastly, we conduct an ablation study to examine the effectiveness of the components comprising AP.

URL: https://openreview.net/forum?id=EGQSpkUDdD

---

Title: TSMixer: An All-MLP Architecture for Time Series Forecast-ing

Authors: Si-An Chen, Chun-Liang Li, Sercan O Arik, Nathanael Christian Yoder, Tomas Pfister

Abstract: Real-world time-series datasets are often multivariate with complex dynamics. To capture this complexity, high capacity architectures like recurrent- or attention-based sequential deep learning models have become popular. However, recent work demonstrates that simple univariate linear models can outperform such deep learning models on several commonly used academic benchmarks. Extending them, in this paper, we investigate the capabilities of linear models for time-series forecasting and present Time-Series Mixer (TSMixer), a novel architecture designed by stacking multi-layer perceptrons (MLPs). TSMixer is based on mixing operations along both the time and feature dimensions to extract information efficiently. On popular academic benchmarks, the simple-to-implement TSMixer is comparable to specialized state-of-the-art models that leverage the inductive biases of specific benchmarks. On the challenging and large scale M5 benchmark, a real-world retail dataset, TSMixer demonstrates superior performance compared to the state-of-the-art alternatives. Our results underline the importance of efficiently utilizing cross-variate and auxiliary information for improving the performance of time series forecasting. We present various analyses to shed light into the capabilities of TSMixer. The design paradigms utilized in TSMixer are expected to open new horizons for deep learning-based time series forecasting.

URL: https://openreview.net/forum?id=wbpxTuXgm0

---

Title: Revisiting Hidden Representations in Transfer Learning for Medical Imaging

Authors: Dovile Juodelyte, Amelia Jiménez-Sánchez, Veronika Cheplygina

Abstract: While a key component to the success of deep learning is the availability of massive amounts of training data, medical image datasets are often limited in diversity and size. Transfer learning has the potential to bridge the gap between related yet different domains. For medical applications, however, it remains unclear whether it is more beneficial to pre-train on natural or medical images. We aim to shed light on this problem by comparing initialization on ImageNet and RadImageNet on seven medical classification tasks. Our work includes a replication study, which yields results contrary to previously published findings. In our experiments, ResNet50 models pre-trained on ImageNet tend to outperform those trained on RadImageNet. To gain further insights, we investigate the learned representations using Canonical Correlation Analysis (CCA) and compare the predictions of the different models. Our results indicate that, contrary to intuition, ImageNet and RadImageNet may converge to distinct intermediate representations, which appear to diverge further during fine-tuning. Despite these distinct representations, the predictions of the models remain similar. Our findings show that the similarity between networks before and after fine-tuning does not correlate with performance gains, suggesting that the advantages of transfer learning might not solely originate from the reuse of features in the early layers of a convolutional neural network.

URL: https://openreview.net/forum?id=ScrEUZLxPr

---

Title: The Geometry of Mixability

Authors: Armando J Cabrera Pacheco, Robert Williamson

Abstract: Mixable loss functions are of fundamental importance in the context of prediction with expert advice in the online setting since they characterize fast learning rates. By re-interpreting properness from the point of view of differential geometry, we provide a simple geometric characterization of mixability for the binary and multi-class cases: a proper loss function $\ell$ is $\eta$-mixable if and only if the superprediction set $\textrm{spr}(\eta \ell)$ of the scaled loss function $\eta \ell$ slides freely inside the superprediction set $\textrm{spr}(\ell_{\log})$ of the log loss $\ell_{\log}$, under fairly general assumptions on the differentiability of $\ell$. Our approach provides a way to treat some concepts concerning loss functions (like properness) in a ''coordinate-free'' manner and reconciles previous results obtained for mixable loss functions for the binary and the multi-class cases.

URL: https://openreview.net/forum?id=VrvGHDSzZ7

---

Title: Neural Causal Structure Discovery from Interventions

Authors: Nan Rosemary Ke, Olexa Bilaniuk, Anirudh Goyal, Stefan Bauer, Hugo Larochelle, Bernhard Schölkopf, Michael Curtis Mozer, Christopher Pal, Yoshua Bengio

Abstract: Recent promising results have generated a surge of interest in continuous optimization methods for causal discovery from observational data. However, there are theoretical limitations on the identifiability of underlying structures obtained solely from observational data. Interventional data, on the other hand, provides richer information about the underlying data-generating process. Nevertheless, extending and applying methods designed for observational data to include interventions is a challenging problem. To address this issue, we propose a general framework based on neural networks to develop models that incorporate both observational and interventional data. Notably, our method can handle the challenging and realistic scenario where the identity of the intervened upon variable is unknown. We evaluate our proposed approach in the context of graph recovery, both de novo and from a partially-known edge set. Our method achieves strong benchmark results on various structure learning tasks, including structure recovery of synthetic graphs as well as standard graphs from the Bayesian Network Repository.

URL: https://openreview.net/forum?id=rdHVPPVuXa

---

Title: Evaluating Human-Language Model Interaction

Authors: Mina Lee, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ashwin Paranjape, Ines Gerard-Ursin, Xiang Lisa Li, Faisal Ladhak, Frieda Rong, Rose E Wang, Minae Kwon, Joon Sung Park, Hancheng Cao, Tony Lee, Rishi Bommasani, Michael S. Bernstein, Percy Liang

Abstract: Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21 Labs' Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation.

URL: https://openreview.net/forum?id=hjDYJUn9l1

---

Title: Benchmarking Continuous Time Models for Predicting Multiple Sclerosis Progression

Authors: Alexander Luke Ian Norcliffe, Lev Proleev, Diana Mincu, F Lee Hartsell, Katherine A Heller, Subhrajit Roy

Abstract: Multiple sclerosis is a disease that affects the brain and spinal cord, it can lead to severe disability and has no known cure. The majority of prior work in machine learning for multiple sclerosis has been centered around using Magnetic Resonance Imaging scans or laboratory tests; these modalities are both expensive to acquire and can be unreliable. In a recent paper it was shown that disease progression can be predicted effectively using performance outcome measures and demographic data. In our work we build on this to investigate the modeling side, using continuous time models to predict progression. We benchmark four continuous time models using a publicly available multiple sclerosis dataset. We find that the best continuous model is often able to outperform the best benchmarked discrete time model. We also carry out an extensive ablation to discover the sources of performance gains, we find that standardizing existing features leads to a larger performance increase than interpolating missing features.

URL: https://openreview.net/forum?id=2uMnAwWnRy

---

Title: Differentially Private Diffusion Models

Authors: Tim Dockhorn, Tianshi Cao, Arash Vahdat, Karsten Kreis

Abstract: While modern machine learning models rely on increasingly large training datasets, data is often limited in privacy-sensitive domains. Generative models trained with differential privacy (DP) on sensitive data can sidestep this challenge, providing access to synthetic data instead. We build on the recent success of diffusion models (DMs) and introduce Differentially Private Diffusion Models (DPDMs), which enforce privacy using differentially private stochastic gradient descent (DP-SGD). We investigate the DM parameterization and the sampling algorithm, which turn out to be crucial ingredients in DPDMs, and propose noise multiplicity, a powerful modification of DP-SGD tailored to the training of DMs. We validate our novel DPDMs on image generation benchmarks and achieve state-of-the-art performance in all experiments. Moreover, on standard benchmarks, classifiers trained on DPDM-generated synthetic data perform on par with task-specific DP-SGD-trained classifiers, which has not been demonstrated before for DP generative models. Project page and code: https://nv-tlabs.github.io/DPDM.

URL: https://openreview.net/forum?id=ZPpQk7FJXF

---

Title: Inherent Limits on Topology-Based Link Prediction

Authors: Justus Isaiah Hibshman, Tim Weninger

Abstract: Link prediction systems (e.g. recommender systems) typically use graph topology as one of their main sources of information. However, automorphisms and related properties of graphs beget inherent limits in predictability. We calculate hard upper bounds on how well graph topology alone enables link prediction for a wide variety of real-world graphs. We find that in the sparsest of these graphs the upper bounds are surprisingly low, thereby demonstrating that prediction systems on sparse graph data are inherently limited and require information in addition to the graph topology.

URL: https://openreview.net/forum?id=izL3B8dPx1

---

Title: On the special role of class-selective neurons in early training

Authors: Omkar Ranadive, Nikhil Thakurdesai, Ari S. Morcos, Matthew L Leavitt, Stephane Deny

Abstract: It is commonly observed that deep networks trained for classification exhibit class-selective neurons in their early and intermediate layers. Intriguingly, recent studies have shown that these class-selective neurons can be ablated without deteriorating network function. But if class-selective neurons are not necessary, why do they exist? We attempt to answer this question in a series of experiments on ResNet-50s trained on ImageNet. We first show that class-selective neurons emerge during the first few epochs of training, before receding rapidly but not completely; this suggests that class-selective neurons found in trained networks are in fact vestigial remains of early training. With single-neuron ablation experiments, we then show that class-selective neurons are important for network function in this early phase of training. We also observe that the network is close to a linear regime in this early phase; we thus speculate that class-selective neurons appear early in training as quasi-linear shortcut solutions to the classification task. Finally, in causal experiments where we regularize against class selectivity at different points in training, we show that the presence of class-selective neurons early in training is critical to the successful training of the network; in contrast, class-selective neurons can be suppressed later in training with little effect on final accuracy. It remains to be understood by which mechanism the presence of class-selective neurons in the early phase of training contributes to the successful training of networks.

URL: https://openreview.net/forum?id=JaNlH6dZYk

---

New submissions
===============

Title: Debiasing Machine Learning Models by Using Weakly Supervised Learning

Abstract: We tackle the problem of bias mitigation of algorithmic decisions in a setting where both the output of the algorithm and the sensitive variable are continuous. Most of prior work deals with discrete sensitive variables, meaning that the biases are measured for subgroups of persons defined by a label, leaving out important algorithmic bias cases, where the sensitive variable is continuous. Typical examples are unfair decisions made with respect to the age or the financial status. In our work, we then propose a bias mitigation strategy for continuous sensitive variables, based on the notion of endogeneity which comes from the field of econometrics. In addition to solve this new problem, our bias mitigation strategy is a weakly supervised learning method which requires that a small portion of the data can be measured in a fair manner. It is model agnostic, in the sense that it does not make any hypothesis on the prediction model. It also makes use of a reasonably large amount of input observations and their corresponding predictions. Only a small fraction of the true output predictions should be known. This therefore limits the need for expert interventions. Results obtained on synthetic data show the effectiveness of our approach for examples as close as possible to real-life applications in econometrics.

URL: https://openreview.net/forum?id=5JvRRTpWdb

---

Title: To Transfer or Not to Transfer: Suppressing Concepts from Source Representations

Abstract: With the proliferation of large pre-trained models in various domains, transfer learning has
gained prominence where intermediate representations from these models can be leveraged
to train better (target) task-specific models, with possibly limited labeled data. Although
transfer learning can be beneficial in many applications, it can transfer undesirable informa-
tion to target tasks that may severely curtail its performance in the target domain or raise
ethical concerns related to privacy and/or fairness. In this paper, we propose a novel ap-
proach for suppressing the transfer of user-determined semantic concepts (viz. color, glasses,
etc.) in intermediate source representations to target tasks without retraining the source
model which can otherwise be expensive or even infeasible. Notably, we tackle a bigger
challenge in the input data as a given intermediate source representation is biased towards
the source task, thus possibly further entangling the desired concepts. We evaluate our
approach qualitatively and quantitatively in the visual domain showcasing its efficacy for
classification and generative source models. Finally, we provide a concept selection approach
that automatically suppresses the undesirable concepts.

URL: https://openreview.net/forum?id=BNP4MxzDEI

---

Title: Distillation Policy Optimization

Abstract: While on-policy algorithms are known for their stability, they often demand a substantial number of samples. In contrast, off-policy algorithms, which leverage past experiences, are considered sample-efficient but tend to exhibit instability. Can we develop an algorithm that harnesses the benefits of off-policy data while maintaining stable learning? In this paper, we introduce an actor-critic learning framework that harmonizes two data sources for both evaluation and control, facilitating rapid learning and adaptable integration with on-policy algorithms. This framework incorporates variance reduction mechanisms, including a unified advantage estimator (UAE) and a residual baseline, improving the efficacy of both on- and off-policy learning. Our empirical results showcase substantial enhancements in sample efficiency for on-policy algorithms, effectively bridging the gap to the off-policy approaches. It demonstrates the promise of our approach as a novel learning paradigm.

URL: https://openreview.net/forum?id=e1AiKT036u

---

Title: Leveraging Endo- and Exo-Temporal Regularization for Black-box Video Domain Adaptation

Abstract: To enable video models to be applied seamlessly across video tasks in different environments, various Video Unsupervised Domain Adaptation (VUDA) methods have been proposed to improve the robustness and transferability of video models. Despite improvements made in model robustness, these VUDA methods require access to both source data and source model parameters for adaptation, raising serious data privacy and model portability issues. To cope with the above concerns, this paper firstly formulates Black-box Video Domain Adaptation (BVDA) as a more realistic yet challenging scenario where the source video model is provided only as a black-box predictor. While a few methods for Black-box Domain Adaptation (BDA) are proposed in the image domain, these methods cannot apply to the video domain since video modality has more complicated temporal features that are harder to align. To address BVDA, we propose a novel Endo and eXo-TEmporal Regularized Network (EXTERN) by applying mask-to-mix strategies and video-tailored regularizations. They are the endo-temporal regularization and exo-temporal regularization, which are performed across both clip and temporal features, while distilling knowledge from the predictions obtained from the black-box predictor. Empirical results demonstrate the state-of-the-art performance of EXTERN across various cross-domain closed-set and partial-set action recognition benchmarks, which even surpasses most existing video domain adaptation methods with source data accessibility.

URL: https://openreview.net/forum?id=icoP08mrQJ

---

Title: We're Not Using Videos Effectively: An Updated Video Domain Adaptation Baseline

Abstract: There has been abundant work in unsupervised domain adaptation for semantic segmentation seeking to adapt a model trained on images from a labeled source domain to an unlabeled target domain. While the vast majority of prior work has studied this as a frame-level Image-DA problem, a few Video-DA works have sought to additionally leverage the temporal signal present in adjacent frames. However, Video-DA works have historically studied a distinct set of benchmarks from Image-DA, with minimal cross-benchmarking. In this work, we address this gap. Surprisingly, we find that (1) even after carefully controlling for data and model architecture, modern Image-DA methods strongly outperform Video-DA methods on Video-DA benchmarks (+14.5 mIoU on Viper to Cityscapes-Seq, +19.0 mIoU on Synthia-Seq to Cityscapes-Seq!), and (2) naive combinations of Image-DA and Video-DA techniques do not lead to consistent performance improvements. To avoid siloed progress between Image-DA and Video-DA we open-source our codebase with support for a comprehensive set of Video-DA and Image-DA methods on a common benchmark. Code available at this link: https://github.com/effectivevideoda/UnifiedVideoDA/tree/main

URL: https://openreview.net/forum?id=10R6iX6JHm

---

Title: Gradient Descent Temporal Difference-difference Learning

Abstract: Off-policy algorithms, in which a behavior policy differs from the target policy and is used to gain experience for learning, have proven to be of great practical value in reinforcement learning. However, even for simple convex problems such as linear value function approximation, these algorithms are not guaranteed to be stable. To address this, alternative algorithms that are provably convergent in such cases have been introduced, the most well known being gradient descent temporal difference (GTD) learning. This algorithm and others like it, however, tend to converge much more slowly than conventional temporal difference learning. In this paper we propose gradient descent temporal difference-difference (Gradient-DD) learning in order to improve GTD2, a GTD algorithm, by introducing second-order differences in successive parameter updates. We investigate this algorithm in the framework of linear value function approximation,
theoretically proving its convergence by applying the theory of stochastic approximation. Studying the model empirically on the random walk task, the Boyan-chain task, and the Baird's off-policy counterexample, we find substantial improvement over GTD2 and, in several cases, better performance even than conventional TD learning.

URL: https://openreview.net/forum?id=xx815uanS2

---

Title: Image Reconstruction via Deep Image Prior Subspaces

Abstract: Deep learning has been widely used for solving image reconstruction tasks but its deployability has been held back due to the shortage of high-quality paired training data. Unsupervised learning methods, e.g., deep image prior (DIP), naturally fill this gap, but bring a host of new issues: the susceptibility to overfitting due to a lack of robust early stopping strategies and unstable convergence. We present a novel approach to tackle these issues by restricting DIP optimisation to a sparse linear subspace of its parameters, employing a synergy of dimensionality reduction techniques and second order optimisation methods. The low-dimensionality of the subspace reduces DIP's tendency to fit noise and allows the use of stable second order optimisation methods, e.g., natural gradient descent or L-BFGS. Experiments across both image restoration and tomographic tasks of different geometry and ill-posedness show that second order optimisation within a low-dimensional subspace is favourable in terms of optimisation stability to reconstruction fidelity trade-off.

URL: https://openreview.net/forum?id=torWsEui9N

---

Title: Equivariant MuZero

Abstract: Deep reinforcement learning has shown lots of success in closed, well-defined domains such as games (Chess, Go, StarCraft). The next frontier is real-world scenarios, where setups are numerous and varied. For this, agents need to learn the underlying rules governing the environment, so as to robustly generalise to conditions that differ from those they were trained on. Model-based reinforcement learning algorithms such as MuZero or Dreamer, aim to accomplish this by learning a world model. However, leveraging a world model has not yet consistently shown greater generalisation capabilities compared to model-free alternatives. In this work, we propose improving the data efficiency and generalisation capabilities of MuZero by explicitly incorporating the symmetries of the environment in its world-model architecture. We prove that, so long as the neural networks used by MuZero are equivariant to a particular symmetry group acting on the environment, the entirety of MuZero's action-selection algorithm will also be equivariant to that group. We evaluate Equivariant MuZero on procedurally-generated MiniPacman and on Chaser from the ProcGen suite: training on a set of mazes, and then testing on unseen rotated versions, demonstrating the benefits of equivariance. Further, we verify that our performance improvements hold even when only some of the components of Equivariant MuZero obey strict equivariance, which highlights the robustness of our construction.

URL: https://openreview.net/forum?id=ExbGarTbLE

---

Title: Optimization with Access to Auxiliary Information

Abstract: We investigate the fundamental optimization question of minimizing a \emph{target} function $f$, whose gradients are expensive to compute or have limited availability, given access to some \emph{auxiliary} side function $h$ whose gradients are cheap or more available. This formulation captures many settings of practical relevance, such as i) re-using batches in SGD, ii) transfer learning, iii) federated learning, iv) training with compressed models/dropout, etc. We propose two generic new algorithms that apply in all these settings and prove that we can benefit from this framework using only an assumption on the Hessian similarity between the target and side information. A benefit is obtained when this similarity measure is small, we also show a potential benefit from stochasticity when the auxiliary noise is correlated with that of the target function.

URL: https://openreview.net/forum?id=kxYqgSkH8I

---

Title: CiPR: An Efficient Framework with Cross-instance Positive Relations for Generalized Category Discovery

Abstract: We tackle the issue of generalized category discovery (GCD). GCD considers the open-world problem of automatically clustering a partially labelled dataset, in which the unlabelled data may contain instances from both novel categories and labelled classes. In this paper, we address the GCD problem with an unknown category number for the unlabelled data. We propose a framework, named CiPR, to bootstrap the representation by exploiting cross-instance positive relations in the partially labelled data for contrastive learning, which have been neglected in existing methods. To obtain reliable cross-instance relations to facilitate representation learning, we introduce a semi-supervised hierarchical clustering algorithm, named selective neighbor clustering (SNC), which can produce a clustering hierarchy directly from the connected components of a graph constructed from selective neighbors. We further present a method to estimate the unknown class number using SNC with a joint reference score that considers clustering indexes of both labelled and unlabelled data, and extend SNC to allow label assignment for the unlabelled instances with a given class number. We thoroughly evaluate our framework on public generic image recognition datasets and challenging fine-grained datasets, and establish a new state-of-the-art.

URL: https://openreview.net/forum?id=1fNcpcdr1o

---

Title: Don't Discriminate Among Corruptions: Generative Robustness Network for Unified Resiliency

Abstract: The vulnerability of deep recognition algorithms against image degradations widens the gap between their performance as compared to robust human perception. Such degradations can be intentionally crafted or commonly occur. While both types of corruption have a similar adverse impact, they are handled independently in the existing literature. We assert they should not be dealt with independently if we aim to build a universally secure and robust system. In this research, we address both types of image degradations (referred to as common corruptions and intentional adversarial perturbations) using a single unified framework: \textit{generative robustness network}. Using the proposed framework, we present: (i) the detection of degraded samples, (ii) mitigating the impact of these degradations, and (iii) establishing a possible connection between different types of degradations. This research presents a \textit{`universal'} framework using a generative robustness network to safeguard deep learning algorithms against both common and deliberate corruptions. We assert that not only the degradations but also their severity is critical since they can widen \textit{intra} and \textit{inter} class variations. Extensive experiments are performed using multiple datasets and degradations in both seen and unseen settings to demonstrate the efficacy of the proposed framework.

URL: https://openreview.net/forum?id=d6yTn3p0x0

---

Title: ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers

Abstract: We propose a memory-efficient finetuning algorithm for large language models (LLMs) that supports finetuning LLMs with 65B parameters in 3-bit or 4-bit precision on as little as one 48GB GPU. Our method, modular low-rank adaptation (ModuLoRA), integrates any user-specified weight quantizer with finetuning via low-rank adapters (LoRAs). Our approach relies on a simple quantization-agnostic backward pass that adaptively materializes low-precision LLM weights from a custom black-box quantization module. This approach enables finetuning 3-bit LLMs for the first time—leveraging state-of-the-art 3-bit OPTQ quantization often outperforms finetuning that relies on less sophisticated 4-bit and 8-bit methods. In our experiments, ModuLoRA attains competitive performance on text classification, natural language infernece, and instruction following tasks using significantly less memory than existing approaches, and we also surpass the state-of-the-art ROUGE score on a popular summarization task. We release ModuLoRA together with a series of low-precision models— including the first family of 3-bit instruction following Alpaca LLMs—as part of LLMTOOLS, a user-friendly library for quantizing, running, and finetuning LLMs on consumer GPUs.

URL: https://openreview.net/forum?id=r9p9CV52MV

---

Title: Generalizing Neural Additive Models via Statistical Multimodal Analysis

Abstract: Interpretable models are gaining increasing attention in the deep learning community, and significant
progress is being made to develop simple, interpretable, yet powerful deep learning approaches.
Generalized Additive Models (GAM) and Neural Additive Models (NAM) are prime examples. Despite these
methods' great potential and popularity in critical applications (e.g., medical applications), they fail to
generalize to multimodal data distributions. The main reason behind this limitation is that these "all-fit-one"
models collapse multiple relationships by being forced to fit the data unimodally. We address this critical
limitation by proposing interpretable multimodal network frameworks capable of learning a Mixture of Neural
Additive Models (MNAM). The proposed MNAM learns relationships between input features and outputs
in a multimodal fashion and assigns a probability to each mode. The proposed method shares similarities with Mixture Density Networks (MDN) while keeping the interpretability that characterizes GAN and NAM. We
demonstrate how the proposed MNAM balances between rich representations and interpretability with
numerous empirical observations and pedagogical studies. We present and discuss training alternatives
and provided extensive practical evaluation to assess the different options suitability. The code is available at \href{https://anonymous.4open.science/r/MNAM-66CC/}{https://anonymous.4open.science/r/MNAM-66CC/} (an anonymized version of GitHub).

URL: https://openreview.net/forum?id=xLg8ljlEba

---

Title: Layer-diverse Negative Sampling for Graph Neural Networks

Abstract: Graph neural networks (GNNs) are a powerful solution for various structure learning applications due to their strong representation capabilities for graph data. However, traditional GNNs, relying on message-passing mechanisms that gather information exclusively from first-order neighbours (known as positive samples), can lead to issues such as over-smoothing and over-squashing.
To mitigate these issues, we propose a layer-diverse negative sampling method for message-passing propagation. This method employs a sampling matrix within a determinantal point process, which transforms the candidate set into a space and selectively samples from this space to generate negative samples. To further enhance the diversity of the negative samples during each forward pass, we develop a space-squeezing method to achieve layer-wise diversity in multi-layer GNNs. Experiments on various real-world graph datasets demonstrate the effectiveness of our approach in improving the diversity of negative samples and overall learning performance. Moreover, adding negative samples dynamically changes the graph's topology, thus with the strong potential to improve the expressiveness of GNNs and reduce the risk of over-squashing.

URL: https://openreview.net/forum?id=WOrdoKbxh6

---

Title: PNeRV: A Polynomial Neural Representation for Videos

Abstract: The application of Implicit Neural Representations (INRs) to video data poses unique challenges due to the introduction of an additional temporal dimension. In the context of videos, INRs have predominantly relied on a frame-only parameterization, which, unfortunately, sacrifices the spatiotemporal continuity observed in pixel-level (spatial) representations. To mitigate this, we introduce Polynomial Neural Representation for Videos (PNeRV), a parameter-wise efficient patch-wise INR for videos that preserves spatiotemporal continuity. PNeRV leverages the modeling capabilities of Polynomial Neural Networks (PNNs) to perform the modulation of a continuous spatial (patch) signal with a continuous time (frame) signal. We further propose a custom Hierarchical Spatial Sampling Scheme that ensures spatial continuity while retaining parameter efficiency. We also employ a carefully designed Positional Embedding methodology to further enhance PNeRV's performance. Our extensive experimentation demonstrates that PNeRV outperforms the baselines in conventional Neural Representation (NR) tasks like compression along with downstream applications that require spatiotemporal continuity in the underlying representation. PNeRV not only addresses the challenges posed by video data in the realm of INRs but also opens new avenues for advanced video processing and analysis in the field.

URL: https://openreview.net/forum?id=oCBsxCov2g

---

Title: Learning By Self-Explaining

Abstract: Artificial intelligence (AI) research has a long track record of drawing inspirations from findings from biology, in particular human intelligence. In contrast to current AI research that mainly treats explanations as a means for model inspection, a somewhat neglected finding from human psychology is the benefit of self-explaining in an agents' learning process. Motivated by this, we introduce a novel learning paradigm, termed Learning by Self-Explaining (LSX). The underlying idea is that a learning module (learner) performs a base task, e.g. image classification, and provides explanations to its decisions. An internal critic module next evaluates the quality of these explanations given the original task. Finally, the learner is refined with the critic's feedback and the loop is repeated as required. The intuition behind this is that an explanation is considered ``good'' if the critic can perform the same task given the respective explanation. Despite many implementation possibilities the structure of any LSX instantiation can be taxonomized based on four learning modules which we identify as: Fit, Explain, Reflect and Revise. In our work, we provide distinct instantiations of LSX for two different learner models, each illustrating different choices for the various LSX components. We broadly evaluate these on several datasets and show that Learning by Self-Explaining not only boosts the generalization abilities of AI models, particularly in small-data regimes, but also aids in mitigating the influence of confounding factors, as well as leading to more task-specific and faithful model explanations. Overall, our results provide experimental evidence of the potential of self-explaining within the learning phase of an AI model.

URL: https://openreview.net/forum?id=OvJZ58BmOE

---

Title: Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets

Abstract: In this note, we demonstrate a first-of-its-kind provable convergence of SGD to the global minima of appropriately regularized logistic empirical risk of depth $2$ nets -- for arbitrary data and with any number of gates with adequately smooth and bounded activations like sigmoid and tanh. We also prove an exponentially fast convergence rate for continuous time SGD that also applies to smooth unbounded activations like SoftPlus. Our key idea is to show the existence of Frobenius norm regularized logistic loss functions on constant-sized neural nets which are ``Villani functions'' and thus be able to build on recent progress with analyzing SGD on such objectives.

URL: https://openreview.net/forum?id=9TqAUYB6tC

---

Title: Context-Aware Estimation of Attribution Robustness In Text

Abstract: Explanations are crucial parts of deep neural network (DNN) classifiers. In high stakes applications, faithful and robust explanations are important to understand DNN classifiers and gain trust. However, recent work has shown that state-of-the-art attribution methods in text classifiers are susceptible to imperceptible adversarial perturbations that alter explanations significantly while maintaining the correct prediction outcome. If undetected, this can critically mislead the users of DNNs. Thus, it is crucial to understand the influence of such adversarial perturbations on the networks’ explanations. In this work, we establish a novel definition of attribution robustness (AR) in text classification. Crucially, it reflects both attribution change induced by adversarial input alterations and perceptibility of such alterations. Moreover, we introduce a set of measures to effectively capture several aspects of perceptibility of perturbations in text, such as semantic distance to the original text, smoothness and grammaticality of the adversarial samples. We then propose our novel Context-AwareExplanationAttack (CEA), a strong adversary that provides a tight estimation for attribution robustness in text classification. CEA uses context-aware masked language models to extract word substitutions that result in fluent adversarial samples. Finally, with experiments on several classification architectures, we show that CEA consistently outperforms current state-of-the-art AR estimators, yielding perturbations that alter explanations to a greater extent while being less perceptible.

URL: https://openreview.net/forum?id=n4QC44o5yv

---

Title: Exploring Format Consistency for Instruction Tuning

Abstract: Instruction tuning has emerged as a promising approach to enhancing large language models in following human instructions. It is shown that increasing the diversity and number of instructions in the training data can consistently enhance generalization performance, which facilitates a recent endeavor to collect various instructions and integrate existing instruction tuning datasets into larger collections. However, different users have their unique ways of expressing instructions, and there often exist variations across different datasets in the instruction styles and formats, i.e., format inconsistency. In this work, we study how format inconsistency may impact the performance of instruction tuning. We propose a framework called ``Unified Instruction Tuning'' (UIT), which calls OpenAI APIs for automatic format transfer among different instruction tuning datasets. We show that UIT successfully improves the generalization performance on unseen instructions, which highlights the importance of format consistency for instruction tuning.
To make the UIT framework more practical, we further propose a novel perplexity-based denoising method to reduce the noise of automatic format transfer. We also train a smaller offline model that achieves comparable format transfer capability to OpenAI APIs to reduce costs in practice. The codes and trained models will soon be available.

URL: https://openreview.net/forum?id=n8fZ6mY6PB

---

Title: Kernel Normalized Convolutional Networks

Abstract: Existing convolutional neural network architectures frequently rely upon batch normalization (BatchNorm) to effectively train the model. BatchNorm, however, performs poorly with small batch sizes, and is inapplicable to differential privacy. To address these limitations, we propose kernel normalization and kernel normalized convolutional layers, and incorporate them into kernel normalized convolutional networks (KNConvNets) as the main building blocks. We implement KNConvNets corresponding to the state-of-the-art ResNets while forgoing BatchNorm layers. Through extensive experiments, we illustrate KNConvNets achieve higher or competitive performance compared to the BatchNorm counterparts in image classification and semantic segmentation. They also significantly outperform their batch-independent competitors including layer and group normalization in non-private and differentially private training. Given that, KNConvNets combine the batch-independence property of layer and group normalization with the performance advantage of BatchNorm.

URL: https://openreview.net/forum?id=Uv3XVAEgG6

---

Title: MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments

Abstract: Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks for very large fully-annotated datasets. Different classes of self-supervised learning offer representations with either good contextual reasoning properties, e.g., using masked image modeling strategies, or invariance to image perturbations, e.g., with contrastive methods. In this work, we propose a single-stage and standalone method, MOCA, which unifies both desired properties using novel mask-and-predict objectives defined with high-level features (instead of pixel-level details). Moreover, we show how to effectively employ both learning paradigms in a synergistic and computation-efficient way. Doing so, we achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols with a training that is at least 3 times faster than prior methods.

URL: https://openreview.net/forum?id=OdDsCaacZ0

---

Title: Correcting Model misspecification via Generative Adversarial Networks

Abstract: Machine learning models are often misspecified in the likelihood, which leads to a lack of robustness in the predictions. In this paper, we introduce a framework for correcting likelihood misspecifications in several paradigm agnostic noisy prior models and test the model’s
ability to remove the misspecification. The ”ABC-GAN” framework introduced is a novel generative modeling paradigm, which combines Generative Adversarial Networks (GANs) and Approximate Bayesian Computation (ABC). This new paradigm assists the existing
GANs by incorporating any subjective knowledge available about the modeling process via ABC, as a regularizer, resulting in a partially interpretable model that operates well under low data regimes. At the same time, unlike any Bayesian analysis, the explicit knowledge
need not be perfect, since the generator in the GAN can be made arbitrarily complex. ABCGAN eliminates the need for summary statistics and distance metrics as the discriminator implicitly learns them, and enables simultaneous specification of multiple generative models.
The model misspecification is simulated in our experiments by introducing noise of various biases and variances. The correction term is learnt via the ABC-GAN, with skip connections, referred to as skipGAN. The strength of the skip connection indicates the amount of
correction needed or how misspecified the prior model is. Based on a simple experimental setup, we show that the ABC-GAN models not only correct the misspecification of the prior, but also perform as well as or better than the respective priors under noisier conditions. In
this proposal, we show that ABC-GANs get the best of both worlds.

URL: https://openreview.net/forum?id=9DVSDOIWeu

---

Title: Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models

Abstract: As general purpose vision models get increasingly effective at a wide set of tasks, it is imperative that they be consistent across the tasks they support. Inconsistent AI models are considered brittle and untrustworthy by human users and are more challenging to incorporate into larger systems that take dependencies on their outputs. Measuring consistency between very heterogeneous tasks that might include outputs in different modalities is challenging since it is difficult to determine if the predictions are consistent with one another. As a solution, we introduce a benchmark dataset, CocoCON, where we create contrast sets by modifying test instances for multiple tasks in small but semantically meaningful ways to change the gold label, and outline metrics for measuring if a model is consistent by ranking the original and perturbed instances across tasks. We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks, especially for more heterogeneous tasks. To alleviate this issue, we propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets, that improves the multi-task consistency of large unified models while retaining their original accuracy on downstream tasks. Data and sample code are available in the supplementary.

URL: https://openreview.net/forum?id=ue9igTDLN2

---

Title: Privacy Budget Tailoring in Private Data Analysis

Abstract: We consider the problem of learning differentially private linear and logistic regression models that do not exhibit disparate performance for minority groups in the data. Small-sized datasets pose a challenging regime for differential privacy; that is, satisfying differential privacy while learning models from data can lead to models with worse accuracy for minority---in size---subgroups. To address this challenge, inspired by Abowd & Schmutte (2018), we propose: (i) to systematically tailor the privacy budget to the different groups, (ii) use linear optimization oracles in a grid to optimize Lagrangian objectives that correspond to fair learning and optimization. We present efficient differentially private algorithms for linear and logistic regression subject to fairness constraints (e.g., bounded group loss) that allocate the privacy budget based on the private standard error of each subgroup in the data. Consequently, the formulation reduces the amount of noise added to these groups, which leads to more accurate models for such groups. We validate the proposed, group-aware budget allocation, method on synthetic and real-world datasets where we show significant reductions in prediction error for the smallest groups, while still preserving sufficient privacy to protect the minority group from re-identification attacks. In addition, we provide sample complexity lower bounds for our problem formulation.

URL: https://openreview.net/forum?id=SnPEhMyuYX

---

Title: On the Choice of Learning Rate for Local SGD

Abstract: Distributed data-parallel optimization accelerates the training of neural networks, but requires constant synchronization of gradients between the workers, which can become a bottleneck. One way to reduce communication overhead is to use Local SGD, where each
worker asynchronously takes multiple local gradient steps, after which the model weights are averaged. In this work, we discuss the choice of learning rate for Local SGD, showing that it faces an intricate trade-off. Unlike in the synchronous case, its gradient estimate is
biased, with the bias dependent on the learning rate itself. Thus using learning rate scaling techniques designed for faster convergence in the synchronous case with Local SGD results in a performance degradation as previously observed. To analyze the manifestation of this bias, we study convergence behaviour of Local SGD and synchronous data-parallel SGD when using their optimal learning rates. Our experiments show that the optimal learning rate for Local SGD differs substantially from that of SGD, and when using it the performance of Local SGD matches that of SGD. However, this performance comes at the cost of added training iterations, rendering Local SGD faster than SGD only when communication is much more time-consuming than computation, suggesting that Local SGD being of limited practical utility.

URL: https://openreview.net/forum?id=DPvwr4HJdt

---

Title: EHRDiff : Exploring Realistic EHR Synthesis with Diffusion Models

Abstract: Electronic health records (EHR) contain a wealth of biomedical information, serving as valuable resources for the development of precision medicine systems. However, privacy concerns have resulted in limited access to high-quality and large-scale EHR data for researchers, impeding progress in methodological development. Recent research has delved into synthesizing realistic EHR data through generative modeling techniques, where a majority of proposed methods relied on generative adversarial networks (GAN) and their variants for EHR synthesis. Despite GAN-based methods attaining state-of-the-art performance in generating EHR data, these approaches are difficult to train and prone to mode collapse. Recently introduced in generative modeling, diffusion models have established cutting-edge performance in image generation, but their efficacy in EHR data synthesis remains largely unexplored. In this study, we investigate the potential of diffusion models for EHR data synthesis and introduce a novel method, EHRDiff. Through extensive experiments, EHRDiff establishes new state-of-the-art quality for synthetic EHR data, protecting private information in the meanwhile.

URL: https://openreview.net/forum?id=DIGkJhGeqi

---

Title: Pathologies of Predictive Diversity in Deep Ensembles

Abstract: Classic results establish that encouraging predictive diversity improves performance in ensembles of low-capacity models, e.g. through bagging or boosting. Here we demonstrate that these intuitions do not apply to high-capacity neural network ensembles (deep ensembles), and in fact the opposite is often true. In a large scale study of nearly 600 neural network classification ensembles, we examine a variety of interventions that trade off component model performance for predictive diversity. While such interventions can improve the performance of small neural network ensembles (in line with standard intuitions), they harm the performance of the large neural network ensembles most often used in practice. Surprisingly, we also find that discouraging predictive diversity is often benign in large-network ensembles, fully inverting standard intuitions. Even when diversity-promoting interventions do not sacrifice component model performance (e.g. using heterogeneous architectures and training paradigms), we observe an opportunity cost associated with pursuing increased predictive diversity. Examining over 1000 ensembles, we observe that the performance benefits of diverse architectures/training procedures are easily dwarfed by the benefits of simply using higher-capacity models, despite the fact that such higher capacity models often yield significantly less predictive diversity. Overall, our findings demonstrate that standard intuitions around predictive diversity, originally developed for low-capacity ensembles, do not apply to modern high-capacity deep ensembles. These results call into question the benefit of efforts to create more diverse deep ensembles, especially in the face of an easier alternative: simply forming ensembles from ever more powerful (and less diverse) component models.

URL: https://openreview.net/forum?id=TQfQUksaC8

---

Title: MMD-Regularized Unbalanced Optimal Transport

Abstract: We study the unbalanced optimal transport (UOT) problem, where the marginal constraints are enforced using Maximum Mean Discrepancy (MMD) regularization. Our work is motivated by the observation that the literature on UOT is focused on regularization based on $\phi$-divergence (e.g., KL divergence). Despite the popularity of MMD, its role as a regularizer in the context of UOT seems less understood.
We begin by deriving a specific dual of MMD-regularized UOT (MMD-UOT), which helps us prove several useful properties.
One interesting outcome of this duality result is that MMD-UOT induces novel metrics, which not only lift the ground metric like the Wasserstein but are also sample-wise efficient to estimate like the MMD.
Further, for real-world applications involving non-discrete measures, we present an estimator for the transport plan that is supported only on the given ($m$) samples. Under mild conditions, we prove that the estimation error with this finitely-supported transport plan is also $\mathcal{O}(1/\sqrt{m})$. As far as we know, such error bounds that are free from the curse of dimensionality are not known for $\phi$-divergence regularized UOT. Finally, we discuss how the proposed estimator can be computed efficiently using accelerated gradient descent.
Our experiments show that MMD-UOT consistently outperforms popular baselines, including KL-regularized UOT and MMD, in diverse machine learning applications.

URL: https://openreview.net/forum?id=eN9CjU3h1b

---

Title: Estimating Optimal Policy Value in General Linear Contextual Bandits

Abstract: In many bandit problems, the maximal reward achievable by a policy is often unknown in advance. We consider the problem of estimating the optimal policy value in the sublinear data regime before the optimal policy is even learnable. We refer to this as $V^*$ estimation. It was previously shown that fast $V^*$ estimation is possible but only in disjoint linear bandits with Gaussian covariates. Whether this is possible for more realistic context distributions has remained an open and important question for tasks such as model selection. In this paper, we first provide lower bounds showing that this general problem is hard. However, under stronger assumptions, we give an algorithm and analysis proving that $\widetilde{\mathcal{O}}(\sqrt{d})$ sublinear estimation of $V^*$ is indeed information-theoretically possible, where $d$ is the dimension. We subsequently introduce a practical and computationally efficient algorithm that estimates a problem-specific upper bound on $V^*$, valid for general distributions and tight for Gaussian context distributions. We prove our algorithm requires only $\widetilde{\mathcal{O}}(\sqrt{d})$ samples to estimate the upper bound. We use this upper bound in conjunction with the estimator to derive novel and improved guarantees for several applications in bandit model selection and testing for treatment effects. We present promising experimental benefits on a semi-synthetic simulation using historical data on warfarin treatment dosage outcomes.

URL: https://openreview.net/forum?id=RUNiIDU8P7

---

Title: Recovering Exact Support in Federated lasso without Optimization

Abstract: Federated learning provides a framework to address the challenges of distributed computing, data ownership, and privacy over a large number of distributed clients with low computational and communication capabilities. In this paper, we study the problem of learning the exact support of sparse linear regression in the federated learning setup. We provide a simple communication efficient algorithm that only needs one-shot communication with the centralized server to compute the exact support by majority voting. Our method does not require the clients to solve any optimization problem and thus, can be run on devices with low computational capabilities. Our method is naturally robust to the problems of client failure, model poisoning, and straggling clients. We formally prove that our method requires a number of samples per client that is polynomial with respect to the support size, but independent of the dimension of the problem. We require the number of distributed clients to be logarithmic in the dimension of the problem. For certain classes of predictor variables (e.g. mutually independent, correlated Gaussian, etc.), the overall sample complexity matches the optimal sample complexity of the non-federated centralized setting. Furthermore, our method is easy to implement and has an overall polynomial time complexity.

URL: https://openreview.net/forum?id=JdXzKSyqbH

---

Title: Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder

Abstract: Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time. Such a pipeline is very efficient at test time since fast vector distances can be used to evaluate candidates, but modifying the reference image embedding guided only by a short textual description can be difficult, especially independent of potential candidates. An alternative approach is to allow interactions between the query and every possible candidate, i.e., reference-text-candidate triplets, and pick the best from the entire set. Though this approach is more discriminative, for large-scale datasets the computational cost is prohibitive since pre-computation of candidate embeddings is no longer possible. We propose to combine the merits of both schemes using a two-stage model. Our first stage adopts the conventional vector distancing metric and performs a fast pruning among candidates. Meanwhile, our second stage employs a dual-encoder architecture, which effectively attends to the input triplet of reference-text-candidate and re-ranks the candidates.
Both stages utilize a vision-and-language pre-trained network, which has proven beneficial for various downstream tasks.
Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task.

URL: https://openreview.net/forum?id=fJAwemcvpL

---

Title: MANDERA: Malicious Node Detection in Federated Learning via Ranking

Abstract: Byzantine attacks aim to hinder the deployment of federated learning algorithms by sending malicious gradients to degrade the model. Although the benign gradients and Byzantine gradients are distributed differently, identifying the malicious gradients is challenging due to (1) the gradient is high-dimensional and each dimension has its unique distribution, and (2) the benign gradients and the malicious gradients are mixed (two-sample test methods cannot apply directly). To address these issues, we propose MANDERA which is theoretically guaranteed to efficiently detect all malicious gradients under Byzantine attacks with no prior knowledge or history about the number of attacked nodes. More specifically, we proposed to transfer the original updating gradient space into a ranking matrix. By such an operation, the scales of different dimensions of the gradients in the ranking space become identical. Then the high-dimensional benign gradients and the malicious gradients can be easily separated in the ranking space. The effectiveness of MANDERA is further confirmed by experimentation on *four* Byzantine attack implementations (Gaussian, Zero Gradient, Sign Flipping, Shifted Mean), compared with state-of-the-art defences. The experiments cover both IID and Non-IID datasets.

URL: https://openreview.net/forum?id=ptZiZAli6D

---

Title: Hyperspherical Prototype Node Clustering

Abstract: The general workflow of deep node clustering is to encode the nodes into node embeddings via graph neural networks and uncover clustering decisions from them, so clustering performance is heavily affected by the embeddings. However, existing works only consider preserving the semantics of the graph but ignore the inter-cluster separability of the nodes, so there's no guarantee that the embeddings can present a clear clustering structure. To remedy this deficiency, we propose Hyperspherical Prototype Node Clustering (HPNC), an end-to-end clustering paradigm that explicitly enhances the inter-cluster separability of learned node embeddings. Concretely, we constrain the embedding space to a unit-hypersphere, enabling us to scatter the cluster prototypes over the space with maximized pairwise distances. Then, we employ a graph autoencoder to map nodes onto the same hypersphere manifold. Consequently, cluster affinities can be directly retrieved from cosine similarities between node embeddings and prototypes. A clustering-oriented loss is imposed to sharpen the affinity distribution so that the learned node embeddings are encouraged to have small intra-cluster distances and large inter-cluster distances. Based on the proposed HPNC paradigm, we devise two schemes (HPNC-IM and HPNC-DEC) with distinct clustering backbones. Empirical results on popular benchmark datasets demonstrate the superiority of our method compared to other state-of-the-art clustering methods, and visualization results illustrate improved separability of the learned embeddings.

URL: https://openreview.net/forum?id=z3ZlnaOM0d

---

Title: Learning to Abstain From Uninformative Data

Abstract: Learning and decision-making in domains with naturally high noise-to-signal ratios – such as Finance or Healthcare – is often challenging, while the stakes are very high.
In this paper, we study the problem of learning and acting under a general noisy generative process. In this problem, the data distribution has a significant proportion of uninformative samples with high noise in the label, while part of the data contains useful information represented by low label noise. This dichotomy is present during both training and inference, which requires the proper handling of uninformative data during both training and testing. We propose a novel approach to learning under these conditions via a loss inspired by the selective learning theory. By minimizing this loss, the model is guaranteed to make a near-optimal decision by distinguishing informative data from uninformative data and making predictions. We build upon the strength of our theoretical guarantees by describing an iterative algorithm, which jointly optimizes both a predictor and a selector, and evaluates its empirical performance in a variety of settings.

URL: https://openreview.net/forum?id=KKARKoPcEA

---

Title: Hiding in a Plain Sight: Out-of-Distribution Detection from Logit Space Embeddings

Abstract: Although deep learning (DL) models have revolutionized the field of machine learning (ML), these classification models cannot easily distinguish the in-distribution (ID) versus the out-of-distribution (OOD) data at the test phase.
This paper analyzes the landscape of ID and OOD data embeddings and demonstrates that OOD data is always embedded toward the center in the logit space.
Furthermore, IDs data are embedded far from the center towards the positive regions of the logit space, thus ensuring minimal overlap between ID and OOD embeddings.
Based on these observations, we propose to make the classification model sensitive to the OOD data by incorporating the configuration of the logit space into the predictive response.
Hence, we estimate the distribution of the ID logits by utilizing a density estimator over the training data logits.
Our proposed approach is data and architecture-agnostic and could be easily incorporated with a trained model without exposure to OOD data.
We ran experiments on the popular image datasets and obtained state-of-the-art performance and an improvement of up to 10$\%$ on AUCROC on the Google genome dataset.

URL: https://openreview.net/forum?id=2nGufrROnC

---

Title: Debiasing Vision-Language Models via Biased Prompts

Abstract: Machine learning models have been shown to inherit biases from their training datasets. This can be particularly problematic for vision-language foundation models trained on uncurated datasets scraped from the internet. The biases can be amplified and propagated to downstream applications like zero-shot classifiers and text-to-image generative models. In this study, we propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding. In particular, we show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models. The proposed closed-form solution enables easy integration into large-scale pipelines, and empirical results demonstrate that our approach effectively reduces social bias and spurious correlation in both discriminative and generative vision-language models without the need for additional data or training.

URL: https://openreview.net/forum?id=Gu1t2ar96S

---

Title: Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Abstract: Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning -- hypothetical consistency (a model's ability to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model's final outputs when intermediate sub-steps are replaced with the model's outputs for those steps). We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.

URL: https://openreview.net/forum?id=5nBqY1y96B

---

Title: Adversarial Fairness with Elastic Weight Consolidation

Abstract: A central goal of algorithmic fairness is to develop a non-discriminatory approach to a protected group. We study methods to improve the accuracy for the worst-group, primarily when the data distribution is unevenly distributed. We propose a method to enhance both accuracy and fairness for the worst-group using regularization based on Elastic Weight Consolidation (EWC). We mitigate socially undesirable biases for binary classification tasks by applying adversarial models. To maintain the critical parameters for predicting the target attribute, we regularize the model using the Fisher information, referred to as EWC. We confirm that learning the task using the UCI Adult (Census), CelebA, and Waterbirds datasets yields a better trade-off between accuracy and fairness than in previous studies. The experimental results on table and image datasets show that our proposed method achieves better fairness improvements than the previous methods, maintaining accuracy under widely-used fairness criteria.

URL: https://openreview.net/forum?id=rXMnlnKhSJ

---

Title: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Abstract: Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-layered approach to the development of safer AI systems.

URL: https://openreview.net/forum?id=bx24KpJ4Eb

---

Title: Unsupervised 3D Scene Representation Learning via Movable Object Inference

Abstract: Unsupervised, category-agnostic, object-centric 3D representation learning for complex scenes remains an open problem in computer vision. While a few recent methods can discover 3D objects from a single image, they remain struggling on scenes with diverse and complex object configurations as they discover objects mostly by appearance similarity which is insufficient for textured objects. In this work, we propose Movable Object Radiance Fields (MORF), aiming at scaling to complex scenes with diverse categories of objects. Inspired by cognitive science studies of object learning in babies, MORF learns 3D object representations via movable object inference. While obtaining 3D movable object signals requires multi-view videos of moving objects, we propose lifting a 2D movable object inference module that can be unsupervisedly pretrained on monocular videos. Thus, MORF requires only multi-view images of static training scenes. During testing, MORF can discover, reconstruct, and move unseen objects from novel categories, all from a single image of novel scenes. Experiments show that MORF extracts accurate object geometry and supports realistic object and scene reconstruction and editing, significantly outperforming the state-of-the-art.

URL: https://openreview.net/forum?id=1QjCzP0KIw

---

Title: Relation Guided Message Passing for Multi-label Classification

Abstract: A well known-challenge in multi-label classification is modelling the dependencies between the labels. Most of the attempts in the literature focus on label dependencies that exhibit themselves through co-occurrences. Co-occurrences represent a pulling type of relationship between labels, meaning that labels that are observed together in training samples are more likely to co-occur. But other label relationships are common, such as a group of labels that never occur together. We call this a pushing relation. Successfully modeling such relations and the dependencies they induce can also lead to improved prediction performance.
In this work, we develop a graph-based dependency module that models multiple types of relations between labels and thus captures richer dependencies. The module is designed to be flexible so that it can be integrated into most embedding-based multi-label classification approaches. We propose a generic method to extract pulling and pushing relations between labels for any multi-label data. We then present Relation Guided Message Passing (RGMP), a Transformer based classifier for multi-label classification that uses the proposed label dependency module. Experiments on benchmark datasets show that RGMP yields similar or superior performance compared to state-of-the-art methods and the approach imposes only minor additional computational and memory overheads.

URL: https://openreview.net/forum?id=hjGw26lTor

---

Title: Introspective Experience Replay: Look Back When Surprised

Abstract: In reinforcement learning (RL), experience replay-based sampling techniques are crucial in promoting convergence by eliminating spurious correlations. However, widely used methods such as uniform experience replay (UER) and prioritized experience replay (PER) have been
shown to have sub-optimal convergence and high seed sensitivity, respectively. To address these issues, we propose a novel approach called Introspective Experience Replay (IER) that selectively samples batches of data points prior to surprising events. Our method builds upon the theoretically sound reverse experience replay (RER) technique, which has been shown to reduce bias in the output of Q-learning-type algorithms with linear function approximation. However, RER is not always practically reliable when using neural function approximation. Through empirical evaluations, we demonstrate that IER with neural function approximation yields reliable and superior performance compared to UER, PER, and hindsight experience replay (HER) across most tasks.

URL: https://openreview.net/forum?id=vWTZO1RXZR

---

Title: AmbientFlow: Invertible generative models from incomplete, noisy measurements

Abstract: Generative models have gained popularity for their potential applications in imaging science, such as image reconstruction, posterior sampling and data sharing. Flow-based generative models are particularly attractive due to their ability to tractably provide exact density estimates along with fast, inexpensive and diverse samples. Training such models, however, requires a large, high quality dataset of objects. In applications such as computed imaging, it is often difficult to acquire such data due to requirements such as long acquisition time or high radiation dose, while acquiring noisy or partially observed measurements of these objects is more feasible. In this work, we propose AmbientFlow, a framework for learning flow-based generative models directly from noisy and incomplete data. Using variational Bayesian methods, a novel framework for establishing flow-based generative models from noisy, incomplete data is proposed. Extensive numerical studies demonstrate the effectiveness of AmbientFlow in correctly learning the object distribution. The utility of AmbientFlow in a downstream inference task of image reconstruction is demonstrated.

URL: https://openreview.net/forum?id=txpYITR8oa

---

Title: A Survey on Self-Supervised Representation Learning

Abstract: Learning meaningful representations is at the heart of many tasks in the field of modern machine learning. Recently, a lot of methods were introduced that allow learning of image representations without supervision. These representations can then be used in downstream tasks like classification or object detection. The quality of these representations is close to supervised learning, while no labeled images are needed. This survey paper provides a comprehensive review of these methods in a unified notation, points out similarities and differences of these methods, and proposes a taxonomy which sets these methods in relation to each other. Furthermore, our survey summarizes the most-recent experimental results reported in the literature in form of a meta-study. Our survey is intended as a starting point for researchers and practitioners who want to dive into the field of representation learning.

URL: https://openreview.net/forum?id=NaDWaYfVxm

---

Title: ECG Representation Learning with Multi-Modal EHR Data

Abstract: Electronic health records (EHRs) provide a rich source of medical information across different modalities such as electrocardiograms (ECG), structured EHRs (sEHR), and unstructured EHRs (text). Inspired by the fact that many cardiac and non-cardiac diseases
influence the behavior of the ECG, we leverage structured EHRs and unstructured EHRs
from multiple sources by pairing with ECGs and propose a set of three new multi-modal
contrastive learning models that combine ECG, sEHR, and text modalities and a supervised
large scale multi-task learning model trained to perform both classification and regression
tasks on a large number of cardiovascular diseases and lab test measurements to produce
robust representations of ECGs that can subsequently be used for a variety of downstream
tasks. The performance of these models is compared against different baseline models such
as supervised learning models trained from scratch with random weights initialization, and
self-supervised learning models trained only on ECGs. We pre-train the models on a large
proprietary dataset of about 9 $million$ ECGs from around 2.4 $million$ patients and evaluate the pre-trained models on various downstream tasks such as classification, zero-shot
retrieval, and out-of-distribution detection involving the prediction of various heart conditions using ECG waveforms as input, and demonstrate that the models presented in this
work show significant improvements compared to all baseline modes.

URL: https://openreview.net/forum?id=UxmvCwuTMG

---

Title: Towards Fair Video Summarization

Abstract: Automated video summarization is a vision task that aims to generate concise summaries of lengthy videos. Recent advancements in deep learning have led to highly performant video summarization models; however, there has been a lack of attention given to fairness and unbiased representation in the generated summaries. To bridge this gap, we introduce and analytically define the fair video summarization problem, and demonstrate its connections to the well-established problem of fair clustering. To facilitate fair model development, we also introduce the FairVidSum dataset, which is similar in design to state-of-the-art video summarization datasets such as TVSum and SumMe, but also includes annotations for sensitive attributes and individuals alongside frame importance scores. Finally, we propose the SumBal metric for quantifying the fairness of an outputted video summary. We conduct extensive experiments to benchmark the fairness of various state-of-the-art video summarization models. Our results highlight the need for better models that balance accuracy and fairness to ensure equitable representation and inclusion in summaries. For completeness, we also provide a novel fair-only baseline, FVS-LP, to showcase the fairness-utility gap models can improve upon.

URL: https://openreview.net/forum?id=Uj6MRfR1P5

---

Title: Optimizing Performance of Feedforward and Convolutional Neural Networks through Dynamic Activation Functions

Abstract: Deep learning training algorithms have been an enormous success in recent years in many fields, including speech, text, image, and video. Deeper and deeper layers are proposed with huge success, with ResNet structures having around 152 layers. Shallow convolution neural networks(CNNs) are still active research, where some phenomena are still unexplained. Activation functions used in the network are of utmost importance, as they provide non-linearity to the networks. ReLU's are the most commonly used activation function. We show the hidden layer's complex piece-wise linear(PWL) activation. We show that these PWL activations work much better than ReLU activations in our networks for convolution neural networks and multilayer perceptrons. Result comparisons in PyTorch for shallow and deep CNNs are given to strengthen our case further.

URL: https://openreview.net/forum?id=DOhicmIElC

---

Title: Optimal Input Gain: All You Need to Supercharge a Feed-Forward Neural Network

Abstract: Linear transformation of the inputs alters the training performance of feed-forward networks that are otherwise equivalent. However, most linear transforms are viewed as a pre-processing operation separate from the actual training. Starting from equivalent networks, it is shown that pre-processing inputs using linear transformation are equivalent to multiplying the negative gradient matrix with an autocorrelation matrix per training iteration. Second order method is proposed to find the autocorrelation matrix that maximizes learning in a given iteration. When the autocorrelation matrix is diagonal, the method optimizes input gains. This optimal input gain (OIG) approach is used to improve two first-order two-stage training algorithms, namely back-propagation (BP) and hidden weight optimization (HWO), which alternately update the input weights and solve linear equations for output weights. Results show that the proposed OIG approach greatly enhances the performance of the first-order algorithms, often allowing them to rival the popular Levenberg-Marquardt approach with far less computation. Since HWO is equivalent to BP with Whitening transformation applied to the inputs, OIG improved HWO could be a significant building block to more complex deep learning architectures.

URL: https://openreview.net/forum?id=uC5UIiN2NM

---

Title: From Differential Privacy to Bounds on Membership Inference: Less can be More

Abstract: Differential Privacy (DP) is the de facto standard for reasoning about the privacy of a training algorithm. Yet, learning with DP often yields poor performance unless one trains on a large dataset. In this paper, we instead outline how training on less data can be beneficial when we are only interested in defending against specific attacks; we take the canonical example of defending against membership inference. To arrive at this result, we first derive (tight) bounds on the success of all membership inference attacks. These bounds do not replace DP, rather they introduce a complementary interpretation of a DP algorithm's ability to defend against membership inference specifically. Because our bound more tightly captures the effect of how training data was selected, we can show that decreasing the sampling rate when constructing the training dataset has a disparate effect on the bound when compared to strengthening the DP guarantee. Thus, when the privacy protection we care about is defending against membership inference, training on less data can yield more advantageous trade-offs between preventing membership inference and utility than strengthening the DP guarantee. We empirically illustrate this on MNIST, CIFAR10 and SVHN-extended.

URL: https://openreview.net/forum?id=daXqjb6dVE

---

Title: Towards a General Transfer Approach for Policy-Value Networks

Abstract: Transferring trained policies and value functions from one task to another, such as one game to another with a different board size, board shape, or more substantial rule changes, is a challenging problem. Popular benchmarks for reinforcement learning (RL), such as Atari games and ProcGen, have limited variety especially in terms of action spaces. Due to a focus on such benchmarks, the development of transfer methods that can also handle changes in action spaces has received relatively little attention. Furthermore, we argue that progress towards more general methods should include benchmarks where new problem instances can be described by domain experts, rather than machine learning experts, using convenient, high-level domain specific languages (DSLs). In addition to enabling end users to more easily describe their problems, user-friendly DSLs also contain relevant task information which can be leveraged to make effective zero-shot transfer plausibly achievable. As an example, we use the Ludii general game system, which includes a highly varied set of over 1000 distinct games described in such a language. We propose a simple baseline approach for transferring fully convolutional policy-value networks between any pair of games modelled in this system, and present extensive results—including various cases of highly successful zero-shot transfer.

URL: https://openreview.net/forum?id=vJcTm2v9Ku

---

Title: Personalized Federated Learning with Spurious Features: An Adversarial Approach

Abstract: One of the common approaches for personalizing federated learning is fine-tuning the global model for each local client. While this addresses some issues of statistical heterogeneity, we find that such personalization methods are vulnerable to spurious features at local agents, leading to reduced generalization performance. This work considers a setup where spurious features correlate with the label in each client's training environment, and the mixture of multiple training environments (i.e., the global environment) diminishes the spurious correlations. While the global federated learning model trained over the global environment suffers less from spurious features, the local fine-tuning step may lead to personalized models vulnerable to spurious correlations. In light of this practical and pressing challenge, we propose a novel strategy to mitigate the effect of spurious features during personalization by maintaining the adversarial transferability between the global and personalized models. Empirical results on object and action recognition tasks show that our proposed approach bounds personalized models from further exploiting spurious features while preserving the benefit of enhanced accuracy from fine-tuning.

URL: https://openreview.net/forum?id=N2wx9UVHkH

---

Title: Improving Diffusion Models for Scene Text Editing with Dual Encoders

Abstract: Scene text editing is a challenging task that involves modifying or inserting specified texts in an image while maintaining its natural and realistic appearance. Most previous approaches to this task rely on style-transfer models that crop out text regions and feed them into image transfer models, such as GANs. However, these methods are limited in their ability to change text style and are unable to insert texts into images. Recent advances in diffusion models have shown promise in overcoming these limitations with text-conditional image editing. However, our empirical analysis reveals that state-of-the-art diffusion models struggle with rendering correct text and controlling text style. To address these problems, we propose DIFFSTE to improve pre-trained diffusion models with a dual encoder design, which includes a character encoder for better text legibility and an instruction encoder for better style control. An instruction tuning framework is introduced to train our model to learn the mapping from the text instruction to the corresponding image with either the specified style or the style of the surrounding texts in the background. Such a training method further brings our method the zero-shot generalization ability to the following three scenarios: generating text with unseen font variation, e.g., italic and bold, mixing different fonts to construct a new font, and using more relaxed forms of natural language as the instructions to guide the generation task. We evaluate our approach on five datasets and demonstrate its superior performance in terms of text correctness, image naturalness, and style controllability.

URL: https://openreview.net/forum?id=yL15ys5swq

---

Title: Blind Biological Sequence Denoising with Self-Supervised Set Learning

Abstract: Biological sequence analysis relies on the ability to denoise the imprecise output of sequencing platforms. We consider a common setting where a short sequence is read out repeatedly using a high-throughput long-read platform to generate multiple subreads, or noisy obser- vations of the same sequence. Denoising these subreads with alignment-based approaches often fails when too few subreads are available or error rates are too high. In this paper, we propose a novel method for blindly denoising sets of sequences without directly observing clean source sequence labels. Our method, Self-Supervised Set Learning (SSSL), gathers subreads together in an embedding space and estimates a single set embedding as the mid- point of the subreads in both the latent and sequence spaces. This set embedding represents the “average” of the subreads and can be decoded into a prediction of the clean sequence. In experiments on simulated long-read DNA data, SSSL methods denoise small reads of ≤ 6 subreads with 17% fewer errors and large reads of > 6 subreads with 8% fewer errors compared to the best baseline. On a real dataset of antibody sequences, SSSL improves over baselines on two self-supervised metrics, with a significant improvement on difficult small reads that comprise over 60% of the test set. By accurately denoising these reads, SSSL promises to better realize the potential of high-throughput DNA sequencing data for downstream scientific applications.

URL: https://openreview.net/forum?id=3s7ior0WZ5

---

Title: A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models

Abstract: Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. However, observing this grounding in contemporary models is complex, even if it is generally expected to take place if the task is addressed in a way that is conductive to generalization. We propose a framework to jointly study task performance and phrase grounding, and propose three benchmarks to study the relation between the two. Our results show that contemporary models demonstrate inconsistency between their ability to ground phrases and solve tasks. We show how this can be addressed through brute-force training on ground phrasing annotations, and analyze the dynamics it creates.

URL: https://openreview.net/forum?id=5G3PI1hEdw

---

Title: Visual Prompt Based Personalized Federated Learning

Abstract: As a popular paradigm of distributed learning, personalized federated learning (PFL) allows personalized models to improve generalization ability and robustness by utilizing knowledge from all distributed clients. Most existing PFL algorithms tackle personalization in a model-centric way, such as personalized layer partition, model regularization, and model interpolation, which all fail to take into account the data characteristics of distributed clients. In this paper, we propose a novel PFL framework for image classification tasks, dubbed pFedPT, that leverages personalized visual prompts to implicitly represent local data distribution information of clients and provides that information to the aggregation model to help with classification tasks. Specifically, in each round of pFedPT training, each client generates a local personalized prompt related to local data distribution. Then, the local model is trained on the input composed of raw data and a visual prompt to learn the distribution information contained in the prompt. During model testing, the aggregated model obtains prior knowledge of the data distributions based on the prompts, which can be seen as an adaptive fine-tuning of the aggregation model to improve model performances on different clients. Furthermore, the visual prompt can be added as an orthogonal method to implement personalization on the client for existing FL methods to boost their performance. Experiments on the CIFAR10 and CIFAR100 datasets show that pFedPT outperforms several state-of-the-art (SOTA) PFL algorithms by a large margin in various settings.

URL: https://openreview.net/forum?id=dUVejidXO7

---

Title: Are you using test log-likelihood correctly?

Abstract: Test log-likelihood is commonly used to compare different models of the same data or different approximate inference algorithms for fitting the same probabilistic model. We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations and (ii) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on root mean squared error.

URL: https://openreview.net/forum?id=n2YifD4Dxo

---

Title: CR-MoE: Consistent Routed Mixture-of-Experts for Scaling Contrastive Learning

Abstract: While Contrastive Learning (CL) achieves great success in many downstream tasks, its good performance heavily relies on a large model capacity. As previous methods focus on scaling dense models, training and inference costs increase rapidly with model sizes, leading to large resource consumption. In this paper, we explore CL with an efficient scaling method, Mixture of Experts (MoE), to obtain a large but sparse model. We start by plugging in the state-of-the-art CL method to MoE. However, this naive combination fails to visibly improve performance despite a much larger capacity. A closer look reveals that the naive MoE+CL model has a strong tendency to route two augmented views of the same image token to different subsets of experts: such ``cross-view instability" breaks the weight-sharing nature in CL and misleads the invariant feature learning. To address this issue, we introduce a new regularization mechanism, by enforcing expert-routing similarity between different views of the same image (or its overlapped patch tokens), while promoting expert-routing diversity of patches from different images. The resultant method, called CR-MoE, improves by 1.7 points in terms of 1\% semi-supervised learning accuracy on ImageNet, compared to the naive combination baseline. It further surpasses the state-of-the-art CL methods on ImageNet pre-training of Vision Transformer (ViT) by 2.8 points, at the same computational cost. Our findings validate CR-MoE as an effective and efficient image representation learner. Code is included in the supplemental materials.

URL: https://openreview.net/forum?id=qKIvn9xL1R

---

Title: Provable Guarantees for Sparsity Recovery with Deterministic Missing Data Patterns

Abstract: We study the problem of consistently recovering the sparsity pattern of a regression parameter vector from correlated observations governed by deterministic missing data patterns using Lasso. We consider the case in which the observed dataset is censored by a deterministic, non-uniform filter. Recovering the sparsity pattern in datasets with deterministic missing structure can be arguably more challenging than recovering in a uniformly-at-random scenario. In this paper, we propose an efficient algorithm for missing value imputation by utilizing the topological property of the censorship filter. We then provide novel theoretical results for exact recovery of the sparsity pattern using the proposed imputation strategy. Our analysis shows that, under certain statistical and topological conditions, the hidden sparsity pattern can be recovered consistently with high probability in polynomial time and logarithmic sample complexity.

URL: https://openreview.net/forum?id=SSqOqAwpN7

---

Title: Exact Inference with Latent Variables in an Arbitrary Domain

Abstract: We analyze the necessary and sufficient conditions for exact inference of a latent model. In latent models, each entity is associated with a latent variable following some probability distribution. The challenging question we try to solve is: can we perform exact inference without observing the latent variables, even without knowing what the domain of the latent variables is? We show that exact inference can be achieved using a semidefinite programming (SDP) approach without knowing either the latent variables or their domain. Our analysis predicts the experimental correctness of SDP with high accuracy, showing the suitability of our focus on the Karush-Kuhn-Tucker (KKT) conditions and the spectrum of a properly defined matrix. Running on a laptop equivalent, our method can achieve exact inference in models with over 10000 entities efficiently. As a byproduct of our analysis, we also provide concentration inequalities with dependence on latent variables, both for bounded moment generating functions as well as for the spectra of matrices. To the best of our knowledge, these results are novel and could be useful for many other problems.

URL: https://openreview.net/forum?id=1R7spWLnpR

---

Title: Multi-Grid Tensorized Fourier Neural Operator for High-Resolution PDEs

Abstract: Memory complexity and data scarcity have so far prohibited learning solution operators of partial differential equations (PDEs) at high resolutions. We address these limitations by introducing a new data efficient and highly parallelizable operator learning approach with reduced memory requirement and better generalization, called multi-grid tensorized neural operator (MG-TFNO).
MG-TFNO scales to large resolutions by leveraging local and global structures of full-scale, real-world phenomena, through a decomposition of both the input domain and the operator's parameter space. Our contributions are threefold: i) we enable parallelization over input samples with a novel multi-grid-based domain decomposition, ii) we represent the parameters of the model in a high-order latent subspace of the Fourier domain, through a global tensor factorization, resulting in an extreme reduction in the number of parameters and improved generalization, and iii) we propose architectural improvements to the backbone FNO. Our approach can be used in any operator learning setting. We demonstrate superior performance on the turbulent Navier-Stokes equations where we achieve less than half the error with over 150x compression. The tensorization combined with the domain decomposition, yields over 150x reduction in the number of parameters and 7x reduction in the domain size without losses in accuracy, while slightly enabling parallelism.

URL: https://openreview.net/forum?id=oFqHIkw8sd

---

Title: The Kernel Perspective on Dynamic Mode Decomposition

Abstract: This manuscript revisits theoretical assumptions concerning dynamic mode decomposition (DMD) of Koopman operators, including the existence of lattices of eigenfunctions, common eigenfunctions between Koopman operators, and boundedness and compactness of Koopman operators. Counterexamples that illustrate restrictiveness of the assumptions are provided for each of the assumptions. In particular, this manuscript proves that the native reproducing kernel Hilbert space (RKHS) of the Gaussian RBF kernel function only supports bounded Koopman operators if the dynamics are affine. In addition, a new framework for DMD, that requires only densely defined Koopman operators over RKHSs is introduced, and its effectiveness is demonstrated through numerical examples.

URL: https://openreview.net/forum?id=4csshMM3HB

---

Reply all

Reply to author

Forward

0 new messages