Weekly TMLR digest for Jun 04, 2023

2 views

Skip to first unread message

TMLR

unread,

Jun 3, 2023, 8:00:11 PM6/3/23

to tmlr-annou...@googlegroups.com

New certifications
==================

Reproducibility Certification: Aux-Drop: Handling Haphazard Inputs in Online Learning Using Auxiliary Dropouts

Rohit Agarwal, Deepak Gupta, Alexander Horsch, Dilip K. Prasad

https://openreview.net/forum?id=R9CgBkeZ6Z

---

Accepted papers
===============

Title: Federated Learning under Covariate Shifts with Generalization Guarantees

Authors: Ali Ramezani-Kebrya, Fanghui Liu, Thomas Pethick, Grigorios Chrysos, Volkan Cevher

Abstract: This paper addresses intra-client and inter-client covariate shifts in federated learning (FL) with a focus on the overall generalization performance. To handle covariate shifts, we formulate a new global model training paradigm and propose Federated Importance-Weighted Empirical Risk Minimization (FTW-ERM) along with improving density ratio matching methods without requiring perfect knowledge of the supremum over true ratios. We also propose the communication-efficient variant FITW-ERM with the same level of privacy guarantees as those of classical ERM in FL. We theoretically show that FTW-ERM achieves smaller generalization error than classical ERM under certain settings. Experimental results demonstrate the superiority of FTW-ERM over existing FL baselines in challenging imbalanced federated settings in terms of data distribution shifts across clients.

URL: https://openreview.net/forum?id=N7lCDaeNiS

---

Title: When Does Uncertainty Matter?: Understanding the Impact of Predictive Uncertainty in ML Assisted Decision Making

Authors: Sean McGrath, Parth Mehta, Alexandra Zytek, Isaac Lage, Himabindu Lakkaraju

Abstract: As machine learning (ML) models are increasingly being employed to assist human decision makers, it becomes critical to provide these decision makers with relevant inputs which can help them decide if and how to incorporate model predictions into their decision making. For instance, communicating the uncertainty associated with model predictions could potentially be helpful in this regard. In this work, we carry out user studies (1,330 responses from 190 participants) to systematically assess how people with differing levels of expertise respond to different types of predictive uncertainty (i.e., posterior predictive distributions with different shapes and variances) in the context of ML assisted decision making for predicting apartment rental prices. We found that showing posterior predictive distributions led to smaller disagreements with the ML model's predictions, regardless of the shapes and variances of the posterior predictive distributions we considered, and that these effects may be sensitive to expertise in both ML and the domain. This suggests that posterior predictive distributions can potentially serve as useful decision aids which should be used with caution and take into account the type of distribution and the expertise of the human.

URL: https://openreview.net/forum?id=pbs22kJmEO

---

Title: The Robustness Limits of SoTA Vision Models to Natural Variation

Authors: Mark Ibrahim, Quentin Garrido, Ari S. Morcos, Diane Bouchacourt

Abstract: Recent state-of-the-art vision models have introduced new architectures, learning
paradigms, and larger pretraining data, leading to impressive performance on tasks
such as classification. While previous generations of vision models were shown to
lack robustness to factors such as pose, the extent to which this next generation
of models are more robust remains unclear. To study this question, we develop a
dataset of more than 7 million images with controlled changes in pose, position
background, lighting color, and size. We study not only how robust recent state-of-
the-art models are, but also the extent to which models can generalize to variation in
each of these factors. We consider a catalog of recent vision models, including vision
transformers (ViT), self-supervised models such as masked autoencoders (MAE),
and models trained on larger datasets such as CLIP. We find that even today’s best
models are not robust to common changes in pose, size, and background. When
some samples varied during training, we found models required a significant portion
of instances seen varying to generalize—though eventually robustness did improve.
When variability is only witnessed for some classes however, we found that models
did not generalize to other classes unless the classes were very similar to those seen
varying during training. We hope our work will shed further light on the blind
spots of SoTA models and spur the development of more robust vision models.

URL: https://openreview.net/forum?id=QhHLwn3D0Y

---

Title: Robust Multi-Agent Reinforcement Learning with State Uncertainty

Authors: Sihong He, Songyang Han, Sanbao Su, Shuo Han, Shaofeng Zou, Fei Miao

Abstract: In real-world multi-agent reinforcement learning (MARL) applications, agents may not have perfect state information (e.g., due to inaccurate measurement or malicious attacks), which challenges the robustness of agents' policies. Though robustness is getting important in MARL deployment, little prior work has studied state uncertainties in MARL, neither in problem formulation nor algorithm design. Motivated by this robustness issue and the lack of corresponding studies, we study the problem of MARL with state uncertainty in this work. We provide the first attempt to the theoretical and empirical analysis of this challenging problem. We first model the problem as a Markov Game with state perturbation adversaries (MG-SPA) by introducing a set of state perturbation adversaries into a Markov Game. We then introduce robust equilibrium (RE) as the solution concept of an MG-SPA. We conduct a fundamental analysis regarding MG-SPA such as giving conditions under which such a robust equilibrium exists. Then we propose a robust multi-agent Q-learning (RMAQ) algorithm to find such an equilibrium, with convergence guarantees. To handle high-dimensional state-action space, we design a robust multi-agent actor-critic (RMAAC) algorithm based on an analytical expression of the policy gradient derived in the paper. Our experiments show that the proposed RMAQ algorithm converges to the optimal value function; our RMAAC algorithm outperforms several MARL and robust MARL methods in multiple multi-agent environments when state uncertainty is present. The source code is public on https://github.com/sihongho/robust_marl_with_state_uncertainty.

URL: https://openreview.net/forum?id=CqTkapZ6H9

---

Title: Optimum-statistical Collaboration Towards General and Efficient Black-box Optimization

Authors: Wenjie Li, Chi-Hua Wang, Guang Cheng, Qifan Song

Abstract: In this paper, we make the key delineation on the roles of resolution and statistical uncertainty in hierarchical bandits-based black-box optimization algorithms, guiding a more general analysis and a more efficient algorithm design. We introduce the optimum-statistical
collaboration, an algorithm framework of managing the interaction between optimization error flux and statistical error flux evolving in the optimization process. We provide a general analysis of this framework without specifying the forms of statistical error and uncertainty quantifier. Our framework and its analysis, due to their generality, can be applied to a large family of functions and partitions that satisfy different local smoothness assumptions and have different numbers of local optimums, which is much richer than the class of functions studied in prior works. Our framework also inspires us to propose a better measure of the statistical uncertainty and consequently a variance-adaptive algorithm VHCT. In theory, we prove the algorithm enjoys rate-optimal regret bounds under different local smoothness assumptions; in experiments, we show the algorithm outperforms prior efforts in different settings.

URL: https://openreview.net/forum?id=ClIcmwdlxn

---

Title: An Adaptive Half-Space Projection Method for Stochastic Optimization Problems with Group Sparse Regularization

Authors: Yutong Dai, Tianyi Chen, Guanyi Wang, Daniel Robinson

Abstract: Optimization problems with group sparse regularization are ubiquitous in various popular downstream applications, such as feature selection and compression for Deep Neural Networks (DNNs). Nonetheless, the existing methods in the literature do not perform particularly well when such regularization is used in combination with a stochastic loss function. In particular, it is challenging to design a computationally efficient algorithm with a convergence guarantee and can compute group-sparse solutions. Recently, a half-space stochastic projected gradient (HSPG) method was proposed that partly addressed these challenges. This paper presents a substantially enhanced version of HSPG that we call AdaHSPG+ that makes two noticeable advances. First, AdaHSPG+ is shown to have a stronger convergence result under significantly looser assumptions than those required by HSPG. This improvement in convergence is achieved by integrating variance reduction techniques with a new adaptive strategy for iteratively predicting the support of a solution. Second, AdaHSPG+ requires significantly less parameter tuning compared to HSPG, thus making it more practical and user-friendly. This advance is achieved by designing automatic and adaptive strategies for choosing the type of step employed at each iteration and for updating key hyperparameters. The numerical effectiveness of our proposed AdaHSPG+ algorithm is demonstrated on both convex and non-convex benchmark problems. The source code is available at https://github.com/tianyic/adahspg.

URL: https://openreview.net/forum?id=KBhSyBBeeO

---

Title: Causally-guided Regularization of Graph Attention Improves Generalizability

Authors: Alexander P Wu, Thomas Markovich, Bonnie Berger, Nils Yannick Hammerla, Rohit Singh

Abstract: Graph attention networks estimate the relational importance of node neighbors to aggregate relevant information over local neighborhoods for a prediction task. However, the inferred attentions are vulnerable to spurious correlations and connectivity in the training data, hampering the generalizability of models. We introduce CAR, a general-purpose regularization framework for graph attention networks. Embodying a causal inference approach based on invariance prediction, CAR aligns the attention mechanism with the causal effects of active interventions on graph connectivity in a scalable manner. CAR is compatible with a variety of graph attention architectures, and we show that it systematically improves generalizability on various node classification tasks. Our ablation studies indicate that CAR hones in on the aspects of graph structure most pertinent to the prediction (e.g., homophily), and does so more effectively than alternative approaches. Finally, we also show that \methodname enhances interpretability of attention coefficients by accentuating node-neighbor relations that point to causal hypotheses.

URL: https://openreview.net/forum?id=iDNMZgjJuJ

---

Title: Analyzing Deep PAC-Bayesian Learning with Neural Tangent Kernel: Convergence, Analytic Generalization Bound, and Efficient Hyperparameter Selection

Authors: Wei Huang, Chunrui Liu, Yilan Chen, Richard Yi Da Xu, Miao Zhang, Tsui-Wei Weng

Abstract: PAC-Bayes is a well-established framework for analyzing generalization performance in machine learning models. This framework provides a bound on the expected population error by considering the sum of training error and the divergence between posterior and prior distributions. In addition to being a successful generalization bound analysis tool, the PAC-Bayesian bound can also be incorporated into an objective function for training probabilistic neural networks, which we refer to simply as {\it Deep PAC-Bayesian Learning}. Deep PAC-Bayesian learning has been shown to achieve competitive expected test set error and provide a tight generalization bound in practice at the same time through gradient descent training. Despite its empirical success, theoretical analysis of deep PAC-Bayesian learning for neural networks is rarely explored. To this end, this paper proposes a theoretical convergence and generalization analysis for Deep PAC-Bayesian learning. For a deep and wide probabilistic neural network, our analysis shows that PAC-Bayesian learning corresponds to solving a kernel ridge regression when the probabilistic neural tangent kernel (PNTK) is used as the kernel. We utilize this outcome in conjunction with the PAC-Bayes $\mathcal{C}$-bound, enabling us to derive an analytical and guaranteed PAC-Bayesian generalization bound for the first time. Finally, drawing insight from our theoretical results, we propose a proxy measure for efficient hyperparameter selection, which is proven to be time-saving on various benchmarks. Our work not only provides a better understanding of the theoretical underpinnings of Deep PAC-Bayesian learning, but also offers practical tools for improving the training and generalization performance of these models.

URL: https://openreview.net/forum?id=nEX2q5B2RQ

---

Title: High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning

Authors: Paul Pu Liang, Yiwei Lyu, Xiang Fan, Jeffrey Tsaw, Yudong Liu, Shentong Mo, Dani Yogatama, Louis-Philippe Morency, Russ Salakhutdinov

Abstract: Many real-world problems are inherently multimodal, from the communicative modalities humans use to express social and emotional states such as spoken language, gestures, and paralinguistics to the force, proprioception, and visual sensors ubiquitous on robots. While there has been an explosion of interest in multimodal representation learning, these methods are still largely focused on a small set of modalities, primarily in the language, vision, and audio space. In order to accelerate generalization towards diverse and understudied modalities, this paper studies efficient representation learning for high-modality scenarios involving a large set of diverse modalities. Since adding new models for every new modality or task becomes prohibitively expensive, a critical technical challenge is heterogeneity quantification: how can we measure which modalities encode similar information and interactions in order to permit parameter sharing with previous modalities? This paper proposes two new information theoretic metrics for heterogeneity quantification: (1) modality heterogeneity studies how similar $2$ modalities $\{X_1,X_2\}$ are by measuring how much information can be transferred from $X_1$ to $X_2$, while (2) interaction heterogeneity studies how similarly pairs of modalities $\{X_1,X_2\}, \{X_3,X_4\}$ interact by measuring how much interaction information can be transferred from $\{X_1,X_2\}$ to $\{X_3,X_4\}$. We show the importance of these $2$ proposed metrics in high-modality scenarios as a way to automatically prioritize the fusion of modalities that contain unique information or unique interactions. The result is a single model, HighMMT, that scales up to $10$ modalities (text, image, audio, video, sensors, proprioception, speech, time-series, sets, and tables) and $15$ tasks from $5$ different research areas. Not only does HighMMT outperform prior methods on the tradeoff between performance and efficiency, it also demonstrates a crucial scaling behavior: performance continues to improve with each modality added, and it transfers to entirely new modalities and tasks during fine-tuning. We release our code and benchmarks, which we hope will present a unified platform for subsequent theoretical and empirical analysis.

URL: https://openreview.net/forum?id=ttzypy3kT7

---

Title: Learning Interpolations between Boltzmann Densities

Authors: Bálint Máté, François Fleuret

Abstract: We introduce a training objective for continuous normalizing flows that can be used in the absence of samples but in the presence of an energy function. Our method relies on either a prescribed or a learnt interpolation $f_t$ of energy functions between the target energy $f_1$ and the energy function of a generalized Gaussian $f_0(x) = ||x/\sigma||_p^p$. The interpolation of energy functions induces an interpolation of Boltzmann densities $p_t \propto e^{-f_t}$ and we aim to find a time-dependent vector field $V_t$ that transports samples along the family $p_t$ of densities. The condition of transporting samples along the family $p_t$ is equivalent to satisfying the continuity equation with $V_t$ and $p_t = Z_t^{-1}e^{-f_t}$. Consequently, we optimize $V_t$ and $f_t$ to satisfy this partial differential equation. We experimentally compare the proposed training objective to the reverse KL-divergence on Gaussian mixtures and on the Boltzmann density of a quantum mechanical particle in a double-well potential.

URL: https://openreview.net/forum?id=TH6YrEcbth

---

Title: Retiring $\Delta \text{DP}$: New Distribution-Level Metrics for Demographic Parity

Authors: Xiaotian Han, Zhimeng Jiang, Hongye Jin, Zirui Liu, Na Zou, Qifan Wang, Xia Hu

Abstract: Demographic parity is the most widely recognized measure of group fairness in machine learning, which ensures equal treatment of different demographic groups. Numerous works aim to achieve demographic parity by pursuing the commonly used metric $\Delta DP$. Unfortunately, in this paper, we reveal that the fairness metric $\Delta DP$ can not precisely measure the violation of demographic parity, because it inherently has the following drawbacks: i) zero-value $\Delta DP$ does not guarantee zero violation of demographic parity, ii) $\Delta DP$ values can vary with different classification thresholds. To this end, we propose two new fairness metrics, Area Between Probability density function Curves (ABPC) and Area Between Cumulative density function Curves (ABCC), to precisely measure the violation of demographic parity at the distribution level. The new fairness metrics directly measure the difference between the distributions of the prediction probability for different demographic groups. Thus our proposed new metrics enjoy: i) zero-value ABCC/ABPC guarantees zero violation of demographic parity; ii) ABCC/ABPC guarantees demographic parity while the classification thresholds are adjusted. We further re-evaluate the existing fair models with our proposed fairness metrics and observe different fairness behaviors of those models under the new metrics.

URL: https://openreview.net/forum?id=LjDFIWWVVa

---

Title: Generating Adversarial Examples with Task Oriented Multi-Objective Optimization

Authors: Anh Tuan Bui, Trung Le, He Zhao, Quan Hung Tran, Paul Montague, Dinh Phung

Abstract: Deep learning models, even the-state-of-the-art ones, are highly vulnerable to adversarial examples. Adversarial training is one of the most efficient methods to improve the model's robustness. The key factor for the success of adversarial training is the capability to generate
qualified and divergent adversarial examples which satisfy some objectives/goals (e.g., finding adversarial examples that maximize the model losses for simultaneously attacking multiple models). Therefore, multi-objective optimization (MOO) is a natural tool for adversarial example generation to achieve multiple objectives/goals simultaneously. However, we observe that a naive application of MOO tends to maximize all objectives/goals equally, without caring if an objective/goal has been achieved yet. This leads to useless effort to further improve the goal-achieved tasks, while putting less focus on the goal-unachieved tasks. In this paper, we propose \emph{Task Oriented MOO} to address this issue, in the context where we can explicitly define the goal achievement for a task. Our principle is to only maintain the goal-achieved tasks, while letting the optimizer spend more effort on improving the goal-unachieved tasks. We conduct comprehensive experiments for our Task Oriented MOO on various adversarial example generation schemes. The experimental results firmly demonstrate the merit of our proposed approach.

URL: https://openreview.net/forum?id=2f81Q622ww

---

Title: Aux-Drop: Handling Haphazard Inputs in Online Learning Using Auxiliary Dropouts

Authors: Rohit Agarwal, Deepak Gupta, Alexander Horsch, Dilip K. Prasad

Abstract: Many real-world applications based on online learning produce streaming data that is haphazard in nature, i.e., contains missing features, features becoming obsolete in time, the appearance of new features at later points in time and a lack of clarity on the total number of input features. These challenges make it hard to build a learnable system for such applications, and almost no work exists in deep learning that addresses this issue. In this paper, we present Aux-Drop, an auxiliary dropout regularization strategy for online learning that handles the haphazard input features in an effective manner. Aux-Drop adapts the conventional dropout regularization scheme for the haphazard input feature space ensuring that the final output is minimally impacted by the chaotic appearance of such features. It helps to prevent the co-adaptation of especially the auxiliary and base features, as well as reduces the strong dependence of the output on any of the auxiliary inputs of the model. This helps in better learning for scenarios where certain features disappear in time or when new features are to be modeled. The efficacy of Aux-Drop has been demonstrated through extensive numerical experiments on SOTA benchmarking datasets that include Italy Power Demand, HIGGS, SUSY and multiple UCI datasets. The code is available at https://github.com/Rohit102497/Aux-Drop.

URL: https://openreview.net/forum?id=R9CgBkeZ6Z

---

New submissions
===============

Title: Label Noise-Robust Learning using a Confidence-Based Sieving Strategy

Abstract: In learning tasks with label noise, improving model robustness against overfitting is a pivotal challenge because the model eventually memorizes labels including the noisy ones. Identifying the samples with noisy labels and preventing the model from learning them is a promising approach to address this challenge. When training with noisy labels, the per-class confidence scores of the model, represented by the class probabilities, can be reliable criteria for assessing whether the input label is the true label or the corrupted one. In this work, we exploit this observation and propose a novel discriminator metric called confidence error and a sieving strategy called CONFES to effectively differentiate between the clean and noisy samples. We provide theoretical guarantees on the probability of error for our proposed metric. Then, we experimentally illustrate the superior performance of our proposed approach compared to recent studies on various settings such as synthetic and real-world label noise. Moreover, we show CONFES can be combined with other state-of-the-art approaches such as Co-teaching and DivideMix to further improve model performance.

URL: https://openreview.net/forum?id=3taIQG4C7H

---

Title: Revisiting Hidden Representations in Transfer Learning for Medical Imaging

Abstract: While a key component to the success of deep learning is the availability of massive amounts of training data, medical image datasets are often limited in diversity and size. Transfer learning has the potential to bridge the gap between related yet different domains. For medical applications, however, it remains unclear whether it is more beneficial to pre-train on natural or medical images. We aim to shed light on this problem by comparing initialization on ImageNet and RadImageNet on seven medical classification tasks. We investigate their learned representations with Canonical Correlation Analysis (CCA) and compare the predictions of the different models. Our results show that, contrary to intuition, ImageNet and RadImageNet converge to distinct intermediate representations, and that these representations are even more dissimilar after fine-tuning. Despite these distinct representations, the predictions of the models remain similar. Our findings challenge the notion that transfer learning is effective due to the reuse of general features in the early layers of a convolutional neural network and show that weight similarity before and after fine-tuning is negatively related to performance gains.

URL: https://openreview.net/forum?id=ScrEUZLxPr

---

Title: Some Remarks on Identifiability of Independent Component Analysis in Restricted Function Classes

Abstract: In this short note, we comment on recent results on identifiability of independent component analysis.
We point out an error in earlier works and clarify that this error cannot be fixed as the chosen approach is not sufficiently
powerful to prove identifiability results. In addition, we explain the necessary ingredients to prove stronger identifiability results.
Finally, we discuss and extend the flow-based technique to construct spurious solutions for independent component analysis problems
and provide a counterexample to an earlier identifiability result.

URL: https://openreview.net/forum?id=REtKapdkyI

---

Title: Learning State Reachability as a Graph in Translation Invariant Goal-based Reinforcement Learning Tasks

Abstract: Deep Reinforcement Learning proved efficient at learning universal control policies when the goal state is close enough to the starting state, or when the value function features few discontinuities. But reaching goals that require long action sequences in complex environments remains difficult. Drawing inspiration from the cognitive process which reuses learned atomic skills in a global planning procedure, we propose an algorithm which encodes reachability between abstract goals as a graph, and produces plans in this goal space. Transitions between goals rely on the exploitation of a learned policy which enjoys a property we call translation invariant local optimality, which encodes the intuition that goal-reaching skills can be reused throughout the state space. Overall, our contribution permits solving large and difficult navigation tasks, outperforming related methods from the literature.

URL: https://openreview.net/forum?id=p2E7P2Rsfv

---

Title: Error Controlled Feature Selection for Ultrahigh Dimensional and Highly Correlated Feature Space Using Deep Learning

Abstract: Deep learning has been at the center of analytics in recent years due to its impressive empirical success in analyzing complex data objects. Despite this success, most existing tools behave like black-box machines, thus the increasing interest in interpretable, reliable, and robust deep learning models applicable to a broad class of applications. Feature-selected deep learning has emerged as a promising tool in this realm. However, the recent developments do not accommodate ultra-high dimensional and highly correlated features or high noise levels. In this article, we propose a novel screening and cleaning method with the aid of deep learning for a data-adaptive multi-resolutional discovery of highly correlated predictors with a controlled error rate. Extensive empirical evaluations over a wide range of simulated scenarios and several real datasets demonstrate the effectiveness of the proposed method in achieving high power while keeping the false discovery rate at a minimum.

URL: https://openreview.net/forum?id=FTWAjl2kG0

---

Title: Deep Operator Learning Lessens the Curse of Dimensionality for PDEs

Abstract: Deep neural networks (DNNs) have achieved remarkable success in numerous domains, and their application to PDE-related problems has been rapidly advancing. This paper provides an estimate for the generalization error of learning Lipschitz operators over Banach spaces using DNNs with applications to various PDE solution operators. The goal is to specify DNN width, depth, and the number of training samples needed to guarantee a certain testing error. Under mild assumptions on data distributions or operator structures, our analysis shows that deep operator learning can have a relaxed dependence on the discretization resolution of PDEs and, hence, lessen the curse of dimensionality in many PDE-related problems including elliptic equations, parabolic equations, and Burgers equations. Our results are also applied to give insights about discretization-invariant in operator learning.

URL: https://openreview.net/forum?id=zmBFzuT2DN

---

Title: Long-Tailed Visual Recognition with Global Contrastive Learning and Prototype Learning

Abstract: We consider the visual recognition problem in long-tailed data in which few classes dominate the majority of the other classes. Most current methods employ contrastive learning to learn a representation for long-tailed data. In this paper, first, we investigate $k$-positive sampling, a popular baseline method widely used to build contrastive learning models for imbalanced data. Previous works show that $k$-positive learning, which only chooses $k$ positive samples (instead of all positive images) for each query image, suffers from inferior performance in long-tailed data. In this work, we further point out that k-positive learning limits the learning capability of both head and tail classes. Based on this perspective, we propose a novel contrastive learning framework namely GloCo which improves the limitation in k-positive learning by enlarging its positive selection space, so it can help the model learn more semantic discrimination features. Second, we analyze how the temperature (the hyperparameter used for tuning a concentration of samples on feature space) affects the gradients of each class in long-tailed learning, and propose a new method that can mitigate inadequate gradients between classes, which can help model learning easier. Finally, we go on to introduce a new prototype learning framework namely ProCo based on coreset selection, which can help us create a global prototype for each cluster while keeping the computation cost within a reasonable time and show that combining GloCo with ProCo can further enhance the model learning ability on long-tailed data.

URL: https://openreview.net/forum?id=xWrtiJwJj5

---

Title: Long-term Forecasting with TiDE: Time-series Dense Encoder

Abstract: Recent work has shown that simple linear models can outperform several Transformer based approaches in long term time-series forecasting. Motivated by this, we propose a Multi-layer Perceptron (MLP) based encoder-decoder model, \underline{Ti}me-series \underline{D}ense \underline{E}ncoder (TiDE), for long-term time-series forecasting that enjoys the simplicity and speed of linear models while also being able to handle covariates and non-linear dependencies. Theoretically, we prove that the simplest linear analogue of our model can achieve near optimal error rate for linear dynamical systems (LDS) under some assumptions. Empirically, we show that our method can match or outperform prior approaches on popular long-term time-series forecasting benchmarks while being 5-10x faster than the best Transformer based model.

URL: https://openreview.net/forum?id=pCbC3aQB5W

---

Title: Worst-case Feature Risk Minimization for Data-Efficient Learning

Abstract: Deep learning models typically require massive amounts of annotated data to train a strong model for a task of interest. However, data annotation is time-consuming and costly. How to use labeled data from a related but distinct domain, or just a few samples to train a satisfactory model are thus important questions. To achieve this goal, models should resist overfitting to the specifics of the training data in order to generalise well to new data. This paper proposes a novel Worst-case Feature Risk Minimization (WFRM) method that helps improve model generalization. Specifically, we tackle a minimax optimization problem in feature space at each training iteration. Given the input features, we seek the feature perturbation that maximizes the current training loss and then minimizes the training loss of the worst-case features. By incorporating our WFRM during training, we significantly improve model generalization under distributional shift – Domain Generalization (DG) and in the low-data regime – Few-shot Learning (FSL). We theoretically analyse WFRM and find the key reason why it works better than ERM – it induces an empirical risk-based semi-adaptive $L_{2}$ regularization of the classifier weights, enabling a better risk-complexity trade-off. We evaluate WFRM on two data-efficient learning tasks, including three standard DG benchmarks of PACS, VLCS, OfficeHome and the most challenging FSL benchmark Meta-Dataset. Despite the simplicity, our method consistently improves various DG and FSL methods, leading to the new state-of-the-art performances in all settings. Codes & models will be released on github.

URL: https://openreview.net/forum?id=czev0exHXT

---

Title: Asymptotic Analysis of Conditioned Stochastic Gradient Descent

Abstract: In this paper, we investigate a general class of stochastic gradient descent (SGD) algorithms, called $\textit{conditioned}$ SGD, based on a preconditioning of the gradient direction. Using a discrete-time approach with martingale tools, we establish under mild assumptions the weak convergence of the rescaled sequence of iterates for a broad class of conditioning matrices including stochastic first-order and second-order methods. Almost sure convergence results, which may be of independent interest, are also presented. Interestingly, the asymptotic normality result consists in a stochastic equicontinuity property so when the conditioning matrix is an estimate of the inverse Hessian, the algorithm is asymptotically optimal.

URL: https://openreview.net/forum?id=U4XgzRjfF1

---

Title: Adaptive Decision-Making for Optimization of Safety-Critical Systems: The ARTEO Algorithm

Abstract: Real-time decision-making in uncertain environments with safety constraints is a common problem in many business and industrial applications. In these problems, it is often the case that a general structure of the problem and some of the underlying relationships among the decision variables are known and other relationships are unknown but measurable subject to a certain level of noise. In this work, we develop the ARTEO algorithm by formulating such real-time decision-making problems as constrained mathematical programming problems, where we combine known structures involved in the objective function and constraint formulations with learned Gaussian process (GP) regression models. We then utilize the uncertainty estimates of the GPs to (i) enforce the resulting safety constraints within a confidence interval and (ii) make the cumulative uncertainty expressed in the decision variable space a part of the objective function to drive exploration for further learning – subject to the safety constraints. We demonstrate the safety and efficiency of our approach with two case studies: optimization of electric motor current and real-time bidding problems. We further evaluate the performance of ARTEO compared to other methods that rely entirely on GP-based safe exploration and optimization. The results indicate that ARTEO benefits from the incorporation of prior knowledge to the optimization problems and leads to lower cumulative regret while ensuring the satisfaction of the safety constraints.

URL: https://openreview.net/forum?id=MuKcAr8b19

---

Title: Exploring validation metrics for offline model-based optimisation

Abstract: In offline model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of desirability through an expensive but real-world scoring process. Offline MBO tries to approximate this expensive scoring function and use that to evaluate generated designs, however evaluation is non-exact because one approximation is being evaluated with another. Instead, we ask ourselves: if we did have the real world scoring function at hand, what cheap-to-compute validation metrics would correlate best with this? Since the real-world scoring function is available for simulated MBO datasets, insights obtained from this can be transferred over to real-world offline MBO tasks where the real-world scoring function is expensive to compute. To address this, we propose a conceptual evaluation framework that is amenable to measuring extrapolation, and demonstrate this on two conditional variants of denoising diffusion models. Empirically, we find that two of the proposed validation metrics correlate very well with the ground truth. Furthermore, an additional analysis reveals that controlling the trade-off between sample quality and diversity (via classifier guidance) is extremely crucial to generating high scoring samples.

URL: https://openreview.net/forum?id=wC4ZID0H9a

---

Title: Gated Domain Units for Multi-source Domain Generalization

Abstract: The phenomenon of distribution shift (DS) occurs when a dataset at test time differs from the dataset at training time, which can significantly impair the performance of a machine learning model in practical settings due to a lack of knowledge about the data's distribution at test time. To address this problem, we postulate that real-world distributions are composed of latent Invariant Elementary Distributions (I.E.D) across different domains. This assumption implies an invariant structure in the solution space that enables knowledge transfer to unseen domains. To exploit this property for domain generalization, we introduce a modular neural network layer consisting of Gated Domain Units (GDUs) that learn a representation for each latent elementary distribution. During inference, a weighted ensemble of learning machines can be created by comparing new observations with the representations of each elementary distribution. Our flexible framework also accommodates scenarios where explicit domain information is not present. Extensive experiments on image, text, and graph data show consistent performance improvement on out-of-training target domains. These findings support the practicality of the I.E.D assumption and the effectiveness of GDUs for domain generalisation.

URL: https://openreview.net/forum?id=V7BvYJyTmM

---

Title: Dynamic Regret Analysis of Safe Distributed Online Optimization for Convex and Non-convex Problems

Abstract: This paper addresses safe distributed online optimization over an unknown set of linear safety constraints. A network of agents aims at jointly minimizing a \emph{global}, \emph{time-varying} function, which is only partially observable to each individual agent. Therefore, agents must engage in {\it local} communications to generate a {\it safe} sequence of actions competitive with the best minimizer sequence in hindsight, and the gap between the two sequences is quantified via dynamic regret. We propose distributed safe online gradient descent (D-Safe-OGD) with an exploration phase, where all agents estimate the constraint parameters collaboratively to build estimated feasible sets, ensuring the action selection safety during the optimization phase. We prove that for convex functions, D-Safe-OGD achieves a dynamic regret bound of $O(T^{2/3}{\color{black} \sqrt{\log T}} + T^{1/3}C_T^*)$, where $C_T^*$ denotes the path-length of the best minimizer sequence. We further prove a dynamic regret bound of $O(T^{2/3}{\color{black} \sqrt{\log T}} + T^{2/3}C_T^*)$ for certain non-convex problems, which establishes the first dynamic regret bound for a {\it safe distributed} algorithm in the {\it non-convex} setting.

URL: https://openreview.net/forum?id=xiQXHvL1eN

---

Title: Foiling Explanations in Deep Neural Networks

Abstract: Deep neural networks (DNNs) have greatly impacted numerous fields over the past decade. Yet despite exhibiting superb performance over many problems, their black-box nature still poses a significant challenge with respect to explainability. Indeed, explainable artificial intelligence (XAI) is crucial in several fields, wherein the answer alone---sans a reasoning of how said answer was derived---is of little value. This paper uncovers a troubling property of explanation methods for image-based DNNs: by making small visual changes to the input image---hardly influencing the network's output---we demonstrate how explanations may be arbitrarily manipulated through the use of evolution strategies. Our novel algorithm, AttaXAI, a model-and-data XAI-agnostic, adversarial attack on XAI algorithms, only requires access to the output logits of a classifier and to the explanation map; these weak assumptions render our approach highly useful where real-world models and data are concerned. We compare our method's performance on two benchmark datasets---CIFAR100 and ImageNet---using four different pretrained deep-learning models: VGG16-CIFAR100, VGG16-ImageNet, MobileNet-CIFAR100, and Inception-v3-ImageNet. We find that the XAI methods can be manipulated without the use of gradients or other model internals. Our novel algorithm is successfully able to manipulate an image in a manner imperceptible to the human eye, such that the XAI method outputs a specific explanation map. To our knowledge, this is the first such method in a black-box setting, and we believe it has significant value where explainability is desired, required, or legally mandatory.

URL: https://openreview.net/forum?id=wvLQMHtyLk

---

Title: Policy Gradient Algorithms Implicitly Optimize by Continuation

Abstract: Direct policy optimization in reinforcement learning is usually solved with policy-gradient algorithms, which optimize policy parameters via stochastic gradient ascent. This paper provides a new theoretical interpretation and justification of these algorithms. First, we formulate direct policy optimization in the optimization by continuation framework. The latter is a framework for optimizing nonconvex functions where a sequence of surrogate objective functions, called continuations, are locally optimized. Second, we show that optimizing affine Gaussian policies and performing entropy regularization can be interpreted as implicitly optimizing deterministic policies by continuation. Based on these theoretical results, we argue that exploration in policy-gradient algorithms consists in computing a continuation of the return of the policy at hand, and that the variance of policies should be history-dependent functions adapted to avoid local extrema rather than to maximize the return of the policy.

URL: https://openreview.net/forum?id=3Ba6Hd3nZt

---

Title: V1T: large-scale mouse V1 response prediction using a Vision Transformer

Abstract: Accurate predictive models of the visual cortex neural response to natural visual stimuli remain a challenge in computational neuroscience. In this work, we introduce $V{\small 1}T$, a novel Vision Transformer based architecture that learns a shared visual and behavioral representation across animals. We evaluate our model on two large datasets recorded from mouse primary visual cortex and outperform previous convolution-based models by more than 12.7% in prediction performance. Moreover, we show that the self-attention weights learned by the Transformer correlate with the population receptive fields. Our model thus sets a new benchmark for neural response prediction and can be used jointly with behavioral and neural recordings to reveal meaningful characteristic features of the visual cortex.

URL: https://openreview.net/forum?id=qHZs2p4ZD4

---

Title: Influence Learning in Complex Systems

Abstract: High sample complexity hampers successful applications of reinforcement learning methods especially in real-world scenarios whose complex dynamics are typically computationally demanding to simulate. One idea is to decompose a large factored problem into small local subproblems including only few state variables and model the influence that the external portion of the system exerts on each of them. This principled approach allows to convert the global simulator of the entire environment into local lightweight simulators, thus enabling faster simulations, planning and solutions. The ability to represent accurately the influence experienced by each local component is crucial for the effectiveness of this method. In this work, we examine different aspects of the problem of learning approximations of the influence in realistic domains. We empirically investigate several learning methods to conclude that even for large and complex systems, in practice, the influence problem often turns into a relatively manageable learning task. Finally, we discuss how to leverage effectively the influence models for long horizon tasks for planning or reinforcement learning problems. Our results show that in many cases short horizon trajectories collected from the global simulator can be used to obtain accurate approximations of the influence for much longer horizons.

URL: https://openreview.net/forum?id=Jn3hvY5FZv

---

Title: Mixture of Dynamical Variational Autoencoders for Multi-Source Trajectory Modeling and Separation

Abstract: In this paper, we propose a latent-variable generative model called mixture of dynamical variational autoencoders (MixDVAE) to model the dynamics of a system composed of multiple moving sources. A DVAE model is pre-trained on a single-source dataset to capture the source dynamics. Then, multiple instances of the pre-trained DVAE model are integrated into a multi-source mixture model with a discrete observation-to-source assignment latent variable. The posterior distributions of both the discrete observation-to-source assignment variable and the continuous DVAE variables representing the sources content/position are estimated using the variational expectation-maximization algorithm, leading to multi-source trajectories estimation. We illustrate the versatility of the proposed MixDVAE model on two tasks: a computer vision task, namely multi-object tracking, and an audio processing task, namely single-channel audio source separation. Experimental results show that the proposed method works well on these two tasks, and outperforms several baseline methods.

URL: https://openreview.net/forum?id=sbkZKBVC31

---

Title: The Multiquadric Kernel for Moment-Matching Distributional Reinforcement Learning

Abstract: Distributional reinforcement learning has gained significant attention in recent years due to its ability to handle uncertainty and variability in the returns an agent can expect to receive for each action it takes. A key challenge in distributional reinforcement learning is finding a measure of the difference between two distributions that is well-suited for use with the distributional Bellman operator, a function that takes in a value distribution and produces a modified distribution based on the agent's current state and action. In this paper, we address this challenge by introducing the multiquadric kernel to moment-matching distributional reinforcement learning. We show that this kernel is both theoretically sound and empirically effective. Our contribution is mainly of a theoretical nature, presenting the first formally sound kernel for moment-matching distributional reinforcement learning with good practical performance. We also provide insights into why the RBF kernel has been shown to provide good practical results despite its theoretical problems. Finally, we evaluate the performance of our kernel on a number of standard tasks, obtaining results comparable to the state-of-the-art.

URL: https://openreview.net/forum?id=z49eaB8kiH

---

Title: One-Step Distributional Reinforcement Learning

Abstract: Reinforcement learning (RL) allows an agent interacting sequentially with an environment to maximize its long-term expected return. In the distributional RL (DistrRL) paradigm, the agent goes beyond the limit of the expected value, to capture the underlying probability distribution of the return across all time steps. The set of DistrRL algorithms has led to improved empirical performance. Nevertheless, the theory of DistrRL is still not fully understood, especially in the control case. In this paper, we present the simpler one-step distributional reinforcement learning (OS-DistrRL) framework encompassing only the randomness induced by the one-step dynamics of the environment. Contrary to DistrRL, we show that our approach comes with a unified theory for both policy evaluation and control. Indeed, we propose two OS-DistrRL algorithms for which we provide an almost sure convergence analysis. The proposed approach compares favorably with categorical DistrRL on various environments.

URL: https://openreview.net/forum?id=ZPMf53vE1L

---

Reply all

Reply to author

Forward

0 new messages