Weekly TMLR digest for Sep 07, 2023

Skip to first unread message


Sep 7, 2023, 2:19:45 PM9/7/23
to tmlr-annou...@googlegroups.com

New certifications

Survey Certification: A Survey on Causal Discovery Methods for I.I.D. and Time Series Data

Uzma Hasan, Emam Hossain, Md Osman Gani



Accepted papers

Title: Multi-annotator Deep Learning: A Probabilistic Framework for Classification

Authors: Marek Herde, Denis Huseljic, Bernhard Sick

Abstract: Solving complex classification tasks using deep neural networks typically requires large amounts of annotated data. However, corresponding class labels are noisy when provided by error-prone annotators, e.g., crowdworkers. Training standard deep neural networks leads to subpar performances in such multi-annotator supervised learning settings. We address this issue by presenting a probabilistic training framework named multi-annotator deep learning (MaDL). A downstream ground truth and an annotator performance model are jointly trained in an end-to-end learning approach. The ground truth model learns to predict instances' true class labels, while the annotator performance model infers probabilistic estimates of annotators' performances. A modular network architecture enables us to make varying assumptions regarding annotators' performances, e.g., an optional class or instance dependency. Further, we learn annotator embeddings to estimate annotators' densities within a latent space as proxies of their potentially correlated annotations. Together with a weighted loss function, we improve the learning from correlated annotation patterns. In a comprehensive evaluation, we examine three research questions about multi-annotator supervised learning. Our findings show MaDL's state-of-the-art performance and robustness against many correlated, spamming annotators.

URL: https://openreview.net/forum?id=MgdoxzImlK


Title: Learning to Optimize Quasi-Newton Methods

Authors: Isaac Liao, Rumen Dangovski, Jakob Nicolaus Foerster, Marin Soljacic

Abstract: Fast gradient-based optimization algorithms have become increasingly essential for the computationally efficient training of machine learning models. One technique is to multiply the gradient by a preconditioner matrix to produce a step, but it is unclear what the best preconditioner matrix is. This paper introduces a novel machine learning optimizer called LODO, which tries to online meta-learn the best preconditioner during optimization. Specifically, our optimizer merges Learning to Optimize (L2O) techniques with quasi-Newton methods to learn preconditioners parameterized as neural networks; they are more flexible than preconditioners in other quasi-Newton methods. Unlike other L2O methods, LODO does not require any meta-training on a training task distribution, and instead learns to optimize on the fly while optimizing on the test task, adapting to the local characteristics of the loss landscape while traversing it. Theoretically, we show that our optimizer approximates the inverse Hessian in noisy loss landscapes and is capable of representing a wide range of inverse Hessians. We experimentally verify that our algorithm can optimize in noisy settings, and show that simpler alternatives for representing the inverse Hessians worsen performance. Lastly, we use our optimizer to train a semi-realistic deep neural network with 95k parameters at speeds comparable to those of standard neural network optimizers.

URL: https://openreview.net/forum?id=Ns2X7Azudy


Title: Task Weighting in Meta-learning with Trajectory Optimisation

Authors: Cuong C. Nguyen, Thanh-Toan Do, Gustavo Carneiro

Abstract: Developing meta-learning algorithms that are un-biased toward a subset of training tasks often requires hand-designed criteria to weight tasks, potentially resulting in sub-optimal solutions. In this paper, we introduce a new principled and fully-automated task-weighting algorithm for meta-learning methods. By considering the weights of tasks within the same mini-batch as an action, and the meta-parameter of interest as the system state, we cast the task-weighting meta-learning problem to a trajectory optimisation and employ the iterative linear quadratic regulator to determine the optimal action or weights of tasks. We theoretically show that the proposed algorithm converges to an $\epsilon_{0}$-stationary point, and empirically demonstrate that the proposed approach out-performs common hand-engineering weighting methods in two few-shot learning benchmarks.

URL: https://openreview.net/forum?id=SSkTBUyJip


Title: Cyclic and Randomized Stepsizes Invoke Heavier Tails in SGD than Constant Stepsize

Authors: Mert Gurbuzbalaban, Yuanhan Hu, Umut Simsekli, Lingjiong Zhu

Abstract: Cyclic and randomized stepsizes are widely used in the deep learning practice and can often outperform standard stepsize choices such as constant stepsize in SGD. Despite their empirical success, not much is currently known about when and why they can theoretically improve the generalization performance. We consider a general class of Markovian stepsizes for learning, which contain i.i.d. random stepsize, cyclic stepsize as well as the constant stepsize as special cases, and motivated by the literature which shows that heaviness of the tails (measured by the so-called ``tail-index”) in the SGD iterates is correlated with generalization, we study tail-index and provide a number of theoretical results that demonstrate how the tail-index varies on the stepsize scheduling. Our results bring a new understanding of the benefits of cyclic and randomized stepsizes compared to constant stepsize in terms of the tail behavior. We illustrate our theory on linear regression experiments and show through deep learning experiments that Markovian stepsizes can achieve even a heavier tail and be a viable alternative to cyclic and i.i.d. randomized stepsize rules.

URL: https://openreview.net/forum?id=lNB5EHx8uC


Title: A probabilistic Taylor expansion with Gaussian processes

Authors: Toni Karvonen, Jon Cockayne, Filip Tronarp, Simo Särkkä

Abstract: We study a class of Gaussian processes for which the posterior mean, for a particular choice of data, replicates a truncated Taylor expansion of any order. The data consist of derivative evaluations at the expansion point and the prior covariance kernel belongs to the class of Taylor kernels, which can be written in a certain power series form. We discuss and prove some results on maximum likelihood estimation of parameters of Taylor kernels. The proposed framework is a special case of Gaussian process regression based on data that is orthogonal in the reproducing kernel Hilbert space of the covariance kernel.

URL: https://openreview.net/forum?id=2TneniEIDB


Title: Bridging the Gap Between Target Networks and Functional Regularization

Authors: Alexandre Piché, Valentin Thomas, Joseph Marino, Rafael Pardinas, Gian Maria Marconi, Christopher Pal, Mohammad Emtiyaz Khan

Abstract: Bootstrapping is behind much of the successes of deep Reinforcement Learning. However, learning the value function via bootstrapping often leads to unstable training due to fast-changing target values. Target Networks are employed to stabilize training by using an additional set of lagging parameters to estimate the target values. Despite the popularity of Target Networks, their effect on the optimization is still misunderstood. In this work, we show that they act as an implicit regularizer which can be beneficial in some cases, but also have disadvantages such as being inflexible and can result in instabilities, even when vanilla TD(0) converges. To overcome these issues, we propose an explicit Functional Regularization alternative that is flexible and a convex regularizer in function space and we theoretically study its convergence. We conducted an experimental study across a range of environments, discount factors, and off-policiness data collections to investigate the effectiveness of the regularization induced by Target Networks and Functional Regularization in terms of performance, accuracy, and stability. Our findings emphasize that Functional Regularization can be used as a drop-in replacement for Target Networks and result in performance improvement. Furthermore, adjusting both the regularization weight and the network update period in Functional Regularization can result in further performance improvements compared to solely adjusting the network update period as typically done with Target Networks. Our approach also enhances the ability to networks to recover accurate $Q$-values.

URL: https://openreview.net/forum?id=BFvoemrmqX


Title: HERMES: Hybrid Error-corrector Model with inclusion of External Signals for nonstationary fashion time series

Authors: Etienne David, Jean Bellot, Sylvain Le Corff

Abstract: Developing models and algorithms to predict nonstationary time series is a long standing statistical problem. It is crucial for many applications, in particular for fashion or retail industries, to make optimal inventory decisions and avoid massive wastes. By tracking thousands of fashion trends on social media with state-of-the-art computer vision approaches, we propose a new model for fashion time series forecasting. Our contribution is twofold. We first provide publicly a dataset gathering 10000 weekly fashion time series. As influence dynamics are the key of emerging trend detection, we associate with each time series an external weak signal representing behaviours of influencers. Secondly, to leverage such a dataset, we propose a new hybrid forecasting mode. Our approach combines per-time-series parametric models with seasonal components and a global recurrent neural network to include sporadic external signals. This hybrid model provides state-of-the-art results on the proposed fashion dataset, on the weekly time series of the M4 competition, and illustrates the benefit of the contribution of external weak signals.

URL: https://openreview.net/forum?id=4ofFo7D5GL


Title: Detecting incidental correlation in multimodal learning via latent variable modeling

Authors: Taro Makino, Yixin Wang, Krzysztof J. Geras, Kyunghyun Cho

Abstract: Multimodal neural networks often fail to utilize all modalities. They subsequently generalize worse than their unimodal counterparts, or make predictions that only depend on a subset of modalities. We refer to this problem as \emph{modality underutilization}. Existing work has addressed this issue by ensuring that there are no systematic biases in dataset creation, or that our neural network architectures and optimization algorithms are capable of learning modality interactions. We demonstrate that even when these favorable conditions are met, modality underutilization can still occur in the small data regime. To explain this phenomenon, we put forth a concept that we call \emph{incidental correlation}. It is a spurious correlation that emerges in small datasets, despite not being a part of the underlying data generating process (DGP). We develop our argument using a DGP under which multimodal neural networks must utilize all modalities, since all paths between the inputs and target are causal. This represents an idealized scenario that often fails to materialize. Instead, due to incidental correlation, small datasets sampled from this DGP have higher likelihood under an alternative DGP with spurious paths between the inputs and target. Multimodal neural networks that use these spurious paths for prediction fail to utilize all modalities. Given its harmful effects, we propose to detect incidental correlation via latent variable modeling. We specify an identifiable variational autoencoder such that the latent posterior encodes the spurious correlations between the inputs and target. This allows us to interpret the Kullback-Leibler divergence between the latent posterior and prior as the severity of incidental correlation. We use an ablation study to show that identifiability is important in this context, since we derive our conclusions from the latent posterior. Using experiments with synthetic data, as well as with VQA v2.0 and NLVR2, we demonstrate that incidental correlation emerges in the small data regime, and leads to modality underutilization. Practitioners of multimodal learning can use our method to detect whether incidental correlation is present in their datasets, and determine whether they should collect additional data.

URL: https://openreview.net/forum?id=QoRo9QmOAr


Title: Fast Kernel Methods for Generic Lipschitz Losses via $p$-Sparsified Sketches

Authors: Tamim El Ahmad, Pierre Laforgue, Florence d'Alché-Buc

Abstract: Kernel methods are learning algorithms that enjoy solid theoretical foundations while suffering from important computational limitations. Sketching, which consists in looking for solutions among a subspace of reduced dimension, is a well-studied approach to alleviate these computational burdens. However, statistically-accurate sketches, such as the Gaussian one, usually contain few null entries, such that their application to kernel methods and their non-sparse Gram matrices remains slow in practice. In this paper, we show that sparsified Gaussian (and Rademacher) sketches still produce theoretically-valid approximations while allowing for important time and space savings thanks to an efficient \emph{decomposition trick}. To support our method, we derive excess risk bounds for both single and multiple output kernel problems, with generic Lipschitz losses, hereby providing new guarantees for a wide range of applications, from robust regression to multiple quantile regression. Our theoretical results are complemented with experiments showing the empirical superiority of our approach over state-of-the-art sketching methods.

URL: https://openreview.net/forum?id=ry2qgRqTOw


Title: Single-Pass Contrastive Learning Can Work for Both Homophilic and Heterophilic Graph

Authors: Haonan Wang, Jieyu Zhang, Qi Zhu, Wei Huang, Kenji Kawaguchi, Xiaokui Xiao

Abstract: Existing graph contrastive learning (GCL) techniques typically require two forward passes for a single instance to construct the contrastive loss, which is effective for capturing the low-frequency signals of node features. Such a dual-pass design has shown empirical success on homophilic graphs, but its effectiveness on heterophilic graphs, where directly connected nodes typically have different labels, is unknown. In addition, existing GCL approaches fail to provide strong performance guarantees. Coupled with the unpredictability of GCL approaches on heterophilic graphs, their applicability in real-world contexts is limited. Then, a natural question arises: Can we design a GCL method that works for both homophilic and heterophilic graphs with a performance guarantee? To answer this question, we theoretically study the concentration property of features obtained by neighborhood aggregation on homophilic and heterophilic graphs, introduce the single-pass graph contrastive learning loss based on the property, and provide performance guarantees for the minimizer of the loss on downstream tasks. As a direct consequence of our analysis, we implement the Single-Pass Graph Contrastive Learning method (SP-GCL). Empirically, on 14 benchmark datasets with varying degrees of homophily, the features learned by the SP-GCL can match or outperform existing strong baselines with significantly less computational overhead, which demonstrates the usefulness of our findings in real-world cases.

URL: https://openreview.net/forum?id=244KePn09i


Title: Variational Elliptical Processes

Authors: Maria Margareta Bånkestad, Jens Sjölund, Jalil Taghia, Thomas B. Schön

Abstract: We present elliptical processes—a family of non-parametric probabilistic models that subsumes Gaussian processes and Student's t processes. This generalization includes a range of new heavy-tailed behaviors while retaining computational tractability. Elliptical processes are based on a representation of elliptical distributions as a continuous mixture of Gaussian distributions. We parameterize this mixture distribution as a spline normalizing flow, which we train using variational inference. The proposed form of the variational posterior enables a sparse variational elliptical process applicable to large-scale problems. We highlight advantages compared to Gaussian processes through regression and classification experiments. Elliptical processes can supersede Gaussian processes in several settings, including cases where the likelihood is non-Gaussian or when accurate tail modeling is essential.

URL: https://openreview.net/forum?id=djN3TaqbdA


Title: Mitigating Confirmation Bias in Semi-supervised Learning via Efficient Bayesian Model Averaging

Authors: Charlotte Loh, Rumen Dangovski, Shivchander Sudalairaj, Seungwook Han, Ligong Han, Leonid Karlinsky, Marin Soljacic, Akash Srivastava

Abstract: State-of-the-art (SOTA) semi-supervised learning (SSL) methods have been highly successful in leveraging a mix of labeled and unlabeled data, often via self-training or pseudo-labeling. During pseudo-labeling, the model's predictions on unlabeled data are used for training and may result in confirmation bias where the model reinforces its own mistakes. In this work, we show that SOTA SSL methods often suffer from confirmation bias and demonstrate that this is often a result of using a poorly calibrated classifier for pseudo labeling. We introduce BaM-SSL, an efficient Bayesian Model averaging technique that improves uncertainty quantification in SSL methods with limited computational or memory overhead. We demonstrate that BaM-SSL mitigates confirmation bias in SOTA SSL methods across standard vision benchmarks of CIFAR-10, CIFAR-100, giving up to 16% improvement in test accuracy on the CIFAR-100 with 400 labels benchmark. Furthermore, we also demonstrate their effectiveness in additional realistic and challenging problems, such as class-imbalanced datasets and in photonics science.

URL: https://openreview.net/forum?id=PRrKOaDQtQ


Title: A Survey on Causal Discovery Methods for I.I.D. and Time Series Data

Authors: Uzma Hasan, Emam Hossain, Md Osman Gani

Abstract: The ability to understand causality from data is one of the major milestones of human-level intelligence. Causal Discovery (CD) algorithms can identify the cause-effect relationships among the variables of a system from related observational data with certain assumptions. Over the years, several methods have been developed primarily based on the statistical properties of data to uncover the underlying causal mechanism. In this study, we present an extensive discussion on the methods designed to perform causal discovery from both independent and identically distributed (I.I.D.) data and time series data. For this purpose, we first introduce the common terminologies used in causal discovery literature and then provide a comprehensive discussion of the algorithms designed to identify causal relations in different settings. We further discuss some of the benchmark datasets available for evaluating the algorithmic performance, off-the-shelf tools or software packages to perform causal discovery readily, and the common metrics used to evaluate these methods. We also evaluate some widely used causal discovery algorithms on multiple benchmark datasets and compare their performances. Finally, we conclude by discussing the research challenges and the applications of causal discovery algorithms in multiple areas of interest.

URL: https://openreview.net/forum?id=YdMrdhGx9y


Title: Individual Privacy Accounting for Differentially Private Stochastic Gradient Descent

Authors: Da Yu, Gautam Kamath, Janardhan Kulkarni, Tie-Yan Liu, Jian Yin, Huishuai Zhang

Abstract: Differentially private stochastic gradient descent (DP-SGD) is the workhorse algorithm for recent advances in private deep learning. It provides a single privacy guarantee to all datapoints in the dataset. We propose \emph{output-specific} $(\varepsilon,\delta)$-DP to characterize privacy guarantees for individual examples when releasing models trained by DP-SGD. We also design an efficient algorithm to investigate individual privacy across a number of datasets. We find that most examples enjoy stronger privacy guarantees than the worst-case bound. We further discover that the training loss and the privacy parameter of an example are well-correlated. This implies groups that are underserved in terms of model utility simultaneously experience weaker privacy guarantees. For example, on CIFAR-10, the average $\varepsilon$ of the class with the lowest test accuracy is 44.2\% higher than that of the class with the highest accuracy.

URL: https://openreview.net/forum?id=l4Jcxs0fpC


Title: A DNN Optimizer that Improves over AdaBelief by Suppression of the Adaptive Stepsize Range

Authors: Guoqiang Zhang, Kenta Niwa, W. Bastiaan Kleijn

Abstract: We make contributions towards improving adaptive-optimizer performance. Our improvements are based on suppression of the range of adaptive stepsizes in the AdaBelief optimizer. Firstly, we show that the particular placement of the parameter $\epsilon$ within the update expressions of AdaBelief reduces the range of the adaptive stepsizes, making AdaBelief closer to SGD with momentum. Secondly, we extend AdaBelief by further suppressing the range of the adaptive stepsizes. To achieve the above goal, we perform mutual layerwise vector projections between the gradient $\boldsymbol{g}_t$ and its first momentum $\boldsymbol{m}_t$ before using them to estimate the second momentum. The new optimization method is referred to as \emph{Aida}. Thirdly, extensive experimental results show that Aida outperforms nine optimizers when training transformers and LSTMs for NLP, and VGG and ResNet for image classification over CIAF10 and CIFAR100 while matching the best performance of the nine methods when training WGAN-GP models for image generation tasks. Furthermore, Aida produces higher validation accuracies than AdaBelief for training ResNet18 over ImageNet.

URL: https://openreview.net/forum?id=VI2JjIfU37


Title: Faster Training of Neural ODEs Using Gauß–Legendre Quadrature

Authors: Alexander Luke Ian Norcliffe, Marc Peter Deisenroth

Abstract: Neural ODEs demonstrate strong performance in generative and time-series modelling. However, training them via the adjoint method is slow compared to discrete models due to the requirement of numerically solving ODEs. To speed neural ODEs up, a common approach is to regularise the solutions. However, this approach may affect the expressivity of the model; when the trajectory itself matters, this is particularly important. In this paper, we propose an alternative way to speed up the training of neural ODEs. The key idea is to speed up the adjoint method by using Gauß-Legendre quadrature to solve integrals faster than ODE-based methods while remaining memory efficient. We also extend the idea to training SDEs using the Wong-Zakai theorem, by training a corresponding ODE and transferring the parameters. Our approach leads to faster training of neural ODEs, especially for large models. It also presents a new way to train SDE-based models.

URL: https://openreview.net/forum?id=f0FSDAy1bU


Title: Bridging the Sim2Real gap with CARE: Supervised Detection Adaptation with Conditional Alignment and Reweighting

Authors: Viraj Uday Prabhu, David Acuna, Rafid Mahmood, Marc T. Law, Yuan-Hong Liao, Judy Hoffman, Sanja Fidler, James Lucas

Abstract: Sim2Real domain adaptation (DA) research focuses on the constrained setting of adapting from a labeled synthetic source domain to an unlabeled or sparsely labeled real target domain. However, for high-stakes applications (e.g. autonomous driving), it is common to have a modest amount of human-labeled real data in addition to plentiful auto-labeled source data (e.g. from a driving simulator).

We study this setting of supervised sim2real DA applied to 2D object detection. We propose Domain Translation via Conditional Alignment and Reweighting (CARE) a novel algorithm that systematically exploits target labels to explicitly close the sim2real appearance and content gaps. We present an analytical justification of our algorithm and demonstrate strong gains over competing methods on standard benchmarks.

URL: https://openreview.net/forum?id=lAQQx7hlku


Title: Efficient Inference With Model Cascades

Authors: Luzian Lebovitz, Lukas Cavigelli, Michele Magno, Lorenz K Muller

Abstract: State-of-the-art deep learning models are becoming ever larger. However, many practical applications are constrained by the cost of inference. Cascades of pretrained models with conditional execution address these requirements based on the intuition that some inputs are easy enough that they can be processed correctly by a smaller model allowing for an early exit. If the smaller model is not sufficiently confident in its prediction, the input is passed on to a larger model. The selection of the confidence threshold allows to trade off computational cost against accuracy. In this work we explore the effective design of model cascades, thoroughly evaluate the impact on the accuracy-efficiency trade-off, and provide a reproducible state-of-the-art baseline that is currently missing for related research. We demonstrate that model cascades dominate the ImageNet Pareto front already with 2-model cascades, achieving an average reduction in compute effort at equal accuracy of almost 3.1x above 86% and more than 1.9x between 80% and 86% top-1 accuracy, while 3-model cascades achieve 4.4x above 87% accuracy. We confirm wider applicability and effectiveness of the method on the GLUE benchmark. We release the code to reproduce our experiments in the supplementary material and use only publicly available pretrained models and datasets.

URL: https://openreview.net/forum?id=obB415rg8q


New submissions

Title: Redundancy Aware Multiple Reference Based Gainwise Evaluation of Extractive Summarization

Abstract: While very popular for evaluating extractive summarization task, the ROUGE metric has long been criticized for its lack of semantic awareness and its ignorance about the ranking quality of the summarizer. Thanks to previous research that has addressed these issues by proposing a gain-based automated metric called \textit{Sem-nCG}, which is both rank and semantic aware. However, \textit{Sem-nCG} does not consider the amount of redundancy present in a model-generated summary and currently does not support evaluation with multiple reference summaries. Unfortunately, addressing both these limitations simultaneously is not trivial. Therefore, in this paper, we propose a redundancy-aware \textit{Sem-nCG} metric and demonstrate how this new metric can be used to evaluate model summaries against multiple references. We also explore different ways of incorporating redundancy into the original metric through extensive experiments. Experimental results demonstrate that the new redundancy-aware metric exhibits a higher correlation with human judgments than the original \textit{Sem-nCG} metric for both single and multiple reference scenarios.

URL: https://openreview.net/forum?id=8RKKz09uEq


Title: Compressing the Activation Maps in Deep Convolutional Neural Networks and Its Regularizing Effect

Abstract: Deep learning has dramatically improved performance in various image analysis applications in the last few years. However, recent deep learning architectures can be very large, with up to hundreds of layers and millions or even billions of model parameters that are impossible to fit into commodity graphics processing units. We propose a novel approach for compressing high-dimensional activation maps, the most memory-consuming part when training modern deep learning architectures. To this end, we also evaluated three different methods to
compress the activation maps: Wavelet Transform, Discrete Cosine Transform, and Simple Thresholding. We performed experiments in two classification tasks for natural images and two semantic segmentation tasks for medical images. Using the proposed method, we could reduce the memory usage for activation maps by up to 95%. Additionally, we show that the proposed method induces a regularization effect that acts on the layer weight gradients.

URL: https://openreview.net/forum?id=s1qh12FReM


Title: RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation

Abstract: The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a foundation agent for robotic manipulation. This agent, named RoboCat, is a visual goal-conditioned decision transformer capable of consuming multi-embodiment action-labelled visual experience. This data spans a large repertoire of motor control skills from simulated and real robotic arms with varying sets of observations and actions. With RoboCat, we demonstrate the ability to generalise to new tasks and robots, both zero-shot as well as through adaptation using only 100–1000 examples for the target task. We also show how a trained model itself can be used to generate data for subsequent training iterations, thus providing a basic building block for an autonomous improvement loop. We investigate the agent’s capabilities, with large-scale evaluations both in simulation and on three different real robot embodiments. We find that as we grow and diversify its training data, RoboCat not only shows signs of cross-task transfer, but also becomes more efficient at adapting to new tasks.

URL: https://openreview.net/forum?id=vsCpILiWHu


Title: ProtoCaps: A Fast and Non-Iterative Capsule Network Routing Method

Abstract: Capsule Networks have emerged as a powerful class of deep learning architectures, known for robust performance with relatively few parameters compared to Convolutional Neural Networks (CNNs). However, their inherent efficiency is often overshadowed by their slow, iterative routing mechanisms which establish connections between Capsule layers, posing computational challenges resulting in an inability to scale. In this paper, we introduce a novel, non-iterative routing mechanism, inspired by trainable prototype clustering. This innovative approach aims to mitigate computational complexity, while retaining, if not enhancing, performance efficacy. Furthermore, we harness a shared Capsule subspace, negating the need to project each lower-level Capsule to each higher-level Capsule, thereby significantly reducing memory requisites during training. Our approach demonstrates superior results compared to the current best non-iterative Capsule Network and tests on the Imagewoof dataset, which is too computationally demanding to handle efficiently by iterative approaches. Our findings underscore the potential of our proposed methodology in enhancing the operational efficiency and performance of Capsule Networks, paving the way for their application in increasingly complex computational scenarios.

URL: https://openreview.net/forum?id=Id10mlBjcx


Title: Pathwise gradient variance reduction in variational inference via zero-variance control variates

Abstract: Stochastic gradient descent is a workhorse in modern deep learning. The gradient of interest is almost always the gradient of an expectation, which is unavailable in closed form. The pathwise and score-function gradient estimators represent the most common approaches to estimating the gradient of an expectation. When it is applicable, the pathwise gradient estimator is often preferred over the score-function gradient estimator because it has substantially lower variance. Indeed, the latter is almost always applied with some variance reduction techniques. However, a series of works suggest, in the context of variational inference, that pathwise gradient estimators may also benefit from variance reduction. Work in this vein generally rely on approximations of the integrand which necessitates the functional form of the variational family be simple. In this work, we apply zero-variance control variates for variance reduction of pathwise gradient estimators which have the advantage that very little is required of the variational distribution, except that we can sample from it.

URL: https://openreview.net/forum?id=c9OMHKDy78


Title: Wavelet Networks: Scale-Translation Equivariant Learning From Raw Waveforms

Abstract: Leveraging the symmetries inherent to specific data domains for the construction of equivariant neural networks has lead to remarkable improvements in terms of data efficiency and generalization. However, most existing research focuses on symmetries arising from planar and volumetric data, leaving a crucial data source largely underexplored: *time-series*. In this work, we fill this gap by leveraging the symmetries inherent to time-series for the construction of equivariant neural network. We identify two core symmetries: *scale and translation*, and construct scale-translation equivariant neural networks for time-series learning. Intriguingly, we find that scale-translation equivariant mappings share strong resemblance with the *wavelet transform*. Inspired by this resemblance, we term our networks *Wavelet Networks*, and show that they perform nested non-linear wavelet-like time-frequency transforms. Empirical results show that Wavelet Networks outperform conventional CNNs on raw waveforms, and match strongly engineered spectrogram techniques across several tasks and time-series types, including audio, environmental sounds, and electrical signals. Our code is publicly available at *[link removed for the sake of the double-blind review process]*.

URL: https://openreview.net/forum?id=ga5SNulYet


Title: Large Language Models as your Personal Data Scientist

Abstract: Large Language Models (LLMs) have contributed to massive performance improvements for various language understanding and generation tasks; however, their limits are yet to be fully explored for "ill-defined" complex tasks. One such task is conversational data science, where a user can talk to an intelligent agent to explain their data science needs, and the agent will serve the user by engaging in a conversation with them like any human data science would do and, accordingly, formulate and execute precise Machine Learning tasks. Although this is a very ambitious goal, given the recent developments in LLMs, a fully functional conversational data science system seems quite achievable in the near future. Through an in-depth case study in this paper, we delved into the potential of employing LLMs as a solution to conversational data science. We hope that our findings will not only broaden the horizons of NLP research but also bring transformative changes in future AI technology.

URL: https://openreview.net/forum?id=SYz8THtTr4


Title: Threshold-aware Learning to Generate Feasible Solutions for Mixed Integer Programs

Abstract: Finding a high-quality feasible solution to a combinatorial optimization (CO) problem in a limited time is challenging due to its discrete nature. Recently, there has been an increasing number of machine learning (ML) methods for addressing CO problems. Neural diving (ND) is one of the learning-based approaches to generating partial discrete variable assignments in Mixed Integer Programs (MIP), a framework for modeling CO problems. However, a major drawback of ND is a large discrepancy between the ML and MIP objectives, i.e., misalignment between the variable value classification accuracy and primal bound. Our study investigates that a specific range of variable assignment rates (coverage) yields high-quality feasible solutions, where we suggest optimizing the coverage bridges the gap between the learning and MIP objectives. Consequently, we introduce a post-hoc method and a learning-based approach for optimizing the coverage. A key idea of our approach is to jointly learn to restrict the coverage search space and to predict the coverage in the learned search space. Experimental results demonstrate that learning a deep neural network to estimate the coverage for finding high-quality feasible solutions achieves state-of-the-art performance in NeurIPS ML4CO datasets. In particular, our method shows outstanding performance in the workload apportionment dataset, achieving the optimality gap of 0.45%, a ten-fold improvement over SCIP within the one-minute time limit.

URL: https://openreview.net/forum?id=xndr1Nnkzp


Title: Integrated Variational Fourier Features for Fast Spatial Modelling with Gaussian Processes

Abstract: Sparse variational approximations are popular methods for scaling up inference and learning in Gaussian processes to larger datasets. For $N$ training points, exact inference has $O(N^3)$ cost; with $M \ll N$ features, state of the art sparse variational methods have $O(NM^2)$ cost. Recently, methods have been proposed using more sophisticated features; these promise $O(M^3)$ cost, with good performance in low dimensional tasks such as spatial modelling, but they only work with a very limited class of kernels, excluding some of the most commonly used. In this work, we propose integrated Fourier features, which extends these performance benefits to a very broad class of stationary covariance functions. We motivate the method and choice of parameters from a convergence analysis and empirical exploration, and show practical speedup in synthetic and real world spatial regression tasks.

URL: https://openreview.net/forum?id=PtBzWCaCYB


Title: Proximal Mean Field Learning in Shallow Neural Networks

Abstract: We propose a custom learning algorithm for shallow over-parameterized neural networks, i.e., networks with single hidden layer having infinite width. The infinite width of the hidden layer serves as an abstraction for the over-parameterization. Building on the recent mean field interpretations of learning dynamics in shallow neural networks, we realize mean field learning as a computational algorithm, rather than as an analytical tool. Specifically, we design a Sinkhorn regularized proximal algorithm to approximate the distributional flow for the learning dynamics over weighted point clouds. In this setting, a contractive fixed point recursion computes the time-varying weights, numerically realizing the interacting Wasserstein gradient flow of the parameter distribution supported over the neuronal ensemble. An appealing aspect of the proposed algorithm is that the measure-valued recursions allow meshless computation. We demonstrate the proposed computational framework of interacting weighted particle evolution on binary and multi-class classification. Our algorithm performs gradient descent of the free energy associated with the risk functional.

URL: https://openreview.net/forum?id=vyRBsqj5iG


Title: On the Dual Problem of Convexified Convolutional Neural Networks

Abstract: We study the dual problem of convexified convolutional neural networks (DCCNNs). First, we introduce a primal learning problem motivated by convexified convolutional neural networks (CCNNs), and then construct the dual convex training program through careful analysis of the Karush-Kuhn-Tucker (KKT) conditions and Fenchel conjugates. Our approach reduces the computational overhead of constructing a large kernel matrix and more importantly, eliminates the ambiguity of factorizing the matrix. Due to the low-rank structure in CCNNs and the related subdifferential of nuclear norms, there is no closed-form expression to recover the primal solution from the dual solution. To overcome this, we propose a highly novel weight recovery algorithm, which takes the dual solution and the kernel information as the input, and recovers the linear weight and the output of convolutional layer, instead of weight parameter. Furthermore, our recovery algorithm exploits the low-rank structure and imposes a small number of filters indirectly, which reduces the parameter size. As a result, DCCNNs inherit all the statistical benefits of CCNNs, while enjoying a more formal and efficient workflow.

URL: https://openreview.net/forum?id=0yMuNezwJ1


Title: Implicit Regularization of Bregman Proximal Point Algorithm and Mirror Descent on Separable Data

Abstract: Bregman proximal point algorithm (BPPA) has witnessed emerging machine learning applications, yet its theoretical understanding has been largely unexplored. We study the computational properties of BPPA through learning linear classifiers with separable data, and demonstrate provable algorithmic regularization of BPPA. For any BPPA instantiated with a fixed Bregman divergence, we provide a lower bound of the margin obtained by BPPA with respect to an arbitrarily chosen norm. The obtained margin lower bound differs from the maximal margin by a multiplicative factor, which inversely depends on the condition number of the distance-generating function measured in the dual norm. We show that the dependence on the condition number is tight, thus demonstrating the importance of divergence in affecting the quality of the learned classifiers. We then extend our findings to mirror descent, for which we establish similar connections between the margin and Bregman divergence, together with a non-asymptotic analysis. Numerical experiments on both synthetic and real-world datasets are provided to support our theoretical findings. To the best of our knowledge, the aforementioned findings appear to be new in the literature of algorithmic regularization.

URL: https://openreview.net/forum?id=Yoe5cRxp4P


Reply all
Reply to author
0 new messages