Weekly TMLR digest for Jun 30, 2024

0 views

Skip to first unread message

TMLR

unread,

Jun 30, 2024, 12:00:10 AMJun 30

to tmlr-annou...@googlegroups.com

New certifications
==================

Featured Certification: Fine-tuning can cripple your foundation model; preserving features may be the solution

Jishnu Mukhoti, Yarin Gal, Philip Torr, Puneet K. Dokania

https://openreview.net/forum?id=kfhoeZCeW7

---

Reproducibility Certification: On the Reproducibility of: "Learning Perturbations to Explain Time Series Predictions"

Wouter Bant, Ádám Divák, Jasper Eppink, Floris Six Dijkstra

https://openreview.net/forum?id=nPZgtpfgIx

---

Accepted papers
===============

Title: Bytes Are All You Need: Transformers Operating Directly On File Bytes

Authors: Maxwell Horton, Sachin Mehta, Ali Farhadi, Mohammad Rastegari

Abstract: Modern deep learning approaches usually utilize modality-specific processing. For example, the most common deep learning approach to image classification involves decoding image file bytes into an RGB tensor which is passed into a neural network. Instead, we investigate modality-independent representation learning by performing classification directly on file bytes, without the need for decoding files at inference time. This enables models to operate on various modalities without any hand-designed, modality-specific processing. Our model, ByteFormer, improves ImageNet Top-1 classification accuracy by $5\%$ (from $72.2\%$ to $77.33\%$) relative to DeIT models of similar size. Compared to Perceiver IO, our model requires absolutely no modality-specific processing at inference time, and uses an order of magnitude fewer parameters at equivalent accuracy on ImageNet. We demonstrate that the same ByteFormer architecture can perform audio classification without modifications or modality-specific preprocessing. We achieve $95.42\%$ classification accuracy on the Speech Commands V2 dataset (comparable to the state-of-the-art accuracy of $98.7\%$). Additionally, we demonstrate that ByteFormer can operate jointly on images and audio, handling joint classification without explicit knowledge of the input modality. We release our code at https://github.com/apple/corenet/tree/main/projects/byteformer.

URL: https://openreview.net/forum?id=RkaqxxAOfN

---

Title: Fine-tuning can cripple your foundation model; preserving features may be the solution

Authors: Jishnu Mukhoti, Yarin Gal, Philip Torr, Puneet K. Dokania

Abstract: Pre-trained foundation models, due to their enormous capacity and exposure to vast amounts of data during pre-training, are known to have learned plenty of real-world concepts. An important step in making these pre-trained models effective on downstream tasks is to fine-tune them on related datasets. While various fine-tuning methods have been devised and have been shown to be highly effective, we observe that a fine-tuned model's ability to recognize concepts on tasks different from the downstream one is reduced significantly compared to its pre-trained counterpart. This is an undesirable effect of fine-tuning as a substantial amount of resources was used to learn these pre-trained concepts in the first place. We call this phenomenon "concept forgetting'' and via experiments show that most end-to-end fine-tuning approaches suffer heavily from this side effect. To this end, we propose a simple fix to this problem by designing a new fine-tuning method called LDIFS (short for $\ell_2$ distance in feature space) that, while learning new concepts related to the downstream task, allows a model to preserve its pre-trained knowledge as well. Through extensive experiments on 10 fine-tuning tasks we show that LDIFS significantly reduces concept forgetting. Additionally, we show that LDIFS is highly effective in performing continual fine-tuning on a sequence of tasks as well, in comparison with both fine-tuning as well as continual learning baselines.

URL: https://openreview.net/forum?id=kfhoeZCeW7

---

Title: Unmasking the Veil: An Investigation into Concept Ablation for Privacy and Copyright Protection in Images

Authors: Shivank Garg, Manyana Tiwari

Abstract: In this paper, we extend the study of concept ablation within pre-trained models as introduced in 'Ablating Concepts in Text-to-Image Diffusion Models' by $\citep{Kumari2022}$. Our work focuses on reproducing the results achieved by the different variants of concept ablation proposed through predefined metrics. We also introduce a novel variant of concept ablation—trademark ablation. This variant combines the principles of memorization and instance ablation to tackle the nuanced influence of proprietary or branded elements in model outputs. Further, our research contributions include an observational analysis of the model's limitations. Moreover, we investigate the model's behavior in response to ablation leakage-inducing prompts, which aim to indirectly ablate concepts, revealing insights into the model's resilience and adaptability. We also observe the model's performance degradation on images generated by concepts far from its target ablation concept, which is documented in the appendix.

URL: https://openreview.net/forum?id=TYYApLzjaQ

---

Title: Improving Variational Autoencoder Estimation from Incomplete Data with Mixture Variational Families

Authors: Vaidotas Simkus, Michael U. Gutmann

Abstract: We consider the task of estimating variational autoencoders (VAEs) when the training data is incomplete. We show that missing data increases the complexity of the model’s posterior distribution over the latent variables compared to the fully-observed case. The increased complexity may adversely affect the fit of the model due to a mismatch between the variational and model posterior distributions. We introduce two strategies based on (i) finite variational-mixture and (ii) imputation-based variational-mixture distributions to address the increased posterior complexity. Through a comprehensive evaluation of the proposed approaches, we show that variational mixtures are effective at improving the accuracy of VAE estimation from incomplete data.

URL: https://openreview.net/forum?id=lLVmIvZfry

---

Title: Conciliator steering: Imposing user preference in multi-objective reinforcement learning

Authors: Sara Pyykölä, Klavdiya Olegovna Bochenina, Laura Ruotsalainen

Abstract: Many real-world problems with multiple objectives require reinforcement learning solutions that can handle trade-offs in a user-preferred manner. In the multi-objective framework, a single algorithm adapting to different user preferences based on a pre-defined reward function and a subjectively defined scalarisation function may be developed. The scalarisation function approximation can be done by fitting a meta-model with information gained from the interaction between the user and the environment or the agent. The interaction requires exact formulation of a constructive feedback, which is also simple for the user to give. In this paper, we propose a novel algorithm, Conciliator steering, that leverages priority order and reward transfer to seek optimal user-preferred policies in multi-objective reinforcement learning under expected scalarised returns criterion. We test Conciliator steering on DeepSeaTreasure v1 benchmark problem and demonstrate that it can find user-preferred policies with effortless and simple user-agent interaction and negligible bias, which has not been possible before. Additionally, we show that on average Conciliator steering results in a fraction of carbon dioxide emissions and total energy consumption when compared to a training of fully connected MNIST classifier, both run on a personal laptop.

URL: https://openreview.net/forum?id=XAD2kcBS50

---

Title: Can LLMs Effectively Leverage Graph Structural Information through Prompts, and Why?

Authors: Jin Huang, Xingjian Zhang, Qiaozhu Mei, Jiaqi Ma

Abstract: Large language models (LLMs) are gaining increasing attention for their capability to process graphs with rich text attributes, especially in a zero-shot fashion. Recent studies demonstrate that LLMs obtain decent text classification performance on common text-rich graph benchmarks, and the performance can be improved by appending encoded structural information as natural languages into prompts. We aim to understand why the incorporation of structural information inherent in graph data can improve the prediction performance of LLMs. First, we rule out the concern of data leakage by curating a novel leakage-free dataset and conducting a comparative analysis alongside a previously widely-used dataset. Second, as past work usually encodes the ego-graph by describing the graph structure in natural language, we ask the question: do LLMs understand the prompts in graph structures? Third, we investigate why LLMs can improve their performance after incorporating structural information.
Our exploration of these questions reveals that (i) there is no substantial evidence that the performance of LLMs is significantly attributed to data leakage; (ii) instead of understanding prompts as graph structures, LLMs tend to process prompts more as contextual paragraphs and (iii) the most efficient elements of the local neighborhood included in the prompt are phrases that are pertinent to the node label, rather than the graph structure.

URL: https://openreview.net/forum?id=L2jRavXRxs

---

Title: Solving Robust MDPs through No-Regret Dynamics

Authors: Etash Kumar Guha

Abstract: Reinforcement learning is a powerful framework for training agents to navigate different situations, but it is susceptible to changes in environmental dynamics. Generating an algorithm that can find environmentally robust policies efficiently and handle different model parameterizations without imposing stringent assumptions on the uncertainty set of transitions is difficult due to the intricate interactions between policy and environment. In this paper, we address both of these issues with a No-Regret Dynamics framework that utilizes policy gradient methods and iteratively approximates the worst case environment during training, avoiding assumptions on the uncertainty set. Alongside a toolbox of nonconvex online learning algorithms, we demonstrate that our framework can achieve fast convergence rates for many different problem settings and relax assumptions on the uncertainty set of transitions.

URL: https://openreview.net/forum?id=SdCuffxg5A

---

Title: Fair Feature Importance Scores for Interpreting Decision Trees

Authors: Camille Olivia Little, Debolina Halder Lina, Genevera I. Allen

Abstract: Across various sectors such as healthcare, criminal justice, national security, finance, and technology, large-scale machine learning (ML) systems are being deployed to make critical data-driven decisions. Many have asked if we can and should trust these ML systems to be making these decisions. Two critical components are prerequisites for trust in ML systems: interpretability, or the ability to understand why the ML system makes the decisions it does, and fairness, which ensures that ML systems do not exhibit bias against certain individuals or groups. While both interpretability and fairness have garnered substantial attention in the ML literature, methods directly interpreting models in terms of fairness remain limited. This paper considers a popular interpretation for a widely used class of ML models: feature importance scores for decision trees and tree-based models. We introduce a novel Fair Tree Feature Importance Score to assess each feature's impact on fairness or bias in decision trees. Analogous to the mean decrease in impurity for trees, our score quantifies the mean increase (or decrease) in group bias, and extends to interpret tree-based ensembles or surrogates of complex ML systems. Through simulations and real examples on benchmark fairness datasets, we show the validity of our Fair Tree Feature Importance Score, offering meaningful interpretations for both tree-based ensembles and tree-based surrogates of other ML systems.

URL: https://openreview.net/forum?id=72mDxlzRZ1

---

Title: Todyformer: Towards Holistic Dynamic Graph Transformers with Structure-Aware Tokenization

Authors: Mahdi Biparva, Raika Karimi, Faezeh Faez, Yingxue Zhang

Abstract: Temporal Graph Neural Networks have garnered substantial attention for their capacity to model evolving structural and temporal patterns while exhibiting impressive performance. However, it is known that these architectures are encumbered by issues that constrain their performance, such as over-squashing and over-smoothing. Meanwhile, Transformers have demonstrated exceptional computational capacity to effectively address challenges related to long-range dependencies. Consequently, we introduce Todyformer—a novel Transformer-based neural network tailored for dynamic graphs. It unifies the local encoding capacity of Message-Passing Neural Networks (MPNNs) with the global encoding of Transformers through i) a novel patchifying paradigm for dynamic graphs to improve over-squashing, ii) a structure-aware parametric tokenization strategy leveraging MPNNs, iii) a Transformer with temporal positional-encoding to capture long-range dependencies, and iv) an encoding architecture that alternates between local and global contextualization, mitigating over-smoothing in MPNNs. Experimental evaluations on public benchmark datasets demonstrate that Todyformer consistently outperforms the state-of-the-art methods for downstream tasks. Furthermore, we illustrate the underlying aspects of the proposed model in effectively capturing extensive temporal dependencies in dynamic graphs.

URL: https://openreview.net/forum?id=nAQSUqEspb

---

Title: The Disagreement Problem in Explainable Machine Learning: A Practitioner’s Perspective

Authors: Satyapriya Krishna, Tessa Han, Alex Gu, Steven Wu, Shahin Jabbari, Himabindu Lakkaraju

Abstract: As various post hoc explanation methods are increasingly being leveraged to explain complex models in high-stakes settings, it becomes critical to develop a deeper understanding of if and when the explanations output by these methods disagree with each other, and how such disagreements are resolved in practice. However, there is little to no research that provides answers to these critical questions. In this work, we introduce and study the disagreement problem in explainable machine learning. More specifically, we formalize the notion of disagreement between explanations, analyze how often such disagreements occur in practice, and how practitioners resolve these disagreements. We first conduct interviews with data scientists to understand what constitutes disagreement between explanations generated by different methods for the same model prediction and introduce a novel quantitative framework to formalize this understanding. We then leverage this framework to carry out a rigorous empirical analysis with four real-world datasets, six state-of-the-art post hoc explanation methods, and six different predictive models, to measure the extent of disagreement between the explanations generated by various popular explanation methods. In addition, we carry out an online user study with data scientists to understand how they resolve the aforementioned disagreements. Our results indicate that (1) state-of-the-art explanation methods often disagree in terms of the explanations they output, and (2) machine learning practitioners often employ ad hoc heuristics when resolving such disagreements. These findings suggest that practitioners may be relying on misleading explanations when making consequential decisions. They also underscore the importance of developing principled frameworks for effectively evaluating and comparing explanations output by various explanation techniques.

URL: https://openreview.net/forum?id=jESY2WTZCe

---

Title: Choosing the parameter of the Fermat distance: navigating geometry and noise

Authors: Frederic Chazal, Laure Ferraris, Pablo Groisman, Matthieu Jonckheere, Frederic Pascal, Facundo Fabián Sapienza

Abstract: The Fermat distance has been recently established as a valuable tool for machine learning tasks when a natural distance is not directly available to the practitioner or to improve the results given by Euclidean distances by exploiting the geometrical and statistical properties of the dataset. This distance depends on a parameter $\alpha$ that significantly affects the performance of subsequent tasks. Ideally, the value of $\alpha$ should be large enough to navigate the geometric intricacies inherent to the problem. At the same time, it should remain restrained enough to avoid any deleterious effects stemming from noise during the distance estimation process.
We study both theoretically and through simulations how to select this parameter.

URL: https://openreview.net/forum?id=jDRNEoxVc7

---

Title: On the Unreasonable Effectiveness of Federated Averaging with Heterogeneous Data

Authors: Jianyu Wang, Rudrajit Das, Gauri Joshi, Satyen Kale, Zheng Xu, Tong Zhang

Abstract: Existing theoretical results (such as (Woodworth et al., 2020a)) predict that the performance of federated averaging (FedAvg) is exacerbated by high data heterogeneity. However, in practice, FedAvg converges pretty well on several naturally heterogeneous datasets. In order to explain this seemingly unreasonable effectiveness of FedAvg that contradicts previous theoretical predictions, this paper introduces the client consensus hypothesis: on certain federated datasets, the average of local models updates on clients starting from the optimum is close to zero. We prove that under this hypothesis, data heterogeneity does not exacerbate the convergence of FedAvg. Moreover, we show that this hypothesis holds for a linear regression problem and some naturally heterogeneous datasets such as FEMNIST and StackOverflow. Therefore, we believe that this hypothesis can better explain the performance of FedAvg in practice.

URL: https://openreview.net/forum?id=zF76Ga4EPs

---

Title: Reproducibility Study of "ITI-GEN: Inclusive Text-to-Image Generation"

Authors: Daniel Gallo Fernández, Răzvan-Andrei Matișan, Alejandro Monroy Muñoz, Janusz Partyka

Abstract: Text-to-image generative models often present issues regarding fairness with respect to certain sensitive attributes, such as gender or skin tone. This study aims to reproduce the results presented in "ITI-GEN: Inclusive Text-to-Image Generation" by Zhang et al. (2023), which introduces a model to improve inclusiveness in these kinds of models. We show that most of the claims made by the authors about ITI-GEN hold: it improves the diversity and quality of generated images, it is scalable to different domains, it has plug-and-play capabilities, and it is efficient from a computational point of view. However, ITI-GEN sometimes uses undesired attributes as proxy features and it is unable to disentangle some pairs of (correlated) attributes such as gender and baldness. In addition, when the number of considered attributes increases, the training time grows exponentially and ITI-GEN struggles to generate inclusive images for all elements in the joint distribution. To solve these issues, we propose using Hard Prompt Search with negative prompting, a method that does not require training and that handles negation better than vanilla Hard Prompt Search. Nonetheless, Hard Prompt Search (with or without negative prompting) cannot be used for continuous attributes that are hard to express in natural language, an area where ITI-GEN excels as it is guided by images during training. Finally, we propose combining ITI-GEN and Hard Prompt Search with negative prompting.

URL: https://openreview.net/forum?id=d3Vj360Wi2

---

Title: On the Reproducibility of: "Learning Perturbations to Explain Time Series Predictions"

Authors: Wouter Bant, Ádám Divák, Jasper Eppink, Floris Six Dijkstra

Abstract: Deep Learning models have taken the front stage in the AI community, yet explainability challenges hinder their widespread adoption. Time series models, in particular, lack attention in this regard. This study tries to reproduce and extend the work of Enguehard (2023b), focusing on time series explainability by incorporating learnable masks and perturbations. Enguehard (2023b) employed two methods to learn these masks and perturbations, the preservation game (yielding SOTA results) and the deletion game (with poor performance). We extend the work by revising the deletion game’s loss function, testing the robustness of the proposed method on a novel weather dataset, and visualizing the learned masks and perturbations. Despite notable discrepancies in results across many experiments, our findings demonstrate that the proposed method consistently outperforms all baselines and exhibits robust performance across datasets. However, visualizations for the preservation game reveal that the learned perturbations primarily resemble a constant zero signal, questioning the importance of learning perturbations. Nevertheless, our revised deletion game shows promise, recovering meaningful perturbations and, in certain instances, surpassing the performance of the preservation game.

URL: https://openreview.net/forum?id=nPZgtpfgIx

---

Title: Revealing an Overlooked Challenge in Class-Incremental Graph Learning

Authors: Daiqing Qi, Handong Zhao, Xiaowei Jia, Sheng Li

Abstract: Graph Neural Networks (GNNs), which effectively learn from static graph-structured data, become ineffective when directly applied to streaming data in a continual learning (CL) scenario. In CL, historical data are not available during the current stage due to a number of reasons, such as limited storage, GDPR1 data retention policy, to name a few. A few recent works study this problem, however, they overlook the uniqueness of continual graph learning (CGL), compared to well-studied continual image classification: the unavailability of previous training data further poses challenges to inference in CGL, in additional to the well-known catastrophic forgetting problem. While existing works make a strong assumption that full access of historical data is unavailable during training but provided during inference, which potentially contradicts the continual learning paradigm Van de Ven & Tolias (2019), we study continual graph learning without this strong and contradictory assumption. In this case, without being re-inserted into previous training graphs for inference, streaming test nodes are often more sparsely connected, which makes the inference more difficult due to insufficient neighborhood information. In this work, we propose ReplayGNN (ReGNN) to jointly solve the above two challenges without memory buffers: catastrophic forgetting and poor neighbor information during inference. Extensive experiments demonstrate the effectiveness of our model over baseline models and its effectiveness in different cases with different levels of neighbor information available.

URL: https://openreview.net/forum?id=ScAc73Y1oJ

---

Title: Selective Pre-training for Private Fine-tuning

Authors: Da Yu, Sivakanth Gopi, Janardhan Kulkarni, Zinan Lin, Saurabh Naik, Tomasz Lukasz Religa, Jian Yin, Huishuai Zhang

Abstract: Text prediction models, when used in applications like email clients or word processors, must protect user data privacy and adhere to model size constraints. These constraints are crucial to meet memory and inference time requirements, as well as to reduce inference costs. Building small, fast, and private domain-specific language models is a thriving area of research. In this work, we show that a careful pre-training on a subset of the public dataset that is guided by the private dataset is crucial to train small language models with differential privacy. On standard benchmarks, small models trained with our new framework achieve state-of-the-art performance. In addition to performance improvements, our results demonstrate that smaller models, through careful pre-training and private fine-tuning, can match the performance of much larger models that do not have access to private data. This underscores the potential of private learning for model compression and enhanced efficiency.

URL: https://openreview.net/forum?id=y3u8OpPHxz

---

Title: Learning Tree-Structured Composition of Data Augmentation

Authors: Dongyue Li, Kailai Chen, Predrag Radivojac, Hongyang R. Zhang

Abstract: Data augmentation is widely used in scenarios where one needs to train a neural network given little labeled data. A common practice of augmentation training is applying a composition of multiple transformations sequentially to the data. Existing augmentation methods such as RandAugment rely on domain expertise to select a list of transformations, while other methods such as AutoAugment formulate an optimization problem over a search space of size $k^d$, which is the number of sequences of length $d$, given a list of $k$ transformation functions.

In this paper, we focus on designing efficient algorithms whose running time complexity is much faster than the worst-case complexity of $O(k^d)$, provably. We propose a new algorithm to search for a binary tree-structured composition of $k$ transformations, where each tree node corresponds to one transformation. The binary tree generalizes sequential augmentations, such as the one constructed by SimCLR. Using a top-down, recursive search procedure, our algorithm achieves a runtime complexity of $O(2^d k)$, which is much faster than $O(k^d)$ as $k$ increases above $2$. We apply the algorithm to tackle data distributions with heterogeneous subpopulations, by searching for one tree in each subpopulation, and then learn a weighted combination, leading to a forest of the trees.

We validate the proposed algorithms on numerous graph and image data sets, including a multi-label graph classification data set we collected. The data set exhibits significant variations in the sizes of graphs and their average degrees, making it ideal for studying data augmentation. We show that our approach can reduce the computation cost (measured by GPU hours) by 43% over existing augmentation search methods while improving performance by 4.3%. Extensive experiments on contrastive learning also validate the benefit of our approach. The tree structures can be used to interpret the relative importance of each transformation, such as identifying the important transformations on small vs. large graphs.

URL: https://openreview.net/forum?id=lmgf03HeqV

---

Title: Large Language Models (LLMs) on Tabular Data: Prediction, Generation, and Understanding - A Survey

Authors: Xi Fang, Weijie Xu, Fiona Anting Tan, Ziqing Hu, Jiani Zhang, Yanjun Qi, Srinivasan H. Sengamedu, Christos Faloutsos

Abstract: Recent breakthroughs in large language modeling have facilitated rigorous exploration of their application in diverse tasks related to tabular data modeling, such as prediction, tabular data synthesis, question answering, and table understanding. Each task presents unique challenges and opportunities. However, there is currently a lack of comprehensive review that summarizes and compares the key techniques, metrics, datasets, models, and optimization approaches in this research domain. This survey aims to address this gap by consolidating recent progress in these areas, offering a thorough survey and taxonomy of the datasets, metrics, and methodologies utilized. It identifies strengths, limitations, unexplored territories, and gaps in the existing literature, while providing some insights for future research directions in this vital and rapidly evolving field. It also provides relevant code and datasets references. Through this comprehensive review, we hope to provide interested readers with pertinent references and insightful perspectives, empowering them with the necessary tools and knowledge to effectively navigate and address the prevailing challenges in the field.

URL: https://openreview.net/forum?id=IZnrCGF9WI

---

New submissions
===============

Title: Conditional Idempotent Generative Networks

Abstract: We propose Conditional Idempotent Generative Networks (CIGN), a new approach that expands upon Idempotent Generative Networks (IGN) to enable conditional generation.
While IGNs offer efficient single-pass generation, they lack the ability to control the content of the generated data.
CIGNs address this limitation by incorporating conditioning mechanisms, allowing users to steer the generation process towards specific types of data.

We establish the theoretical foundations for CIGNs, outlining their scope, loss function and evaluation metrics.
We then present two potential architectures for implementing CIGNs, which we call channel conditioning and filter conditioning.
We discuss experimental results obtained on the MNIST dataset, demonstrating the effectiveness of both conditioning approaches.
Our findings pave the way for further exploration of CIGNs on larger datasets and more complex use cases.

URL: https://openreview.net/forum?id=VOKmQLsl6C

---

Title: Improving Text-to-Image Consistency via Automatic Prompt Optimization

Abstract: Impressive advances in text-to-image (T2I) generative models have yielded a plethora of high performing models which are able to generate aesthetically appealing, photorealistic images. Despite the progress, these models still struggle to produce images that are consistent with the input prompt, oftentimes failing to capture object quantities, relations and attributes properly. Existing solutions to improve prompt-image consistency suffer from the following challenges: (1) they oftentimes require model fine-tuning, (2) they only focus on nearby prompt samples, and (3) they are affected by unfavorable trade-offs among image quality, representation diversity, and prompt-image consistency. In this paper, we address these challenges and introduce a T2I optimization-by-prompting framework, OPT2I, which leverages a large language model (LLM) to improve prompt-image consistency in T2I models. Our framework starts from a user prompt and iteratively generates revised prompts with the goal of maximizing a consistency score. Our extensive validation on two datasets, MSCOCO and PartiPrompts, shows that OPT2I can boost the initial consistency score by up to 24.9% in terms of DSG score while preserving the FID and increasing the recall between generated and real data. Our work paves the way toward building more reliable and robust T2I systems by harnessing the power of LLMs.

URL: https://openreview.net/forum?id=g12Gdl6aDL

---

Title: A Theoretical Framework for Zeroth-Order Budget Convex Optimization

Abstract: This paper studies a natural generalization of the problem of minimizing a convex function $f$ by querying its values sequentially.
At each time-step $t$, the optimizer can invest a budget $b_t$ in a query point $X_t$ of their choice to obtain a fuzzy evaluation of $f$ at $X_t$ whose accuracy depends on the amount of budget invested in $X_t$ across times. This setting is motivated by the minimization of objectives whose values can only be determined approximately through lengthy or expensive computations, where it is paramount to recycle past information. In the univariate case, we design ReSearch, an anytime parameter-free algorithm for which we prove near-optimal optimization-error guarantees. Then, we present two applications of our univariate analysis. First, we show how to use ReSearch for stochastic convex optimization, obtaining theoretical and empirical improvements on state-of-the-art benchmarks. Second, we handle the $d$-dimensional budget problem by combining ReSearch with a coordinate descent method, presenting theoretical guarantees and experiments.

URL: https://openreview.net/forum?id=bo8vM9j3UO

---

Title: Towards Backwards-Compatible Data with Confounded Domain Adaptation

Abstract: Most current domain adaptation methods address either covariate shift or label shift, but are not applicable where they occur simultaneously and are confounded with each other. Domain adaptation approaches which do account for such confounding are designed to adapt covariates to optimally predict a particular label whose shift is confounded with covariate shift. In this paper, we instead seek to achieve general-purpose data backwards compatibility. This would allow the adapted covariates to be used for a variety of downstream problems, including on pre-existing prediction models and on data analytics tasks. To do this we consider a modification of generalized label shift (GLS), which we call confounded shift. We present a novel framework for this problem, based on minimizing the expected divergence between the source and target conditional distributions, conditioning on possible confounders. Within this framework, we provide concrete implementations using the Gaussian reverse Kullback-Leibler divergence and the maximum mean discrepancy. Finally, we demonstrate our approach on synthetic and real datasets.

URL: https://openreview.net/forum?id=GSp2WC7q0r

---

Title: From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

Abstract: One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results. However, less attention has been given to the benefits of scaling compute during inference. This survey focuses on these inference-time approaches. We explore three areas under a unified mathematical formalism: token-level generation algorithms, meta-generation algorithms, and efficient generation. Token-level generation algorithms, often called decoding algorithms, operate by sampling a single token at a time or constructing a token-level search space and then selecting an output. These methods typically assume access to a language model's logits, next-token distributions, or probability scores. Meta-generation algorithms work on partial or full sequences, incorporating domain knowledge, enabling backtracking, and integrating external information. Efficient generation methods aim to reduce token costs and improve the speed of generation. Our survey unifies perspectives from three research communities: traditional natural language processing, modern LLMs, and machine learning systems.

URL: https://openreview.net/forum?id=eskQMcIbMS

---

Title: For Robust Worst-Group Accuracy, Ignore Group Annotations

Abstract: Existing methods for last layer retraining that aim to optimize worst-group accuracy (WGA) rely heavily on well-annotated groups in the training data. We show, both in theory and practice, that annotation-based data augmentations using either downsampling or upweighting for WGA are susceptible to domain annotation noise. The WGA gap is exacerbated in high-noise regimes for models trained with vanilla empirical risk minimization. To this end, we introduce Regularized Annotation of Domains (RAD) to train robust last layer classifiers without needing explicit domain annotations. Our results show that RAD is competitive with other recently proposed domain annotation-free techniques. Most importantly, RAD outperforms state-of-the-art annotation-reliant methods even with only 5\% noise in the training data for several publicly available datasets.

URL: https://openreview.net/forum?id=l8E68fD6yp

---

Title: Linear Weight Interpolation Leads to Transient Performance Gains

Abstract: We train copies of a neural network on different sets of SGD noise and find that linearly interpolating their weights can, remarkably, produce networks that perform significantly better than the original networks. However, such interpolated networks consistently end up in unfavorable regions of the optimization landscape: with further training, their performance fails to improve or degrades, effectively undoing the performance gained from the interpolation. We identify two quantities that impact an interpolated network's performance and relate our observations to linear mode connectivity. Finally, we investigate this phenomenon from the lens of example importance and find that performance improves and degrades almost exclusively on the harder subsets of the training data, while performance is stable on the easier subsets. Our work represents a step towards a better understanding of neural network loss landscapes and weight interpolation in deep learning.

URL: https://openreview.net/forum?id=XGAdBXlFcj

---

Title: Stochastic Re-weighted Gradient Descent via Distributionally Robust Optimization

Abstract: We present Re-weighted Gradient Descent (RGD), a novel optimization technique that improves the performance of deep neural networks through dynamic sample re-weighting. Leveraging insights from distributionally robust optimization (DRO) with Kullback-Leibler divergence, our method dynamically assigns importance weights to training data during each optimization step. RGD is simple to implement, computationally efficient, and compatible with widely used optimizers such as SGD and Adam. We demonstrate the effectiveness of RGD on various learning tasks, including supervised learning, meta-learning, and out-of-domain generalization. Notably, RGD achieves state-of-the-art results on diverse benchmarks, with improvements of +0.7% on DomainBed, +1.44% on tabular classification, +1.94% on GLUE with BERT, and +1.01% on ImageNet-1K with ViT.

URL: https://openreview.net/forum?id=KCf5CLAXZq

---

Title: DIOMIX: A Dynamic Multi-Agent Reinforcement Learning Mixing Structure for Independent Intra-Option Learning

Abstract: In cooperative multi-agent reinforcement learning (MARL), agents are equipped with a formalism to plan, learn, and reason in diverse ways, enabling continual knowledge accumulation over time. Each agent must consistently learn within its environment and possess the ability to reason at various levels of both temporal and spatial abstraction to navigate the intricacies specific to its surroundings. Current state-of-the-art approaches explicitly rely on learning an objective function that harmonizes both planning and learning without explicitly relying on reasoning. We propose a distinctive framework, Dynamic Intra-Options Mixtures (DIOMIX), aiming to address the deficiency in reasoning capabilities present in current state-of-the-art algorithms. We introduce an agent-independent option-based framework, incorporating a notion of temporal abstraction into the MARL paradigm using an advantage-based learning scheme directly on the option policy. This scheme enables higher long-term utility retention compared to directly optimizing action-value functions themselves. However, using temporal difference learning could hinder the optimization of extended temporal actions; therefore, to mitigate this issue where options are optimized solely to execute as primitive actions, we incorporate a regularization mechanism into the learning process to enable options execution over extended periods. Through quantitative and qualitative empirical results, DIOMIX can acquire individually separable and explainable reasoning capabilities that lead to agent specialization, task simplification, and help with training efficiency. We achieve this by embedding their learning within an option-based framework without compromising performance.

URL: https://openreview.net/forum?id=IghGTYfMRt

---

Title: Multi-intention Inverse Q-learning for Interpretable Behavior Representation

Abstract: In advancing the understanding of natural decision-making processes, inverse reinforcement learning (IRL) methods have proven instrumental in reconstructing animal's intentions underlying complex behaviors. Given the recent development of a continuous-time multi-intention IRL framework, there has been persistent inquiry into inferring discrete time-varying rewards with IRL. To address this challenge, we introduce the class of hierarchical inverse Q-learning (HIQL) algorithms. Through an unsupervised learning process, HIQL divides expert trajectories into multiple intention segments, and solves the IRL problem independently for each. Applying HIQL to simulated experiments and several real animal behavior datasets, our approach outperforms current benchmarks in behavior prediction and produces interpretable reward functions. Our results suggest that the intention transition dynamics underlying complex decision-making behavior is better modeled by a step function instead of a smoothly varying function. This advancement holds promise for neuroscience and cognitive science, contributing to a deeper understanding of decision-making and uncovering underlying brain mechanisms.

URL: https://openreview.net/forum?id=hrKHkmLUFk

---

Title: Nonlinear Behaviour of Critical Points for a Simple Neural Network

Abstract: In severely over-parametrized regimes, neural network optimization can be analyzed by linearization techniques as the neural tangent kernel, which shows gradient descent convergence to zero training error, and landscape analysis, which shows that all local minima are global minima.

Practical networks are often much less over-parametrized, and training behavior becomes more nuanced and nonlinear. This paper contains a fine grained analysis of the nonlinearity for a simple shallow network in one dimension. We show that the networks have unfavorable critical points, which can be mitigated by sufficiently high local resolution. Given this resolution, all critical points satisfy $L_2$ loss bounds of optimal adaptive approximation in Sobolev and Besov spaces on convex and concave subdomains of the target function. These bounds cannot be matched by linear approximation methods and show nonlinear and global behavior of the critical point's inner weights.

URL: https://openreview.net/forum?id=wfdG2PEOHS

---

Title: PriViT: Vision Transformers for Private Inference

Abstract: The Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications. However, ViTs are ill-suited for private inference using secure multi-party computation (MPC) protocols, due to the large number of non-polynomial operations (self-attention, feed-forward rectifiers, layer normalization). We develop PriViT, a gradient-based algorithm to selectively Taylorize nonlinearities in ViTs while maintaining their prediction accuracy. Our algorithm is conceptually very simple, easy to implement, and achieves improved performance over existing MPC-friendly transformer architectures in terms of the latency-accuracy Pareto frontier.

URL: https://openreview.net/forum?id=3CmPvcYJnm

---

Title: Towards Provable Log Density Policy Gradient

Abstract: Policy gradient methods are a vital ingredient behind the success of modern reinforcement learning. Modern policy gradient methods, although successful, introduce a residual error in gradient estimation. In this work, we argue that this residual term is significant and correcting for it could potentially improve sample-complexity of reinforcement learning methods. To that end, we propose log density gradient to estimate the policy gradient, which corrects for this residual error term. Log density gradient method computes policy gradient by utilising the state-action discounted distributional formulation. We first present the equations needed to exactly find the log density gradient for a tabular Markov Decision Processes (MDPs). For more complex environments, we propose a temporal difference (TD) method that approximates log density gradient by utilizing backward on-policy samples. Since backward sampling from a Markov chain is highly restrictive we also propose a min-max optimization that can approximate log density gradient using just on-policy samples. We also prove uniqueness, and convergence under linear function approximation, for this min-max optimization. Finally, we show that the sample complexity of our min-max optimization to be of the order of $m^{-1/2}$, where $m$ is the number of on-policy samples. We also demonstrate a proof-of-concept for our log density gradient method on gridworld environment, and observe that our method is able to improve upon the classical policy gradient method by a clear margin, thus indicating a promising novel direction to develop reinforcement learning algorithms that require fewer samples.

URL: https://openreview.net/forum?id=qIWazsRaTR

---

Title: Continual Adaptation of Foundation Models for Federated Learning

Abstract: In this paper, we focus on the important yet understudied problem of Continual Federated Learning (CFL), where a server communicates with a set of clients to incrementally learn new concepts over time without sharing or storing any data. The complexity of this problem is compounded by challenges from both the Continual and Federated Learning perspectives. Specifically, models trained in a CFL setup suffer from catastrophic forgetting which is exacerbated by data heterogeneity across clients. Existing attempts at this problem tend to impose large overheads on clients and communication channels or require access to stored data which renders them unsuitable for real-world use due to privacy. In this paper, we attempt to tackle forgetting and heterogeneity while minimizing overhead costs and without requiring access to any stored data. We study this problem in the context of Foundation Models and explore parameter-efficient approaches to adapt to dynamic distributions while minimizing forgetting. We achieve this by leveraging a prompting based approach (such that only prompts and classifier heads have to be communicated) and proposing a novel and lightweight generation and distillation scheme to consolidate client models at the server. We formulate this problem for image classification and establish strong baselines for comparison, conduct experiments on CIFAR-100 as well as challenging, large-scale datasets like ImageNet-R and DomainNet. Our approach outperforms both existing methods and our own baselines by as much as 7% while significantly reducing communication and client-level computation costs.

URL: https://openreview.net/forum?id=vsZ5A3Zxyr

---

Title: On the Convergence Rates of Federated Q-Learning across Heterogeneous Environments

Abstract: Large-scale multi-agent systems are often deployed across wide geographic areas, where agents interact with heterogeneous environments. There is an emerging interest in understanding the role of heterogeneity in the performance of the federated versions of classic reinforcement learning algorithms. In this paper, we study synchronous federated Q-learning, which aims to learn an optimal Q-function by having $K$ agents average their local Q-estimates per $E$ iterations. We observe an interesting phenomenon on the convergence speeds in terms of $K$ and $E$. Similar to the homogeneous environment settings, there is a linear speed-up concerning $K$ in reducing the errors that arise from sampling randomness. Yet, in sharp contrast to the homogeneous settings, $E>1$ leads to significant performance degradation. Specifically, we provide a fine-grained characterization of the error evolution in the presence of environmental heterogeneity, which decay to zero as the number of iterations $T$ increases. The slow convergence of having $E>1$ turns out to be fundamental rather than an artifact of our analysis. We prove that, for a wide range of stepsizes, the $\ell_{\infty}$ norm of the error cannot decay faster than $\Theta (E/T)$. In addition, our experiments demonstrate that the convergence exhibits an interesting two-phase phenomenon. For any given stepsize, there is a sharp phase-transition of the convergence: the error decays rapidly in the beginning yet later bounces up and stabilizes. Provided that the phase-transition time can be estimated, choosing different stepsizes for the two phases leads to faster overall convergence.

URL: https://openreview.net/forum?id=jPMJYlJc4j

---

Title: Dual Gauss-Newton Directions for Deep Learning

Abstract: Gauss-Newton (a.k.a. prox-linear) directions can be computed by solving an
optimization subproblem that trade-offs between a partial linearization of the
objective function and a proximity term. In this paper, we study the possibility
to leverage the convexity of this subproblem in order to instead solve the
corresponding dual. As we show, the dual can be advantageous when the number of
network outputs is smaller than the number of network parameters. We propose a
conjugate gradient algorithm to solve the dual, that integrates seamlessly with
autodiff through the use of linear operators and handles dual constraints. We
prove that this algorithm produces descent directions, when run for any number
of steps. Finally, we study empirically the advantages and current limitations
of our approach compared to various popular deep learning solvers.

URL: https://openreview.net/forum?id=Lce9dGQ1L9

---

Title: A Survey on Large Language Models for Critical Societal Domains: Finance, Healthcare, and Law

Abstract: In the fast-evolving domain of artificial intelligence, large language models (LLMs) such as GPT-3 and GPT-4 are revolutionizing the landscapes of finance, healthcare, and law: domains characterized by their reliance on professional expertise, challenging data acquisition, high-stakes, and stringent regulatory compliance. This survey offers a detailed exploration of the methodologies, applications, challenges, and forward-looking opportunities of LLMs within these high-stakes sectors. We highlight the instrumental role of LLMs in enhancing diagnostic and treatment methodologies in healthcare, innovating financial analytics, and refining legal interpretation and compliance strategies. Moreover, we critically examine the ethics for LLM applications in these fields, pointing out the existing ethical concerns and the need for transparent, fair, and robust AI systems that respect regulatory norms. By presenting a thorough review of current literature and practical applications, we showcase the transformative impact of LLMs, and outline the imperative for interdisciplinary cooperation, methodological advancements, and ethical vigilance. Through this lens, we aim to spark dialogue and inspire future research dedicated to maximizing the benefits of LLMs while mitigating their risks in these precision-dependent sectors. To facilitate future research on LLMs in these critical societal domains, we also initiate a reading list that tracks the latest advancements under this topic, which will be released and continually updated.

URL: https://openreview.net/forum?id=upAWnMgpnH

---

Title: Using Multimodal Foundation Models and Clustering for Improved Style Ambiguity Loss

Abstract: Teaching text-to-image models to be creative involves using style ambiguity loss, which
requires a pretrained classifier. In this work, we explore a new form of the style ambiguity
training objective, used to approximate creativity, that does not require training a classifier
or even a labeled dataset. We then train a diffusion model to maximize style ambiguity
to imbue the diffusion model with creativity and find our new methods improve upon the
traditional method, based on automated metrics for human judgment, while still maintaining
creativity and novelty.

URL: https://openreview.net/forum?id=GqG4IvRyNl

---

Title: Strategies for Pretraining Neural Operators

Abstract: Pretraining for partial differential equation (PDE) modeling has recently shown promise in scaling neural operators across datasets to improve generalizability and performance. Despite these advances, our understanding of how pretraining affects neural operators is still limited; studies generally propose tailored architectures and datasets that make it challenging to compare or examine different pretraining frameworks. To address this, we compare various pretraining methods without optimizing architecture choices to characterize pretraining dynamics on different models and datasets as well as to understand its scaling and generalization behavior. We find that pretraining is highly dependent on model and dataset choices, but in general transfer learning or physics-based pretraining strategies work best. In addition, pretraining performance can be further improved by using data augmentations. Lastly, pretraining is additionally beneficial when fine-tuning in scarce data regimes or when generalizing to downstream data similar to the pretraining distribution. Through providing insights into pretraining neural operators for physics prediction, we hope to motivate future work in developing and evaluating pretraining methods for PDEs.

URL: https://openreview.net/forum?id=9vEVeX9oIv

---

Title: Score-based Explainability for Graph Representations

Abstract: Despite the widespread use of unsupervised Graph Neural Networks (GNNs), their post-hoc explainability remains underexplored. Current graph explanation methods typically focus on explaining a single dimension of the final output. However, unsupervised and self-supervised GNNs produce d-dimensional representation vectors whose individual elements lack clear, disentangled semantic meaning. To tackle this issue, we draw inspiration from the success of score-based graph explainers in supervised GNNs and propose a novel framework, grXAI, for graph representation explainability. grXAI generalizes existing score-based graph explainers to identify the subgraph most responsible for constructing the latent representation of the input graph. This framework can be easily and efficiently implemented as a wrapper around existing methods, enabling the explanation of graph representations through connected subgraphs, which are more human-intelligible. Extensive qualitative and quantitative experiments demonstrate grXAI's strong ability to identify subgraphs that effectively explain learned graph representations across various unsupervised tasks and learning algorithms.

URL: https://openreview.net/forum?id=K6DKrrpYpJ

---

Title: Structure-Preserving Network Compression Via Low-Rank Induced Training Through Linear Layers Composition

Abstract: Deep Neural Networks (DNNs) have achieved remarkable success in addressing many previously unsolvable tasks. However, the storage and computational requirements associated with DNNs pose a challenge for deploying these trained models on resource-limited devices. Therefore, a plethora of compression and pruning techniques have been proposed in recent years. Low-rank decomposition techniques are among the approaches most utilized to address this problem. Compared to post-training compression, compression-promoted training is still under-explored. In this paper, we present a theoretically-justified novel approach, termed Low-Rank Induced Training (LoRITa), that promotes low-rankness through the composition of linear layers and compresses by using singular value truncation. This is achieved without the need to change the structure at inference time or require constrained and/or additional optimization, other than the standard weight decay regularization. Moreover, LoRITa eliminates the need to (i) initialize with pre-trained models, (ii) specify rank selection prior to training, and (iii) compute SVD in each iteration. Our experimental results (i) demonstrate the effectiveness of our approach using MNIST on Fully Connected Networks, CIFAR10 on Vision Transformers, and CIFAR10/100 on Convolutional Neural Networks, and (ii) illustrate that we achieve either competitive or state-of-the-art results when compared to leading structured pruning and low-rank training methods in terms of FLOPs and parameters drop.

URL: https://openreview.net/forum?id=1KCrVMJoJ9

---

Title: The Real Tropical Geometry of Neural Networks

Abstract: We consider a binary classifier defined as the sign of a tropical rational function, that is, as the difference of two convex piecewise linear functions. In particular, we consider binary classifications through ReLU neural networks, whose parameter space is contained as a semialgebraic set inside the parameter space of tropical rational functions.
We initiate the study of two different subdivisions of this parameter space:
a subdivision into semialgebraic sets, on which the combinatorial type of the decision boundary is fixed, and a subdivision into a polyhedral fan, capturing the combinatorics of the partitions of the dataset. The sublevel sets of the $0/1$-loss function arise as subfans of this classification fan, and we show that the level-sets are not necessarily connected. We describe the classification fan i) geometrically, as normal fan of the activation polytope, and ii) combinatorially through a list of properties of associated bipartite graphs, in analogy to covector axioms of oriented matroids and tropical oriented matroids. Our findings extend and refine the connection between neural networks and tropical geometry by observing structures established in real tropical geometry, such as positive tropicalizations of hypersurfaces and tropical semialgebraic sets.

URL: https://openreview.net/forum?id=I7JWf8XA2w

---

Title: Adapt then Unlearn: Exploring Parameter Space Semantics for Unlearning in Generative Adversarial Networks

Abstract: Owing to the growing concerns about privacy and regulatory compliance, it is desirable to regulate the output of generative models. To that end, the objective of this work is to prevent the generation of outputs containing undesired features from a pre-trained Generative Adversarial Network (GAN) where the underlying training data set is inaccessible. Our approach is inspired by the observation that the parameter space of GANs exhibits meaningful directions that can be leveraged to suppress specific undesired features. However, such directions usually result in the degradation of the quality of generated samples. Our proposed two-stage method, known as `\textbf{Adapt-then-Unlearn},' excels at unlearning such undesirable features while also maintaining the quality of generated samples. In the initial stage, we adapt a pre-trained GAN on a set of negative samples (containing undesired features) provided by the user. Subsequently, we train the original pre-trained GAN using positive samples, along with a repulsion regularizer. This regularizer encourages the learned model parameters to move away from the parameters of the adapted model (first stage) while not degrading the generation quality. We provide theoretical insights into the proposed method. To the best of our knowledge, our approach stands as the first method addressing unlearning within the realm of high-fidelity GANs (such as StyleGAN). We validate the effectiveness of our method through comprehensive experiments, encompassing both class-level unlearning on the MNIST and AFHQ dataset and feature-level unlearning tasks on the CelebA-HQ dataset. Our code and implementation is available at: https://anonymous.4open.science/r/Unlearning_GAN_Via_Few_Shot_Adaptation/.

URL: https://openreview.net/forum?id=jAHEBivObO

---

Title: Concept-Driven Continual Learning

Abstract: This paper introduces novel solutions to the challenge of catastrophic forgetting in continual learning: Interpretability Guided Continual Learning (IG-CL) and Intrinsically Interpretable Neural Network (IN2). These frameworks bring interpretability into continual learning, systematically managing human-understandable concepts within neural network models to enhance knowledge retention from previous tasks. Our methods are designed to enhance interpretability, providing transparency and control over the continual training process. While our primary focus is to provide a new framework to design continual learning algorithms based on interpretability instead of improving performance, we observe that our methods often surpass existing ones: IG-CL employs interpretability tools to guide neural networks, showing an improvement of up to 1.4% in average incremental accuracy over existing methods; IN2, inspired by the Concept Bottleneck Model, adeptly adjusts concept units for both new and existing tasks, reducing average incremental forgetting by up to 9.1%. Both frameworks demonstrate superior performance compared to exemplar-free methods and are competitive with exemplar-based methods. When combined with exemplar-based strategies, they further improve the performance by up to 18%. These advancements represent a significant step in addressing the limitations of current continual learning methods, offering efficient and interpretable approaches that do not require additional memory for past data.

URL: https://openreview.net/forum?id=HSW49uvCNW

---

Title: On the effects of similarity metrics in decentralized deep learning under distribution shift

Abstract: Decentralized Learning (DL) enables privacy-preserving collaboration among organizations or users to enhance the performance of local deep learning models. However, model aggregation becomes challenging when client data is heterogeneous, and identifying compatible collaborators without direct data exchange remains a pressing issue. In this paper, we investigate the effectiveness of various similarity metrics in DL for identifying peers for model merging, conducting an empirical analysis across multiple datasets with distribution shifts. Our research provides insights into the performance of these metrics, examining their role in facilitating effective collaboration. By exploring the strengths and limitations of these metrics, we contribute to the development of robust DL methods.

URL: https://openreview.net/forum?id=WppTEs4Kkn

---

Title: TSCMamba: Mamba Meets Multi-View Learning for Time Series Classification

Abstract: Time series classification (TSC) on multivariate time series is a critical problem. We propose a novel multi-view approach integrating frequency-domain and time-domain features to provide complementary contexts for TSC. Our method fuses continuous wavelet transform spectral features with temporal convolutional or multilayer perceptron features. We leverage the Mamba state space model for efficient and scalable sequence modeling. We also introduce a novel tango scanning scheme to better model sequence relationships. Experiments on 10 standard benchmark datasets demonstrate our approach achieves an average 6.45% accuracy improvement over state-of-the-art TSC models.

URL: https://openreview.net/forum?id=cpHGwrkbbb

---

Title: Geometric Analysis of Transformer Time Series Forecasting Latent Manifolds

Abstract: Transformer models have consistently achieved remarkable results in various domains such as natural language processing and computer vision. However, despite ongoing research efforts to better understand these models, they still lack a comprehensive understanding. This is particularly true for deep time series forecasting methods, where analysis and understanding work is relatively limited. Time series data, unlike image and text information, can be more challenging to interpret and analyze. To address this, we approach the problem from a \emph{manifold learning} perspective, assuming that the latent representations of time series forecasting models lie next to a low-dimensional manifold. In our study, we focus on analyzing the geometric features of these latent data manifolds, including intrinsic dimension and principal curvatures. Our findings reveal that deep transformer models exhibit similar geometric behavior across layers, and these geometric features are correlated with model performance. Additionally, we observe that untrained models initially have different structures, but they rapidly converge during training.
By leveraging our geometric analysis and differentiable tools, we can potentially design new and improved deep forecasting neural networks. This approach complements existing analysis studies and contributes to a better understanding of transformer models in the context of time series forecasting.

URL: https://openreview.net/forum?id=zRZe93OZho

---

Title: Are EEG Sequences Time Series? EEG Classification with Time Series Models and Joint Subject Training

Abstract: As with most other data domains, EEG data analysis relies on rich domain-specific preprocessing. Beyond such preprocessing, machine learners would hope to deal with such data as with any other time series data. For EEG classification many models have been developed with layer types and architectures we typically do not see in time series classification. Furthermore, typically separate models for each individual subject are learned, not one model for all of them. In this paper, we systematically study the differences between EEG classification models and generic time series classification models. We describe three different model setups to deal with EEG data from different subjects, namely subject-specific models (most EEG literature), subject-agnostic models and subject-conditional models. In experiments on three datasets, we demonstrate that off-the-shelf time series classification models trained per subject perform close to EEG classification models, but that do not quite reach the performance of domain-specific modeling. Additionally, we combine time-series models with subject embeddings to train one joint subject-conditional classifier on all subjects. The resulting models are competitive with dedicated EEG models in 2 out of 3 datasets, even outperforming all EEG methods on one of them.

URL: https://openreview.net/forum?id=yEqzW4NdYh

---

Title: AdaWaveNet: Adaptive Wavelet Network for Time Series Analysis

Abstract: Time series data analysis is a critical component in various domains such as finance, healthcare, and meteorology. Despite the progress in deep learning for time series analysis, there remains a challenge in addressing the non-stationary nature of time series data. Most of the existing models, which are built on the assumption of constant statistical properties over time, often struggle to capture the temporal dynamics in realistic time series and result in bias and error in time series analysis. This paper introduces the Adaptive Wavelet Network (AdaWaveNet), a novel approach that employs Adaptive Wavelet Transformation for multi-scale analysis of non-stationary time series data. AdaWaveNet designed a lifting scheme-based wavelet decomposition and construction mechanism for adaptive and learnable wavelet transforms, which offers enhanced flexibility and robustness in analysis. We conduct extensive experiments on 10 datasets across 3 different tasks, including forecasting, imputation, and a newly established super-resolution task. The evaluations demonstrate the effectiveness of AdaWaveNet over existing methods in all three tasks, which illustrates its potential in various real-world applications.

URL: https://openreview.net/forum?id=m4bE9Y9FlX

---

Title: Hybrid Regularization Methods Achieve Near-Optimal Regularization in Random Feature Models

Abstract: We demonstrate the potential of hybrid regularization methods to automatically and efficiently regularize the training of random feature models to generalize well on unseen data. Hybrid methods automatically combine the strengths of early stopping and weight decay while avoiding their respective weaknesses. By iteratively projecting the original learning problem onto a lower-dimensional subspace, they provide an efficient way to choose the weight decay hyperparameter. In our work, the weight decay hyperparameter is automatically selected by generalized cross-validation (GCV), which performs leave-one-out cross-validation simultaneously in a single training run and without the need for a dedicated validation dataset. As a demonstration, we use the random feature model to generate well- and ill-posed training problems arising from image classification. Our results show that hybrid regularization leads to near-optimal regularization in all problems. In particular, it is competitive with optimally tuned classical regularization methods. While hybrid regularization methods are popular in many large-scale inverse problems, their potential in machine learning is under-appreciated, and our findings motivate their wider use.

URL: https://openreview.net/forum?id=ayNuSG60LW

---

Title: Undetectable Steganography for Language Models

Abstract: We introduce a cryptographic method to hide an arbitrary secret payload in the response of a Large Language Model (LLM). A secret key is required to extract the payload from the model's response, and without the key it is provably impossible to distinguish between the responses of the original LLM and the LLM that hides a payload. In particular, the quality of generated text is not affected by the payload.
Our approach extends a recent result of Christ, Gunn and Zamir (2023) who introduced an undetectable watermarking scheme for LLMs.

URL: https://openreview.net/forum?id=fq6aQoMSHz

---

Title: Deep Neural Networks Can Learn Generalizable Same-Different Visual Relations

Abstract: Although deep neural networks can achieve human-level performance on many object recognition benchmarks, prior work suggests that these same models fail to learn simple abstract relations, such as determining whether two objects are the same or different. Much of this prior work focuses on training convolutional neural networks to classify images of two same or two different abstract shapes, testing generalization on within-distribution stimuli. In this article, we comprehensively study whether deep neural networks can acquire and generalize same-different relations both within and out-of-distribution using a variety of architectures, forms of pretraining, and fine-tuning datasets. We find that certain pretrained transformers can learn a same-different relation that generalizes with near perfect accuracy to out-of-distribution stimuli. Furthermore, we find that fine-tuning on abstract shapes that lack texture or color provides the strongest out-of-distribution generalization. Our results suggest that, with the right approach, deep neural networks can learn generalizable same-different visual relations.

URL: https://openreview.net/forum?id=fYIO7nQrTZ

---

Title: Linear Convergence of Decentralized FedAvg for PL Objectives: The Interpolation Regime

Abstract: Federated Learning (FL) is a distributed learning paradigm where multiple clients each having access to a local dataset collaborate to solve a joint problem. Federated Averaging (FedAvg) the algorithm of choice has been widely explored in the {\em Centralized} setting where the server coordinates the information sharing among clients. However, this approach incurs high communication cost and if the central server fails then the complete system fails. Hence, there is a need to study the performance of FedAvg in the {\em decentralized} setting, which is not very well understood, especially in the interpolation regime, a common phenomenon observed in modern overparameterized neural networks. In this work, we address this challenge and perform a thorough theoretical performance analysis of FedAvg in the interpolation regime under {\em Decentralized} setting, where only the neighboring clients communicate depending on the network topology. We consider a class of non-convex functions satisfying the Polyak-{\L}ojasiewicz (PL) inequality, a condition satisfied by overparameterized neural networks. For the first time, we establish that {\em Decentralized} FedAvg achieves linear convergence rates of $\mathcal{O}({T^2} \log ({1}/{\epsilon}))$, where $\epsilon$ is the solution accuracy, and $T$ is the number of local updates at each client. In contrast to the standard {\em Decentralized} FedAvg analyses, our work does not require bounded heterogeneity and gradient assumptions. Instead, we show that sample-wise (and local) smoothness of the local objectives suffice to capture the effect of heterogeneity. Experiments on multiple real datasets corroborate our theoretical findings.

URL: https://openreview.net/forum?id=Og3VxBFhwj

---

Title: Sparse Neural Architectures and Deterministic Ramanujan Graphs

Abstract: We present a sparsely connected, neural network architecture constructed using the theory of Ramanujan graphs which provide comparable performance to a dense network. The deterministic Ramanujan graphs occur either as Cayley graphs of certain algebraic groups or as Ramanujan $r$-coverings of the full $(k,l)$ bi-regular bipartite graph on $k + l$ vertices. The bipartite graphs represent the convolution and the fully connected layers retaining desirable structural properties like path connectivity and symmetricity. The method is novel as a zero-shot, data independent, deterministic pruning at initialization technique. The approach helps in early identification of winning lottery tickets, unlike previous techniques which typically determine them in an iterative fashion. We demonstrate experimentally that the proposed architecture provides competitive accuracy and sparsity ratio with those achieved by previous pre-training pruning algorithms.

URL: https://openreview.net/forum?id=x8wscCAJ2m

---

Reply all

Reply to author

Forward

0 new messages