Expert Certification: Multi-Fidelity Active Learning with GFlowNets
Alex Hernández-García, Nikita Saxena, Moksh Jain, Cheng-Hao Liu, Yoshua Bengio
https://openreview.net/forum?id=dLaazW9zuF
---
Accepted papers
===============
Title: How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
Authors: Subhabrata Dutta, Joykirat Singh, Soumen Chakrabarti, Tanmoy Chakraborty
Abstract: Despite superior reasoning prowess demonstrated by Large Language Models (LLMs) with Chain-of-Thought (CoT) prompting, a lack of understanding prevails around the internal mechanisms of the models that facilitate CoT generation. This work investigates the neural sub-structures within LLMs that manifest CoT reasoning from a mechanistic point of view. From an analysis of Llama-2 7B applied to multistep reasoning over fictional ontologies, we demonstrate that LLMs deploy multiple parallel pathways of answer generation for step-by-step reasoning. These parallel pathways provide sequential answers from the input question context as well as the generated CoT. We observe a functional rift in the middle layers of the LLM. Token representations in the initial half remain strongly biased towards the pretraining prior, with the in-context prior taking over in the later half. This internal phase shift manifests in different functional components: attention heads that write the answer token appear in the later half, attention heads that move information along ontological relationships appear in the initial half, and so on. To the best of our knowledge, this is the first attempt towards mechanistic investigation of CoT reasoning in LLMs.
URL: https://openreview.net/forum?id=uHLDkQVtyC
---
Title: Weighted Risk Invariance: Domain Generalization under Invariant Feature Shift
Authors: Gina Wong, Joshua Gleason, Rama Chellappa, Yoav Wald, Anqi Liu
Abstract: Learning models whose predictions are invariant under multiple environments is a promising approach for out-of-distribution generalization. Such models are trained to extract features $X_{\text{inv}}$ where the conditional distribution $Y \mid X_{\text{inv}}$ of the label given the extracted features does not change across environments. Invariant models are also supposed to generalize to shifts in the marginal distribution $p(X_{\text{inv}})$ of the extracted features $X_{\text{inv}}$, a type of shift we call an invariant covariate shift. However, we show that proposed methods for learning invariant models underperform under invariant covariate shift, either failing to learn invariant models---even for data generated from simple and well-studied linear-Gaussian models---or having poor finite-sample performance. To alleviate these problems, we propose weighted risk invariance (WRI). Our framework is based on imposing invariance of the loss across environments subject to appropriate reweightings of the training examples. We show that WRI provably learns invariant models, i.e. discards spurious correlations, in linear-Gaussian settings. We propose a practical algorithm to implement WRI by learning the density $p(X_{\text{inv}})$ and the model parameters simultaneously, and we demonstrate empirically that WRI outperforms previous invariant learning methods under invariant covariate shift.
URL: https://openreview.net/forum?id=WyPKLWPYsr
---
Title: Federated Learning with Reduced Information Leakage and Computation
Authors: Tongxin Yin, Xuwei Tan, Xueru Zhang, Mohammad Mahdi Khalili, Mingyan Liu
Abstract: Federated learning (FL) is a distributed learning paradigm that allows multiple decentralized clients to collaboratively learn a common model without sharing local data. Although local data is not exposed directly, privacy concerns nonetheless exist as clients' sensitive information can be inferred from intermediate computations. Moreover, such information leakage accumulates substantially over time as the same data is repeatedly used during the iterative learning process. As a result, it can be particularly difficult to balance the privacy-accuracy trade-off when designing privacy-preserving FL algorithms. This paper introduces Upcycled-FL, a simple yet effective strategy that applies first-order approximation at every even round of model update. Under this strategy, half of the FL updates incur no information leakage and require much less computational and transmission costs. We first conduct the theoretical analysis on the convergence (rate) of Upcycled-FL and then apply two perturbation mechanisms to preserve privacy. Extensive experiments on both synthetic and real-world data show that the Upcycled-FL strategy can be adapted to many existing FL frameworks and consistently improve the privacy-accuracy trade-off.
URL: https://openreview.net/forum?id=ZJ4A3xhADV
---
Title: Multi-Fidelity Active Learning with GFlowNets
Authors: Alex Hernández-García, Nikita Saxena, Moksh Jain, Cheng-Hao Liu, Yoshua Bengio
Abstract: In the last decades, the capacity to generate large amounts of data in science and engineering applications has been growing steadily. Meanwhile, machine learning has progressed to become a suitable tool to process and utilise the available data. Nonetheless, many relevant scientific and engineering problems present challenges where current machine learning methods cannot yet efficiently leverage the available data and resources. For example, in scientific discovery, we are often faced with the problem of exploring very large, structured and high-dimensional spaces. Moreover, the high fidelity, black-box objective function is often very expensive to evaluate. Progress in machine learning methods that can efficiently tackle such challenges would help accelerate currently crucial areas such as drug and materials discovery. In this paper, we propose a multi-fidelity active learning algorithm with GFlowNets as a sampler, to efficiently discover diverse, high-scoring candidates where multiple approximations of the black-box function are available at lower fidelity and cost. Our evaluation on molecular discovery tasks shows that multi-fidelity active learning with GFlowNets can discover high-scoring candidates at a fraction of the budget of its single-fidelity counterpart while maintaining diversity, unlike RL-based alternatives. These results open new avenues for multi-fidelity active learning to accelerate scientific discovery and engineering design.
URL: https://openreview.net/forum?id=dLaazW9zuF
---
Title: Transformer-Based Models Are Not Yet Perfect At Learning to Emulate Structural Recursion
Authors: Dylan Zhang, Curt Tigges, Zory Zhang, Stella Biderman, Maxim Raginsky, Talia Ringer
Abstract: This paper investigates the ability of transformer-based models to learn structural recursion from examples. Recursion is a universal concept in both natural and formal languages. Structural recursion is central to the programming language and formal mathematics tasks where symbolic tools currently excel beyond neural models, such as inferring semantic relations between datatypes and emulating program behavior.
We introduce a general framework that nicely connects the abstract concepts of structural recursion in the programming language domain to concrete sequence modeling problems and learned models' behavior. The framework includes a representation that captures the general \textit{syntax} of structural recursion, coupled with two different frameworks for understanding their \textit{semantics}---one that is more natural from a programming languages perspective and one that helps bridge that perspective
with a mechanistic understanding of the underlying transformer architecture.
With our framework as a powerful conceptual tool, we identify different issues under various set-ups. The models trained to emulate recursive computations cannot fully capture the recursion yet instead fit short-cut algorithms and thus cannot solve certain edge cases that are under-represented in the training distribution. In addition, it is difficult for state-of-the-art large language models (LLMs) to mine recursive rules from in-context demonstrations. Meanwhile, these LLMs fail in interesting ways when emulating reduction (step-wise computation) of the recursive function.
URL: https://openreview.net/forum?id=Ry5CXXm1sf
---
Title: Revisiting Non-separable Binary Classification and its Applications in Anomaly Detection
Authors: Matthew Lau, ISMAILA SECK, Athanasios P Meliopoulos, Wenke Lee, Eugene Ndiaye
Abstract: The inability to linearly classify $\texttt{XOR}$ has motivated much of deep learning. We revisit this age-old problem and show that $\textit{linear}$ classification of $\texttt{XOR}$ is indeed possible. Instead of separating data between halfspaces, we propose a slightly different paradigm, $\texttt{equality separation}$, that adapts the SVM objective to distinguish data within or outside the margin. Our classifier can then be integrated into neural network pipelines with a smooth approximation. From its properties, we intuit that equality separation is suitable for anomaly detection. To formalize this notion, we introduce $\textit{closing numbers}$, a quantitative measure on the capacity for classifiers to form closed decision regions for anomaly detection. Springboarding from this theoretical connection between binary classification and anomaly detection, we test our hypothesis on supervised anomaly detection experiments, showing that equality separation can detect both seen and unseen anomalies.
URL: https://openreview.net/forum?id=zOJ846BXhl
---
Title: SEAL: Simultaneous Label Hierarchy Exploration And Learning
Authors: Zhiquan Tan, Zihao Wang, Yifan Zhang
Abstract: Label hierarchy is an important source of external knowledge that can enhance classification performance. However, most existing methods rely on predefined label hierarchies that may not match the data distribution. To address this issue, we propose Simultaneous label hierarchy Exploration And Learning (SEAL), a new framework that explores the label hierarchy by augmenting the observed labels with latent labels that follow a prior hierarchical structure. Our approach uses a 1-Wasserstein metric over the tree metric space as an objective function, which enables us to simultaneously learn a data-driven label hierarchy and perform (semi-)supervised learning. We evaluate our method on several standard benchmarks and show that it achieves improved results in semi-supervised image classification scenarios.
URL: https://openreview.net/forum?id=JZVqDTNA59
---
Title: Fast and Effective Weight Update for Pruned Large Language Models
Authors: Vladimír Boža
Abstract: Pruning large language models (LLMs) is a challenging task due to their enormous size. The primary difficulty is fine-tuning the model after pruning, which is needed to recover the lost performance caused by dropping weights. Recent approaches have either ignored fine-tuning entirely, focusing on efficient pruning criteria, or attempted layer-wise weight updates, preserving the behavior of each layer. However, even layer-wise weight updates can be costly for LLMs, and previous works have resorted to various approximations.
In our paper, we propose a fast and effective weight update algorithm for pruned layers based on the Alternating Direction Method of Multipliers (ADMM). We further extend it with a simple gradual pruning mask selection and achieve state-of-the-art pruning performance across a wide range of LLMs.
URL: https://openreview.net/forum?id=1hcpXd9Jir
---
Title: A Self-Representation Learning Method for Unsupervised Feature Selection using Feature Space Basis
Authors: Prayag Tiwari, Farid Saberi Movahed, Saeed Karami, Farshad Saberi-Movahed, Jens Lehmann, Sahar Vahdati
Abstract: Current methods of feature selection based on a self-representation framework use all the features of the original data in their representation framework. This issue carries over redundant and noisy features into the representation space, thereby diminishing the quality and effectiveness of the results. This work proposes a novel representation learning method, dubbed GRSSLFS (Graph Regularized Self-Representation and Sparse Subspace Learning), that mitigates the drawbacks of using all features. GRSSLFS employs an approach for constructing a basis for the feature space, which includes those features with the highest variance. The objective function of GRSSLFS is then developed based on a self-representation framework that combines subspace learning and matrix factorization of the basis matrix. Moreover, these basis features are incorporated into a manifold learning term to preserve the geometrical structure of the underlying data.
We provide an effectiveness and performance evaluation on several widely-used benchmark datasets. The results show that GRSSLFS achieves a high level of performance compared to several classic and state-of-the-art feature selection methods.
URL: https://openreview.net/forum?id=LNvbgBFPMt
---
Title: Cost-Sensitive Learning to Defer to Multiple Experts with Workload Constraints
Authors: Jean Vieira Alves, Diogo Leitão, Sérgio Jesus, Marco O. P. Sampaio, Javier Liébana, Pedro Saleiro, Mario A. T. Figueiredo, Pedro Bizarro
Abstract: Learning to defer (L2D) aims to improve human-AI collaboration systems by learning how to defer decisions to humans when they are more likely to be correct than an ML classifier. Existing research in L2D overlooks key real-world aspects that impede its practical adoption, namely: i) neglecting cost-sensitive scenarios, where type I and type II errors have different costs; ii) requiring concurrent human predictions for every instance of the training dataset; and iii) not dealing with human work-capacity constraints. To address these issues, we propose the \textit{deferral under cost and capacity constraints framework} (DeCCaF). DeCCaF is a novel L2D approach, employing supervised learning to model the probability of human error under less restrictive data requirements (only one expert prediction per instance) and using constraint programming to globally minimize the error cost, subject to workload limitations. We test DeCCaF in a series of cost-sensitive fraud detection scenarios with different teams of 9 synthetic fraud analysts, with individual work-capacity constraints. The results demonstrate that our approach performs significantly better than the baselines in a wide array of scenarios, achieving an average $8.4\%$ reduction in the misclassification cost. The code used for the experiments is available at https://github.com/feedzai/deccaf
URL: https://openreview.net/forum?id=TAvGZm2Rqb
---
Title: Task-Relevant Feature Selection with Prediction Focused Mixture Models
Authors: Abhishek Sharma, Catherine Zeng, Sanjana Narayanan, Sonali Parbhoo, Roy H. Perlis, Finale Doshi-Velez
Abstract: Probabilistic models, such as mixture models, can encode latent structures that both explain the data and aid specific downstream tasks.
We focus on a constrained setting where we want to learn a model with relatively few components (e.g. for interpretability).
Simultaneously, we ensure that the components are useful for downstream predictions by introducing \emph{prediction-focused} modeling for mixtures, which automatically selects data features relevant to a prediction task.
Our approach identifies task-relevant input features, outperforms models that are not prediction-focused, and is easy to optimize; most importantly, we also characterize \emph{when} prediction-focused modeling can be expected to work.
URL: https://openreview.net/forum?id=voHKJOdCNw
---
Title: Convergence Analysis and Trajectory Comparison of Gradient Descent for Overparameterized Deep Linear Networks
Authors: Hongru Zhao, Jinchao Xu
Abstract: This paper presents a convergence analysis and trajectory comparison of the gradient descent (GD) method for overparameterized deep linear neural networks with different random initializations, demonstrating that the GD trajectory for these networks closely matches that of the corresponding convex optimization problem. This study touches upon one major open theoretical problem in machine learning--why deep neural networks trained with GD methods are efficient in many practical applications? While the solution of this problem is still beyond the reach of general nonlinear deep neural networks, extensive efforts have been invested in studying relevant questions for deep linear neural networks, and many interesting results have been reported to date. For example, recent results on loss landscape show that even though the loss function of deep linear neural networks is non-convex, every local minimizer is also a global minimizer. We focus on the trajectory of GD when applied to deep linear networks and demonstrate that, with appropriate initialization and sufficient width of the hidden layers, the GD trajectory closely matches that of the corresponding convex optimization problem. This result holds regardless of the depth of the network, providing insight into the efficiency of GD in the training of deep neural networks. Furthermore, we show that the GD trajectory for an overparameterized deep linear network automatically avoids bad saddle points.
URL: https://openreview.net/forum?id=jG7ndW7UHp
---
Title: Variational Learning ISTA
Authors: Fabio Valerio Massoli, Christos Louizos, Arash Behboodi
Abstract: Compressed sensing combines the power of convex optimization techniques with a sparsity-inducing prior on the signal space to solve an underdetermined system of equations. For many problems, the sparsifying dictionary is not directly given, nor its existence can be assumed. Besides, the sensing matrix can change across different scenarios. Addressing these issues requires solving a sparse representation learning problem, namely dictionary learning, taking into account the epistemic uncertainty of the learned dictionaries and, finally, jointly learning sparse representations and reconstructions under varying sensing matrix conditions. We address both concerns by proposing a variant of the LISTA architecture. First, we introduce Augmented Dictionary Learning ISTA (A-DLISTA), which incorporates an augmentation module to adapt parameters to the current measurement setup. Then, we propose to learn a distribution over dictionaries via a variational approach, dubbed Variational Learning ISTA (VLISTA). VLISTA exploits A-DLISTA as the likelihood model and approximates a posterior distribution over the dictionaries as part of an unfolded LISTA-based recovery algorithm. As a result, VLISTA provides a probabilistic way to jointly learn the dictionary distribution and the reconstruction algorithm with varying sensing matrices. We provide theoretical and experimental support for our architecture and show that our model learns calibrated uncertainties.
URL: https://openreview.net/forum?id=AQk0UsituG
---
Title: Correcting Flaws in Common Disentanglement Metrics
Authors: Louis Mahon, Lei Sha, Thomas Lukasiewicz
Abstract: Disentangled representations are those in which distinct features, such as size or shape, are represented by distinct neurons. Quantifying the extent to which a given representation is disentangled is not straightforward; multiple metrics have been proposed. In this paper, we identify two failings of existing metrics, which mean they can assign a high score to a model which is still entangled, and we propose two new metrics, which redress these problems. First, we use hypothetical toy examples to demonstrate the failure modes we identify for existing metrics. Then, we show that similar situations occur in practice. Finally, we validate our metrics on the downstream task of compositional generalization. We measure the performance of six existing disentanglement models on this downstream compositional generalization task, and show that performance is (a) generally quite poor, (b) correlated, to varying degrees, with most disentanglement metrics, and (c) most strongly correlated with our newly proposed metrics. Anonymous code to reproduce our results is available at https://github.com/anon296/anon.
URL: https://openreview.net/forum?id=c8WJ4Vozb2
---
Title: Overcoming Order in Autoregressive Graph Generation for Molecule Generation
Authors: Edo Cohen-Karlik, Eyal Rozenberg, Daniel Freedman
Abstract: Graph generation is a fundamental problem in various domains, and is of particular interest in chemistry where graphs may be used to represent molecules. Recent work has shown that molecular graph generation using recurrent neural networks (RNNs) is advantageous compared to traditional generative approaches which require converting continuous latent representations into graphs. One issue which arises when treating graph generation as sequential generation is the arbitrary order of the sequence which results from a particular choice of graph flattening method: in the chemistry setting, molecular graphs commonly have multiple SMILES strings corresponding to the same molecule. Inspired by the use case of molecular graph generation, we propose using RNNs, taking into account the non-sequential nature of graphs by adding an Orderless Regularization (OLR) term that encourages the hidden state of the recurrent model to be invariant to different valid orderings present under the training distribution. We demonstrate that sequential molecular graph generation models benefit from our proposed regularization scheme, especially when data is scarce. Our findings contribute to the growing body of research on graph generation and provide a valuable tool for various applications requiring the synthesis of realistic and diverse graph structures.
URL: https://openreview.net/forum?id=BK6Gc10tRy
---
Title: Neural Clamping: Joint Input Perturbation and Temperature Scaling for Neural Network Calibration
Authors: Yung-Chen Tang, Pin-Yu Chen, Tsung-Yi Ho
Abstract: Neural network calibration is an essential task in deep learning to ensure consistency between the confidence of model prediction and the true correctness likelihood. In this paper, we propose a new post-processing calibration method called $\textbf{Neural Clamping}$, which employs a simple joint input-output transformation on a pre-trained classifier via a learnable universal input perturbation and an output temperature scaling parameter. Moreover, we provide theoretical explanations on why Neural Clamping is provably better than temperature scaling. Evaluated on BloodMNIST, CIFAR-100, and ImageNet image recognition datasets and a variety of deep neural network models, our empirical results show that Neural Clamping significantly outperforms state-of-the-art post-processing calibration methods. The code is available at github.com/yungchentang/NCToolkit, and the demo is available at huggingface.co/spaces/TrustSafeAI/NCTV.
URL: https://openreview.net/forum?id=qSFToMqLcq
---
Title: A replica analysis of under-bagging
Authors: Takashi Takahashi
Abstract: Under-bagging (UB), which combines under-sampling and bagging, is a popular ensemble learning method for training classifiers on an imbalanced data. Using bagging to reduce the increased variance caused by the reduction in sample size due to under-sampling is a natural approach. However, it has recently been pointed out that in generalized linear models, naive bagging, which does not consider the class imbalance structure, and ridge regularization can produce the same results. Therefore, it is not obvious whether it is better to use UB, which requires an increased computational cost proportional to the number of under-sampled data sets, when training linear models. Given such a situation, in this study, we heuristically derive a sharp asymptotics of UB and use it to compare with several other standard methods for learning from imbalanced data, in the scenario where a linear classifier is trained from a two-component mixture data. The methods compared include the under-sampling (US) method, which trains a model using a single realization of the subsampled data, and the simple weighting (SW) method, which trains a model with a weighted loss on the entire data. It is shown that the performance of UB is improved by increasing the size of the majority class while keeping the size of the minority fixed, even though the class imbalance can be large, especially when the size of the minority class is small. This is in contrast to US, whose performance is almost independent of the majority class size. In this sense, bagging and simple regularization differ as methods to reduce the variance increased by under-sampling. On the other hand, the performance of SW with the optimal weighting coefficients is almost equal to UB, indicating that the combination of reweighting and regularization may be similar to UB.
URL: https://openreview.net/forum?id=7HIOUZAoq5
---
New submissions
===============
Title: MoCaE: Mixture of Calibrated Experts Significantly Improves Object Detection
Abstract: Combining the strengths of many existing predictors to obtain a Mixture of Experts which is superior to its individual components is an effective way to improve the performance without having to develop new architectures or train a model from scratch. However, surprisingly, we find that naively combining off-the-shelf object detectors in a similar way to Deep Ensembles, can often lead to degraded performance. We identify that the primary cause of this issue is that the predictions of the experts do not match their performance, a term referred to as miscalibration. Consequently, the most confident detector dominates the final predictions, preventing the mixture from leveraging all the predictions from the experts appropriately. To address this, when constructing the Mixture of Experts for object detection, we propose to combine their predictions in a manner which reflects the individual performance of the experts; an objective we achieve by first calibrating the predictions before filtering and refining them. We term this approach the Mixture of Calibrated Experts (MoCaE) and demonstrate its effectiveness through extensive experiments on 5 different detection tasks, showing that it: (i) improves object detectors on COCO and instance segmentation methods on LVIS by up to $\sim 2.5$ AP; (ii) reaches state-of-the-art on COCO test-dev with $65.1$ AP and on DOTA with $82.62$ $\mathrm{AP_{50}}$; (iii) outperforms single models consistently on recent detection tasks such as Open Vocabulary Object Detection. The code will be made public.
URL: https://openreview.net/forum?id=fJEsas1z8J
---
Title: Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models
Abstract: Vision-and-Language Navigation (VLN) has gained increasing attention over recent years and many approaches have emerged to advance their development. The remarkable achievements of foundation models have shaped the challenges and proposed methods for VLN research. In this survey, we provide a top-down review that adopts a principled framework for embodied planning and reasoning, and emphasizes the current methods and future opportunities leveraging foundation models to address VLN challenges.
We hope our in-depth discussions could provide valuable resources and insights: on one hand, to milestone the progress and explore opportunities and potential roles for foundation models in this field, and on the other, to organize different challenges and solutions in VLN to foundation model researchers.
URL: https://openreview.net/forum?id=yiqeh2ZYUh
---
Title: Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs?
Abstract: The model editing problem concerns how language models should learn new facts about the world over time. While empirical research on model editing has drawn widespread attention, the conceptual foundations of model editing remain shaky -- perhaps unsurprisingly, since model editing is essentially belief revision, a storied problem in philosophy that has eluded succinct solutions for decades. Model editing nonetheless demands a solution, since we need to be able to control knowledge within language models. With this goal in mind, this paper critiques the standard formulation of the model editing problem and proposes a formal testbed for model editing research. We first describe 12 open problems with model editing, based on challenges with (1) defining the problem, (2) developing benchmarks, and (3) assuming LLMs have editable beliefs in the first place. Many of the challenges are extremely difficult to address, e.g. determining far-reaching consequences of edits, labeling probabilistic entailments between facts, and updating beliefs of agent simulators. Next, we introduce a semi-synthetic dataset for model editing based on Wikidata, where we can evaluate edits against labels given by an idealized Bayesian agent. This enables us to say exactly how belief revision in language models falls short of a desirable epistemic standard. We encourage further research exploring settings where such a gold standard can be compared against.
URL: https://openreview.net/forum?id=LRf19n5Ly3
---
Title: Persona-aware Generative Model for Code-mixed Language
Abstract: Code-mixing and script-mixing are prevalent across online social networks and multilingual societies. However, a user's preference toward code-mixing depends on the socioeconomic status, demographics of the user, and the local context, which existing generative models tend to ignore while generating code-mixed texts. In this work, we make a pioneering attempt to develop a persona-aware generative model to generate texts resembling real-life code-mixed texts of individuals. We propose PARADOX, a persona-aware generative model for code-mixed text generation, which is a novel Transformer-based encoder-decoder model that encodes an utterance conditioned on a user's persona and generates code-mixed texts without monolingual reference data. We propose an alignment module that re-calibrates the generated sequence to resemble real-life code-mixed texts. PARADOX generates code-mixed texts that are semantically more meaningful and linguistically more valid. To evaluate the personification capabilities of PARADOX, we propose four new metrics -- CM BLEU, CM Rouge-1, CM Rouge-L and CM KS. On average, PARADOX achieves $1.6$% better CM BLEU, $57$% better perplexity and $32$% better semantic coherence than the non-persona-based counterparts.
URL: https://openreview.net/forum?id=fzP4qIiVIh
---
Title: Tweedie Moment Projected Diffusions for Inverse Problems
Abstract: Diffusion generative models unlock new possibilities for inverse problems as they allow for the incorporation of strong empirical priors into the process of scientific inference. Recently, diffusion models are repurposed for solving inverse problems using Gaussian approximations to conditional densities of the reverse process via Tweedie’s formula to parameterise the mean, complemented with various heuristics. To address various challenges arising from these approximations, we leverage higher order information using Tweedie’s formula and obtain a statistically principled approximation. We further provide a theoretical guarantee specifically for posterior sampling which can lead to better theoretical understanding of diffusion-based conditional sampling. Finally, we illustrate the empirical effectiveness of our approach for general linear inverse problems on toy synthetic examples as well as image restoration. We show that our method (i) removes any time-dependent step-size hyperparameters required by earlier methods, (ii) brings stability and better sample quality across multiple noise levels, (iii) is the only method that works in a stable way with variance exploding (VE) forward processes as opposed to earlier works.
URL: https://openreview.net/forum?id=4unJi0qrTE
---
Title: Growing Tiny Networks: Spotting Expressivity Bottlenecks and Fixing Them Optimally
Abstract: Machine learning tasks are generally formulated as optimization problems, where one searches for an optimal function within a certain functional space. In practice, parameterized functional spaces are considered, in order to be able to perform gradient descent. Typically, a neural network architecture is chosen and fixed, and its parameters (connection weights) are optimized, yielding an architecture-dependent result. This way of proceeding however forces the evolution of the function during training to lie within the realm of what is expressible with the chosen architecture, and prevents any optimization across architectures. Costly architectural hyper-parameter optimization is often performed to compensate for this. Instead, we propose to adapt the architecture on the fly during training. We show that the information about desirable architectural changes, due to expressivity bottlenecks when attempting to follow the functional gradient, can be extracted from backpropagation. To do this, we propose a mathematical definition of expressivity bottlenecks, which enables us to detect, quantify and solve them while training, by adding suitable neurons. Thus, while the standard approach requires large networks, in terms of number of neurons per layer, for expressivity and optimization reasons, we are able to start with very small neural networks and let them grow appropriately. As a proof of concept, we show results~on the CIFAR dataset, matching large neural network accuracy, with competitive training time, while removing the need for standard architectural hyper-parameter search.
URL: https://openreview.net/forum?id=hbtG6s6e7r
---
Title: Multi-Attribute Constraint Satisfaction via Language Model Rewriting
Abstract: Obeying precise constraints on top of multiple external attributes is a common computational problem underlying seemingly different domains, from controlled text generation to protein engineering. Existing language model (LM) controllability methods for multi-attribute constraint satisfaction often rely on specialized architectures or gradient-based classifiers, limiting their flexibility to work with arbitrary black-box evaluators and pretrained models. Current general-purpose large language models, while capable, cannot achieve fine-grained multi-attribute control over external attributes. Thus, we create Multi-Attribute Constraint Satisfaction (MACS), a generalized method capable of finetuning language models on any sequential domain to satisfy user-specified constraints on multiple external real-value attributes. Our method trains LMs as editors by sampling diverse multi-attribute edit pairs from an initial set of paraphrased outputs. During inference, LM iteratively improves upon its previous solution to satisfy constraints for all attributes. We additionally experiment with offline Reinforcement Learning (RL) methods that improve the constraint satisfaction rate of LMs. To evaluate our approach, we present a new Fine-grained Constraint Satisfaction (FineCS) benchmark, featuring two challenging tasks: (1) Text Style Transfer, where the goal is to simultaneously modify the sentiment and complexity of reviews, and (2) Protein Design, focusing on modulating fluorescence and stability of Green Fluorescent Proteins (GFP). Our empirical results show that MACS achieves the highest threshold satisfaction in both FineCS tasks, outperforming strong domain-specific baselines. Our work opens new avenues for flexible, generalized multi-attribute control, with implications for diverse applications spanning natural language processing and bioinformatics.
URL: https://openreview.net/forum?id=3q1bUIHTJK
---
Title: Data Augmentation Policy Search for Long-Term Forecasting
Abstract: Data augmentation serves as a popular regularization technique to combat overfitting
challenges in neural networks. While automatic augmentation has demonstrated success in
image classification tasks, its application to time-series problems, particularly in long-term
forecasting, has received comparatively less attention. To address this gap, we introduce a
time-series automatic augmentation approach named TSAA, which is both efficient and easy
to implement. The solution involves tackling the associated bilevel optimization problem
through a two-step process: initially training a non-augmented model for a limited number
of epochs, followed by an iterative split procedure. During this iterative process, we alternate
between identifying a robust augmentation policy through Bayesian optimization and refining
the model while discarding suboptimal runs. Extensive evaluations on challenging univariate
and multivariate forecasting benchmark problems demonstrate that TSAA consistently
outperforms several robust baselines, suggesting its potential integration into prediction
pipelines.
URL: https://openreview.net/forum?id=Wnd0XY0twh
---