# Weekly TMLR digest for Jul 03, 2022

1 view

### TMLR

Jul 2, 2022, 8:00:10 PMJul 2

Accepted papers
===============

Title: The Graph Cut Kernel for Ranked Data

Authors: Michelangelo Conserva, Marc Peter Deisenroth, K S Sesh Kumar

Abstract: Many algorithms for ranked data become computationally intractable as the number of objects grows due to the complex geometric structure induced by rankings. An additional challenge is posed by partial rankings, i.e. rankings in which the preference is only known for a subset of all objects. For these reasons, state-of-the-art methods cannot scale to real-world applications, such as recommender systems. We address this challenge by exploiting the geometric structure of ranked data and additional available information about the objects to derive a kernel for ranking based on the graph cut function. The graph cut kernel combines the efficiency of submodular optimization with the theoretical properties of kernel-based methods. We demonstrate that our novel kernel drastically reduces the computational cost while maintaining the same accuracy as state-of-the-art methods.

---

Title: NoiLin: Improving adversarial training and correcting stereotype of noisy labels

Authors: Jingfeng Zhang, Xilie Xu, Bo Han, Tongliang Liu, Lizhen Cui, Gang Niu, Masashi Sugiyama

Abstract: Adversarial training (AT) formulated as the minimax optimization problem can effectively enhance the model's robustness against adversarial attacks. The existing AT methods mainly focused on manipulating the inner maximization for generating quality adversarial variants or manipulating the outer minimization for designing effective learning objectives. However, empirical results of AT always exhibit the robustness at odds with accuracy and the existence of the cross-over mixture problem, which motivates us to study some label randomness for benefiting the AT. First, we thoroughly investigate noisy labels (NLs) injection into AT's inner maximization and outer minimization, respectively and obtain some observations on when NL injection benefits AT. Second, based on the observations, we propose a simple but effective method---NoiLIn that randomly injects NLs into training data at each training epoch and dynamically increases the NL injection rate once robust overfitting occurs. Empirically, NoiLIn can significantly mitigate the AT's undesirable issue of robust overfitting and even further improve the generalization of the state-of-the-art AT methods. Philosophically, NoiLIn sheds light on a new perspective of learning with NLs: NLs should not always be deemed detrimental, and even in the absence of NLs in the training set, we may consider injecting them deliberately.

---

Title: Multi-Agent Off-Policy TDC with Near-Optimal Sample and Communication Complexities

Authors: Ziyi Chen, Yi Zhou, Rong-Rong Chen

Abstract: The finite-time convergence of off-policy temporal difference (TD) learning has been comprehensively studied recently. However, such a type of convergence has not been established for off-policy TD learning in the multi-agent setting, which covers broader reinforcement learning applications and is fundamentally more challenging. This work develops a decentralized TD with correction (TDC) algorithm for multi-agent off-policy TD learning under Markovian sampling. In particular, our algorithm avoids sharing the actions, policies and rewards of the agents, and adopts mini-batch sampling to reduce the sampling variance and communication frequency. Under Markovian sampling and linear function approximation, we proved that the finite-time sample complexity of our algorithm for achieving an $\epsilon$-accurate solution is in the order of $\mathcal{O}\big(\frac{M\ln\epsilon^{-1}}{\epsilon(1-\sigma_2)^2}\big)$, where $M$ denotes the total number of agents and $\sigma_2$ is a network parameter. This matches the sample complexity of the centralized TDC. Moreover, our algorithm achieves the optimal communication complexity $\mathcal{O}\big(\frac{\sqrt{M}\ln\epsilon^{-1}}{1-\sigma_2}\big)$ for synchronizing the value function parameters, which is order-wise lower than the communication complexity of the existing decentralized TD(0). Numerical simulations corroborate our theoretical findings.

---

Title: Benchmarking and Analyzing Unsupervised Network Representation Learning and the Illusion of Progress

Authors: Saket Gurukar, Priyesh Vijayan, srinivasan parthasarathy, Balaraman Ravindran, Aakash Srinivasan, Goonmeet Bajaj, Chen Cai, Moniba Keymanesh, Saravana Kumar, Pranav Maneriker, Anasua Mitra, Vedang Patel

Abstract: A number of methods have been developed for unsupervised network representation learning -- ranging from classical methods based on the graph spectra to recent random walk based methods and from deep learning based methods to matrix factorization based methods. Each new study inevitably seeks to establish the relative superiority of the proposed method over others. The lack of a standard assessment protocol and benchmark suite often leave practitioners wondering if a new idea represents a significant scientific advance. In this work, we articulate a clear and pressing need to systematically and rigorously benchmark such methods. Our overall assessment -- a result of a careful benchmarking of 15 methods for unsupervised network representation learning on 16 non-attributed graphs (several with different characteristics) - is that many recently proposed improvements are somewhat of an illusion when assessed through the lens of downstream tasks such as link prediction and node classification. Specifically, we find that several proposed improvements are marginal at best and that aspects of many of these datasets often render such small differences insignificant, especially when viewed from a rigorous statistical lens. A more detailed analysis of our results identify several new insights: first, we find that classical methods, often dismissed or not considered by recent efforts, can compete on certain types of datasets if they are tuned appropriately; second, we find that from a qualitative standpoint, a couple of methods based on matrix factorization offer a small but not always consistent advantage over alternative methods; third, no single method completely outperforms other embedding methods on both node classification and link prediction tasks. Finally, we also present several analysis that reveals settings under which certain algorithms perform well (e.g., the role of neighborhood context and dataset properties that impact performance). An important outcome of this study is the benchmark and evaluation protocol, which practitioners may find useful for future research in this area.

---

New submissions
===============

Title: The Evolution of Out-of-Distribution Robustness Throughout Fine-Tuning

Abstract: Although machine learning models typically experience a drop in performance on out-of-distribution data, accuracies on in- versus out-of-distribution data are widely observed to follow a single linear trend when evaluated across a testbed of models.
Models that are more accurate on the out-of-distribution data relative to this baseline exhibit “effective robustness” and are exceedingly rare.
Identifying such models, and understanding their properties, is key to improving out-of-distribution performance.
We conduct a thorough empirical investigation of effective robustness during fine-tuning and surprisingly find that models pre-trained on larger datasets exhibit effective robustness during training that vanishes at convergence.
We study how properties of the data influence effective robustness, and we show that it increases with the larger size, more diversity, and higher example difficulty of the dataset.
We also find that models that display effective robustness are able to correctly classify 10\% of the examples that no other current testbed model gets correct.
Finally, we discuss several strategies for scaling effective robustness to the high-accuracy regime to improve the out-of-distribution accuracy of state-of-the-art models.

---

Title: A Deeper Look at Optimal Transport for Imitation Learning

Abstract: Optimal transport (OT) tools have shown early promise for imitation learning (IL) and enable a metric-aware alignment of the expert and agent's stationary distributions. Despite encouraging results, the use of OT for IL is still ad hoc and lacks a systematic treatment, which could guide future research. To gain an understanding of these inner workings, we perform theoretical and empirical analysis of existing OT-based methods for IL from a unified perspective. Based on our study we propose OTIL, a simple and modality-agnostic method for IL. Our algorithm demonstrates state-of-the-art performance on a wide range of environments that feature both continuous and discrete action spaces, as well as state and image observations. We make our experimentation code public.

---

Title: Interpretable Node Representation with Attribute Decoding

Abstract: Variational Graph Autoencoders (VGAEs) are powerful models for unsupervised learning of node representations from graph data. In this work, we make a systematic analysis of modeling node attributes in VGAEs and show that attribute decoding is important for node representation learning. We further propose a new learning model, interpretable NOde Representation with Attribute Decoding (NORAD). The model encodes node representations in an interpretable approach: node representations capture community structures in the graph and the relationship between communities and node attributes. We further propose a rectifying procedure to refine node representations of isolated notes, which improves the quality of the representations of these nodes. Our empirical results demonstrate the advantage of the proposed model when learning graph data in an interpretable approach.

---

Title: Lookback for Learning to Branch

Abstract: The expressive and computationally inexpensive bipartite Graph Neural Networks (GNN) have been shown to be an important component of deep learning based Mixed-Integer Linear Program (MILP) solvers. Recent works have demonstrated the effectiveness of such GNNs in replacing the branching (variable selection) heuristic in branch-and-bound B&B solvers. These GNNs are trained, offline and on a collection of MILPs, to imitate a very good but computationally expensive branching heuristic, strong branching. Given that B&B results in a tree of sub-MILPs, we ask (a) whether there are strong dependencies exhibited by the target heuristic among the neighboring nodes of the B&B tree, and (b) if so, whether we can incorporate them in our training procedure. Specifically, we find that with the strong branching heuristic, a child node's best choice was often the parent's second-best choice. We call this the "lookback" phenomenon. Surprisingly, the typical branching GNN of Gasse et al. (2019) often misses this simple "answer". To imitate the target behavior more closely by incorporating the lookback phenomenon in GNNs, we propose two methods: (a) target smoothing for the standard cross-entropy loss function, and (b) adding a Parent-as-Target (PAT) Lookback regularizer term. Finally, we propose a model selection framework to incorporate harder-to-formulate objectives such as solving time in the final models. Through extensive experimentation on standard benchmark instances, we show that our proposal results in up to 22% decrease in the size of the B&B tree and up to 15% improvement in the solving times.

---

Title: INR-V: A Continuous Representation Space for Videos

Abstract: Images are considered a complete signal, whereas videos are usually broken down into a set of temporally coherent images. Consequently, an image space is leveraged for various video-based tasks, such as novel video generation and future video segment prediction. This limits the expressivity of videos to only image-based operations needing network designs to obtain temporally coherent trajectories in the image space. We propose INR-V, a video representation network that learns a continuous latent space directly for videos. INR-V regards videos as complete units parameterized by implicit neural representations (INRs), a multi-layered perceptron with only a few thousand parameters. A meta-network is then used to predict these few thousand parameters allowing it to learn a continuous space over the neural representations. Later, the meta-network can generate novel neural representations by sampling diverse points over the learned space leading to novel videos. Interestingly, we find that conditional regularization and progressive weight initialization play a crucial role in obtaining INR-V. In this work, we analyze INR-V's several video-based properties. For instance, we show smooth interpolation of coherent videos between two videos by traversing along their latent points in the underlying video space. Moreover, learning a video space allows the network to directly invert an unseen video to its latent point in the latent space. We show the various applications of video inversion. Lastly, INRs learn a continuous signal independent of the input dimension letting INR-V generate multi-resolution videos (like 32 x 32 or 100 x100) directly at inference without any finetuning or architectural changes. We conduct several comparisons and evaluate each of the properties, ultimately demonstrating the potential of a continuous representation space for videos.

---

Title: Emergent Abilities of Large Language Models

Abstract: Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.

---

Title: The Alignment Problem in Curriculum Learning

Abstract: In curriculum learning, teaching involves cooperative selection of sequences of data via plans to facilitate efficient and effective learning.
One-off cooperative selection of data has been mathematically formalized as entropy-regularized optimal transport and the limiting behavior of myopic sequential interactions has been analyzed, both yielding theoretical and practical guarantees.
We recast sequential cooperation with curriculum planning in a reinforcement learning framework and analyze performance mathematically and by simulation.
We prove that infinite length plans are equivalent to not planning under certain assumptions on the method of planning, and isolate instances where monotonicity and hence convergence in the limit hold, as well as cases where it does not. We also demonstrate through simulations that argmax data selection is the same across planning horizons and demonstrate problem-dependent sensitivity of learning to the teacher's planning horizon. Thus, we find that planning ahead yields efficiency at the cost of effectiveness. This failure of alignment is illustrated in particular with grid world examples in which the teacher must attempt to steer the learner away from a particular location in order to reach the desired grid square. We conclude with implications and directions for efficient and effective curricula.

---

Title: Approximating 1-Wasserstein Distance with Trees

Abstract: Wasserstein distance, which measures the discrepancy between distributions, shows efficacy in various types of natural language processing (NLP) and computer vision (CV) applications. One of the challenges in estimating Wasserstein distance is that it is computationally expensive and does not scale well for many distribution comparison tasks. In this paper, we aim to approximate the 1-Wasserstein distance by the tree-Wasserstein distance (TWD), where TWD is a 1-Wasserstein distance with tree-based embedding and can be computed in linear time with respect to the number of nodes on a tree. More specifically, we propose a simple yet efficient L1-regularized approach to learning the weights of the edges in a tree. To this end, we first show that the 1-Wasserstein approximation problem can be formulated as a distance approximation problem using the shortest path distance on a tree. We then show that the shortest path distance can be represented by a linear model and can be formulated as a Lasso-based regression problem. Owing to the convex formulation, we can obtain a globally optimal solution efficiently. Moreover, we propose a tree-sliced variant of these methods. Through experiments, we demonstrated that the weighted TWD can accurately approximate the original 1-Wasserstein distance.

---

Title: Completeness and Coherence Learning for Fast Arbitrary Style Transfer

Abstract: Style transfer methods put a premium on two objectives: (1) completeness which encourages the encoding of a complete set of style patterns; (2) coherence which discourages the production of spurious artifacts not present in input styles. While existing methods pursue the two objectives either partially or implicitly, we present the Completeness and Coher- ence Network (CCNet) which jointly learns completeness and coherence components and rejects their incompatibility, both in an explicit manner. Specifically, we develop an attention mechanism integrated with bi-directional softmax operations for explicit imposition of completeness and coherence and for collaborative modelling. We also propose ccLoss as a quantitative measure for evaluating the quality of a stylized image in terms of completeness and coherence. Through an empirical evaluation, we demonstrate that compared with exist- ing methods, our method strikes a better tradeoff between computation costs, generalization ability and stylization quality.

---

Title: Differentiable Model Compression via Pseudo Quantization Noise

Abstract: We propose DiffQ a differentiable method for model compression for quantizing model parameters without gradient approximations (e.g., Straight Through Estimator). We suggest adding independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator. DiffQ is differentiable both with respect to the unquantized weights and the number of bits used. Given a single hyper-parameter balancing between the quantized model size and accuracy, DiffQ optimizes the number of bits used per individual weight or groups of weights, in end-to-end training. We experimentally verify that our method outperforms state-of-the-art quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation. For instance, on the ImageNet dataset, DiffQ compresses a 12 layers transformer-based model by more than a factor of 8, (lower than 4 bits precision per weight on average), with a loss of 0.3\% in model accuracy. Code is available in the supplementary material.

---

Title: Using the Krylov Subspace Formulation to Improve Regularisation and Interpretation in Partial Least Squares Regression

Abstract: Partial least squares regression (PLS-R) has been an important regression method in the life sciences and many other fields for decades. However, PLS-R is typically solved using an algorithmic approach, rather than through an optimisation formulation and procedure. There is a clear optimisation formulation of the PLS-R problem based on a Krylov subspace formulation, but it is only rarely considered. The popularity of PLS-R is attributed to the ability to interpret the data through the model components, but the model components are not available when solving the PLS-R problem using the Krylov subspace formulation. We therefore highlight a simple reformulation of the PLS-R problem using the Krylov subspace formulation as a promising modelling framework for PLS-R, and illustrate one of the main benefits of this reformulation---namely that it allows arbitrary penalty terms of the regression coefficients to be included in the PLS-R model. Further, we propose an approach to estimate the PLS-R model components for the solution found through the Krylov subspace formulation, that are those we would have obtained had we been able to use the common algorithms for estimating the PLS-R model. We illustrate the utility of the proposed method on simulated and real data.

---

Title: Data Models for Dataset Drift Controls in Machine Learning With Images

Abstract: Camera images are ubiquitous in machine learning research. They also play a central role in the delivery of important services spanning medicine and environmental surveying. However, the application of machine learning models in these domains has been limited because of robustness concerns. A primary failure mode are performance drops due to differences between the training and deployment data. While there are methods to prospectively validate the robustness of machine learning models to such dataset drifts, existing approaches do not account for explicit models of the primary object of interest: the data. This makes it difficult to create physically faithful drift test cases or to provide precise specifications of data models that should be avoided during the deployment of a machine learning model. In this study, we demonstrate how these shortcomings can be overcome by pairing machine learning robustness validation with physical optics. We examine the role raw sensor data and differentiable data models can play in controlling performance risks related to image dataset drift. The findings are distilled into three applications. First, drift synthesis enables the controlled generation of physically faithful drift test cases. The experiments presented here show that the average decrease in model performance is ten to four times less severe than under post-hoc augmentation testing. Second, the gradient connection between machine learning model and our data models allows for drift forensics that can be used to specify performance-sensitive data models which should be avoided during deployment of a machine learning model. Third, drift adjustment opens up the possibility for processing adjustments in the face of drift. This can lead to speed up and stabilization of classifier training at a margin of up to 20% in validation accuracy. Alongside our data model code we release two datasets to the public that we collected as part of this work. In total, the two datasets, Raw-Microscopy and Raw-Drone, comprise 1,488 scientifically calibrated reference raw sensor measurements, 8,928 raw intensity variations as well as 17,856 images processed through our data models with twelve different configurations. A guide to access the open code and datasets is available at https://anonymous.4open.science/r/tmlr/README.md.

---

Title: Fingerprints of Super Resolution Networks

Abstract: Several recent studies have demonstrated that deep-learning based image generation models, such as GANs, can be uniquely identified, and possibly even reverse-engineered, by the fingerprints they leave on their output images. We extend this research to single image super-resolution (SISR) networks. Compared to previously studied models, SISR networks are a uniquely challenging class of image generation model from which to extract and analyze fingerprints, as they can often generate images that closely match the corresponding ground truth and thus likely leave little flexibility to embed signatures. We take SISR models as examples to investigate if the findings from the previous work on fingerprints of GAN-based networks are valid for general image generation models. We show that SISR networks with a high upscaling factor or trained using adversarial loss leave highly distinctive fingerprints, and that under certain conditions, some SISR network hyperparameters can be reverse-engineered from these fingerprints.

---

Title: A mathematical model for estimating anti-learning when a decision tree solves the parity bit problem

Abstract: On some data, machine learning displays anti-learning; this means, in the most surprising scenario, that the more examples you place in the training set, the worse the accuracy becomes, until it becomes $0\%$ on the test set. We produce a framework in which this kind of anti-learning can be reproduced and studied theoretically. We deduce a formula estimating anti-learning when decision trees (one of the most important tools of machine learning) solve the parity bit problem (one of the most famously tricky problems of machine learning). Our estimation formula (deduced under certain mathematical assumptions) agrees very well with experimental results (produced on random data without these assumptions).

---

Title: On the Paradox of Certified Training

Abstract: Certified defenses based on convex relaxations are an established technique for training provably robust models. The key component is the choice of relaxation, varying from simple intervals to tight polyhedra. Counterintuitively, loose interval-based training often leads to higher certified robustness than what can be achieved with tighter relaxations, which is a well-known but poorly understood paradox. While recent works introduced various improvements aiming to circumvent this issue in practice, the fundamental problem of training models with high certified robustness remains unsolved. In this work, we investigate the underlying reasons behind the paradox and identify two key properties of relaxations, beyond tightness, that impact certified training dynamics: continuity and sensitivity. Our extensive experimental evaluation with a number of popular convex relaxations provides strong evidence that these factors can explain the drop in certified robustness observed for tighter relaxations. We also systematically explore modifications of existing relaxations and discover that improving unfavorable properties is challenging, as such attempts often harm other properties, revealing a complex tradeoff. Our findings represent an important first step towards understanding the intricate optimization challenges involved in certified training.

---