Daily TMLR digest for Jun 12, 2024

0 views

Skip to first unread message

TMLR

unread,

Jun 12, 2024, 12:00:09 AMJun 12

to tmlr-anno...@googlegroups.com

New certifications
==================

Featured Certification: Self-Improvement for Neural Combinatorial Optimization: Sample Without Replacement, but Improvement

Jonathan Pirnay, Dominik G. Grimm

https://openreview.net/forum?id=agT8ojoH0X

---

Accepted papers
===============

Title: ***FastDoc***: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy

Authors: Abhilash Nandy, Manav Nitin Kapadnis, Sohan Patnaik, Yash Parag Butala, Pawan Goyal, Niloy Ganguly

Abstract: In this paper, we propose FastDoc (Fast Continual Pre-training Technique using Document Level Metadata and Taxonomy), a novel, compute-efficient framework that utilizes Document metadata and Domain-Specific Taxonomy as supervision signals to continually pre-train transformer encoder on a domain-specific corpus. The main innovation is that during domain-specific pretraining, an open-domain encoder is continually pre-trained using sentence-level embeddings as inputs (to accommodate long documents), however, fine-tuning is done with token-level embeddings as inputs to this encoder. We perform such domain-specific pre-training on three different domains namely customer support, scientific, and legal domains, and compare performance on 6 different downstream tasks and 9 different datasets. The novel use of document-level supervision along with sentence-level embedding input for pre-training reduces pre-training compute by around 1,000, 4,500, and 500 times compared to MLM and/or NSP in Customer Support, Scientific, and Legal Domains, respectively. The reduced training time does not lead to a deterioration in performance. In fact we show that FastDoc either outperforms or performs on par with several competitive transformer-based baselines in terms of character-level F1 scores and other automated metrics in the Customer Support, Scientific, and Legal Domains. Moreover, reduced training aids in mitigating the risk of catastrophic forgetting. Thus, unlike baselines, FastDoc shows a negligible drop in performance on open domain.

URL: https://openreview.net/forum?id=RA4yRhjoXw

---

Title: Self-Improvement for Neural Combinatorial Optimization: Sample Without Replacement, but Improvement

Authors: Jonathan Pirnay, Dominik G. Grimm

Abstract: Current methods for end-to-end constructive neural combinatorial optimization usually train a policy using behavior cloning from expert solutions or policy gradient methods from reinforcement learning. While behavior cloning is straightforward, it requires expensive expert solutions, and policy gradient methods are often computationally demanding and complex to fine-tune. In this work, we bridge the two and simplify the training process by sampling multiple solutions for random instances using the current model in each epoch and then selecting the best solution as an expert trajectory for supervised imitation learning. To achieve progressively improving solutions with minimal sampling, we introduce a method that combines round-wise Stochastic Beam Search with an update strategy derived from a provable policy improvement. This strategy refines the policy between rounds by utilizing the advantage of the sampled sequences with almost no computational overhead. We evaluate our approach on the Traveling Salesman Problem and the Capacitated Vehicle Routing Problem. The models trained with our method achieve comparable performance and generalization to those trained with expert data. Additionally, we apply our method to the Job Shop Scheduling Problem using a transformer-based architecture and outperform existing state-of-the-art methods by a wide margin.

URL: https://openreview.net/forum?id=agT8ojoH0X

---

New submissions
===============

Title: A Probabilistic Model behind Self- Supervised Learning

Abstract: In self-supervised learning (SSL), representations are learned via an auxiliary task without annotated labels. A common task is to classify augmentations or different modalities of the data, which share semantic _content_ (e.g. an object in an image) but differ in _style_ (e.g. the object's location). Many approaches to self-supervised learning have been proposed, e.g. SimCLR, CLIP and VicREG, which have recently gained much attention for their representations achieving downstream performance comparable to supervised learning. However, a theoretical understanding of the mechanism behind self-supervised methods eludes. Addressing this, we present a generative latent variable model for self-supervised learning and show that several families of discriminative SSL, including contrastive methods, induce a comparable distribution over representations, providing a unifying theoretical framework for these methods. The proposed model also justifies connections drawn to mutual information and the use of a ``projection head''. Learning representations by fitting the model generatively (termed SimVAE) improves performance over discriminative and other VAE-based methods on simple image benchmarks and significantly narrows the gap between generative and discriminative representation learning in more complex settings. Importantly, as our analysis predicts, SimVAE outperforms self-supervised learning where style information is required, taking an important step toward understanding self-supervised methods and achieving task-agnostic representations.

URL: https://openreview.net/forum?id=QEwz7447tR

---

Title: Data Valuation in the Absence of a Reliable Validation Set

Abstract: Data valuation plays a pivotal role in ensuring data quality and equitably compensating data contributors. Existing game-theoretic data valuation techniques mostly rely on the availability of a high-quality validation set for their efficacy. However, the feasibility of obtaining a clean validation set drawn from the test distribution may be limited in practice. In this work, we show that the choice of validation set can significantly impact the final data value scores. In order to mitigate this, we introduce a general paradigm that converts a traditional validation-based game-theoretic data valuation method into a validation-free alternative. Specifically, we utilize the cross-validation error as a surrogate for to evaluate the model's performance on a validation set. As computing the cross-validation error can be computationally expensive, we propose using the cross-validation error of a kernel regression model as an effective and efficient surrogate for the true performance score on the population. We compare the performance of the validation-free variant of existing data valuation techniques with their original validation-based counterparts. Our results indicate that the validation-free variants generally match or often significantly surpass the performance of their validation-based counterparts.

URL: https://openreview.net/forum?id=xBORyL316c

---

Reply all

Reply to author

Forward

0 new messages