Daily TMLR digest for Nov 07, 2022

1 view

Skip to first unread message

TMLR

unread,

Nov 6, 2022, 7:00:09 PM11/6/22

to tmlr-anno...@googlegroups.com

Accepted papers
===============

Title: A Note on "Assessing Generalization of SGD via Disagreement"

Authors: Andreas Kirsch, Yarin Gal

Abstract: Several recent works find empirically that the average test error of deep neural networks can be estimated via the prediction disagreement of models, which does not require labels. In particular, Jiang et al. (2022) show for the disagreement between two separately trained networks that this `Generalization Disagreement Equality' follows from the well-calibrated nature of deep ensembles under the notion of a proposed `class-aggregated calibration.' In this reproduction, we show that the suggested theory might be impractical because a deep ensemble's calibration can deteriorate as prediction disagreement increases, which is precisely when the coupling of test error and disagreement is of interest, while labels are needed to estimate the calibration on new datasets. Further, we simplify the theoretical statements and proofs, showing them to be straightforward within a probabilistic context, unlike the original hypothesis space view employed by Jiang et al. (2022).

URL: https://openreview.net/forum?id=oRP8urZ8Fx

---

New submissions
===============

Title: Blind Sequence Denoising with Self-Supervised Set Learning

Abstract: Denoising discrete-valued sequences typically relies on training a supervised model on ground-truth sources or fitting a statistical model of a noisy channel. Biological sequence analysis presents a unique challenge for both approaches, as obtaining ground-truth sequences is resource-intensive and the complexity of sequencing errors makes it difficult to specify an accurate noise model. Recent developments in DNA sequencing have opened an avenue for tackling this problem by producing long DNA reads consisting of multiple subreads, or noisy observations of the same sequence, that can be denoised together. Inspired by this context, we propose a novel method for denoising sets of sequences that does not require access to clean sources. Our method, Self-Supervised Set Learning (SSSL), gathers subreads together in an embedding space and estimates a single set embedding as the midpoint of the subreads in both the latent space and sequence space. This set embedding represents the “average” of the subreads and can be decoded into a prediction of the clean sequence. In experiments on simulated long-read DNA data, SSSL-denoised sequences contain 31% fewer errors compared to a traditional denoising algorithm based on a multi-sequence alignment (MSA) of the subreads. When very few subreads are available or high error rates lead to poor alignment, SSSL reduces errors by an even greater margin. On an experimental dataset of antibody sequences, SSSL improves over the MSA-based algorithm on two proposed self-supervised metrics, with a significant difference on difficult reads with fewer than ten subreads that comprise over 75% of the test set. SSSL promises to better realize the potential of high-throughput DNA sequencing data

URL: https://openreview.net/forum?id=JHTnT9vezG

---

Reply all

Reply to author

Forward

0 new messages