UMI aware deduplication

130 views
Skip to first unread message

Alejandro Pezzulo

unread,
Apr 9, 2019, 6:14:46 PM4/9/19
to Smart-3SEQ
Hi,

What's your favorite current way to process Smart 3'seq data with UMI-aware de-duplication? We just got data back for our first run, for which we used the legacy single P7 indexing method for 24 samples pooled pre-SPRI.

We're currently following the steps recommended in the protocol, and are playing with Kallisto and others as alternative methods.

Thank you,
Alejandro Pezzulo

Joe Foley

unread,
Apr 10, 2019, 1:53:55 PM4/10/19
to smart...@googlegroups.com
I don't favor deduplication for this protocol. With any variant of 3SEQ we expect a lot of duplicate reads by chance, because we're only sampling sequences from narrow windows upstream of the ends of expressed genes, so in any given library with moderate sequencing depth it's likely there are more reads in your data than possible reads from the sample, and that's before thinking about factors that may lower the library complexity like small amounts of degraded material. UMIs are supposed to help with this, but they don't entirely solve the problem because it's also possible to get duplicate UMIs by chance: our design has UMIs of length 5, or 1024 possible sequences (though there is detectable bias in the sequence representation), so if you happen to get thousands of reads from a highly expressed transcript it's guaranteed that you'll get many duplicate UMIs from non-duplicate library molecules. Longer UMIs might reduce the probability of this happening, but not to zero. Thus even a UMI-based deduplication may introduce new noise (generally in the form of a bias against high-expressed transcripts); on the other hand, I'm not convinced that the amount of PCR amplification noise it subtracts is very large in the first place. So it may do more harm than good.

We can see that ambiguity empirically with the ERCC-only libraries in the manuscript. Attached are several draft versions of the standard curve figure with multiple dilutions, PCR conditions, ERCC mixes, and replicates (the final version is figure S6). "raw.pdf" is the original read counts, where you can see the linear fits are pretty good already. "naive" is the counts after deduplication with the simplistic algorithm used in most software, assuming that no UMIs are ever duplicated by chance; you can see how it bends the top right part of the curve away from the expected line. "weighted_average.pdf" shows the results of a smarter deduplication method from Fumiaki Katagiri (doi: 10.15252/embj.201796529), which uses the relative counts of the different UMIs to set a tolerance for expected duplicates; "weighted_average2.pdf" is my extension of that algorithm to handle indefinitely high read counts (described more in the "Detection of duplicate reads" section in the manuscript). My interpretation is that the "weighted_average2" algorithm almost compensates for the new bias added by deduplication, but the correlations are only a tiny bit better and a lot of the distributions are shifted off the expected line, so I'm not convinced that it's actually improving any results.

All of these algorithms are implemented in a standalone program, "umi-dedup" (https://github.com/jwfoley/umi-dedup), which is called by the 3SEQtools shell scripts, and was originally developed in a collaboration with Maxime Turgeon to develop a more rigorous deduplication algorithm (which turned out to be a lot less practical to use, and nobody involved had sufficient time and motivation to finish the project). Unfortunately we never got around to optimizing the program so it's still quite slow. Thus my recommendation is to use the "-d" option in "3SEQtools/align_smart-3seq.sh" or equivalent to skip duplicate marking altogether, in the interest of time, or you can leave it on for the sake of an additional QC metric (e.g. figures S5B, S11A, S24B, S27A in the MS), but either way I wouldn't throw out the duplicate reads for downstream purposes.
--
You received this message because you are subscribed to the Google Groups "Smart-3SEQ" group.
To unsubscribe from this group and stop receiving emails from it, send an email to smart-3seq+...@googlegroups.com.
To post to this group, send email to smart...@googlegroups.com.
Visit this group at https://groups.google.com/group/smart-3seq.
To view this discussion on the web visit https://groups.google.com/d/msgid/smart-3seq/c1664c7d-e45b-43e5-9f3f-a3d35d2ba499%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

raw.pdf
naive.pdf
weighted_average.pdf
weighted_average2.pdf
signature.asc
Reply all
Reply to author
Forward
0 new messages