I don't favor deduplication for this protocol. With any variant of
3SEQ we expect a lot of duplicate reads by chance, because we're
only sampling sequences from narrow windows upstream of the ends of
expressed genes, so in any given library with moderate sequencing
depth it's likely there are more reads in your data than possible
reads from the sample, and that's before thinking about factors that
may lower the library complexity like small amounts of degraded
material. UMIs are supposed to help with this, but they don't
entirely solve the problem because it's also possible to get
duplicate UMIs by chance: our design has UMIs of length 5, or 1024
possible sequences (though there is detectable bias in the sequence
representation), so if you happen to get thousands of reads from a
highly expressed transcript it's guaranteed that you'll get many
duplicate UMIs from non-duplicate library molecules. Longer UMIs
might reduce the probability of this happening, but not to zero.
Thus even a UMI-based deduplication may introduce new noise
(generally in the form of a bias against high-expressed
transcripts); on the other hand, I'm not convinced that the amount
of PCR amplification noise it subtracts is very large in the first
place. So it may do more harm than good.
We can see that ambiguity empirically with the ERCC-only libraries
in the manuscript. Attached are several draft versions of the
standard curve figure with multiple dilutions, PCR conditions, ERCC
mixes, and replicates (the final version is figure S6). "raw.pdf" is
the original read counts, where you can see the linear fits are
pretty good already. "naive" is the counts after deduplication with
the simplistic algorithm used in most software, assuming that no
UMIs are ever duplicated by chance; you can see how it bends the top
right part of the curve away from the expected line.
"weighted_average.pdf" shows the results of a smarter deduplication
method from Fumiaki Katagiri (doi:
10.15252/embj.201796529),
which uses the relative counts of the different UMIs to set a
tolerance for expected duplicates; "weighted_average2.pdf" is my
extension of that algorithm to handle indefinitely high read
counts (described more in the "Detection of duplicate reads"
section in the manuscript). My interpretation is that the
"weighted_average2" algorithm almost compensates for the new bias
added by deduplication, but the correlations are only a tiny bit
better and a lot of the distributions are shifted off the expected
line, so I'm not convinced that it's actually improving any
results.
All of these algorithms are implemented in a standalone program,
"umi-dedup" (https://github.com/jwfoley/umi-dedup), which is
called by the 3SEQtools shell scripts, and was originally
developed in a collaboration with Maxime Turgeon to develop a more
rigorous deduplication algorithm (which turned out to be a lot
less practical to use, and nobody involved had sufficient time and
motivation to finish the project). Unfortunately we never got
around to optimizing the program so it's still quite slow. Thus my
recommendation is to use the "-d" option in
"3SEQtools/align_smart-3seq.sh" or equivalent to skip duplicate
marking altogether, in the interest of time, or you can leave it
on for the sake of an additional QC metric (e.g. figures S5B,
S11A, S24B, S27A in the MS), but either way I wouldn't throw out
the duplicate reads for downstream purposes.