Hi Richard,
On Wed, Nov 22, 2017 at 08:39:13AM -0800, Richard C wrote:
> Hi folks,
> Once thing we routinely do is check for "proximal" duplicates in HiSeqX
> data. These appear in the data to be the same as "optical" duplicates
> of technologies past ie. two duplicate reads are "proximal" when they
> are duplicates at the same position, and are very near in coordinates
> on the flowcell.
> To estimate library complexity we need to both mark the duplicates
> (which we currently do with sambamba), and estimate the fraction of the
> duplicates that are "proximal".  This second step is done
> irregularly (when we need to) by running MarkDuplicates from Picard:
> java -Xms1G -Xmx50GÂ Â -jar /picard-tools-1.140/picard.jar
> MarkDuplicates ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT I=my.bam
> M=myBam.mets OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500
> READ_NAME_REGEX="[a-zA-Z0-9_]+:[0-9]+:([0-9]+):([0-9]+):([0-9]+).*
> Here a regex to enable extracting flowcell coordinates from read names
> is applied.Â
> Running this second step is becoming a regular activity for us and I'm
> wondering how big of an effort it would be to bring this functionality
> into the duplicate marking routines of sambamba.
> A separate approach I might be looking at would be to continue to mark
> duplicates with sambamba and write a flagstat-like post-processor to
> count up reads and calculate the "proximal" duplicate post-duplicate
> marking. Â
> I'd appreciate hearing your thoughts.
I started on a rewrite of markdup. The idea is not only to make
markdup more efficient, but also to be able to apply different
heuristics. So your feature request is timely.
Can you add your feature request as an issue to the github tracker?
Pj.