optical / proximal duplicates

51 views
Skip to first unread message

Richard C

unread,
Nov 22, 2017, 11:39:13 AM11/22/17
to sambamba-discussion
Hi folks,

Once thing we routinely do is check for "proximal" duplicates in HiSeqX data.  These appear in the data to be the same as "optical" duplicates of technologies past ie. two duplicate reads are "proximal" when they are duplicates at the same position, and are very near in coordinates on the flowcell.

To estimate library complexity we need to both mark the duplicates (which we currently do with sambamba), and estimate the fraction of the duplicates that are "proximal".   This second step is done irregularly (when we need to) by running MarkDuplicates from Picard:

java -Xms1G -Xmx50G   -jar /picard-tools-1.140/picard.jar MarkDuplicates ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT I=my.bam M=myBam.mets OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 READ_NAME_REGEX="[a-zA-Z0-9_]+:[0-9]+:([0-9]+):([0-9]+):([0-9]+).*

Here a regex to enable extracting flowcell coordinates from read names is applied. 

Running this second step is becoming a regular activity for us and I'm wondering how big of an effort it would be to bring this functionality into the duplicate marking routines of sambamba.

A separate approach I might be looking at would be to continue to mark duplicates with sambamba and write a flagstat-like post-processor to count up reads and calculate the "proximal" duplicate post-duplicate marking.   

I'd appreciate hearing your thoughts.

thanks,
RIchard

Pjotr Prins

unread,
Nov 24, 2017, 12:00:59 AM11/24/17
to Richard C, sambamba-discussion
Hi Richard,

On Wed, Nov 22, 2017 at 08:39:13AM -0800, Richard C wrote:
> Hi folks,
> Once thing we routinely do is check for "proximal" duplicates in HiSeqX
> data. These appear in the data to be the same as "optical" duplicates
> of technologies past ie. two duplicate reads are "proximal" when they
> are duplicates at the same position, and are very near in coordinates
> on the flowcell.
> To estimate library complexity we need to both mark the duplicates
> (which we currently do with sambamba), and estimate the fraction of the
> duplicates that are "proximal".  This second step is done
> irregularly (when we need to) by running MarkDuplicates from Picard:
> java -Xms1G -Xmx50GÂ Â -jar /picard-tools-1.140/picard.jar
> MarkDuplicates ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT I=my.bam
> M=myBam.mets OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500
> READ_NAME_REGEX="[a-zA-Z0-9_]+:[0-9]+:([0-9]+):([0-9]+):([0-9]+).*
> Here a regex to enable extracting flowcell coordinates from read names
> is applied.Â
> Running this second step is becoming a regular activity for us and I'm
> wondering how big of an effort it would be to bring this functionality
> into the duplicate marking routines of sambamba.
> A separate approach I might be looking at would be to continue to mark
> duplicates with sambamba and write a flagstat-like post-processor to
> count up reads and calculate the "proximal" duplicate post-duplicate
> marking. Â
> I'd appreciate hearing your thoughts.

I started on a rewrite of markdup. The idea is not only to make
markdup more efficient, but also to be able to apply different
heuristics. So your feature request is timely.

Can you add your feature request as an issue to the github tracker?

Pj.
Reply all
Reply to author
Forward
0 new messages