Any known issues with using delly on bams that used sambamba to mark duplicates?

33 views
Skip to first unread message

Peter Waltman

unread,
Oct 23, 2024, 12:06:53 PM10/23/24
to delly-users
My group has noticed that delly's somatic SV process will generate a huge difference in the number of putative SVs it infers when it is applied to bams that have been processed using sambamba to mark duplicates versus those that were processed using samtools.

It's uncanny, but if I take a bam (where duplicates haven't been removed), and use sambamba to mark and remove duplicate reads, the initial set of putative SVs will be 4x greater than if I use samtools to mark and remove the duplicate reads. 

As an example, for the same sample, I'll get 16500 SVs (this is on a whole-exome sample) if samtools was used to mark & remove duplicates. However, for the same initial bam, if sambamba was to mark & remove duplicates, delly will infer roughly 62000 SVs. This behavior isn't unique to this one sample, however, as we've observed similar behavior for other samples.

Has anyone else observed the same? If so, is this a known issue? I checked the README, and didn't see any mention of it.

If there's an issue using delly with bams that have been processed with sambamba, it might be good idea to add a note about that to the README.

tr

unread,
Oct 24, 2024, 5:10:27 AM10/24/24
to delly-users
Hmm, that's puzzling. In general, I would recommend to only mark duplicates and not to remove them. Are the duplicate statistics similar for samtools and sambamba? Delly just ignores all duplicate-marked reads.
Reply all
Reply to author
Forward
0 new messages