Overlapping read pairs

Ryan Morin

unread,

Feb 12, 2015, 8:38:54 PM2/12/15

to strelka...@googlegroups.com

Hi.
I've noticed a problem in some libraries with smaller fragment length distributions wherein what appear to be PCR errors in libraries are observed twice in shorter fragments and thus count toward a variant call more than they should. We see these coming up quite a bit in certain libraries and usually they have 4 reads supporting them (2 reads from 2 different fragments). Reads from the same fragment should ideally not count twice towards a specific base call. Tools such as FLASH aim to minimize this issue prior to alignment. However, I'm wondering if Strelka has a way to discount bases coming from the same fragment such that they are not double-counted in the model.

Ryan

Saunders, Chris

unread,

Feb 13, 2015, 3:14:55 PM2/13/15

to strelka...@googlegroups.com

Good question. Strelka currently handles this case incorrectly and will ‘double-count’ short-fragment read overlaps, which will impact FP call rates. Our internal pipelines all soft-clip the read overlap down to single-copy prior to variant calling so there hasn’t been an incentive to prioritize a fix. At present you will get better results with strelka by using a tool like FLASH or similar as a pre-processing step.

--
You received this message because you are subscribed to the Google Groups "strelka-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to strelka-discu...@googlegroups.com.
To post to this group, send email to strelka...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/strelka-discuss/5b0334d6-c98e-4828-ac03-08d3ae54180c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ryan Morin

unread,

Feb 13, 2015, 7:43:10 PM2/13/15

to strelka...@googlegroups.com

Thanks Chris.
FLASH is certainly an option that would get rid of this issue. It's not ideal for us because many of the data sets we are working with are already aligned. It would be great if anyone knows of something that does what FLASH does but works directly on bam files. Otherwise, we can convert back to FASTQ to run FLASH before realignment.

Best,
Ryan

Jennifer Becq

unread,

Feb 16, 2015, 6:55:30 AM2/16/15

to strelka...@googlegroups.com

Hi Ryan,

I've used the "clipOverlap" tool of BamUtil. It simply clips the overlapping portion of the read with the lowest average quality. It specifically works off a BAM of aligned reads, so no need to realign.

http://genome.sph.umich.edu/wiki/BamUtil:_clipOverlap

Cheers

Jennifer

Ryan Morin

unread,

Dec 12, 2016, 4:47:17 PM12/12/16

to strelka-discuss

Using FLASH or clipOverlap leads to data duplication and for whole genomes is not ideal if we can get around it. Is there any plan to modify Strelka to simply not double-count bases supported on both strands as independent? Alternatively, is it possible to run each sub-process in the Strelka pipeline on a bam stream coming from, for example, clipOverlap?

Thanks

Ryan

rcorbett

unread,

Dec 21, 2016, 11:26:32 AM12/21/16

to strelka-discuss

Hi Ryan,

We are noticing here that Strelka is calling more somatic variants with low AF when our WGS reads are aligned with mem instead of aln. Looking deeper into this, much of the difference is coming from mem more correctly aligning reads in noisy regions, but also its ability to better align read pairs that completely overlap (aln for some reason often doesn't align these). When these completely overlapping reads have a mismatch in them it only takes a couple of fragments that were previously unaligned with aln to now generate a variant call when aligned with mem. In other words, we on the same page on this.

When we played with read mergers for a different application earlier this year we had really good success with PANDASeq and PEAR, but to each their own.

On Thursday, February 12, 2015 at 5:38:54 PM UTC-8, Ryan Morin wrote:

Reply all

Reply to author

Forward