Recalibrated (SpeedSeq) BAMs

37 views

Skip to first unread message

John Doe

unread,

Sep 12, 2017, 11:48:05 PM9/12/17

to Platypus Users

Thanks for making such a great variant caller.

I have some BAMs that have been run though SpeedSeq (https://github.com/hall-lab/speedseq; (produced by BWA-mem, processed by Sambamba, Samblaster) that I’d like to run through Platypus.

I saw your response here https://groups.google.com/forum/#!searchin/platypus-users/recalibration%7Csort:relevance/platypus-users/yi58Qjxiteg/0lg27eFyBQAJ , where you said "It's perfectly ok to use an already de-duplicated BAM file…,” so I ran one raw and a corresponding SpeedSeq-recalibrated BAM through Platypus with the default settings. The resultant VCFs were around 95.6% identical, with a total of around 3M variants each.

The files differ, predictably, in some of the annotations, most frequently the ones related to coverage.

Each VCF, however, had about 60 variants not present in the other (I assume because when Platypus processed the other VCF, it considered that position to be homozygous reference?) In addition, around 150 variants present in both files had different genotypes (e.g., one was 0/1 in one and 1/1 in the other).

Is this to be expected? Should I consider the recalibrated/raw genotype results to be more reliable?

Thank you for your time.

Andy Rimmer

unread,

Sep 13, 2017, 6:23:01 AM9/13/17

to John Doe, Platypus Users

Hi,

I'm not sure what SpeedSeq does, but any kind of recalibration is likely to change the output of Platypus, particularly if base quality scores are modified. Simple de-duplication should not make a difference, as Platypus will remove duplicate reads anyway.

I would expect the changes to occur mostly in low coverage regions, or around sites which have weak support for variation. If some variants are present in one file but not the other, it is probably because they are at the boundary of what Platypus will call, and these are probably not very trustworthy. Genotype differences may occur if a small number of reads have been removed, or had their base qualities changed, or been modified in some other way, but this should only happen if the original genotype call was not very clear (e.g. an 8/2 ref/variant split might get called het in the raw file, but if the 2 variant bases are recalibrated to lower quality then this might become hom ref).

As for which is more reliable, well that depends on whether SpeedSeq is doing anything sensible or not. It is certainly possible to improve calls like this using population information and a well-implemented recalibration.

To be honest though, ~300 variant calling differences out of 3 million is not a lot (0.01%), and the false discovery rate among both call sets is probably much higher than this (> 0.1%), so I suggest checking a couple manually to see what the differences are in the 2 BAMs and then not worrying too much about it.

Kind regards,

Andy

--
You received this message because you are subscribed to the Google Groups "Platypus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platypus-users+unsubscribe@googlegroups.com.
To post to this group, send email to platypus-users@googlegroups.com.
Visit this group at https://groups.google.com/group/platypus-users.
To view this discussion on the web, visit https://groups.google.com/d/msgid/platypus-users/247ae80e-c982-4575-b1f6-3ea0899760c7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.