Brad and Eric;
>> First off, quick question, does MuTect2 work with unpaired tumor samples? I
>> successfully completed variant calling, but all of the mutect2.vcf.gz files
>> are empty of variants.
I think this is a bug that is fixed in GATK 3.6:
http://gatkforums.broadinstitute.org/gatk/discussion/6690/mutect2-tumor-only-mode-empty-vcfs
We have 3.6 support ready for bcbio and hope to roll it out as part of the new
release (which keeps getting pushed dealing with bug reports but will be ready
really soon now).
>> we noticed in some Melanoma cell line data, called with vardict, freebayes,
>> mutect2 and ensemble numpass=2, that it was missing the dinucleotide
>> variant V600K. When I looked at the individual caller.vcf.gz files, MuTect2
>> was empty and the other two looked like this:
>>
>> Freebayes:
>> 7 140453136 . A T
>> 7 140453137 . C T
>>
>> Vardict:
>> 7 140453136 . AC TT
Sorry about the problem. Variant representation is definitely an unsolved and
you will hit cases like this with the ensemble method. It is not fancy and
does intersection of calls with bcftools which will not pick up on these kind
of overlaps.
Practically, we try to normalize representations as much as possible but don't
currently run vcfallelicprimatives on VarDict like we do on FreeBayes since
VarDict typically does not combine haplotypes. It looks like you've found an
edge case where it does.
The real fix for this would be to have smarter ensemble methods that take into
account multiple representations, like we do for variant comparisons, but I
don't know any existing tools that do this. SomaticSeq
(
https://github.com/bioinform/somaticseq) might do and is a target for
inclusion in bcbio. I'd also thought about smartening up the current ensemble
approach by using rtg vcfeval for identifying overlaps (compare two files and
use the TP set as overlapping).
> Hi Brad, we've noticed the same issue with unifying calls from MuTect and
> UnifiedGenotyper, which report base substitutions individually, with
> FreeBayes and HaplotypeCaller, which report the same event as a
> multi-nucleotide substitution.
>
> We use a post-processing script on the MuTect and UnifiedGenotyper VCFs to
> identify adjacent SNVs and, using the original BAM file, scan the aligned
> reads at that site to determine whether the substitutions occur together in
> the same read -- and if so, replace the original 2 (or more) SNVs with a
> multi-nucleotide substitution.
>
> I'm happy to share this script if you like; the implementation is not fancy.
That's awesome, I'd be happy to work on something smarter with you. Either
better normalization, like you've done above, or better comparison engines,
like using rtg vcfeval would be good ways to move forward.
Sorry to not have an immediate solution. Thanks for starting this discussion,
Brad