Detection of somatic dinucleotide changes

Brad Wubbenhorst

unread,

Jul 13, 2016, 2:27:18 PM7/13/16

to biovalidation

I think this might be a big issue for the ensemble caller. First off, quick question, does MuTect2 work with unpaired tumor samples? I successfully completed variant calling, but all of the mutect2.vcf.gz files are empty of variants. Perhaps if it did I would not be writing this, but we noticed in some Melanoma cell line data, called with vardict, freebayes, mutect2 and ensemble numpass=2, that it was missing the dinucleotide variant V600K. When I looked at the individual caller.vcf.gz files, MuTect2 was empty and the other two looked like this:

Freebayes:

7       140453136       .       A       T       1581.78 PASS    AC=1;AF=0.5;AN=2;DECOMPOSED;LEN=1;NS=1;TYPE=snp;ANN=T|missense_variant|MODERATE|BRAF|ENSG00000157764|transcript|ENST00000288602|protein_coding|15/18|c.1799T>A|p.Val600Glu|1860/2480|1799/2301|600/766||,T|missense_variant|MODERATE|BRAF|ENSG00000157764|transcript|ENST00000496384|protein_coding|6/10|c.620T>A|p.Val207Glu|622/8294|620/1125|207/374||WARNING_TRANSCRIPT_NO_START_CODON,T|missense_variant|MODERATE|BRAF|ENSG00000157764|transcript|ENST00000479537|nonsense_mediated_decay|2/6|c.83T>A|p.Val28Glu|83/743|83/309|28/102||WARNING_TRANSCRIPT_NO_START_CODON,T|sequence_feature|LOW|BRAF|ENSG00000157764|domain:Protein_kinase|ENST00000288602|protein_coding|16/18|c.1799T>A||||||,T|sequence_feature|LOW|BRAF|ENSG00000157764|beta_strand|ENST00000288602|protein_coding|15/18|c.1799T>A||||||,T|3_prime_UTR_variant|MODIFIER|BRAF|ENSG00000157764|transcript|ENST00000497784|nonsense_mediated_decay|16/19|c.*1249T>A|||||54959|WARNING_TRANSCRIPT_NO_START_CODON;EPR=pass,cosmic,clinvar      GT:GQ:DP:RO:QR:AO:QA:GL 0|1:5:81:10:317:71:2082:-163.26,0,-4.45723
7       140453137       .       C       T       1581.78 PASS    AC=1;AF=0.5;AN=2;DECOMPOSED;LEN=1;NS=1;TYPE=snp;ANN=T|missense_variant|MODERATE|BRAF|ENSG00000157764|transcript|ENST00000288602|protein_coding|15/18|c.1798G>A|p.Val600Met|1859/2480|1798/2301|600/766||,T|missense_variant|MODERATE|BRAF|ENSG00000157764|transcript|ENST00000496384|protein_coding|6/10|c.619G>A|p.Val207Met|621/8294|619/1125|207/374||WARNING_TRANSCRIPT_NO_START_CODON,T|missense_variant|MODERATE|BRAF|ENSG00000157764|transcript|ENST00000479537|nonsense_mediated_decay|2/6|c.82G>A|p.Val28Met|82/743|82/309|28/102||WARNING_TRANSCRIPT_NO_START_CODON,T|sequence_feature|LOW|BRAF|ENSG00000157764|domain:Protein_kinase|ENST00000288602|protein_coding|16/18|c.1798G>A||||||,T|sequence_feature|LOW|BRAF|ENSG00000157764|beta_strand|ENST00000288602|protein_coding|15/18|c.1798G>A||||||,T|3_prime_UTR_variant|MODIFIER|BRAF|ENSG00000157764|transcript|ENST00000497784|nonsense_mediated_decay|16/19|c.*1248G>A|||||54958|WARNING_TRANSCRIPT_NO_START_CODON;EPR=pass,cosmic,clinvar      GT:GQ:DP:RO:QR:AO:QA:GL 0|1:5:81:10:317:71:2082:-163.26,0,-4.45723

Vardict:

7       140453136       .       AC      TT      193     PASS    ADJAF=0.0465;AF=0.8488;BIAS=2:2;DP=86;HIAF=0.875;HICNT=70;HICOV=80;LSEQ=ACCCACTCCATCGAGATTTC;MQ=60;MSI=1;MSILEN=1;NM=1.3;ODDRATIO=1.80509;PMEAN=32.2;PSTD=1;QSTD=1;QUAL=31.2;REFBIAS=4:6;RSEQ=TGTAGCTAGACCAAAATCAC;SAMPLE=WM4420-3MP1;SBF=0.50422;SHIFT3=0;SN=23.333;TYPE=Complex;VARBIAS=40:33;VD=73;ANN=TT|missense_variant|MODERATE|BRAF|ENSG00000157764|transcript|ENST00000288602|protein_coding|16/18|c.1798_1799delGTinsAA|p.Val600Lys|1860/2480|1798/2301|600/766||,TT|missense_variant|MODERATE|BRAF|ENSG00000157764|transcript|ENST00000496384|protein_coding|7/10|c.619_620delGTinsAA|p.Val207Lys|622/8294|619/1125|207/374||WARNING_TRANSCRIPT_NO_START_CODON,TT|missense_variant|MODERATE|BRAF|ENSG00000157764|transcript|ENST00000479537|nonsense_mediated_decay|3/6|c.82_83delGTinsAA|p.Val28Lys|83/743|82/309|28/102||WARNING_TRANSCRIPT_NO_START_CODON,TT|sequence_feature|LOW|BRAF|ENSG00000157764|domain:Protein_kinase|ENST00000288602|protein_coding|16/18|c.1798_1799delGTinsAA||||||,TT|sequence_feature|LOW|BRAF|ENSG00000157764|beta_strand|ENST00000288602|protein_coding|15/18|c.1798_1799delGTinsAA||||||,TT|3_prime_UTR_variant|MODIFIER|BRAF|ENSG00000157764|transcript|ENST00000497784|nonsense_mediated_decay|16/19|c.*1248_*1249delGTinsAA|||||54958|WARNING_TRANSCRIPT_NO_START_CODON;EPR=pass,cosmic  GT:DP:VD:AD:AF:RD:ALD   1/1:86:73:10,73:0.8488:4,6:40,33

And thus this variant is not known to be the same. I am not sure what can be done to correct this.

-bwubb

Eric Talevich

unread,

Jul 13, 2016, 2:53:23 PM7/13/16

to Brad Wubbenhorst, biovalidation

Hi Brad, we've noticed the same issue with unifying calls from MuTect and UnifiedGenotyper, which report base substitutions individually, with FreeBayes and HaplotypeCaller, which report the same event as a multi-nucleotide substitution.

We use a post-processing script on the MuTect and UnifiedGenotyper VCFs to identify adjacent SNVs and, using the original BAM file, scan the aligned reads at that site to determine whether the substitutions occur together in the same read -- and if so, replace the original 2 (or more) SNVs with a multi-nucleotide substitution.

I'm happy to share this script if you like; the implementation is not fancy.

-Eric

--
You received this message because you are subscribed to the Google Groups "biovalidation" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biovalidatio...@googlegroups.com.
To post to this group, send email to bioval...@googlegroups.com.
Visit this group at https://groups.google.com/group/biovalidation.
For more options, visit https://groups.google.com/d/optout.

Brad Chapman

unread,

Jul 14, 2016, 8:55:33 PM7/14/16

to Eric Talevich, Brad Wubbenhorst, biovalidation

Brad and Eric;

>> First off, quick question, does MuTect2 work with unpaired tumor samples? I
>> successfully completed variant calling, but all of the mutect2.vcf.gz files
>> are empty of variants.

I think this is a bug that is fixed in GATK 3.6:

http://gatkforums.broadinstitute.org/gatk/discussion/6690/mutect2-tumor-only-mode-empty-vcfs

We have 3.6 support ready for bcbio and hope to roll it out as part of the new
release (which keeps getting pushed dealing with bug reports but will be ready
really soon now).

>> we noticed in some Melanoma cell line data, called with vardict, freebayes,
>> mutect2 and ensemble numpass=2, that it was missing the dinucleotide
>> variant V600K. When I looked at the individual caller.vcf.gz files, MuTect2
>> was empty and the other two looked like this:
>>
>> Freebayes:
>> 7 140453136 . A T

>> 7 140453137 . C T
>>

>> Vardict:
>> 7 140453136 . AC TT

Sorry about the problem. Variant representation is definitely an unsolved and
you will hit cases like this with the ensemble method. It is not fancy and
does intersection of calls with bcftools which will not pick up on these kind
of overlaps.

Practically, we try to normalize representations as much as possible but don't
currently run vcfallelicprimatives on VarDict like we do on FreeBayes since
VarDict typically does not combine haplotypes. It looks like you've found an
edge case where it does.

The real fix for this would be to have smarter ensemble methods that take into
account multiple representations, like we do for variant comparisons, but I
don't know any existing tools that do this. SomaticSeq
(https://github.com/bioinform/somaticseq) might do and is a target for
inclusion in bcbio. I'd also thought about smartening up the current ensemble
approach by using rtg vcfeval for identifying overlaps (compare two files and
use the TP set as overlapping).

> Hi Brad, we've noticed the same issue with unifying calls from MuTect and
> UnifiedGenotyper, which report base substitutions individually, with
> FreeBayes and HaplotypeCaller, which report the same event as a
> multi-nucleotide substitution.
>
> We use a post-processing script on the MuTect and UnifiedGenotyper VCFs to
> identify adjacent SNVs and, using the original BAM file, scan the aligned
> reads at that site to determine whether the substitutions occur together in
> the same read -- and if so, replace the original 2 (or more) SNVs with a
> multi-nucleotide substitution.
>
> I'm happy to share this script if you like; the implementation is not fancy.

That's awesome, I'd be happy to work on something smarter with you. Either
better normalization, like you've done above, or better comparison engines,
like using rtg vcfeval would be good ways to move forward.

Sorry to not have an immediate solution. Thanks for starting this discussion,
Brad

Brad Wubbenhorst

unread,

Jul 15, 2016, 9:37:53 AM7/15/16

to biovalidation, eric.t...@gmail.com

Eric,

That would be awesome of you to share the script!

Brad,

In your experience and opinion, do you think Mutect or Mutect2 in unpaired mode is worth anything? I honestly didnt know it was possible to run those in such a manner until I started using bcbio. I cant imagine what benefit Mutect brings over UnifiedGenotyper in the absence of a matched normal.

We have 3.6 support ready for bcbio and hope to roll it out as part of the new
release (which keeps getting pushed dealing with bug reports but will be ready
really soon now)

And do you mean that if I set up GATK 3.6 in bcbio it will not work?

-bwubb

Brad Chapman

unread,

Jul 16, 2016, 6:38:03 AM7/16/16

to Brad Wubbenhorst, biovalidation, eric.t...@gmail.com

Brad;

> In your experience and opinion, do you think Mutect or Mutect2 in
> unpaired mode is worth anything? I honestly didnt know it was possible to
> run those in such a manner until I started using bcbio. I cant imagine what
> benefit Mutect brings over UnifiedGenotyper in the absence of a matched
> normal.

I don't have a good sense for MuTect2 on this. For MuTect, I know people have
had good success using a panel of normals (set as `background` in the
configuration:
http://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#variant-calling)
if you have a reasonable way to prepare that.

Absent of that, you do need to do some kind of prioritization with population
databases to try and separate germline from somatic:

http://bcb.io/2015/03/05/cancerval/

Having a normal is always better but if you don't have it for a sample all you
can do is your best. Somatic callers will be better than callers that assume
diploid states since you can pick up lower frequency non-HET variants.

> And do you mean that if I set up GATK 3.6 in bcbio it will not work?

It requires Java 8 so needs to have gatk-framework synced with the recent
version as well so both work. We have gatk-framework 3.6 ready to go as a
bioconda package but this will require you to manually install this. So it's
possible for sure, but the next release should synchronize all this to try and
make it less painful.

Brad

Eric Talevich

unread,

Jul 20, 2016, 3:19:29 PM7/20/16

to Brad Wubbenhorst, biovalidation

Brad, here's the SNV-joining script I mentioned earlier:
https://gist.github.com/etal/74440a849e273a5242fa1dd92377277c

This script requires samtools and the UCSC/kent utility "twoBitToFa", and the code is a little dusty, but hopefully you can adapt it to your needs. We use it on VCFs from GATK 3.4 Mutect, UnifiedGenotyper and HaplotypeCaller (which does report MNPs, but with quirks).

-Eric

Brad Wubbenhorst

unread,

Jul 21, 2016, 7:44:06 PM7/21/16

to biovalidation

Thank you very much!

Brad Wubbenhorst

unread,

Jul 21, 2016, 9:00:16 PM7/21/16

to biovalidation

Not sure if this is a bug or intentional, but I just noticed (in a separate project) that the mutect2-germline.vcf.gz files contain all of the Somatic calls. They are labeled as 'Somatic' in the FILTER column, but are designated 'PASS' in the non-germline file.

Sorry, unrelated to dinucleotides, but related to MuTect2

Brad Wubbenhorst

unread,

Sep 7, 2016, 11:32:40 AM9/7/16

to biovalidation, bwubbe...@gmail.com

Hey Eric,

I can not run your join SNP python script without being yelled at:

  File "/home/bwubb/NGS/joinAdjacentSNPs.py", line 371, in <module>
    main(args.distance, args.vcf, args.bam, args.ref2bit, args.output)
  File "/home/bwubb/NGS/joinAdjacentSNPs.py", line 322, in main
    nextLine = peek_line(inFile)
  File "/home/bwubb/NGS/joinAdjacentSNPs.py", line 294, in peek_line
    nextline = f.readline()
  ValueError: Mixing iteration and read methods would lose data

It does not like that you have a for loop for the lines in the input, while peeking at the next. Did you suppress this error somehow? Because it seems like its pretty important. Let me know if you want me to take this discussion somewhere else. Thanks.

-bwubb

Brad Wubbenhorst

unread,

Sep 7, 2016, 3:41:54 PM9/7/16

to biovalidation

I tried replacing f.readline() with f.next() and that was a terrible idea. Im not completely sure why, but this messes up the value of 'line' by the for loop, even though the position in the "should" be maintained with f.seek()

Eric Talevich

unread,

Sep 8, 2016, 2:30:18 PM9/8/16

to Brad Wubbenhorst, biovalidation

Hey Brad W -- sorry for the trouble, this might have been due to some untested cleanup I did just before posting the script. Would you mind filing an issue in the source repo?:
https://github.com/etal/vcf-scripts

Thanks,

Eric

--

You received this message because you are subscribed to the Google Groups "biovalidation" group.

To unsubscribe from this group and stop receiving emails from it, send an email to biovalidation+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward