Analyzing matched normal/panel of normals data sets

ni...@cc-tdi.org

unread,

Jul 27, 2017, 4:05:16 AM7/27/17

to Platypus Users

Hi,

My current data are a mixture of matched tumor/normal samples and tumor samples that don't have a match, but do have >40 normal samples to compare to. I understand that to run the variant analysis, I need to run all the files together and generate one VCF with all the calls. This step worked for me. I then turned to somaticMutationDetector.py to interpret the VCF as the correct mutations I needed, but I kept getting the error:

Could not find normal sample normaldna in input VCF header

I get the same error with the tumor sample. I've tried the original file name (normaldna.bam), the file name without the extension (normaldna), and the full file path (/dir/that/contains/normaldna.bam). I already looked through the VCF input file, and I can't find where it is supposed to specify the original BAM a particular indel was called for. Does anyone know of any other post-processing tool of a VCF I could do to accomplish this, or have any experience with this script in particular? I know there are dozens of indel callers, but I'm pretty set on Platypus.

A couple other answers in this Group mention the script, but I haven't seen any answer with a full example usage written out that works. I also know for a while that Andy Rimmer was just using the script as a placeholder but that it didn't actually work. Does anyone have a solution, or a work-around?

Thanks!

Andy Rimmer

unread,

Jul 27, 2017, 5:12:10 AM7/27/17

to ni...@cc-tdi.org, Platypus Users

Hi Nick,

I think you are getting that error because the script cannot match the sample names in the VCF sample columns with the names you are giving. Check the VCF column headers to see what the actual sample names are. They may not be the same as the BAM file names. Platypus extracts the sample name from the SM field in the RG tag in the BAM, you can find this in the BAM header if you don't know it already (samtools view -H input.bam).

As you mention, this script is really only a placeholder. It does something sensible, but not very sophisticated to discriminate somatic variants from germline variants and genotyping errors and gives a posterior probability for a variant being somatic. I think it should work, but I would treat the results with caution.

Kind regards,

Andy

--
You received this message because you are subscribed to the Google Groups "Platypus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platypus-users+unsubscribe@googlegroups.com.
To post to this group, send email to platypus-users@googlegroups.com.
Visit this group at https://groups.google.com/group/platypus-users.
To view this discussion on the web, visit https://groups.google.com/d/msgid/platypus-users/ee5c94a3-198f-4760-8118-d90c01198515%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Dr Andrew (Andy) Rimmer

ni...@cc-tdi.org

unread,

Jul 27, 2017, 3:15:45 PM7/27/17

to Platypus Users, ni...@cc-tdi.org

Thank you! BAM file headers turned out to be the problem - both the tumor and normal samples were simply "SM:DNA", so I couldn't figure out where they were supposed to be different.

I'll mess around with the minimum posterior probability and see how it affects my results. I did find a paper that utilized your script, and kept the default posterior probability minimum of phred 5. If I go lower, what's my risk of false positives?

Thanks again for the help!

On Thursday, July 27, 2017 at 2:12:10 AM UTC-7, Andy Rimmer wrote:

Hi Nick,

I think you are getting that error because the script cannot match the sample names in the VCF sample columns with the names you are giving. Check the VCF column headers to see what the actual sample names are. They may not be the same as the BAM file names. Platypus extracts the sample name from the SM field in the RG tag in the BAM, you can find this in the BAM header if you don't know it already (samtools view -H input.bam).

As you mention, this script is really only a placeholder. It does something sensible, but not very sophisticated to discriminate somatic variants from germline variants and genotyping errors and gives a posterior probability for a variant being somatic. I think it should work, but I would treat the results with caution.

Kind regards,

Andy

On Thu, Jul 27, 2017 at 9:05 AM, <ni...@cc-tdi.org> wrote:

Hi,

My current data are a mixture of matched tumor/normal samples and tumor samples that don't have a match, but do have >40 normal samples to compare to. I understand that to run the variant analysis, I need to run all the files together and generate one VCF with all the calls. This step worked for me. I then turned to somaticMutationDetector.py to interpret the VCF as the correct mutations I needed, but I kept getting the error:

Could not find normal sample normaldna in input VCF header

I get the same error with the tumor sample. I've tried the original file name (normaldna.bam), the file name without the extension (normaldna), and the full file path (/dir/that/contains/normaldna.bam). I already looked through the VCF input file, and I can't find where it is supposed to specify the original BAM a particular indel was called for. Does anyone know of any other post-processing tool of a VCF I could do to accomplish this, or have any experience with this script in particular? I know there are dozens of indel callers, but I'm pretty set on Platypus.

A couple other answers in this Group mention the script, but I haven't seen any answer with a full example usage written out that works. I also know for a while that Andy Rimmer was just using the script as a placeholder but that it didn't actually work. Does anyone have a solution, or a work-around?

Thanks!

--
You received this message because you are subscribed to the Google Groups "Platypus Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to platypus-user...@googlegroups.com.
To post to this group, send email to platypu...@googlegroups.com.

Visit this group at https://groups.google.com/group/platypus-users.
To view this discussion on the web, visit https://groups.google.com/d/msgid/platypus-users/ee5c94a3-198f-4760-8118-d90c01198515%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ni...@cc-tdi.org

unread,

Jul 27, 2017, 3:17:49 PM7/27/17

to Platypus Users, ni...@cc-tdi.org

Also, this script should work with indels? I worry it would mostly be involved with somatic point mutation calling

Andy Rimmer

unread,

Jul 28, 2017, 5:45:22 AM7/28/17

to ni...@cc-tdi.org, Platypus Users

The script will work with SNPs, indwells and MNPs, so everything that Platypus reports. Indels are of course harder to call and genotype correctly, so you'll probably get more false positives here.

As for the posterior threshold, it's very hard to say what an appropriate value is as this is not a well-calibrated model, and in any case things will vary enormously from dataset to dataset. By selecting variants where the genotypes are mismatched between tumour and normal (which is what the script does) you are mostly selecting for genotyping errors of various kinds. The script uses the likelihoods from Platypus to compute a posterior which measures how much evidence there is for this genotype mismatch, i.e. how much evidence there is for the variant being present in the tumour but not the normal. A higher threshold means the evidence must be stronger, which will suppress certain kinds of false positives, but others will creep through (e.g. if there is contamination in the tumour). So, setting the threshold higher is better for killing false positives, but I think you'll have to work out an appropriate threshold empirically.

The other thing you should do to minimise the false positive rate is make sure that you get rid of low quality variant calls. So filter on the QUAL column in the original VCF and remove things that don't have PASS or alleleBias in the FILTER column. And you could set some minimum thresholds on per-sample coverage (the NR value in the per-sample data column). Again this will depend on your data and on how much you care about finding low frequency somatics.

Kind regards,

Andy

To unsubscribe from this group and stop receiving emails from it, send an email to platypus-users+unsubscribe@googlegroups.com.
To post to this group, send email to platypus-users@googlegroups.com.

Visit this group at https://groups.google.com/group/platypus-users.

To view this discussion on the web, visit https://groups.google.com/d/msgid/platypus-users/006a07b2-cdad-496a-9311-aeebd1c660ce%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward