Re: Mutation Analysis TCGA Breast

151 views
Skip to first unread message

David Heiman

unread,
Nov 16, 2016, 12:45:26 PM11/16/16
to Scooter Willis, gdac-...@broadinstitute.org

Hi Scooter,

Apologies for not answering sooner. It is always best to direct any GDAC questions to the GDAC mailing list (CC'd). While I maintain the report, I did not generate the MAF. If someone on the list can't answer your question, we'll relay it or get you the contact information to find the proper expert as soon as possible.

Regards,
David


On Nov 16, 2016 12:31 PM, "Scooter Willis" <will...@gmail.com> wrote:
David

Wanted to confirm that you got the email that I sent yesterday where you are listed as being responsible for the report. If you are not responsible for the report can you direct me to the appropriate support channel.

Based on the referenced article it is the Ovarian manuscript so I assume that is the same process used to process the breast data for SNP calling via exome DNA-seq?

Thanks

Scooter

On Tue, Nov 15, 2016 at 2:38 PM Scooter Willis <will...@gmail.com> wrote:
David

In the Mutation Analysis data for TCGA Breast I was hoping you can answer a question regarding what appears to be either a data anomaly or a possible output error in the report.


If you look at the summary data for BRCA-TP.final_analysis_set.maf you see differences in Tumor_Seq_Allele1 and Tumor_Seq_Allele2 which is expected compared to the Reference_Allele. 

1. Is the Reference_Allele always derived from  Matched_Norm_Sample_Barcode calls? Shouldn't the Reference_Allele have two values assuming it came from matched normal.

2. For some reason and my concern is that Match_Norm_Seq_Allele1 always equals Match_Norm_Seq_Allele2 which implies no germline heterogeneity which shouldn't be the case. Reasonable for a gene to have in one Allele an A at position X and in the other Allele a T at position X. This is my concern that either a reporting error outputting the same value twice and possibly impacting the mutation call in tumor.

3. Do you know what SNP array was used for BRCA calls?

4. If the SNP array had intronic/GWAS SNP probes were these included/excluded from the analysis?

Thanks

Scooter Willis


Scooter Willis

unread,
Nov 16, 2016, 1:41:24 PM11/16/16
to gdac-...@broadinstitute.org, David Heiman
In the Mutation Analysis data for TCGA Breast I was hoping you can answer a question regarding what appears to be either a data anomaly or a possible output error in the report.


If you look at the summary data for BRCA-TP.final_analysis_set.maf you see differences in Tumor_Seq_Allele1 and Tumor_Seq_Allele2 which is expected compared to the Reference_Allele. 

1. Is the Reference_Allele always derived from  Matched_Norm_Sample_Barcode calls? Shouldn't the Reference_Allele have two values assuming it came from matched normal. 

2. For some reason and my concern is that Match_Norm_Seq_Allele1 always equals Match_Norm_Seq_Allele2 which implies no germline heterogeneity which shouldn't be the case. Reasonable for a gene to have in one Allele an A at position X and in the other Allele a T at position X. This is my concern that either a reporting error outputting the same value twice and possibly impacting the mutation call in tumor.

3. The BRCA Mutation Analysis references the Nature Ovarian manuscript so the assumption is the same methods/source of data(exome DNA Seq) was used?

4. Also trying to understand the impact of pancan_mutation_blacklist.v14.hg19.txt. We have a gene of interest CYP2D6 which in BRCA only has one called mutation as compared to the assumption is normal matching tissue from the patient. This would suggest that the likelihood of tumor mutation in CPY2D6 is unlikely. We have a cohort in breast cancer where we don't have normal control for each patient. We have run SNP array on the cohort where based on TCGA BRCA any SNPs that are determined have a very high chance of being germline in particular if they are well known SNPs. If the pancan blacklist file is filtering CPY2D6 SNP that appear to be germline but could be mutations in the tumor we want to make sure we understand the implications. Difficult when the file is not available to understand what is being filtered. 

Thanks

Scooter Willis

David Heiman

unread,
Dec 1, 2016, 11:23:08 AM12/1/16
to Gdac-users, dhe...@broadinstitute.org, will...@gmail.com
Hi Scooter,

Many apologies that it seems no one has gotten back to you quickly. My time has been limited, so I tried to put you in touch with experts who I had hoped could quickly answer your questions in detail. Answers to the best of my ability are in-line below.

Regards,
David

On Wednesday, November 16, 2016 at 1:41:24 PM UTC-5, Scooter Willis wrote:
In the Mutation Analysis data for TCGA Breast I was hoping you can answer a question regarding what appears to be either a data anomaly or a possible output error in the report.


If you look at the summary data for BRCA-TP.final_analysis_set.maf you see differences in Tumor_Seq_Allele1 and Tumor_Seq_Allele2 which is expected compared to the Reference_Allele. 

1. Is the Reference_Allele always derived from  Matched_Norm_Sample_Barcode calls? Shouldn't the Reference_Allele have two values assuming it came from matched normal.
My understanding is that Reference_Allele is derived from the reference genome build specified in NCBI_Build.
 
2. For some reason and my concern is that Match_Norm_Seq_Allele1 always equals Match_Norm_Seq_Allele2 which implies no germline heterogeneity which shouldn't be the case. Reasonable for a gene to have in one Allele an A at position X and in the other Allele a T at position X. This is my concern that either a reporting error outputting the same value twice and possibly impacting the mutation call in tumor.
Looking at the MAF Spec, these columns are not required to have values, thus while the original intent of these columns is as you suggest, having inspected several MAFs from multiple sequencing centers, I do not believe they are being used that way, as they nearly always match in most cases, or are simply blank.
 

3. The BRCA Mutation Analysis references the Nature Ovarian manuscript so the assumption is the same methods/source of data(exome DNA Seq) was used?
The reference is to the first published use of MutSig. MutSig_2CV has differences with the version used in that paper; more details available in our FAQ. All open-access TCGA MAFs use exome DNA-Seq, to the best of my knowledge.
 
4. Also trying to understand the impact of pancan_mutation_blacklist.v14.hg19.txt. We have a gene of interest CYP2D6 which in BRCA only has one called mutation as compared to the assumption is normal matching tissue from the patient. This would suggest that the likelihood of tumor mutation in CPY2D6 is unlikely. We have a cohort in breast cancer where we don't have normal control for each patient. We have run SNP array on the cohort where based on TCGA BRCA any SNPs that are determined have a very high chance of being germline in particular if they are well known SNPs. If the pancan blacklist file is filtering CPY2D6 SNP that appear to be germline but could be mutations in the tumor we want to make sure we understand the implications. Difficult when the file is not available to understand what is being filtered.
The blacklist is used to filter out recurrent mutation sites that the MutSig development team found to cause issues with the determination of significance. Because these by nature include germline mutations that may not have been part of available databases at the original generation of the MAF, we are not permitted to release it to the public. In the future we will phase out the blacklist in favor of a panel of normals. I will ensure someone on the team sees this and can hopefully shed light on this question. You could always look at the legacy TCGA data at the GDC to determine if your mutation of interest is available in the raw data: https://gdc-portal.nci.nih.gov/legacy-archive/files/87ac6b5b-806e-4de0-b8d8-ae6888759667.


Thanks

Scooter Willis

David Heiman

unread,
Dec 6, 2016, 10:51:52 AM12/6/16
to Scooter Willis, Hertz, Daniel, gdac-...@broadinstitute.org
Hi Scooter,

MAFs drop a lot of the information that is available in the VCFs. To my knowledge, the SNP calling algorithms do use the normals.

TCGA is very conservative with anything that may potentially be a germline mutation. Unless it's been very carefully vetted, it is extremely likely that somatic mutations that appear to be germline will not be available in publicly accessible MAFs.

If you want access to protected data (e.g. VCFs), you will need to follow the instructions here: https://gdc.cancer.gov/access-data/obtaining-access-controlled-data.

-David

--
David Heiman
Run Operations Engineer
TCGA Genome Data Analysis Center
The Broad Institute of MIT and Harvard
415 Main Street
Cambridge, MA 02142

On Thu, Dec 1, 2016 at 12:07 PM, Scooter Willis <will...@gmail.com> wrote:
David

Thanks for the response. Seems odd that the Reference_Allele is a single value if it is being used in the SNP calling algorithm as it does not allow reporting heterogeneity when it occurs. Still concerning that Match_Norm_Seq_Allele1 and Match_Norm_Seq_Allele2 are being reported if they are not accurate or relevant. This is further complicated by LOH where tumor alleles that agree are a marker for a heterozygous deletion assuming that is taken into consideration.  

I have included Daniel Hertz on the email who I am working with on this particular problem. Ultimately we are trying to determine the likelihood of CYP2D6 mutations in breast cancer to resolve conflicting data of germline SNP mutations in CYP2D6 that are predictive of therapy. Dan is putting in a request for access to the normal and tumor DNA sequence data for TCGA breast so we can resolve the nature of expected mutations in CYP2D6. Based on current TCGA data that for tumor samples with no deletions in the CYP2D6 region then using SNP calls in tumor can be inferred as being germline. 

Any guidance or recommendations who we can contact regarding access to normal TCGA SNP data for this type of research? Want to make sure we understand all the options for confirming the SNP calls without reproducing the work that has already been done by GDAC. Given the current irregularities in the reference allele and matched normal would be concerning to use that data in a publication where ideally TCGA is the best data source to answer the question.

Thanks

Scooter   

Scooter Willis

unread,
Dec 6, 2016, 10:54:07 AM12/6/16
to David Heiman, Hertz, Daniel, gdac-...@broadinstitute.org
David

Understand on being conservative. We are going through the process to request the protected data. 

Scooter

David Heiman

unread,
Dec 6, 2016, 11:22:30 AM12/6/16
to Scooter Willis, Hertz, Daniel, gdac-...@broadinstitute.org
The following is further clarification from one of our experts:
>To my knowledge, the SNP calling algorithms do use the normals.

Correct.

>Seems odd that the Reference_Allele is a single value if it is being used in the SNP calling algorithm as it does not allow reporting heterogeneity when it occurs. 

The somatic mutation calling algorithms used in TCGA assume that only two alleles (reference germline + somatic) can be present at a given site.  You are correct in observing that this will exclude somatic mutations that either a) occur independently at the same locus in different haplotypes or different subclonal populations or b) occur at the site of a germline het.

These multiallelic sites are excluded because either scenario is extremely unlikely, and we've calculated that any read evidence that might support a multiallelic site is much more likely to be a result of sequencing/alignment artifact than a true multiallelic event.

>Still concerning that Match_Norm_Seq_Allele1 and Match_Norm_Seq_Allele2 are being reported if they are not accurate or relevant. 

This is a fault of the TCGA MAF spec.  Initially, these fields were intended to indicate whether a somatic mutation is homozygous or heterozygous (if the former, Allele1 = Allele2 != reference, if the latter, Allele1 = reference, Allele2 = nonreference).  They were deemed obsolete due to callers' reporting actual somatic allelic fractions, i.e. (alt read count)/(coverage depth), and the fact that somatic mutation callers do not call zygosity, unlike their germline counterparts (see below).

>This is further complicated by LOH where tumor alleles that agree are a marker for a heterozygous deletion assuming that is taken into consideration.  

Correct — LoH events need to be inferred from downstream algorithms (e.g. ABSOLUTE) that use a combination of allelic fraction clusters with respect to all of a patient's mutations and superimposed copy number segments to infer somatic zygosity. In other words, inferring zygosity changes requires exploiting correlations between mutations.  MAFs alone cannot reliably indicate somatic zygosity status, since they are the product solely of mutation callers, which consider sites independently.

>Any guidance or recommendations who we can contact regarding access to normal TCGA SNP data for this type of research? Want to make sure we understand all the options for confirming the SNP calls without reproducing the work that has already been done by GDAC. Given the current irregularities in the reference allele and matched normal would be concerning to use that data in a publication where ideally TCGA is the best data source to answer the question.

Germline data is considered protected access, so you'll need to put in an access request via dbGaP at NCBI to get authorization.  GDAC only hosts public access data (i.e. strictly somatic calls), and thus cannot provide this information.
Reply all
Reply to author
Forward
0 new messages