Hi Scooter,
Apologies for not answering sooner. It is always best to direct any GDAC questions to the GDAC mailing list (CC'd). While I maintain the report, I did not generate the MAF. If someone on the list can't answer your question, we'll relay it or get you the contact information to find the proper expert as soon as possible.
Regards,
David
DavidWanted to confirm that you got the email that I sent yesterday where you are listed as being responsible for the report. If you are not responsible for the report can you direct me to the appropriate support channel.Based on the referenced article it is the Ovarian manuscript so I assume that is the same process used to process the breast data for SNP calling via exome DNA-seq?ThanksScooterOn Tue, Nov 15, 2016 at 2:38 PM Scooter Willis <will...@gmail.com> wrote:DavidIn the Mutation Analysis data for TCGA Breast I was hoping you can answer a question regarding what appears to be either a data anomaly or a possible output error in the report.If you look at the summary data for BRCA-TP.final_analysis_set.maf you see differences in Tumor_Seq_Allele1 and Tumor_Seq_Allele2 which is expected compared to the Reference_Allele.1. Is the Reference_Allele always derived from Matched_Norm_Sample_Barcode calls? Shouldn't the Reference_Allele have two values assuming it came from matched normal.2. For some reason and my concern is that Match_Norm_Seq_Allele1 always equals Match_Norm_Seq_Allele2 which implies no germline heterogeneity which shouldn't be the case. Reasonable for a gene to have in one Allele an A at position X and in the other Allele a T at position X. This is my concern that either a reporting error outputting the same value twice and possibly impacting the mutation call in tumor.3. Do you know what SNP array was used for BRCA calls?4. If the SNP array had intronic/GWAS SNP probes were these included/excluded from the analysis?Thanks
Scooter Willis
In the Mutation Analysis data for TCGA Breast I was hoping you can answer a question regarding what appears to be either a data anomaly or a possible output error in the report.If you look at the summary data for BRCA-TP.final_analysis_set.maf you see differences in Tumor_Seq_Allele1 and Tumor_Seq_Allele2 which is expected compared to the Reference_Allele.1. Is the Reference_Allele always derived from Matched_Norm_Sample_Barcode calls? Shouldn't the Reference_Allele have two values assuming it came from matched normal.
2. For some reason and my concern is that Match_Norm_Seq_Allele1 always equals Match_Norm_Seq_Allele2 which implies no germline heterogeneity which shouldn't be the case. Reasonable for a gene to have in one Allele an A at position X and in the other Allele a T at position X. This is my concern that either a reporting error outputting the same value twice and possibly impacting the mutation call in tumor.
3. The BRCA Mutation Analysis references the Nature Ovarian manuscript so the assumption is the same methods/source of data(exome DNA Seq) was used?
4. Also trying to understand the impact of pancan_mutation_blacklist.v14.hg19.txt. We have a gene of interest CYP2D6 which in BRCA only has one called mutation as compared to the assumption is normal matching tissue from the patient. This would suggest that the likelihood of tumor mutation in CPY2D6 is unlikely. We have a cohort in breast cancer where we don't have normal control for each patient. We have run SNP array on the cohort where based on TCGA BRCA any SNPs that are determined have a very high chance of being germline in particular if they are well known SNPs. If the pancan blacklist file is filtering CPY2D6 SNP that appear to be germline but could be mutations in the tumor we want to make sure we understand the implications. Difficult when the file is not available to understand what is being filtered.
Thanks
Scooter Willis
DavidThanks for the response. Seems odd that the Reference_Allele is a single value if it is being used in the SNP calling algorithm as it does not allow reporting heterogeneity when it occurs. Still concerning that Match_Norm_Seq_Allele1 and Match_Norm_Seq_Allele2 are being reported if they are not accurate or relevant. This is further complicated by LOH where tumor alleles that agree are a marker for a heterozygous deletion assuming that is taken into consideration.I have included Daniel Hertz on the email who I am working with on this particular problem. Ultimately we are trying to determine the likelihood of CYP2D6 mutations in breast cancer to resolve conflicting data of germline SNP mutations in CYP2D6 that are predictive of therapy. Dan is putting in a request for access to the normal and tumor DNA sequence data for TCGA breast so we can resolve the nature of expected mutations in CYP2D6. Based on current TCGA data that for tumor samples with no deletions in the CYP2D6 region then using SNP calls in tumor can be inferred as being germline.Any guidance or recommendations who we can contact regarding access to normal TCGA SNP data for this type of research? Want to make sure we understand all the options for confirming the SNP calls without reproducing the work that has already been done by GDAC. Given the current irregularities in the reference allele and matched normal would be concerning to use that data in a publication where ideally TCGA is the best data source to answer the question.Thanks
Scooter
>To my knowledge, the SNP calling algorithms do use the normals.
Correct.
>Seems odd that the Reference_Allele is a single value if it is being used in the SNP calling algorithm as it does not allow reporting heterogeneity when it occurs.
The somatic mutation calling algorithms used in TCGA assume that only two alleles (reference germline + somatic) can be present at a given site. You are correct in observing that this will exclude somatic mutations that either a) occur independently at the same locus in different haplotypes or different subclonal populations or b) occur at the site of a germline het.These multiallelic sites are excluded because either scenario is extremely unlikely, and we've calculated that any read evidence that might support a multiallelic site is much more likely to be a result of sequencing/alignment artifact than a true multiallelic event.
>Still concerning that Match_Norm_Seq_Allele1 and Match_Norm_Seq_Allele2 are being reported if they are not accurate or relevant.
This is a fault of the TCGA MAF spec. Initially, these fields were intended to indicate whether a somatic mutation is homozygous or heterozygous (if the former, Allele1 = Allele2 != reference, if the latter, Allele1 = reference, Allele2 = nonreference). They were deemed obsolete due to callers' reporting actual somatic allelic fractions, i.e. (alt read count)/(coverage depth), and the fact that somatic mutation callers do not call zygosity, unlike their germline counterparts (see below).
>This is further complicated by LOH where tumor alleles that agree are a marker for a heterozygous deletion assuming that is taken into consideration.
Correct — LoH events need to be inferred from downstream algorithms (e.g. ABSOLUTE) that use a combination of allelic fraction clusters with respect to all of a patient's mutations and superimposed copy number segments to infer somatic zygosity. In other words, inferring zygosity changes requires exploiting correlations between mutations. MAFs alone cannot reliably indicate somatic zygosity status, since they are the product solely of mutation callers, which consider sites independently.
>Any guidance or recommendations who we can contact regarding access to normal TCGA SNP data for this type of research? Want to make sure we understand all the options for confirming the SNP calls without reproducing the work that has already been done by GDAC. Given the current irregularities in the reference allele and matched normal would be concerning to use that data in a publication where ideally TCGA is the best data source to answer the question.
Germline data is considered protected access, so you'll need to put in an access request via dbGaP at NCBI to get authorization. GDAC only hosts public access data (i.e. strictly somatic calls), and thus cannot provide this information.