Mutsig vs. extended mutations in luad

Marcus Kelly

unread,

Oct 12, 2017, 10:31:38 AM10/12/17

to cBioPortal for Cancer Genomics Discussion Group

Hello,

I've downloaded the TCGA provisional data on lung adenocarcinoma (http://www.cbioportal.org/study?id=luad_tcga#summary) to do my own analysis on mutation incidence. I've noticed that the mutation data supplied in the data_mutations_extended.txt file does not seem to agree with the data_mutsig file.

Consider the mutations for HBG2 from data_mutations_extended.txt :

$ grep '^HBG2' data_mutations_extended.txt
HBG2    3048    broad.mit.edu   GRCh37 11      5275539 5275539 +       missense_variant        Missense_Mutation       SNP     C       C       A                       TCGA-38-4632-01 TCGA-38-4632-11 C       C       -       -       C   C
Unknown         Somatic Unspecified     WXS     none    NA      NA      Illumina HiSeq 49      21                      ENST00000336906.4:c.298G>T      p.Asp100Tyr     p.D100Y ENST00000336906 NM_000184.2     100     Gat/Tat 0       2   7
35      blood coagulation               Epithelial(150;2.76e-09)|BRCA - Breast invasive adenocarcinoma(625;0.135)               100             HBG2_HUMAN      5.65419e-07     c.298G>T        HBG2_uc001mak.1_RNA|HBG2_uc001maj.1_Missense_
Mutation_p.D100Y        1       c.(298-300)GAT>TAT                      P69892 NA                                      heme binding|oxygen binding|oxygen transporter activity A-gamma globin          skin(1)                 1       g.chr
11:5275539C>A           4.4004e-07      0.502   TTCTCAGGATCCACATGCAGC                           p.D100Y -                                       hemoglobin complex              uc001mai.1      NM_000559       0.00333 NP_000550       0   M
edulloblastoma(188;0.00225)|Breast(177;0.0155)|all_neural(188;0.0212)

etc.

There is, clearly, only one mutation.

However, in the mutsig file :

rank                          35
gene                        HBG2
description hemoglobin, gamma G
N                         201510
n                             10
npat                          10
nsite                         10
nsil                           2
n1                             1
n2                             4
n3                             2
n4                             2
n5                             1
n6                             0
p_classic               0.000124
p_ns_s                     0.280
p_clust                 0.000611
p_cons                    0.0256
p_joint                 0.000191
p                       4.40e-07
q                       0.000227

As far as I can tell, this indicates that there are ten mutations at ten unique sites in ten different patient's' HBG2 genes. In which case, why are at least 9 of them not in the data_mutation_extended file?

Kelsey Zhu

unread,

Oct 16, 2017, 4:56:06 PM10/16/17

to mrku...@gmail.com, Jeff Bruce, cBioPortal for Cancer Genomics Discussion Group

Hi Marcus,

I am forwarding the email from our scientist Dr. Bruce. He took look at your question.

Best!

Kelsey

---------- Forwarded message ----------
From: Jeff Bruce <jeffp...@gmail.com>
Date: Mon, Oct 16, 2017 at 3:20 PM
Subject: Re: Mutsig vs. extended mutations in luad_tcga
To: Kelsey Zhu <kelse...@gmail.com>

It appears as if several mutations in HBG2 were filtered by the firehose pipeline. It's possible these mutations made it onto a blacklist or were determined to be likely germline variants. The structure of the pipeline appears to have allowed these mutations to make it into MutSig but not into the final set imported into cBioportal.

The raw calls, including the filtered mutations, can be downloaded here: http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/LUAD/20160128/gdac.broadinstitute.org_LUAD.Mutation_Packager_Oncotated_Raw_Calls.Level_3.2016012800.0.0.tar.gz

but these should be used with caution and the knowledge that at least one filter determined that the majority of the mutations in this gene should be removed.

On Mon, Oct 16, 2017 at 2:40 PM, <kelse...@gmail.com> wrote:

Jeff Bruce

unread,

Oct 17, 2017, 9:43:11 AM10/17/17

to Mark Kelly, Kelsey Zhu, cBioPortal for Cancer Genomics Discussion Group

Hi Mark, most of that information can be found on the broad GDAC FAQ site: https://confluence.broadinstitute.org/display/GDAC/FAQ

On Mon, Oct 16, 2017 at 5:52 PM, Mark Kelly <mrku...@gmail.com> wrote:

Thanks!

Can I suggest that the fork in this pipeline be made clear on your website?

Also, while I have your attention, I haven't been able to find mutsig documentation that includes the set of columns in the data_mutsig.txt file. Could you point me to some?

Mark

Mark Kelly

unread,

Oct 17, 2017, 10:59:18 AM10/17/17

to Kelsey Zhu, Jeff Bruce, cBioPortal for Cancer Genomics Discussion Group

Thanks!

Can I suggest that the fork in this pipeline be made clear on your website?

Also, while I have your attention, I haven't been able to find mutsig documentation that includes the set of columns in the data_mutsig.txt file. Could you point me to some?

Mark

On Mon, Oct 16, 2017 at 1:56 PM Kelsey Zhu <kelse...@gmail.com> wrote:

Reply all

Reply to author

Forward