Hello,
I've downloaded the TCGA provisional data on lung adenocarcinoma (
http://www.cbioportal.org/study?id=luad_tcga#summary) to do my own analysis on mutation incidence. I've noticed that the mutation data supplied in the data_mutations_extended.txt file does not seem to agree with the data_mutsig file.
Consider the mutations for HBG2 from data_mutations_extended.txt :
$ grep '^HBG2' data_mutations_extended.txt
HBG2 3048 broad.mit.edu GRCh37 11 5275539 5275539 + missense_variant Missense_Mutation SNP C C A TCGA-38-4632-01 TCGA-38-4632-11 C C - - C C
Unknown Somatic Unspecified WXS none NA NA Illumina HiSeq 49 21 ENST00000336906.4:c.298G>T p.Asp100Tyr p.D100Y ENST00000336906 NM_000184.2 100 Gat/Tat 0 2 7
35 blood coagulation Epithelial(150;2.76e-09)|BRCA - Breast invasive adenocarcinoma(625;0.135) 100 HBG2_HUMAN 5.65419e-07 c.298G>T HBG2_uc001mak.1_RNA|HBG2_uc001maj.1_Missense_
Mutation_p.D100Y 1 c.(298-300)GAT>TAT P69892 NA heme binding|oxygen binding|oxygen transporter activity A-gamma globin skin(1) 1 g.chr
11:5275539C>A 4.4004e-07 0.502 TTCTCAGGATCCACATGCAGC p.D100Y - hemoglobin complex uc001mai.1 NM_000559 0.00333 NP_000550 0 M
edulloblastoma(188;0.00225)|Breast(177;0.0155)|all_neural(188;0.0212)
etc.
There is, clearly, only one mutation.
However, in the mutsig file :
rank 35
gene HBG2
description hemoglobin, gamma G
N 201510
n 10
npat 10
nsite 10
nsil 2
n1 1
n2 4
n3 2
n4 2
n5 1
n6 0
p_classic 0.000124
p_ns_s 0.280
p_clust 0.000611
p_cons 0.0256
p_joint 0.000191
p 4.40e-07
q 0.000227
As far as I can tell, this indicates that there are ten mutations at ten unique sites in ten different patient's' HBG2 genes. In which case, why are at least 9 of them not in the data_mutation_extended file?