Loading GENIE data on cbioportal

188 views
Skip to first unread message

Shakuntala Baichoo

unread,
Feb 1, 2022, 1:03:38 PM2/1/22
to cBioPortal for Cancer Genomics Discussion Group
Hi!
Has anyone loaded GENIE data on a local version of cBioPortal?
I am trying to load GENIE 11 data. I managed to load all the panels details but it seems there are some unknowns in sample table and mutation file which is preventing to load the data.

Grateful if someone could help on this.

Thanks,
Shakuntala

Ritika Kundra

unread,
Feb 1, 2022, 1:08:45 PM2/1/22
to Shakuntala Baichoo, cBioPortal for Cancer Genomics Discussion Group
Hi Shakuntala,

Can you share the samples or the error message? 

Thanks,
Ritika

--
You received this message because you are subscribed to the Google Groups "cBioPortal for Cancer Genomics Discussion Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cbioportal/46cc86a5-5446-47da-8ad0-05bc9a47976en%40googlegroups.com.

Shakuntala Baichoo

unread,
Feb 1, 2022, 3:52:34 PM2/1/22
to Ritika Kundra, cBioPortal for Cancer Genomics Discussion Group
Hi Ritika,
It seems it cannot recognize some values referring to NAs in samples, mutations_extended and CNA files.
Here is most of the output and I have highlighted the errors in red:
....
INFO: -: Validation of case list folder complete

INFO: data_gene_matrix.txt: line 1: This column can be replaced by a 'gene_panel' property in the respective meta file; value encountered: 'cna'
ERROR: data_gene_matrix.txt: lines [11, 22, 26, (35601 more)]: column 3: Blank cell found in column; value encountered: ''' (in column 'cna')'
ERROR: data_gene_matrix.txt: lines [11, 22, 26, (35601 more)]: Gene panel ID is not in database. Please import this gene panel before loading study data.; value encountered: ''
INFO: data_gene_matrix.txt: Validation of file complete
INFO: data_gene_matrix.txt: Read 135707 lines. Lines with warning: 0. Lines with error: 35604

WARNING: data_CNA.txt: line 1: The recommended column Entrez_Gene_Id was not found. Using Hugo_Symbol for all gene parsing.
WARNING: data_CNA.txt: lines [92, 99, 179, (46 more)]: Gene symbol not known to the cBioPortal instance. This record will not be loaded.; values encountered: ['BRE', 'C11ORF30', 'CXORF67', '(46 more)']
INFO: data_CNA.txt: Validation of file complete
INFO: data_CNA.txt: Read 965 lines. Lines with warning: 50. Lines with error: 0

WARNING: data_clinical_patient.txt: Columns OS_MONTHS and/or OS_STATUS not found. Overall survival analysis feature will not be available for this study.
WARNING: data_clinical_patient.txt: Columns DFS_MONTHS and/or DFS_STATUS not found. Disease free analysis feature will not be available for this study.
ERROR: data_clinical_patient.txt: lines [6, 10, 13, (77894 more)]: columns [7, 10, 6, (1 more)]: Value of numeric attribute is not a real number; values encountered: ['Unknown', 'Not Applicable', 'Not Collected', '(1 more)']
INFO: data_clinical_patient.txt: Validation of file complete
INFO: data_clinical_patient.txt: Read 121226 lines. Lines with warning: 0. Lines with error: 77897

WARNING: data_fusions.txt: lines [32, 162, 246, (1435 more)]: Gene symbol not known to the cBioPortal instance. This record will not be loaded.; values encountered: ['PARK2', 'HIST1H2BD', 'MRE11A', '(659 more)']
WARNING: data_fusions.txt: lines [2530, 2531, 2532, (2664 more)]: Entrez gene id is not an integer. This record will not be loaded.; values encountered: ['238.0', '324.0', '8289.0', '(709 more)']
WARNING: data_fusions.txt: lines [13995, 14252, 35873, (1 more)]: Hugo Symbol is not in gene or alias table and starts with a number. This can be caused by unintentional gene conversion in Excel.; values encountered: ['48787', '30302_C.
890', '1311DEL']
INFO: data_fusions.txt: Validation of file complete
INFO: data_fusions.txt: Read 41199 lines. Lines with warning: 4105. Lines with error: 0

WARNING: data_mutations_extended.txt: column 60: A SWISSPROT column was found in datafile without specifying associated 'swissprot_identifier' in metafile, assuming 'swissprot_identifier: name'.
WARNING: data_mutations_extended.txt: lines [2, 3, 4, (9652 more)]: Variant_Type indicates a SNP, but length of Reference_Allele, Tumor_Seq_Allele1 and/or Tumor_Seq_Allele2 do not equal 1.; values encountered: ['(T, , C)', '(G, , T)', '
(C, , A)', '(9 more)']
WARNING: data_mutations_extended.txt: lines [2, 3, 4, (216850 more)]: Missing value in SWISSPROT column; this column is recommended to make sure that the UniProt canonical isoform is used when drawing Pfam domains in the mutations view.
; value encountered: ''
INFO: data_mutations_extended.txt: lines [35, 49, 59, (9949 more)]: Line will not be loaded due to the variant classification filter. Filtered types: [Silent, Intron, 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR, RNA]; values encountered: ['Intr
on', 'Silent', '3'UTR', '(4 more)']
WARNING: data_mutations_extended.txt: lines [172, 225, 301, (160 more)]: Variant_Type indicates a DNP, but length of Reference_Allele, Tumor_Seq_Allele1 and/or Tumor_Seq_Allele2 do not equal 2.; values encountered: ['(CG, , AA)', '(CC,
, AA)', '(GG, , TT)', '(27 more)']
WARNING: data_mutations_extended.txt: lines [303, 315, 1545, (114 more)]: Variant_Type indicates a ONP, but length of Reference_Allele, Tumor_Seq_Allele1 and 2 are not bigger than 3 or are of unequal lengths.; values encountered: ['(GTG
, , AAA)', '(ACCAC, , GTGGT)', '(CTG, , TTG)', '(77 more)']
WARNING: data_mutations_extended.txt: lines [12632, 12702, 12888, (3345 more)]: Entrez gene id exists, but gene symbol specified is not known to the cBioPortal instance. The gene symbol will be ignored. Might be wrong mapping, new or de
precated gene symbol.; values encountered: ['MEF2BNB-MEF2B', 'PARK2', 'RFWD2', '(18 more)']
WARNING: data_mutations_extended.txt: lines [12632, 14434, 14478, (789 more)]: Off panel variant. Gene symbol not known to the targeted panel.; values encountered: ['MEF2BNB-MEF2B', 'GNB2L1', 'NUTM1', '(19 more)']
WARNING: data_mutations_extended.txt: lines [29888, 45896, 45961, (27 more)]: No Amino_Acid_Change or HGVSp_Short value. This mutation record will get a generic "MUTATED" flag
WARNING: data_mutations_extended.txt: lines [226807, 226808, 226809, (478212 more)]: Missing value in SWISSPROT column; this column is recommended to make sure that the UniProt canonical isoform is used when drawing Pfam domains in the
mutations view.; value encountered: ''
INFO: data_mutations_extended.txt: lines [226825, 226837, 226846, (9415 more)]: Line will not be loaded due to the variant classification filter. Filtered types: [Silent, Intron, 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR, RNA]; values encount
ered: ['Intron', 'Silent', 'RNA', '(4 more)']
WARNING: data_mutations_extended.txt: lines [226849, 226916, 226917, (5774 more)]: Entrez gene id exists, but gene symbol specified is not known to the cBioPortal instance. The gene symbol will be ignored. Might be wrong mapping, new or
deprecated gene symbol.; values encountered: ['BRE', 'WHSC1L1', 'WHSC1', '(44 more)']
WARNING: data_mutations_extended.txt: lines [227152, 227218, 227353, (350 more)]: Off panel variant. Gene symbol not known to the targeted panel.; values encountered: ['PGBD3', 'MEF2BNB-MEF2B', 'FIP1L1', '(24 more)']
WARNING: data_mutations_extended.txt: lines [227289, 228078, 228676, (534 more)]: Variant_Type indicates a ONP, but length of Reference_Allele, Tumor_Seq_Allele1 and 2 are not bigger than 3 or are of unequal lengths.; values encountered
: ['(CTC, CTC, ATT)', '(CTC, CTC, TTT)', '(TGC, TGC, GAA)', '(306 more)']
WARNING: data_mutations_extended.txt: lines [230068, 232465, 243076, (67 more)]: No Amino_Acid_Change or HGVSp_Short value. This mutation record will get a generic "MUTATED" flag
WARNING: data_mutations_extended.txt: lines [331868, 331869, 331870, (1911 more)]: Variant_Type indicates a SNP, but length of Reference_Allele, Tumor_Seq_Allele1 and/or Tumor_Seq_Allele2 do not equal 1.; values encountered: ['(G, , A)'
, '(C, , G)', '(A, , C)', '(13 more)']
WARNING: data_mutations_extended.txt: lines [331968, 331970, 331974, (265 more)]: Variant_Type indicates a DNP, but length of Reference_Allele, Tumor_Seq_Allele1 and/or Tumor_Seq_Allele2 do not equal 2.; values encountered: ['(CG, , GG)
', '(CC, , AC)', '(CG, , AG)', '(55 more)']
ERROR: data_mutations_extended.txt: lines [332357, 332496, 332549, (19 more)]: No Entrez gene id or gene symbol provided for gene.
WARNING: data_mutations_extended.txt: lines [337450, 337462, 337596, (3420 more)]: Gene symbol not known to the cBioPortal instance. This record will not be loaded.; values encountered: ['PAK7', 'PARK2', 'WHSC1', '(10 more)']
WARNING: data_mutations_extended.txt: lines [414405, 415440, 593496]: All Values in columns Reference_Allele, Tumor_Seq_Allele1 and Tumor_Seq_Allele2 are equal.; values encountered: ['(GAGG, GAGG, GAGG)', '(AGG, AGG, AGG)']
WARNING: data_mutations_extended.txt: lines [714440, 714441, 714442, (277644 more)]: Missing value in SWISSPROT column; this column is recommended to make sure that the UniProt canonical isoform is used when drawing Pfam domains in the
mutations view.; value encountered: ''
INFO: data_mutations_extended.txt: lines [714469, 714526, 714532, (73719 more)]: Line will not be loaded due to the variant classification filter. Filtered types: [Silent, Intron, 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR, RNA]; values encoun
tered: ['5'Flank', '3'UTR', '5'UTR', '(4 more)']
WARNING: data_mutations_extended.txt: lines [714505, 714510, 714737, (1713 more)]: Entrez gene id exists, but gene symbol specified is not known to the cBioPortal instance. The gene symbol will be ignored. Might be wrong mapping, new or
deprecated gene symbol.; values encountered: ['HIST1H3E', 'HIST1H3C', 'HIST1H1C', '(44 more)']
WARNING: data_mutations_extended.txt: lines [714571, 714621, 714736, (2660 more)]: Gene symbol not known to the cBioPortal instance. This record will not be loaded.; values encountered: ['FAM46C', 'PAK7', 'PARK2', '(28 more)']
WARNING: data_mutations_extended.txt: lines [714616, 715842, 717575, (333 more)]: Variant_Type indicates a ONP, but length of Reference_Allele, Tumor_Seq_Allele1 and 2 are not bigger than 3 or are of unequal lengths.; values encountered
: ['(CTC, CTC, TTT)', '(CAC, CAC, AAA)', '(GAG, GAG, AAA)', '(238 more)']
WARNING: data_mutations_extended.txt: lines [732877, 740361, 743455, (713 more)]: No Amino_Acid_Change or HGVSp_Short value. This mutation record will get a generic "MUTATED" flag
WARNING: data_mutations_extended.txt: lines [778513, 778538, 794417, (448 more)]: Off panel variant. Gene symbol not known to the targeted panel.; values encountered: ['C1orf147', 'HIST2H3D', 'PCDHAC1', '(14 more)']
WARNING: data_mutations_extended.txt: lines [796770, 796773, 796774, (31020 more)]: Variant_Type indicates a SNP, but length of Reference_Allele, Tumor_Seq_Allele1 and/or Tumor_Seq_Allele2 do not equal 1.; values encountered: ['(A, , G)
', '(C, , T)', '(T, , A)', '(13 more)']
WARNING: data_mutations_extended.txt: lines [796810, 796813, 796826, (414 more)]: Variant_Type indicates a DNP, but length of Reference_Allele, Tumor_Seq_Allele1 and/or Tumor_Seq_Allele2 do not equal 2.; values encountered: ['(CC, , AC)
', '(GA, , AA)', '(CC, , TC)', '(77 more)']
WARNING: data_mutations_extended.txt: lines [846157, 846235, 846527, (169 more)]: All Values in columns Reference_Allele, Tumor_Seq_Allele1 and Tumor_Seq_Allele2 are equal.; values encountered: ['(-, -, -)', '(CT, CT, CT)', '(CG, CG, CG
)', '(2 more)']
WARNING: data_mutations_extended.txt: lines [846157, 846527, 846846, (132 more)]: Given value for Variant_Classification column is not one of the expected values. This can result in mapping issues and subsequent missing features in the
mutation view UI, such as missing COSMIC information.; values encountered: ['In_Frame_DEL', 'Frame_Shift_DEL']
WARNING: data_mutations_extended.txt: lines [859622, 859623, 859624, (5 more)]: Entrez gene id and gene symbol do not match. The gene symbol will be ignored. Might be wrong mapping or recycled gene symbol.; value encountered: '(KMT2D, 9
757)'
ERROR: data_mutations_extended.txt: lines [880084, 880092, 880098, (551 more)]: No Entrez gene id or gene symbol provided for gene.
INFO: data_mutations_extended.txt: Validation of file complete
INFO: data_mutations_extended.txt: Read 1065808 lines. Lines with warning: 972715. Lines with error: 576

WARNING: genie_data_cna_hg19.seg: lines [106307, 153216, 334275, (6 more)]: Segment is zero bases wide and will not be loaded; values encountered: ['153023184-153023184', '65096797-65096797', '36854039-36854039', '(6 more)']
INFO: genie_data_cna_hg19.seg: Validation of file complete
INFO: genie_data_cna_hg19.seg: Read 3748118 lines. Lines with warning: 9. Lines with error: 0

INFO: -: Validation complete


Thanks,

Shakuntala
---------------------------------------------------------------
Assoc. Prof. (Dr.) Shakuntala Baichoo
Department of Digital Technologies, FoICDT, University of Mauritius
Phone: +230 4037762

Ritika Kundra

unread,
Feb 2, 2022, 10:52:19 AM2/2/22
to Shakuntala Baichoo, cBioPortal for Cancer Genomics Discussion Group
Hi Shakuntala,

Thanks for sharing that. How did you create your database? And can you also share the MAF and matrix file?

Ritika

Shakuntala Baichoo

unread,
Feb 2, 2022, 12:31:42 PM2/2/22
to Ritika Kundra, cBioPortal for Cancer Genomics Discussion Group
Hi Ritika,
I used docker-compose to create the database.
As for sharing the data, unfortunately I cannot because it is access-controlled; I am sorry for that.

But I know that cBioportal has loaded this data on the main site but separately; so may be they used some scripts to clean the data prior to loading it.

Thanks,
Shakuntala 
To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+unsubscribe@googlegroups.com.



--
Shakuntala Baichoo

Sent from my iPhone


Ritika Kundra

unread,
Feb 9, 2022, 10:25:29 AM2/9/22
to Shakuntala Baichoo, cBioPortal for Cancer Genomics Discussion Group
Hi Shakuntala,

Apologies for the delay.

Let me try to mimic your process to see what is causing the problem. Will get back to you once I have an update. Can you share the resource from where you downloaded the GENIE files? Since we are submitting data to GENIE for MSK, we are part of the GENIE consortium.

Thanks,
Ritika

To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+...@googlegroups.com.

Shakuntala Baichoo

unread,
Feb 10, 2022, 9:26:24 AM2/10/22
to Ritika Kundra, cBioPortal for Cancer Genomics Discussion Group
Hi Ritika,
Thanks a lot for your reply.
I have downloaded the  GENIE data from Sagebionetworks:https://www.synapse.org/#!Synapse:syn7222066/wiki/410924

Best regards,
Shakuntala
---------------------------------------------------------------
Assoc. Prof. (Dr.) Shakuntala Baichoo
Department of Digital Technologies, FoICDT, University of Mauritius
Phone: +230 4037762

Ritika Kundra

unread,
Mar 16, 2022, 11:14:45 AM3/16/22
to Shakuntala Baichoo, cBioPortal for Cancer Genomics Discussion Group
Hi Shakuntala,

Sorry for the late reply. We saw the same error as you did. The reason is due to the unavailability of data. The fields causing the error will not get imported to the portal with the current format. So a workaround is to remove those erroneous entries and re-import. We will work upstream to see if centers can resend the updated information. But it's tricky to get a timeline for that.

Thanks,
Ritika

Shakuntala Baichoo

unread,
Mar 16, 2022, 2:23:08 PM3/16/22
to Ritika Kundra, cBioPortal for Cancer Genomics Discussion Group
Hi Ritika,
Thanks a lot for the update. I will try that.


Best regards,
Shakuntala
---------------------------------------------------------------
Assoc. Prof. (Dr.) Shakuntala Baichoo
Department of Digital Technologies, FoICDT, University of Mauritius
Phone: +230 4037762

Reply all
Reply to author
Forward
0 new messages