cBioPortal Datahub Public Dataset Load Into Database Issues

Miu ki Yip

unread,

May 8, 2026, 12:15:33 PMMay 8

to cbiop...@googlegroups.com, Yichao Sun

Hi all,

I am trying to load some of the public data in the cBioPortal Data hub (https://github.com/cBioPortal/datahub/tree/master/public) into a local instance of cBioPortal but I am encountering errors loading the datasets in.

For instance, when trying to load the “msk_impact_50k_2026” study into the database, I am seeing the following errors in the stdout:

  ERROR: data_clinical_sample.txt: line 5: column 20: Attribute name not in upper case.; value encountered: 'purity_estimate_from_mutations'

And

ERROR: data_sv.txt: lines [25, 150, 154, (354 more)]: column 1: Sample ID not defined in clinical file; values encountered: ['P-0012686-T01-IM5', 'P-0013979-T01-IM5', 'P-0010429-T01-IM5', '(317 more)']

WARNING: data_sv.txt: lines [7423, 7840, 12522, (1 more)]: Hugo Symbol should not start with a number.; values encountered: ['48787', '30302_C.890', '1311DEL']

I am concerned that I am looking in an out of date location for the public datasets since a lot of the datasets in the GitHub repo linked have issues loading into the database.

Please let me know how I can get these datasets loaded or if there is an updated data repository.

Thank you!

jagn...@gmail.com

unread,

May 11, 2026, 9:33:58 AMMay 11

to cBioPortal for Cancer Genomics Discussion Group

Hi Miu

thanks for reaching out to cBioPortal team. Which version of cBioPortal are you using for import?

The direct link to the study file is https://datahub.assets.cbioportal.org/msk_impact_50k_2026.tar.gz

You can obtain this from the main Study page on cbioportal.org. There is a download link for all studies right next to the Study title.

https://www.cbioportal.org/study/summary?id=msk_impact_50k_2026

I will reach out to the data team since new studies sometimes have errors.

thanks
Jag

Prasanna Jagannathan

unread,

May 19, 2026, 11:32:31 AMMay 19

to Miu ki Yip, Yichao Sun, cBioPortal for Cancer Genomics Discussion Group

Hi Miu

I have raised your concerns to the datahub team. Someone from the datahub team will get back to you.

thanks

Jag

On Thu, 14 May 2026 at 09:41, Miu ki Yip <miy...@med.cornell.edu> wrote:

Hi Jag,

We are using v6.3.2 version of cBioPortal locally.

I’ve downloaded the msk 50k impact study using the link provided but I am still getting the same errors.

ERROR: data_clinical_sample.txt: line 5: column 20: Attribute name not in upper case.; value encountered: 'purity_estimate_from_mutations'

And

WARNING: data_sv.txt: line 1: Missing genomic information. Consider adding the fields: [Site1_Contig, Site1_Ensembl_Transcript_Id, Site1_Entrez_Gene_Id, Site1_Region, Site1_Region_Number, Site2_Contig, Site2_Ensembl_Transcript_Id, Site2_Entrez_Gene_Id, Site2_Region, Site2_Region_Number]

ERROR: data_sv.txt: lines [25, 150, 154, (354 more)]: column 1: Sample ID not defined in clinical file; values encountered: ['P-0012686-T01-IM5', 'P-0013979-T01-IM5', 'P-0010429-T01-IM5', '(317 more)']

WARNING: data_sv.txt: lines [7423, 7840, 12522, (1 more)]: Hugo Symbol should not start with a number.; values encountered: ['48787', '30302_C.890', '1311DEL']

Is the version of cBioPortal an issue with these new datasets?

Thank you!

From: jagn...@gmail.com <jagn...@gmail.com>
Date: Monday, May 11, 2026 at 9:34 AM
To: Miu ki Yip <miy...@med.cornell.edu>
Subject: Private message regarding: cBioPortal Datahub Public Dataset Load Into Database Issues

ⓘ Informational: External Sender
This sender is external to Weill Cornell Medicine. Be careful when clicking links or opening attachments.

Hi Miu

thanks for reaching out to cBioPortal team. Which version of cBioPortal are you using for import?

The direct link to the study file is https://datahub.assets.cbioportal.org/msk_impact_50k_2026.tar.gz

You can obtain this from the main Study page on cbioportal.org. There is a download link for all studies right next to the Study title.

https://www.cbioportal.org/study/summary?id=msk_impact_50k_2026

I will reach out to the data team since new studies sometimes have errors.

thanks
Jag

On Friday, 8 May 2026 at 12:15:33 pm UTC-4 Miu ki Yip wrote:

Ramya Madupuri

unread,

Jun 1, 2026, 1:32:09 PMJun 1

to Prasanna Jagannathan, Miu ki Yip, Yichao Sun, cBioPortal for Cancer Genomics Discussion Group

Hi Miu,

Thank you for bringing this to our attention.

We looked into the issues with the 'msk_impact_50k_2026' study and have fixed the import related errors. Please pull the latest version and give the import another try. We are also reviewing other DataHub studies for similar issues and working through them as part of a broader cleanup effort.

Please let us know if you continue to run into any problems after updating the study files.

Best,

Ramya

--
You received this message because you are subscribed to the Google Groups "cBioPortal for Cancer Genomics Discussion Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cbioportal/CALPp9FJfiprELiBaDuuJDjLf2Rxg-eh4qHiY_phhHcqj3K7_Eg%40mail.gmail.com.

Reply all

Reply to author

Forward