cBioPortalData

Peter Saffrey

unread,

Nov 25, 2022, 9:45:43 AM11/25/22

to cBioPortal for Cancer Genomics Discussion Group

Hi there,

Now that I have my own private cBioPortal instance up and running I've been learning how to use it. I was able to use the Python API to obtain all the mutations from a single study to apply visualisations to these.

However, using the R tools (cBioPortalData and cBioDataPack) I don't seem to be able to get the same information. Using cBioPortalData I need to supply a gene list, but I want all the variants, not just a subset. I understand this is to reduce load on the API, I was just wondering if I can override this when using my own instance - especially since I definitely can get all this data when using Python.

The alternative for bulk data seems to be cBioDataPack, but from what I can tell looking at the documentation, I can't use cBioDataPack for a private instance.

Any suggestions?

Thanks,

Peter

mram...@gmail.com

unread,

Nov 28, 2022, 12:12:23 PM11/28/22

to cBioPortal for Cancer Genomics Discussion Group

Hi Peter,

How are you obtaining all the data when using Python? Are you requesting a specific endpoint?

AFAIU, the API was not designed with this use case in mind. If you can get all the feature data
for a particular study then you should be able to request for all the variants.

Note that cBioDataPack downloads static tarballs from the cBio Portal website.
There is an option named "cBio_URL" that you can set to download from a different URL
if you have tarballs built similarly to those at cBio Portal.

Best,

Marcel

Peter Saffrey

unread,

Nov 29, 2022, 5:39:10 AM11/29/22

to cBioPortal for Cancer Genomics Discussion Group

Hi Marcel,

Thanks for the response. My Python code looks like this:

profileId = f"{study_name}_mutations"

sampleListId = f"{study_name}_all"

mutations = self._client.Mutations.getMutationsInMolecularProfileBySampleListIdUsingGET(

molecularProfileId=profileId,

sampleListId=sampleListId,

projection="DETAILED"

).result()

This gets _all_ the mutations for a study in some detail, so whatever downstream analysis I want to do is possible from here. However, as you rightly point out, this is a very blunt instrument and I'm sure there could be better ways to do this. Two spring to mind:

1. Make the use-case more specific so that I can look for a better cBioPortal endpoint to get only the data that I want, rather than everything. In this case, I was just doing a proof-of-concept and wanted to plot mutation-counts-by-sample. On the R side, I can get this data much more cheaply by calling `clinicalData()`, which gives me a variety of summary statistics per-sample, including mutation count so I suppose a Python equivalent probably exists.

2. Find a way to get all the mutations, but more efficiently, as in the cBioDataPack example.

In an ideal world, I would use option 1, but I suspect that hunting around for the clever way to get at exactly the right data might not always work. Therefore, it would certainly be useful to have the more flexible (although less elegant) option 2 at my disposal. Can you point me to documentation on how I prepare the tar balls you mention, so that I can make my data available via cBioDataPack? And is there a similarly efficient method available to get these tar balls into a Pandas Dataframe on the Python side?