Assigning mutation data to the correct patient ID

11 views
Skip to first unread message

dfpo...@gmail.com

unread,
Aug 22, 2019, 10:59:41 PM8/22/19
to cBioPortal for Cancer Genomics Discussion Group
After doing a query of the form (taking a study at random):

You get a bunch of IDs for different tumors as column names with formats all over the place, like:
SD7357_T    SR070761    SU2C_Lung-SU2C-DFCI-LUAD-1002-Tumor-SM-AOL41

I am using all of the studies with mutation data available for download, and some patients are in multiple studies. So to avoid double-counting, I need to convert these column names into unique patient IDs that can be compared between studies. In cases of the form 'TCGA-\w-\w-\d', it seems you can cut off anything after the TCGA-\w-\w and get a patient ID, but most datasets don't appear to have their columns in this format.

I can do a query of the form:

This data does not seem to contain whatever form of ID is used as the column heads for the getProfileData query'd mutation data. I can still match them by looking for the appearance of patient IDs in the mutation column names, but there are column names that do not contain a patient ID (AL4602) and patient IDs without any matching column (luad_mskcc_2015_15).

The big issue is I am trying avoid double counting patients, and assigning them to the proper tumor types. Is there a way to map back to the patient IDs from the column names in a getProfileData call in a way that will avoid double-counting between studies?

Thanks,
Douglas

Priti Kumari

unread,
Aug 26, 2019, 4:32:21 PM8/26/19
to cBioPortal for Cancer Genomics Discussion Group, dfpo...@gmail.com
Hi Douglas,

There is no way to use the patient ID to avoid double counting patients. You could however select “Curated set of non redundant studies” in the home page to only include studies with no overlapping samples. See attached screenshot.

Thank you,
Priti


On August 22, 2019 at 10:59:43 PM, dfpo...@gmail.com (dfpo...@gmail.com) wrote:

        External Email - Use Caution        

--
You received this message because you are subscribed to the Google Groups "cBioPortal for Cancer Genomics Discussion Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cbioportal/b379091f-a3de-4a12-86fd-89abd82117c4%40googlegroups.com.
Screen Shot 2019-08-26 at 4.28.54 PM.png

Nikolaus Schultz

unread,
Aug 26, 2019, 9:40:45 PM8/26/19
to Priti Kumari, cBioPortal for Cancer Genomics Discussion Group, dfpo...@gmail.com
Hi Jason,

Outside of TCGA and MSK-IMPACT, there should not be too many duplicate samples in multiple studies. These studies though all use the sample patient IDs and sample IDs, so you should be able to detect them and filter out duplicates.

Niki.


To view this discussion on the web visit https://groups.google.com/d/msgid/cbioportal/etPan.5d6441d2.3a8ab475.1e55%40jimmy.harvard.edu.
<Screen Shot 2019-08-26 at 4.28.54 PM.png>

Reply all
Reply to author
Forward
0 new messages