Fwd: 128968 - TCGA data - duplicate patient mRNA

Azat Akhmetov

unread,

Aug 6, 2012, 12:43:58 PM8/6/12

to cbiop...@googlegroups.com

Hello,

While trying to download mRNA data from the TCGA portal based on patient lists that cBio gave me, I found myself confused at one point: For some patients, more than one sample is taken, yet cBio does not seem to differentiate between individual samples from each patient. (see the topmost forwarded email)

I was wondering if you could help me find out more about this.

Regards,

Azat Akhmetov

---------- Forwarded message ----------
From: Azat Akhmetov <arpha...@gmail.com>
Date: Wed, Jul 25, 2012 at 7:16 PM
Subject: Re: 128968 - TCGA data - duplicate patient mRNA
To: "Swan, Don (NIH/NCI) [C]" <sw...@mail.nih.gov>

Hi Don,

Thank you for the detailed explanation. Regarding #1: So when I ask the cBio Cancer Genomics Portal (http://www.cbioportal.org/public-portal/) for "Patients that have BRCA1 alterations", for instance, and it gives me patient TCGA-29-1710 (which has two sets of mRNA samples), does that mean this patient's primary tumor had BRCA1 alterations, or secondary, or both, or what?

My confusion is because the portal does not distinguish between tumors but only between individuals, whereas the data portal does. When I want to collect, for instance, mRNA for patients with BRCA1 alterations only, I don't know how to deal with 1710, because, which of his tumors had the alteration?

By the way, I think in reality 29-1710 shows up as BRCA1 unaltered, but I was just illustrating the issue.

Regards,

Azat

On Mon, Jul 23, 2012 at 11:23 AM, Swan, Don (NIH/NCI) [C] <sw...@mail.nih.gov> wrote:

Hi there-

This is in response to the request you submitted in regard to there being duplicate samples in the TCGA OV mRNA data. Please see your questions followed by my answers:

1- Why is more than one sample taken from some patients and which sample is the correct one?
In the example that you provided, the barcodes show that one is a primary solid tumor and the other is a recurrent solid tumor. This can be determined by the “-01A” and “-02A”.

TCGA-29-1710-01A-02R-0566-07
TCGA-29-1710-02A-01R-0810-07

You can find more information about the TCGA barcodes here: https://wiki.nci.nih.gov/display/TCGA/Working+with+TCGA+Data

You can find out what the codes mean in the Code Tables Report: https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm

2- Why are Level 3 Ovarian Cancer gene expression (mRNA) data from "Affymetrix HT Human Genome U133 Array Plate Set" platforms always greater than 1, what kind of values are these?

The processing information of the level 2 and level 3 data and the sample to file mapping that you are looking for can be obtained when you download it from the Data Matrix (http://tcga-data.nci.nih.gov ). When you select the data that you want and build an archive, the system automatically provides MAGE-TAB files specific to that data. MAGE-TAB files are tab-delimited text files that contain the annotations for the data that you’re downloading and can be found in the METADATA directory of the archive. The MAGE-TAB files consist of the Investigation Description Format (IDF) file and the Sample and Data Relationship Format (SDRF) file. The one that you are interested in is the IDF file as that contains the protocol information. It should include all of the processing steps, including descriptions or links to descriptions, for the data.
The SDRF file contains the sample annotation information. You will find the sample ID in the first column of the MAGE-TAB file and you will find the corresponding file name for that sample ID in the same row all the way over to the right.
For example, in "broad.mit.edu_OV.HT_HG-U133A.sdrf.txt".
In the first column of the SDRF you will find the sample ID that would be similar to those found in the level 3 data files (i.e. TCGA-01-0628-11A-01R-0362-01). Then, in the same row you will find the corresponding file name in the column labeled Array Data File (i.e. 5500024056197041909864.G08.CEL)
Since these are tab-delimited text files, it is best to open them in spreadsheet applications like Microsoft Excel.

If you are going to be downloading the data from the HTTP Repositories, the MAGE-TAB files can also be found at the bottom of the directories and are labeled accordingly (i.e. hms.harvard.edu_OV.HG-CGH-244A.mage-tab.1.6.0.tar.gz)

If there is mention to a Description.txt file in the MAGE-TAB files, these can be found in the MAGE-TAB archives on the HTTP(S) repositories.
In your case for Ex.:
https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/ov/cgcc/broad.mit.edu/ht_hg-u133a/transcriptome/broad.mit.edu_OV.HT_HG-U133A.mage-tab.1.1007.0/

The OV U133A data is as follows:
Level 2: Probeset-level Robust Multiarray Analysis
Level 3: Gene-level Robust Multiarray Analysis

I also recommend taking some time to read through the TCGA Wiki and Data Primer:
TCGA Wiki: https://wiki.nci.nih.gov/display/TCGA/TCGA+Wiki+Home
TCGA Data Primer: https://wiki.nci.nih.gov/display/TCGA/TCGA+Data+Primer

If you have any other questions please let us know.

Thanks-
Don

Don Swan (Contractor)

Tier 2 Application Support Specialist
Microarray Application Trainer

301-443-6222
Application Support
TerpSys (www.terpsys.com)
2115 East Jefferson St, Suite 6000

Rockville, MD 20852

Application Support
Email: nc...@pop.nci.nih.gov
Local: 301.451.4384
Toll-Free: 888.478.4423
http://ncicb.nci.nih.gov/support

Nikolaus Schultz

unread,

Aug 6, 2012, 4:48:31 PM8/6/12

to cbiop...@googlegroups.com, Azat Akhmetov

Dear Azat,

The cBio Portal currently only stores genomic data for primary tumors (not secondary or recurrent tumors).

There are, however, a couple of primary tumors from TCGA, for which multiple samples (possibly representing different regions of the tumor) are available. In these cases, only one is included in the cBio Portal, and the selection is random (in our experience though, there are very few differences between different samples from the same tumor).

I hope this helps.

Niki.

Azat Akhmetov

unread,

Aug 6, 2012, 4:56:33 PM8/6/12

to Nikolaus Schultz, cbiop...@googlegroups.com

That explains everything, Niki! Thank you very much.

Reply all

Reply to author

Forward