Hi there-
This is in response to the request you submitted in regard to there being duplicate samples in the TCGA OV mRNA data. Please see your questions followed by my answers:
1- Why is more than one sample taken from some patients and which sample is the correct one?
In the example that you provided, the barcodes show that one is a primary solid tumor and the other is a recurrent solid tumor. This can be determined by the “-01A” and “-02A”.TCGA-29-1710-01A-02R-0566-07
TCGA-29-1710-02A-01R-0810-07
You can find more information about the TCGA barcodes here: https://wiki.nci.nih.gov/display/TCGA/Working+with+TCGA+Data
You can find out what the codes mean in the Code Tables Report: https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm
2- Why are Level 3 Ovarian Cancer gene expression (mRNA) data from "Affymetrix HT Human Genome U133 Array Plate Set" platforms always greater than 1, what kind of values are these?
The processing information of the level 2 and level 3 data and the sample to file mapping that you are looking for can be obtained when you download it from the Data Matrix (http://tcga-data.nci.nih.gov ). When you select the data that you want and build an archive, the system automatically provides MAGE-TAB files specific to that data. MAGE-TAB files are tab-delimited text files that contain the annotations for the data that you’re downloading and can be found in the METADATA directory of the archive. The MAGE-TAB files consist of the Investigation Description Format (IDF) file and the Sample and Data Relationship Format (SDRF) file. The one that you are interested in is the IDF file as that contains the protocol information. It should include all of the processing steps, including descriptions or links to descriptions, for the data.
The SDRF file contains the sample annotation information. You will find the sample ID in the first column of the MAGE-TAB file and you will find the corresponding file name for that sample ID in the same row all the way over to the right.
For example, in "broad.mit.edu_OV.HT_HG-U133A.sdrf.txt".
In the first column of the SDRF you will find the sample ID that would be similar to those found in the level 3 data files (i.e. TCGA-01-0628-11A-01R-0362-01). Then, in the same row you will find the corresponding file name in the column labeled Array Data File (i.e. 5500024056197041909864.G08.CEL)
Since these are tab-delimited text files, it is best to open them in spreadsheet applications like Microsoft Excel.
If you are going to be downloading the data from the HTTP Repositories, the MAGE-TAB files can also be found at the bottom of the directories and are labeled accordingly (i.e. hms.harvard.edu_OV.HG-CGH-244A.mage-tab.1.6.0.tar.gz)
If there is mention to a Description.txt file in the MAGE-TAB files, these can be found in the MAGE-TAB archives on the HTTP(S) repositories.
In your case for Ex.:
The OV U133A data is as follows:
Level 2: Probeset-level Robust Multiarray Analysis
Level 3: Gene-level Robust Multiarray Analysis
I also recommend taking some time to read through the TCGA Wiki and Data Primer:
TCGA Wiki: https://wiki.nci.nih.gov/display/TCGA/TCGA+Wiki+Home
TCGA Data Primer: https://wiki.nci.nih.gov/display/TCGA/TCGA+Data+Primer
If you have any other questions please let us know.
Thanks-
Don
Don Swan (Contractor)
Tier 2 Application Support Specialist
Microarray Application Trainer
Application Support
TerpSys (www.terpsys.com)2115 East Jefferson St, Suite 6000
Rockville, MD 20852
Application Support
Email: nc...@pop.nci.nih.gov
Local: 301.451.4384
Toll-Free: 888.478.4423
http://ncicb.nci.nih.gov/support