Inconsistencies in UCSC Xena Toil RNA-seq CGL processed samples

353 views

Skip to first unread message

daniel.joha...@gmail.com

unread,

Sep 21, 2016, 11:11:24 AM9/21/16

to UCSC Xena and Cancer Genomics Browser

Dear UCSC Community,

I am working with data processed by the UCSC Xena Team using their Toil RNA-seq CGL Pipeline. All data files were downloaded from http://xena.ucsc.edu using an S3 bucket link that was provided in the help section this July. The publication related to this data is here: http://dx.doi.org/10.1101/062497

While working with the data, I discovered some inconsistencies which I would like to discuss within the community.

1) The files tcga_expected_count and tcga_RSEM_gene_tpm contain 10657 and 10490 samples respectively and the sample names do not fully overlap. Both files are based on the RSEM output, so I would expect overlapping and matching sample numbers and names.

2) Based on the file TCGA_TARGET_phenotype.txt, I discovered several samples (e.g. TCGA-17-Z000-01) for which the columns sample_type or gender had no values. The GDC data portal however claims that this specific sample is of sample type "Primary Tumor".

3) Sample names seem to omit some characters. For the sample TCGA-17-Z000-01 the GDC data portal claims the sample submitter id TCGA-17-Z000-01A. Why are sample names in UCSC Xena truncated?

4) Some samples appear more than once in the file tcga_RSEM_gene_tpm. The sample id TCGA-38-4625-01 can be found on column 7397 and 10441. The expression values differ. Reading this data into R caused the sample names to be renamed to TCGA-38-4625-01.1 and TCGA-38-4625-01.2 while using check.names = FALSE.

5) Some samples from the file TCGA_TARGET_phenotype.txt are missing the expression files. 200 samples match the string "TCGA Acute Myeloid Leukemia", however those samples cannot be found in the file tcga_RSEM_gene_tpm. The same holds true for the Osteosarcoma samples from TARGET.

6) The file GTEX_phenotype.txt contains 5 samples that have the column "body_site_detail (SMTSD)" filled out (e.g. Stomach) while the column _primary_site is empty. Cleary those values can be derived from the body_site_detail column.

7) The file GTEX_phenotype.txt contains 9783 samples while the file contains only 7863 columns with expression values. Why are there apparently 1919 samples missing?

8) The file TCGA_TARGET_phenotype.txt claims 18,046 samples for TCGA and TARGET combined. However neither the files tcga_RSEM_gene_tpm and target_RSEM_gene_tpm (11,224 samples) nor the information on the website https://xenabrowser.net/datapages/?host=https://toil.xenahubs.net (11,928 samples) support this total sample number of the combined TCGA and TARGET data set.

9) Sample numbers between the file tcga_RSEM_gene_tpm (10490) and the file tcga_Kallisto_tpm (10663) differ. Is it not expected to get the same number of sample columns between those two files, while the only difference is one used RSEM and the other one Kallisto to compute TPM values?

10) It would be nice to add mappings to Uberon using the primary site and the disease ontology using the cancer type. In addition it would be of value to harmonize the body sites and the naming between GTEx and TCGA/TARGET in order to better compare normals to tumors.

Thanks for helping,

Best, Daniel

Mary Goldman

unread,

Sep 26, 2016, 11:58:47 AM9/26/16

to daniel.joha...@gmail.com, UCSC Xena and Cancer Genomics Browser

Hi Daniel,

Please find replies inline. Note that we do not have answers to all your questions but I wanted to reply with the answers we do have as of now.

Best,
Mary
-------------
Mary Goldman
UCSC Xena Browser
http://xena.ucsc.edu/

---------- Forwarded message ----------
From: <daniel.johannes.gerlach@gmail.com>
Date: Wed, Sep 21, 2016 at 4:51 AM
Subject: [ucsc-cancer-genomics-browser] Inconsistencies in UCSC Xena Toil RNA-seq CGL processed samples
To: UCSC Xena and Cancer Genomics Browser <ucsc-cancer-genomics-browser@googlegroups.com>

Dear UCSC Community,

I am working with data processed by the UCSC Xena Team using their Toil RNA-seq CGL Pipeline. All data files were downloaded from http://xena.ucsc.edu using an S3 bucket link that was provided in the help section this July. The publication related to this data is here: http://dx.doi.org/10.1101/062497

While working with the data, I discovered some inconsistencies which I would like to discuss within the community.

1) The files tcga_expected_count and tcga_RSEM_gene_tpm contain 10657 and 10490 samples respectively and the sample names do not fully overlap. Both files are based on the RSEM output, so I would expect overlapping and matching sample numbers and names.

We are looking into this.

2) Based on the file TCGA_TARGET_phenotype.txt, I discovered several samples (e.g. TCGA-17-Z000-01) for which the columns sample_type or gender had no values. The GDC data portal however claims that this specific sample is of sample type "Primary Tumor".

We get our data for the phenotype.txt from the clinical data. If you look here on the GDC, you can see that there is no clinical data file for this sample or patient. This is why there is no sample_type or gender.

3) Sample names seem to omit some characters. For the sample TCGA-17-Z000-01 the GDC data portal claims the sample submitter id TCGA-17-Z000-01A. Why are sample names in UCSC Xena truncated?

We map everything to the sample level, not the vial level. You can map information from the GDC to any level and we have chosen the sample level. https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode

4) Some samples appear more than once in the file tcga_RSEM_gene_tpm. The sample id TCGA-38-4625-01 can be found on column 7397 and 10441. The expression values differ. Reading this data into R caused the sample names to be renamed to TCGA-38-4625-01.1 and TCGA-38-4625-01.2 while using check.names = FALSE.

Here is a response on our private mailing list about these duplicates. Note that the data on our hubs are from cgHub rather than the GDC. We are currently migrating out scripts to take data from the GDC.:
----------

There are three issues: are the duplications real, what are they and why, and what do you do about them.

1. All the duplications are real? Yes I think so. And they all should have different values because they all come from different cgHUB analysis IDs.

examples of duplications for primary tumor TCGA-AC-A3OD-01, uuids in the s3 links below are cgHUB analysis IDs.

s3://cgl-rnaseq-recompute-fixed/tcga/e9f7caba-a833-4c3c-82f2-11d7a233974d.tar.gz
s3://cgl-rnaseq-recompute-fixed/tcga/e9de5496-4486-4ceb-b3b3-30a53b2c52f6.tar.gz
s3://cgl-rnaseq-recompute-fixed/tcga/84a6ed5e-ad73-4399-ac37-381721f3b4e8.tar.gz
s3://cgl-rnaseq-recompute-fixed/tcga/99ce385c-e0e2-41c2-a868-40a31eac1e50.tar.gz
s3://cgl-rnaseq-recompute-fixed/tcga/a03f6b87-d762-479a-9132-aa42563bacc4.tar.gz

2. what are they and why. The reason are two folds, each tumor sometimes maps to multiple TCGA aliquots (different biological samples, 2 in this case), and occasionally each aliquot maps to multiple cgHUB ids (different experimental or informatics replicates: 3 and 2 here) . see data below.

cghub analysis ID TCGA aliquot ID
a03f6b87-d762-479a-9132-aa42563bacc4 TCGA-AC-A3OD-01B-06R-A22O-07
99ce385c-e0e2-41c2-a868-40a31eac1e50 TCGA-AC-A3OD-01B-06R-A22O-07
84a6ed5e-ad73-4399-ac37-381721f3b4e8 TCGA-AC-A3OD-01B-06R-A22O-07
e9de5496-4486-4ceb-b3b3-30a53b2c52f6 TCGA-AC-A3OD-01A-11R-A21T-07
e9f7caba-a833-4c3c-82f2-11d7a233974d TCGA-AC-A3OD-01A-11R-A21T-07

3. what do you do about them.

It is really up to you what do you want to do. I image you can do a simple averaging, just take one of five values, or clustering and identify centroid (different ways to calc centroid), or identify outliers first and do clustering and identify centroid, etc.... I don't want to make an arbitrary decision based on my own judgment (it becomes something undocumented), which will becomes the downloaded version of data at a lot of places.
------------

5) Some samples from the file TCGA_TARGET_phenotype.txt are missing the expression files. 200 samples match the string "TCGA Acute Myeloid Leukemia", however those samples cannot be found in the file tcga_RSEM_gene_tpm. The same holds true for the Osteosarcoma samples from TARGET.

This is because not all samples in TCGA and TARGET underwent expression analysis.

6) The file GTEX_phenotype.txt contains 5 samples that have the column "body_site_detail (SMTSD)" filled out (e.g. Stomach) while the column _primary_site is empty. Clearly those values can be derived from the body_site_detail column.

We are looking into this.

7) The file GTEX_phenotype.txt contains 9783 samples while the file contains only 7863 columns with expression values. Why are there apparently 1919 samples missing?

Again, not every sample in GTEX underwent expression analysis. In general we find that there are more samples with clinical/phenotype data than with expression data for all major consortia.

8) The file TCGA_TARGET_phenotype.txt claims 18,046 samples for TCGA and TARGET combined. However neither the files tcga_RSEM_gene_tpm and target_RSEM_gene_tpm (11,224 samples) nor the information on the website https://xenabrowser.net/datapages/?host=https://toil.xenahubs.net (11,928 samples) support this total sample number of the combined TCGA and TARGET data set.

We are looking into this

9) Sample numbers between the file tcga_RSEM_gene_tpm (10490) and the file tcga_Kallisto_tpm (10663) differ. Is it not expected to get the same number of sample columns between those two files, while the only difference is one used RSEM and the other one Kallisto to compute TPM values?

We are looking into this.

10) It would be nice to add mappings to Uberon using the primary site and the disease ontology using the cancer type. In addition it would be of value to harmonize the body sites and the naming between GTEx and TCGA/TARGET in order to better compare normals to tumors.

Perhaps this is something the GDC would be interested in doing? Seems like it would be useful to the community at large.

Thanks for helping,

Best, Daniel

--
You received this message because you are subscribed to the Google Groups "UCSC Xena and Cancer Genomics Browser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ucsc-cancer-genomics-browser+unsubs...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages