Dear UCSC Community,
I am working with data processed by the UCSC Xena Team using their Toil RNA-seq CGL Pipeline. All data files were downloaded from
http://xena.ucsc.edu using an S3 bucket link that was provided in the help section this July. The publication related to this data is here:
http://dx.doi.org/10.1101/062497
While working with the data, I discovered some inconsistencies which I would like to discuss within the community.
1) The files tcga_expected_count and tcga_RSEM_gene_tpm contain 10657 and 10490 samples respectively and the sample names do not fully overlap. Both files are based on the RSEM output, so I would expect overlapping and matching sample numbers and names.
2) Based on the file TCGA_TARGET_phenotype.txt, I discovered several samples (e.g. TCGA-17-Z000-01) for which the columns sample_type or gender had no values. The GDC data portal however claims that this specific sample is of sample type "Primary Tumor".
3) Sample names seem to omit some characters. For the sample TCGA-17-Z000-01 the GDC data portal claims the sample submitter id TCGA-17-Z000-01A. Why are sample names in UCSC Xena truncated?
4) Some samples appear more than once in the file tcga_RSEM_gene_tpm. The sample id TCGA-38-4625-01 can be found on column 7397 and 10441. The expression values differ. Reading this data into R caused the sample names to be renamed to TCGA-38-4625-01.1 and TCGA-38-4625-01.2 while using check.names = FALSE.
5) Some samples from the file TCGA_TARGET_phenotype.txt are missing the expression files. 200 samples match the string "TCGA Acute Myeloid Leukemia", however those samples cannot be found in the file tcga_RSEM_gene_tpm. The same holds true for the Osteosarcoma samples from TARGET.
6) The file GTEX_phenotype.txt contains 5 samples that have the column "body_site_detail (SMTSD)" filled out (e.g. Stomach) while the column _primary_site is empty. Cleary those values can be derived from the body_site_detail column.
7) The file GTEX_phenotype.txt contains 9783 samples while the file contains only 7863 columns with expression values. Why are there apparently 1919 samples missing?
8) The file TCGA_TARGET_phenotype.txt claims 18,046 samples for TCGA and TARGET combined. However neither the files tcga_RSEM_gene_tpm and target_RSEM_gene_tpm (11,224 samples) nor the information on the website
https://xenabrowser.net/datapages/?host=https://toil.xenahubs.net (11,928 samples) support this total sample number of the combined TCGA and TARGET data set.
9) Sample numbers between the file tcga_RSEM_gene_tpm (10490) and the file tcga_Kallisto_tpm (10663) differ. Is it not expected to get the same number of sample columns between those two files, while the only difference is one used RSEM and the other one Kallisto to compute TPM values?
10) It would be nice to add mappings to Uberon using the primary site and the disease ontology using the cancer type. In addition it would be of value to harmonize the body sites and the naming between GTEx and TCGA/TARGET in order to better compare normals to tumors.
Thanks for helping,
Best, Daniel