Xena transcripts ids are not completely matched with Gencode v23

Zhaleh Safikhani

unread,

Oct 11, 2017, 11:26:40 AM10/11/17

to UCSC Xena and Cancer Genomics Browser

Hi,

I am going to quantify transcript abundance of some other datasets with Kallisto and compare them to quantifications available on Xena for TCGA https://xenabrowser.net/datapages/?dataset=tcga_Kallisto_est_counts&host=https://toil.xenahubs.net. So I want to use the same transcriptome reference you used for processing TCGA Rna-seq data (which apparently is Gencode v23 as it is mentioned on the website and in your published paper).

However, I noticed the number of transcripts in Gencode v23 GTF file (CHR) ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_23/gencode.v23.annotation.gtf.gz , is 198,619 while the number of transcripts on Xena (for TCGA) is 197,045 https://xenabrowser.net/datapages/?host=https%3A%2F%2Ftoil.xenahubs.net&dataset=tcga_Kallisto_est_counts&label=TOIL%20Kallisto%20est_counts&allIdentifiers=true.

Strangely, when I checked the intersection between these two sets of transcript I just found 181,836 in common.

More precisely, there are 16,783 transcripts in v23 gtf while not in yours which I guess they are transcripts of duplicated genes between X and Y chromosomes that you removed.

But there are also 15,208 extra transcripts in Xena set of identifiers while not in gencode v23 which I don’t understand how it is possible.

I also checked https://toil.xenahubs.net/download/gencode.v23.annotation.transcript.probemap.gz file you shared for ID/Gene mapping on Xena and I noticed transcripts in this file are exactly similar to transcripts in Gencode v23 and different from your transcript IDs.

Would you please let me know if the version of Gencode annotation is used on Xena is still v23? And if it is, then how the set of transcript ids on Xena are different from Gencode v23 and what procedure I should follow to have identical transcript identifiers to those on Xena?

Best,

Zhaleh

diff.xena.tcga.gencode.v23.csv

gloria.k...@gmail.com

unread,

Oct 16, 2017, 12:08:47 PM10/16/17

to UCSC Xena and Cancer Genomics Browser

Hi Zhaleh,

Did you get this resolved? There are a number of non-coding RNAs which are part of the 15,208 transcripts that are missing in Xena.

Gloria

Mary Goldman

unread,

Oct 16, 2017, 12:30:40 PM10/16/17

to gloria.k...@gmail.com, UCSC Xena and Cancer Genomics Browser

Hi Gloria,

Reading your other email, I believe this is a different issue. Toil used the complete set of Gencode genes which is where the extra genes that Zhaleh saw came from.

Please let me know if my other email does not resolve the issue for you.

Best,
Mary
-------------
Mary Goldman
UCSC Xena

UC Santa Cruz Genomics Institute

--
You received this message because you are subscribed to the Google Groups "UCSC Xena and Cancer Genomics Browser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ucsc-cancer-genomics-browser+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward