Transcript ID question

Elinne Becket

unread,

Dec 19, 2013, 6:38:56 PM12/19/13

to gen...@soe.ucsc.edu

Hello,

I am a post-doc in the USC Peter Jones Lab. I downloaded RNA-seq isoform data from TCGA, and the output is a list of isoform IDs that all consist of "uc0##xxx.#", which is the same name listed for gene isoforms on the UCSC browser (for example, an isoform of gene FILIP1 is uc010kbe.1).

After an exhaustive search, I can't find what this ID name is specifically called, and how to retrieve gene names from these IDs (on a large scale). I was wondering if you had any knowledge about how I can convert these IDs to gene names? I would be very grateful.

Thank you,

Elinne Becket

Post-doctoral Scholar

Peter A Jones Laboratory

USC Norris Cancer Center, NOR 7341

1441 Eastlake Ave

Los Angeles, CA 90089

818-667-4970

Jonathan Casper

unread,

Dec 20, 2013, 5:12:09 PM12/20/13

to Elinne Becket, gen...@soe.ucsc.edu

Hello Elinne,

Thank you for your question about translating accession numbers to gene names. The uc* accession numbers you refer to are UCSC transcript accession numbers related to the UCSC Genes track on our genome browser. From the track description page (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene):

... transcript is assigned a permanent "uc" accession. If the transcript was not in the previous release of UCSC Genes, the accession ends with the suffix ".1" indicating that this is the first version of this transcript. If the transcript is identical to some transcript in the previous release of UCSC Genes, the accession is re-used with the same version number. If the transcript is not identical to any transcript in the previous release but it overlaps a similar transcript with a compatible structure, the previous accession is re-used with the version number incremented.

On the subject of converting these accession numbers to gene symbols, one of our engineers offers the following insights:

Yes, the TCGA RNA-seq data is referencing the UCSC Genes transcript IDs. Or at least old versions of them. The TCGA standard annotation targets are defined in something called the GAF file. There are two GAF files that are in current use within the TCGA: GAF 3.0 and 2.1. GAF 3.0 corresponds to the previous release of hg19 UCSC Genes, i.e. the track currently on the browser as "Old UCSC Genes". GAF 2.1 corresponds to the release before that. A lot of the RNA-seq data comes from UNC, which is still using GAF 2.1 (and the latest-2 version of UCSC Genes).

If this person just needs to map transcript IDs to gene names, the versions of kgXref that correspond to the UCSC Genes release(s). It's an imperfect solution, since it's old and HUGO keeps evolving (and UCSC Genes isn't always consistent with HUGO), but it should be good for 80% or more of the loci.

You can do this mapping with the UCSC Table Browser as follows:

1. Open the UCSC Table Browser at http://genome.ucsc.edu/cgi-bin/hgTables (also available from the top menu of the UCSC Genome Browser by selecting "Tools", then "Table Browser").
2. Use the following settings:

Clade: Mammal
Genome: Human
Assembly: Feb. 2009 (GRCh37/hg19)
Group: All Tables
Database: hg19

3. For "Table", select one of the three kgXref tables: kgXref, kgXrefOld6, and kgXrefOld5.
4. Under the "identifiers" heading, click "paste list" or "upload list" to enter the UCSC accession numbers that you want gene symbols for.
5. Under the "output format" heading, select "selected fields from primary and related tables".
6. Click "get output".
7. On the next page, select the fields "kgID" and "geneSymbol" and click "get output".
8. The result should be a list of UCSC accession numbers and their associated gene symbols.

I suggest that in step 3 you start with the kgXref table, and then try submitting any accession numbers that weren't found to the older tables (kgXrefOld6 and kgXrefOld5). Some accession numbers may be too old even for that - your uc010kbe.1 example is one of them. The kgXrefOld5 table only goes as far back as uc010kbe.2. You can try increasing the version number of the old accessions to see if they are found in one of the tables, but you should double-check to make sure that the gene symbols are correctly identified.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. Questions sent to that address will be archived in a publicly-accessible forum for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

--

Elinne Becket

unread,

Dec 20, 2013, 5:51:28 PM12/20/13

to Jonathan Casper, gen...@soe.ucsc.edu

Thanks so much, Jonathan. I will definitely give this a try. Even knowing what the uc accession numbers were made up of helped a lot.

One quick additional question. Will the Table Browser strategy work if I remove the numeric suffix from my accession numbers?

Best,

Elinne

Brian Lee

unread,

Dec 23, 2013, 4:14:47 PM12/23/13

to Elinne Becket, Jonathan Casper, gen...@soe.ucsc.edu

Dear Elinne,

Unfortunately, pasting the incomplete identifier will not work with the Table Browser's paste identifiers option. However, you can use the Table Browser's filter option instead.

1. Open the UCSC Table Browser at http://genome.ucsc.edu/cgi-bin/hgTables (also available from the top menu of the UCSC Genome Browser by selecting "Tools", then "Table Browser").

2. Use the following settings:

Clade: Mammal

Genome: Human

Assembly: Feb. 2009 (GRCh37/hg19)

Group: Genes and Gene Prediction Tracks

Track: UCSC Genes

Table: kgXref

3. Click the "filters: create" button.

4. In the "kgID DOES match" box you can paste a list of your identifiers with the numeric suffix replaced with an asterisks such as "uc001aaa* uc001aac* uc001aae*". Then click "submit".

5. Under the "output format" heading, select "selected fields from primary and related tables".

6. Click "get output".

7. On the next page, select the fields "kgID" and "geneSymbol" and click "get output".

8. The result should be a list of UCSC accession numbers and their associated gene symbols such as:

#kgID geneSymbol

uc001aaa.3 DDX11L1

uc001aac.4 WASH7P

uc001aae.4 WASH7P

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee

UCSC Genome Bioinformatics Group

--

Reply all

Reply to author

Forward