Hello Elinne,
Thank you for your question about translating accession numbers to gene names. The uc* accession numbers you refer to are UCSC transcript accession numbers related to the UCSC Genes track on our genome browser. From the track description page (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene):
... transcript is assigned a permanent "uc" accession. If the transcript was not in the previous release of UCSC Genes, the accession ends with the suffix ".1" indicating that this is the first version of this transcript. If the transcript is identical to some transcript in the previous release of UCSC Genes, the accession is re-used with the same version number. If the transcript is not identical to any transcript in the previous release but it overlaps a similar transcript with a compatible structure, the previous accession is re-used with the version number incremented.
On the subject of converting these accession numbers to gene symbols, one of our engineers offers the following insights:
Yes, the TCGA RNA-seq data is referencing the UCSC Genes transcript IDs. Or at least old versions of them. The TCGA standard annotation targets are defined in something called the GAF file. There are two GAF files that are in current use within the TCGA: GAF 3.0 and 2.1. GAF 3.0 corresponds to the previous release of hg19 UCSC Genes, i.e. the track currently on the browser as "Old UCSC Genes". GAF 2.1 corresponds to the release before that. A lot of the RNA-seq data comes from UNC, which is still using GAF 2.1 (and the latest-2 version of UCSC Genes).
If this person just needs to map transcript IDs to gene names, the versions of kgXref that correspond to the UCSC Genes release(s). It's an imperfect solution, since it's old and HUGO keeps evolving (and UCSC Genes isn't always consistent with HUGO), but it should be good for 80% or more of the loci.
You can do this mapping with the UCSC Table Browser as follows:
1. Open the UCSC Table Browser at http://genome.ucsc.edu/cgi-bin/hgTables (also available from the top menu of the UCSC Genome Browser by selecting "Tools", then "Table Browser").
2. Use the following settings:
Clade: Mammal
Genome: Human
Assembly: Feb. 2009 (GRCh37/hg19)
Group: All Tables
Database: hg19
3. For "Table", select one of the three kgXref tables: kgXref, kgXrefOld6, and kgXrefOld5.
4. Under the "identifiers" heading, click "paste list" or "upload list" to enter the UCSC accession numbers that you want gene symbols for.
5. Under the "output format" heading, select "selected fields from primary and related tables".
6. Click "get output".
7. On the next page, select the fields "kgID" and "geneSymbol" and click "get output".
8. The result should be a list of UCSC accession numbers and their associated gene symbols.
I suggest that in step 3 you start with the kgXref table, and then try submitting any accession numbers that weren't found to the older tables (kgXrefOld6 and kgXrefOld5). Some accession numbers may be too old even for that - your uc010kbe.1 example is one of them. The kgXrefOld5 table only goes as far back as uc010kbe.2. You can try increasing the version number of the old accessions to see if they are found in one of the tables, but you should double-check to make sure that the gene symbols are correctly identified.
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. Questions sent to that address will be archived in a publicly-accessible forum for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.
--
Jonathan Casper
UCSC Genome Bioinformatics Group
--
--