Hello Paul,
Thank you for your question about finding a representative transcript for each gene in a list. We do have a table that contains only a single transcript for each cluster; the table's name is knownCanonical. Using the table's data is slightly more awkward in your case, however, as you would like to search by gene name instead of a locus or a transcript ID - the knownCanonical table is not set up for that kind of search. The easiest way to obtain the data you seek is to do a normal Table Browser search on the UCSC Genes knownGene table, but add in fields from knownCanonical as follows:
1. Open the UCSC Table Browser at http://genome.ucsc.edu/cgi-bin/hgTables
2. Select the following options (or alter them, as appropriate for your project):
Clade: Mammal
Genome: Human
Assembly: Feb. 2009 (GRCh37/hg19)
Group: Genes and Gene Predictions
Track: UCSC Genes
Table: knownGene
Region: genome
Output Format: selected fields from primary and related tables
3. Click the "paste list" or "upload list" button to add the list of gene identifiers that you are searching for.
4. Click "get output".
5. On the next page, scroll down and check the box for "knownCanonical" from the "Linked Tables" list. Then scroll to the end of the page and click "allow selection from checked tables". You should now be able to select fields from the knownGene, kgXref (checked by default), and the knownCanonical tables.
6. Select whichever data fields you want to have from the knownGene table (transcription start and end coordinates, CDS coordinates, exons ...), then add the geneSymbol field from kgXref and one or more fields from the knownCanonical table.
7. Click "get output".
The resulting output will still have one entry for each transcript - the same duplicate information that you were trying to filter out. However, this time most of the lines will include "n/a" entries for the knownCanonical fields. All of the transcripts that were not in the knownCanonical table (and, thus, are not the canonical representative of that cluster) do not have data in the knownCanonical table. Instead, the knownCanonical field output for those transcripts is just "n/a". You can filter out all of the lines with "n/a" and you will be left with a single transcript for each gene with the gene symbol listed on the same line. On a UNIX-like system, for example, you can filter out those lines by running
grep -v 'n/a'on the output file.
Please review any results obtained this way. The knownCanonical table is created by an automated process, and manual curation is not involved. Errors in classification do crop up once in a while.
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.
--
Jonathan Casper
UCSC Genome Bioinformatics Group
--