Hi Rajesh,
Thank you for your question about canonical transcripts in the UCSC
Genes track.
The UCSC Genes track scores each transcript in a "gene cluster" and
then selects that transcript with the highest CDS score as the
"canonical" transcript for that cluster. These CDS scores are
produced using the txCdsPredict program and are based on a number of
things, including ORF length, other ORF features, and whether or not
the transcript is present in RefSeq.
For the case of the TP53 gene in the hg19 genome, the txCdsPredict
program scores two transcripts the same, uc002gim.3 and uc002gij.3.
The UCSC Genes construction process chooses uc002gij.3 it is before
uc002gim.3 alphabetically.
You may also be interested in the use of APPRIS tags in the GENCODE
Genes track,
https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeGencodeV19.
Items in the GENCODE Genes tracks are given various tags,
http://www.gencodegenes.org/gencode_tags.html, some of which are
APPRIS tags. The APPRIS group attempts to annotate alternatively
splicing methods using various methods. For almost every gene loci,
they have attempted to select a single CDS variant as the
"PRINCIPAL" isoform, though there are still some that don't have a
"PRINCIPAL" tag. So, if you are looking to create a set of canonical
genes, it may be best to look to using the Table Browser and a set
of filters to extract the APPRIS "principal" tagged transcripts from
the GENCODE transcript set. If you're interested in this
information, I can come up with some steps to extract it from the
Table Browser.
I hope this is helpful. If you have any further questions, please
reply to
gen...@soe.ucsc.edu. All messages sent to that address are
archived on a publicly-accessible Google Groups forum. If your
question includes sensitive data, you may send it instead to
genom...@soe.ucsc.edu.
Matthew Speir
UCSC Genome Bioinformatics Group