knownCanonical

36 views
Skip to first unread message

Alex Lenail

unread,
Jul 1, 2019, 11:53:19 AM7/1/19
to gen...@soe.ucsc.edu
Hello, 

I'd like to download a set of canonical splice isoforms for hg38 (gencode 29), with a single splice isoform per gene symbol. Right now, the knownCanonical table available for download contains 64,792 rows, which, when merged with kgXref, reveals multiple "canonical" isoforms per gene symbol for many genes. The FAQ states: 
I just want to download a gene set with a single entry per gene. Where can I find this?

We have data tables named knownCanonical available for different assemblies comprised of a single transcript/isoform per gene.

...

For hg38, the knownCanonical table is a subset of the GENCODE v29 track. As opposed to the hg19 equivalent which generally used the longest isoform for indentification, this table is defined as follows:

knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal transcript when available. If no APPRIS tag exists for any transcript associated with the cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then the longest isoform is used.

It can be downloaded directly from the hg38 downloads database or by using the Table Browser.


It seems there's likely a canonical isoform for each geneSymbol / gene cluster combination. I'm not aware of what a gene cluster is; the table schema at UCSC just tells me it's 
Which cluster of transcripts this belongs to in knownIsoforms

Would you be willing to point me towards documentation as to what those clusterIds are, or better yet, how I can get a table of unique canonical isoforms per gene symbol? 

Many thanks, 

--Alex

Alex Lenail

unread,
Jul 1, 2019, 11:53:19 AM7/1/19
to gen...@soe.ucsc.edu
And a follow-up: 

I had anticipated that the ENST* names would be unique in the knownGene annotation file, but I see many rows like the following: 

ENST00000313871.8 chrX + 1591592 1602514 1593462 1601594 5 1591592,1593443,1595383,1599191,1600658, 1591769,1594224,1595532,1599432,1602514, Q02040 uc004fpn.4
ENST00000313871.8 chrY + 1591592 1602514 1593462 1601594 5 1591592,1593443,1595383,1599191,1600658, 1591769,1594224,1595532,1599432,1602514, Q02040 uc004fpn.4

Surely this must be a mistake -- these genes can't be both on the X and Y chromosomes at the exact same coordinates, right? 

--Alex

Luis Nassar

unread,
Jul 4, 2019, 12:24:39 PM7/4/19
to Alex Lenail, UCSC Genome Browser Discussion List

Hello Alex,

Thank for your interest in the Genome Browser and for taking the time to write in.

You are correct that the knownCanonical table for hg38 contains some isoforms with duplicate gene symbols. This table is a subset of the GENCODEv29 data, which uses Ensembl IDs as primary keys (e.g. ENSG*). For hg38 knownCanonical, one canonical isoform per cluster ID means that there is only one isoform per Ensembl gene ID. The single isoform per gene ID is chosen by the parameters specified in the documentation (APRIS tag > GENCODE Basic set > longest isoform).

We do not have a direct way to extract a gene list with no gene symbol duplicates, that would be more difficult to produce and support due to gene symbol versioning and fluidity. If you would like to generate one, the knownCanonical table would be a good place to start, then filtering for the longest isoform is a common approach.

As far as the gene present on both the X and Y chromosomes, that example falls in a pseudoautosomal region (PAR1). These are regions that recombine during meiosis. The assemblers of the human genome deal with PARs by assembling a single haplotype sequence (as they do for all the autosomes), but then duplicating it in corresponding regions in X and Y. For some analyses such as sequence alignments, the PAR sequences are replaced with N's in one of the chromosomes so that the duplication of sequence doesn't mess up the results. The Genome Browser does not replace any sequence with N's. A list of PARs in hg38 can be found in the GRC website: https://www.ncbi.nlm.nih.gov/grc/human

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Lou Nassar
UCSC Genomics Institute

Training videos & resources: http://genome.ucsc.edu/training/index.html
Want to share the Browser with colleagues?
Host a workshop: http://bit.ly/ucscTraining


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAAOX%3DqG96DX%2BpB8%3D6HoGV_2yX0ANpRmUrbhxUQ6%2BY8hTpaRvKA%40mail.gmail.com.

Alex Lenail

unread,
Jul 4, 2019, 4:43:44 PM7/4/19
to Luis Nassar, Cory McLean, Alexander Lenail, UCSC Genome Browser Discussion List
Thanks for getting back to me, Lou. 

I've defaulted to choosing the longest of the canonical isoforms as the "true canonical" isoform for each gene symbol. 

For your records, I've attached the ~800 gene symbols for which the UCSC knownCanonical table lists multiple canonical isoforms (which are not from PAR regions). I've also copied the first few rows below -- the last column is boolean "canonical" field.

If there is another piece of biology which explains these (analogous to the PAR regions, which I hadn't known about before this, e.g. maybe something related to pseudogenes, etc...) I'd love to know!



namechromstrandtxStarttxEndcdsStartcdsEndexonCountexonStartsexonEndsgeneSymbol
0ENST00000508969.2chr11-171810300718138597181030071810300271810300,71813342,71811948,71813859,ALG1L9PTrue
1ENST00000532875.1chr11-171800540718046407180054071800540471800540,71801779,71802857,71804529,71800787,71801855,71802972,71804640,ALG1L9PTrue
2ENST00000295900.10chr31638645596400346263912598639994671363864559,63898398,63912587,63913156,63952378,6...63864999,63898497,63912923,63913225,63952483,6...ATXN7True
3ENST00000487717.5chr31639119286400020763912598639994671263911928,63912587,63913156,63952378,63979914,6...63912181,63912923,63913225,63952483,63980167,6...ATXN7True
4ENST00000453174.7chr10179904897799510297990489779904897879904897,79906571,79907667,79920951,79921913,7...79905049,79906780,79907856,79921068,79921988,7...BMS1P21True
5ENST00000634565.1chr10179906604799078567990660479906604279906604,79907667,79906780,79907856,BMS1P21True
6ENST00000580790.1chr10-1736991507373048773699150736991501273699150,73704985,73713203,73718014,73719159,7...73699588,73705078,73713332,73718165,73719299,7...BMS1P4True
7ENST00000584747.5chr10-1737158427373046973715842737158421073715842,73718014,73719159,73720537,73721361,7...73716180,73718165,73719299,73720659,73721504,7...BMS1P4True
8ENST00000442201.6chr3-118061458718067950018061492018067938020180614587,180616280,180616515,180616825,180619...180615077,180616363,180616695,180616966,180619...CCDC39True
9ENST00000476379.5chr3-118061400718073787418061934718067938025180614007,180616280,180616515,180616825,180619...180615077,180616363,180616695,180616966,180619...CCDC39True
10ENST00000580501.2chr10-147581210475823214758121047581210347581210,47581458,47582195,47581358,47581578,47582321,CTSLP2True
11ENST00000628708.1chr10146753604467581984675360446753604346753604,46754833,46758077,46753917,46754892,46758198,CTSLP2True
12ENST00000425346.5chr3150350694503540795035143350353744450350694,50351408,50352008,50353240,50350984,50351560,50352046,50354079,CYB561D2True
13ENST00000607121.5chr3150365373503681975036537350365373350365373,50366296,50367795,50365940,50366331,50368197,CYB561D2True
14ENST00000443649.8chr12-11222076621222275341222083801222260147122207662,122216487,122216758,122218265,122224...122208577,122216584,122216869,122218397,122224...DIABLOTrue
15ENST00000464942.7chr12-11222076671222260521222083801222260146122207667,122216487,122216758,122218265,122224...122208577,122216584,122216869,122218397,122224...DIABLOTrue


Happy 4th,

--Alex

duplicate_canonical.tsv
Reply all
Reply to author
Forward
0 new messages