We have data tables named knownCanonical available for different assemblies comprised of a single transcript/isoform per gene.
...
For hg38, the knownCanonical table is a subset of the GENCODE v29 track. As opposed to the hg19 equivalent which generally used the longest isoform for indentification, this table is defined as follows:
knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal transcript when available. If no APPRIS tag exists for any transcript associated with the cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then the longest isoform is used.
It can be downloaded directly from the hg38 downloads database or by using the Table Browser.
| Which cluster of transcripts this belongs to in knownIsoforms |
Hello Alex,
Thank for your interest in the Genome Browser and for taking the time to write in.
You are correct that the knownCanonical table for hg38 contains some isoforms with duplicate gene symbols. This table is a subset of the GENCODEv29 data, which uses Ensembl IDs as primary keys (e.g. ENSG*). For hg38 knownCanonical, one canonical isoform per cluster ID means that there is only one isoform per Ensembl gene ID. The single isoform per gene ID is chosen by the parameters specified in the documentation (APRIS tag > GENCODE Basic set > longest isoform).
We do not have a direct way to extract a gene list with no gene symbol duplicates, that would be more difficult to produce and support due to gene symbol versioning and fluidity. If you would like to generate one, the knownCanonical table would be a good place to start, then filtering for the longest isoform is a common approach.
As far as the gene present on both the X and Y chromosomes, that example falls in a pseudoautosomal region (PAR1). These are regions that recombine during meiosis. The assemblers of the human genome deal with PARs by assembling a single haplotype sequence (as they do for all the autosomes), but then duplicating it in corresponding regions in X and Y. For some analyses such as sequence alignments, the PAR sequences are replaced with N's in one of the chromosomes so that the duplication of sequence doesn't mess up the results. The Genome Browser does not replace any sequence with N's. A list of PARs in hg38 can be found in the GRC website: https://www.ncbi.nlm.nih.gov/grc/human
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
Lou Nassar
UCSC Genomics Institute
Training videos & resources: http://genome.ucsc.edu/training/index.html
Want to share the Browser with colleagues?
Host a workshop: http://bit.ly/ucscTraining
--
---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAAOX%3DqG96DX%2BpB8%3D6HoGV_2yX0ANpRmUrbhxUQ6%2BY8hTpaRvKA%40mail.gmail.com.
| name | chrom | strand | txStart | txEnd | cdsStart | cdsEnd | exonCount | exonStarts | exonEnds | geneSymbol | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ENST00000508969.2 | chr11 | -1 | 71810300 | 71813859 | 71810300 | 71810300 | 2 | 71810300,71813342, | 71811948,71813859, | ALG1L9P | True |
| 1 | ENST00000532875.1 | chr11 | -1 | 71800540 | 71804640 | 71800540 | 71800540 | 4 | 71800540,71801779,71802857,71804529, | 71800787,71801855,71802972,71804640, | ALG1L9P | True |
| 2 | ENST00000295900.10 | chr3 | 1 | 63864559 | 64003462 | 63912598 | 63999467 | 13 | 63864559,63898398,63912587,63913156,63952378,6... | 63864999,63898497,63912923,63913225,63952483,6... | ATXN7 | True |
| 3 | ENST00000487717.5 | chr3 | 1 | 63911928 | 64000207 | 63912598 | 63999467 | 12 | 63911928,63912587,63913156,63952378,63979914,6... | 63912181,63912923,63913225,63952483,63980167,6... | ATXN7 | True |
| 4 | ENST00000453174.7 | chr10 | 1 | 79904897 | 79951029 | 79904897 | 79904897 | 8 | 79904897,79906571,79907667,79920951,79921913,7... | 79905049,79906780,79907856,79921068,79921988,7... | BMS1P21 | True |
| 5 | ENST00000634565.1 | chr10 | 1 | 79906604 | 79907856 | 79906604 | 79906604 | 2 | 79906604,79907667, | 79906780,79907856, | BMS1P21 | True |
| 6 | ENST00000580790.1 | chr10 | -1 | 73699150 | 73730487 | 73699150 | 73699150 | 12 | 73699150,73704985,73713203,73718014,73719159,7... | 73699588,73705078,73713332,73718165,73719299,7... | BMS1P4 | True |
| 7 | ENST00000584747.5 | chr10 | -1 | 73715842 | 73730469 | 73715842 | 73715842 | 10 | 73715842,73718014,73719159,73720537,73721361,7... | 73716180,73718165,73719299,73720659,73721504,7... | BMS1P4 | True |
| 8 | ENST00000442201.6 | chr3 | -1 | 180614587 | 180679500 | 180614920 | 180679380 | 20 | 180614587,180616280,180616515,180616825,180619... | 180615077,180616363,180616695,180616966,180619... | CCDC39 | True |
| 9 | ENST00000476379.5 | chr3 | -1 | 180614007 | 180737874 | 180619347 | 180679380 | 25 | 180614007,180616280,180616515,180616825,180619... | 180615077,180616363,180616695,180616966,180619... | CCDC39 | True |
| 10 | ENST00000580501.2 | chr10 | -1 | 47581210 | 47582321 | 47581210 | 47581210 | 3 | 47581210,47581458,47582195, | 47581358,47581578,47582321, | CTSLP2 | True |
| 11 | ENST00000628708.1 | chr10 | 1 | 46753604 | 46758198 | 46753604 | 46753604 | 3 | 46753604,46754833,46758077, | 46753917,46754892,46758198, | CTSLP2 | True |
| 12 | ENST00000425346.5 | chr3 | 1 | 50350694 | 50354079 | 50351433 | 50353744 | 4 | 50350694,50351408,50352008,50353240, | 50350984,50351560,50352046,50354079, | CYB561D2 | True |
| 13 | ENST00000607121.5 | chr3 | 1 | 50365373 | 50368197 | 50365373 | 50365373 | 3 | 50365373,50366296,50367795, | 50365940,50366331,50368197, | CYB561D2 | True |
| 14 | ENST00000443649.8 | chr12 | -1 | 122207662 | 122227534 | 122208380 | 122226014 | 7 | 122207662,122216487,122216758,122218265,122224... | 122208577,122216584,122216869,122218397,122224... | DIABLO | True |
| 15 | ENST00000464942.7 | chr12 | -1 | 122207667 | 122226052 | 122208380 | 122226014 | 6 | 122207667,122216487,122216758,122218265,122224... | 122208577,122216584,122216869,122218397,122224... | DIABLO | True |