Hello Vijay,
Thank you for using the Genome Browser and for your question regarding the knownCanonical table.
What you are observing here with the missing BBS5 entry in the knownCanonical table is an artifact of how that table was created for hg19. If you take a look at the hg19 UCSC Genes description page (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene) we define knownCanonical as the following:
knownCanonical identifies the canonical isoform of each cluster ID, or gene. Generally, this is the longest isoform.
The problem is, however, when two genes have overlapping coordinates, and one of them is entirely within another, the algorithm considers them isoforms and the smaller gene will be missed by knownCanonical. You can see this with BBS5 by going to the following session (http://genome.ucsc.edu/s/Lou/hg19_MLQ1). KLHL41 has a transcript with the same start site as BBS5, however, it extends much further. All of the BBS5 transcripts fall within it. If you query the Table Browser for these coordinates you see only KLHL41.
Using coordinates chr2:170,331,250-170,374,046:
chr2 170366211 170382772 17243 uc002ueu.1 uc002ueu.1 KLHL41
In order to get around this, you can use the complete knownGenes table, or you could also use the knownCanonical table for hg38. For the hg38 assembly (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=knownGene) the table was generated differently:
knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal transcript when available. If no APPRIS tag exists for any transcript associated with the cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then the longest isoform is used.
This new method did not have the same issue as hg19, as it uses APPRIS tags, then GENCODE sets, and then finally if those are not available the longest isoform. If you convert the region from the session above to hg38 (View in the top bluebar -> In Other Genomes) you will get the following coordinates (chr2:169,474,740-169,517,536), then if you query the position on the knownCanonical table on the Table Browser you get the following results:
chr2 169479177 169506655 10932 ENST00000295240.7 ENSG00000163093.11 BBS5
chr2 169479479 169525922 41682 ENST00000513963.1 ENSG00000251569.1 AC093899.2
chr2 169509701 169526262 36897 ENST00000284669.1 ENSG00000239474.6 KLHL41
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
Lou Nassar
UCSC Genomics Institute
--
---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CABmdQgFUPwCQDyz5w9KYSrbNAPUjPQHhn4KhYe%3DXccS_hMPwWA%40mail.gmail.com.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.
Hello Vijay,
To get an output from the knownGene table like the one you described you will want to use the "selected fields from primary and related tables" option in the Table Browser:
Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables
assembly: hg19
track: UCSC Genes
table: knownGene
output format: selected fields from primary and related tables
get output
Select Fields from hg19.knownGene: chrom, txStart, txEnd
transcript select fields from hg19.kgXref fields: kgID, geneSymbol
get output
Your output file should look like this (first 10 entries from chrom1):
If you would like to get this output with knownCanonical (like your example) you can follow the steps above with the following changes:
table: knownCanonical
...
Select Fields from hg19.knownCanonical: chrom, chromStart, chromEnd
transcript select fields from hg19.kgXref fields: kgID, geneSymbol
Your output file should look like this (first 10 entries from chrom1):
It may also be worth mentioning that our data tables (such as the example output) use 0-start, half open coordinates. If this is relevant you, we have a blog post on the topic: http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems/
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
Lou Nassar
UCSC Genomics Institute
Dear Luis,Thank you very much for your reply.How can I get a file from the knownGene table which look like this?chrom transcript_start transcript_stop transcript_id gene symbolchr1 11873 14409 uc010nxq.1 DDX11L1chr1 14361 19759 uc009viu.3 WASH7Pchr1 14406 29370 uc009viw.2 WASH7Pchr1 34610 36081 uc001aak.3 FAM138Fchr1 69090 70008 uc001aal.1 OR4F5chr1 134772 140566 uc021oeg.2 LOC729737chr1 321083 321115 uc001aaq.2 DQ597235chr1 321145 321207 uc001aar.2 DQ599768chr1 322036 326938 uc009vjk.2 LOC100133331
Thanks and Regards
Vijay Lakhujani
Senior Bioinformatician
Neuberg Center for Genomic Medicine,
Neuberg Supratech Reference Laboratories, Ahmedabad
vijay.l...@supratechlabs.com