Gene BBS5 missing from UCSC knownCanonical table

VIJAY LAKHUJANI

unread,

Mar 4, 2019, 12:00:51 PM3/4/19

to gen...@soe.ucsc.edu

Dear Team,

I am trying to generate a BED file following the below steps:

Download a bed file for the canonical transcripts using UCSC Table Browser:

track: UCSC Genes
table: knownCanonical
output format: select fields from primary and related tables
press get output
select fields from hg19.knownCanonical: chrom, chromStart, chromEnd,
transcript select fields from hg19.kgXref: geneSymbol
press get output

This BED file does not have the gene "BBS5" in it.

However, when I follow the same steps except selecting the table = "knownGenes" (instead of knownCanonical ) , I can see BBS5 gene. The question is why this gene does not appear in the table knownCanonical?

Could you please help?

Thanks and Regards
Vijay Lakhujani
Senior Bioinformatician
Neuberg Center for Genomic Medicine,
Neuberg Supratech Reference Laboratories, Ahmedabad
vijay.l...@supratechlabs.com

Luis Nassar

unread,

Mar 4, 2019, 6:13:30 PM3/4/19

to VIJAY LAKHUJANI, gen...@soe.ucsc.edu

Hello Vijay,

Thank you for using the Genome Browser and for your question regarding the knownCanonical table.

What you are observing here with the missing BBS5 entry in the knownCanonical table is an artifact of how that table was created for hg19. If you take a look at the hg19 UCSC Genes description page (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene) we define knownCanonical as the following:

knownCanonical identifies the canonical isoform of each cluster ID, or gene. Generally, this is the longest isoform.

The problem is, however, when two genes have overlapping coordinates, and one of them is entirely within another, the algorithm considers them isoforms and the smaller gene will be missed by knownCanonical. You can see this with BBS5 by going to the following session (http://genome.ucsc.edu/s/Lou/hg19_MLQ1). KLHL41 has a transcript with the same start site as BBS5, however, it extends much further. All of the BBS5 transcripts fall within it. If you query the Table Browser for these coordinates you see only KLHL41.

Using coordinates chr2:170,331,250-170,374,046:

chr2 170366211 170382772 17243 uc002ueu.1 uc002ueu.1 KLHL41

In order to get around this, you can use the complete knownGenes table, or you could also use the knownCanonical table for hg38. For the hg38 assembly (http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=knownGene) the table was generated differently:

knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal transcript when available. If no APPRIS tag exists for any transcript associated with the cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then the longest isoform is used.

This new method did not have the same issue as hg19, as it uses APPRIS tags, then GENCODE sets, and then finally if those are not available the longest isoform. If you convert the region from the session above to hg38 (View in the top bluebar -> In Other Genomes) you will get the following coordinates (chr2:169,474,740-169,517,536), then if you query the position on the knownCanonical table on the Table Browser you get the following results:

chr2 169479177 169506655 10932 ENST00000295240.7 ENSG00000163093.11 BBS5
chr2 169479479 169525922 41682 ENST00000513963.1 ENSG00000251569.1 AC093899.2
chr2 169509701 169526262 36897 ENST00000284669.1 ENSG00000239474.6 KLHL41

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Lou Nassar
UCSC Genomics Institute

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CABmdQgFUPwCQDyz5w9KYSrbNAPUjPQHhn4KhYe%3DXccS_hMPwWA%40mail.gmail.com.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Luis Nassar

unread,

Mar 7, 2019, 6:10:53 PM3/7/19

to VIJAY LAKHUJANI, gen...@soe.ucsc.edu

Hello Vijay,

To get an output from the knownGene table like the one you described you will want to use the "selected fields from primary and related tables" option in the Table Browser:

Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables
assembly: hg19
track: UCSC Genes
table: knownGene
output format: selected fields from primary and related tables
get output
Select Fields from hg19.knownGene: chrom, txStart, txEnd
transcript select fields from hg19.kgXref fields: kgID, geneSymbol
get output

Your output file should look like this (first 10 entries from chrom1):

#hg19.knownGene.chrom    hg19.knownGene.txStart    hg19.knownGene.txEnd    hg19.kgXref.kgID    hg19.kgXref.geneSymbol
chr1    11873    14409    uc001aaa.3    DDX11L1
chr1    11873    14409    uc010nxr.1    DDX11L1
chr1    11873    14409    uc010nxq.1    DDX11L1
chr1    14361    16765    uc009vis.3    WASH7P
chr1    14361    19759    uc009vit.3    WASH7P
chr1    14361    19759    uc009viu.3    WASH7P
chr1    14361    19759    uc001aae.4    WASH7P
chr1    14361    29370    uc001aah.4    WASH7P
chr1    14361    29370    uc009vir.3    WASH7P

If you would like to get this output with knownCanonical (like your example) you can follow the steps above with the following changes:

table: knownCanonical
...
Select Fields from hg19.knownCanonical: chrom, chromStart, chromEnd
transcript select fields from hg19.kgXref fields: kgID, geneSymbol

Your output file should look like this (first 10 entries from chrom1):

#hg19.knownCanonical.chrom    hg19.knownCanonical.chromStart    hg19.knownCanonical.chromEnd    hg19.kgXref.kgID    hg19.kgXref.geneSymbol
chr1    11873    14409    uc010nxq.1    DDX11L1
chr1    14361    19759    uc009viu.3    WASH7P
chr1    14406    29370    uc009viw.2    WASH7P
chr1    34610    36081    uc001aak.3    FAM138F
chr1    69090    70008    uc001aal.1    OR4F5
chr1    134772    140566    uc021oeg.2    LOC729737
chr1    321083    321115    uc001aaq.2    DQ597235
chr1    321145    321207    uc001aar.2    DQ599768
chr1    322036    326938    uc009vjk.2    LOC100133331
chr1    327545    328439    uc021oei.1    LOC388312

It may also be worth mentioning that our data tables (such as the example output) use 0-start, half open coordinates. If this is relevant you, we have a blog post on the topic: http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems/

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Lou Nassar
UCSC Genomics Institute

On Tue, Mar 5, 2019 at 11:06 PM VIJAY LAKHUJANI <vijay.l...@supratechlabs.com> wrote:

Dear Luis,

Thank you very much for your reply.

How can I get a file from the knownGene table which look like this?

chrom transcript_start transcript_stop transcript_id gene symbol
chr1 11873 14409 uc010nxq.1 DDX11L1
chr1 14361 19759 uc009viu.3 WASH7P
chr1 14406 29370 uc009viw.2 WASH7P
chr1 34610 36081 uc001aak.3 FAM138F
chr1 69090 70008 uc001aal.1 OR4F5
chr1 134772 140566 uc021oeg.2 LOC729737
chr1 321083 321115 uc001aaq.2 DQ597235
chr1 321145 321207 uc001aar.2 DQ599768
chr1 322036 326938 uc009vjk.2 LOC100133331

Thanks and Regards
Vijay Lakhujani
Senior Bioinformatician
Neuberg Center for Genomic Medicine,
Neuberg Supratech Reference Laboratories, Ahmedabad
vijay.l...@supratechlabs.com

Reply all

Reply to author

Forward