genes missing in canonical transcripts data

62 views

Skip to first unread message

Genetics Savvy

unread,

Nov 25, 2015, 1:04:55 PM11/25/15

to gen...@soe.ucsc.edu

Hello,

I had downloaded the canonical transcripts data for GRCh38 from the UCSC genome browser( i wanted the gene symbols/names and the canonical transcript ( ENSEMBL id).

The steps i had undertaken for the same are:

Assembly: GRCh38
Track: Gencode v22
Group: genes and gene Predictions
table : knwonCanonical
output format: selected fields from primary and secondary tables
clicked on "get output"
select fields from hg38.knownCanonical : checked "transcript"
Linked tables : "knownToEnsembl" and "KgXref"
clicked: "allow selection from selected tables" button
under hg38.KgXref, checked only "geneSymbol"
under hg38.knownToEnsembl", checked "name" and "value"

This gave me an output of 49534 entries with the headers: #hg38.knownCanonical.transcript hg38.kgXref.geneSymbol hg38.knownToEnsembl.name hg38.knownToEnsembl.value

I found that the following genes were not present in the output file here (some of the genes, not the full list):

hsa-mir-6724
CBX3P4
CHMP1AP1
CLOCK

Also, I wanted to compare this output data to hg19, so i downloaded the canonical transcripts (gene name and the Ensembl Ids), the same way, the only changes being :

Assembly: hg19
Track: UCSC genes

This gave me an output of 26811 entries wit the headers: #hg19.knownCanonical.transcript hg19.kgXref.geneSymbol hg19.knownToEnsembl.name hg19.knownToEnsembl.value

The CLOCK gene was present in the output, while CHMP1AP1, CBX3P4, hsa-mir-6724 were not present in this data.

I wanted to know, why some genes, which were present in hg19 data (obtained as stated above), missing in the GRCh38 data (obtained as stated above).

Also, even in hg19 data downloaded, many of the genes are missing. Could you tell me why is it so ??

Thanks in advance,

Genetics Savvy

Brian Lee

unread,

Nov 30, 2015, 5:55:36 PM11/30/15

to Genetics Savvy, gen...@soe.ucsc.edu

Dear Genetics Savy,

Thank you for using the UCSC Genome Browser, please see this previously answered mailing list question:https://groups.google.com/a/soe.ucsc.edu/d/msg/genome/_6asF5KciPc/UPFDONvZBgAJ

The process for building the knownCanonical table changed between hg19 and hg38, likely explaining the difference you are observing. If you go to the track description page for these two tracks on their respective assemblies you will find these paragraphs:

http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=knownGene

knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal transcript when available. If no APPRIS tag exists for any transcript associated with the cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then the longest isoform is used.

http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene

knownCanonical identifies the canonical isoform of each cluster ID, or gene. Generally, this is the longest isoform.

Besides reviewing track description pages, searching our mailing list archives is one of the best ways to find answers to questions before mailing the list. You will want to note, however, that sometimes this is imperfect as it is possible that very infrequently a process can change, such as how the knownCanonical table is built, so that occasionally an older answer may no longer reflect what is current: https://groups.google.com/a/soe.ucsc.edu/forum/?hl=en&fromgroups#!searchin/genome/knownCanonical%7Csort:date

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply togen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genomics Institute

--

Reply all

Reply to author

Forward

0 new messages