--
Dear Laura,
Thank you for using the UCSC Genome Browser and your question about refGene and knownCanonical for hg38.
Ensembl and GENCODE merged in the past and can be considered identical. For hg38, the knownGene and knownCanonical tables, which previously referred to "UCSC Genes" also changed the way they were built to now reflect sourcing GENCODE and are labeled as GENCODE v22 (and thus is representative of Ensembl genes as well). Please read this description page (there you will see a note about how knownCanonical is built too): http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=knownGene
knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal transcript when available. If no APPRIS tag exists for any transcript associated with the cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then the longest isoform is used.
When you open knownCanonical you will see lines like the following:
curl -s http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/knownCanonical.txt.gz | gzip -d | grep uc004ega.3 chrX 100628669 100636806 1 uc004ega.3 ENSG00000000003.13
The fifth column is a unique identifier for this transcript (uc004ega.3) that is in a related knownGene table, it can also be used in a "knownGene cross-reference table" that is abbreviated as kgXref, also available for download:
curl -s http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/kgXref.txt.gz | gzip -d | grep uc004efy.5 uc004ega.3 NM_003270 O43657 TSN6_HUMAN TSPAN6 NM_003270 NM_003270 Homo sapiens tetraspanin 6 (TSPAN6), transcript variant 1, mRNA. (from RefSeq NM_003270)
In the sixth column of the kgXref file you will see the refSeq number (NM_003270), it can be used to find the corresponding entry in refGene.txt referred back to knownCanonical.
curl -s http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz | gzip -d | grep NM_003270 1352 NM_003270 chrX - 100627107 100636857 100630797 100636694 8100627107,100630758,100632484,100633404,100633930,100635177,100635557,100636607, 100629986,100630866,100632568,100633539,100634029,100635252,100635746,100636857, 0 TSPAN6 cmpl cmpl -1,0,0,0,0,0,0,0,
You can select these refGene specific rows relating back to knownCanonical from our Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables
1. Select hg38, and group "Genes..." and track "GENCODE v22"
2. Change table from "knownGene" to "knownCanonical"
3. Change "output format" to "selected fields from primary and related tables".
4. Click "get output"
5. Scroll down to the "Linked Tables" section and click the box next to "hg38 refGene".
6. Click the "allow selection from checked tables"
7. Below "hg38.refGene fields" you can click "check all" and then "get output".
Now you will have all refGene rows that were related back through knownCanonical, such as the above line:
1352 NM_001278740 chrX - 100627107 100636732 100630797 100635569 8 100627107,100630758,100632484,100633404,100633930,100635177,100635557,100636190, 100629986,100630866,100632568,100633539,100634029,100635252,100635746,100636732, 0 TSPAN6 cmpl cmpl -1,0,0,0,0,0,0,-1,
At any point when using the Table Browser, you can set the "Group:" to "All Tables" then find a table you are interested in, and then click the "describe table schema" link to see descriptions about the rows and some example data.
Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
All the best,
Brian Lee
UCSC Genomics Institute
--
Hi Laura,
Thank you for your message, please try just the Table Browser steps. Here are some modified steps that will include the ENST information.
In a new browser window navigate to the Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables
1. Select hg38, and set "group:" "Genes and Gene Predictions" to track "GENCODE v22" (this should be the default selection).
2. Change table from "knownGene" to "knownCanonical".
This step is a good opportunity to click the "describe table schema" button to see more about the table data you are requesting.
3. Change "output format" to "selected fields from primary and related tables".
This step allows you add information from other tables beyond the knownCanonical table.
4. Click "get output".
This screen is where we can add requests to get information from other tables, we are going to request information from the hg38 refGene table and the hg38 knownToEnsembl.
5. Scroll down to the "Linked Tables" section and click the box next to "hg38 refGene" and the box next to "hg38 knownToEnsembl".
6. Scroll to the very bottom and click the "allow selection from checked tables".
Now we can select the fields we want from each of these three tables.
7. Under the "Select Fields from hg38.knownCanonical" click the box next to transcript.
This will be the transcript location driving all the related table output.
8. Under hg38.knownToEnsembl fields click "check all".
9. Under hg38.refGene fields click "check all".
10. Click "get output".
The results will be rows like the following:
#hg38.knownCanonical.transcript hg38.knownToEnsembl.name hg38.knownToEnsembl.value hg38.refGene.bin hg38.refGene.name hg38.refGene.chrom hg38.refGene.strand hg38.refGene.txStart hg38.refGene.txEnd hg38.refGene.cdsStart hg38.refGene.cdsEnd hg38.refGene.exonCount hg38.refGene.exonStarts hg38.refGene.exonEnds hg38.refGene.score hg38.refGene.name2 hg38.refGene.cdsStartStat hg38.refGene.cdsEndStat hg38.refGene.exonFrames
uc001ggs.5 uc001ggs.5 ENST00000367772.7 29 NM_181093 chr1 - 169853075 169893959 169853712 169888840 14 169853075,169854269,169855795,169859040,169862612,169864368,169866895,169868927,169870254,169873695,169875977,169878633,169888675,169893787, 169853772,169854964,169855957,169859212,169862797,169864508,169866973,169869039,169870357,169873752,169876091,169878819,169888890,169893959, 0 SCYL3 cmpl cmpl 0,1,1,0,1,2,2,1,0,0,0,0,0,-1,
The first field (hg38.knownCanonical.transcript) is the id used from knownCanonical that is driving the selection of all the other data. The second two fields are the entire knownToEnsembl table that exists to provide the related ENST id (ENST00000367772.7), the remaining fields are all the fields from the refGene table that correspond to the entries in the knownCanonical table.
Please do make these selections independently. Here is a session to compare your steps against to help see the output: http://genome.ucsc.edu/cgi-bin/hgTables?hgS_doOtherUser=submit&hgS_otherUserName=Brian%20Lee&hgS_otherUserSessionName=hg38.refGene.canonical
Here is also a link to a video tutorial about using the Table Browser: http://www.openhelix.com/cgi/tutorialInfo.cgi?id=28
Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
All the best,
Brian Lee
UCSC Genomics Institute
--
In step#1, should i change the track to "refseq genes" (from "GENCODE v22) when I need to download the knownCanonical for refseq?
Dear Laura,
Thank you for using the UCSC Genome Browser. The process for building the knownCanonical table changed between hg19 and hg38, likely explaining the difference you are observing.
If you go to the track description page for these two tracks on their respective assemblies you will find these paragraphs:
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=knownGene
knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal transcript when available. If no APPRIS tag exists for any transcript associated with the cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then the longest isoform is used.
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene
knownCanonical identifies the canonical isoform of each cluster ID, or gene. Generally, this is the longest isoform.
Besides review track description pages, searching our mailing list archives is one of the best ways to find answers to questions before mailing the list. You will want to note, however, that sometimes this is imperfect as processes change, such as how the knownCanonical table is built, so that occasionally an answer may no longer reflect what is current: https://groups.google.com/a/soe.ucsc.edu/forum/?hl=en&fromgroups#!searchin/genome/knownCanonical%7Csort:date
Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
All the best,
Brian Lee
UCSC Genomics Institute
#hg38.refGene.name hg38.refGene.name2 hg38.wgEncodeGencodeTagV23.tag
NM_001159746 ABR CCDS,alternative_5_UTR,basic,alternative_5_UTR,basic,cds_start_NF,mRNA_start_NF,
NM_001282149 ABR CCDS,not_organism_supported,basic,not_organism_supported,basic,
NM_021962 ABR CCDS,basic,appris_principal_1,basic,cds_start_NF,mRNA_start_NF,
Dear Laura,
Thank you for using the UCSC Genome Browser. I want to clarify a statement you used earlier where you referred to our support to your questions as ’UCSC's curated RefSeq canonical transcript files’. Please note that RefSeq does not produce an official set of "canonical" transcripts, and UCSC does not provide a "knownCanonical" for RefSeq genes.
I want to be clear for our mailing list that in this thread we are attempting to support you in your requests, not outlining a UCSC curated canonical file for RefSeq. Also, along the way we have been attempting to point out this method is imperfect and you may still end up with multiple transcripts associated with a single gene symbol and other unexpected errors, and these issues are ultimately your responsibility to resolve in your search to meet your research needs.
This mailing list is not a source of scientific advice, rather intended to provide support for questions related to the use of the UCSC Genome Browser and utilities. There are forums like BioStar, https://www.biostars.org/, where scientists may be able to provide you with the scientific direction you need, or other agencies devoted to resolving such questions like APPRIS (Annotating principal splice isoforms).
Given that information, I can help assist you in finding the Table you are looking for in hg38. You need to use the "allow selection from checked tables" option to keep relating tables until the desired related table appears with fields you can select. Following the steps you provided:
Track:GENCODE v22
Group:Genes and Gene Prediction
Table:knownGene
Region:Genome
Pasted a list of genes using the “identifiers (name/accession)” option.
Output format:selected fields from primary and related tables
Scroll down to "Linked Tables", click the box next to hg38 refGene.
Click "allow selection from checked tables" at the bottom, a new list of tables will appear.
Scroll down to "Linked Tables", click the box next to hg38 wgEncodeGencodeRefSeqV23.
Click "allow selection from checked tables" at the bottom, a new list of tables will appear.
Scroll down to "Linked Tables", click the box next to hg38 wgEncodeGencodeTagV23.
Click "allow selection from checked tables" at the bottom, a new list of tables will appear.
Go to the top and find select the choices that meet your research interests, likely check the box next to "tag" in "hg38.wgEncodeGencodeTagV23 fields".
It should be noted this is just data from an external site that has been loaded into the browser. For example, for your previous question about ABR, if you go to APPRIS, http://appris.bioinfo.cnio.es/, and search for ABR you would find a page like the following,http://appris.bioinfo.cnio.es/#/database/id/homo_sapiens/ENSG00000159842?db=hg38, which shares how ENST00000302538 is annotated by them as PRINCIPAL:1 When you are searching out the wgEncodeGencodeTagV23 table for an entry like "appris_principal_1", you are ultimately referring back to APPRIS. You can review a paper about APPRIS to see how annotating principal splice isoforms is a challenging scientific topic, beyond the general scope of our mailing list:http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531113/,
All the best,
Brian Lee
UCSC Genomics Institute
Hi Matthew,
Thank you very much for all your help and for sending me the instructions to download the GRCh38 refseq canonical transcripts. I followed your instructions and got the appris tags for refseq transcripts.
I am also very interested in downloading the ENSEMBL canonical transcripts from UCSC genome browser. I had downloaded them based on the instructions sent by Brian earlier.Some of the ENSEMBL genes also have multiple canonical transcripts for a given gene. I tried downloading the ensembl appris tag data from UCSC the same way you suggested me for Refseq, but I was not able to do so.
I did the following selections on the UCSC genome table browser:track: all Gencode v22Group : genes and gene predictiontable:Comprehensiveregion: genome
Then I tried to paste the lost of the 1345 genes in the box that opens up on clicking �paste list� next to �identifiers (name/accession)�.It didn�t accept the gene-names, it requires the ensembl Ids of the transcripts.
Then I tried this:Track:GENCODE v22Group:Genes and Gene PredictionTable:knownGeneRegion:Genome
Pasted the list of gene next to the �identifiers (name/accession)�. This time it accepted the gene Names
Output format:selected fields from primary and related tables
Clicked on �get output�In the linked tables, I couldn�t file the �wgEncodeGencodeTagV23� option which actually gives the Appris tags.
I even tried with setting the �Table� to �knownCanonical� but got the same problem in that case.
Would you please let me know what I may be doing wrong? What is your suggestion to choose a canonical ensembl transcript for a given gene if there are multiple ones? Thank you so much for all your help.
Best,Laura
ps: There were around ~49k ensembl canonical transcripts when I downloaded them from UCSC genome browser. I noticed that there were duplicate "gene names" but unique �transcript ids�.
I have one more question and would very much appreciate your response. I downloaded the refseq canonical transcripts from ucsc genome browser. I noticed that Gene ABR has three transcripts [�NM_001159746.2� , �NM_001282149.1', �NM_021962.4�] having 18, 21, 21 exons respectively. I would like to choose only one of them. Would you suggest me to choose the longest one?
thanks,Laura
--
---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser discussion list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.