Refseq and Ensembl Canonical transcripts

4,790 views
Skip to first unread message

Laura Smith

unread,
Oct 26, 2015, 7:05:56 PM10/26/15
to gen...@soe.ucsc.edu
Hi, 

Could you please let me know what is the best way to download the Refseq and ENSEMBL canonical transcripts from UCSC genome browser for HG38? 

thanks,
Laura 

Amonida Zadissa

unread,
Oct 27, 2015, 12:06:48 PM10/27/15
to Laura Smith, gen...@soe.ucsc.edu, amo...@ebi.ac.uk
Hi Laura,

Please refer to the answer you received from my colleague, Denise, on
Ensembl help-desk about retrieving the Ensembl transcripts.

Best regards,
Amonida

--
Amonida Zadissa
Ensembl Production Team
EMBL-EBI
Hinxton
England

On 26/10/2015 22:56, 'Laura Smith' via UCSC Genome Browser discussion

Luvina Guruvadoo

unread,
Oct 29, 2015, 1:43:29 PM10/29/15
to Laura Smith, gen...@soe.ucsc.edu
Hello Laura,

Thank you for your question. The RefSeq and GENCODE (Ensembl) data for hg38 can be downloaded from our server here:
http://hgdownload.soe.ucsc.edu/downloads.html. Click on "Human", then "Annotation database". Scroll down and you will find the knownCanonical.txt.gz and refGene.txt.gz files.

If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

- - -
Luvina Guruvadoo
UCSC Genome Bioinformatics Group


--


Laura Smith

unread,
Oct 30, 2015, 12:21:10 PM10/30/15
to Luvina Guruvadoo, gen...@soe.ucsc.edu
Hi Luvina, 

Thank you for your reply. 

Could you please clarify if the following is true based on your previous email? 

I downloaded the "knownCanonical.txt.gz" file. It looks like this is the canonical transcript list for ENSEMBL transcripts. I only see "ENSEMBL ids" in this file. 



I also downloaded the "refGene.txt.gz" file. This file seems the be the whole of refSeq. It is not the "canonical transcripts" for refseq. Could you please point me to the canonical transcripts file that have "refseq IDs" in them?  

Thank you very much for your help,
Laura 

Brian Lee

unread,
Oct 30, 2015, 6:15:54 PM10/30/15
to Laura Smith, Luvina Guruvadoo, gen...@soe.ucsc.edu

Dear Laura,

Thank you for using the UCSC Genome Browser and your question about refGene and knownCanonical for hg38.

Ensembl and GENCODE merged in the past and can be considered identical. For hg38, the knownGene and knownCanonical tables, which previously referred to "UCSC Genes" also changed the way they were built to now reflect sourcing GENCODE and are labeled as GENCODE v22 (and thus is representative of Ensembl genes as well). Please read this description page (there you will see a note about how knownCanonical is built too): http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=knownGene

knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal transcript when available. If no APPRIS tag exists for any transcript associated with the cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then the longest isoform is used.

When you open knownCanonical you will see lines like the following:

curl -s http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/knownCanonical.txt.gz | gzip -d | grep uc004ega.3
chrX    100628669    100636806    1    uc004ega.3    ENSG00000000003.13

The fifth column is a unique identifier for this transcript (uc004ega.3) that is in a related knownGene table, it can also be used in a "knownGene cross-reference table" that is abbreviated as kgXref, also available for download:

curl -s http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/kgXref.txt.gz | gzip -d | grep uc004efy.5
uc004ega.3    NM_003270    O43657    TSN6_HUMAN    TSPAN6    NM_003270    NM_003270    Homo sapiens tetraspanin 6 (TSPAN6), transcript variant 1, mRNA. (from RefSeq NM_003270)

In the sixth column of the kgXref file you will see the refSeq number (NM_003270), it can be used to find the corresponding entry in refGene.txt referred back to knownCanonical.

curl -s http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz | gzip -d | grep NM_003270
1352    NM_003270    chrX    -    100627107    100636857    100630797    100636694    8100627107,100630758,100632484,100633404,100633930,100635177,100635557,100636607,    100629986,100630866,100632568,100633539,100634029,100635252,100635746,100636857,    0    TSPAN6    cmpl    cmpl    -1,0,0,0,0,0,0,0,

You can select these refGene specific rows relating back to knownCanonical from our Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables

1. Select hg38, and group "Genes..." and track "GENCODE v22" 
2. Change table from "knownGene" to "knownCanonical" 
3. Change "output format" to "selected fields from primary and related tables".
4. Click "get output" 
5. Scroll down to the "Linked Tables" section and click the box next to "hg38 refGene". 
6. Click the "allow selection from checked tables" 
7. Below "hg38.refGene fields" you can click "check all" and then "get output".

Now you will have all refGene rows that were related back through knownCanonical, such as the above line:

1352    NM_001278740    chrX    -    100627107    100636732    100630797    100635569    8    100627107,100630758,100632484,100633404,100633930,100635177,100635557,100636190,    100629986,100630866,100632568,100633539,100634029,100635252,100635746,100636732,    0    TSPAN6    cmpl    cmpl    -1,0,0,0,0,0,0,-1,

At any point when using the Table Browser, you can set the "Group:" to "All Tables" then find a table you are interested in, and then click the "describe table schema" link to see descriptions about the rows and some example data.

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genomics Institute


--


Laura Smith

unread,
Nov 2, 2015, 7:35:29 PM11/2/15
to Brian Lee, Luvina Guruvadoo, gen...@soe.ucsc.edu
Dear Brian, 

Thank you very much for your detailed message. I have downloaded the canonical transcripts as you described  in your last email. Then, I noticed that the transcript IDs for ENSEMBL genes didn't come with the selections I made (for example ENST000004111184). I only see ENSG.. ids. Could you please let me know which field from which table I need to select to obtain the ENST ids?

Thank you very much for your help, very much appreciated! 

Laura   

Brian Lee

unread,
Nov 2, 2015, 8:11:13 PM11/2/15
to Laura Smith, Luvina Guruvadoo, gen...@soe.ucsc.edu

Hi Laura,

Thank you for your message, please try just the Table Browser steps. Here are some modified steps that will include the ENST information.

In a new browser window navigate to the Table Browser: http://genome.ucsc.edu/cgi-bin/hgTables

1. Select hg38, and set "group:" "Genes and Gene Predictions" to track "GENCODE v22" (this should be the default selection).
2. Change table from "knownGene" to "knownCanonical". 
This step is a good opportunity to click the "describe table schema" button to see more about the table data you are requesting.


3. Change "output format" to "selected fields from primary and related tables".

This step allows you add information from other tables beyond the knownCanonical table.
4. Click "get output".
This screen is where we can add requests to get information from other tables, we are going to request information from the hg38 refGene table and the hg38 knownToEnsembl.
5. Scroll down to the "Linked Tables" section and click the box next to "hg38 refGene" and the box next to "hg38 knownToEnsembl".
6. Scroll to the very bottom and click the "allow selection from checked tables". 
Now we can select the fields we want from each of these three tables.
7. Under the "Select Fields from hg38.knownCanonical" click the box next to transcript.
This will be the transcript location driving all the related table output.
8. Under hg38.knownToEnsembl fields click "check all". 
9. Under hg38.refGene fields click "check all". 
10. Click "get output".

The results will be rows like the following: 

#hg38.knownCanonical.transcript    hg38.knownToEnsembl.name    hg38.knownToEnsembl.value    hg38.refGene.bin    hg38.refGene.name    hg38.refGene.chrom    hg38.refGene.strand    hg38.refGene.txStart    hg38.refGene.txEnd    hg38.refGene.cdsStart    hg38.refGene.cdsEnd    hg38.refGene.exonCount    hg38.refGene.exonStarts    hg38.refGene.exonEnds    hg38.refGene.score    hg38.refGene.name2    hg38.refGene.cdsStartStat    hg38.refGene.cdsEndStat    hg38.refGene.exonFrames

uc001ggs.5    uc001ggs.5    ENST00000367772.7    29    NM_181093    chr1    -    169853075    169893959    169853712    169888840    14    169853075,169854269,169855795,169859040,169862612,169864368,169866895,169868927,169870254,169873695,169875977,169878633,169888675,169893787,    169853772,169854964,169855957,169859212,169862797,169864508,169866973,169869039,169870357,169873752,169876091,169878819,169888890,169893959,    0    SCYL3    cmpl    cmpl    0,1,1,0,1,2,2,1,0,0,0,0,0,-1,

The first field (hg38.knownCanonical.transcript) is the id used from knownCanonical that is driving the selection of all the other data. The second two fields are the entire knownToEnsembl table that exists to provide the related ENST id (ENST00000367772.7), the remaining fields are all the fields from the refGene table that correspond to the entries in the knownCanonical table.

Please do make these selections independently. Here is a session to compare your steps against to help see the output: http://genome.ucsc.edu/cgi-bin/hgTables?hgS_doOtherUser=submit&hgS_otherUserName=Brian%20Lee&hgS_otherUserSessionName=hg38.refGene.canonical

Here is also a link to a video tutorial about using the Table Browser: http://www.openhelix.com/cgi/tutorialInfo.cgi?id=28

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genomics Institute

Laura Smith

unread,
Nov 9, 2015, 12:11:28 PM11/9/15
to Brian Lee, Luvina Guruvadoo, gen...@soe.ucsc.edu
Dear Brian,

Thank you very much for your kind help! :) I very much appreciate it. I was able to download the hg38 ensembl canonical transcripts  and I have a question regarding the number of canonical transcripts again if you don't mind: 

When I had downloaded the ensembl canonical transcripts from ensembl website last year for hg19, I got total ~63k transcripts. Then, when I downloaded the hg38 ensembl transcripts from UCSC genome browser this year as you described in your email, the number of ucsc ensembl transcripts is ~49k.  Could this be because the UCSC genome browser canonical transcripts description is different that the ensembl's canonical transcripts description?

Or is it possible that there are less number of ensembl canonical transcripts in hg38 than it was in hg19? If you could please provide your thoughts on this, I would very much appreciate it. Thank you so much for all your help and quick feedback! 

Best,
Laura 

Laura Smith

unread,
Nov 10, 2015, 4:11:15 PM11/10/15
to Brian Lee, Luvina Guruvadoo, gen...@soe.ucsc.edu

Laura Smith

unread,
Nov 10, 2015, 5:27:58 PM11/10/15
to Brian Lee, Luvina Guruvadoo, gen...@soe.ucsc.edu
Hi Brian, 

Thank you very much for providing the detailed steps below for downloading the ensembl canonical transcripts earlier. I have a quick question. 

When I want to download the hg38 refseq canonical transcripts, should I change the track name to "RefSeq genes" track instead of "GENCODE v22" in step #1? 

I want to make sure I download the refseq canonical transcripts the correct way, this is why I am asking. 

Thank you very much,
Laura 
 




Matthew Speir

unread,
Nov 12, 2015, 1:54:33 PM11/12/15
to Laura Smith, Brian Lee, Luvina Guruvadoo, gen...@soe.ucsc.edu
Hi Laura,

Thank you for your questions. Can expand upon what you mean by "canonical" transcripts? To my knowledge, RefSeq does not provide a "canonical" set in the same way that we provide the "knownCanonical" table for something like UCSC Genes on hg19 or GENCODE Genes V22 on hg38. Note that the "knownCanonical" set for hg19 and hg38 have different requirements for inclusion. See the discussion of the "knownCanonical" table on the description pages for more information:
If you can provide us with some of the things you are looking for from a "canonical" RefSeq transcript set, maybe we can provide some tips on how to get that using data from the UCSC Genome Browser. You may also want to consider contacting NCBI directly to see if they can provide any guidance on obtaining a canonical transcript set from the transcript set they provide. Contact information for RefSeq/NCBI can be found here: https://www.ncbi.nlm.nih.gov/home/about/contact.shtml.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group

--


Laura Smith

unread,
Nov 12, 2015, 7:31:59 PM11/12/15
to Matthew Speir, Brian Lee, Luvina Guruvadoo, gen...@soe.ucsc.edu
Thank you very much Matthew for your email and detailed explanation. I really appreciate it.  Let me try to better explain what I need.  What I am interested in is only hg38 version of transcripts.  I am not interested in hg19 version.
 
What i am trying to do is to have the "knownCanonical" transcripts for "ENSEMBL genes" and for "REFSEQ genes" based on what UCSC genome browser says is "knownCanonical". I have downloaded the ENSEMBL transcripts based on the steps Brian had provided to me below by choosing the GENCODE track in step #1 since "GENCODE" stands for ENSEMBL. 


My question is for "REFSEQ genes" only.   Please see Brian's email below.  He had given me a list of steps below. In step#1, should i change the track to "refseq genes" (from "GENCODE v22) when I need to download the knownCanonical for refseq? All I need at the end is a list of refseq transcript ids (one transcript per gene) which are "knownCanonical" per UCSC definition. 

I hope it is clear. Thank you for all your help! 

Laura

Matthew Speir

unread,
Nov 13, 2015, 2:16:30 PM11/13/15
to Laura Smith, Brian Lee, Luvina Guruvadoo, gen...@soe.ucsc.edu
Hi Laura,

RefSeq does not produce an official set of "canonical" transcripts, so there is no "knownCanonical" for RefSeq genes. The "knownCanonical" table for Gencode Genes V22 attempts to provide a transcript set where there is only a single transcript for each gene. The instructions provided by my colleague Brian attempt to provide you with this single transcript coverage for RefSeq genes using the knownCanonical table. This method is imperfect and you may still end up with multiple transcripts associated with a single gene symbol.


 In step#1, should i change the track to "refseq genes" (from "GENCODE v22) when I need to download the knownCanonical for refseq?

No, the selected track should still be GENCODE Genes V22 with the knownCanonical table. Follow the directions as laid out by Brian in his email.

As I noted in my last email, you may want to contact NCBI/RefSeq to see if they have any recommendations for obtaining a "canonical" transcript set for hg38/GRCh38 from the RefSeq site.


I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group

Laura Smith

unread,
Nov 20, 2015, 1:53:36 PM11/20/15
to Matthew Speir, Brian Lee, Luvina Guruvadoo, gen...@soe.ucsc.edu
Hi Matthew, 

Thank you very much for your detailed email. My goal was to get the canonical transcripts definition of UCSC genes as you described. I had contacted refseq folks for a list of canonical transcripts in the past and they have informed me that they don't have such a list. So, i am getting a list of canonical transcrips from UCSC and using the transcripts refseq Ids. 

I will use the GENCODE v22 track as you advised. Thank you so much for your help with this. 


I have one more quick question if you don't mind. When I downloaded the UCSC canonical transcripts this year for GRCh38 following the instructions by Brian, I noticed that "GNAS" gene was missing. However, this gene was present in the hg19 version. 

Is there any reason for this? 

Thank you,
Laura 



Brian Lee

unread,
Nov 24, 2015, 4:15:58 PM11/24/15
to Laura Smith, Matthew Speir, Luvina Guruvadoo, gen...@soe.ucsc.edu

Dear Laura,

Thank you for using the UCSC Genome Browser. The process for building the knownCanonical table changed between hg19 and hg38, likely explaining the difference you are observing.

If you go to the track description page for these two tracks on their respective assemblies you will find these paragraphs:

http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=knownGene

knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal transcript when available. If no APPRIS tag exists for any transcript associated with the cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then the longest isoform is used.

http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=knownGene

knownCanonical identifies the canonical isoform of each cluster ID, or gene. Generally, this is the longest isoform.

Besides review track description pages, searching our mailing list archives is one of the best ways to find answers to questions before mailing the list. You will want to note, however, that sometimes this is imperfect as processes change, such as how the knownCanonical table is built, so that occasionally an answer may no longer reflect what is current: https://groups.google.com/a/soe.ucsc.edu/forum/?hl=en&fromgroups#!searchin/genome/knownCanonical%7Csort:date

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genomics Institute

Laura Smith

unread,
Jan 12, 2016, 12:10:27 PM1/12/16
to Brian Lee, Matthew Speir, Luvina Guruvadoo, gen...@soe.ucsc.edu
Dear Brian, 

Thank you very much for your reply. I have one more question on ensembl canonical transcripts. I downloaded the ensembl canonical transcripts from UCSC genome browser (using the directions you had provided before which was very helpful, thank you again). There were around ~49k ensembl canonical transcripts. Noticed that there were duplicate "gene names" but unique “transcript ids”. These are the genes:

5S_rRNA
CCL3L1
CYP2D6
DEFB130
MIR4509-2
Metazoa_SRP
RNA5-8S5
SNORA26
SNORA31
SNORA75
SNORD113
SNTG2
U1
U3
U6
U8
Y_RNA
snoU13
uc_338

All these above are valid gene names ( checked in Ensembl ) but they miss a suffix which causes them to have different transcript ids.  For example for ENST00000600454 it should have gene name of Metazoa_SRP.23-201 while  ENST00000621054.1 should have gene name as Metazoa_SRP.28-201 but both these entries have a same gene name of Metazoa_SRP. 

Do you know why this could happen? Which transcript Id shall I choose for such cases where the same gene has multiple transcripts? For instance for Metazoa_SRP gene? I would very much appreciate your guidance on this. 

Thank you,
Laura  




Laura Smith

unread,
Jan 25, 2016, 7:09:46 PM1/25/16
to Matthew Speir, Brian Lee, Luvina Guruvadoo, gen...@soe.ucsc.edu
Hi Matthew, 

I have a quick question regarding the refseq and ensembl canonical transcripts that I downloaded from UCSC genome browser based on Brian's recommendation and steps he sent before in the email chain below. 

I noticed that the GRCh38 canonical transcripts that I downloaded also include transcripts on alternative assembly chromosomes such as the format chrNN_xxxxxxx_alt and chrNN_xxxxxx_random, for example chr6_GL000254v2_alt . I think this is expected to have canonical transcripts also on these sort of chromosomes, right?  I wanted to double check with you if this is expected or not. 

Thank you,
Laura. 

Luvina Guruvadoo

unread,
Jan 28, 2016, 12:27:05 PM1/28/16
to Laura Smith, Matthew Speir, Brian Lee, gen...@soe.ucsc.edu
Hello Laura,

Yes, this is expected.

For your previous question, the Ensembl site provides information on their naming convention. See:
http://www.ensembl.org/info/genome/genebuild/genome_annotation.html

If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Regards,
Luvina

--
Luvina Guruvadoo
UCSC Genome Browser

http://genome.ucsc.edu



--
Luvina Guruvadoo
UCSC Genome Browser

http://genome.ucsc.edu


Laura Smith

unread,
Feb 3, 2016, 10:37:07 AM2/3/16
to Luvina Guruvadoo, Matthew Speir, Brian Lee, gen...@soe.ucsc.edu
Hello Luvina, 

Thank you very much for your reply. 

I have one more question and would very much appreciate your response. I downloaded the refseq canonical transcripts from ucsc genome browser. I noticed that Gene ABR has three transcripts [‘NM_001159746.2’ , ‘NM_001282149.1', ’NM_021962.4’] having 18, 21, 21 exons  respectively. I would like to choose only one of them. Would you suggest me to choose the longest one?

thanks,
Laura 

Matthew Speir

unread,
Feb 8, 2016, 2:50:40 PM2/8/16
to Laura Smith, Luvina Guruvadoo, Brian Lee, gen...@soe.ucsc.edu
Hi Laura,

Thank you for your question about getting a list of canonical transcripts.

Since you are using "knownCanonical" for hg38 to get this list of transcripts, I would recommend basing your decision off the "appris_principal_1" tag associated with a particular transcript. This tag is used by GENCODE to mark the primary transcript of a particular gene. Here are the tags associated with the three transcripts listed in your email:

#hg38.refGene.name    hg38.refGene.name2    hg38.wgEncodeGencodeTagV23.tag
NM_001159746    ABR    CCDS,alternative_5_UTR,basic,alternative_5_UTR,basic,cds_start_NF,mRNA_start_NF,
NM_001282149    ABR    CCDS,not_organism_supported,basic,not_organism_supported,basic,
NM_021962    ABR    CCDS,basic,appris_principal_1,basic,cds_start_NF,mRNA_start_NF,

As you can see, NM_021962 is the only transcript that has this "appris_principal_1" tag associated with it.

You can get this information from the Table Browser using the following steps:

1. Navigate to the Table Browser, .
2. Make the following selections:
    clade: Mammal
    genome: Human
    assembly: Dec. 2013 (GRCh38/hg38)
    group: Genes and Gene Predictions Tracks
    track: RefSeq Genes
    table: refGene
    output: selected fields from primary and related tables

3. Next to "identifiers", click "paste list".
4. Paste in your list of identifiers:
        NM_001159746
        NM_001282149
        NM_021962

5. Click "submit".
6. Click "get output".
7. Under the "Linked Tables" section, check the box next to " wgEncodeGencodeRefSeqV23".
8. Click "allow selection from checked tables".
9. Go back to the "Linked Tables section and check the box next to "wgEncodeGencodeTagV23".
10. Click "allow selection from checked tables".
11. Under the "Select Fields from hg38.refGene" section, check the boxes next to "name" and "name2".
12. Under the "hg38.wgEncodeGencodeTagV23 fields" section, check the box next to "tag".
13. Click "get output".

You can find descriptions of the various GENCODE tags on their website here: http://www.gencodegenes.org/gencode_tags.html. You can read more about APPRIS here: http://appris.bioinfo.cnio.es/#/help/intro.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group


Laura Smith

unread,
Feb 8, 2016, 2:55:28 PM2/8/16
to Laura Smith, Brian Lee, Matthew Speir, Luvina Guruvadoo, gen...@soe.ucsc.edu
Dear Brian, Luvina and Matthew, 

Thank you all very much for your help to me with downloading the UCSC's curated refseq canonical transcript files in the past couple of months. You have been extremely helpful and I very much appreciate it.  

I have one question if you don't mind. There are about 35k genes in the refseq table that I downloaded from UCSC genome browser. However, there are only 20k refseq canonical transcripts? Why is there a gap in the number of genes? Is this expected, if so, 

Thank you very much,
Laura



Matthew Speir

unread,
Feb 8, 2016, 6:19:45 PM2/8/16
to Laura Smith, Brian Lee, Luvina Guruvadoo, gen...@soe.ucsc.edu
Hi Laura,

Could you provide more context for where you are getting your value of 35,000 RefSeq genes? Looking at the total number of items in the UCSC "refGene" table shows a total count of over 60,000 transcripts. Often a single gene will have multiple, alternatively splicing transcripts associated with it. If we assume that a list of unique gene symbols in this refGene table represents all of the genes, then we get a total of a little over 26,000 genes.

Your 20,000 from the steps described in previous emails doesn't seem too far off from this rough 26,000 estimate. Additionally, various sources at NIH/NCBI/NHGRI seem to indicate there are between 20,000 and 25,000 genes in the human genome:
The other thing you should note is that the process described by my colleague Brian in previous emails is an imperfect process and that it depends on numerous links and assumptions about equivalency between two different gene sets, RefSeq and GENCODE V23. These gene sets are generated in different ways and items found in one set may not necessarily be found in the other and as such means that they will not end up in your output.


I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group


Laura Smith

unread,
Feb 12, 2016, 12:16:20 PM2/12/16
to Laura Smith, Matthew Speir, Luvina Guruvadoo, Brian Lee, gen...@soe.ucsc.edu
Hi Matthew, 

Thank you very much for all your help and for sending me the instructions to download the GRCh38 refseq canonical transcripts. I followed your instructions and got the appris tags for refseq transcripts. 

I am also very interested in downloading the ENSEMBL canonical transcripts from UCSC genome browser. I had downloaded them based on the instructions sent by Brian earlier. 
Some of the ENSEMBL genes  also have multiple canonical transcripts for a given gene. I tried downloading the ensembl appris tag data from UCSC the same way you suggested me for Refseq, but I was not able to do so. 

I did the following selections on the UCSC genome table browser:
track: all Gencode v22
Group : genes and gene prediction
table:Comprehensive
region: genome
Then I tried to paste the lost of the 1345 genes in the box that opens up on clicking “paste list” next to “identifiers (name/accession)”.
It didn’t accept the gene-names, it requires the ensembl Ids of the transcripts.


Then I tried this:
Track:GENCODE v22
Group:Genes and Gene Prediction
Table:knownGene
Region:Genome
Pasted the list of gene next to the “identifiers (name/accession)”. This time it accepted the gene Names
Output format:selected fields from primary and related tables
Clicked on “get output”
In the linked tables, I couldn’t file the “wgEncodeGencodeTagV23” option which actually gives the Appris tags.

I even tried with setting the “Table” to “knownCanonical” but got the same problem in that case. 

Would you please let me know what I may be doing wrong? What is your suggestion to choose a canonical ensembl transcript for a given gene if there are multiple ones? Thank you so much for all your help. 

Best,
Laura 

ps: There were around ~49k ensembl canonical transcripts when I downloaded them from UCSC genome browser. I noticed that there were duplicate "gene names" but unique “transcript ids”. 

Brian Lee

unread,
Feb 17, 2016, 6:41:32 PM2/17/16
to Laura Smith, Matthew Speir, Luvina Guruvadoo, gen...@soe.ucsc.edu

Dear Laura,

Thank you for using the UCSC Genome Browser. I want to clarify a statement you used earlier where you referred to our support to your questions as ’UCSC's curated RefSeq canonical transcript files’. Please note that RefSeq does not produce an official set of "canonical" transcripts, and UCSC does not provide a "knownCanonical" for RefSeq genes.

I want to be clear for our mailing list that in this thread we are attempting to support you in your requests, not outlining a UCSC curated canonical file for RefSeq. Also, along the way we have been attempting to point out this method is imperfect and you may still end up with multiple transcripts associated with a single gene symbol and other unexpected errors, and these issues are ultimately your responsibility to resolve in your search to meet your research needs.

This mailing list is not a source of scientific advice, rather intended to provide support for questions related to the use of the UCSC Genome Browser and utilities. There are forums like BioStar, https://www.biostars.org/, where scientists may be able to provide you with the scientific direction you need, or other agencies devoted to resolving such questions like APPRIS (Annotating principal splice isoforms).

Given that information, I can help assist you in finding the Table you are looking for in hg38. You need to use the "allow selection from checked tables" option to keep relating tables until the desired related table appears with fields you can select. Following the steps you provided:

Track:GENCODE v22
Group:Genes and Gene Prediction
Table:knownGene
Region:Genome

Pasted a list of genes using the “identifiers (name/accession)” option. 


Output format:selected fields from primary and related tables

Scroll down to "Linked Tables", click the box next to hg38 refGene. 
Click "allow selection from checked tables" at the bottom, a new list of tables will appear.
Scroll down to "Linked Tables", click the box next to hg38 wgEncodeGencodeRefSeqV23. 
Click "allow selection from checked tables" at the bottom, a new list of tables will appear.
Scroll down to "Linked Tables", click the box next to hg38 wgEncodeGencodeTagV23. 
Click "allow selection from checked tables" at the bottom, a new list of tables will appear.
Go to the top and find select the choices that meet your research interests, likely check the box next to "tag" in "hg38.wgEncodeGencodeTagV23 fields".

It should be noted this is just data from an external site that has been loaded into the browser. For example, for your previous question about ABR, if you go to APPRIS, http://appris.bioinfo.cnio.es/, and search for ABR you would find a page like the following,http://appris.bioinfo.cnio.es/#/database/id/homo_sapiens/ENSG00000159842?db=hg38, which shares how ENST00000302538 is annotated by them as PRINCIPAL:1 When you are searching out the wgEncodeGencodeTagV23 table for an entry like "appris_principal_1", you are ultimately referring back to APPRIS. You can review a paper about APPRIS to see how annotating principal splice isoforms is a challenging scientific topic, beyond the general scope of our mailing list:http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531113/,

All the best,

Brian Lee
UCSC Genomics Institute

José Manuel Rodríguez

unread,
Feb 18, 2016, 12:07:11 PM2/18/16
to Laura Smith, Matthew Speir, Luvina Guruvadoo, Brian Lee, gen...@soe.ucsc.edu
Hi Laura (and the rest of community),

I am Jose Rodriguez, the first authors of APPRIS
(Sorry, I have just read this email which mentions APPRIS)

APPRIS works with protein-coding genes (because their methods work at protein level).

Most of your genes (at least the first ones) haven't protein sequences.

It�s a general problem, where a HGNC gene name map to multiple Ensembl gene id:

Metazoa_SRP -> ENSG00000213167

Metazoa_SRP -> ENSG00000268154


How Brian Lee suggests we should keep in contact using individual emails discarding the UCSC email list.

Best Regards,
J



*********************************************
Jose Manuel Rodriguez Carrasco

Structural Biology and BioComputing Programme
Spanish National Cancer Center (CNIO)
Spanish National Bioinformatics Institute (INB) - http://www.inab.org -

Address: C/ Melchor Fernandez Almagro 3, Madrid (Spain) ZipCode: 28029

On Feb 12, 2016, at 3:14 AM, 'Laura Smith' via UCSC Genome Browser discussion list <gen...@soe.ucsc.edu> wrote:

Hi Matthew, 

Thank you very much for all your help and for sending me the instructions to download the GRCh38 refseq canonical transcripts. I followed your instructions and got the appris tags for refseq transcripts. 

I am also very interested in downloading the ENSEMBL canonical transcripts from UCSC genome browser. I had downloaded them based on the instructions sent by Brian earlier. 
Some of the ENSEMBL genes  also have multiple canonical transcripts for a given gene. I tried downloading the ensembl appris tag data from UCSC the same way you suggested me for Refseq, but I was not able to do so. 

I did the following selections on the UCSC genome table browser:
track: all Gencode v22
Group : genes and gene prediction
table:Comprehensive
region: genome
Then I tried to paste the lost of the 1345 genes in the box that opens up on clicking �paste list� next to �identifiers (name/accession)�.
It didn�t accept the gene-names, it requires the ensembl Ids of the transcripts.


Then I tried this:
Track:GENCODE v22
Group:Genes and Gene Prediction
Table:knownGene
Region:Genome
Pasted the list of gene next to the �identifiers (name/accession)�. This time it accepted the gene Names
Output format:selected fields from primary and related tables
Clicked on �get output�
In the linked tables, I couldn�t file the �wgEncodeGencodeTagV23� option which actually gives the Appris tags.

I even tried with setting the �Table� to �knownCanonical� but got the same problem in that case. 

Would you please let me know what I may be doing wrong? What is your suggestion to choose a canonical ensembl transcript for a given gene if there are multiple ones? Thank you so much for all your help. 

Best,
Laura 

ps: There were around ~49k ensembl canonical transcripts when I downloaded them from UCSC genome browser. I noticed that there were duplicate "gene names" but unique �transcript ids�. 
I have one more question and would very much appreciate your response. I downloaded the refseq canonical transcripts from ucsc genome browser. I noticed that Gene ABR has three transcripts [�NM_001159746.2� , ï¿½NM_001282149.1', ï¿½NM_021962.4�] having 18, 21, 21 exons  respectively. I would like to choose only one of them. Would you suggest me to choose the longest one?

thanks,
Laura 






--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser discussion list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.



**NOTA DE CONFIDENCIALIDAD** Este correo electrónico, y en su caso los ficheros adjuntos, pueden contener información protegida para el uso exclusivo de su destinatario. Se prohíbe la distribución, reproducción o cualquier otro tipo de transmisión por parte de otra persona que no sea el destinatario. Si usted recibe por error este correo, se ruega comunicarlo al remitente y borrar el mensaje recibido.

**CONFIDENTIALITY NOTICE** This email communication and any attachments may contain confidential and privileged information for the sole use of the designated recipient named above. Distribution, reproduction or any other use of this transmission by any party other than the intended recipient is prohibited. If you are not the intended recipient please contact the sender and delete all copies.
Reply all
Reply to author
Forward
0 new messages