[Genome] Canonical transcripts for ENSEMBL

1,488 views
Skip to first unread message

Laura Smith

unread,
Jun 18, 2012, 1:26:11 PM6/18/12
to gen...@soe.ucsc.edu
To whom it may concern:

Hello,

I would like to get the
canonical transcripts for ENSEMBL. Is there a way to get this
online from UCSC genome browser or a link to this data perhaps?

If so,
would it be possible for you to send me the directions to download it
please? I looked at UCSC genome browser website, but couldn't figure
out where the canonical transcripts are:
ensembl.

I would very much appreciate your help on this matter. 

Thank you,
Laura

Laura Smith

unread,
Jun 19, 2012, 12:45:12 PM6/19/12
to gen...@soe.ucsc.edu
Hello,

I would like to get the

canonical transcripts for REFSEQ. Is there a way to get this


online from UCSC genome browser or a link to this data perhaps?

If so,
would it be possible for you to send me the directions to download it
please? I looked at UCSC genome browser website, but couldn't figure
out where the canonical transcripts are:

REFSEQ.

Steve Heitner

unread,
Jun 19, 2012, 1:16:07 PM6/19/12
to Laura Smith, gen...@soe.ucsc.edu
Hello, Laura.

You can do this by using our Table Browser. If you're unfamiliar with the
Table Browser, please see the User's Guide at
http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html.

We have a knownCanonical table associated with the UCSC Genes track. If you
want the transcripts in terms of RefSeq IDs, you will have to link to an
additional table. Please follow the below instructions:

1. From http://genome.ucsc.edu, select "Tables" from the blue navigation
bar at the top of the screen.

2. Select the following options:
Clade: Mammal
Genome: Human
Assembly: Feb. 2009 (GRCh37/hg19)
Group: Genes and Gene Prediction Tracks
Track: UCSC Genes
Table: knownCanonical
Region: Select either "genome" for the entire genome or specify a region
next to "position"
Output format: selected fields from primary and related tables

3. Click the "get output" button

4. In the "Select Fields from hg19.knownCanonical" section, check the
checkboxes for the fields you wish to display

5. In the "hg19.kgXref fields" section, check the hg19.refseq checkbox to
display the associated RefSeq ID in your output

6. Click the "get output" button

Please contact us again at gen...@soe.ucsc.edu if you have any further
questions.

---
Steve Heitner
UCSC Genome Bioinformatics Group
_______________________________________________
Genome maillist - Gen...@soe.ucsc.edu
https://lists.soe.ucsc.edu/mailman/listinfo/genome


Steve Heitner

unread,
Jun 19, 2012, 4:39:04 PM6/19/12
to Laura Smith, gen...@soe.ucsc.edu
Hello, Laura.

I just wanted to point out that the solution I sent you earlier does not
produce a canonical set directly from RefSeq. We do not have such a set.
The method I suggested is simply a means of obtaining RefSeq IDs from our
own canonical set. If this method is satisfactory to you, a slight
modification to the previous steps may provide you with more RefSeq IDs than
the original method. Instead of the previous step 5, perform the following:

5. In the "Linked Tables" section, check the hg19.knownToRefSeq checkbox

6. Scroll down and click the "allow selection from checked tables" button

7. In the "hg19.knownToRefSeq fields" section, check the "value" checkbox

8. Click the "get output" button

In addition, you may find the previously-answered mailing list question at
https://lists.soe.ucsc.edu/pipermail/genome/2010-April/021815.html helpful.
This answer assumes a degree of familiarity with command line tools. If you
are not as comfortable with command line tools, you can also find an
assortment of useful tools at Galaxy (http://galaxy.psu.edu/).

Greg Roe

unread,
Jun 19, 2012, 4:47:14 PM6/19/12
to Laura Smith, gen...@soe.ucsc.edu
Hi Laura,

We don't have exactly what you're asking for. I would suggest contacting
Ensembl (http://uswest.ensembl.org/info/about/contact/index.html) with
this question.

If you have any additional questions, please reply to: gen...@soe.ucsc.edu
-
Greg Roe
UCSC Genome Bioinformatics Group

Amonida Zadissa

unread,
Jun 20, 2012, 9:58:48 AM6/20/12
to Greg Roe, Laura Smith, gen...@soe.ucsc.edu
Hi Laura,

Just adding to Greg's comment here; if you are familiar with Perl, you
can use the Ensembl Perl API
(http://www.ensembl.org/info/docs/api/core/index.html#api) for
extracting the canonical transcripts for any of the species that we
annotate.

If you need help with the API or any other data sets from Ensembl,
please contact our help-desk (help...@ensembl.org).

Best regards,
Amonida

--
Amonida Zadissa Ph.D.
Deputy team leader
EnsEMBL Genebuild team
Wellcome Trust Sanger Institute
England
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Laura Smith

unread,
Jun 20, 2012, 2:43:29 PM6/20/12
to st...@soe.ucsc.edu, gen...@soe.ucsc.edu
Hi Steve,

Thank you very much for your detailed and quick reply. I really appreciate it. I tried the second way you gave me in the below and it worked.


thanks!
Laura

________________________________
From: Steve Heitner <st...@soe.ucsc.edu>
To: Laura Smith <lsmith...@yahoo.com>; gen...@soe.ucsc.edu
Sent: Tuesday, June 19, 2012 1:39 PM

Laura Smith

unread,
Jun 20, 2012, 2:44:36 PM6/20/12
to Amonida Zadissa, Greg Roe, gen...@soe.ucsc.edu
Thank you very much, yes, I contacted ENSEMBL and they were very helpful. So, I got these transcripts now.

Thanks for your reply. 

Laura

________________________________
From: Amonida Zadissa <amo...@sanger.ac.uk>
To: Greg Roe <gr...@soe.ucsc.edu>
Cc: Laura Smith <lsmith...@yahoo.com>; "gen...@soe.ucsc.edu" <gen...@soe.ucsc.edu>; amo...@sanger.ac.uk
Sent: Wednesday, June 20, 2012 6:58 AM
Subject: Re: [Genome] Canonical transcripts for ENSEMBL

Laura Smith

unread,
May 7, 2014, 11:14:23 PM5/7/14
to st...@soe.ucsc.edu, gen...@soe.ucsc.edu
Hello, 

I have downloaded the known.canonical transcripts from UCSC website known to refgene per instructions below in March 2014. However, in the downloaded file, I noticed that there are multiple transcripts per gene in rare cases. Is this an error on UCSC site, or was this intentional? If you could please help clarify this, I would very much appreciate it. 

If in case, this was intentional, could you please explain me why? I need to limit 1 transcript per gene, so how do i make the selection? I could choose the transcript with longest length, however there is a case where both transcripts mapping to same gene have the same exact length in the example below. In this case, which one would you recommend me to choose as the canonical transcript? 

Thank you, 
Laura 

PS: Below  is an example where one gene has 2 canonical transcripts: 

TSPO    NM_000714.5
TSPO    NM_001256531.1

One quick solution I thought of was that we choose the "longest" one among the above 2 transcripts. Then, I looked at the originally downloaded file (attached, where we derived the 2 column format from) and both transcripts have the SAME exact start 43547519 and end position 43559248

For some reason, UCSC keeps both of these transcripts as canonical. 

Which one to keep would make more sense? Both transcripts have the same summary. For example, should we keep the one with the most recent date?

 
#hg19.knownCanonical.chrom hg19.knownCanonical.chromStart hg19.knownCanonical.chromEnd hg19.knownCanonical.clusterId hg19.knownCanonical.transcript hg19.knownCanonical.protein hg19.gbStatus.acc hg19.gbStatus.version hg19.gbStatus.modDate hg19.gbStatus.srcDb hg19.knownToRefSeq.name hg19.knownToRefSeq.value hg19.refGene.bin hg19.refGene.name hg19.refGene.chrom hg19.refGene.name2 hg19.refSeqStatus.mrnaAcc hg19.refSeqStatus.status hg19.refSeqStatus.mol hg19.refSeqSummary.mrnaAcc hg19.refSeqSummary.completeness hg19.refSeqSummary.summary

chr22 43547519 43559248 19615 uc003bdn.4 uc003bdn.4 NM_000714 5 2013-10-03 RefSeq uc003bdn.4 NM_000714 917 NM_000714 chr22 TSPO NM_000714 Reviewed mRNA NM_000714 FullLength Present mainly in the mitochondrial compartment of peripheral tissues, the protein encoded by this gene interacts with some benzodiazepines and has different affinities than its endogenous counterpart. The protein is a key factor in the flow of cholesterol into mitochondria to permit the initiation of steroid hormone synthesis. Alternatively spliced transcript variants have been reported; one of the variants lacks an internal exon and is considered non-coding, and the other variants encode the same protein. [provided by RefSeq, Feb 2012].

chr22 43547519 43559248 19614 uc003bdo.4 uc003bdo.4 NM_001256531 1 2014-01-26 RefSeq uc003bdo.4 NM_001256531 917 NM_001256531 chr22 TSPO NM_001256531 Reviewed mRNA NM_001256531 Unknown Present mainly in the mitochondrial compartment of peripheral tissues, the protein encoded by this gene interacts with some benzodiazepines and has different affinities than its endogenous counterpart. The protein is a key factor in the flow of cholesterol into mitochondria to permit the initiation of steroid hormone synthesis. Alternatively spliced transcript variants have been reported; one of the variants lacks an internal exon and is considered non-coding, and the other variants encode the same protein. [provided by RefSeq, Feb 2012].



I also looked at the archives faq and found the following note, so is it possible that multiple transcripts were chosen as canonical for the same gene by mistake? 



Report message to a moderator
- RE: [genome] Multiple transcripts position to a single gene position, [message #11012 is a reply to message #10995] Tue, 02 October 2012 10:55 Go to previous message

There is also a table in the UCSC Genes track called knownCanonical which
contains one canonical transcript per gene. This table is not perfect and
in some rare cases (such as with WASH7P), there is more than one canonical
transcript reported. We are currently working on revising the method used
to select canonical transcripts to correct this, but as mentioned above, it
is rare that more than one canonical transcript is reported. If you would
like to revise your existing query to use the knownCanonical table instead,
just replace "knownGene" with "knownCanonical" and replace "alignID" with
"transcript":

----- Forwarded Message -----
From: Steve Heitner <st...@soe.ucsc.edu>
To: Laura Smith <lsmith...@yahoo.com>; gen...@soe.ucsc.edu

Matthew Speir

unread,
May 13, 2014, 5:10:49 PM5/13/14
to Laura Smith, st...@soe.ucsc.edu, gen...@soe.ucsc.edu
Hello Laura,

Thank you for your question about the knownCanonical table. Unfortunately, the issue of a gene being assigned multiple transcripts is still present in our most recent versions of the knownCanonical table. We are looking at different solutions to this complex problem, and hope to have this resolved in a future version of the UCSC Genes track. For the transcript you mentioned in your email, one of our engineers suggests arbitrarily choosing which of the two transcripts to keep and which to discard.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--


Reply all
Reply to author
Forward
0 new messages