Hello,
I have downloaded the known.canonical transcripts from UCSC website known to refgene per instructions below in March 2014. However, in the downloaded file, I noticed that there are multiple transcripts per gene in rare cases. Is this an error on UCSC site, or was this
intentional? If you could please help clarify this, I would very much appreciate it.
If in case, this was intentional, could you please explain me why? I need to limit 1 transcript per gene, so how do i make the selection? I could choose the transcript with longest length, however there is a case where both transcripts mapping to same gene have the same exact length in the example below. In this case, which one would you recommend me to choose as the canonical transcript?
Thank you,
Laura
PS: Below is an example where one gene has 2 canonical transcripts:
TSPO NM_000714.5
TSPO NM_001256531.1
One quick solution I thought of was that we choose the "longest" one among the above 2 transcripts. Then, I looked at the originally downloaded file (attached, where we derived the 2 column format from) and both transcripts have the SAME exact start 43547519 and end position 43559248.
For some reason, UCSC keeps both of these transcripts as canonical.
Which one to keep would make more sense? Both transcripts have the same summary. For example, should we keep the one with the most recent date?
#hg19.knownCanonical.chrom
hg19.knownCanonical.chromStart hg19.knownCanonical.chromEnd hg19.knownCanonical.clusterId
hg19.knownCanonical.transcript
hg19.knownCanonical.protein
hg19.gbStatus.acc
hg19.gbStatus.version
hg19.gbStatus.modDate
hg19.gbStatus.srcDb
hg19.knownToRefSeq.name hg19.knownToRefSeq.value
hg19.refGene.bin
hg19.refGene.name hg19.refGene.chrom
hg19.refGene.name2
hg19.refSeqStatus.mrnaAcc
hg19.refSeqStatus.status
hg19.refSeqStatus.mol
hg19.refSeqSummary.mrnaAcc
hg19.refSeqSummary.completeness
hg19.refSeqSummary.summary
chr22 43547519 43559248 19615 uc003bdn.4 uc003bdn.4 NM_000714 5 2013-10-03 RefSeq uc003bdn.4 NM_000714 917 NM_000714 chr22 TSPO NM_000714 Reviewed mRNA NM_000714
FullLength Present mainly in the mitochondrial compartment of peripheral tissues, the protein encoded by this gene interacts with some benzodiazepines and has different affinities than its endogenous counterpart. The protein is a key factor in the flow of cholesterol into mitochondria to permit the initiation of steroid hormone synthesis. Alternatively spliced transcript variants have been reported; one of the variants lacks an internal exon and is considered non-coding, and the other variants encode the same protein. [provided by RefSeq, Feb 2012].
chr22 43547519 43559248 19614 uc003bdo.4 uc003bdo.4 NM_001256531 1 2014-01-26 RefSeq uc003bdo.4 NM_001256531 917 NM_001256531 chr22 TSPO NM_001256531 Reviewed mRNA NM_001256531
Unknown Present mainly in the mitochondrial compartment of peripheral tissues, the protein encoded by this gene interacts with some benzodiazepines and has different affinities than its endogenous counterpart. The protein is a key factor in the flow of cholesterol into mitochondria to permit the initiation of steroid hormone synthesis. Alternatively spliced transcript variants have been reported; one of the variants lacks an internal exon and is considered non-coding, and the other variants encode the same protein. [provided by RefSeq, Feb 2012].
I
also looked at the archives faq and found the following note, so is it possible that multiple transcripts were chosen as canonical for the same gene by mistake?
Report message to a moderator
- RE: [genome] Multiple transcripts position to a single gene position, [message #11012 is a reply to message #10995] Tue, 02 October 2012 10:55 Go to previous message
…
There is also a table in the UCSC Genes track called knownCanonical which
contains one canonical transcript per gene. This table is not perfect and
in some rare cases (such as with WASH7P), there is more than one canonical
transcript reported. We are currently working on revising the method used
to select canonical transcripts to correct this, but as mentioned above, it
is rare that more than one canonical transcript is reported. If you would
like to revise your existing query to use the knownCanonical table instead,
just replace "knownGene" with "knownCanonical" and replace "alignID" with
"transcript":