tx, cds, exon start and stop sites

Rodney T Perry

unread,

Oct 30, 2013, 3:51:32 PM10/30/13

to gen...@soe.ucsc.edu, Howard William Wiener

Dear Sir/Madame:

I have two technical questions about the definition and annotation for transcription, coding sequences, and exon start and start sites. I will use the gene, GKAP1 (hg19; chr9:86,354,000-86,445,000) as an example. There are 3 variants coded on negative strand. When I download these into table browser, it shows the following for the first variant, uc004amy:

Tx start and stop are: 86,354,335 -86,432,752.

Cds start and stop are: 86,354,611-86,421,432.

Exon starts are:

86354335,86356866,86357444,86363223,86368172,86383732,86395296,86399629,86403515,86414099,86421216,86431910,86432431

Exon stops are:

86354659,86356944,86357515,86363287,86368274,86383885,86395319,86399753,86403593,86414243,86421475,86432042,86432752,

I have no problem with the Tx start and stop.

I have no problem with the Cds start, 86,354,611. It starts at the end of the 3’UTR (86,354,335-86,354,610) and what should be the beginning of the last exon. This is clear on the genome browser window and is confirmed by the CCDS track showing the last exon but not the 3UTR segment.

However, the table browser lists the Exon start at 86,354, 335 which is where the Tx start is, 86,354,335 that includes the 3UTR?

So, my first question is how can you (correctly) list different start and stop coordinates for transcription and coding sequences (because the transcription includes the 5’UTR and the 3’UTR while the coding sequence doesn’t include them) and then turn around and include the 5’UTR and 3’UTR as part of the first exon and last exon, respectively? Can you clarify this?

The 2^nd question is for the same variant. The Tx stop is 86,432,752, but the Cds stop is 86, 421,432. This means that the last 2 exons, 86,431,910-86,432,042 and 86,432,431-86,432,752, are spliced out. In fact, they are spliced out of all 3 variants. So, why are these last 2 exons called exons; I thought the definition of an exon is, “a segment of a gene that is retained during splicing for translation”? Or are they named exons because that the DNA sequence shows the “characteristics” of an exon such as the recognition nucleotides at the splice donor and acceptor sites of exons and introns?

These just seem confusing to me and I appreciate the assistance and clarification.

Best,

Rodney Perry

Matthew Speir

unread,

Nov 5, 2013, 3:03:49 PM11/5/13

to Rodney T Perry, gen...@soe.ucsc.edu, Howard William Wiener

Hello Rodney,

Transcription is the creation of the mRNA from the gene in the DNA. As you know, the exons at the 5' and 3' ends of the mRNA often contain untranslated regions (or UTRs), and sometimes these UTRs contain introns as well. The introns are removed, and what's left is the mature mRNA that contains only the exons. You can see an example of this in the following session: http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=mspeeeer&hgS_otherUserSessionName=hg19_MLQ_Example. In that example, the first exon in view contains both translated an untranslated regions and the subsequent exons are all pieces of the 3' UTR. We define an exon as any part of a gene that is included in the mature mRNA, regardless of whether or not this exon is translated or not. NCBI defines exons in a similar way, as seen in this entry from their glossary (http://www.ncbi.nlm.nih.gov/books/NBK21106/):

    exon
    Refers to the portion of a gene that encodes for a part of that gene's mRNA. A gene may comprise many exons, some of which may include only protein-coding sequence;
    however, an exon may also include 5' or 3' untranslated sequence. Each exon codes for a specific portion of the complete protein. In some species (including humans),
    a gene's exons are separated by long regions of DNA (called introns or sometimes “junk DNA”) that often have no apparent function but have been shown to encode small
    untranslated RNAs or regulatory information. (See also splice sites.)

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group

--

Rodney T Perry

unread,

Nov 6, 2013, 11:02:36 AM11/6/13

to gen...@soe.ucsc.edu

Matthew,

Thanks for the explanation. Ok, I understand that the 5’UTR and 3’UTR are counted as part of an exon which explains why the transcription start and exon start coordinates are the same but the transcription start and coding sequence start coordinates are different.

As a technical point, you say to look at the “first exon in view” in your example. It looks like you mean on the left side of the browser? This gene is coded on the opposite strand so that exon would be the fourth exon that is transcribed going right to left in this browser window. So when you say, “the subsequent exons that are all pieces of the 3’UTR”, these are really pieces of the 5’UTR, right? If so, then according to NCBI’s definition, you are saying these are exons because they code for part of the protein, and that is the only difference between these exons and the introns that are spliced out, right? But what if this gene has no mature mRNAs/transcripts that retain these exons for translation… how do you know they code for part of the protein if they are eventually spliced out and never translated to protein? Is it the recognition nucleotides at the intron/exon junctions that I mentioned before? This appears to be the situation with the 2 exons in the gene example I used below, GKAP1.

Sorry for the continuous questions regarding this basic information, but I just want to be precise on the terms and definition used. Thanks again, Matthew.

Rodney Perry

Galt Barber

unread,

Nov 7, 2013, 2:03:53 PM11/7/13

to Rodney T Perry, gen...@soe.ucsc.edu

http://en.wikipedia.org/wiki/Exon

"However, the term exon is often misused to refer only to coding sequences for the final protein."