Dear Sir/Madame:
I have two technical questions about the definition and annotation for transcription, coding sequences, and exon start and start sites. I will use the gene, GKAP1 (hg19; chr9:86,354,000-86,445,000) as an example. There are 3 variants coded on negative strand. When I download these into table browser, it shows the following for the first variant, uc004amy:
Tx start and stop are: 86,354,335 -86,432,752.
Cds start and stop are: 86,354,611-86,421,432.
Exon starts are:
86354335,86356866,86357444,86363223,86368172,86383732,86395296,86399629,86403515,86414099,86421216,86431910,86432431
Exon stops are:
86354659,86356944,86357515,86363287,86368274,86383885,86395319,86399753,86403593,86414243,86421475,86432042,86432752,
I have no problem with the Tx start and stop.
I have no problem with the Cds start, 86,354,611. It starts at the end of the 3’UTR (86,354,335-86,354,610) and what should be the beginning of the last exon. This is clear on the genome browser window and is confirmed by the CCDS track showing the last exon but not the 3UTR segment.
However, the table browser lists the Exon start at 86,354, 335 which is where the Tx start is, 86,354,335 that includes the 3UTR?
So, my first question is how can you (correctly) list different start and stop coordinates for transcription and coding sequences (because the transcription includes the 5’UTR and the 3’UTR while the coding sequence doesn’t include them) and then turn around and include the 5’UTR and 3’UTR as part of the first exon and last exon, respectively? Can you clarify this?
The 2nd question is for the same variant. The Tx stop is 86,432,752, but the Cds stop is 86, 421,432. This means that the last 2 exons, 86,431,910-86,432,042 and 86,432,431-86,432,752, are spliced out. In fact, they are spliced out of all 3 variants. So, why are these last 2 exons called exons; I thought the definition of an exon is, “a segment of a gene that is retained during splicing for translation”? Or are they named exons because that the DNA sequence shows the “characteristics” of an exon such as the recognition nucleotides at the splice donor and acceptor sites of exons and introns?
These just seem confusing to me and I appreciate the assistance and clarification.
Best,
Rodney Perry
--
Matthew,
Thanks for the explanation. Ok, I understand that the 5’UTR and 3’UTR are counted as part of an exon which explains why the transcription start and exon start coordinates are the same but the transcription start and coding sequence start coordinates are different.
As a technical point, you say to look at the “first exon in view” in your example. It looks like you mean on the left side of the browser? This gene is coded on the opposite strand so that exon would be the fourth exon that is transcribed going right to left in this browser window. So when you say, “the subsequent exons that are all pieces of the 3’UTR”, these are really pieces of the 5’UTR, right? If so, then according to NCBI’s definition, you are saying these are exons because they code for part of the protein, and that is the only difference between these exons and the introns that are spliced out, right? But what if this gene has no mature mRNAs/transcripts that retain these exons for translation… how do you know they code for part of the protein if they are eventually spliced out and never translated to protein? Is it the recognition nucleotides at the intron/exon junctions that I mentioned before? This appears to be the situation with the 2 exons in the gene example I used below, GKAP1.
Sorry for the continuous questions regarding this basic information, but I just want to be precise on the terms and definition used. Thanks again, Matthew.
Rodney Perry
--