refGene table; possible issue with some exonFrames(?)

215 views
Skip to first unread message

CH Albach

unread,
Dec 31, 2014, 12:46:53 PM12/31/14
to gen...@soe.ucsc.edu, Samuel Gross, John Bates
Hi UCSC,

I've written some code which generates exon frame data, and noticed an inconsistency with the exonFrames field when processing 392 of the 56836 lines in the current refGene table (as downloaded from here, where it had an upload timestamp of "21-Dec-2014 23:28").

From my understanding of the exonFrames field, it should be possible to derive it entirely from the cds{Start,End} and exon{Starts,Ends} fields. I've found little documentation on the exonFrames though, so please correct me if there are some edge cases I haven't accounted for.

I've attached the offending lines of refGene.txt to this email.

Take, for example, line 320 of refGene.txt:
73 NM_001282171 chr19_KI270922v1_alt + 92406 143045 92443 123738 9 92406,95658,97510,100973,123126,123693,123733,123844,143022, 92477,95958,97804,101024,123231,123731,123746,124394,143045, 0 KIR3DS1 cmpl cmpl 0,1,1,1,1,2,1,-1,-1,

First, note that this is a forward stranded feature. Next, note exons 4 and 5 (0 indexed):
  • Exon 4: [123126, 123231), frame=1 // Note: this exon has a length divisible by 3.
  • Exon 5: [123693, 123731), frame=2
I expected exon 5 to also have frame 1, since exon 4 has a length of 105 (105%3 = 0).

Is this a data issue, or am misunderstanding something about the field? It would also be useful to understand how this field is generated.

Thanks!
CH
refGene_frameIssue.txt

Brian Lee

unread,
Jan 5, 2015, 2:35:24 PM1/5/15
to CH Albach, gen...@soe.ucsc.edu, Samuel Gross, John Bates
Dear CH,

Thank you for using the UCSC Genome Browser and your question about exonFrames in the refGene table, and for including information around NM_001282171 to provide a useful example of your inquiry.

Below is a session link that will load the browser around exon 5 (zero-indexed as you kindly noted) that also displays the below custom track of the transcript: http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=brianlee&hgS_otherUserSessionName=hg38.NM_001282171.exon5

track name="NM_001282171" description="refGene NM_001282171 chr19_KI270922v1_alt:92407-143045; Exon5 deletion 123,732-123,733"
chr19_KI270922v1_alt 92406 143045 NM_001282171 0 + 92443 123738 0 9 71,300,294,51,105,38,13,550,23, 0,3252,5104,8567,30720,31287,31327,31438,50616,

The exon frames come from the mRNA, not the genome, and this example you provided represents a transcript where there are deletions in respect to the reference.  In the attached session you will see that there is a codon with a gap at chr19_KI270922v1_alt:123,732-123,733.

Since the coding region is determined from the mRNA transcript, not from the aligned genomic chunk, exonFrames cannot be derived from the refGene cds and exon start/end table values, as it sounds you are trying to implement in your code.  

One place to see these alignment details is also when clicking a RefSeq gene in the browser where you will find a section titled "mRNA/Genomic Alignments" where you can further click a link titled "View details of parts of alignment within browser window."  There you can scroll down to a section titled "Side by Side Alignment" and see dots indicating a deletion in the alignment:

000896 a..ag 000898
>>>>>> |  || >>>>>>
123731 aacag 123735

I hope this information was helpful.  Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genome Bioinformatics Group





--


CH Albach

unread,
Jan 7, 2015, 5:21:24 PM1/7/15
to Brian Lee, gen...@soe.ucsc.edu, Samuel Gross, John Bates
Thanks Brian for the thorough explanation. You've addressed all of my questions/concerns.
Reply all
Reply to author
Forward
0 new messages