Hi Jim ,
Thank You for the quick reply.
As mentioned in previous mail there is an insertion of 2 base pair(frame shift) in my genome and that is why I thought my stop codon was shifting couple of hundred bases to the left. So i though why not change the CDS end (to the base which I think my actual stop codon lies) and see if the protein translation I get is correct or not . And to my surprise it translates correctly.
![](https://lh3.googleusercontent.com/-R_xMxX3mYKg/Wuf1NJqZmVI/AAAAAAAAAWE/N4VLsxX7VN0AastxPg7EYCZdzckyjgmpACLcBGAs/s1600/20.png)
![](https://lh3.googleusercontent.com/-_0plk7TBxZk/Wuf1zGRG5RI/AAAAAAAAAWM/5Ca7zXsyxEY-0AsPIUV79LMRE0-uVl-iACLcBGAs/s1600/21.png)
Here as you can see when I corrected the stop codon the translation I thought was correct I am getting it (see 3rd line and gene_STOP_corrected..).
To check further I checked it with human hg19 from broad . Bellow is attached image for the same annotation
![](https://lh3.googleusercontent.com/-eU4fUb9pVIA/Wuf33q2F3gI/AAAAAAAAAWk/BvhCYpAhYTgzcQCyyM5fP9QQdvpfQNUtACLcBGAs/s1600/23.png)
AS you can see both of them matches.
Additionally if there is an insertion (not in multiple of 3 ) in middle of exon the protein being translated in both cases (with and without insertion) must be same till the point of insertion (After that it may change).To see whether this is happening..
Below is image of Human hg19 (from broad)
![](https://lh3.googleusercontent.com/-IqDO5yhECjo/Wuf50mqAnmI/AAAAAAAAAW0/GpSdlLjvwSAwtPJQTOX4TxwIQyMxew-cwCLcBGAs/s1600/25.png)
Below is image of my genome with a GT insertion after ATTTT..
![](https://lh3.googleusercontent.com/-HJxK5jHjYLU/Wuf6qplxA3I/AAAAAAAAAXQ/K8fHYWl2GOgDGdU48Sw8SDn7bec5a3IIACLcBGAs/s1600/26.png)
As seen before the insertion both the proteins are same <-...TEEEGRD... <- (only if you compare to STOP corrected)
Only at point of insertion the amino acids change <- ..WWRF TEE.. <- in hg19
<- ..GGLV TEE.. <- in my genome
So my C terminal is a little different than the annotated one (that is fine).
----------------------------------------------------------------------------------------------------------------------
So what I think hapenning here is I believe IGV parsing the positive strand from left to right (looking at cordinates from the refSeq file)
(cdsStarts from T actually (on positive strand) for reverse strand it is cdsEnd in reality)
- - - - --- > positive strand
let supose the sequence is TCACTGCATGCT... (same seq from my case - postive strand)
What it is actually doing is breaking into 3 -mers
5' TCA CTG CAT GCT .. 3'
Reverse complementing it
5' TCA CTG CAT GCT .. 3'
3' AGT GAC GTA CGA 5'
<--------------- direction of trnascription as it goes from 5' to 3'
and translating it
TGA - stopcodon
CAG - Q
ATG - M
AGC - S
Stop. QMS...<- The same thing you see in IGV (for my genome starting from stop codon)
I think it is not considering the frame information (at least for cases of last exons of reverse transcribed genes) for translating it and using it only for differential display.
....................................................
To further validate I changed the frame info in human hg19 (from broad to all three frames)
Below is image attahced
![](https://lh3.googleusercontent.com/-oJ919extKF4/WugCPTm0QlI/AAAAAAAAAXs/x3g94xyiTqoVbdcX8yP7tmWnAEyKP1BXwCLcBGAs/s1600/27.png)
above image Exon ending
![](https://lh3.googleusercontent.com/-rysSjpTgHdQ/WugCkI_KdII/AAAAAAAAAX0/pTL-TTZocbMCNfjFf2cAp_sxaRRkMeIqACLcBGAs/s1600/28.png)
above image exon starting
As we can see all three protein are same and it is using frame information only to display the starting nucleotide s.
P.S The majority of ref Seq genes have frame information in exact sync with the length of exons (But there are very rare cases when the frame information does not corroborate well with the exon length -> This happens when mRNA and hg19 alignment is not 100% identical i.e hg19 has indels w.r.t to mRNA ; ) . As a result 99.9% of cases the translation achieved while parsing positive strand from right to left is accurate (here the assumption is that the annotation is 100% perfect and it has absolutely zero errors) but in few borderline cases it fails.
The genome I am working on is a derivative of hg19 with indels introduced in it and I have written scripts to liftover hg19 UCSC refGene annotation to my genome.(since I had the vcf file that had all the info about the indels introduced ; it is more or less accurate I have verified it and double checked and used it to build a SnpEff database also)
Looking forward to hear from you.
Your's Sincerely,
Harsh