Issue with ref-seq frame information and translation

79 views
Skip to first unread message

harsh shukla

unread,
Apr 30, 2018, 6:48:57 AM4/30/18
to igv-help
Hi,
 
  Hope you are doing fine.

1) My IGV version and machine are as follows

  IGV :   Version 2.3.97 (157)
   OS :   Ubuntu 16.04

I have some other reference human genome (closely related to hg19) and corresponding annotation file. There is a gene named : GPATCH4.It is  transcribed from the negative strand and its frame information is as follows: ( the file is in refSeq format)

GPATCH4 cmpl cmpl 2,2,0,0,0,1,1,0,
exon number                                   8 7 6 5 4 3 2 1 

Since it is a negative strand the last exon  is the leftmost one having the refSeq frame field as 2.Here what 2 means is that 2 bases(last 2) from the previous exon and 1st base of current exon from a codon. Hence for the current exon the translation starts from base 2

Here the sequence is the flipped strand ( reverse strand)


Here technically my correct translation is 3rd line of translated frames - R       A       A      since 
                                                                                                                CGT  GCC  GCT                    ( since it is reverse strand)


I dint know what was happening so I  created a two temporary file  having distinct frame information :    1,2,0,0,0,1,1,0,
                                                                                                                                                                    0,2,0,0,0,1,1,0,




This made no sense at all even after changing the frame info it still giving me same protein in all cases;  also the sequence of nucleotide translated corresponded to only the 0th frame(refGene_0)  in all cases(see 1st line of translated  frames NHDW..)  .


I decided to have a look at ending of these exons.Here no matter what is the frame information it always takes and translates the stop codon first and than moves right translating three nucleotide ( in reverse order) at a time (it effectively discards frame information)  This should not happen actually ; it should start translating from right (- > to left)  ( Here the 3rd frame line ...SRGDSC .. <-  is actually the correct frame of translation).

My guess is IGV is interpreting and translating from left to right but it should actually do it from right -> left and than assign amino acids

P.S all these hoopla is happening because I have a 2 nucleotide insert somewhere in middle of exon 8 so the C terminal chain is little different than annotated. Also I think because of the insertion my actual stop codon has shifted to around hundred bases to left  

Small red box on line 3 on left side of image

Orignal stop codon : 1st line; red box above the CDS end of exon 8

James Robinson

unread,
Apr 30, 2018, 2:24:20 PM4/30/18
to igv-help
This is a curious case, could you attach a sample file?   

IGV translates from right->left for negative strand genes,  it would be a major bug noticed by many people otherwise.    It works correctly for all refGene files downloaded from UCSC.  What program produced your annotation?   I think there might be some error or confusing around the interpretation of frame.  This is not the same as GFF phase.   Its been some time since I looked at this but it does work for UCSC annotation files (i.e. the various "genePred" formats as downloaded from their site).    



--

---
You received this message because you are subscribed to the Google Groups "igv-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to igv-help+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/igv-help/375d5aab-9380-45dd-87ff-f50d690263b5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

James Robinson

unread,
Apr 30, 2018, 2:37:59 PM4/30/18
to igv-help
An idea, if you are able to do it:  Upload your file to UCSC in  "bigGenePred" format to see how it interprets it.  Instructions are here:  https://genome.ucsc.edu/goldenPath/help/bigGenePred.html


harsh shukla

unread,
May 1, 2018, 2:20:43 AM5/1/18
to igv-help
Hi Jim ,

       Thank You for the quick reply.

   As mentioned in previous mail there is an insertion of 2 base pair(frame shift) in my genome and that is why I thought my stop codon was shifting couple of hundred bases to the left. So i though why not change the CDS end (to the base which I think my actual stop codon lies) and see if the protein translation I get is correct or not . And to my surprise it translates correctly.




Here as you can see when I corrected the stop codon the translation I thought was correct I am getting it (see 3rd line and gene_STOP_corrected..).

To check further I checked it with human hg19 from broad . Bellow is attached image for the same annotation


AS you can see both of them matches.


Additionally if there is an insertion (not in multiple of 3 ) in middle of exon the protein being translated in both cases (with and without insertion) must be same till the point of insertion (After that it may change).To see whether this is happening..

Below is image of Human hg19 (from broad)


Below is image of my genome with a GT insertion after ATTTT..



As seen before the insertion both the proteins are same   <-...TEEEGRD... <-  (only if you compare to STOP corrected)
Only at point of insertion the amino acids change   <- ..WWRF  TEE.. <- in hg19
                                                                                 <- ..GGLV    TEE.. <- in my genome

So my C terminal is a little different than the annotated one (that is fine).

----------------------------------------------------------------------------------------------------------------------

So what I think hapenning here is I believe IGV parsing the positive strand from left to right (looking at cordinates from the refSeq file)

(cdsStarts from T actually (on positive strand)  for reverse strand it is cdsEnd in reality)
                                              - - - - --- > positive strand
let supose the sequence is TCACTGCATGCT... (same seq from my case - postive strand) 

What it is actually doing is breaking into 3 -mers

5'  TCA CTG CAT GCT ..     3'

Reverse complementing it

5'   TCA CTG CAT GCT ..  3'   
3'   AGT GAC GTA CGA     5'

    <---------------    direction of trnascription as it goes from 5' to 3'

and translating it 

TGA - stopcodon
CAG -  Q
ATG -  M
AGC -  S 
 
 Stop.  QMS...<-     The same thing you see in IGV (for my genome starting from stop codon)

I think it is not considering the frame information (at least for cases of last exons of reverse transcribed genes) for translating it and using it  only for differential display.
....................................................

To further validate I changed the frame info in human hg19 (from broad to all three frames)
Below is image attahced

                                                                                                          above image  Exon  ending


                                                                                                             above image exon starting

As we can see all three protein are same and it is using frame information only to display the starting nucleotide s.


P.S The majority of ref Seq genes have frame information in exact sync with the length of exons (But there are very rare cases when the frame information does not corroborate well with the exon length -> This happens when mRNA and hg19 alignment is not 100% identical i.e hg19 has indels w.r.t to mRNA ; ) . As a result 99.9% of cases the translation achieved while parsing positive strand from  right to left is accurate (here the assumption is that the annotation is 100% perfect and it has absolutely zero errors) but in few borderline cases it fails.

The genome I am working on is a derivative of hg19 with indels introduced in it and I have written  scripts to liftover hg19 UCSC refGene annotation to my genome.(since I had the vcf file that had all the info about the indels introduced ; it is more or less accurate I have verified it and double checked and used it to build a SnpEff database also)

Looking forward to hear from you.


Your's Sincerely,
Harsh

Harsh Shukla

unread,
May 1, 2018, 3:49:51 AM5/1/18
to igv-...@googlegroups.com
Hi James,

  I forgot to attach the hg19 annotation files.

gene_hg19_orignal_refGene.txt -  This is the orignal annoataion from refGene table UCSC - frame info of last (first w.r.t to positive strand) exon 8 : 2
gene_hg19_refGene_1               -  Modified the frame of last exon                                         - frame info of last (first w.r.t to positive strand) exon 8 : 1
gene_hg19_refGene_0               -  Modified the frame of last exon                                         - frame info of last(first w.r.t to positive strand)  exon 8 : 0       
  


Regards,
Harsh

--

---
You received this message because you are subscribed to the Google Groups "igv-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to igv-help+unsubscribe@googlegroups.com.
gene_hg19_orignal_refGene.txt
gene_hg19_refGene_0.txt
gene_hg19_refGene_1.txt

James Robinson

unread,
May 1, 2018, 12:29:24 PM5/1/18
to igv-help
I will look at this when I can make some time.   My interest here is that the code is correctly interpreting the annotation as written wrt the UCSC specification.    The test for this will be to take the same annotation and upload to UCSC and compare the result.   It would speed things along if you could prepare a bigGenePred as describe in the link I sent to do this.



Harsh Shukla

unread,
May 2, 2018, 6:53:16 AM5/2/18
to igv-...@googlegroups.com
Hi James,

   So I tried following the steps you mentioned and was able to make it work.I took out the records of only GPATCH4 from hg19 refSeq orignal annotation file.The annotation was in genePredExt format

I converted to BigGenePred file using the below mentioned commands

./genePredToBigGenePred <hg19_genePredExt> stdout | sort -k1,1 -k2,2n > out.bgpInput
./bedToBigBed -type=bed12+8 -tab -as=bigGenePred.as out.bgpInput hg19.chrom.sizes <hg19_output.bb>

I did this for three files:

gene_hg19_orignal_frame_2.genePredExt              --   orignal file frame of exon 8 : 2
gene_hg19_frame_1.genePredExt                           --                    frame of exon 8 : 1   (changed)
gene_hg19_frame_0.genePredExt                           --                    frame of exon 8 : 0   (changed)


following are the corresponding output binary file (bigGenePred):

I copied the hg19_output.bb into my github repository and finally managed to make my custom track (for UCSC genome Browser)




The results are as follows : snapshot of starting of exon 8

As we can see the change in frame info does change the protein being translated.



Whereas in IGV that is not the case




RefSeq gene                       -   original having  8th exon frame info : 2
gene_hg19_refGene_1.txt  -                           8th exon frame info : 1 (changed)
gene_hg19_refGene_0.txt  -                           8th exon frame info : 0 (changed) 


I am attaching the genePredExt and bigGenePred (.bb) files used to create the custom tracks (mentioned previously) (IGV_file.zip) . I am also attaching the screenshots containing details about my custom tracks (Custom_track_detail.zip). I hope you find it all useful and up to your satisfaction.

Looking forward to hear from you.

Regards,
Harsh


IGV_files.zip
Custom_track_details.zip

James Robinson

unread,
May 2, 2018, 1:48:33 PM5/2/18
to igv-help
Thanks for the test case, that will be very helpful.


James Robinson

unread,
May 7, 2018, 12:48:25 AM5/7/18
to igv-help
OK,  I think I've got this fixed, however, there is a discrepancy with the UCSC screenshot.   For frame = 1 UCSC shows the first AA of exon 8 to be K.   However, the last nucleotide of exon 7 is A,  so the first codon of exon 8 with frame = 1 is AGG => R.   This is what IGV shows with the fixes.   I don't know what UCSC is doing here because it is a non-sensical annotation,  so maybe that is not important.

It will take a few days to roll this out in a release.   Thank you again for the test data.












Reply all
Reply to author
Forward
0 new messages