Script to get the longest isoform for each 'gene'

3,140 views
Skip to first unread message

Jacqueline Farrell

unread,
May 21, 2015, 10:44:28 AM5/21/15
to trinityrn...@googlegroups.com
Hello all -- 

Before I start writing my own script I was wondering is there a script already included with Trinity that pulls out the longest isoform for each gene?  Thanks in advance for your help. 

Cheers, 

Jacqueline

Mark Chapman

unread,
May 21, 2015, 11:02:46 AM5/21/15
to Jacqueline Farrell, trinityrn...@googlegroups.com
Hi Jacqueline,

There's a python script available which would take an old trinity.fasta file, maybe you could make a few modifications to take in the newly formatted ones (i.e. TRx|...etc nomenclature):
If you get it to work feel free to post it to the group :)

Best wishes, Mark

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
Dr. Mark A. Chapman
+44 (0)2380 594396
------------------------------------
Centre for Biological Sciences
University of Southampton
Life Sciences Building 85
Highfield Campus
Southampton
SO17 1BJ

Tiago Hori

unread,
May 21, 2015, 11:19:27 AM5/21/15
to Mark Chapman, Jacqueline Farrell, trinityrn...@googlegroups.com
There may be another way. Trinity stats works by creating an array that contains the longest isoforms, that is how it calculates the average. 

It should be super simple to tweak that.

T.

Sent from my iPhone

Farbod Emami

unread,
May 21, 2015, 11:28:26 AM5/21/15
to trinityrn...@googlegroups.com, markcha...@gmail.com, jacquelin...@gmail.com
Dear Tiago, Hi.
In which file the name of the ONLY longest Isoform per gene is stored during TrinityStats.pl running? 


On Thursday, May 21, 2015 at 7:49:27 PM UTC+4:30, Tiago Hori wrote:
There may be another way. Trinity stats works by creating an array that contains the longest isoforms, that is how it calculates the average. 

It should be super simple to tweak that.

T.

Sent from my iPhone

On May 21, 2015, at 12:02 PM, Mark Chapman <markcha...@gmail.com> wrote:

Hi Jacqueline,

There's a python script available which would take an old trinity.fasta file, maybe you could make a few modifications to take in the newly formatted ones (i.e. TRx|...etc nomenclature):
If you get it to work feel free to post it to the group :)

Best wishes, Mark
On 21 May 2015 at 15:44, Jacqueline Farrell <jacquelin...@gmail.com> wrote:
Hello all -- 

Before I start writing my own script I was wondering is there a script already included with Trinity that pulls out the longest isoform for each gene?  Thanks in advance for your help. 

Cheers, 

Jacqueline

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
Dr. Mark A. Chapman
+44 (0)2380 594396
------------------------------------
Centre for Biological Sciences
University of Southampton
Life Sciences Building 85
Highfield Campus
Southampton
SO17 1BJ

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

Tiago Hori

unread,
May 21, 2015, 11:34:24 AM5/21/15
to Farbod Emami, trinityrn...@googlegroups.com, markcha...@gmail.com, jacquelin...@gmail.com
It doesn't create a file. It creates a Perl array. All I saying is that one could, if one wanted, grab that array and write a piece of code that iterates and print its contents. 

It may be easier than writing form scratch, although if the Python alternative works as is; it is easier.

<digression> 
I should really have learned Python.
</digression>

image1.PNG

Sent from my iPhone
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

Brian Haas

unread,
May 21, 2015, 2:30:07 PM5/21/15
to Tiago Hori, Farbod Emami, trinityrn...@googlegroups.com, markcha...@gmail.com, jacquelin...@gmail.com
The longest transcript isn't always the 'best' transcript....  but this has been asked for so many times, I'll just write the script and post it shortly.

~b

Dear Tiago, Hi.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.
--
Dr. Mark A. Chapman
------------------------------------
Centre for Biological Sciences
University of Southampton
Life Sciences Building 85
Highfield Campus
Southampton
SO17 1BJ

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

Brian Haas

unread,
May 21, 2015, 2:30:36 PM5/21/15
to Tiago Hori, Farbod Emami, trinityrn...@googlegroups.com, markcha...@gmail.com, jacquelin...@gmail.com
yes, we should all learn python!  ... and if I didn't have all this legacy perl code...   ;)

Brian Haas

unread,
May 21, 2015, 2:47:19 PM5/21/15
to Tiago Hori, Farbod Emami, trinityrn...@googlegroups.com, markcha...@gmail.com, jacquelin...@gmail.com
Just drop the attached script into your TRINITY_HOME/util/misc folder and it should do the trick.

best,

~b
get_longest_isoform_seq_per_trinity_gene.pl

Ken Field

unread,
May 21, 2015, 5:41:04 PM5/21/15
to Brian Haas, Tiago Hori, Farbod Emami, trinityrn...@googlegroups.com, markcha...@gmail.com, jacquelin...@gmail.com
Thanks Brian! And now could you just write a script that outputs the best transcript? ;)

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
Ken Field, Ph.D.
Associate Professor of Biology
Program in Cell Biology/Biochemistry
Bucknell University
Room 203A Biology Building

Brian Haas

unread,
May 21, 2015, 5:44:59 PM5/21/15
to Ken Field, Tiago Hori, Farbod Emami, trinityrn...@googlegroups.com, markcha...@gmail.com, jacquelin...@gmail.com
Ha!  I was thinking the same thing when I put this together.  It would be doable, but not something that would be done in just a few minutes...   It would involve the Trinotate report and the RSEM expression matrix.

I'm also curious about what the utility would be for it, as I try to make the claim that you simply don't have to do these things:


I'm clearly biased...

~b

Tiago Hori

unread,
May 21, 2015, 6:30:50 PM5/21/15
to Ken Field, Brian Haas, Farbod Emami, trinityrn...@googlegroups.com, markcha...@gmail.com, jacquelin...@gmail.com
Isn't that called magic?

Sent from my iPhone

Farbod Emami

unread,
May 23, 2015, 12:13:40 AM5/23/15
to trinityrn...@googlegroups.com, jacquelin...@gmail.com, farbo...@gmail.com, bh...@broadinstitute.org, kfi...@bucknell.edu, markcha...@gmail.com
Dear Brian, Hi, 
I have used your script for longest isoformes as follow :
get_the_longest_. . .  Trinity.fasta > longest_trinity_isoforms.fasta
but it seems that the out put isoformes (transcripts) are not according to transcripts order in the main file, yes?
I mean it is not as :
Transcript No. 1 Longest isoform
Transcript No. 2 Longest isoform
Transcript No. 3 Longest isoform
and so on.
(of course i think it should not be any problem with it!)
Thank you


On Friday, May 22, 2015 at 3:00:50 AM UTC+4:30, Tiago Hori wrote:
Isn't that called magic?

Sent from my iPhone

On May 21, 2015, at 6:41 PM, Ken Field <kfi...@bucknell.edu> wrote:

Thanks Brian! And now could you just write a script that outputs the best transcript? ;)
On Thu, May 21, 2015 at 2:47 PM, Brian Haas <bh...@broadinstitute.org> wrote:
Just drop the attached script into your TRINITY_HOME/util/misc folder and it should do the trick.

best,

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Brian Haas

unread,
May 23, 2015, 9:48:32 AM5/23/15
to Farbod Emami, trinityrn...@googlegroups.com, Jacqueline Farrell, Ken Field, Mark Chapman
that's right, they don't retain their original order in the Trinity.fasta file.  Sequences are sorted according to length (descendingly) just for aesthetics.

~b

On Sat, May 23, 2015 at 12:13 AM, Farbod Emami <farbo...@gmail.com> wrote:
Dear Brian, Hi, 
I have used your script for longest isoformes as follow :
get_the_longest_. . .  Trinity.fasta > longest_trinity_isoforms.fasta
but it seems that the out put isoformes (transcripts) are not according to transcripts order in the main file, yes?
I mean it is not as :
Transcript No. 1 Longest isoform
Transcript No. 2 Longest isoform
Transcript No. 3 Longest isoform
and so on.
(of course i think it should not be any problem with it!)
Thank you


On Friday, May 22, 2015 at 3:00:50 AM UTC+4:30, Tiago Hori wrote:
Isn't that called magic?

Sent from my iPhone

On May 21, 2015, at 6:41 PM, Ken Field <kfi...@bucknell.edu> wrote:

Thanks Brian! And now could you just write a script that outputs the best transcript? ;)
On Thu, May 21, 2015 at 2:47 PM, Brian Haas <bh...@broadinstitute.org> wrote:
Just drop the attached script into your TRINITY_HOME/util/misc folder and it should do the trick.

best,

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
Ken Field, Ph.D.
Associate Professor of Biology
Program in Cell Biology/Biochemistry
Bucknell University
Room 203A Biology Building

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--

Joseph Lee

unread,
Jun 2, 2015, 5:12:05 AM6/2/15
to trinityrn...@googlegroups.com, jacquelin...@gmail.com, kfi...@bucknell.edu, markcha...@gmail.com, farbo...@gmail.com
This works fine and REALLY helps me a lot!!!
I have been seeking a tool like this since i fail to write a script doing this work!
thx you, Brian!!!

JLee

Brian Haas於 2015年5月23日星期六 UTC+8下午9時48分32秒寫道:
Dear Brian, Hi, 
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
Ken Field, Ph.D.
Associate Professor of Biology
Program in Cell Biology/Biochemistry
Bucknell University
Room 203A Biology Building

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Farbod Emami

unread,
Jul 7, 2015, 5:06:22 AM7/7/15
to trinityrn...@googlegroups.com, farbo...@gmail.com, markcha...@gmail.com, jacquelin...@gmail.com, kfi...@bucknell.edu
Dear Brian, Hi.
I have used your script for collecting longest Isoform per gene. I was thinking that it must be only one transcripts for each gene (some people call it unigene) after running the scripts, but when I was using TransRate program (http://hibberdlab.com/transrate/getting_started.html) to check my assembly  contigs, I accidentally realized that there are several cases that begins with the similar beginning in the name. please check this sample:

$ cat '/home/emami/Soft2/trinityrnaseq-2.0.6/util/misc/Farbod_longest_isoform.fasta'  | grep TR85783
>TR85783|c0_g2_i1 len=6498 path=[12981:0-2513 12982:2514-2543 12983:2544-6497] [-1, 12981, 12982, 12983, -2]
>TR85783|c0_g1_i1 len=6498 path=[12984:0-2513 12985:2514-2543 12986:2544-6497] [-1, 12984, 12985, 12986, -2]
>TR85783|c1_g1_i1 len=923 path=[1801:0-847 1802:848-922] [-1, 1801, 1802, -2]

is it normal or there is some error in the script?
Thank you
Farbod
Dear Brian, Hi, 
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
Ken Field, Ph.D.
Associate Professor of Biology
Program in Cell Biology/Biochemistry
Bucknell University
Room 203A Biology Building

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Mark Chapman

unread,
Jul 7, 2015, 5:16:29 AM7/7/15
to Farbod Emami, trinityrn...@googlegroups.com, Jacqueline Farrell, Ken Field
Hi Farbod,
The same TR number doesnt mean the same component ('gene'). Its just the last number that differs between transcripts.
eg 
>TR85783|c0_g2_i1
>TR85783|c1_g1_i1
are different components ('genes')
>TR85783|c0_g2_i1
>TR85783|c0_g2_i2
are the same component ('gene')
Although in your example the first two components, although supposedly different, look the same according to the 'path'. Maybe something odd has happened here (Brian?). But the script to isolate one isoform per gene is working right, as per your question to the group.

Best wishes, Mark



Dear Brian, Hi, 
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
Ken Field, Ph.D.
Associate Professor of Biology
Program in Cell Biology/Biochemistry
Bucknell University
Room 203A Biology Building

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

Cera Fisher

unread,
Jan 5, 2016, 3:37:54 PM1/5/16
to trinityrnaseq-users, kfi...@bucknell.edu, tiag...@me.com, farbo...@gmail.com, markcha...@gmail.com, jacquelin...@gmail.com
Thanks for writing the script even though you didn't think we should use it. :) In my case, I've got the transcriptomes of two non-model organisms and I'm really just trying to find orthologs between the two in order to do some cross-species gene expression analysis, which I'm doing by reciprocal blast. It seemed like it would reduce complication if I only had one transcript from each gene, and it seemed simplest to choose the longest transcript for each Trinity component. I'll let you know how this goes. 
Sincerely,
a naive grad student


On Thursday, May 21, 2015 at 5:44:59 PM UTC-4, Brian Haas wrote:
Ha!  I was thinking the same thing when I put this together.  It would be doable, but not something that would be done in just a few minutes...   It would involve the Trinotate report and the RSEM expression matrix.

I'm also curious about what the utility would be for it, as I try to make the claim that you simply don't have to do these things:


I'm clearly biased...

~b
On Thu, May 21, 2015 at 5:41 PM, Ken Field <kfi...@bucknell.edu> wrote:
Thanks Brian! And now could you just write a script that outputs the best transcript? ;)
On Thu, May 21, 2015 at 2:47 PM, Brian Haas <bh...@broadinstitute.org> wrote:
Just drop the attached script into your TRINITY_HOME/util/misc folder and it should do the trick.

best,

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
Ken Field, Ph.D.
Associate Professor of Biology
Program in Cell Biology/Biochemistry
Bucknell University
Room 203A Biology Building
Reply all
Reply to author
Forward
0 new messages