Transdecoder output naming conventions and ORF selection

1,472 views
Skip to first unread message

Gwilym Haynes

unread,
Oct 5, 2015, 4:31:17 PM10/5/15
to TransDecoder-users
Hi,
I am using Transdecoder to analyze transcriptomes reconstructed from RNA-Seq data, and would like to better understand how the output files are named. Below are three potential ORFs identified in Transdecoder and output in the filename.cds file. In  each case, the transcript/scaffold name is written three times, being suffixed with either a "g.numbers" or "m.numbers". Do g and m stand for gene and mRNA respectively? 

>scaffold1000|size5236|m.5263 scaffold1000|size5236|g.5263  ORF scaffold1000|size5236|g.5263 scaffold1000|size5236|m.5263 type:complete len:204 (+) scaffold1000|size5236:1882-2493(+)
>scaffold1000|size5236|m.5261 scaffold1000|size5236|g.5261  ORF scaffold1000|size5236|g.5261 scaffold1000|size5236|m.5261 type:complete len:306 (+) scaffold1000|size5236:1743-2660(+)
>scaffold1000|size5236|m.5260 scaffold1000|size5236|g.5260  ORF scaffold1000|size5236|g.5260 scaffold1000|size5236|m.5260 type:5prime_partial len:1469 (-) scaffold1000|size5236:828-5234(-)


In my last analysis, Transdecoder sometimes returned multiple CDS for a single transcript. Is there a way to get Transdecoder to only select the best CDS from each mRNA sequence?


Regards,
Gwilym Haynes

Brian Haas

unread,
Oct 6, 2015, 8:20:27 AM10/6/15
to Gwilym Haynes, TransDecoder-users
Hi Gwilym,

In this case, the g-number is going to refer to the original transcript, and the m-number to the ORF identified on that transcript.

Currently, there isn't an option for TransDecoder to report only the single best ORF per transcript, but it's something we'll plan on integrating in a future release - as many have asked for it.

best,

~b  

--
You received this message because you are subscribed to the Google Groups "TransDecoder-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to transdecoder-us...@googlegroups.com.
To post to this group, send email to transdeco...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/transdecoder-users/d2fbdd5a-fccb-428f-bf73-87109537fc7a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

Rui Li

unread,
Mar 5, 2016, 2:43:56 PM3/5/16
to TransDecoder-users, gwily...@gmail.com
Hi Brian, 

Is the one ORF option available now? Thanks!

Rui


On Tuesday, October 6, 2015 at 5:20:27 AM UTC-7, Brian Haas wrote:
Hi Gwilym,

In this case, the g-number is going to refer to the original transcript, and the m-number to the ORF identified on that transcript.

Currently, there isn't an option for TransDecoder to report only the single best ORF per transcript, but it's something we'll plan on integrating in a future release - as many have asked for it.

best,

~b  
On Mon, Oct 5, 2015 at 4:31 PM, Gwilym Haynes <gwily...@gmail.com> wrote:
Hi,
I am using Transdecoder to analyze transcriptomes reconstructed from RNA-Seq data, and would like to better understand how the output files are named. Below are three potential ORFs identified in Transdecoder and output in the filename.cds file. In  each case, the transcript/scaffold name is written three times, being suffixed with either a "g.numbers" or "m.numbers". Do g and m stand for gene and mRNA respectively? 

>scaffold1000|size5236|m.5263 scaffold1000|size5236|g.5263  ORF scaffold1000|size5236|g.5263 scaffold1000|size5236|m.5263 type:complete len:204 (+) scaffold1000|size5236:1882-2493(+)
>scaffold1000|size5236|m.5261 scaffold1000|size5236|g.5261  ORF scaffold1000|size5236|g.5261 scaffold1000|size5236|m.5261 type:complete len:306 (+) scaffold1000|size5236:1743-2660(+)
>scaffold1000|size5236|m.5260 scaffold1000|size5236|g.5260  ORF scaffold1000|size5236|g.5260 scaffold1000|size5236|m.5260 type:5prime_partial len:1469 (-) scaffold1000|size5236:828-5234(-)


In my last analysis, Transdecoder sometimes returned multiple CDS for a single transcript. Is there a way to get Transdecoder to only select the best CDS from each mRNA sequence?


Regards,
Gwilym Haynes

--
You received this message because you are subscribed to the Google Groups "TransDecoder-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to transdecoder-users+unsub...@googlegroups.com.

Brian Haas

unread,
Mar 6, 2016, 8:28:14 AM3/6/16
to Rui Li, TransDecoder-users, Gwilym Haynes
Hi Rui,

I've added the following under TransDecoder/util

You can drop in the attached script at that location in your current installation and it should work for you.

Note, it just pulls the single longest ORF per transcript, and the longest one might not actually be the 'best' one - but most of the time it generally is the case.

      usage: ./get_longest_ORF_per_transcript.pl file.transdecoder.pep


I'm only adding it because a number of people have asked for it - not because I think it's the best way to find the 'best' ORF.

good luck!

~b


On Sat, Mar 5, 2016 at 2:43 PM, Rui Li <liruir...@gmail.com> wrote:
Hi Brian, 

Is the one ORF option available now? Thanks!

Rui

On Tuesday, October 6, 2015 at 5:20:27 AM UTC-7, Brian Haas wrote:
Hi Gwilym,

In this case, the g-number is going to refer to the original transcript, and the m-number to the ORF identified on that transcript.

Currently, there isn't an option for TransDecoder to report only the single best ORF per transcript, but it's something we'll plan on integrating in a future release - as many have asked for it.

best,

~b  
On Mon, Oct 5, 2015 at 4:31 PM, Gwilym Haynes <gwily...@gmail.com> wrote:
Hi,
I am using Transdecoder to analyze transcriptomes reconstructed from RNA-Seq data, and would like to better understand how the output files are named. Below are three potential ORFs identified in Transdecoder and output in the filename.cds file. In  each case, the transcript/scaffold name is written three times, being suffixed with either a "g.numbers" or "m.numbers". Do g and m stand for gene and mRNA respectively? 

>scaffold1000|size5236|m.5263 scaffold1000|size5236|g.5263  ORF scaffold1000|size5236|g.5263 scaffold1000|size5236|m.5263 type:complete len:204 (+) scaffold1000|size5236:1882-2493(+)
>scaffold1000|size5236|m.5261 scaffold1000|size5236|g.5261  ORF scaffold1000|size5236|g.5261 scaffold1000|size5236|m.5261 type:complete len:306 (+) scaffold1000|size5236:1743-2660(+)
>scaffold1000|size5236|m.5260 scaffold1000|size5236|g.5260  ORF scaffold1000|size5236|g.5260 scaffold1000|size5236|m.5260 type:5prime_partial len:1469 (-) scaffold1000|size5236:828-5234(-)


In my last analysis, Transdecoder sometimes returned multiple CDS for a single transcript. Is there a way to get Transdecoder to only select the best CDS from each mRNA sequence?


Regards,
Gwilym Haynes

--
You received this message because you are subscribed to the Google Groups "TransDecoder-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to transdecoder-us...@googlegroups.com.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

--
You received this message because you are subscribed to the Google Groups "TransDecoder-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to transdecoder-us...@googlegroups.com.

To post to this group, send email to transdeco...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
get_longest_ORF_per_transcript.pl
Message has been deleted

Rui Li

unread,
Mar 7, 2016, 11:57:50 PM3/7/16
to TransDecoder-users, liruir...@gmail.com, gwily...@gmail.com
Hello Brian,

I have successfully got the longest ORF for each transcript, in pep.fasta format.
Do you have code to get longest ORF for the genome.gff file?

Thanks!

Rui
To unsubscribe from this group and stop receiving emails from it, send an email to transdecoder-users+unsub...@googlegroups.com.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

--
You received this message because you are subscribed to the Google Groups "TransDecoder-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to transdecoder-users+unsub...@googlegroups.com.

To post to this group, send email to transdeco...@googlegroups.com.

Brian Haas

unread,
Mar 8, 2016, 6:21:32 AM3/8/16
to Rui Li, TransDecoder-users, Gwilym Haynes
You could try this:

Index the gff3 file you want to extract entries from like so:

    util/index_gff3_files_by_isoform.pl transcripts.fasta.transdecoder.genome.gff3

then grab the accessions for the entries you want to extract:

    grep '>' transcripts.fasta.transdecoder.longestOnly.pep | perl -lane 'if (/>(\S+)/) { print "$1";}' > accs

and then extract those from the gff3 file like so:


    util/gene_list_to_gff.pl  accs transcripts.fasta.transdecoder.genome.gff3.inx > subset.gff3


best,


~b




To unsubscribe from this group and stop receiving emails from it, send an email to transdecoder-us...@googlegroups.com.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

--
You received this message because you are subscribed to the Google Groups "TransDecoder-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to transdecoder-us...@googlegroups.com.

To post to this group, send email to transdeco...@googlegroups.com.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

--
You received this message because you are subscribed to the Google Groups "TransDecoder-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to transdecoder-us...@googlegroups.com.

To post to this group, send email to transdeco...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Giuseppe Puglia

unread,
Jun 29, 2016, 5:46:44 AM6/29/16
to TransDecoder-users, liruir...@gmail.com, gwily...@gmail.com
Hi Brian,

I'm using Transdecoder for my RNA-seq project and I would like to use the script you made in order to take out only the longest orf per transcript (I know this approach is not be the best solution, but I need it to have a general idea of my data). Unfortunately I got this error message:


Error, cannot parse header info: TCONS_00000080|m.1 TCONS_00000080|g.1 type:complete len:507 gc:universal TCONS_00000080:2829-4349(+)  at ./get_longest_ORF_per_transcript.pl line 31, <$filehandle> line 1.


In the following I copied the first two aminoacidic sequences of the transdecoder.pep file:


>TCONS_00000080|m.1 TCONS_00000080|g.1 type:complete len:507 gc:universal TCONS_00000080:2829-4349(+)

MEKFQSYLGLDRSQQHYFLYPLIFQEYIYVLAHDHGLNRSILLENAGYDNKSSLLIVKRLITRMYQQNHLILSVNDSKQTPFLGHNKNFYSQVMSEVSSIIMEIPLSLRLISSLEKKGVVKSDNLRSIHSIFSFLEDNFLHLNYVLDILIPYPAHLEILVQALRYWIKDASSLHLLRFFLHECHNWDNLITSNSKKASSSFSKRNHRLFFFLYTSHVCEYESGFIFLRNQSSHLRSTSSGALLERIYFYGKMEHLAEVFARAFQANLWLFKDPFMHYVRYQGKSILASKGTFLLMNKWKYYFVNFWKSYFYLWSEPGRIYINQLSNHSLDFLGYRSSVRLKRSMVRSQMLENAFLIDNAIKKFDTIVPIMPLIGSLAKSKFCNALGHPIGKVIWANLSDSDIIDRFGRIYRNLSHYHSGSSKKKSLYRVKYILRLSCARTLARKHKSTVRAFLKRFGSELLEEFFTEEEQVFSLTFPKVSSISRRLSRRRIWYLDIICINDLANHE*

>TCONS_00000080|m.2 TCONS_00000080|g.2 type:complete len:354 gc:universal TCONS_00000080:4876-5937(+)

MTAILERRESESLWGRFCNWITSTENRLYIGWFGVLMIPTLLTATSVFIIAFIAAPPVDIDGIREPVSGSLLYGNNIISGAIIPTSAAIGLHFYPIWEAASVDEWLYNGGPYELIVLHFLLGVACYMGREWELSFRLGMRPWIAVAYSAPVAAATAVFLIYPIGQGSFSDGMPLGISGTFNFMIVFQAEHNILMHPFHMLGVAGVFGGSLFSAMHGSLVTSSLIRETTENESANEGYRFGQEEETYNIVAAHGYFGRLIFQYASFNNSRSLHFFLAAWPVVGIWFTALGISTMAFNLNGFNFNQSVVDSQGRVINTWADIINRANLGMEVMHERNAHNFPLDLAAIEAPSTNG*


The .pep file seems ok, any idea how to solve this issue?


Thank you in advance.

Best wishes,

Giuseppe
To unsubscribe from this group and stop receiving emails from it, send an email to transdecoder-users+unsub...@googlegroups.com.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

--
You received this message because you are subscribed to the Google Groups "TransDecoder-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to transdecoder-users+unsub...@googlegroups.com.

To post to this group, send email to transdeco...@googlegroups.com.

Brian Haas

unread,
Jun 29, 2016, 6:56:44 AM6/29/16
to Giuseppe Puglia, TransDecoder-users, liruir...@gmail.com, gwily...@gmail.com
Hi Giuseppe

The latest code for transdecoder.longorf should have a parameter for restricting it to the longest orf.  Can you rerun with this option?  If you need your existing pep file, I can look into modifying the script soon.

Best,

-Brian
(by iPhone)

To unsubscribe from this group and stop receiving emails from it, send an email to transdecoder-us...@googlegroups.com.

To post to this group, send email to transdeco...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages