transdecoder after stringtie merge

810 views
Skip to first unread message

Raymond

unread,
Apr 4, 2017, 8:06:17 AM4/4/17
to TransDecoder-users
ls

I saw that my problems is similar it seems as to one that was previously reported in this group (titled "Transdecoder after cuffmerge", dd 09-02-2016), namely the conversion from gtf to gff

I have a stringtie generated gtf:
head StringtieMerged.20170404.gtf
# /data/annotations/......
# StringTie version 1.3.1c
000000F|arrow   StringTie       transcript      64278   65756   1000    -       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1";
000000F|arrow   StringTie       exon    64278   64803   1000    -       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "1";
000000F|arrow   StringTie       exon    65112   65470   1000    -       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "2";
000000F|arrow   StringTie       exon    65649   65756   1000    -       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "3";
000000F|arrow   StringTie       transcript      67344   69649   1000    -       .       gene_id "MSTRG.2"; transcript_id "MSTRG.2.1";
000000F|arrow   StringTie       exon    67344   67702   1000    -       .       gene_id "MSTRG.2"; transcript_id "MSTRG.2.1"; exon_number "1";
000000F|arrow   StringTie       exon    67851   67922   1000    -       .       gene_id "MSTRG.2"; transcript_id "MSTRG.2.1"; exon_number "2";
000000F|arrow   StringTie       exon    68037   68137   1000    -       .       gene_id "MSTRG.2"; transcript_id "MSTRG.2.1"; exon_number "3";

but after conversion it look as follows.

head StringtieMerged.20170404.gff
000000F|arrow   Cufflinks       match   2025117 2025392 100     +       .       ID=GENE^MSTRG.81,TRANS^MSTRG.81.1;Target=GENE^MSTRG.81,TRANS^MSTRG.81.1 1 276 +
000000F|arrow   Cufflinks       match   2027771 2027909 100     +       .       ID=GENE^MSTRG.81,TRANS^MSTRG.81.1;Target=GENE^MSTRG.81,TRANS^MSTRG.81.1 277 415 +
000000F|arrow   Cufflinks       match   2028345 2028653 100     +       .       ID=GENE^MSTRG.81,TRANS^MSTRG.81.1;Target=GENE^MSTRG.81,TRANS^MSTRG.81.1 416 724 +

000000F|arrow   Cufflinks       match   2025139 2027831 100     +       .       ID=GENE^MSTRG.81,TRANS^MSTRG.81.3;Target=GENE^MSTRG.81,TRANS^MSTRG.81.3 1 2693 +
000000F|arrow   Cufflinks       match   2028390 2028902 100     +       .       ID=GENE^MSTRG.81,TRANS^MSTRG.81.3;Target=GENE^MSTRG.81,TRANS^MSTRG.81.3 2694 3206 +


In the original thread I do not see whether there was a solution?
If so, could you provide it!
many thanks,

Raymond

Brian Haas

unread,
Apr 5, 2017, 8:03:03 PM4/5/17
to Raymond, TransDecoder-users
Hi Ray,

In the sample directory, there's an example that uses a cufflinks output file.  I haven't explored using this with stringtie yet, but it's on my list.

~brian

--
You received this message because you are subscribed to the Google Groups "TransDecoder-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to transdecoder-users+unsub...@googlegroups.com.
To post to this group, send email to transdecoder-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/transdecoder-users/19e8b2ae-1d94-4e78-8927-efbcd1468065%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

Stefan Reuscher

unread,
Apr 21, 2017, 5:00:12 AM4/21/17
to TransDecoder-users, r...@keygene.com
Jumping onto this dying thread.

I seem to have problems connecting hisat->stringtie-->TransDecoder. 
Coming from:
stringtie --merge ./stringtie_out/*.gtf -o ./stringtie_out/merged.gtf

I try to use:

cufflinks_gtf_genome_to_cdna_fasta.pl ./stringtie_out/merged.gtf ./data_in/OL_genome.fa > ./TransDecoder_out/transcripts.fasta

This results in

-parsing cufflinks output: ./stringtie_out/merged.gtf
Use of uninitialized value $type in string eq at /usr/NGS_tools/TransDecoder-3.0.1/util/cufflinks_gtf_genome_to_cdna_fasta.pl line 39, <$fh> line 1.
Use of uninitialized value $type in string eq at /usr/NGS_tools/TransDecoder-3.0.1/util/cufflinks_gtf_genome_to_cdna_fasta.pl line 39, <$fh> line 2.
-parsing genome fasta: ./data_in/OL_genome.fa
-done parsing genome.
// processing chromosome01_38544486bp
// processing chromosome02_33003716bp

a fasta file is made however.

Then using the stringtie-way of making transcripts

gffread -w ./TransDecoder_out/transcripts.fasta -g ./data_in/OL_genome.fa ./stringtie_out/merged.gtf

also works. The files have the same number of seqs but different seq names:

> names(transcripts_cuff[1:10])
 [1] "MSTRG.2594.1 MSTRG.2594" "MSTRG.1203.6 MSTRG.1203" "MSTRG.1203.1 MSTRG.1203"
 [4] "MSTRG.1203.2 MSTRG.1203" "MSTRG.1203.4 MSTRG.1203" "MSTRG.1203.3 MSTRG.1203"
 [7] "MSTRG.1203.5 MSTRG.1203" "MSTRG.1203.7 MSTRG.1203" "MSTRG.1122.1 MSTRG.1122"
[10] "MSTRG.3742.1 MSTRG.3742"

> names(transcripts_gffread[1:10])
 [1] "MSTRG.1.1 gene=MSTRG.1" "MSTRG.1.2 gene=MSTRG.1" "MSTRG.2.1 gene=MSTRG.2"
 [4] "MSTRG.2.2 gene=MSTRG.2" "MSTRG.3.1 gene=MSTRG.3" "MSTRG.3.2 gene=MSTRG.3"
 [7] "MSTRG.4.1 gene=MSTRG.4" "MSTRG.5.1 gene=MSTRG.5" "MSTRG.6.1 gene=MSTRG.6"
[10] "MSTRG.7.1 gene=MSTRG.7"

I could also replicate the problem above.

I can make TransDecoder work, however there are warnings and errors along the way.

My question is: Is TransDecoder safe to run on stringtie output ? Are there any plans to from the devs to check that ?

Best Regards,

Stefan


Am Donnerstag, 6. April 2017 09:03:03 UTC+9 schrieb Brian Haas:
Hi Ray,

In the sample directory, there's an example that uses a cufflinks output file.  I haven't explored using this with stringtie yet, but it's on my list.

~brian
On Tue, Apr 4, 2017 at 8:06 AM, Raymond <r...@keygene.com> wrote:
ls

I saw that my problems is similar it seems as to one that was previously reported in this group (titled "Transdecoder after cuffmerge", dd 09-02-2016), namely the conversion from gtf to gff

I have a stringtie generated gtf:
head StringtieMerged.20170404.gtf
# /data/annotations/......
# StringTie version 1.3.1c
000000F|arrow   StringTie       transcript      64278   65756   1000    -       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1";
000000F|arrow   StringTie       exon    64278   64803   1000    -       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "1";
000000F|arrow   StringTie       exon    65112   65470   1000    -       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "2";
000000F|arrow   StringTie       exon    65649   65756   1000    -       .       gene_id "MSTRG.1"; transcript_id "MSTRG.1.1"; exon_number "3";
000000F|arrow   StringTie       transcript      67344   69649   1000    -       .       gene_id "MSTRG.2"; transcript_id "MSTRG.2.1";
000000F|arrow   StringTie       exon    67344   67702   1000    -       .       gene_id "MSTRG.2"; transcript_id "MSTRG.2.1"; exon_number "1";
000000F|arrow   StringTie       exon    67851   67922   1000    -       .       gene_id "MSTRG.2"; transcript_id "MSTRG.2.1"; exon_number "2";
000000F|arrow   StringTie       exon    68037   68137   1000    -       .       gene_id "MSTRG.2"; transcript_id "MSTRG.2.1"; exon_number "3";

but after conversion it look as follows.

head StringtieMerged.20170404.gff
000000F|arrow   Cufflinks       match   2025117 2025392 100     +       .       ID=GENE^MSTRG.81,TRANS^MSTRG.81.1;Target=GENE^MSTRG.81,TRANS^MSTRG.81.1 1 276 +
000000F|arrow   Cufflinks       match   2027771 2027909 100     +       .       ID=GENE^MSTRG.81,TRANS^MSTRG.81.1;Target=GENE^MSTRG.81,TRANS^MSTRG.81.1 277 415 +
000000F|arrow   Cufflinks       match   2028345 2028653 100     +       .       ID=GENE^MSTRG.81,TRANS^MSTRG.81.1;Target=GENE^MSTRG.81,TRANS^MSTRG.81.1 416 724 +

000000F|arrow   Cufflinks       match   2025139 2027831 100     +       .       ID=GENE^MSTRG.81,TRANS^MSTRG.81.3;Target=GENE^MSTRG.81,TRANS^MSTRG.81.3 1 2693 +
000000F|arrow   Cufflinks       match   2028390 2028902 100     +       .       ID=GENE^MSTRG.81,TRANS^MSTRG.81.3;Target=GENE^MSTRG.81,TRANS^MSTRG.81.3 2694 3206 +


In the original thread I do not see whether there was a solution?
If so, could you provide it!
many thanks,

Raymond

--
You received this message because you are subscribed to the Google Groups "TransDecoder-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to transdecoder-users+unsub...@googlegroups.com.
To post to this group, send email to transdeco...@googlegroups.com.
Auto Generated Inline Image 1

Brian Haas

unread,
Apr 21, 2017, 8:17:56 AM4/21/17
to Stefan Reuscher, TransDecoder-users, Raymond
I'll work on this later today and see how it goes.

stay tuned...

~b

To unsubscribe from this group and stop receiving emails from it, send an email to transdecoder-users+unsubscribe@googlegroups.com.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

--
You received this message because you are subscribed to the Google Groups "TransDecoder-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to transdecoder-users+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Brian Haas

unread,
Apr 21, 2017, 1:42:29 PM4/21/17
to Stefan Reuscher, TransDecoder-users, Raymond

I took a look into this.  The couple of error messages encountered during the gtf to gff3 conversion are harmless...  The code wasn't ignoring the couple of comment lines at the top of the stringtie gtf file. I've updated the devel code so it'll ignore the comments now.

For converting the stringtie gtf to the transcript fasta file, you'd use the script included in TransDecoder that we use similarly for the cufflinks example:

$TransDecoder/util/cufflinks_gtf_genome_to_cdna_fasta.pl stringtie_merged.gtf  genome.fasta > stringtie_merged.transcripts.fasta


If you want to pull the latest code:

     git clone https://github.com/TransDecoder/TransDecoder.git


you'll find a stringtie example in the sample_data/ directory along with a runMe.sh script that outlines the process.   Here, I'm using the stringtie data that comes from running the Tuxedo2 pipeline in their latest protocol paper.

best,

~brian


   



Stefan Reuscher

unread,
Apr 23, 2017, 8:00:17 PM4/23/17
to TransDecoder-users, reusche...@gmail.com, r...@keygene.com
Dear Brian,

thanks for your effort. I understand the errors thrown now and will not worry about them.

I also did some visual inspection of the results on my own. I basically loaded stringtie_merged.gtf, stringtie_merged.gff3 (twice, once using your .pl script and once using stringties gffread util) and the final TransDecoder.genome.gff all into the same genome browser and could confirm that the output appears to be consistent between tools and conversions.

I have one more question if you dont mind. When converting the transcript-based gff to the genome-based gff I get about 2,000 warnings:

Warning [2411], shouldn't have a minus-strand ORF on a spliced transcript structure. Skipping entry Gene.105676::MSTRG.9956.2::g.105676::m.105676.

Any idea where this could come from ?

Gruß,

Stefan

Brian Haas

unread,
Apr 23, 2017, 8:42:02 PM4/23/17
to Stefan Reuscher, TransDecoder-users, Raymond
Good to hear!

You can ignore those warnings.  Basically, it just means that the ORF that was predicted is found to be in the antisense orientation when taking into account the spliced orientation of the transcript sequence on the genome.

If you run TransDecoder in strand-specific mode, then you should have ORFs that only match the transcribed orientation of the stringtie transcripts. 

best,

~brian

--
You received this message because you are subscribed to the Google Groups "TransDecoder-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to transdecoder-users+unsub...@googlegroups.com.
To post to this group, send email to transdecoder-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages