How to get just Trinity "genes" from the assembly file?

709 views
Skip to first unread message

setar...@gmail.com

unread,
Sep 29, 2015, 4:34:59 AM9/29/15
to trinityrnaseq-users

Dear friends,

I have done a transcriptome assembly using Trinity and now I would like to get just "genes" from the assembly file. Could you please help me out to this end?

Thanks,
Mary

Laurent Marc

unread,
Sep 29, 2015, 7:20:26 AM9/29/15
to trinityrnaseq-users

this command shows the number of loci:
grep '^>' trinity_out_dir/Trinity.fasta | sed 's/TR[0-9]*[|]//' | cut -f1,2 -d"_" | sort | uniq | wc -l

and this other shows how many transcripts there are within each locus:
grep '^>' trinity_out_dir/Trinity.fasta | sed 's/TR[0-9]*[|]//' | cut -f1,2 -d"_" | sort | uniq -c


Do you agree ?

Laurent --

setar...@gmail.com

unread,
Sep 29, 2015, 7:32:03 AM9/29/15
to trinityrnaseq-users
Thanks for your command Laurent. As you mentioned the first command give us how many genes exist within the assembly, it's just number isn't it? I need the "genes" sequences for ORF prediction and further analysis. Could you please help me about it, friend?

Ken Field

unread,
Sep 29, 2015, 8:08:36 AM9/29/15
to setar...@gmail.com, trinityrnaseq-users
Mary-
I had a similar problem when I first started working with Trinity. What you need to realize is that there is no single sequence for what Trinity calls a gene. A Trinity gene is a collection of related transcripts. Assuming that this is a de novo assembly, the transcripts are related by having shared kmers during the assembly stage.

For each gene you could try to identify a consensus sequence, or you could take the longest sequence, or you could take the most abundant sequence (this is probably the best option). You could then call that sequence your gene sequence, but it wouldn't really be true. I found that it was best to keep things as transcripts when dealing with sequences and I used transdecoder for ORF prediction.

Ken

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
Ken Field, Ph.D.
Associate Professor of Biology
Program in Cell Biology/Biochemistry
Bucknell University
Room 203A Biology Building

Tiago Hori

unread,
Sep 29, 2015, 8:14:35 AM9/29/15
to Ken Field, setar...@gmail.com, trinityrnaseq-users
I think we all did. Think of gene as a group not as a unit. It is a group of sequences that we assume are bound by either paralogy or alternative splicing (or both). 

I agree with Ken that using consensus is not ideal, you may be throwing things that are very different functionally into the same bag. As Ken said highest expression is probably better, but I would also look at DE from the isoform standpoint.

T.

"Profanity the is the only language all programmers understand" 
Sent from my iPhone, the universal excuse for my poor spelling.

setar...@gmail.com

unread,
Sep 29, 2015, 8:23:07 AM9/29/15
to trinityrnaseq-users
Thanks Ken. You're right, I filtered lowly supported transcripts with FPKM cutoff 1. since I have no previous experience, I plan to try other ways that I found the filtering based on expression is the probable best way. You mentioned TransDecoder, I'll be happy to hear your opinion about my task on ORF prediction by Transdecoder.
 
 I made a de novo transcriptome assembly using Illumina reads (PE, 100 bp) generated from a strand-specific library (FR), then filtered lowly supported transcripts as I mentioned above. I tried to find ORF using TransDecoder tool within the Trinity package (20140717). On one hand, given the strand-specific RNA-seq library, I used -S flag as recommended in the TransDecoder guide that generated 13430 peptide sequences (longest_orfs.pep) while with removing the -S, the longest_orfs.pep file contain 46823 sequences. Does it mean that many ORFs located on minus strand, yes? is it normal for such a library? one the other hand, based on blastx of the assembly against uniprot, many hits were on the reverse strand as the qstart was greater than the qend; in your professional view, this results can confirm that many ORF located on minus strand? 
I'm new in this field, please help me out to find what happened. It caused me be in a doubt about strand-specific library. I would be highly appreciated for sharing what you know.

Many thanks,





On Tuesday, September 29, 2015 at 12:04:59 PM UTC+3:30, setar...@gmail.com wrote:

Tiago Hori

unread,
Sep 29, 2015, 8:30:33 AM9/29/15
to setar...@gmail.com, trinityrnaseq-users
How many transcripts are there in your assembly? 

Are you doing DE?

T.

"Profanity the is the only language all programmers understand" 
Sent from my iPhone, the universal excuse for my poor spelling.
--

setar...@gmail.com

unread,
Sep 29, 2015, 8:38:21 AM9/29/15
to trinityrnaseq-users
There was about 170,000 transcript in the assembly, but after removing lowly supported transcripts it reduced to 72000 transcript. I plan to do DE analysis on this assembly file containing 72000 transcript. 


On Tuesday, September 29, 2015 at 12:04:59 PM UTC+3:30, setar...@gmail.com wrote:

Tiago Hori

unread,
Sep 29, 2015, 8:49:04 AM9/29/15
to setar...@gmail.com, trinityrnaseq-users
Ok.

So, if your libraries are truly stranded the vast majority of your transcripts should hit one strand, if not all. That is because when you select the stranded option in Trinity, it ignores all assembly that involve reverse complements. Here there is one assumption, that your libraries are truly stranded, i.e there were no issues with library construction. Obviously if during library prep there was a problem a the non wanted strand was allowed to be included to, than you might have a mix. When you take the S option you are allowing for ORF in both orientations and realistic you should not have those.

Having said that you would be annotations less than one third of your transcripts with transcoder. 

Have you re-napped the original reads to the assembly using a stranded option?

T. 

"Profanity the is the only language all programmers understand" 
Sent from my iPhone, the universal excuse for my poor spelling.
--

Ken Field

unread,
Sep 29, 2015, 8:49:11 AM9/29/15
to setar...@gmail.com, trinityrnaseq-users
Mary-
I think that strand specific reads on the Illumina are almost always "RF" not "FR". I would make certain that you don't have it backwards before proceeding!
Ken

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Tiago Hori

unread,
Sep 29, 2015, 8:58:03 AM9/29/15
to Ken Field, setar...@gmail.com, trinityrnaseq-users
If you used TruSeq stranded the my are RF for sure, cause they are dUTP based. Unless something changed and Illumina didn't call me :p (they never call). 

Now if you used an in house construction or a third party method things could be different!

T.

"Profanity the is the only language all programmers understand" 
Sent from my iPhone, the universal excuse for my poor spelling.

setar...@gmail.com

unread,
Sep 29, 2015, 9:00:24 AM9/29/15
to trinityrnaseq-users
Thanks Tiago. Yes, about 89% of reads mapped back to assembly, and about 62% of reads mapped more than one time that sounds normal for Trinity assembly as it considers the gene isoforms. About strandness, I sent our sample to a sequencing center and they told me the orientation is as FR. Is there any way to check the orientation for making sure about the right orientation?




On Tuesday, September 29, 2015 at 12:04:59 PM UTC+3:30, setar...@gmail.com wrote:

setar...@gmail.com

unread,
Sep 29, 2015, 9:04:05 AM9/29/15
to trinityrnaseq-users
Tiago, actually they told us that TruSeq Stranded mRNA Sample Prep Kit (RS-122-2101 or RS-122-2102) was used. and strand specific reads generated at the centre are forward-reverse (dUTP). So, it is as FR, right?


On Tuesday, September 29, 2015 at 12:04:59 PM UTC+3:30, setar...@gmail.com wrote:

Tiago Hori

unread,
Sep 29, 2015, 9:07:22 AM9/29/15
to setar...@gmail.com, trinityrnaseq-users
Yes. Ask them what kit they used to construct your libraries. It sounds to me that the are either using a in house kit or a mix between in-house and non-stranded TruSeq. 


You can visualize your alignments on IGV and check if their aligning in the right orientation. Also did you use the trinity built in alignment pipeline for quality assessment? https://github.com/trinityrnaseq/trinityrnaseq/wiki/RNA-Seq-Read-Representation-by-Trinity-Assembly

T.

"Profanity the is the only language all programmers understand" 
Sent from my iPhone, the universal excuse for my poor spelling.
--

Tiago Hori

unread,
Sep 29, 2015, 9:09:08 AM9/29/15
to setar...@gmail.com, trinityrnaseq-users
As far as I know dUTP generates RF libraries in Trinity nomenclature. They could be referring to the top hat nomenclature which is different. 

T.

"Profanity the is the only language all programmers understand" 
Sent from my iPhone, the universal excuse for my poor spelling.
--

Tiago Hori

unread,
Sep 29, 2015, 9:15:34 AM9/29/15
to setar...@gmail.com, trinityrn...@googlegroups.com
Ultimate stresses sportsmanship and fair play. Competitive play is encouraged, but NEVER AT THE EXPENSE OF RESPECT BETWEEN PLAYERS, adherence to the rules and the BASIC JOY OF PLAY.

setar...@gmail.com

unread,
Sep 29, 2015, 9:15:51 AM9/29/15
to trinityrnaseq-users
Thanks Tiago for helping me. No, I didn't quality assessment, but I certainly try it. If the high percentage of reads mapped as proper pairs, everything could be ok?




On Tuesday, September 29, 2015 at 12:04:59 PM UTC+3:30, setar...@gmail.com wrote:

Tiago Hori

unread,
Sep 29, 2015, 9:32:12 AM9/29/15
to setar...@gmail.com, trinityrn...@googlegroups.com
I am not completely sure what is the impact of using the wrong strandness, any thoughts Ken, Mark or Brian. I suspect it would confuse Trinity because it would increase the level of improper pairing, maybe? 


Your assembly could be ok, but I suspect you are not using the full power of the stranded feature if you had the wrong orientation. I would suggest re-running the assembly. May I ask where were your libraries made?

T.
Ultimate stresses sportsmanship and fair play. Competitive play is encouraged, but NEVER AT THE EXPENSE OF RESPECT BETWEEN PLAYERS, adherence to the rules and the BASIC JOY OF PLAY.

Ken Field

unread,
Sep 29, 2015, 9:53:29 AM9/29/15
to Tiago Hori, setar...@gmail.com, trinityrn...@googlegroups.com
I don't understand enough about how Trinity uses orientation information to know whether they would show up as improper pairs (oriented away from each other) or if they would show up as proper pairs but just on the wrong strand. In any case, you need to know for sure if they used dUTP to mark the strand. If they did, the orientation is RF.

If you can't be certain, then I think you are better off rerunning the assembly without the strand-specific flag rather than with the wrong one.

Ken

setar...@gmail.com

unread,
Sep 29, 2015, 10:47:43 AM9/29/15
to trinityrnaseq-users
Thanks for all comment. Tiago, you mentioned that the assembly could be OK. However, I checked it with runing ./bowtie_PE_separate_then_join.pl and may do re-assembly. However, I would like to know about the results of transdecoder
About sequencing center, I just know it was done in Germany.



On Tuesday, September 29, 2015 at 12:04:59 PM UTC+3:30, setar...@gmail.com wrote:

Brian Haas

unread,
Sep 29, 2015, 11:56:13 AM9/29/15
to Ken Field, Tiago Hori, setar...@gmail.com, trinityrn...@googlegroups.com
If you set the strandedness wrong, it'll all show up as antisense.  Just revcomp the assembly and it'll be fine.

-Brian
(by iPhone)

Reply all
Reply to author
Forward
0 new messages