Trinity hybrid assembly 454 + illumina single and paired end

257 views
Skip to first unread message

Giorgio Casaburi

unread,
Jun 3, 2015, 10:21:33 AM6/3/15
to trinityrn...@googlegroups.com
Hello,

I have tons of data (about 1Tb) of RNA-Seq coming from different technologies (i.e. 454 - single end, Illumina both single and paired end) of the same non-genome annotated species from different tissues/conditions. So far for the illumina data I did several assemblies with Trinity, specifically one for every single distinct run and also a huge one that derived from all the illumina reads merged together (the latter tooks about 6TBs of storage during writing of intermediate files!!!). 
Everything finished with no problem.

I have also a smaller quota of 454 data that I would like to integer into the illumina data. My question is what would you suggets to do?

I thought of:  

1) Merging all the 454 data in one .fasta file and run Trinity. 

2) After that merging all the Trinity.fasta files obtained from the 454, the illumina single experiments and the illumina merged one (the huge) and run again trinity (with maybe normalization).

3) Doing some downstream like cd-hit-est or cap3 to remove redundancy.

What is your suggestion?

Thanks in advance,
~Giorgio

Tiago Hori

unread,
Jun 3, 2015, 4:16:42 PM6/3/15
to Giorgio Casaburi, trinityrn...@googlegroups.com
Giorgio,

I would try detonate to access the quality of these assemblies. I go from there.

I would take the best illumina assembly and merge that with the best 454 assembly using cd-hit. If you make multiple assemblies per sequencing technology, I would merge those into one, but I would use Detonate to make sure the merged assembly is better than an individual assembly.

T.

Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Giorgio Casaburi

unread,
Jun 3, 2015, 4:22:49 PM6/3/15
to trinityrn...@googlegroups.com, giorgio...@gmail.com
Thanks Tiago! I did try the Detonate approach but A) it takes forever for just a single assembly and b) it fails often and I'm not able to reproduce the error. 



On Wednesday, June 3, 2015 at 4:16:42 PM UTC-4, Tiago Hori wrote:
Giorgio,

I would try detonate to access the quality of these assemblies. I go from there.

I would take the best illumina assembly and merge that with the best 454 assembly using cd-hit. If you make multiple assemblies per sequencing technology, I would merge those into one, but I would use Detonate to make sure the merged assembly is better than an individual assembly.

T.

Sent from my iPhone

On Jun 3, 2015, at 11:21 AM, Giorgio Casaburi <giorgio...@gmail.com> wrote:

Hello,

I have tons of data (about 1Tb) of RNA-Seq coming from different technologies (i.e. 454 - single end, Illumina both single and paired end) of the same non-genome annotated species from different tissues/conditions. So far for the illumina data I did several assemblies with Trinity, specifically one for every single distinct run and also a huge one that derived from all the illumina reads merged together (the latter tooks about 6TBs of storage during writing of intermediate files!!!). 
Everything finished with no problem.

I have also a smaller quota of 454 data that I would like to integer into the illumina data. My question is what would you suggets to do?

I thought of:  

1) Merging all the 454 data in one .fasta file and run Trinity. 

2) After that merging all the Trinity.fasta files obtained from the 454, the illumina single experiments and the illumina merged one (the huge) and run again trinity (with maybe normalization).

3) Doing some downstream like cd-hit-est or cap3 to remove redundancy.

What is your suggestion?

Thanks in advance,
~Giorgio

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

Tiago Hori

unread,
Jun 3, 2015, 4:37:59 PM6/3/15
to Giorgio Casaburi, trinityrn...@googlegroups.com
You can try to look at your rates of read mapping to get an idea of quality. Try the previous version of Detonate if you can. Also it needs a fair amount of memory. 

T.

Sent from my iPhone
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

Giorgio Casaburi

unread,
Jun 3, 2015, 4:44:08 PM6/3/15
to trinityrn...@googlegroups.com, giorgio...@gmail.com
That's a good suggestion. I'm going to try the previous version it might work. Thanks!

Giorgio Casaburi

unread,
Jun 4, 2015, 8:59:17 AM6/4/15
to trinityrn...@googlegroups.com, giorgio...@gmail.com
Hi Tiago,

Do you know if there is any way to run Detonate (in non-reference based mode) without using any close related transcriptome? The species that I sequenced has not been genome-annotaed yet and the closest one neighter. After that we are really far from a phylogenetic point of view and I'm not sure if using those transcriptomes will actually give a true result.

Thanks!

Tiago Hori

unread,
Jun 4, 2015, 9:03:55 AM6/4/15
to Giorgio Casaburi, trinityrn...@googlegroups.com
Yep. Detonate has two function RSEM-EVAL is reference independent. REF-EVAL is reference dependent. With RSEM evil you use your assembly and map the reads back to it. It is like re-mapping, but with a more elegant statistical approach. It will give you two things I really like, a quality score and a list of contigs with low read support.

T.  

T.

Sent from my iPhone
--

Giorgio Casaburi

unread,
Jun 4, 2015, 9:09:20 AM6/4/15
to trinityrn...@googlegroups.com, giorgio...@gmail.com
Thank you so much Tiago for the fastest answer ever :)! I guess my concern was about the parameter file you have to pass to detonate in the RSEM-EVAL.
e.g. : The --transcript-length-parameters option instructs RSEM-EVAL to parameterize its prior distribution using the mean and standard deviation of the transcript lengths in the Ensembl mouse annotation. These parameters can also be estimated from a species more closely related to the one you are interested in, using ./rsem-eval/rsem-eval-estimate-transcript-length-distribution. If --transcript-length-parameters is not provided, default transcript-length parameters, estimated from the human Ensembl annotation, will be used.

How really important this parameter is?


On Thursday, June 4, 2015 at 9:03:55 AM UTC-4, Tiago Hori wrote:
Yep. Detonate has two function RSEM-EVAL is reference independent. REF-EVAL is reference dependent. With RSEM evil you use your assembly and map the reads back to it. It is like re-mapping, but with a more elegant statistical approach. It will give you two things I really like, a quality score and a list of contigs with low read support.

T.  

T.

Sent from my iPhone

On Jun 4, 2015, at 9:59 AM, Giorgio Casaburi <giorgio...@gmail.com> wrote:

Hi Tiago,

Do you know if there is any way to run Detonate (in non-reference based mode) without using any close related transcriptome? The species that I sequenced has not been genome-annotaed yet and the closest one neighter. After that we are really far from a phylogenetic point of view and I'm not sure if using those transcriptomes will actually give a true result.

Thanks!

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

Mark Chapman

unread,
Jun 4, 2015, 9:11:37 AM6/4/15
to Giorgio Casaburi, trinityrn...@googlegroups.com
Hi Giorgio,
You can estimate this parameter from your own data, although I don't know whether you should change this parameter between 'detonating' your 454 assembly and your Illumina assembly

On 4 June 2015 at 14:09, Giorgio Casaburi <giorgio...@gmail.com> wrote:
Thank you so much Tiago for the fastest answer ever :)! I guess my concern was about the parameter file you have to pass to detonate in the RSEM-EVAL.
e.g. : The --transcript-length-parameters option instructs RSEM-EVAL to parameterize its prior distribution using the mean and standard deviation of the transcript lengths in the Ensembl mouse annotation. These parameters can also be estimated from a species more closely related to the one you are interested in, using ./rsem-eval/rsem-eval-estimate-transcript-length-distribution. If --transcript-length-parameters is not provided, default transcript-length parameters, estimated from the human Ensembl annotation, will be used.

How really important this parameter is?


On Thursday, June 4, 2015 at 9:03:55 AM UTC-4, Tiago Hori wrote:
Yep. Detonate has two function RSEM-EVAL is reference independent. REF-EVAL is reference dependent. With RSEM evil you use your assembly and map the reads back to it. It is like re-mapping, but with a more elegant statistical approach. It will give you two things I really like, a quality score and a list of contigs with low read support.

T.  

T.

Sent from my iPhone

On Jun 4, 2015, at 9:59 AM, Giorgio Casaburi <giorgio...@gmail.com> wrote:

Hi Tiago,

Do you know if there is any way to run Detonate (in non-reference based mode) without using any close related transcriptome? The species that I sequenced has not been genome-annotaed yet and the closest one neighter. After that we are really far from a phylogenetic point of view and I'm not sure if using those transcriptomes will actually give a true result.

Thanks!

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
Dr. Mark A. Chapman
+44 (0)2380 594396
------------------------------------
Centre for Biological Sciences
University of Southampton
Life Sciences Building 85
Highfield Campus
Southampton
SO17 1BJ

Tiago Hori

unread,
Jun 4, 2015, 9:13:16 AM6/4/15
to Giorgio Casaburi, trinityrn...@googlegroups.com
Just estimate it based on your own assembly. THere is a tutorial buried somewhere that tells you to do that.

T.

Sent from my iPhone
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

Tiago Hori

unread,
Jun 4, 2015, 9:13:36 AM6/4/15
to Mark Chapman, Giorgio Casaburi, trinityrn...@googlegroups.com
Goddmaned you beat me to it :)

Sent from my iPhone

Giorgio Casaburi

unread,
Jun 4, 2015, 9:20:47 AM6/4/15
to trinityrn...@googlegroups.com, markcha...@gmail.com, giorgio...@gmail.com
Thank you so much guys! I think the illumina one will probably make more sense and I'm gonna try mutliple detonate anyways and see the best score considering the reads I want to use. I forgot to mention a paramount thing though; once I obtain the final assembly only part of the reads (i.e. from the last experiment) will be used for the study and matched againt the assembly. So I should probably try to use detonate with these reads and against different assemblies generated and see what's the best result. Does that make sense?





On Thursday, June 4, 2015 at 9:13:36 AM UTC-4, Tiago Hori wrote:
Goddmaned you beat me to it :)

Sent from my iPhone

On Jun 4, 2015, at 10:11 AM, Mark Chapman <markcha...@gmail.com> wrote:

Hi Giorgio,
You can estimate this parameter from your own data, although I don't know whether you should change this parameter between 'detonating' your 454 assembly and your Illumina assembly
On 4 June 2015 at 14:09, Giorgio Casaburi <giorgio...@gmail.com> wrote:
Thank you so much Tiago for the fastest answer ever :)! I guess my concern was about the parameter file you have to pass to detonate in the RSEM-EVAL.
e.g. : The --transcript-length-parameters option instructs RSEM-EVAL to parameterize its prior distribution using the mean and standard deviation of the transcript lengths in the Ensembl mouse annotation. These parameters can also be estimated from a species more closely related to the one you are interested in, using ./rsem-eval/rsem-eval-estimate-transcript-length-distribution. If --transcript-length-parameters is not provided, default transcript-length parameters, estimated from the human Ensembl annotation, will be used.

How really important this parameter is?


On Thursday, June 4, 2015 at 9:03:55 AM UTC-4, Tiago Hori wrote:
Yep. Detonate has two function RSEM-EVAL is reference independent. REF-EVAL is reference dependent. With RSEM evil you use your assembly and map the reads back to it. It is like re-mapping, but with a more elegant statistical approach. It will give you two things I really like, a quality score and a list of contigs with low read support.

T.  

T.

Sent from my iPhone

On Jun 4, 2015, at 9:59 AM, Giorgio Casaburi <giorgio...@gmail.com> wrote:

Hi Tiago,

Do you know if there is any way to run Detonate (in non-reference based mode) without using any close related transcriptome? The species that I sequenced has not been genome-annotaed yet and the closest one neighter. After that we are really far from a phylogenetic point of view and I'm not sure if using those transcriptomes will actually give a true result.

Thanks!

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
Dr. Mark A. Chapman
+44 (0)2380 594396
------------------------------------
Centre for Biological Sciences
University of Southampton
Life Sciences Building 85
Highfield Campus
Southampton
SO17 1BJ

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

Mark Chapman

unread,
Jun 4, 2015, 11:17:00 AM6/4/15
to Giorgio Casaburi, trinityrn...@googlegroups.com
Hi Giorgio,
I think that's not the right way to evaluate your assembly. For detonate you should use your reads that were used to make the assembly, not just a subset of them, even if you plan to only map some reads to the assembly.

On 4 June 2015 at 14:20, Giorgio Casaburi <giorgio...@gmail.com> wrote:
Thank you so much guys! I think the illumina one will probably make more sense and I'm gonna try mutliple detonate anyways and see the best score considering the reads I want to use. I forgot to mention a paramount thing though; once I obtain the final assembly only part of the reads (i.e. from the last experiment) will be used for the study and matched againt the assembly. So I should probably try to use detonate with these reads and against different assemblies generated and see what's the best result. Does that make sense?



On Thursday, June 4, 2015 at 9:13:36 AM UTC-4, Tiago Hori wrote:
Goddmaned you beat me to it :)

Sent from my iPhone

On Jun 4, 2015, at 10:11 AM, Mark Chapman <markcha...@gmail.com> wrote:

Hi Giorgio,
You can estimate this parameter from your own data, although I don't know whether you should change this parameter between 'detonating' your 454 assembly and your Illumina assembly
On 4 June 2015 at 14:09, Giorgio Casaburi <giorgio...@gmail.com> wrote:
Thank you so much Tiago for the fastest answer ever :)! I guess my concern was about the parameter file you have to pass to detonate in the RSEM-EVAL.
e.g. : The --transcript-length-parameters option instructs RSEM-EVAL to parameterize its prior distribution using the mean and standard deviation of the transcript lengths in the Ensembl mouse annotation. These parameters can also be estimated from a species more closely related to the one you are interested in, using ./rsem-eval/rsem-eval-estimate-transcript-length-distribution. If --transcript-length-parameters is not provided, default transcript-length parameters, estimated from the human Ensembl annotation, will be used.

How really important this parameter is?


On Thursday, June 4, 2015 at 9:03:55 AM UTC-4, Tiago Hori wrote:
Yep. Detonate has two function RSEM-EVAL is reference independent. REF-EVAL is reference dependent. With RSEM evil you use your assembly and map the reads back to it. It is like re-mapping, but with a more elegant statistical approach. It will give you two things I really like, a quality score and a list of contigs with low read support.

T.  

T.

Sent from my iPhone

On Jun 4, 2015, at 9:59 AM, Giorgio Casaburi <giorgio...@gmail.com> wrote:

Hi Tiago,

Do you know if there is any way to run Detonate (in non-reference based mode) without using any close related transcriptome? The species that I sequenced has not been genome-annotaed yet and the closest one neighter. After that we are really far from a phylogenetic point of view and I'm not sure if using those transcriptomes will actually give a true result.

Thanks!

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
Dr. Mark A. Chapman
------------------------------------
Centre for Biological Sciences
University of Southampton
Life Sciences Building 85
Highfield Campus
Southampton
SO17 1BJ

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Tiago Hori

unread,
Jun 4, 2015, 12:40:39 PM6/4/15
to Mark Chapman, Giorgio Casaburi, trinityrn...@googlegroups.com
I would definitely use all the reads, among many things that will tell you the effect of normalization on transcript recovery and also a true percentage of properly mapped reads.

T. 

Sent from my iPhone

Giorgio Casaburi

unread,
Jun 4, 2015, 1:42:23 PM6/4/15
to trinityrn...@googlegroups.com, giorgio...@gmail.com, markcha...@gmail.com
Thank you guys, I must admit that I agree with you both but I hoped there was another solution since the forward set of the total reads is 450GB and the reverse another 450GB, so I don't know how long would detonate take, but I will try it and eventually post the performance for future references. Can I just use the normalized sequences set from trinity? (there way less). The main thing is that I have one experiment run which needs to be analyzed but I have also data from other labs here and there and I was trying to assemble the most reliable transcriptome out of all the data. Further I would use the obtained main assembly to study my run. Obviously it looks like Detonate won't help with that since it will not tell me what is the most suitable assembly to work with for my run without considering all the sequences used. 
              What would you do in my situation? 
 

Tiago Hori

unread,
Jun 4, 2015, 2:02:36 PM6/4/15
to Giorgio Casaburi, trinityrn...@googlegroups.com, markcha...@gmail.com
I just remembered that you have mixed technologies. I never done that before. However, you can still align that reads separately to the the merged reference and compare that to what you had with the technology specific assembly. Let's say you have 3 assemblies. One is illumina, one is 454 and the third one is merged. You align the illumina reads to the illumina and merged assembly and compares those and the do that for the 454. That's should give you a good idea of what will work better for DE.

T.

Sent from my iPhone

On Jun 4, 2015, at 2:42 PM, Giorgio Casaburi <giorgio...@gmail.com> wrote:

Thank you guys, I must admit that I agree with you both but I hoped there was another solution since the forward set of the total reads is 450GB and the reverse another 450GB, so I don't know how long would detonate take, but I will try it and eventually post the performance for future references. Can I just use the normalized sequences set from trinity? (there way less). The main thing is that I have one experiment run which needs to be analyzed but I have also data from other labs here and there and I was trying to assemble the most reliable transcriptome out of all the data. Further I would use the obtained main assembly to study my run. Obviously it looks like Detonate won't help with that since it will not tell me what is the most suitable assembly to work with for my run without considering all the sequences used. 
              What would you do in my situation? 
 

--

Giorgio Casaburi

unread,
Jun 4, 2015, 2:23:20 PM6/4/15
to trinityrn...@googlegroups.com, giorgio...@gmail.com, markcha...@gmail.com
Great thanks! So I will use the reads from my experiment and run multiple detonate againts basically every assembly I obtained (single,merged, distinct technolgies etc.) and see what's the best score I get. 
One last thing if you don't mind cuz I didn't see reported anywhere: 
Do you think in Detonate makes more sense using the raw reads or the trimmed-normalized reads from Trinity pipeline?
Just a couriosity that I bet lot of people had already!
 

On Thursday, June 4, 2015 at 2:02:36 PM UTC-4, Tiago Hori wrote:
I just remembered that you have mixed technologies. I never done that before. However, you can still align that reads separately to the the merged reference and compare that to what you had with the technology specific assembly. Let's say you have 3 assemblies. One is illumina, one is 454 and the third one is merged. You align the illumina reads to the illumina and merged assembly and compares those and the do that for the 454. That's should give you a good idea of what will work better for DE.

T.

Sent from my iPhone

On Jun 4, 2015, at 2:42 PM, Giorgio Casaburi <giorgio...@gmail.com> wrote:

Thank you guys, I must admit that I agree with you both but I hoped there was another solution since the forward set of the total reads is 450GB and the reverse another 450GB, so I don't know how long would detonate take, but I will try it and eventually post the performance for future references. Can I just use the normalized sequences set from trinity? (there way less). The main thing is that I have one experiment run which needs to be analyzed but I have also data from other labs here and there and I was trying to assemble the most reliable transcriptome out of all the data. Further I would use the obtained main assembly to study my run. Obviously it looks like Detonate won't help with that since it will not tell me what is the most suitable assembly to work with for my run without considering all the sequences used. 
              What would you do in my situation? 
 

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

Tiago Hori

unread,
Jun 4, 2015, 2:46:52 PM6/4/15
to Giorgio Casaburi, trinityrn...@googlegroups.com, markcha...@gmail.com
I would go trimmed both for adapters and quality, but not normalized. Normalization will inflate your DETONATE scores, by reducing the proportion of unmapped reads.

T.

Sent from my iPhone
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

Giorgio Casaburi

unread,
Jun 4, 2015, 2:54:07 PM6/4/15
to trinityrn...@googlegroups.com, giorgio...@gmail.com, markcha...@gmail.com
Interesting great suggestion, thanks!
Reply all
Reply to author
Forward
0 new messages