Merging RNA-seq assemblies

1,518 views
Skip to first unread message

Tevfik Hamdi Kitapci

unread,
Feb 23, 2016, 8:26:57 PM2/23/16
to trinityrnaseq-users
Hi,
I have data from a time series experiment 8 time points 3 bio-replicates at each point. I have assembled each samples independently using trinity command

Trinity --seqType fq --max_memory 28G --left %s --right %s --trimmomatic --output %s

I want to merge those assemblies. I read about using cd-hit to merge assemblies but I don't understand how to use it. Can somebody please give me details about how to use cd-hit or any other software to merge those assemblies.

Btw I am also assembling my data by combining all my samples (same command as above plus the --normalize_reads option since combined data has over 220 million paired-end reads) I am still waiting for the output for this (it's been running for 190 hours on a 128GB memory machine) While I wait for this result I want to merge my individual assemblies and then I will compare the results.

Thanks a lot
Best Regards
Hamdi

Samuel Abalde

unread,
Feb 24, 2016, 4:16:54 AM2/24/16
to trinityrnaseq-users
Hi Hamdi,

Why do you want to merge all your assemblies? If you want to compare the results between samples, shouldn't you assemble them individually and then to perform the comparisons?

Anyways, you could use Transrate instead of CD-HIT. I use CD-HIT sometimes, but you could lose some isoforms since they are pretty similar each other and they will be part of the same cluster.
I personally like to use Transrate, which has been developed to analyse the quality of your assemblies. It analyses each contig individually and output a "assembly score". Besides, it has the option --merge-assemblies, what chooses the "good contigs" of every assembly and joins them into one file.

The way to use it would be:

transrate  -t 16  --assembly $assembly1,$assembly2  --merge-assemblies $assemblym

Where:

-t is the number of threads to use (optional)
--assembly the comma-separated names of the files to be merged
--merge-assemblies the name of the output fasta file

If you want to output this file in another directory, this option could be path/to/the/directory/file.fasta

I hope it helps,
Samu

Samuel Abalde

unread,
Feb 24, 2016, 4:18:30 AM2/24/16
to trinityrnaseq-users
Btw, I assume you know it, but the $ just says the name is a variable. To use it directly I'd write:

transrate  -t 16  --assembly assembly1,assembly2  --merge-assemblies assemblym


Brian Haas

unread,
Feb 24, 2016, 8:52:41 AM2/24/16
to Samuel Abalde, trinityrnaseq-users
Hi all,

I'm a tad suspicious of Transrate (no offense to any Transrate developers intended).  Be sure to evaluate resulting assemblies using

after doing whatever it is you want to do in order to improve the assembly.

On Wed, Feb 24, 2016 at 4:18 AM, Samuel Abalde <saab...@gmail.com> wrote:
Btw, I assume you know it, but the $ just says the name is a variable. To use it directly I'd write:

transrate  -t 16  --assembly assembly1,assembly2  --merge-assemblies assemblym


--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

Samuel Abalde

unread,
Feb 24, 2016, 10:53:24 AM2/24/16
to trinityrnaseq-users
But Trinity doesn't have an option to merge assemblies, does it?

Brian Haas

unread,
Feb 24, 2016, 10:56:40 AM2/24/16
to Samuel Abalde, trinityrnaseq-users
Trinity doesn't have an option to merge assemblies.  You basically generate one assembly, and you can decide to filter it based on expression values.

The primary way to get a better assembly with Trinity is to feed it more reads (and normalize)... and some future version of the Trinity software will deal with any shortcomings as it's always a work in progress.



On Wed, Feb 24, 2016 at 10:53 AM, Samuel Abalde <saab...@gmail.com> wrote:
But Trinity doesn't have an option to merge assemblies, does it?

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Richard Smith-Unna

unread,
Feb 24, 2016, 11:37:48 AM2/24/16
to trinityrnaseq-users, saab...@gmail.com
Hi all,

Transrate lead developer here.

I don't recommend using the current --merge-assemblies option of Transrate for this. What it does is pool the contigs from all the assemblies, then perform a read-based quality assessment of the pooled assembly and filter out low-scoring contigs. Ideally this leaves you with the well-assembled stuff from each assembly, but it doesn't attempt to reconcile the contigs with one another in any way.

We've developed a separate tool for assembly merging: https://github.com/cboursnell/transfuse. This also takes into account quality information, but is quite conservative in discarding contigs. It clusters the transcripts that pass the filter, builds a multiple sequence alignment for each cluster, and then resolves the splice graph for each alignment to generate merged contigs. It's still in development but it appears to be performing well - the manuscript is in preparation.

Brian, I'd be interested to hear what makes you suspicious of Transrate, and what you mean by it. We work completely in the open and welcome constructive criticism, and are constantly improving our software and documentation. Happy to talk here, in our user group, or on our chat room: https://gitter.im/Blahah/transrate.

Cheers,
Richard Smith-Unna


On Wednesday, 24 February 2016 13:52:41 UTC, Brian Haas wrote:
Hi all,

I'm a tad suspicious of Transrate (no offense to any Transrate developers intended).  Be sure to evaluate resulting assemblies using

after doing whatever it is you want to do in order to improve the assembly.
On Wed, Feb 24, 2016 at 4:18 AM, Samuel Abalde <saab...@gmail.com> wrote:
Btw, I assume you know it, but the $ just says the name is a variable. To use it directly I'd write:

transrate  -t 16  --assembly assembly1,assembly2  --merge-assemblies assemblym


--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Brian Haas

unread,
Feb 24, 2016, 11:56:21 AM2/24/16
to Richard Smith-Unna, trinityrnaseq-users
Hi Richard,

I suppose I should be more forthcoming.  I had reviewed the manuscript a long while back, and I'm honestly very enthusiastic about being able to leverage such a method, but I did have some concerns.  I've posted my review below.  I never did get a response to my critique - which I thought was mostly constructive - and I would also be curious about thoughts by others that are more well versed in the mathematics related to these optimization techniques - such as the Detonate authors.   If I'm terribly wrong about anything I stated in my review, I'm happy to own up to it.  I don't know if the latest version on bioRxiv is the same as what I had looked at many months ago, so it could be that some of this was already addressed.  

I really like the idea of TransRate - especially if it can provide results on par with Detonate but require relatively little time and resources to compute.  ... and I harbor absolutely no ill will.



====

The manuscript by Smith-Unna et al. describes a method and tool named TransRate that was developed to score the quality of a de novo transcriptome assembly.  The manuscript is well written and well organized.  However, I suspect that there are several potential flaws in the method as it’s currently implemented, which I’ll further describe below.

One of my biggest concerns, and one that the authors attempt to proactively defend in the manuscript, is that the contig scores do not appear to be weighted according to their estimated expression levels.  Each contig, regardless of whether it has much read support or whether it has little to no apparent read support, is weighted equally in its contribution to the assembly aggregate score.  This is a problem for a number of reasons.  Those transcripts that are moderately to highly expressed and assembled incorrectly should be penalized more heavily than those misassembled contains that have little read support, most simply due to the more highly expressed transcript accounting for a larger proportion of the total number of sequenced reads.  Intuitively and mathematically, this should lead to a more suitable contig and cumulative assembly scoring scheme.

This equal weighting of transcript contigs goes onto confound other analyses, such as the spearman rank correlations and comparison between TransRate and Detonate.  The redundant contigs with little to no reads assigned, even though they may in many cases be correctly reconstructed, are skewing the correlation statistic.  Based on the authors observation related to this, they would be better off removing those contigs that form the distribution at the mark of minimum value inflation and recompute correlations without those data.

The authors appear to wholly assign reads to contigs based on the maximum probability of their assignment, instead of fractionally assigning reads according to this probability.  In all the scoring metrics, it would be better to weight the contribution of each read according to its fractional assignment to that contig.  The all-or-nothing read assignment as done by the authors will artificially augment scores for more dominantly expressed isoforms and unfairly penalize the lower expressed isoforms.  Related to this, the authors should specifically address in the manuscript the impact of overzealous alternative isoform reconstructions, and the degree to which the sensitivity of an algorithm in reconstructing very lowly expressed transcripts impacts the cumulative assembly score.

Although the scoring system devised by the authors touches on many of the important metrics for exploring the quality of the assembly, the scoring system implemented seems largely ad hoc, particularly in how the scores are combined to generate a single assembly quality score.   This is in stark contrast to the earlier published Detonate software, which has a very solid mathematical framework.  

The authors compare TransRate assembly scores for a large number of published transcriptome assemblies.  Although I applaud them for this effort, and it would be useful to have a score that would allow for reliable comparisons among different assemblies from different transcriptomes, there is no compelling reason to think that such a score as computed by TransRate would enable trustworthy comparisons between unrelated transcriptomes.  It’’s clear that multiple assemblies given a single set of reads as input would allow for meaningful comparisons (as done via Detonate), but being able to compare across different assemblies from different sets of reads would ideally be based on a mathematical model with strong theoretical foundations.

The overall usefulness of the Cseg metric should be demonstrated on known falsely-fused transcripts. It should be demonstrated to be a good predictor of such events if it is to be used as a key quality metric.

====


Richard Smith-Unna

unread,
Feb 24, 2016, 12:36:58 PM2/24/16
to trinityrnaseq-users, richard...@gmail.com
Hi Brian,

Thanks very much for your response and for your thoughtful review. We are actually just about to submit our revised MS. It took many months for reviews to come back (November I think) - I assume another reviewer was much slower than you. It also took a while to resubmit as I've paused my PhD to do a fellowship. We haven't updated the preprint yet, but we will do this shortly.

Briefly though, transrate deliberately departs from the likelihood-based approach taken by DETONATE and many genome assessment tools. The reason is that most users are unable to interpret likelihoods usefully except as a simple 'which number is higher?'. We aim to provide detailed information about every contig in a way that enables researchers to make evidence based decisions, and to quantify effects that can be intuitively grasped and potentially corrected. The feedback we have been getting from users is that we're making good progress towards this goal (though of course we will continue trying to get better). In short - the approach is based on optimising usefulness first and perfect accuracy second (though both matter, of course). 

The point about weighting of contigs is valid - and we now provide an alternative expression-weighted score. In the future we will continue to improve the scoring procedure, but we have a lot of users right now who want a paper that describes what they've been using :). We also computed alternative correlation statistics with lowest-scoring contigs removed.

Some of the other points are addressed with explanation or discussion that I mostly won't duplicate here. The only one I will highlight is that in the final figure of the paper where we show transrate scores of hundreds of assemblies from different datasets, we are not trying to compare the quality of assemblies from different datasets, but to show empirically how assembly scores vary and allow discussion of what variables might contribute to the variance - the purpose is again to make interpretation easier for users.

Finally, we added some analysis of transcript fusions in yeast with the segmentation score component.

I hope we've addressed your concerns, and would be very happy to work with you to continue to improve in the future.

Cheers,
Richard

Brian Haas

unread,
Feb 24, 2016, 12:40:00 PM2/24/16
to Richard Smith-Unna, trinityrnaseq-users
Fantastic!  I look forward to the updated version, and I'm very happy to see this has been so constructive.

yours truly,

~brian


--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Tevfik Hamdi Kitapci

unread,
Feb 24, 2016, 2:28:38 PM2/24/16
to trinityrnaseq-users
Hi Samuel,
Thanks for your reply. Can you please explain how can compare results between samples from individual assemblies? I want to perform a time series differential expression analysis: my idea it first get an combined transcriptome assembly then map the reads from each sample to this assembly. Then follow up with a differential expression analysis based on the expression count data (normalized as necessary such as FPKM, TPM etc.)

I don't know how to perform this analysis with 24 individual assemblies. I can map the reads from each sample to the assembly obtained from those reads but then how can I compare the results from those samples ? I'll appreciate your comments.

Thanks
Best Regards
Hamdi

Tevfik Hamdi Kitapci

unread,
Feb 24, 2016, 2:33:34 PM2/24/16
to trinityrnaseq-users, saab...@gmail.com
Hi Richard,
Thanks for your comments. I will definitely give a try to transfuse. Do you know when you will submit/publish your manuscript ?

Best Regards
Hamdi

Samuel Abalde

unread,
Feb 25, 2016, 4:36:24 AM2/25/16
to trinityrnaseq-users
Hi Hamdi,

I'm not an expert in differential expression, so I don't want to give you a wrong information. My concern was if you merge all your assembies in just one, you may be losing some important information, such as low expressed reads or pretty similar ones (i. e. CD-HIT clusters reads following a percentage of similarity -by default 90%, I think-, so you could lose reads more similar than 90% but different ones, if that's your threshold).

I haven't thought in your approach when I wrote my answer. When you work with samples of different tissues from the same organism, it's true there are people who recommend to merge all those reads, perform the assembly and then map the reads (individually from each sample) to that assembly. So I don't dare to say that's a wrong approach. It's just a little concern as a beginner and some curiousity about your goals. I'm not saying you can't do that. If you wanna use this approach, I'd recommend you to read this forum post: https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ where it's recommended to use TPM metrics, instead of FPKM, and the mathematical reasons.
If I'm not wrong, you can get those metrics with Trinity.

In some case you want to compare your samples individually, I'd recommend you to read this one about "between samples normalization", where you can see some methods to compare these individual samples:  https://haroldpimentel.wordpress.com/2014/12/08/in-rna-seq-2-2-between-sample-normalization/

Well, I know I haven't answered your question, but I'm not an expert and I don't want to interfere in your job. I hope the reads above can be useful for you to make a proper decision. I'm sorry if a made you to have doubts about your pipeline or I've wasted your time with my curiosity.

Good luck,
Samu
Reply all
Reply to author
Forward
0 new messages