CD-HIT-EST, Corset and SuperTranscript comparison

97 views
Skip to first unread message

Tal Zaquin

unread,
Jun 14, 2022, 1:30:48 AM6/14/22
to corset-project
Hi all,
As part of my study, I am trying to identify the orthology relationship between multiple species. Some of the sps are model orgs with a well established genome, and other are non-model. For the non-model we pooled multiple samples, per species, and ran the Trinity assembler. Obviously, some of the contigs are true isoforms and as such will have a paralogous relationship, which are the result of duplications. However, as we get many isoforms per 'gene', I was looking at random samples of orthologous groups, and found that in many cases, the isoforms are aligning on different locations of a better annotated sp. This led me to the conclusion that they are, probably, not isoforms, but part of the same gene that the assembler was not able to assemble the reads correctly.  
Here, I would love your input; Would you think that using any clustering algorithm to remove redundant seqs will be helpful (I am not trying to run any DE assay downstream). Furthermore, if the above question is yes, I understand that Corset and SuperTranscript will be quite similar in their output (I'm planning on running both of them individually for each assembly, in any case), while CD-HIT will be quite different. As each one as its own advantages and disadvantages, will a combinations of, for example first SuperTranscript and then CD-hit might be beneficial?   

I'm sorry for the long post, and hope that I clearly stated my intentions.
Looking forward for your suggestions.

P.S. yes, I know that it would be much better to assemble a genome, this will be the lab's next step. However, for now we only have the rna-seq.

Nadia Davidson

unread,
Jul 4, 2022, 3:11:53 AM7/4/22
to corset-project
Hi,
Was not going to suggest you should assemble a genome :)
I think the approach you take depends a lot of what you are looking for. For example if you doing a phylogenetic analysis and just want to know the SNPs and distance between species, you could possibly map reads from all species onto the reference genome of the model ones? If you are interested in differences in gene structure this is more complex of course. Assemblers make a lot of mistakes and require a reasonable depth of coverage to assemble an isoform. Long read sequencing could be good for this.

In general, removing redundancy in an assembly is a good thing and both Corset and CD-HIT should do this. The SuperTranscripts method (Lace) takes already clustered transcripts, so is a downstream step. I think you probably don't want to assemble and then compare superTranscripts between species because the order of some exons can be ambiguous (without a reference genome) and the choice of order might be different between your Lace assemblies.

Good luck with your project.

Cheers,
Nadia.
Reply all
Reply to author
Forward
0 new messages