Hi all,
As part of my study, I am trying to identify the orthology relationship between multiple species. Some of the sps are model orgs with a well established genome, and other are non-model. For the non-model we pooled multiple samples, per species, and ran the Trinity assembler. Obviously, some of the contigs are true isoforms and as such will have a paralogous relationship, which are the result of duplications. However, as we get many isoforms per 'gene', I was looking at random samples of orthologous groups, and found that in many cases, the isoforms are aligning on different locations of a better annotated sp. This led me to the conclusion that they are, probably, not isoforms, but part of the same gene that the assembler was not able to assemble the reads correctly.
Here, I would love your input; Would you think that using any clustering algorithm to remove redundant seqs will be helpful (I am not trying to run any DE assay downstream). Furthermore, if the above question is yes, I understand that Corset and SuperTranscript will be quite similar in their output (I'm planning on running both of them individually for each assembly, in any case), while CD-HIT will be quite different. As each one as its own advantages and disadvantages, will a combinations of, for example first SuperTranscript and then CD-hit might be beneficial?
I'm sorry for the long post, and hope that I clearly stated my intentions.
Looking forward for your suggestions.
P.S. yes, I know that it would be much better to assemble a genome, this will be the lab's next step. However, for now we only have the rna-seq.