single gene sequence per cluster

Lada Jovović

unread,

Jun 30, 2023, 5:20:57 AM6/30/23

to corset-project

Hi guys,

is there a way to make a single gene sequence per cluster?

For example, I have multiple transcripts assigned to one cluster and I want to have one representative sequence. What do you do in that step to somehow shrink your transcriptome even more to the gene level?

Should I take for example the longest "isoform" or is there a script that can do that (or apply some other filtering approach) for me?

I read some of the previous conversations and see that Nadia said something like that was in the developing phase years ago.

Tnx,

Lada

Nadia Davidson

unread,

Jun 30, 2023, 6:38:44 AM6/30/23

to corset-project

Hi Lada,

I think Lace is what you are looking for.

Cheers,

Nadia.

Lada Jovović

unread,

Jun 30, 2023, 10:18:56 AM6/30/23

to corset-project

Hi Nadia,

thats what I thought I should use just wasn't sure. thank you so much for your quick response!

Best,

Lada

shi ye

unread,

Jul 7, 2023, 11:41:07 PM7/7/23

to corset-project

Hi Nadia

I want to take one representative transcript from each cluster and do functional annotation, I think if I use superTanscripts, it may affect the result，What should I do？

best,

Shiye

Nadia Davidson

unread,

Jul 10, 2023, 4:50:22 AM7/10/23

to corset-project

Hi Shiye,

You could try something like what we show at the end of this link https://github.com/Oshlack/Corset/wiki/Example for annotation. I guess if depends what you will use the function annotation for. I wouldn't recommend using SuperTranscripts for this.

Cheers,

Nadia.

Lada Jovović

unread,

Jul 14, 2023, 5:39:27 AM7/14/23

to corset-project

Hi Nadia and Shiye.

Ok, so if Supertranscripts is not good for annotation what should I do with my Corset output ? fetchClusterSeqs.py is for working with a subset or clusters of interests. But I am not working on subset and what I want to do is use my assembled transcripts from Trinity, remove the redundancy by clustering transcripts into genes (I see people use CD-HIT too, but I'll like to use Corset) and then use these clusters ("genes") for all downstream RNAseq applications such as ORF prediction, annotation, quantification, mapping, DEseq, GO and KEGG enrichments etc...

I just don't understand what should I do if building superTranscripts with Lace is not good for functional annotation. Specifically, I don't understand what should I do with clusters having more than one transcript, how do I use those for the aforementioned downstream applications? Are they all corset "genes" or do somehow I have to filter among those? For example, should I take the longest transcript in the cluster as a representative?

To be even more specific - In one research there were 138.752 trinity transcripts and using corset they came to a final set of 72.826 genes which was used for all further analysis. I don't understand how do I come to that (lower) number if I have multiple transcripts assigned to same Cluster ID. Hope I explained well and I apologise if it's a silly question :)