-D parameter

Lada Jovović

unread,

Mar 31, 2023, 7:15:20 AM3/31/23

to corset-project

Hello everyone,

I am new to Corset so can someone please help me out with the following?

My input was 633.679 transcripts (from Trinity.fasta) made from 10 samples. I run corset with default parameters and got 51.556 clusters. What bothered me is that multiple transcripts were assigned to the same cluster-ID so in the end I have 46.698 unique cluster IDs. I need to use this for DGE analysis.

I read about superTranscripts and that I should run corset with the -D parameter set high. In the example, it is set to 99999999999.

https://github.com/Oshlack/Corset/wiki/Example

I tried that but it has been running for 10 days now. How do I decide about the value for -D parameter for my data?

Tnx,

Lada

Nadia Davidson

unread,

Apr 11, 2023, 5:59:05 PM4/11/23

to corset-project

Hi Lada,

It's expected that you get multiple transcripts for each cluster-ID. This is because you can have multiple isoforms per gene (biology) or because the assembler has generated some redundancy (technical). Either way, the reads which map to these should be aggregated to clusters, which is what corset does. There should be a file generated by corset which gives you the counts for each cluster by sample for DGE analysis.

To annotate each cluster to a gene there are a number of approaches. For example you can take the longest transcript as a representative and use blast2go. You can also use lace to create a superTranscript representation for each cluster. This can be useful for differential transcript usage analysis (DTU), visualising mapped reads and SNP calling. If you want to do DTU analysis, the -D parameter need to be set high as you mention. However it's not clear to me why this would make corset run so slowly.

I think if you are only interested in DGE analysis, using the original corset results is fine. The number of transcritps/cluster you have sounds about right from my experience. Best of luck with your analysis.

Cheers,

Nadia.

Lada Jovović

unread,

Apr 21, 2023, 5:23:37 AM4/21/23

to corset-project

Dear Nadia,

thank you so much for the clarification, and yes, it makes sense totally now (multiple transcripts in one cluster could be isoforms).

I am just still not sure how the counting works - when multiple transcripts are aggregated in one cluster, does the counts.txt file for that particular cluster give me the read count simply by summing up the reads that mapped to all transcripts in that cluster or does it do something else (for example looks just at the one that is shared)?

As for the -D parameter, this job finally finished in the meantime, and for the same dataset, I got much more clusters - 290.687 total (186.126 unique). I am definitely gonna proceed with the the first result. :)