Hierarchial clustering stuck on a cluster

Jodi Thomas

unread,

Feb 21, 2021, 6:31:33 PM2/21/21

to corset-project

Good morning!

I am running corset on my assembly and it appears to be stuck on clustering a specific cluster:

...

145.6 million equivalence classes read

Starting hierarchial clustering...

0 thousand clusters done

1 thousand clusters done

2 thousand clusters done

3 thousand clusters done

4 thousand clusters done

5 thousand clusters done

cluster with 23521 transcripts.. this might take a while

It took about 1 hour 20 mins to get to that stage, and after 2.5 days nothing has changed. I have read previous posts in this group about corset taking a while to run and have tried to follow the advice on these. My transcriptome is a long-read ISOseq assembly with 49,981 transcripts - I have carried out cdhit prior to running corset to remove some redundancy as well as transdecoder to cut down the no. of transcripts. I have 4 groups, each with 10 samples and am using the equivalence class info from running salmon mapping. I have added '-l 10' to corset to try and speed things up. Here is my code:

g="3,1,3,1,4,2,4,2,3,1,3,1,3,1,4,2,4,2,4,2,3,1,3,1,4,2,4,2,3,1,3,1,3,1,4,2,4,2,4,2"

n="PS506_CNS,PS506_eyes,PS508_CNS,PS508_eyes,PS509_CNS,PS509_eyes,PS510_CNS,PS510_eyes,PS513_CNS,PS513_eyes,PS514_CNS,PS514_eyes,PS518_CNS,PS518_eyes,PS522_CNS,PS522_eyes,$

eq_classes="PS506_CNS_eq_classes.txt PS506_eyes_eq_classes.txt PS508_CNS_eq_classes.txt PS508_eyes_eq_classes.txt PS509_CNS_eq_classes.txt PS509_eyes_eq_classes.txt PS510_$

corset -I \

-g ${g} \

-n ${n} \

-i salmon_eq_classes ${eq_classes} \

-l 10 \

-p isoseq3_cdhit0.99_transdecoder_nrmollusca_genes_corset

I'm expecting it to take 1 - 2 weeks to run based on previous posts. But just wanted to check if you think this seems realistic? Or because its been stuck at this stage for a couple days it might take a very long time? Would it be worth adding the '-x' parameter in? I hadn't added '-x' in yet as I wasn't sure if it would remove any useful info - but I see in a previous post you reccommend '-x 100'. I am working on a squid, which I would expect to have high levels of RNA editing so this may be a further complication.

Thanks so much for your help!

Jodi :)

Nadia Davidson

unread,

Feb 23, 2021, 12:18:14 AM2/23/21

to corset-project

Hi Jodi,

Has the job progressed at all since yesterday? Any output after the "cluster with 23521 transcripts.. this might take a while" line?

Your job is a little different than some of the previous cases that took a long time to run because the number of transcripts you have is not large, and I would assume it's fairly clean given that it's a long read assembly. However you have many samples, so I would assume a lot of reads and equivalence classes to process. Are the reads short-reads? Roughly how many per sample? Also what type of computer are you running corset on? How much RAM does it have?

It may even be worth trying an older version of corset in your case, such as 1.07 (see https://github.com/Oshlack/Corset/releases). I don't know if this will be faster, but a lot of changes were added for efficiency when the number of transcripts is very large, which may actually makes things slower in your case. Feel free to update me on whether this helps.

Cheers,

Nadia.

Jodi Thomas

unread,

Feb 23, 2021, 1:33:38 AM2/23/21

to corset-project

Hi Nadia,

Thanks for your reply! No the job hasn't progressed, the last line is still 'cluster with 23521 transcripts.. this might take a while'.

Yes, they are short RNAseq reads with approx 349,732,537 reads per sample. I'm running corst on my university's HPC - I don't know too much about hardware but I think it is a PowerEdge R630 Rack Server. I have given the corset job 1 core (as I understand corset can only use a max of 1 core) and 300 G of memory - does that sound OK?

OK good idea, thank you! The other idea I had is splitting my corset job in two. I have 10 samples in each of two treatment groups for tissue A (total 20 samples tissue A), and 10 samples in each of two treatment groups for tissue B (total 20 samples tissue B). I ran corset with all 40 samples, specifying 4 groups, to potentially make downstream analysis easier. However, I could possibly do 1 corset run with the 20 samples, 2 groups in tissue A, and a 2nd corset run with the 20 samples, 2 groups in tissue B. What do you think of this?

Thank you!

Jodi

Jodi Thomas

unread,

Feb 23, 2021, 1:36:26 AM2/23/21

to corset-project

Sorry, aprrox 87,433,134 per sample!!

Nadia Davidson

unread,

Feb 23, 2021, 2:08:03 AM2/23/21

to corset-project

350 million reads per sample did sound like a lot! 87 million * 40 samples is still very high and there's a reasonable chance corset will not be able to process that much. Splitting the job by samples is one possibility, but this may still take a very long time to run (if at all) and you will not be able to combine the count data later (as the cluster ID won't correspond).

I've just thought of another option that might work in your case. You could take a small subset of samples (e.g. 1 sample from each condition) and run corset on this. The clustering does not usually change much with more samples (even 1 sample does a good job of clustering the transcript in my experience!). The output of corset's clustering can then be used to build a superTranscript (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1284-1, https://github.com/Oshlack/Lace), which you can then use a bit like a reference genome. e.g. map the reads from all your samples back using tools like STAR and then get gene level counts. It's a longer process, but has the advantage of allowing you to do other things like look at gene structure and splicing, visualise read coverage over genes in IGV etc.

Cheers,

Nadia.

Jodi Thomas

unread,

Feb 23, 2021, 2:20:59 AM2/23/21

to corset-project

Awesome, what a great idea! Thanks Nadia I'll try that.

Jodi :)

Jodi Thomas

unread,

Feb 23, 2021, 6:13:41 PM2/23/21

to corset-project

Good morning Nadia,

Just thought I'd let you know that I re-ran corset with 2 samples in each of the 4 groups (total 8 samples) and it ran in ~6.5 hours.

Thanks for your help!

Jodi :)

Jodi Thomas

unread,

Feb 23, 2021, 8:05:33 PM2/23/21

to corset-project

I just wanted to check please - if I build a superTranscriptome using Lace, and then map all my reads against this to get gene level counts to use for DGE analysis, can I use my annotated de novo transcriptome (the same transcriptome that went into corset)? Or do I need to annotate the superTranscriptome output from Lace?

Sorry, this q. may be better suited to go into the Lace google group. I have followed the link on the Lace github page, and tried searching for the google group but I cannot find it. I'm not sure if this is because it is not viewable by the general public.

Thank you!!

Jodi :)

Nadia Davidson

unread,

Feb 24, 2021, 4:36:17 AM2/24/21

to corset-project

Hi Jodi,

The superTranscript IDs will be the cluster IDs from corset, so if you have already annotated the transcripts, it's possible to use this to annotated the superTranscripts using the mapping between transcripts and cluster IDs that corset outputs. However this would require you to write a custom script/code. It's worth noting that you would probably need to do this with the corset-only method as well. If you are just interested in DGE, you do not need to worry about exon/intron positions. Each gene is essentially it's own chromosome, and you just need a tool that tells you how many read map to each "chromosome". You can also map back the de novo assembly to the superTranscripts (or even the unassembled long reads), if you wish to annotate gene structure. Hope that make some sense, otherwise I'm happy to answer more questions!

Thanks for pointing out that the Lace google group was closed. I have tried to fix this, so hopefully open for viewing and questions from the general public now.

Cheers,

Nadia.

Jodi Thomas

unread,

Feb 25, 2021, 6:47:12 PM2/25/21

to corset-project

Hi Nadia,

Great that makes sense, thanks for your reply. And good news - the original corset run I posted about (40 samples across 4 groups and stuck on a cluster) has now finished!! I had left it running in the background with the small hope it might finish and it has! It took 6 days and 6 hours. As this has worked and I have count info for every sample, I will use this for my DGE analysis. I have annotated my transcriptome assembly and plan on using Adam Taranto's fetchClusterSeqs.py script to pull out the transcripts for each cluster of interest and use that to compare to my annotated assembly to annotate the transcripts belonging to the DE clusters.

Thank you - I can now find the google group!

Thanks so much for your help Nadia, it has made using corset a great experience!

Jodi :)

Nadia Davidson

unread,

Feb 25, 2021, 6:52:40 PM2/25/21

to corset-project

Excellent! I'm glad the job succeeded in the end!

Happy to help and answer further questions as required.

Cheers,

Nadia.

Reply all

Reply to author

Forward