Corset with Salmon

Yoshida Naoki

unread,

Sep 2, 2020, 12:29:05 PM9/2/20

to corset-project

Hi Corset Authors,

I am struggling to run Corset and I would appreciate if you kindly support me.

I am working with a non-model frog using 2 groups with each 2 replications (total 4 RNA samples). These were sequenced with 2 x 150 bp, trimmed, de novo assembled by Trinity with using 4 fastq pair files, and evaluated by BUSCO, resulting in reliable app. 200,000 transcripts. Previously, I used CD-hit-est to reduce redaudancy, but now I would like to use Corset to get better clustering. To run Corset, I firstly run salmon on Mac terminal:

*Indexing the assembled TranscriptomeAssemblyTools

salmon index -p 2 -t /Users/naoki/AF_1_2_liver/trinity_contig_AF_1_2/Trinity_AF_1_2.fasta -i Salmon_ref_index_AF_1_2

*Quantifying the reads of 4 samples (individually run)

salmon quant -i rSalmon_ref_index_AF_1_2 -l ISF --dumpEq -1 pair1.fastq.gz -2 pair2.fastq.gz -p 8 -o Salmon_quant_AF_1_2_output

Four eq_classes.txt.gz files (2.1-3 MB) were generated in aux_info sub-directory in main directories, named "A1, A2, B1 and B2". The I run Corset with using the following command:

corset -g 1,1,2,2 -n A1,A2,B1,B2 -i salmon_eq_classes A1/aux_info/eq_classes.txt.gz A2/aux_info/eq_classes.txt.gz B1/aux_info/eq_classes.txt.gz B2/aux_info/eq_classes.txt.gz

Then I got count.txt and clusters.txt, but these files are blank (counts.txt notes but "A1 A2 B1 B2"). And Mac terminal showed as follows.

Running Corset Version 1.09

Setting sample groups:1,1,2,2, 2 groups in total

Setting sample names to:A1,A2,B1,B2

Reading salmon eq_classes file : A1/aux_info/eq_classes.txt.gz

Reading data on 0 transcripts in 0 equivalence classes

0 reads counted, 0 reads filtered, 0 reads redistributed.

Reading salmon eq_classes file : A2/aux_info/eq_classes.txt.gz

Reading data on 0 transcripts in 0 equivalence classes

0 reads counted, 0 reads filtered, 0 reads redistributed.

Reading salmon eq_classes file : B1/aux_info/eq_classes.txt.gz

Reading data on 0 transcripts in 0 equivalence classes

0 reads counted, 0 reads filtered, 0 reads redistributed.

Reading salmon eq_classes file : B2/aux_info/eq_classes.txt.gz

Reading data on 0 transcripts in 0 equivalence classes

0 reads counted, 0 reads filtered, 0 reads redistributed.

Done reading all files.

Start to cluster the reads

Starting hierarchial clustering...

Finished

I was wondering if eq_classes.txt.gz files were not generated correctly? Or something wrong in Corset command or setting of directory?

Thanks and Best Regards,

Naoki

Yoshida Naoki

unread,

Sep 5, 2020, 12:17:41 PM9/5/20

to corset-project

This problem was all resolved. As written in the following salmon v1.3.0 release note, eq_classes.txt output is now gzipped as eq_classes.txt.

https://github.com/COMBINE-lab/salmon/releases

The equivalence class output is now gzipped when written (and written to aux_info/eq_classes.txt.gz rather than aux_info/eq_classes.txt). To detect this behavior, an extra property gzipped is written to the eq_class_properties entry of aux_info/meta_info.json. Apart from being gzipped to save space, however, the format is unchanged. So, you can simply read the file using a gzip stream, or, alternatively, simply unzip the file before reading it.

So I just unzipped eq_classes.txt.gz by "gzip -d eq_classes.txt.gz". And I just use this file.

corset -g 1,1,2,2 -n A1,A2,B1,B2 -i salmon_eq_classes A1/aux_info/eq_classes.txt. A2/aux_info/eq_classes.txt B1/aux_info/eq_classes.txt B2/aux_info/eq_classes.txt

Finally I successfully got the output. Sorry for bothering you.

Best,

Naoki

2020年9月3日木曜日 1:29:05 UTC+9 Yoshida Naoki:

Nadia Davidson

unread,

Sep 6, 2020, 10:35:37 PM9/6/20

to corset-project

Hi Naoki,

Thanks for your email. I wasn't aware of the change to the eq_class file in the newer version of salmon, so I will update the documentation on our wiki to reflect this.

Your email has brought another issue to my attention and that is that salmon version 1 upwards uses the "validateMapping" option by default, which is not compatible with corset. This can be switched off in salmon with the flag "--hardFilter". I'm going to update our documentation about this too. You may want to rerun you salmon alignment with this option as I suspect it will impact your corset results.

Cheers,

Nadia.

Yoshida Naoki

unread,

Sep 7, 2020, 12:09:43 PM9/7/20

to corset-project

Hi Nadia,

Thank you very much for your reply and pointing out the setting in salmon. This is really helpful for me!

I used salmon v1.2.1 without the flag "--hardFilter" and run corset. This produced 21,000 clusters from 27,000 transcripts. It seems good result for me, but will you still recommend to rerun salmon with the lag "--hardFilter" to get better corset result?

Sincerely,

Naoki

2020年9月7日月曜日 11:35:37 UTC+9 nadiamd...@gmail.com:

Nadia Davidson

unread,

Sep 7, 2020, 9:43:00 PM9/7/20

to corset-project

Hi Naoki,

The clustering may be okay, but to be safe it would be better to rerun salmon with the --hardFilter option, in particular if you are interested in using the counts generated by corset.

Cheers,

Nadia.

Nadal Tegstrom

unread,

Jan 2, 2021, 3:26:18 AM1/2/21

to corset-project

Hi Nadia,

Happy new year! And many thanks for your continuous support for Corset!

Sorry for revoking this thread, but I feel it is relevant. From your previous reply, I understand that the --hardFilter flag will turn of the --validateMappings function. I am however running Salmon 1.3 and I wonder if it is different for this verion since I understand from the log below (#with --hardFilter, in bold and italicized) that the --hardFilter flag will trigger mapping validation. Do you have a suggestion in this case whether I should or shouldn't go with the --hardFilter flag?

Thanks for your help in advance and looking forward to hearing from you!

Best,

Nadal

----------------------------

# With --hardFilter

[2021-01-02 00:28:01.558] [jointLog] [info] setting maxHashResizeThreads to 8

[2021-01-02 00:28:01.558] [jointLog] [info] Fragment incompatibility prior below threshold. Incompatible fragments will be ignored.

[2021-01-02 00:28:01.558] [jointLog] [info] The --mimicBT2, --mimicStrictBT2 and --hardFilter flags imply mapping validation (--validateMappings). Enabling mapping validation.

[2021-01-02 00:28:01.558] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65

[2021-01-02 00:28:01.558] [jointLog] [info] The use of range-factorized equivalence classes does not make sense in conjunction with --hardFilter. Disabling range-factorized equivalence classes.

[2021-01-02 00:28:01.559] [jointLog] [info] Usage of --validateMappings implies a default consensus slack of 0.2. Setting consensusSlack to 0.35.

[2021-01-02 00:28:01.559] [jointLog] [info] parsing read library format

[2021-01-02 00:28:01.559] [jointLog] [info] There is 1 library.

[2021-01-02 00:28:01.732] [jointLog] [info] Loading pufferfish index

[2021-01-02 00:28:01.753] [jointLog] [info] Loading dense pufferfish index.

---------------------------

#"Without --hardFilter

[2021-01-02 09:08:57.607] [jointLog] [info] setting maxHashResizeThreads to 8

[2021-01-02 09:08:57.607] [jointLog] [info] Fragment incompatibility prior below threshold. Incompatible fragments will be ignored.

[2021-01-02 09:08:57.607] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65

[2021-01-02 09:08:57.607] [jointLog] [info] Usage of --validateMappings implies a default consensus slack of 0.2. Setting consensusSlack to 0.35.

[2021-01-02 09:08:57.607] [jointLog] [info] parsing read library format

[2021-01-02 09:08:57.607] [jointLog] [info] There is 1 library.

[2021-01-02 09:08:57.743] [jointLog] [info] Loading pufferfish index

[2021-01-02 09:08:57.743] [jointLog] [info] Loading dense pufferfish index.

Nadia Davidson

unread,

Jan 10, 2021, 11:57:43 PM1/10/21

to corset-project

Hi Nadal,

From your log files, I think --hardFilter is okay, as we want salmon to switch of range factorisation ( which the --validateMapping parameter used to switch on), so it's not really the validateMapping that's a problem directly. The key log line is "The use of range-factorized equivalence classes does not make sense in conjunction with --hardFilter. Disabling range-factorized equivalence classes." So I think your corset results for this should be okay. I'll check through the corset documentation and update in case I've said something incorrect about hardFilter switching off validateMappings.

Thanks.

Cheers,

Nadia.

Nadal Tegstrom

unread,

Jan 14, 2021, 5:08:32 PM1/14/21

to corset-project

Hi Nadia,

Thanks so much for your advice. I will keep the --hardFilter then. :)

I also have the second question emerged very recently if you don't mind. I have time-series samples of the same species (e.g. samples collected at 5 and 10 days after exposed to a fungal pathogen and each timepoint has its own controls), and I would like compare results within each timepoint and between the two timepoints. Would it make sense if I run Corset with the below command so that Corset can form clusters shared by both timepoints and I can use them in R?

Corset -g 1,1,1,2,2,2,3,3,3,4,4,4 (# 4 groups; 2 controls and 2 treatments) -n D5S1,D5S2,D5S3,D5C1,D5C2,D5C3,D10S1,D10S2,D10S3,D10C1,D10C2,D10C3 (# treatment day5, control day5, treatment day10, control day10) -salmon_eq_classes *-Sample*/aux_info/eq_classes.txt

Thank you very much and looking forward to hearing from you.

Best regards,

Nadal

Nadia Davidson

unread,

Jan 14, 2021, 6:30:04 PM1/14/21

to corset-project

Yes, that's how you'd do it!

吉田直樹

unread,

Jan 25, 2021, 10:36:19 AM1/25/21

to corset-project

Hello Nadia,

Thanks to your answer and suggestion, I finally got DEGs with using Salmon, Corset and EdgeR. When I do annotation and GO enrichment analysis, I found that some identified DEGs had similar cluster ID. And I looked back at Corset cluster file, I found that some cluster IDs were overlapped as Cluster-4423.1, Cluster-4423.2, Cluster-4423.3... Most clusters were single as Cluster-1.0, 2.0, 3.0, and so on. What was such overlapped (with different sub numbers) clusters? Would it be okay to forward DEG analysis with using those clusters and their sequences?

Thank you!

Naoki

2021年1月15日金曜日 8:30:04 UTC+9 nadiamd...@gmail.com:

Nadia Davidson

unread,

Feb 7, 2021, 9:27:56 PM2/7/21

to corset-project

Hi Naoki,

In general each ID is a unique cluster which you can do DGE on. So Cluster-4423.1, Cluster-4423.2, Cluster-4423.3 etc. would in theory be separate genes. The numbering just indicates that they may have some sequence in common. However in practice, such clusters could be from the same gene (because assembly and clustering is not perfect). This should not matter too much for DGE analysis; you will just see the same gene coming up as DGE a few time in the ranked list. For GO enrichment analysis, however it may be a problem that you are double counting genes. If you have annotated all the clusters to genes, you could choose the lowest p-value for each gene and proceed with that for GO enrichment or other gene set testing.