no. of reads in counts.txt are raw or normalized reads?

nadalt...@gmail.com

unread,

May 1, 2020, 11:47:25 AM5/1/20

to corset-project

Hi,

Thank you very much for developing and continuous support of Corset; it is very useful for my experiment. I have some questions and I hope you don't mind to help explaining me. I have 2 treatments (A and B) with 3 replicates each. After trinity, I used Salmon to quantify read abundance and followed the Corset instructions. I got 2 output files: clusters.txt and counts.txt.

I wanted to be sure that the samples are listed correctly so I looked at quant.sf from Salmon and tried to match the counts from each sample back to Corset output. I am not sure if I did anything wrong so that the no. of reads reported in quant.sf by Salmon and in counts.txt are not the same (please see some of the results below). Please correct me if I am wrong, but I understand that Corset summed all reads from transcripts of each cluster, rounded the counts somehow and reported them as raw counts. They should be the same (or very close) to the Salmon outputs. Or those reported in counts.txt are normalised reads?

Thank you very much and looking forward to hearing from you.

Best regards
Nad

-----------------
Transcript 1
-----------------
(1) Result from clusters.txt
TRINITY_DN22615_c0_g1_i1 is the only transcript in the Cluster-6962.59096

(2) Results from counts.txt.
                                     A1    A2    A3    B1    B2    B3
Cluster-6962.59096    140    130    83    219    233    195

(3) Results from quant.sf (Salmon)
                                     Name    Length    EffectiveLength    TPM    NumReads

Sample A1 TRINITY_DN22615_c0_g1_i1 237 61.993 25.864129 154.199

Sample A2 TRINITY_DN22615_c0_g1_i1 237 58.081 29.414225 140.224

Sample A3 TRINITY_DN22615_c0_g1_i1 237 56.194 18.605301 88.449

Sample B1 TRINITY_DN22615_c0_g1_i1 237 56.120 48.992792 219.375
Sample B2 TRINITY_DN22615_c0_g1_i1 237 55.068 58.604391 250.620

Sample B3           TRINITY_DN22615_c0_g1_i1    237    63.627    36.524794    199.230

-----------------
Transcript 2
-----------------
(1) Result from clusters.txt
TRINITY_DN77576_c0_g1_i1 is the only transcript in the Cluster-43219.0

(2) Results from counts.txt.
                            A1    A2    A3    B1    B2    B3
Cluster-43219.0    26    2    8    1    27    0

(3) Results from quant.sf (Salmon)
                                     Name    Length    EffectiveLength    TPM    NumReads

Sample A1 TRINITY_DN77576_c0_g1_i1 793 565.197 0.433184 23.546

Sample A2 TRINITY_DN77576_c0_g1_i1 793 550.086 0.044297 2.000

Sample A3 TRINITY_DN77576_c0_g1_i1 793 546.198 0.173130 8.000

Sample B1 TRINITY_DN77576_c0_g1_i1 793 548.196 0.022862 1.000

Sample B2 TRINITY_DN77576_c0_g1_i1 793 541.401 0.642176 27.000
Sample B3 TRINITY_DN77576_c0_g1_i1 793 563.757 0.000000 0.000

Nadia Davidson

unread,

May 3, 2020, 7:00:55 PM5/3/20

to corset-project

Hi Nad,

Some differences between salmon's output and corset's are to be expected, and the numbers you list are fairly similar, so it looks like corset it working correctly. The reason they are a slightly different in the way they estimate counts. Both are based on something called equivalence class counts, that salmon calculated. Corset simply adds up all the equivalence class counts for a gene. If an equivalence class matches more than one gene it will randomly assign it. Salmon instead, uses the equivalence class counts to infer the counts per transcript. If an equivalence class count matches more than one transcript if will use some more sophisticated methods to try to workout where to assign it. In practice, these differences seem to make very little difference to detecting differential expression in genes.

Cheers,

Nadia.

nadalt...@gmail.com

unread,

May 4, 2020, 9:56:00 PM5/4/20

to corset-project

Hi Nadia,

Thanks so much for your reply and sorry for the very basic question. I am a wet-lab biologist very new to RNA-Seq. I hope do you not mind one last question. May you please confirm my understanding that in the file count.txt provides raw reads and in the Corset workflow you perform normalization in R with calcNormFactors?

Thanks again and have a great day!

Best wishes,

Nad

Nadia Davidson

unread,

May 4, 2020, 10:41:09 PM5/4/20

to corset-project

No worries and not a basic question at all.

Yes that's right corset provides raw, unnormalised reads which you can normalise in R using other methods like calcNormFactors.