no. of reads in counts.txt are raw or normalized reads?

101 views
Skip to first unread message

nadalt...@gmail.com

unread,
May 1, 2020, 11:47:25 AM5/1/20
to corset-project
Hi,

Thank you very much for developing and continuous support of Corset; it is very useful for my experiment. I have some questions and I hope you don't mind to help explaining me. I have 2 treatments (A and B) with 3 replicates each. After trinity, I used Salmon to quantify read abundance and followed the Corset instructions. I got 2 output files: clusters.txt and counts.txt.

I wanted to be sure that the samples are listed correctly so I looked at quant.sf from Salmon and tried to match the counts from each sample back to Corset output. I am not sure if I did anything wrong so that the no. of reads reported in quant.sf by Salmon and in counts.txt are not the same (please see some of the results below). Please correct me if I am wrong, but I understand that Corset summed all reads from transcripts of each cluster, rounded the counts somehow and reported them as raw counts. They should be the same (or very close) to the Salmon outputs. Or those reported in counts.txt are normalised reads?

Thank you very much and looking forward to hearing from you.

Best regards
Nad

-----------------
Transcript 1
-----------------
(1) Result from clusters.txt
TRINITY_DN22615_c0_g1_i1 is the only transcript in the Cluster-6962.59096

(2) Results from counts.txt.
                                     A1    A2    A3    B1    B2    B3
Cluster-6962.59096    140    130    83    219    233    195

(3) Results from quant.sf (Salmon)
                                     Name    Length    EffectiveLength    TPM    NumReads
Sample A1           TRINITY_DN22615_c0_g1_i1    237    61.993    25.864129    154.199
Sample A2           TRINITY_DN22615_c0_g1_i1    237    58.081    29.414225    140.224
Sample A3           TRINITY_DN22615_c0_g1_i1    237    56.194    18.605301    88.449
Sample B1           TRINITY_DN22615_c0_g1_i1    237    56.120    48.992792    219.375
Sample B2           TRINITY_DN22615_c0_g1_i1    237    55.068    58.604391    250.620
Sample B3           TRINITY_DN22615_c0_g1_i1    237    63.627    36.524794    199.230

-----------------
Transcript 2
-----------------
(1) Result from clusters.txt
TRINITY_DN77576_c0_g1_i1 is the only transcript in the Cluster-43219.0

(2) Results from counts.txt.
                            A1    A2    A3    B1    B2    B3
Cluster-43219.0    26    2    8    1    27    0

(3) Results from quant.sf (Salmon)
                                     Name    Length    EffectiveLength    TPM    NumReads
Sample A1               TRINITY_DN77576_c0_g1_i1    793    565.197    0.433184    23.546
Sample A2               TRINITY_DN77576_c0_g1_i1    793    550.086    0.044297    2.000
Sample A3               TRINITY_DN77576_c0_g1_i1    793    546.198    0.173130    8.000
Sample B1               TRINITY_DN77576_c0_g1_i1    793    548.196    0.022862    1.000
Sample B2               TRINITY_DN77576_c0_g1_i1    793    541.401    0.642176    27.000
Sample B3               TRINITY_DN77576_c0_g1_i1    793    563.757    0.000000    0.000


Nadia Davidson

unread,
May 3, 2020, 7:00:55 PM5/3/20
to corset-project
Hi Nad,

Some differences between salmon's output and corset's are to be expected, and the numbers you list are fairly similar, so it looks like corset it working correctly. The reason they are a slightly different in the way they estimate counts. Both are based on something called equivalence class counts, that salmon calculated. Corset simply adds up all the equivalence class counts for a gene. If an equivalence class matches more than one gene it will randomly assign it. Salmon instead, uses the equivalence class counts to infer the counts per transcript. If an equivalence class count matches more than one transcript if will use some more sophisticated methods to try to workout where to assign it. In practice, these differences seem to make very little difference to detecting differential expression in genes.

Cheers,
Nadia.

nadalt...@gmail.com

unread,
May 4, 2020, 9:56:00 PM5/4/20
to corset-project
Hi Nadia,

Thanks so much for your reply and sorry for the very basic question. I am a wet-lab biologist very new to RNA-Seq. I hope do you not mind one last question. May you please confirm my understanding that in the file count.txt provides raw reads and in the Corset workflow you perform normalization in R with calcNormFactors?

Thanks again and have a great day!

Best wishes,
Nad

Nadia Davidson

unread,
May 4, 2020, 10:41:09 PM5/4/20
to corset-project
No worries and not a basic question at all.

Yes that's right corset provides raw, unnormalised reads which you can normalise in R using other methods like calcNormFactors.

Cheers,
Nadia.
Reply all
Reply to author
Forward
0 new messages