Number of mapped reads: sample_name.cnt vs expect_counts

492 views
Skip to first unread message

Mafalda Ferreira

unread,
May 13, 2015, 10:41:14 AM5/13/15
to rsem-...@googlegroups.com
Hello,

I'm trying to understand something concerning the outputs of RSEM. One is the file sample_name.cnt inside the folder sample_name.stat and comparing the result of that file and the expected counts obtained.

I'm aware that inside the file sample_name.cnt, the first line has 4 numbers which are:

N0 = number of unalignable reads; 

N1 = number of alignable reads; 

N2 = number of filtered reads due to too many alignments;
N= N0 + N1 + N2


Giving an example of one of my samples, this is the first line of sample_name.cnt for that sample:

0 22910113 0 22910113


so N0 = 0, N1 = 22910113, N2 = 0 and Ntotal=22910113


I find this result odd because the total number of read pairs for this sample is 28,805,265 (from the fastq file). Assuming that the numbers inside .cnt file concern read pairs, I expect that N1 would be smaller than the total no. of input read pairs since I don't expect that all reads align. However, if there is some amount of reads that could not be mapped, shouldn't N0 be different from 0 and Nt equal to the number of input read pairs? Did the meaning of these 4 numbers changed with RSEM version?

Also, summing the expected_counts from the sample_name.gene.results, I get a library size of 11433722.93 (EdgeR outputs this number too, since it sums the expected counts column to produce the library size). This number is approximately half of N1. Now this is puzzling because it either means: 


1) that N1 is actually the total number of reads mapped (no. R1 + no. R2). In that case there is a large amount of reads not mapping (22910113 / 28,805,265 * 2 ~ 0.40 --> 40 % mapping success) which is not good...

2) that N1 still represents read pairs, but the expected counts are wrong (?).

It is weird because I've done this math before to calculate mapping success and the N always matched the number of input read pairs, and N1 always matched library size. 

Can I get some help solving this out? 
Thank you in advance,

Mafalda

Colin Dewey

unread,
May 13, 2015, 12:26:42 PM5/13/15
to rsem-...@googlegroups.com
Hi Mafalda,

Could you give what version of RSEM you are using, the command line you are running for rsem-calculate-expression, and, if you are using a custom alignment, how that alignment was obtained?  That would help here.

Indeed, the numbers in the first line of sample_name.cnt are for fragments (read pairs, in the case of paired-end data).

If you are not seeing any unalignable reads, it might be that the aligner you are using is not putting the unaligned reads into its output.

The sum of the expected counts for a sample should generally be pretty close to the value of N1.  They are not the same because some reads may only have poor-quality alignments and may be assigned (or have some fraction of them assigned) to the “noise” gene in the RSEM model.  If you have a lot of poor-quality alignments, then this could explain a difference between the number of aligned reads and the sum of the counts in the RSEM output files.

Hope that helps,
Colin

--
RSEM website: http://deweylab.biostat.wisc.edu/rsem/
---
You received this message because you are subscribed to the Google Groups "RSEM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rsem-users+...@googlegroups.com.
To post to this group, send email to rsem-...@googlegroups.com.
Visit this group at http://groups.google.com/group/rsem-users.

Mafalda Sousa Ferreira

unread,
May 14, 2015, 5:58:55 AM5/14/15
to rsem-...@googlegroups.com, cde...@biostat.wisc.edu
Could you give what version of RSEM you are using, the command line you are running for rsem-calculate-expression, and, if you are using a custom alignment, how that alignment was obtained?  That would help here.

I'm using RSEM v. 1.2.19 and I'm calling it with this command (inside a script that calls one by one the two pairs of reads for each sample):

/src/trinityrnaseq-2.0.2/util/align_and_estimate_abundance.pl --transcripts ../trinity_out_dir/transcriptome.Trinity.fasta --seqType fq --left $my_first_read_file --right $my_second_read_file --est_method RSEM --output_prefix $my_prefix --SS_lib_type RF --output_dir ./output --prep_reference --trinity_mode --aln_method bowtie --bowtie_RSEM "--all --best --strata -m 300 --chunkmbs 512 --seedlen 23"

I'm using RNA-seq reads for a group of samples to construct a transcriptome (with Trinity 2.0.2) and I'm mapping these reads back to the transcriptome using the scrip 
align_and_estimate_abundance.pl that calls RSEM.

Indeed, the numbers in the first line of sample_name.cnt are for fragments (read pairs, in the case of paired-end data).

If you are not seeing any unalignable reads, it might be that the aligner you are using is not putting the unaligned reads into its output.

I'm using bowtie 1.0.0

The sum of the expected counts for a sample should generally be pretty close to the value of N1.  They are not the same because some reads may only have poor-quality alignments and may be assigned (or have some fraction of them assigned) to the “noise” gene in the RSEM model.  If you have a lot of poor-quality alignments, then this could explain a difference between the number of aligned reads and the sum of the counts in the RSEM output files.

Even so, it is somewhat odd that they are almost exactly half of N1, no? Or else, it is as you say and a lot of them map very poorly... I thought the difference was due to the fact that the expected counts are not a raw read count but are an estimated count made by the RSEM model, while N1 should be the exact number of reads aligned.  

Thanks for your answer,
Mafalda

You received this message because you are subscribed to a topic in the Google Groups "RSEM Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rsem-users/L4IQPTbNKos/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rsem-users+...@googlegroups.com.

Bo Li

unread,
May 15, 2015, 7:40:05 PM5/15/15
to rsem-...@googlegroups.com
Hi Mafalda,

That's weird. 'sample_name.cnt' is the mapping statistic and should be
consistent with the alignments your aligner produced. Can you try to run
RSEM directly instead of using the Trinity script? Then we will know if
the problem is due to RSEM or the Trinity script.

Best,
Bo

On 2015-05-14 02:58, Mafalda Sousa Ferreira wrote:
> Could you give what version of RSEM you are using, the command line
> you are running for rsem-calculate-expression, and, if you are using a
> custom alignment, how that alignment was obtained?  That would help
> here.
>
> I'm using RSEM v. 1.2.19 and I'm calling it with this command (inside
> a script that calls one by one the two pairs of reads for each
> sample):
>
> /src/trinityrnaseq-2.0.2/util/align_and_estimate_abundance.pl [4]
> --transcripts ../trinity_out_dir/transcriptome.Trinity.fasta --seqType
> fq --left $my_first_read_file --right $my_second_read_file
> --est_method RSEM --output_prefix $my_prefix --SS_lib_type RF
> --output_dir ./output --prep_reference --trinity_mode --aln_method
> bowtie --bowtie_RSEM "--all --best --strata -m 300 --chunkmbs 512
> --seedlen 23"
>
> I'm using RNA-seq reads for a group of samples to construct a
> transcriptome (with Trinity 2.0.2) and I'm mapping these reads back to
> the transcriptome using the scrip align_and_estimate_abundance.pl [4]
>>> I find this result odd because the total number of read PAIRS for
>>> this sample is 28,805,265 (from the fastq file). Assuming that the
>>> numbers inside .cnt file concern read PAIRS, I expect that N1
>>> would be smaller than the total no. of input read pairs since I
>>> don't expect that all reads align. However, if there is some
>>> amount of reads that could not be mapped, shouldn't N0 be
>>> different from 0 and Nt equal to the number of input read pairs?
>>> Did the meaning of these 4 numbers changed with RSEM version?
>>>
>>> Also, summing the expected_counts from the
>>> sample_name.gene.results, I get a library size of 11433722.93
>>> (EdgeR outputs this number too, since it sums the expected counts
>>> column to produce the library size). This number is approximately
>>> half of N1. Now this is puzzling because it either means: 
>>>
>>> 1) that N1 is actually the TOTAL number of reads mapped (NO. R1 +
>>> NO. R2). In that case there is a large amount of reads not mapping
>>> (22910113 / 28,805,265 * 2 ~ 0.40 --> 40 % mapping success) which
>>> is not good...
>>>
>>> 2) that N1 still represents read PAIRS, but the expected counts
>>> are wrong (?).
>>>
>>> It is weird because I've done this math before to calculate
>>> mapping success and the N always matched the number of INPUT READ
>>> PAIRS, and N1 always matched library size. 
>>>
>>> Can I get some help solving this out? 
>>> Thank you in advance,
>>> Mafalda
>>>
>>> --
>>> RSEM website: http://deweylab.biostat.wisc.edu/rsem/ [1]
>>> ---
>>> You received this message because you are subscribed to the
>>> Google Groups "RSEM Users" group.
>>> To unsubscribe from this group and stop receiving emails from it,
>>> send an email to rsem-users+...@googlegroups.com.
>>> To post to this group, send email to rsem-...@googlegroups.com.
>>> Visit this group at http://groups.google.com/group/rsem-users
>>> [2].
>>
>> --
>> RSEM website: http://deweylab.biostat.wisc.edu/rsem/ [1]
>> ---
>> You received this message because you are subscribed to a topic in
>> the Google Groups "RSEM Users" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/rsem-users/L4IQPTbNKos/unsubscribe
>> [3].
>> To unsubscribe from this group and all its topics, send an email to
>> rsem-users+...@googlegroups.com.
>> To post to this group, send email to rsem-...@googlegroups.com.
>> Visit this group at http://groups.google.com/group/rsem-users [2].
>
> --
> RSEM website: http://deweylab.biostat.wisc.edu/rsem/ [1]
> ---
> You received this message because you are subscribed to the Google
> Groups "RSEM Users" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to rsem-users+...@googlegroups.com.
> To post to this group, send email to rsem-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/rsem-users [2].
>
>
> Links:
> ------
> [1] http://deweylab.biostat.wisc.edu/rsem/
> [2] http://groups.google.com/group/rsem-users
> [3]
> https://groups.google.com/d/topic/rsem-users/L4IQPTbNKos/unsubscribe
> [4] http://align_and_estimate_abundance.pl

Mafalda Sousa Ferreira

unread,
May 17, 2015, 7:05:34 AM5/17/15
to rsem-...@googlegroups.com, b...@cs.wisc.edu
Hello,

I am going to describe what I did. 

I started by updating RSEM and bowtie versions just to be sure that this wasn't a problem of incompatibility between Trinity and RSEM/bowtie. I am using now RSEM 2.1.21 and bowtie 1.1.1.

Then I used two different pipelines for the same sample and transcriptome. 

1) Using Trinity's align_and_estimate_abundance.pl :

/src/trinityrnaseq-2.0.2/util/align_and_estimate_abundance.pl --transcripts ../trinity_out_dir/transcriptome_XAB.Trinity.fasta --seqType fq --left 3112_XAB_variable_R1.fastq --right 3112_XAB_variable_R2.fastq --est_method RSEM --output_prefix 3112_XAB_variable --SS_lib_type RF --output_dir ./output --prep_reference --trinity_mode --aln_method bowtie --bowtie_RSEM "--all --best --strata -m 200 --chunkmbs 512 --seendlen 23"

2) Using RSEM's pipeline: 

2a)  /src/rsem-1.2.21/rsem-prepare-reference --transcript-to-gene-map 
transcriptome_XAB.Trinity.fasta.gene_trans_map transcriptome_XAB.Trinity.fasta XAB

2b) 
/src/rsem-1.2.21/rsem-calculate-expression --forward-prob 0 --calc-ci --bowtie-path /src/bowtie-1.1.1 --seed-length 23 --bowtie-chunkmbs 512 --paired-end 3112_XAB_variable_R1.fastq 3112_XAB_variable_R2.fastq XAB 3112_XAB_variable.rsem.out

While running 2b) I got this error:

"Could not locate a Bowtie index corresponding to basename XAB

I experimented a lot of things in an attempt to solve this like changing the reference name, moving the reference from a separate folder to the exact folder I was running RSEM, writing the absolute path to the reference name... Since I've never used bowtie alone I didn't really know what was happening (maybe it is obvious to more experienced users but I wasn't for me). I compared the files that were being formed after commands 2a) and 1) and noticed that the files with extension ebwt. and rev.ebwt were missing after 2a). So essentially I had to run, in between 2a) and 2b), this command:

/src/bowtie-1.1.1/bowtie-build transcriptome_XAB.Trinity.fasta XAB

and re-run 2b). These are the results I obtained:

Sample: 3112_XAB_variable
With rsem scripts With trinity script
N input read pairs 10,064,999 10,064,999
N0 6,028,072 0
N1 4,036,927 6,946,945
N2 0 0
Nt 10,064,999 6,946,945
library size (sum expected counts) 3,945,740 3,765,312

Running the RSEM pipeline independently, as you suggested, I now obtain N = N input read pairs and N1 approximately equal to library size, which seems correct and what I was expecting! Do you think I can trust these results now?

With the trinity script align_and_estimate_abundance.pl the results are still the same as before, which leads me to believe that the script is not working correctly. It even reports more aligned reads than rsem's pipeline but then a smaller library size is produced. Maybe this is because I'm not using the most recent Trinity version but I've been reticent to update it because I produced other transcriptomes with v.2.0.2 and want to maintain consistency. 

Best,
Mafalda






To unsubscribe from this group and all its topics, send an email to rsem-users+...@googlegroups.com.
To post to this group, send email to rsem-...@googlegroups.com.

Bo Li

unread,
May 17, 2015, 3:04:32 PM5/17/15
to rsem-...@googlegroups.com, trinityrn...@googlegroups.com
Hi Mafalda,

Yes, I think that you can trust the RSEM results. They look reasonable
to me. In addition, you may want to report the problem you encountered
about the Trinity script to Trinity users group.

Best,
Bo

On 2015-05-17 04:05, Mafalda Sousa Ferreira wrote:
> Hello,
>
> I am going to describe what I did. 
>
> I started by updating RSEM and bowtie versions just to be sure that
> this wasn't a problem of incompatibility between Trinity and
> RSEM/bowtie. I am using now RSEM 2.1.21 and bowtie 1.1.1.
>
> Then I used two different pipelines for the same sample and
> transcriptome. 
>
> 1) Using Trinity's align_and_estimate_abundance.pl [5] :
>
> /src/trinityrnaseq-2.0.2/util/align_and_estimate_abundance.pl [1]
> SAMPLE:
> 3112_XAB_variable
>
> WITH RSEM SCRIPTS
> WITH TRINITY SCRIPT
>
> N INPUT READ PAIRS
> 10,064,999
> 10,064,999
>
> N0
> 6,028,072
> 0
>
> N1
> 4,036,927
> 6,946,945
>
> N2
> 0
> 0
>
> NT
> 10,064,999
> 6,946,945
>
> LIBRARY SIZE (SUM EXPECTED COUNTS)
> 3,945,740
> 3,765,312
>
> Running the RSEM pipeline independently, as you suggested, I now
> obtain N = N input read pairs and N1 approximately equal to library
> size, which seems correct and what I was expecting! Do you think I can
> trust these results now?
>
> With the trinity script align_and_estimate_abundance.pl [5] the
>> /src/trinityrnaseq-2.0.2/util/align_and_estimate_abundance.pl [1]
>> [4]
>> --transcripts ../trinity_out_dir/transcriptome.Trinity.fasta
>> --seqType
>> fq --left $my_first_read_file --right $my_second_read_file
>> --est_method RSEM --output_prefix $my_prefix --SS_lib_type RF
>> --output_dir ./output --prep_reference --trinity_mode --aln_method
>> bowtie --bowtie_RSEM "--all --best --strata -m 300 --chunkmbs 512
>> --seedlen 23"
>>
>> I'm using RNA-seq reads for a group of samples to construct a
>> transcriptome (with Trinity 2.0.2) and I'm mapping these reads back
>> to
>> the transcriptome using the scrip align_and_estimate_abundance.pl
>> [1] [4]
>> RSEM website: http://deweylab.biostat.wisc.edu/rsem/ [2] [1]
>> ---
>> You received this message because you are subscribed to the
>> Google Groups "RSEM Users" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to rsem-users+...@googlegroups.com.
>> To post to this group, send email to rsem-...@googlegroups.com.
>> Visit this group at http://groups.google.com/group/rsem-users [3]
>> [2].
>>
>> --
>> RSEM website: http://deweylab.biostat.wisc.edu/rsem/ [2] [1]
>> ---
>> You received this message because you are subscribed to a topic in
>> the Google Groups "RSEM Users" group.
>> To unsubscribe from this topic, visit
>> https://groups.google.com/d/topic/rsem-users/L4IQPTbNKos/unsubscribe
>> [4]
>> [3].
>> To unsubscribe from this group and all its topics, send an email to
>> rsem-users+...@googlegroups.com.
>> To post to this group, send email to rsem-...@googlegroups.com.
>> Visit this group at http://groups.google.com/group/rsem-users [3]
>> [2].
>
>  --
>  RSEM website: http://deweylab.biostat.wisc.edu/rsem/ [2] [1]
>  ---
>  You received this message because you are subscribed to the Google
> Groups "RSEM Users" group.
>  To unsubscribe from this group and stop receiving emails from it,
> send an email to rsem-users+...@googlegroups.com.
>  To post to this group, send email to rsem-...@googlegroups.com.
>  Visit this group at http://groups.google.com/group/rsem-users [3]
> [2].
>
> Links:
> ------
> [1] http://deweylab.biostat.wisc.edu/rsem/ [2]
> [2] http://groups.google.com/group/rsem-users [3]
> [3]
> https://groups.google.com/d/topic/rsem-users/L4IQPTbNKos/unsubscribe
> [4]
> [4] http://align_and_estimate_abundance.pl [1]
>
> --
> RSEM website: http://deweylab.biostat.wisc.edu/rsem/ [2]
> --- You received this message because you are subscribed to a topic
> in the Google Groups "RSEM Users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/rsem-users/L4IQPTbNKos/unsubscribe
> [4].
> To unsubscribe from this group and all its topics, send an email to
> rsem-users+...@googlegroups.com.
> To post to this group, send email to rsem-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/rsem-users [3].
>
> --
> RSEM website: http://deweylab.biostat.wisc.edu/rsem/ [2]
> ---
> You received this message because you are subscribed to the Google
> Groups "RSEM Users" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to rsem-users+...@googlegroups.com.
> To post to this group, send email to rsem-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/rsem-users [3].
>
>
> Links:
> ------
> [1] http://align_and_estimate_abundance.pl
> [2] http://deweylab.biostat.wisc.edu/rsem/
> [3] http://groups.google.com/group/rsem-users
> [4]
> https://groups.google.com/d/topic/rsem-users/L4IQPTbNKos/unsubscribe
> [5] http://align_and_estimate_abundance.pl/

Mafalda Sousa Ferreira

unread,
May 18, 2015, 7:52:37 AM5/18/15
to rsem-...@googlegroups.com, trinityrn...@googlegroups.com
Hello,

I realized I missed the flag --bowtie in the RSEM command and that is why the indices are not produced during rsem-prepare-reference. My bad...

Yes, I will report the problem to Trinity users group!

Thanks for your help,
Mafalda 


To unsubscribe from this group and all its topics, send an email to rsem-users+...@googlegroups.com.
To post to this group, send email to rsem-...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages