Discrepancy in alignment stats between outputs from STAR and RSEM (.cnt file)

141 views
Skip to first unread message

Raj Sasidharan

unread,
Jun 18, 2020, 7:21:39 PM6/18/20
to RSEM Users
Dear Bo and Colin,

I ran the following RSEM command on a single-end FASTQ file. I aligned the reads to STAR indices built for the human transcriptome.

rsem-calculate-expression -p 12 --star --append-names --star-path /usr/local/bin/ --star-gzipped-read-file /data/sample1.fastq.gz /genome/GRCh38_RSEM_STAR sample > sample_RSEM.log 2>&1


I don't quite understand how the alignment numbers from STAR log file (below) maps to the output by RSEM in .cnt file. There are a total of 16702488 reads and according to STAR 11145815 reads are uniquely mapped, 4302351 reads are mapped to multiple loci, 59784 reads are mapped to too many loci, 20167 reads were unmapped because of too many mismatches, 1055479 reads were unmapped because they were too short and 118892 reads were unmapped for other reasons. The total mapped and unmapped reads add to total reads. 

However, from RSEM's .cnt file (below), for this sample, there are total of 8473737 reads, of which 7219415 are aligned (both unique and multi-mapped) and 1254322 reads are unaligned. The Total reads should be 16702488, and total unaligned should be 9483073, right? 

Would appreciate if you could you please reconcile these two outputs for me. Also, could you please explain what the numbers in line 2 and 3 mean in the .cnt file? Thanks.


Output from STAR log file for the sample:

                                 Started job on | Jun 16 04:09:46

                             Started mapping on | Jun 16 04:10:58

                                    Finished on | Jun 16 04:13:01

       Mapping speed, Million of reads per hour | 488.85


                          Number of input reads | 16702488

                      Average input read length | 49

                                    UNIQUE READS:

                   Uniquely mapped reads number | 11145815

                        Uniquely mapped reads % | 66.73%

                          Average mapped length | 49.18

                       Number of splices: Total | 501431

            Number of splices: Annotated (sjdb) | 490452

                       Number of splices: GT/AG | 496455

                       Number of splices: GC/AG | 4424

                       Number of splices: AT/AC | 328

               Number of splices: Non-canonical | 224

                      Mismatch rate per base, % | 0.28%

                         Deletion rate per base | 0.01%

                        Deletion average length | 1.67

                        Insertion rate per base | 0.01%

                       Insertion average length | 1.35

                             MULTI-MAPPING READS:

        Number of reads mapped to multiple loci | 4302351

             % of reads mapped to multiple loci | 25.76%

        Number of reads mapped to too many loci | 59784

             % of reads mapped to too many loci | 0.36%

                                  UNMAPPED READS:

  Number of reads unmapped: too many mismatches | 20167

       % of reads unmapped: too many mismatches | 0.12%

            Number of reads unmapped: too short | 1055479

                 % of reads unmapped: too short | 6.32%

                Number of reads unmapped: other | 118892

                     % of reads unmapped: other | 0.71%

                                  CHIMERIC READS:

                       Number of chimeric reads | 0

                            % of chimeric reads | 0.00%


Output from .cnt file for the sample:

1254322 7219415 0 8473737

4632069 2587346 4804035

24264981 1

0 1254322

1 2415380

2 1356563

3 1468225

4 444324

5 448121

6 200991

7 138113

8 413042

9 47136

10 112856

11 17339

12 20372

13 11737

14 16524

15 9874

16 6305

17 13935

18 10011

19 5951

20 5897

21 5531

22 2173

23 3960

24 2891

25 1498

26 5893

27 1658

28 2301

29 577

30 1346

31 3768

32 1992

33 1365

34 359

35 1205

36 4698

37 730

38 195

39 165

40 1693

41 330

42 598

43 67

44 212

45 1203

46 5

47 31

48 1009

49 949

50 354

52 34

53 39

54 245

55 20

56 3824

57 1

58 9

59 14

60 182

62 68

63 86

64 35

65 9

66 169

68 17

70 1246

71 20

72 371

73 144

74 10

76 27

78 31

80 1352

81 8

84 2

Inf 0


Bo Li

unread,
Jun 18, 2020, 7:52:08 PM6/18/20
to rsem-...@googlegroups.com
Dear Raj and other RSEM users,

Very happy to help here. 


re: STAR alignment.  STAR aligns to the genome and thus 11145815 refers to reads that are mapped unique to one genomic locus (e.g. exon, intron, intergenic). However, when STAR generates the transcriptome BAM file, it does not write all reads to the transcriptomic BAM. For example, reads aligned to intergenic regions will not be reported in the transcriptome BAM. Since RSEM takes STAR's transcriptomic BAM as its input, there is no way for RSEM to know those reads that are not reported in the transcriptomic BAM file. That's why in RSEM's cnt file, only 8473737 reads were reported. 

Hope it helps,
Bo


--
RSEM website: http://deweylab.biostat.wisc.edu/rsem/
---
You received this message because you are subscribed to the Google Groups "RSEM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rsem-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rsem-users/5522511b-2578-448b-a8ce-17590b189d6ao%40googlegroups.com.

Raj Sasidharan

unread,
Jun 18, 2020, 8:40:18 PM6/18/20
to RSEM Users
Dear Bo,

Thanks for the fast response. I have two additional questions:

1. I used rsem-prepare-reference, supplied GTF file and prepared STAR indices for human transcriptome. So, could you please explain what you mean by saying "STAR aligns to the genome"? Aren't these indices for transcriptome sequences? 
2. Counting reads with SAM flag 0x4 gives the total unaligned reads (1254322), as also reported in .cnt file. The total unmapped reads reported by STAR is 1194538. Shouldn't these two be identical?

Thanks again for your input on this.

Bo Li

unread,
Jun 18, 2020, 9:23:20 PM6/18/20
to rsem-...@googlegroups.com
Hi Raj,

Please see my comments below.

On Thu, Jun 18, 2020 at 8:40 PM Raj Sasidharan <rs2...@gmail.com> wrote:
Dear Bo,

Thanks for the fast response. I have two additional questions:

1. I used rsem-prepare-reference, supplied GTF file and prepared STAR indices for human transcriptome. So, could you please explain what you mean by saying "STAR aligns to the genome"? Aren't these indices for transcriptome sequences? 

No. STAR always align to the genome. However, if you provide STAR a GTF file, STAR will use the GTF when it converts alignments to transcriptomic coordinates.

 
2. Counting reads with SAM flag 0x4 gives the total unaligned reads (1254322), as also reported in .cnt file. The total unmapped reads reported by STAR is 1194538. Shouldn't these two be identical?


No. The unmapped reported by STAR are those reads unmappable to the genome. The unmappable reads in the transcriptomic BAM are reads mappable to the genome but not the transcriptome (e.g. intronic reads). 

Best,
Bo

 
--
RSEM website: http://deweylab.biostat.wisc.edu/rsem/
---
You received this message because you are subscribed to the Google Groups "RSEM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rsem-users+...@googlegroups.com.

Raj Sasidharan

unread,
Jun 18, 2020, 10:39:21 PM6/18/20
to RSEM Users
Thank you Bo.

I think it would be useful for RSEM to report aligned, unaligned reads wrt total reads in the FASTQ file. 


On Thursday, June 18, 2020 at 4:21:39 PM UTC-7, Raj Sasidharan wrote:
Reply all
Reply to author
Forward
0 new messages