STAR log file vs. samtools flagstats

76 views
Skip to first unread message

j...@pieriandx.com

unread,
Oct 5, 2017, 1:10:03 PM10/5/17
to rna-star
Hi Alex,

I did samtools flagstats on my BAM file (including chimeric alignments), and got below stats:
samtools flagstat main.bam 

8959204 + 0 in total (QC-passed reads + QC-failed reads)

6982701 + 0 duplicates

8959204 + 0 mapped (100.00%:-nan%)

8959204 + 0 paired in sequencing

4478770 + 0 read1

4480434 + 0 read2

8078500 + 0 properly paired (90.17%:-nan%)

8957972 + 0 with itself and mate mapped

1232 + 0 singletons (0.01%:-nan%)

839501 + 0 with mate mapped to a different chr

839501 + 0 with mate mapped to a different chr (mapQ>=5)


and below is the log file from STAR:

                                 Started job on | Aug 15 16:39:59

                             Started mapping on | Aug 15 16:52:32

                                    Finished on | Aug 15 16:58:32

       Mapping speed, Million of reads per hour | 44.56


                          Number of input reads | 4455951

                      Average input read length | 150

                                    UNIQUE READS:

                   Uniquely mapped reads number | 3266496

                        Uniquely mapped reads % | 73.31%

                          Average mapped length | 150.11

                       Number of splices: Total | 2176826

            Number of splices: Annotated (sjdb) | 2176586

                       Number of splices: GT/AG | 2160380

                       Number of splices: GC/AG | 16022

                       Number of splices: AT/AC | 424

               Number of splices: Non-canonical | 0

                      Mismatch rate per base, % | 0.43%

                         Deletion rate per base | 0.03%

                        Deletion average length | 1.75

                        Insertion rate per base | 0.00%

                       Insertion average length | 1.23

                             MULTI-MAPPING READS:

        Number of reads mapped to multiple loci | 310114

             % of reads mapped to multiple loci | 6.96%

        Number of reads mapped to too many loci | 1266

             % of reads mapped to too many loci | 0.03%

                                  UNMAPPED READS:

       % of reads unmapped: too many mismatches | 0.00%

                 % of reads unmapped: too short | 9.55%

                     % of reads unmapped: other | 0.03%

                                  CHIMERIC READS:

                       Number of chimeric reads | 451269

                            % of chimeric reads | 10.13%

-----------------------------------------------------------------------------
I did read one of your previous threads on how to compare samtools output with STAR summary, but those formula just don't work here. Could you please give some hints?
Basically, I am trying to get the number of duplicated reads and number of unique reads (unique here, means without duplicates, not means these uniquely mapped ones..). How would you suggest I get these numbers?

Thanks,

Jia 

Alexander Dobin

unread,
Oct 6, 2017, 5:40:46 PM10/6/17
to rna-star
Hi Jia,

I would suggest directly counting the unique read names in each category.
For instance, to count duplicate reads:
$ samtools view -f 0x400 A.bam | cut -f 1 | sort | uniq | wc -l
To count non-duplicate reads, I think, this will work:
$ samtools view -F 0xD00 -F A.bam | cut -f 1 | sort | uniq | wc -l
Here 0xD00=0x400+0x800+0x100, as you do not want to output the "secondary" and "supplementary" aligment which may not be marked as duplicate.

The sorting may take time - if you file is already sorted by read name, or is Unsorted STAr output, you can try to avoid sort and pipe direclty to uniq, though uniq may complain.
Also, instead of uniq | wc -l, you can write you own script that will compare the read name to the previous one and count only unique ones.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages