Read counts difference between log.final.out and samtools flagstat

2,220 views
Skip to first unread message

Chao Zhang

unread,
Sep 25, 2014, 4:03:22 PM9/25/14
to rna-...@googlegroups.com
Hi:

Here is the output from samtools flagstat:
117917848 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
117917848 + 0 mapped (100.00%:-nan%)
117917848 + 0 paired in sequencing
58958929 + 0 read1
58958919 + 0 read2
117916342 + 0 properly paired (100.00%:-nan%)
117916342 + 0 with itself and mate mapped
1506 + 0 singletons (0.00%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

And this is log.final.out:
                                 Started job on | Mar 05 22:05:24
                             Started mapping on | Mar 05 22:05:25
                                    Finished on | Mar 05 22:10:29
       Mapping speed, Million of reads per hour | 665.98

                          Number of input reads | 56238010
                      Average input read length | 98
                                    UNIQUE READS:
                   Uniquely mapped reads number | 51896331
                        Uniquely mapped reads % | 92.28%
                          Average mapped length | 98.34
                       Number of splices: Total | 17581525
            Number of splices: Annotated (sjdb) | 17458185
                       Number of splices: GT/AG | 17412869
                       Number of splices: GC/AG | 125455
                       Number of splices: AT/AC | 19153
               Number of splices: Non-canonical | 24048
                      Mismatch rate per base, % | 0.32%
                         Deletion rate per base | 0.00%
                        Deletion average length | 1.51
                        Insertion rate per base | 0.01%
                       Insertion average length | 1.20
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci | 2762064
             % of reads mapped to multiple loci | 4.91%
        Number of reads mapped to too many loci | 20833
             % of reads mapped to too many loci | 0.04%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches | 0.00%
                 % of reads unmapped: too short | 2.73%
                     % of reads unmapped: other | 0.04%

I know Star treat one pair as one read in log.final.out, but 56238010*2=112476020 < 117917848. Why output contains more read than input?

-Chao

Alexander Dobin

unread,
Sep 30, 2014, 11:41:37 AM9/30/14
to rna-...@googlegroups.com
Hi Chao,

the difference is due to multi-mappers, which samtools flagstat counts as separate "reads" (or rather alignments).
You can count unique only alignments with 
samtools view -q 255 | samtools flagstat -
   this should agree with the number of unique reads
or primary only alignments
samtools view -F 0x100 | samtools flagstat -
    this should agree with the (unique+multiple).

Cheers
Alex

guillem ylla

unread,
Jan 26, 2015, 11:33:29 AM1/26/15
to rna-...@googlegroups.com
Hello,


I have noticed the same differences than Chao Zhang between the STAR Log.final.out and samtools. So as to I followed what Alex proposed, I tried "samtools view -b -q 255file.bam | samtools flagstat -" and "samtools view -b -F 0x100 OD0_star.bam| samtools flagstat -",  but I can not see the coherency...



                          Number of input reads |    4349066
                      Average input read length |    434
                                    UNIQUE READS:
                   Uniquely mapped reads number |    3209716
                                         *
                                         *                  
                                         *
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |    137421
             % of reads mapped to multiple loci |    3.16%
        Number of reads mapped to too many loci |    1307
             % of reads mapped to too many loci |    0.03%

                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |    0.00%
                 % of reads unmapped: too short |    22.97%

                     % of reads unmapped: other |    0.04%


Total reads: 4349066*2=8698132
 Uniquely mapped reads: 3209716*2=6419432
 Uniquely mapped plus multi-mapped: 137421*2+1307*2+3209716*2 =  6696888


 $ samtools view -b -q 255 OD0_star.bam | samtools flagstat -
6096823 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplimentary
0 + 0 duplicates
6096823 + 0 mapped (100.00%:-nan%)
6096823 + 0 paired in sequencing
3189240 + 0 read1
2907583 + 0 read2

 $ samtools view -b -F 0x100 OD0_star.bam| samtools flagstat -
6354684 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplimentary
0 + 0 duplicates
6354684 + 0 mapped (100.00%:-nan%)
6354684 + 0 paired in sequencing
3325786 + 0 read1
3028898 + 0 read2
6015094 + 0 properly paired (94.66%:-nan%)
6015094 + 0 with itself and mate mapped
339590 + 0 singletons (5.34%:-nan%)

0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

5774214 + 0 properly paired (94.71%:-nan%)
5774214 + 0 with itself and mate mapped
322609 + 0 singletons (5.29%:-nan%)

0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


 $  samtools flagstat OD0_star.bam
6676803 + 0 in total (QC-passed reads + QC-failed reads)
322119 + 0 secondary
0 + 0 supplimentary
0 + 0 duplicates
6676803 + 0 mapped (100.00%:-nan%)
6354684 + 0 paired in sequencing
3325786 + 0 read1
3028898 + 0 read2
6015094 + 0 properly paired (94.66%:-nan%)
6015094 + 0 with itself and mate mapped
339590 + 0 singletons (5.34%:-nan%)

0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)




El dimarts, 30 setembre de 2014 17:41:37 UTC+2, Alexander Dobin va escriure:

Alexander Dobin

unread,
Jan 28, 2015, 2:52:02 PM1/28/15
to rna-...@googlegroups.com
Hi Guillem,

which parameters are using for mapping? Are you trimming the reads?
It appears that STAR outputs single-end alignments, since the numbers for read1 and read2 are not equal.
In this case you need a different way to reconcile the number from STAR and flagstat, since STAR count 1 read for both the single- and paired-end alignments, while flagstat will count 1 and 2 respectively.

$ samtools view -b -q 255 OD0_star.bam | samtools flagstat -
...
5774214 + 0 properly paired (94.71%:-nan%)
5774214 + 0 with itself and mate mapped
322609 + 0 singletons (5.29%:-nan%)

Flagstat counts 2 for each of the PE aligned reads and 1 for SE aligned reads, so the total number of mapped reads:
5774214/2+322609=3209716 which agrees with the STAR number for uniquely mapped reads.

$ samtools view -b -F 0x100 OD0_star.bam| samtools flagstat -
...

6015094 + 0 properly paired (94.66%:-nan%)
6015094 + 0 with itself and mate mapped
339590 + 0 singletons (5.34%:-nan%)

6015094/2+339590=3347137 which agrees with unique+multi number from STAR 137421+3209716=3347137.
Note,  that "reads mapped to too many loci" are output as unmapped and should not be added here.

Cheers
Alex

guillem ylla

unread,
Feb 4, 2015, 9:28:04 AM2/4/15
to rna-...@googlegroups.com
Dear Alex,

Thank you for the answer!

I did the trimming with trimmomatic-0.32 (with the parameters=  ILLUMINACLIP:"TruSeq3-PE-2.fa":2:30:10:8:TRUE SLIDINGWINDOW:4:15). It generates 4 files R1-clean.fastq.gz, R2-clean.fastq.gz, R1-unpaired.fastq.gz and R2-unpaired.fastq.gz.

And then the STAR as:

$ STAR  --runThreadN 10 --genomeDir genome_indexed_star --readFilesIn R1-clean.fastq.gz R2-clean.fastq.gz --readFilesCommand zcat --outFileNamePrefix Outfile

What do you mean with "It appears that STAR outputs single-end alignments, since the numbers for read1 and read2 are not equal" ? Why?  The 2 "clean" files have the same number of reads according to FastQC report did after Trimming.

Thank you,


Guillem


El dimecres, 28 gener de 2015 20:52:02 UTC+1, Alexander Dobin va escriure:

Alexander Dobin

unread,
Feb 5, 2015, 6:20:46 PM2/5/15
to rna-...@googlegroups.com
Hi Guillem,

STAR controls the quality of output alignments with the following parameters:
outFilterScoreMin               0
    int: alignment will be output only if its score is higher than this value
outFilterScoreMinOverLread      0.66
        float: outFilterScoreMin normalized to read length (sum of mates' lengths for paired-end reads)
outFilterMatchNmin              0
    int: alignment will be output only if the number of matched bases is higher than this value
outFilterMatchNminOverLread     0.66
    float: outFilterMatchNmin normalized to read length (sum of mates' lengths for paired-end reads)

By default, at least ~2/3 of the read length (sum of both mates) has to be mapped, which for untrimmed reads of equal length requires paired-end alignments.
However, if one of the reads is trimmed significantly before mapping, a single-end alignment may be allowed.
For instance, if you have 2x100 and the 2nd mate was trimmed to 20 bases, than an alignment longer than ~2/3*(100+20)=80 will be accepted.
To ensure that you only get paired alignments you can specify --outFilterMatchNmin <minMateLength+1> (e.g. 101 for 2x100).

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages