differing numbers of unmapped reads

59 views
Skip to first unread message

Nikelle Petrillo

unread,
Sep 28, 2017, 9:48:26 AM9/28/17
to rna-star
Hi all, 

I used this command to generate a bam file that included unmapped reads. 

STAR --genomeDir /data/databases/genomes/indices/cyno-index --readFilesIn filtered.Cyno-Spleen_1.1P.fastq.gz filtered.Cyno-Spleen_2.2P.fastq.gz --runThreadN 8 --outFileNamePrefix /data/data/JTX0002DF/trimmed/alignments/tmp1/filtered.Cyno_ --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat --outSAMunmapped Within --quantMode GeneCounts --sjdbGTFfile /data/databases/ensembl_annotations/GCF_000364345.1_Macaca_fascicularis_5.0_genomic.gff --sjdbOverhang 149


I then used: samtools view  -h -f 4 filtered.Cyno_Aligned.sortedByCoord.out.bam > unmapped.bam  to extract all the unmapped reads and put them in their own separate bam file. 

I then used samtools view -c unmapped.bam to count the number of unmapped reads. The total I got was 57,548,280 unmapped reads. 

My final log states that there was a total of 340,439,154 input reads and 5.22 % of reads were unmapped (too short) and .74% of reads were unmapped (other). After doing the math, this comes out to 20,290,173 reads that should be unmapped. 
How come I got an unmapped read count of about 57M while the final log shows about 20M reads unmapped? 

Thanks!
          Started job on |       Sep 27 18:26:52
                             Started mapping on |       Sep 27 18:39:47
                                    Finished on |       Sep 28 01:14:43
       Mapping speed, Million of reads per hour |       51.72

                          Number of input reads |       340439154
                      Average input read length |       264
                                    UNIQUE READS:
                   Uniquely mapped reads number |       300599104
                        Uniquely mapped reads % |       88.30%
                          Average mapped length |       270.37
                       Number of splices: Total |       189960264
            Number of splices: Annotated (sjdb) |       184613580
                       Number of splices: GT/AG |       187682508
                       Number of splices: GC/AG |       1427789
                       Number of splices: AT/AC |       151126
               Number of splices: Non-canonical |       698841
                      Mismatch rate per base, % |       0.56%
                         Deletion rate per base |       0.03%
                        Deletion average length |       2.37
                        Insertion rate per base |       0.02%
                       Insertion average length |       1.94
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       18343786
             % of reads mapped to multiple loci |       5.39%
        Number of reads mapped to too many loci |       1195265
             % of reads mapped to too many loci |       0.35%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       5.22%
                     % of reads unmapped: other |       0.74%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

Alexander Dobin

unread,
Oct 2, 2017, 5:33:22 PM10/2/17
to rna-star
Hi Nikelle,

first, STAR counts the number of paired reads, while samtools reports the number of lines - i.e. each mate separately.
So you would need to multiply the STAR number of unmapped reads (20M) by 2. Still, 40M is not 57M you see in samtools.
Another correct is that STAR reports the "reads mapped to too many loci" as unmapped, bu in your case it will make a very smal difference.

I only see one other explanation for the difference: ~17M reads have a mate that has been trimmed to a short sequence that cannot be mapped well.
Those reads will still be considered mapped (since majority of the PE sequence maps), however, the unmapped mates will be output with SAM FLAG bit 0x4.
You can check that by running
$ samtools view -c -f 0x8 unmapped.bam
This should give you ~40M "both mates unmapped", if you divide this by 2, it should agree with STAR's unmapped reads.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages