Re: Many unmapped reads in STAR, classified as "too short"?

10,544 views
Skip to first unread message

Alexander Dobin

unread,
May 23, 2013, 10:12:35 AM5/23/13
to rna-...@googlegroups.com
Hi Carmen,

there are three main explanations for poor mappability with large proportion of unmapped alignments reported as "too short": 

1. Poor sequencing quality.
On your quality scores distribution plots, If the median quality score drops below 20 at a certain cycle, you will have problems mapping the tails.
Note that by default STAR requires mapped length to be > 2/3 of the total read length (i.e. 2/3*202=135b in case of PE101).
This is controlled by --outFilterMatchNminOverLread and --outFilterScoreMinOverLread.
You can try to reduce this parameters to, say, 0.4 to see if you get more reads mapped - but the alignments will be shorter, of course.
Also, you can try to map read1 and read2 separately to see if one of them is more problematic than the other.

2a. Contamination with exogenous sequences.
You can try to BLAST a few of unmapped reads agains the full NCBI database to see if you get any good matches.

2b. Contamination with ribosomal RNA.
If your samples are "total RNA", depleted with Ribo-Zero or Ribo-Minus kits, it is possible that the depletion did not work well. rRNA are typically multi-mappers (and you get plenty of those), however, not all rRNA repeats make it into the main chromosomal assembly, and in this case they will not be mapped and will be reported as "alignment too short". We have recently had many cases like that in our lab for human tissues. I believe for the fly genome, the unplaced contigs are in chrU and chrUextra - please try to include them in the genome if you have not done so.

2c. Contamination with primer-dimers.
You can try to clip off Illumina adapters with various clipper software. I think this is quite rare.

3. Inserts that are too short.
If this happens, you will be sequencing into your adapter at the end of the 2nd mate. STAR will try to trim it off, but  the resulting alignment might be too short.
You can check if this is the case by following the suggestions in 1 or 3c.

Please let me know if this helps
Cheers
Alex


On Wednesday, May 22, 2013 9:21:24 PM UTC-4, Carmen Sandoval wrote:

Hi all,

I am currently using STAR to map several Hi-SEQ mRNA runs, and I am very pleased with the run time (working with 200M+ HiSeq runs...)


However, I'm having trouble getting a decent amount of reads to map, but I don't really understand why. I'm hoping you can shed some light :)


In the final log, only about 50% (or less) of the reads map to the reference. I'm using a GTF in addition to the genome.

The unmapped bin that most of the reads fall into is "too short", which I believe Alex has pointed out to be correlated with read quality. But I've run the runs through FastQC, and the quality is pretty good up until the ~85th base out of 101. 

What parameter I might be mis-specifying? These are PE 101 Illumina reads, and we have around 200M reads per sample.

What other parameter be causing the unmapped reads: too short set of reads to be so large?

Many Thanks! 

Carmen


My command like is like so:

# $1 = READ1 fq file
# $2 = READ2 fq file
# $3 = PREFIX for Output Files [*.BAM]

/path/to/STAR --genomeDir /path/to/Fly/ --readFilesCommand 'zcat -fc' --readFilesIn $1 $2 --runThreadN 32 --genomeLoad LoadAndRemove --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 2 --outSAMstrandField None --outSAMmode Full --outSAMattributes Standard --outSAMunmapped None --outFilterType BySJout --outStd SAM | samtools view -b -o $3_STAR.bam -S -


This is the command line I used to build the STAR index.

/path/to/STAR_2.3.0e/STAR --runMode genomeGenerate --genomeDir /path/to/Genomes/Fly/ --genomeFastaFiles /path/to/genomes/fly/dm3_genome.fa --runThreadN 16  --sjdbGTFfile /path/to/dm3_refGene_2011_02_15.gtf --sjdbGTFtagExonParentTranscript transcript_id --sjdbOverhang 100

And all Final Logs look something like this:

./Log.final.out



Started job on |    May 20 22:49:23
                         Started mapping on |    May 20 22:52:08
                                Finished on |    May 21 05:18:10
   Mapping speed, Million of reads per hour |    32.74

                      Number of input reads |    210640950
                  Average input read length |    202
                                UNIQUE READS:
               Uniquely mapped reads number |    29841188
                    Uniquely mapped reads % |    14.17%
                      Average mapped length |    190.07
                   Number of splices: Total |    4405621
        Number of splices: Annotated (sjdb) |    4123661
                   Number of splices: GT/AG |    4348429
                   Number of splices: GC/AG |    25290
                   Number of splices: AT/AC |    664
           Number of splices: Non-canonical |    31238
                  Mismatch rate per base, % |    1.23%
                     Deletion rate per base |    0.03%
                    Deletion average length |    1.92
                    Insertion rate per base |    0.02%
                   Insertion average length |    2.41
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |    36646260
         % of reads mapped to multiple loci |    17.40%
    Number of reads mapped to too many loci |    229494
         % of reads mapped to too many loci |    0.11%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |    0.00%
             % of reads unmapped: too short |    58.84%
                 % of reads unmapped: other |    9.49%


Carmen Sandoval

unread,
May 23, 2013, 7:26:48 PM5/23/13
to rna-...@googlegroups.com
Thanks Alex,

I will try all these and let you know how the alignments improve. 
I checked the Quality through FastQC for all samples, and it seemed to be OK, but perhaps I'll have to double check or look into it in more depth.

C

Carmen Sandoval

unread,
May 24, 2013, 4:51:21 PM5/24/13
to rna-...@googlegroups.com
Alex --

How much better/worse do you think it would be to pass the runs first through a Quality Trimmer that will remove the low-quality tails (such as Qtrim) and then specify an --outFilterMatchNminOverLread / --outFilterScoreMinOverLread that is very stringent given that STAR will now be mapping high-quality bases? First of all, will STAR consider each read (pair) length individually for the Lread value? And if so, could this be a better way to gain sentitivity on low-quality-tail reads, without causing more false positives on higher-quality-tail reads?

Just an idea.
C

Alexander Dobin

unread,
May 25, 2013, 10:09:08 AM5/25/13
to rna-...@googlegroups.com
Hi Carmen,

this could be a good approach if you have a lot of poor quality tails, and also in cases when inserts are short and you see adapter sequence at the 3' ends of the reads.
If you trim the reads before feeding them to STAR, it will indeed consider each read length (Lread) individually for normalized thresholds such as --outFilterMatchNminOverLread / --outFilterScoreMinOverLread

Note, that STAR also trims reads (soft clipping S in the CIGAR) in cases where it cannot place a short junction overhang. If you specify very stringent --outFilterMatchNminOverLread / --outFilterScoreMinOverLread, you will lose these alignments.

Cheers
Alex

Shawn Driscoll

unread,
May 27, 2013, 11:38:57 AM5/27/13
to rna-...@googlegroups.com
I second the idea of collecting the unmapped reads and then working with those directly to see what's going on. Try BLAST...if it can't find hits for those then there's no chance they will align. You can also try bowtie (1) which has command line options to trim bases from the 3' and/or 5' ends of the reads. I'd do that as a test only. Try trimming 25 bases and see what happens. I also would do this with just on end of the pairs at first to simplify things. If you can find a way to start to get those unmapped reads to map then you can go back to STAR and finish the job with an edited copy of those reads.

Aditi Kulkarni

unread,
Jun 20, 2017, 10:55:15 AM6/20/17
to rna-star
Hi Alex,

I had a question regarding your response below. I tried to BLAST some of the unmapped reads and they have hits with match genes. What could be the reason that these reads remain unmapped ?

Thanks,
Aditi

Alexander Dobin

unread,
Jun 21, 2017, 5:05:15 PM6/21/17
to rna-star
Hi Aditi,

could you explain a bit more about this problem. What genome are you mapping to? What is the mapping rate? Please send examples of BLAST hits.

Cheers
Alex

Aditi Kulkarni

unread,
Jun 22, 2017, 9:23:38 AM6/22/17
to rna-star
Hi Alex,

I have sent you an email with a screenshot and explained my problem at length to you.

Thanks,
Aditi

SG

unread,
Jul 6, 2017, 10:57:47 AM7/6/17
to rna-star


On Thursday, May 23, 2013 at 10:12:35 AM UTC-4, Alexander Dobin wrote:

Ehsan Hajiramezanali

unread,
Dec 19, 2018, 11:38:15 AM12/19/18
to rna-star
Hi Alex,

I'm working with 150nt single-end melon data. I have 7 different samples. While I got 95% uniquely mapped for two samples, I can see between 45% - 75% unmapped (marked as "too short") for others.  As you suggested, I checked the quality of the samples and found them similar. Their qualities are well. In addition, the I checked the 0.4 for two options --outFilterMatchNminOverLread and --outFilterScoreMinOverLread and it was not helpful. 

Then, I have BLASTed unmapped reads and I could not find any good match. The lengths of the matched parts are less than 30nt with at least 3 mismatches (The e-value is higher than .5). 

I'm wondering if you could let me know do you have any idea regarding that?
Would you please help me with this?

Thanks in advance
Ehsan

P.S. 

Two of the logs are in the following:

~~~
                              Started job on | Dec 16 20:21:46
                             Started mapping on | Dec 16 20:21:53
                                    Finished on | Dec 16 20:28:44
       Mapping speed, Million of reads per hour | 36.06

                          Number of input reads | 4116318
                      Average input read length | 145
                                    UNIQUE READS:
                   Uniquely mapped reads number | 1073122
                        Uniquely mapped reads % | 26.07%
                          Average mapped length | 144.93
                       Number of splices: Total | 471198
            Number of splices: Annotated (sjdb) | 417622
                       Number of splices: GT/AG | 465239
                       Number of splices: GC/AG | 4294
                       Number of splices: AT/AC | 204
               Number of splices: Non-canonical | 1461
                      Mismatch rate per base, % | 0.97%
                         Deletion rate per base | 0.04%
                        Deletion average length | 2.65
                        Insertion rate per base | 0.02%
                       Insertion average length | 2.53
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci | 15706
             % of reads mapped to multiple loci | 0.38%
        Number of reads mapped to too many loci | 184
             % of reads mapped to too many loci | 0.00%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches | 0.00%
                 % of reads unmapped: too short | 73.47%
                     % of reads unmapped: other | 0.07%
                                  CHIMERIC READS:
                       Number of chimeric reads | 0
                            % of chimeric reads | 0.00%
~~~

~~~
                                 Started job on | Dec 16 19:07:44
                             Started mapping on | Dec 16 19:07:50
                                    Finished on | Dec 16 19:08:48
       Mapping speed, Million of reads per hour | 256.44

                          Number of input reads | 4131463
                      Average input read length | 145
                                    UNIQUE READS:
                   Uniquely mapped reads number | 3962156
                        Uniquely mapped reads % | 95.90%
                          Average mapped length | 143.87
                       Number of splices: Total | 1677217
            Number of splices: Annotated (sjdb) | 1474769
                       Number of splices: GT/AG | 1655227
                       Number of splices: GC/AG | 15702
                       Number of splices: AT/AC | 606
               Number of splices: Non-canonical | 5682
                      Mismatch rate per base, % | 0.95%
                         Deletion rate per base | 0.04%
                        Deletion average length | 2.68
                        Insertion rate per base | 0.03%
                       Insertion average length | 2.49
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci | 52887
             % of reads mapped to multiple loci | 1.28%
        Number of reads mapped to too many loci | 162
             % of reads mapped to too many loci | 0.00%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches | 0.00%
                 % of reads unmapped: too short | 2.75%
                     % of reads unmapped: other | 0.06%
                                  CHIMERIC READS:
                       Number of chimeric reads | 0
                            % of chimeric reads | 0.00%
~~~

Thanks
Reply all
Reply to author
Forward
0 new messages