Reducing percentage of unmapped "too short" reads.

346 views
Skip to first unread message

Blair Perry

unread,
Jan 6, 2016, 5:05:02 PM1/6/16
to rna-star

Hello everyone, 


I’m getting a high percentage of reads left unmapped due to being “too short” (~35% - 42%) when mapping paired-end Illumina reads. After reading some of the threads from users with similar problems, I tried setting --outFilterScoreMinOverLread,--outFilterMatchNminOverLread, and --alignSplicedMateMapLminOverLmate to 0.50, which reduced the amount of “too short” reads only by about 4%. Any suggestions for other ways to try to reduce the number of unmapped reads? 



Here is the command I used with STAR 2.4.2a:


STAR --runMode alignReads --runThreadN 8 --genomeDir Tsirt_genome --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM GeneCounts --twopassMode Basic --outFilterType BySJout --outSAMattributes All --outSAMheaderPG Sickle --chimOutType SeparateSAMold --outSAMattrRGline ID:SRR805147 DS:Lung --outFileNamePrefix ./map/SRR805147_Lung --readFilesIn sickle_trimmed/SRR805147_Lung_1.sickle.fastq sickle_trimmed/SRR805147_Lung_2.sickle.fastq --bamRemoveDuplicatesType UniqueIdentical




Below is a Log.final.out file when the above flags were set at their default values:


Number of input reads |       44635143

                      Average input read length |       195

                                    UNIQUE READS:

                   Uniquely mapped reads number |       26148473

                        Uniquely mapped reads % |       58.58%

                          Average mapped length |       189.97

                       Number of splices: Total |       12516076

            Number of splices: Annotated (sjdb) |       12514958

                       Number of splices: GT/AG |       12333007

                       Number of splices: GC/AG |       126045

                       Number of splices: AT/AC |       8109

               Number of splices: Non-canonical |       48915

                      Mismatch rate per base, % |       0.43%

                         Deletion rate per base |       0.02%

                        Deletion average length |       1.95

                        Insertion rate per base |       0.01%

                       Insertion average length |       1.53

                             MULTI-MAPPING READS:

        Number of reads mapped to multiple loci |       647350

             % of reads mapped to multiple loci |       1.45%

        Number of reads mapped to too many loci |       7705

             % of reads mapped to too many loci |       0.02%

                                  UNMAPPED READS:

       % of reads unmapped: too many mismatches |       0.00%

                 % of reads unmapped: too short |       39.79%

                     % of reads unmapped: other |       0.16%



And here is the Log.final.out after changing the flags to 0.50:


Number of input reads |       44635143

                      Average input read length |       195

                                    UNIQUE READS:

                   Uniquely mapped reads number |       28049700

                        Uniquely mapped reads % |       62.84%

                          Average mapped length |       184.83

                       Number of splices: Total |       13307673

            Number of splices: Annotated (sjdb) |       13306377

                       Number of splices: GT/AG |       13090522

                       Number of splices: GC/AG |       135649

                       Number of splices: AT/AC |       8716

               Number of splices: Non-canonical |       72786

                      Mismatch rate per base, % |       0.50%

                         Deletion rate per base |       0.02%

                        Deletion average length |       1.90

                        Insertion rate per base |       0.01%

                       Insertion average length |       1.53

                             MULTI-MAPPING READS:

        Number of reads mapped to multiple loci |       859077

             % of reads mapped to multiple loci |       1.92%

        Number of reads mapped to too many loci |       9478

             % of reads mapped to too many loci |       0.02%

                                  UNMAPPED READS:

       % of reads unmapped: too many mismatches |       0.00%

                 % of reads unmapped: too short |       35.05%

                     % of reads unmapped: other |       0.16%


Please let me know if there I can provide any other helpful information.


Thank you very much for your time, I really appreciate any feedback or suggestions.


Blair Perry

Alexander Dobin

unread,
Jan 6, 2016, 5:53:30 PM1/6/16
to rna-star
Hi Blair,

you can try to reduce these parameter even more, say to 0.3, to see whether it will get you more mapped reads. 
However, the short alignments will be unreliable, and may mask the actual cause for poor mappability. As usual, I would suggest the following:
1. Check for ribosomal RNA "contamination". This is especially important if you have total RNA (not A+ selected data). Have you included the non-chromosomal contigs into the genome? One of them contains a highly expressed rRNa locus.
2. Map read1 and read2 separately, it may point to a problem with one of the reads.
3. Check sequencing quality by plotting quality scores vs position in read (Illumina pipelines typically produce these plots).
4. BLAST a few of the unmapped reads to check if you have some sort of cotnamination.

Cheers
Alex

Blair Perry

unread,
Jan 8, 2016, 11:02:32 AM1/8/16
to rna-star
Hello,

Thank you for your feedback! I will give those a shot.

Blair
Reply all
Reply to author
Forward
0 new messages