Reducing percentage of unmapped "too short" reads.

Blair Perry

unread,

Jan 6, 2016, 5:05:02 PM1/6/16

to rna-star

Hello everyone,

I’m getting a high percentage of reads left unmapped due to being “too short” (~35% - 42%) when mapping paired-end Illumina reads. After reading some of the threads from users with similar problems, I tried setting --outFilterScoreMinOverLread,--outFilterMatchNminOverLread, and --alignSplicedMateMapLminOverLmate to 0.50, which reduced the amount of “too short” reads only by about 4%. Any suggestions for other ways to try to reduce the number of unmapped reads?

Here is the command I used with STAR 2.4.2a:

STAR --runMode alignReads --runThreadN 8 --genomeDir Tsirt_genome --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM GeneCounts --twopassMode Basic --outFilterType BySJout --outSAMattributes All --outSAMheaderPG Sickle --chimOutType SeparateSAMold --outSAMattrRGline ID:SRR805147 DS:Lung --outFileNamePrefix ./map/SRR805147_Lung --readFilesIn sickle_trimmed/SRR805147_Lung_1.sickle.fastq sickle_trimmed/SRR805147_Lung_2.sickle.fastq --bamRemoveDuplicatesType UniqueIdentical

Below is a Log.final.out file when the above flags were set at their default values:

Number of input reads | 44635143

Average input read length | 195

UNIQUE READS:

Uniquely mapped reads number | 26148473

Uniquely mapped reads % | 58.58%

Average mapped length | 189.97

Number of splices: Total | 12516076

Number of splices: Annotated (sjdb) | 12514958

Number of splices: GT/AG | 12333007

Number of splices: GC/AG | 126045

Number of splices: AT/AC | 8109

Number of splices: Non-canonical | 48915

Mismatch rate per base, % | 0.43%

Deletion rate per base | 0.02%

Deletion average length | 1.95

Insertion rate per base | 0.01%

Insertion average length | 1.53

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 647350

% of reads mapped to multiple loci | 1.45%

Number of reads mapped to too many loci | 7705

% of reads mapped to too many loci | 0.02%

UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.00%

% of reads unmapped: too short | 39.79%

% of reads unmapped: other | 0.16%

And here is the Log.final.out after changing the flags to 0.50:

Number of input reads | 44635143

Average input read length | 195

UNIQUE READS:

Uniquely mapped reads number | 28049700

Uniquely mapped reads % | 62.84%

Average mapped length | 184.83

Number of splices: Total | 13307673

Number of splices: Annotated (sjdb) | 13306377

Number of splices: GT/AG | 13090522

Number of splices: GC/AG | 135649

Number of splices: AT/AC | 8716

Number of splices: Non-canonical | 72786

Mismatch rate per base, % | 0.50%

Deletion rate per base | 0.02%

Deletion average length | 1.90

Insertion rate per base | 0.01%

Insertion average length | 1.53

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 859077

% of reads mapped to multiple loci | 1.92%

Number of reads mapped to too many loci | 9478

% of reads mapped to too many loci | 0.02%

UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.00%

% of reads unmapped: too short | 35.05%

% of reads unmapped: other | 0.16%

Please let me know if there I can provide any other helpful information.

Thank you very much for your time, I really appreciate any feedback or suggestions.

Blair Perry

Alexander Dobin

unread,

Jan 6, 2016, 5:53:30 PM1/6/16

to rna-star

Hi Blair,

you can try to reduce these parameter even more, say to 0.3, to see whether it will get you more mapped reads.

However, the short alignments will be unreliable, and may mask the actual cause for poor mappability. As usual, I would suggest the following:

1. Check for ribosomal RNA "contamination". This is especially important if you have total RNA (not A+ selected data). Have you included the non-chromosomal contigs into the genome? One of them contains a highly expressed rRNa locus.

2. Map read1 and read2 separately, it may point to a problem with one of the reads.

3. Check sequencing quality by plotting quality scores vs position in read (Illumina pipelines typically produce these plots).

4. BLAST a few of the unmapped reads to check if you have some sort of cotnamination.

Cheers

Alex

Blair Perry

unread,

Jan 8, 2016, 11:02:32 AM1/8/16

to rna-star

Hello,

Thank you for your feedback! I will give those a shot.

Blair

Reply all

Reply to author

Forward