High percentage of reads unmapped: too short

Ali F

unread,

Apr 22, 2016, 10:56:23 AM4/22/16

to rna-star

Hi all,

I have been trying to do some alignments using STAR. The software indeed seems stellar, giving very clear outputs and with most options well explained in the documentation (and of course the ultra-high mapping speed). However I am getting quite a high “% of reads unmapped: too short”.

Having read quite extensively through similar problems both on this google group and on SeqAnswers, etc, I'm still a bit unsure as to why exactly this is happening.

For a bit more information on the issue:

I have assembled a transcriptome de novo from an RNA-seq dataset. As of yet I have no GTF file, I am mapping straight to the transcript contigs. When running STAR on my denovo assembly I don't get too many unmapped reads:

Average input read length | 182

Uniquely mapped reads % | 43.81%

Average mapped length | 179.89

% of reads mapped to multiple loci | 50.93%

% of reads mapped to too many loci | 0.90%

UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.00%

% of reads unmapped: too short | 4.35%

% of reads unmapped: other | 0.00%

CHIMERIC READS:

Number of chimeric reads | 0

% of chimeric reads | 0.00%

The high rate of multi-mapping loci is expected here as the dataset is comprised of 6 individuals with high levels of heterozygosity, and the raw assembly thus contains allelic variation and alternatively spliced transcripts sharing exons.

However, when I undergo a pipeline to collapse some of this allelic variation, and scaffold a lot of my contigs together, the rate of unmapped reads being “too short” dramatically increases on the new assembly:

Average input read length | 182

Uniquely mapped reads % | 46.69%

Average mapped length | 179.19

% of reads mapped to multiple loci | 7.94%

% of reads mapped to too many loci | 0.01%

UNMAPPED READS:

% of reads unmapped: too many mismatches | 0.00%

% of reads unmapped: too short | 45.36%

% of reads unmapped: other | 0.00%

CHIMERIC READS:

Number of chimeric reads | 0

% of chimeric reads | 0.00%

Having read through various forums, I have changed various parameters of the alignment to try to 'optimise' the mapping run. I am confident that sequence quality is not an issue for either forward or reverse reads – I was fairly stringent with trimming (using Trimmomatic) and adapter sequences have already been trimmed off, and from eyeballing the FastQC outputs none of my .fq files have bases with low quality, most in fact are >30. I even tried tricking STAR by icreasing '--outQSconversionAdd' to '20', but this has no effect on the results.

The STAR options which seem to have the most effect are indeed '--outFilterMatchNminOverLread' '--outFilterScoreMinOverLread'. Dropping the value of these from the defauilt leads to a decreasing amount of “reads unmapped”. Setting both the options to '0' results in 100% of the reads mapping back to the reference, with ~74% uniquely mapping. The 100% of reads mapping is a result I do not trust, especially as I know a number of contigs have been dropped from the original assembly.

I would like to ask:

1 – Why reads are classed as “unmapped too short”, as opposed to “too many mistmatches” or simply not finding any suitable alignment match in the reference?

2 – Given that I used to have high mapping rates with the original assembly, why am I suddenly finding so many are “too short”?

3 – How can I optimise the mapping but still be confident that I'm not relaxing the parameters too far so as to induce false matches? I would expect some mistmatching due to the new assembly being a tentative consensus from different contigs and alleles.

Thank you very much, and my apologies for such a long post!

Ali

Alexander Dobin

unread,

Apr 22, 2016, 4:58:31 PM4/22/16

to rna-star

Hi Ali,

by default, STAR will reduce the mapped length to avoid too many mismatches - this is why the unmapped reads are mostly "too short"

("too many mismatches" category gets populated if you map reads end-to-end --alignEndsType EndToEnd).

If you reduce --outFilterMatchNminOverLread --outFilterScoreMinOverLread too much, you will be allowing very short -not reliable- mapped lengths.

I see two possible explanations for the reduced mapping rate with the new assembly:

1. All individuals are strongly divergent from the consensus assembly.

In this case you could try to increase the number of allowed mismatches with, say --outFilterMismatchNmax 20 or even 50 .

Please post the entire Log.final.out file so that we can check the error rates.

2. Some highly expressed regions are missing from the new assembly.

Not sure how to check this, maybe you can re-map the unmapped reads to the old assembly to see if they come from any specific loci.

Cheers

Alex

Aravind Tallam

unread,

Apr 29, 2016, 10:57:32 AM4/29/16

to rna-star

Hi Alexander,

Thanks for the useful information.

I'm also having the same problem but with Human genome. My problem is even worst as the percentage of "unmapped reads: too short" is above 95% for all of my samples. I followed the steps described in the manual for genome generation and also for mapping and tried several run but the problem persists. May be its a small parameter error or something that I'm missing. I attached the log.final.out file from one of my samples for your kind perusal and below are the commands that i used for the run. I would be greatly thankful if you can give me some advice to solve this problem.

./STAR --runThreadN 10 --runMode genomeGenerate --genomeDir $DIR/GenomeIndices --genomeFastaFiles $DIR/Homo_sapiens.GRCh38.dna.primary_assembly.fa --sjdbGTFfile $DIR/Homo_sapiens.GRCh38.84.gtf --sjdbOverhang 100

./STAR --runThreadN 10 --genomeDir GenomeIndices --readFilesCommand gunzip -c $DIR/TrimmedFASTQ/$1_R1_val_1.fq.gz $DIR/TrimmedFASTQ/$1_R2_val_2.fq.gz --sjdbGTFfile /Homosapiens/Homo_sapiens.GRCh38.84.gtf --quantMode TranscriptomeSAM GeneCounts --outReadsUnmapped Fastx --outSAMtype BAM SortedByCoordinate --outWigType bedGraph read1_5p --outFileNamePrefix $DIR/Alignment/$1_

Thanks a lot.

cheers,

Aravind

36_Log.final.out

Alexander Dobin

unread,

Apr 29, 2016, 3:55:00 PM4/29/16

to rna-star

Hi Aravind,

are you mapping human RNA-seq data to the human genome?

If so, this is not the diveregent species problem like the post above.

Most likely culprit here would be trimming, some trimmers drop the overtrimmed reads from only one mate, which screws up the mate correspondence in the two files.

I would recommend to

1. Try to map untrimmed reads.

2. Try to map one read only.

Cheers

Alex

Reply all

Reply to author

Forward