Re: "too short" reads

Alexander Dobin

unread,

Jan 3, 2013, 1:49:49 PM1/3/13

to rna-...@googlegroups.com

Hi Monica,

the minimum mapped length is controlled by --outFilterScoreMinOverLread and --outFilterMatchNminOverLread.

By default these parameters are equal to 0.66, i.e. if either the number of matched bases OR the alignment score (which is number of mapped bases + penalties) is < 66% of the read length, the alignment will not be output and will be reported as "too short".

In your case, it appears that 45% of your reads have mapped length below 66% of the read length (i.e. 35b for 50b reads). Is this what you expect? For example, this would happen if you have short RNA libraries and did not trim fully the adapter. If your libraries are standard long RNA-seq, this, generally, should not happen.

You can try 0 values for --outFilterScoreMinOverLread and --outFilterMatchNminOverLread, and check the average mapped length you will get in this case.

If you could send me several 100,000 of the reads, I could take a look at them.

Cheers

Alex

On Wednesday, January 2, 2013 9:03:09 PM UTC-5, mbritton wrote:

Hi folks,

I've been using STAR for a little while, and today tried running it for the first time on a small set of HiSeq 50 bp single-end sequences, aligning to hg19. I had originally generated the genome (with gencode.v14.annotation.gtf.sjdb) using --sjdbOverhang 99, but I have since re-generated the genome (in a new directory) using --sjdbOverhang 49, which would seem to be more appropriate for these reads. I also generated a genome without the sjdb file (so no --sjdbOverhang). I ran star against all these genome directories, using both the raw reads, and the same reads that had been trimmed for quality and adapter contamination. A sample command line is:

star_2.2.0c/star \
--genomeDir /references/star49 \
--readFilesIn \
../raw/C1556ACXX_721_S08_8750_CAGATC_CAGATC_L007_R1_001.fastq \
--runThreadN 14 genomeLoad=LoadAndRemove \
--outFileNamePrefix ./C1556ACXX_721_S08_8750_49_ \
--outFilterMultimapNmax 2

The results are very similar for each type of run, with a very high percentage of "too short" unmapped reads.
Here are the last few lines of a Log.final.out:

                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |    0.00%
                 % of reads unmapped: too short |    45.77%
                     % of reads unmapped: other |    0.17%

This stays at ~45% whether I input untrimmed reads (all 50bp) or trimmed reads (most 50bp, but some as short as 20).
What exactly does "too short" refer to? Does it mean that the aligned portion of the read is "too short"? Is there a parameter that I should change that would allow more reads to be mapped?

Thanks,

Monica

mbritton

unread,

Jan 4, 2013, 3:20:15 AM1/4/13

to rna-...@googlegroups.com

Hi Alex:

Thanks for the suggestions. I had problems setting the outFilter parameters to zero, but setting them to 0.1 increased the % mapped reads to ~75%. I should mention that this is not a good read set; the RNA is from FFPE samples and is known to be degraded. It is interesting, though, that adapter and quality trimming only made a slight improvement to %reads aligned; I will investigate that further.

I was confused by the output from the runs where 45% of the reads were considered unmapped because the alignment length was "too short", but no reads were filtered out due to "too many mismatches", which I would have expected to be quite high. I now see that the default max mismatch (outFilterMismatchNmax) is 10, which seems to be a very high number. I also see that the ratio of mismatches to aligned length can be adjusted (by outFilterMismatchNoverLmax). Other aligners generally use much lower mismatch defaults, but I don't think they have the finer ratio controls included with star. I would like to get your opinion (and that of others too), about which parameters are best to use to control thresholds for alignment filtering, and whether you recommend different parameters when the goal of the alignment is differential expression calculation vs. SNP/Indel calling.

Thanks,

Monica

Alexander Dobin

unread,

Jan 4, 2013, 4:39:41 PM1/4/13

to rna-...@googlegroups.com

Hi Monica,

STAR algorithm allows it to trim the poor quality or adapter or any other kind of un-mappable tails, so it is not surprising that the mapping percentage does not change significantly after trimming.

The default --outFilterMismatchNmax and --outFilterMismatchNoverLmax parameters are very tolerant, and are only good for relatively high quality samples with long reads (>75).

You can check the Average "mismatch rate per base" in Log.final.out, for our typical RNA-seq it's 0.3-0.6%, you would probably expect higher rate for your samples.

The "Average mapped length" will tell you how much of the sequence was trimmed off by STAR. Note, that if a read has a tail with >50% of bases mismatched, this tail will be trimmed of completely, and if the remaining mapped part does not pass --outFilterScoreMinOverLread and --outFilterMatchNminOverLread threshold, it will be reported as "unmapped - too short".

The default parameters are close to the optimum for expression quantification, but you would want more stringent parameters for calling SNP/indels. For SNP calling with 50b reads I would probably allow no more than 1 or 2 mismatches, and also would check the next best alignment.

Cheers

Alex

kminie

unread,

Apr 21, 2017, 7:23:34 AM4/21/17

to rna-star

I am also facing same kind of problem with having % of reads unmapped: too short | nearly 70-80%
I am not sure why as the library size selection used by us is standard one.

Please tell me what is the threshold cut off for reads to be categorized under too short reads , what is the minimum number of matches required to be mapped?

Alexander Dobin

unread,

Apr 21, 2017, 3:58:56 PM4/21/17

to rna-star

Hi @kminie,

the --outFilterScoreMinOverLread and --outFilterMatchNminOverLread are 0.66 by default, which means that ~66% of the read length is required to match the genome, otherwise it's considered unmapped - "too short".

Please post the whole Log.final.out results of your run.

Cheers

Alex

Reply all

Reply to author

Forward