Hi Shawn,
thanks a lot for sharing this detailed analysis. Your observations about the behavior of STAR parameters are right to the point.
The output filters in STAR are applied in the "AND" fashion, so the most stringent filter will control the final decision.
If you want only one of the similar filters to work, you have to make the other completely nonrestrictive - as you did with --outFilterMismatchNoverLmax 0.1 --outFilterMismatchNmax <very large>
Of course, if you wanted to set absolute number of mismatches Nmm rather than relative, you could do --outFilterMismatchNoverLmax 1 --outFilterMismatchNmax <Nmm>.
The logic for the multi-mapping reads is indeed different from the DNA aligners since STAR was originally designed for RNA alignments. As far as I understand, DNA aligners typically report the multi-mapping alignments to define the "mapping quality" of the best alignment. On the other hand, for RNA-seq we typically care about all possible loci from which a read could have originated - so we only need to explore alignment scores very close to the best score. STAR will report all multimapping alignments with the scores >= (bestScore-outFilterMultimapScoreRange). If the number of these alignments is > outFilterMultimapNmax, no alignments of this read will be output - this is just to prevent pollution of the output.
If you set --outFilterMultimapScoreRange to a very large number (> read length), STAR will output a lot of suboptimal alignment pieces, more or less like BLAT does - I am not sure if this is very useful. I guess I could introduce an option to limit the absolute minimum alignment score for the sub-optimal alignments.
outFilterMatchNminOverLread is restricting the minimum number of the bases matched to the genome, excluding mismatches or indels. If you set it very high it will indeed limit the number of allowed mismatches. Other parameters of similar meaning are (again, these filter will be applied in the "AND" fashion):
outFilterMatchNmin 0
int: alignment will be output only if the number of matched bases is higher than this value
outFilterScoreMin 0
int: alignment will be output only if its score is higher than this value
outFilterScoreMinOverLread 0.66
float: outFilterScoreMin normalized to read length (sum of mates' lengths for paired-end reads)
By default these filters are quite nonrestrictive, allowing alignments longer than 2/3 of the read length.
Your observation about fewer indels found by STAR than the DNA aligners is correct - STAR is very cautious about the indels that are close to read ends.
Accurate identification of such indels is not easy for RNA-seq, since they could also be mapped as splice junctions with much longer gaps.
My preference is to have a lower sensitivity but higher precision. For DNA alignments, calling short indels near the ends is important, and I will have to think about an efficient algorithm to detect them.
I would like to make STAR more friendly to DNA in the future, but I would have to delve into DNA alignnment - I do not have much experience with it.
I think you have figured out the main points already: end-to-end alignments vs. local, controlling the number of mismatches, controlling the multi-mappers, prohibiting splicing.
Cheers
Alex