small-RNA seq multi-mapping concerns

Hartwig Visser

unread,

Oct 13, 2020, 8:09:32 AM10/13/20

to rna-star

Hi Alex (and everyone else :))

I trust you are well during these difficult times.

I've been reading far and wide to find a solution to multi-mapped reads when aligning with STAR, but I'm just confusing myself. Especially since it seems there is no real consensus on how to deal with them.

I have decided, that the best option for me would be to keep only uniquely mapped reads along with 1 instance of every multi-mapped read (i.e. the best mapped one) and then to discard the rest.

I have paired-end sequences (trimmed to 25 bp) (which I will align to miRBase reference assembly) and I'm currently working via Galaxy and using your settings for small-RNA:

--outFilterMismatchNoverLmax 0.05 --outFilterMatchNmin 16 --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --alignIntronMax 1

Which other settings would I need to change to achieve my goal (i.e. only unique reads and 1 best read for multi-mappers)?

Furthermore, I ultimately want to perform differential expression analysis on miRNA expression. Provided everything goes to plan, would the STAR output BAM file only include the abovementioned reads, or do I need to further filter the BAM file? I have tried to play around with the settings but the number of reported "uniquely mapped reads" never corresponds with the sum of the read counts from e.g. samtools idxstat.

Thank you for you assistance.

Kind regards

Hartwig Visser

rohit satyam

unread,

Oct 13, 2020, 3:34:19 PM10/13/20

to rna-star

Hi @hartwig

You can retain these multi mapping reads in STAR BAM files and use mmquant from here: https://bitbucket.org/mzytnicki/multi-mapping-counter/src/master/ to take into account multi mapping reads. However, it will fuse the genes that the read multimaps to. Given your read length i.e. 25 bps , you would want to increase --outFilterMatchNmin to 20 and that might reduce the multi mapping reads. However, in the case of miRNA, read mapping to 5-20 distinct sites are allowed as per this paper: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8. Also you might like to use miRNA-specific aligners or maybe QuickMIRSeq pipeline alongside star to see if there is any difference.

Also Uniquely mapped reads from STAR will never correspond to what you get from samtools and it has been discussed in the manual:

Log.final.out: summary mapping statistics after mapping job is complete, very useful for quality control. The statistics are calculated for each read (single- or paired-end) and then summed or averaged overall reads. Note that STAR counts a paired-end read as one read, (unlike the samtools flagstat/idxstats, which count each mate separately). Most of the information is collected about the UNIQUE mappers (unlike samtools flagstat/idxstats which does not separate unique or multi-mappers). Each splicing is counted in the numbers of splices, which would correspond to summing the counts in SJ.out.tab. The mismatch/indel error rates are calculated on a per base basis, i.e. as the total number of mismatches/indels in all unique mappers divided by the total number of mapped bases.

Hope this helps.

Best

Alexander Dobin

unread,

Oct 14, 2020, 7:56:22 PM10/14/20

to rna-star

Hi Hartwig,

The set of parameters you have are good for small RNA-seq.

if you want to keep only one alignment of each multimapper, you need --outSAMmultNmax 1 .

You probably want to pick the alignment randomly from the multimappers, you would need --outMultimapperOrder Random

To consider only top-scoring alignments, you need --outFilterMultimapScoreRange 0

You may also consider increasing --outFilterMultimapNmax from default 10 - it will not blow up your BAM, since you are only recording 1 alignment.

Altogether:

--outSAMmultNmax 1 --outMultimapperOrder Random --outFilterMultimapScoreRange 0 --outFilterMultimapNmax 50