Multi-mapped reads

280 views
Skip to first unread message

Leif Majeres

unread,
Dec 2, 2022, 4:20:03 PM12/2/22
to rna-star
Hi Alex,

I spent the better part of yesterday afternoon trying to troubleshoot this, but I think I've reached a point where I'm going to need to ask for help.

I'm using STAR to align paired-end reads from a NovaSeq run. Most reads are 150bp long, but quite a few had adapter contamination at the ends, so we have a good amount of reads that are 135bp and below.

My initial alignment that I ran with my data as paired-end was having a fairly low maprate (60~70% of reads mapping), with the remainder getting binned as 'too short'.
Based on my understanding of the documentation, I *think* this has something to do with how STAR will normalize the score it needs to align to the length of the paired-end reads. I have a labmate with a similar but much worse problem (all of her forward reads are half the length of the reverse reads, and STAR's giving her a 1% map rate when run paired-end).

When I ran my dataset as unpaired, the maprates were much higher. ~95% uniquely mapped, with less than 1% being too short, and the remaining 3~4% being considered mapped to multiple loci. So I do think it is an issue with how STAR is handling the pairs.

Now, I did figure out how to turn off binning things as 'too short' for STAR to align, and it all was able to map. Some of those reads went to 'uniquely mapped', but a lot are now considered 'mapped to multiple loci' (20~30%). I figured it was unlikely to be an rRNA thing, because this didn't occur when I ran the same data unpaired.

So I decided to go digging around in IGV and in the output SAM.
Most of the reads I found that were considered mapped to 2 loci (which was the majority of the multi-mapped reads) were behaving like this. Here's the lines for one library from my SAM.

A01770:18:HF3HVDMXY:2:1157:17309:20134  345     NC_037349.1     50888249        3       1S124M  *       0       0       TTCTGTACCGAGAGCCATGGACTACGACGTGCTCCTGCGCTTGGAGCCCCAGGTCAGATCCTGCGACACCTAACCCAGAGCTCACCCTCTGCACGTGAGAGCACCTCCTCCATCATCCCGTGTCC        FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NH:i:2  HI:i:2  AS:i:122        nM:i:0
A01770:18:HF3HVDMXY:2:1157:17309:20134  137     NC_037349.1     50888250        3       124M1S  *       0       0       CTGTACCGAGAGCCATGGACTACGACGTGCTCCTGCGCTTGGAGCCCCAGGTCAGATCCTGCGACACCTAACCCAGAGCTCACCCTCTGCACGTGAGAGCACCTCCTCCATCATCCCGTGTCCCA        FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NH:i:2  HI:i:1  AS:i:122        nM:i:0

And here's what it looks like in IGV.
522A troubleshooting.PNG
The green highlighted one are the lines in question.
Now this looks and acts like paired reads in IGV. However, when I look at the flag code, it's saying their mate's unpaired. And sure enough, when I look at the unmapped output, they both are there too!

First in pair (mapped)
TTCTGTACCGAGAGCCATGGACTACGACGTGCTCCTGCGCTTGGAGCCCCAGGTCAGATCCTGCGACACCTAACCCAGAGCTCACCCTCTGCACGTGAGAGCACCTCCTCCATCATCCCGTGTCC

Second in pair (mapped)
CTGTACCGAGAGCCATGGACTACGACGTGCTCCTGCGCTTGGAGCCCCAGGTCAGATCCTGCGACACCTAACCCAGAGCTCACCCTCTGCACGTGAGAGCACCTCCTCCATCATCCCGTGTCCCA

Mate1 (unmapped)
GGACACGGGATGATGGAGGAGGTGCTCTCACGTGCAGAGGGTGAGCTCTGGGTTAGGTGTCGCAGGATCTGACCTGGGGCTCCAAGCGCAGGAGCACGTCGTAGTCCATGGCTCTCGGTACAGAA

Mate2 (unmapped)
CTGTACCGAGAGCCATGGACTACGACGTGCTCCTGCGCTTGGAGCCCCAGGTCAGATCCTGCGACACCTAACCCAGAGCTCACCCTCTGCACGTGAGAGCACCTCCTCCATCATCCCGTGTCCCA

I looked at a lot of these, and what they have in common is that they're short enough that the mate almost completely overlaps its partner. In this one for example, they're only off from each other by two bases (which might be adapters that didn't get clipped potentially; I admit I haven't looked into that enough yet)
My conundrum is, I feel like STAR should be able to consider these mapped, not multi-mapped. It's strange that it's discarding its partner as unmapped, then mapping the thing anyway. I feel like it should be able to handle these as paired, mapped reads, even if they are really overlapping.
What would be the best way to solve this problem?
I'll attach the logs for the file I've been using to explore/troubleshoot this problem, if it's of any use to you. Let me know if there's anything else I can give you that might be of use.

Best, and happy Thanksgiving!
Leif
UnpairedMate1_522A_77_S77_1.Log.out
Paired_522A_77_S77.Log.final.out
Paired_522A_77_S77.Log.out
UnpairedMate1_522A_77_S77_1.Log.final.out

Alexander Dobin

unread,
Dec 2, 2022, 4:27:47 PM12/2/22
to rna-star
Hi Leif,

I think the problem is that after trimming the ends of overlapping mates protrude improperly (i.e. the -strand mate start is to the left of the +strand mate), which is not allowed by deafult.
You can allow such alignments with --alignEndsProtrude 10 ConcordantPair
If you have a lot of overlapping reads, I would recommend trying the new option in STAR that merges overlapping mates before mapping: --peOverlapNbasesMin 10 

Cheers
Alex

Reply all
Reply to author
Forward
0 new messages