Hi Alex,
I spent the better part of yesterday afternoon trying to troubleshoot this, but I think I've reached a point where I'm going to need to ask for help.
I'm using STAR to align paired-end reads from a NovaSeq run. Most reads are 150bp long, but quite a few had adapter contamination at the ends, so we have a good amount of reads that are 135bp and below.
My initial alignment that I ran with my data as paired-end was having a fairly low maprate (60~70% of reads mapping), with the remainder getting binned as 'too short'.
Based on my understanding of the documentation, I *think* this has something to do with how STAR will normalize the score it needs to align to the length of the paired-end reads. I have a labmate with a similar but much worse problem (all of her forward reads are half the length of the reverse reads, and STAR's giving her a 1% map rate when run paired-end).
When I ran my dataset as unpaired, the maprates were much higher. ~95% uniquely mapped, with less than 1% being too short, and the remaining 3~4% being considered mapped to multiple loci. So I do think it is an issue with how STAR is handling the pairs.
Now, I did figure out how to turn off binning things as 'too short' for STAR to align, and it all was able to map. Some of those reads went to 'uniquely mapped', but a lot are now considered 'mapped to multiple loci' (20~30%). I figured it was unlikely to be an rRNA thing, because this didn't occur when I ran the same data unpaired.
So I decided to go digging around in IGV and in the output SAM.
Most of the reads I found that were considered mapped to 2 loci (which was the majority of the multi-mapped reads) were behaving like this. Here's the lines for one library from my SAM.
A01770:18:HF3HVDMXY:2:1157:17309:20134 345 NC_037349.1 50888249 3 1S124M * 0 0 TTCTGTACCGAGAGCCATGGACTACGACGTGCTCCTGCGCTTGGAGCCCCAGGTCAGATCCTGCGACACCTAACCCAGAGCTCACCCTCTGCACGTGAGAGCACCTCCTCCATCATCCCGTGTCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:2 HI:i:2 AS:i:122 nM:i:0
A01770:18:HF3HVDMXY:2:1157:17309:20134 137 NC_037349.1 50888250 3 124M1S * 0 0 CTGTACCGAGAGCCATGGACTACGACGTGCTCCTGCGCTTGGAGCCCCAGGTCAGATCCTGCGACACCTAACCCAGAGCTCACCCTCTGCACGTGAGAGCACCTCCTCCATCATCCCGTGTCCCA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:2 HI:i:1 AS:i:122 nM:i:0
And here's what it looks like in IGV.
The green highlighted one are the lines in question.
Now this looks and acts like paired reads in IGV. However, when I look at the flag code, it's saying their mate's unpaired. And sure enough, when I look at the unmapped output, they both are there too!
First in pair (mapped)
TTCTGTACCGAGAGCCATGGACTACGACGTGCTCCTGCGCTTGGAGCCCCAGGTCAGATCCTGCGACACCTAACCCAGAGCTCACCCTCTGCACGTGAGAGCACCTCCTCCATCATCCCGTGTCC
Second in pair (mapped)
CTGTACCGAGAGCCATGGACTACGACGTGCTCCTGCGCTTGGAGCCCCAGGTCAGATCCTGCGACACCTAACCCAGAGCTCACCCTCTGCACGTGAGAGCACCTCCTCCATCATCCCGTGTCCCA
Mate1 (unmapped)
GGACACGGGATGATGGAGGAGGTGCTCTCACGTGCAGAGGGTGAGCTCTGGGTTAGGTGTCGCAGGATCTGACCTGGGGCTCCAAGCGCAGGAGCACGTCGTAGTCCATGGCTCTCGGTACAGAA
Mate2 (unmapped)
CTGTACCGAGAGCCATGGACTACGACGTGCTCCTGCGCTTGGAGCCCCAGGTCAGATCCTGCGACACCTAACCCAGAGCTCACCCTCTGCACGTGAGAGCACCTCCTCCATCATCCCGTGTCCCA
I looked at a lot of these, and what they have in common is that they're short enough that the mate almost completely overlaps its partner. In this one for example, they're only off from each other by two bases (which might be adapters that didn't get clipped potentially; I admit I haven't looked into that enough yet)
My conundrum is, I feel like STAR should be able to consider these mapped, not multi-mapped. It's strange that it's discarding its partner as unmapped, then mapping the thing anyway. I feel like it should be able to handle these as paired, mapped reads, even if they are really overlapping.
What would be the best way to solve this problem?
I'll attach the logs for the file I've been using to explore/troubleshoot this problem, if it's of any use to you. Let me know if there's anything else I can give you that might be of use.
Best, and happy Thanksgiving!
Leif