Hello,
I have a very large set of reads and, for post-processing reasons, I used STAR to align the reads to the organism's reference genome and then also to the reverse complement of that reference genome (then I should be able to take the "forward" set of alignments from each run to get the complete set). However, when I aligned to the reverse-complement genome and then converted the results to "forward" coordinates, the results didn't match up in a small fraction of cases (and overall, there were slightly fewer uniquely aligned reads and splice junctions). As an example, I isolated some alignments to chr XIII and just ran STAR on those reads to align to chr XIII and chr XIII_minus fasta files. For the following read, these are the two alignments found:
SNXXXX:XXX:8844:15231 272 chrXIII_minus 213483 3 2S12M247348N36M * 0 0 GTGCTCTTCCGACTGACAAGCGCCCATTCTGACCATTAAACTATCACGGA IGIIHJIGJJJIJIHHGHJIJIIIHH@DIJJJJJJIIHHFHFFDFFFCC@ NH:i:2 HI:i:1 AS:i:36 nM:i:0
SNXXXX:XXX:8844:15231 0 chrXIII 463554 3 38M247348N10M2S * 0 0 TCCGTGATAGTTTAATGGTCAGAATGGGCGCTTGTCAGTCGGAAGAGCAC @CCFFFDFFHFHHIIJJJJJJID@HHIIIJIJHGHHIJIJJJGIJHIIGI NH:i:2 HI:i:2 AS:i:40 nM:i:2
(chr XIII is 924431 bp long so in this instance it maps back to the same genomic location.) I understand that given STAR's rules, these two alignments are scored differently, but why wouldn't the reverse-complement
equivalent of the bottom alignment appear as an equally-scored alignment? (Or, for that matter, the reverse-complement equivalent of the top). Thus, is the alignment to chrXIII_minus at position 213483 with CIGAR string "2S10M247348N38M" not being found? Or not being scored appropriately? Or might there be a keyword that I should have set? (I only set the keywords: outFilterMultimapNmax=100, outFilterMultimapScoreRange=10)
Any help would be appreciated! If there's anything else I should provide, let me know.
Thank you!
Blythe