Hello,
I am working with 150-base paired end sequencing reads of Mus Musculus and their STAR-generated CIGAR strings. I used STAR to align these reads to an N-masked genome I created to mask known SNP positions. I used the GRCm39 ensembl gtf file to create the N-masked genome (along with an N-masked fasta file). Upon further analysis of a few CIGAR strings, they don't seem to be correct in their alignment.
For example,
Read A:
A00783:429:HCKLKDSXY:3:1164:11306:10238 163 15 + 98020822 150 255 75M1D75M 15 98020906 234 CAAGAACGTTAGTGTGCAGCCCTAACCCCCAGCACCTTTGGTGGCTTCTGGGTTGTGGCCCTGCACCTCCTCTTGTTCCTTCTGCCCATGCTTTTCTCCTGCCAGGAGCCATCTCATGACAAAACACTTTTGTCATAGTTTACCTTAGGA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF:
CIGAR is 75M1D75M
Pulling this region of the N-masked genome fasta file outputs:
CAAGAANGTTAGTGTGCAGCCCNAANGCCCAGCACCTTTGGTGGCTTCTGGGTTGTGACCCTGCACCTCCTCTTGNTANCTTCTGCCCNTGCTTTTCTCCTGCCAGGAGCCNTCTCATGACAAAACACTTTTGTCGTAGTTTACCTCAGG
If I place these sequences next to each other with a numbering system, I get this output --> see attached file 'forum.example'
I find that there are matches up until base pair #77. I realize there is an "N" in the 76th position of the N-masked fasta sequence - could this have anything to do with the incorrect CIGAR? Even if this is true, why did Ns found earlier in the fasta sequence obtain an "M" in the CIGAR string?
In hindsight, my question here is: How does STAR deal with CIGAR generation over an N-masked genome?
I'm happy to provide more examples of incorrect CIGAR strings and/or any other information that might be helpful in understanding this issue.
Thank you in advance!
Jack