STAR Cigar Strings Not Matching // N-masked genome

94 views

Skip to first unread message

Jack Engel

unread,

Jul 10, 2023, 11:05:06 AM7/10/23

to rna-star

Hello,

I am working with 150-base paired end sequencing reads of Mus Musculus and their STAR-generated CIGAR strings. I used STAR to align these reads to an N-masked genome I created to mask known SNP positions. I used the GRCm39 ensembl gtf file to create the N-masked genome (along with an N-masked fasta file). Upon further analysis of a few CIGAR strings, they don't seem to be correct in their alignment.

For example,

Read A:

A00783:429:HCKLKDSXY:3:1164:11306:10238 163 15 + 98020822 150 255 75M1D75M 15 98020906 234 CAAGAACGTTAGTGTGCAGCCCTAACCCCCAGCACCTTTGGTGGCTTCTGGGTTGTGGCCCTGCACCTCCTCTTGTTCCTTCTGCCCATGCTTTTCTCCTGCCAGGAGCCATCTCATGACAAAACACTTTTGTCATAGTTTACCTTAGGA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF:

CIGAR is 75M1D75M

Pulling this region of the N-masked genome fasta file outputs:

CAAGAANGTTAGTGTGCAGCCCNAANGCCCAGCACCTTTGGTGGCTTCTGGGTTGTGACCCTGCACCTCCTCTTGNTANCTTCTGCCCNTGCTTTTCTCCTGCCAGGAGCCNTCTCATGACAAAACACTTTTGTCGTAGTTTACCTCAGG

If I place these sequences next to each other with a numbering system, I get this output --> see attached file 'forum.example'

I find that there are matches up until base pair #77. I realize there is an "N" in the 76th position of the N-masked fasta sequence - could this have anything to do with the incorrect CIGAR? Even if this is true, why did Ns found earlier in the fasta sequence obtain an "M" in the CIGAR string?

In hindsight, my question here is: How does STAR deal with CIGAR generation over an N-masked genome?

I'm happy to provide more examples of incorrect CIGAR strings and/or any other information that might be helpful in understanding this issue.

Thank you in advance!

Jack

forum.example

Alexander Dobin

unread,

Jul 10, 2023, 3:31:07 PM7/10/23

to rna-star

Hi Jack,

STAR should not have a problem with rare Ns in the genome. Please check that masking Ns did not introduce insertions in the genome.

Reply all

Reply to author

Forward

0 new messages