How does STAR handle repeat masked genomes?

Joshua Bradley

unread,

May 15, 2015, 1:18:59 PM5/15/15

to rna-...@googlegroups.com

I've seen multiple posts on biostar that say "never align to a repeat masked genome" which makes sense. So my question is:

Does STAR recognize a genome that has soft-masked (lowercase) repeats or is the sequence matching process completely case-insensitive?

Alexander Dobin

unread,

May 15, 2015, 5:07:24 PM5/15/15

to rna-...@googlegroups.com, jgbra...@gmail.com

Hi Joshua,

low or upper case A,C,G,T, AllOther are treated the same (actually converted to 0-5) for both the genome or read sequences.

Cheers

Alex

César Lizárraga

unread,

Aug 10, 2015, 10:39:15 AM8/10/15

to rna-star, jgbra...@gmail.com

Wait, does that mean if you are using lower or uppercase A,C,G,T are all scored the same as well as Other (N)? Or is N treated differently?

Thanks.

Cesar.

Alexander Dobin

unread,

Aug 10, 2015, 4:08:58 PM8/10/15

to rna-star, jgbra...@gmail.com

Hi Cesar,

a=A, c=C, g=G, t=T for both the genome and read sequences.

All other characters (including IUPAC di/tri-nucleotides) are converted into "N", i.e. unknown.

Cheers

Alex

Zhu Zhuo

unread,

Sep 14, 2015, 4:31:58 PM9/14/15

to rna-star, jgbra...@gmail.com

Hi Alex,

Continuing with this question, I'm wondering how STAR scores when the base is aligned to "N" in the genome. Will it be considered as match, or mismatch?

Thank you!

Zhu

Alexander Dobin

unread,

Sep 15, 2015, 6:43:00 PM9/15/15

to rna-star, jgbra...@gmail.com

Hi Zhu,

a base mapped to N is considered neutral, neither match or mismatch. Nothing is added to the alignment score, and it's not counted in the number of matches or mismatches.

Cheers

Alex

David Soong

unread,

Mar 11, 2016, 2:27:30 PM3/11/16

to rna-star, jgbra...@gmail.com

Hi Alex,

There are several regions in hg19 that contain long stretches of A's (or T's). And STAR tends to all long A reads in my sample to these regions, e.g.

chr6:160,521,742-160,521,853

chr12:66,451,367-66,451,458

chr15:77,910,857-77,910,956

Should I mask out those regions with N's or is there a way for STAR to handle this situation?

Thanks.

Alexander Dobin

unread,

Mar 14, 2016, 6:38:07 AM3/14/16

to rna-star, jgbra...@gmail.com

Hi David,

masking the genome with Ns is the way to go, there is no other way to handle it with STAR.