How does STAR handle repeat masked genomes?

1,877 views
Skip to first unread message

Joshua Bradley

unread,
May 15, 2015, 1:18:59 PM5/15/15
to rna-...@googlegroups.com
I've seen multiple posts on biostar that say "never align to a repeat masked genome" which makes sense. So my question is:

Does STAR recognize a genome that has soft-masked (lowercase) repeats or is the sequence matching process completely case-insensitive?

Alexander Dobin

unread,
May 15, 2015, 5:07:24 PM5/15/15
to rna-...@googlegroups.com, jgbra...@gmail.com
Hi Joshua,

low or upper case A,C,G,T, AllOther are treated the same (actually converted to 0-5) for both the genome or read sequences.

Cheers
Alex

César Lizárraga

unread,
Aug 10, 2015, 10:39:15 AM8/10/15
to rna-star, jgbra...@gmail.com
Wait, does that mean if you are using lower or uppercase A,C,G,T are all scored the same as well as Other (N)? Or is N treated differently?

Thanks.

Cesar.

Alexander Dobin

unread,
Aug 10, 2015, 4:08:58 PM8/10/15
to rna-star, jgbra...@gmail.com
Hi Cesar,

a=A, c=C, g=G, t=T for both the genome and read sequences.
All other characters (including IUPAC di/tri-nucleotides) are converted into "N", i.e. unknown.

Cheers
Alex

Zhu Zhuo

unread,
Sep 14, 2015, 4:31:58 PM9/14/15
to rna-star, jgbra...@gmail.com
Hi Alex,

Continuing with this question, I'm wondering how STAR scores when the base is aligned to "N" in the genome. Will it be considered as match, or mismatch?

Thank you!

Zhu

Alexander Dobin

unread,
Sep 15, 2015, 6:43:00 PM9/15/15
to rna-star, jgbra...@gmail.com
Hi Zhu,

a base mapped to N is considered neutral, neither match or mismatch. Nothing is added to the alignment score, and it's not counted in the number of matches or mismatches.

Cheers
Alex

David Soong

unread,
Mar 11, 2016, 2:27:30 PM3/11/16
to rna-star, jgbra...@gmail.com
Hi Alex,

There are several regions in hg19 that contain long stretches of A's (or T's). And STAR tends to all long A reads in my sample to these regions, e.g.

chr6:160,521,742-160,521,853
chr12:66,451,367-66,451,458
chr15:77,910,857-77,910,956

Should I mask out those regions with N's or is there a way for STAR to handle this situation?

Thanks.

Alexander Dobin

unread,
Mar 14, 2016, 6:38:07 AM3/14/16
to rna-star, jgbra...@gmail.com
Hi David,

masking the genome with Ns is the way to go, there is no other way to handle it with STAR.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages