Why STAR limits alignments to soft-masked regions (lowercase letters in reference genome)?

MaVi Ruiz

unread,

Aug 17, 2020, 11:39:21 AM8/17/20

to rna-star

Dear All,

I have been trying to figure out what would be the best parameters to use in the STAR aligner to find all possible alignments to an RNA read.

I realized that somehow STAR limits alignments or does not find a way to do alignments in soft mask regions. Even though, the lack position might not have any mismatch or soft-clip that would be needed in order to STAR do the alignment.

I set multimapping (very high) in order to get all the possible regions of where the read can be mapped but it doesn't make the deal.

I would be glad to hear your suggestions!

Thanks in advance.

Alexander Dobin

unread,

Aug 17, 2020, 1:18:29 PM8/17/20

to rna-star

Hi MaVi,

How many positions you expect the read to align to?

In addition to --outFilterMultimapNmax, the --winAnchorMultimapNmax to at least that number.

Soft-masking (i.e. lower case bases) does not make a difference for STAR, it converts them all to ACGT.

Cheers

Alex

MaVi Ruiz

unread,

Aug 19, 2020, 6:33:57 PM8/19/20

to rna-star

Hello Alex,

Thank you for your kind reply.

Yes, I found a post (sequence that should map, but doesn't, from Eric Londin in May) where He had a question similar to mine.

I did what it was suggested, the same that you suggests here, however STAR doesn't map the reads to all the positions.

So, at this point I don't know what else do.

Would you have any other suggestion?

I thank you in advance!

Alexander Dobin

unread,

Aug 21, 2020, 4:14:20 PM8/21/20

to rna-star

Hi MaVi,

we will need to look at the specific examples - the reads that are matched exactly to the genome in fewer than min(--outFilterMultimapNmax, --winAnchorMultimapNmax) loci,

but are not reported as mapped.

Cheers

Alex

virgin...@gmail.com

unread,

Aug 26, 2020, 7:39:16 PM8/26/20

to rna-star

Hi Alex,

Thanks once again for your answer.

I think I was able to figure out what the good parameters should be to try to have as many alignments as possible.

I combined --outFilterMultimapNmax X, --winAnchorMultimapNmax X and –seedSearchStartLmax 15 (as I have short reads this was the additional key).

But now I'm trying to figure out what would be a good compromise between the number X and the mapping time.

If you have any suggestions I would greatly appreciate it.

Thanks again,

MaVi

Alexander Dobin

unread,

Aug 29, 2020, 12:37:30 PM8/29/20

to rna-star

Hi MaVi,

indeed, increasing --winAnchorMultimapNmax will significantly increase mapping time.

A couple of suggestions:

I. You can increase this parameter iteratively: e.g

1. map with default 50

2. Re-map the unmapped reads with 200

3. Re-map the unmapped reads with 500

...

until you reach your desired value.

II. Mask the exact repeats in the genome, for the repeat length > read length. Mask all copies of the repeat with Ns except one. Then the reads will be aligned to only one copy of each repeat, but you reconstruct all alignments since you know which masked loci each non-masked repeat corresponds to.

Cheers

Alex

virgin...@gmail.com

unread,

Sep 14, 2020, 3:46:29 PM9/14/20

to rna-star

Hi Alex,

Thank so much for your suggestions, they very valued!! I am definitely to take them in account for my pipeline.

I would like to take this opportunity to ask you one other thing about the mapping. I have been observing that for some reads STAR favorise the spliced alignments over "complete" alignment in one certain place.

So, for instance, I have this read: ACCACCAGACCTGCCTTACAGGAGCTC that should be mapped to chr7:109553889-109,553915 taking in consideration the SNP rs567326500 (T>G) in the position chr7:109553909. However, currently with STAR I found this spliced alignment chr7:109553889-109553902|110585860-110585872.

I found the same spliced alignment even when I use seeds equals to 15 or equal to 24.

Do you have any clue of why STAR is preventing to find the whole read consecutively?

My parameters are the same as I before mentioned (Anchor and multimap high, also --alignEndsType EndToEnd and the dbSNP149_all.vcf), just I changed the seed amount as I thought that it will be the cause of this observation.

Thank you again very much, in advance!!

virgin...@gmail.com

unread,

Sep 16, 2020, 9:58:20 AM9/16/20

to rna-star

Screen Shot 2020-09-16 at 9.45.22 AM.png

Hello Alex,

I realized that the snapshot of the positions in the genome that I talk about in my last post will probably be more useful. As you will see if you take into account the rs567326500 (T>G) the alignment for the read ACCACCAGACGCCTACAGGCTC would be a perfect alignment in that position of the genome. However, STAR divided the ACCACCAGACCTGC and CTTACAGGAGCTC read and found the spliced alignment. Which is totally fine, but I'm expecting the entire read to be aligned in cr7:109553889-109553915. Am I wrong to expect this? Should I change the parameters to prevent STAR from favouring spliced alignments?

Thank you very much again for your time and your advices!

Best,

MaVi

Alexander Dobin

unread,

Sep 17, 2020, 12:20:57 PM9/17/20

to rna-star

Hi MaVi,

There are two reasons why the spliced alignment is favored.

1. The annotated spliced alignments by default get a higher score: --sjdbScore 2. If you want to remove this bias, you would need --sjdbScore 0.

2. However, if the spliced alignment does not have a mismatch, but there is a mismatch with the reference allele at the SNP position, it will be counted as mismatch, even if you provide the VCF file.

VCF file only serves to check which alignments overlap variants, and how confidently they align to these loci (with WASP). Since this read has another good alignment, it won't pass the WASP test.

Cheers

Alex

virgin...@gmail.com

unread,

Oct 12, 2020, 3:52:38 PM10/12/20

to rna-star

Hello Alex,

And thank you very much for all your time and explanations.

Sorry to keep bothering you, I have another question about splice alignments, is there any way to prevent STAR from aligning reads to non-annotated splices?

I am really thankfully for your kind support !

Regards,

MaVi

Alexander Dobin

unread,

Oct 14, 2020, 7:35:55 PM10/14/20

to rna-star

Hi MaVi,

to prevent unannoted spliced, you can set --alignSJoverhangMin to a large number (>read length).

Cheers

Alex

virgin...@gmail.com

unread,

Feb 12, 2021, 12:38:26 PM2/12/21

to rna-star

(I am resending this message which probably got lost as I realize I sent it replying on the gmail account.)

Dear Alexander,

First of all, I wish you a happy new year.

I thank you in advance for all the advice you have given me so far to quickly run STAR (masking the repeated regions of the genome that I am currently working on). But, I would like to know if you could have other insights that could help me to advance in my goal of recovering as much as possible the positions for short RNA reads (24-33 nd) in a short time.

Here are the parameters that I have configured so far for launch STAR:

--seedSearchStartLmax 15
--alignEndsType EndToEnd
--sjdbOverhang 32
--sjdbScore 2
--alignSJDBoverhangMin 1
--alignSJoverhangMin 1000 # I put this very high to prevent unknown alignments as you suggested
--winAnchorMultimapNmax 1000
--outFilterMultimapNmax 1000
--outFilterMatchNmin 25
--genomeConsensusFile # I use a vcf with the SNPs

Thank you again for all your help so far and thank you in advance for your time to read this.

Best,

MaVi

Alexander Dobin

unread,

Feb 18, 2021, 12:56:17 PM2/18/21

to rna-star

Hi MaVi,

if you are not interested in short spliced fragments of mRNA, I would recommend prohibiting splicing: then you set --alingIntronMax 1 and do not use any --sjdb* parameters

--outFilterMatchNmin 25 is probably too harsh, I would set it to the minimum mapped length you want to keep.

I would not use --genomeConsensusFile initially, this is still an experimental option. I have not tested it for small RNAs, but it would be interesting to see how much effect it has on alignments.

You may also want to set a hard limit on mismatches with --outFilterMismatchNmax 1 . I would not recommend allowing more than one mismatch for short RNAs. Even with 1MM it's not guaranteed that you will find all possible alignments with one mismatch.