STAR crashed with untemplated polyA at the 3' end

92 views
Skip to first unread message

Bo Han

unread,
Aug 20, 2013, 11:21:29 PM8/20/13
to rna-...@googlegroups.com
Hello, Dear Alex, 

I have encountered a weird bug when I tried to use STAR to map PAS seq reads (I expect that STAR will mark the untemplated sequence as soft-clip). 
I ended up to reproduce the error with very simple sequence:

#index

>chr1
TCAAATTATACTCTGAATACAGAATGGCATTTTCAGAATCAAACTTTAAT

# build index

STAR --runMode genomeGenerate --genomeDir STARindex/ --genomeFastaFiles test.fa

# it has no problem mapping the following fastq (individually)
STAR --runMode alignReads --genomeDir STARindex --readFilesIn test.fq 

@perfect
TTATACTCTGAATACAGAAT
+
IIIIIIIIIIIIIIIIIIII
@tail1
TTATACTCTGAATACAGAATA
+
IIIIIIIIIIIIIIIIIIIII
@tail2
TTATACTCTGAATACAGAATAA
+
IIIIIIIIIIIIIIIIIIIIII
@tail3
TTATACTCTGAATACAGAATAAA
+
IIIIIIIIIIIIIIIIIIIIIII
@tail4
TTATACTCTGAATACAGAATAAAA
+
IIIIIIIIIIIIIIIIIIIIIIII

# but it gives segmentation fault when mapping

@tail5
TTATACTCTGAATACAGAATAAAAA
+
IIIIIIIIIIIIIIIIIIIIIIIII

# this error does NOT happen for polyC/G/T

@tail5G
TTATACTCTGAATACAGAATGGGGG
+
IIIIIIIIIIIIIIIIIIIIIIIII
@tail5C
TTATACTCTGAATACAGAATCCCCC
+
IIIIIIIIIIIIIIIIIIIIIIIII
@tail5T
TTATACTCTGAATACAGAATTTTTT
+
IIIIIIIIIIIIIIIIIIIIIIIII

# when I changed the last A to T/C/G, the crash went away and results are as expected.

@tail5T
TTATACTCTGAATACAGAATAAAAT
+
IIIIIIIIIIIIIIIIIIIIIIIII
@tail5C
TTATACTCTGAATACAGAATAAAAC
+
IIIIIIIIIIIIIIIIIIIIIIIII
@tail5G
TTATACTCTGAATACAGAATAAAAG
+
IIIIIIIIIIIIIIIIIIIIIIIII

tail5T 0 chr1 6 255 20M5S * 0 0 TTATACTCTGAATACAGAATAAAAT IIIIIIIIIIIIIIIIIIIIIIIII NH:i:1 HI:i:1 AS:i:19 nM:i:0
tail5C 0 chr1 6 255 20M5S * 0 0 TTATACTCTGAATACAGAATAAAAC IIIIIIIIIIIIIIIIIIIIIIIII NH:i:1 HI:i:1 AS:i:19 nM:i:0
tail5G 0 chr1 6 255 20M5S * 0 0 TTATACTCTGAATACAGAATAAAAG IIIIIIIIIIIIIIIIIIIIIIIII NH:i:1 HI:i:1 AS:i:19 nM:i:0

# the error doesn't happen when the AAAAA is located at the 5' end

@tail5
AAAAATTATACTCTGAATACAGAAT
+
IIIIIIIIIIIIIIIIIIIIIIIII

I was able to reproduce this error in version 2.3.0e and 2.3.1o. And both self-compiled and downloaded precompiled static version. 

Thanks in advance, 
Bo

Alexander Dobin

unread,
Aug 22, 2013, 11:10:33 AM8/22/13
to rna-...@googlegroups.com
Hi Bo,

I think this error is caused to the small genome problem (see, for example, this post).
Please try to generate genome with --genomeSAindexNbases 2, and hopefully the problem will go away.

The untemplated A-tails will indeed be represented as soft-clipping in the CIGAR string.

Cheers
Alex

Bo Han

unread,
Aug 22, 2013, 1:24:04 PM8/22/13
to rna-...@googlegroups.com
Hi, Alex, 
Thanks for the prompt reply, this indeed fixed the problem. Sorry that I didn't notice the previous post. 

I have another question, how does the penalty of soft-clipping calculated? My feeling is that one soft-clip has lower penalty than a mismatch: when I set outFilterMultimapScoreRange to 1, there are different alignments of the same read been reported and those alignments have different M&&S.

Thanks!

Alexander Dobin

unread,
Aug 23, 2013, 8:39:04 AM8/23/13
to rna-...@googlegroups.com
Hi Bo,

the soft-clipped bases contribute 0 to alignment score, while matched/mismatched bases contribute +/-1. Thus compared to the matched bases, effectively the soft-clipped bases are penalized as -1, and mismatched bases are penalized as -2 - which agrees with your observation.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages