Substitution instead of Mismatch

787 views
Skip to first unread message

Smitha Shankar

unread,
Feb 28, 2013, 4:32:00 PM2/28/13
to rna-...@googlegroups.com
Hi All, 

In STAR I find that the mismatches at end of sequence are called substitutions instead of mismatch, this results in nM=0, rather than nM=1
and some of these terminal substitutions are long, >20, Is there a way I can I can get these to be called as mismatches instead? 

Smitha

Smitha Shankar

unread,
Feb 28, 2013, 4:43:21 PM2/28/13
to rna-...@googlegroups.com
My paired end reads 2*50 bp

Alexander Dobin

unread,
Feb 28, 2013, 9:29:38 PM2/28/13
to rna-...@googlegroups.com
Hi Smitha,

if I understand it correctly, are you talking about the "Soft-clip" operation "S" in the CIGAR string?

STAR performs the so called local alignment of the read sequence to the genome, as opposed to the end-to-end (semi-global) alignment which is performed by many DNA aligners such as bowtie1. 
This means that STAR will try to maximize the alignment score by "extending" the alignment towards the end of the reads. However, it will not try to force the "full-length" read alignment from the first to the last base of the read sequence. The score is a sum of +1 for matches and -1 for mismatches.
Consider the following example:
A A A A C A A A C C A C A       read 
| | | | X | | | X X | X |
A A A A A A A A A A A A A       genome
+ + + + - + + + - - + - +       +/-1 for matches/mismatches
1 2 3 4 3 4 5 6 5 4 5 4 5       alignment score
              ^max score
             8M5S, nM=1
STAR will "extend" the alignment to include 8 first bases (where the alignment score reaches maximum), and will "Soft-clip" the remaining 5 bases.
The alignment length is 8, and the number of mismatches is 1 which considers just those in the aligned portion of the read.

So if you see something like 20S, it means that at least 10 bases out of 20 are mismatched.
This could be caused by many reasons: poor sequencing quality tails, adapter sequence, A-tails, 3'-end RNA modifications.
However, the most likely reason is that STAR could not find a good splicing accommodation for those 20b.
That's why I would not recommend counting the mismatches in the S-clipped portions of the read.

Cheers
Alex

Smitha Shankar

unread,
Mar 1, 2013, 8:10:09 PM3/1/13
to rna-...@googlegroups.com
Thank you for the detailed explanation Alex! 
It was very helpful. Although this concept works well in most cases, I woud like you to take a look at few reads below. These reads span known slice junctions. There were many reads supporting this junction, but I was wondering why these reads are being clipped when it can be split and mapped to the junction, please take a look:


TATCAAGACTGAGTTGATTTCTGTGTCTGAAGTACACCCTTCTAGACTTC

GSNAP :Cigar = 17M879N33M

STAR: Cigar = 17S33M

GAGGCTTTATGTGGAAGCAGCATTTGCTGGAGAGAGTAAACATGAGGTCG

GSNAP :Cigar = 46M787N4M

STAR: Cigar = 48M2S

CGTACCCTAACACGCACGTCCCAGGAGACGGCAGATGTCAAGGGGTCAGT

GSNAP :Cigar = 44M10111N6M

STAR: Cigar = 50M

Could this be problem with the way I have set the parameters? I have set --sjdbOverhang to 49 as my paired end reads are 50BP in length. 


Smitha 

Alexander Dobin

unread,
Mar 2, 2013, 1:04:49 PM3/2/13
to rna-...@googlegroups.com
Hi Smitha,

I have looked closely at these examples. In all these cases the spliced alignments would have mismatches very close to the splice junctions (check out the UCSC session). STAR considers such splices not trustworthy and chooses to not report them, especially if overhangs are short. I think it would make some sense to report these splices in cases they cross annotated junctions - I will need to make some changes in the code to allow for it. Can you make an estimate from your data how many reads we are dealing with here?

Thanks!
Alex

Smitha Shankar

unread,
Mar 5, 2013, 6:07:57 PM3/5/13
to rna-...@googlegroups.com
Hello Alex, thank you for working on this!
I looked through my data. There are about 50,000 reads. 
Is there any other information you would need? 

Smitha

Smitha Shankar

unread,
Mar 6, 2013, 7:33:44 PM3/6/13
to rna-...@googlegroups.com
Hi Alex, I looked through another sample. The issue I have mentioned below is more prominet. I compared the depth of the known splice junctions from GNSAP and STAR. GSNAP reports about 3667378 reads more than STAR across known splice junctions. In my observation , many of these reads had the soft clip where splicing would have been an option. 

Smitha

Alexander Dobin

unread,
Mar 6, 2013, 10:49:15 PM3/6/13
to rna-...@googlegroups.com
Hi Smitha,
This is a large number of reads and I need to check what is going on.
Could you please send me a chunk of you reads, say 1M, as well as run parameters for STAR and GSNAP?  
Thanks!
Alex
Message has been deleted

Smitha Shankar

unread,
Mar 8, 2013, 1:11:37 PM3/8/13
to rna-...@googlegroups.com
Hello, Alex. I am not able to attach the file in this chain here. I have mailed you the settings and a few reads as you requested. 

Smitha
Reply all
Reply to author
Forward
0 new messages