Re: Soft Clipping at end of the reads

951 views
Skip to first unread message

Smitha Shankar

unread,
Feb 27, 2013, 12:23:29 AM2/27/13
to rna-...@googlegroups.com
Oops! here is the attachment.

On Tuesday, February 26, 2013 9:16:29 PM UTC-8, Smitha Shankar wrote:
Hi Alex, 

I have been using STAR for sometime now and I have to see it runs very fast compared to all other tools I have used till now. 
I have a dataset of short reads 50 bp,I have alignment results from both GSNAP and STAR. 
I have many reads that are soft clipped at the ends instead of insertions and  then aligning the rest of the read. Dues to this I am not able to identify many known splice junction between two exons after aligning with STAR.
I have attached a snapshot of IGV viewer where in you can see the first alignment having many reads supporting the splice junction whereas STAR (second in the window) is not reporting any reads supporting the splice junctions due to soft clipping. 
Here is an example of a CIGAR string for one read with the above mentioned issue:
Read: TTCAAGGAAACCTTTCCAACACAGATGTCCAAGCTGCCAAGAACAAGCTG
STAR - CIGAR: 1S40M9S
GSNAP -CIGAR: 40M4303N10M

This is the case with many splice junctions. Can you advise on what parameter I could change to improve the output? 


Smitha
igv_snapshot.png

Alexander Dobin

unread,
Feb 27, 2013, 1:34:01 AM2/27/13
to rna-...@googlegroups.com
Hi Smitha,

capturing the splices with short overhangs (10b in your example) is a dangerous task since short sequences can in principle map to many loci making the alignment ambiguous. There are two ways to increase STAR's sensitivity to these short overhangs:

1. I highly recommend to run STAR with annotations. You would need to generate the genome with --sjdbGTFfile /path/to/gtf .
I used gencode annotations to map your read (you can download that genome here), and got exactly the junction you were looking for 1S39M4303N10M. Note that the first base of this read is mismatched so STAR reports it as soft-clipped 1S rather than a mismatch.
Of course, this approach will only allow for detecting short overhangs on annotated junctions.

2. Increase STAR's sensitivity with --seedSearchStartLmax 12. This allows to capture the same junction in your example without using annotations. So, in principle, this will allow detecting unannotated junctions with short overhangs. However, there is a trade-off here between sensitivity and precision. For simulated data in our paper (Fig. 2), when run without annotations, GSNAP indeed shows higher sensitivity than STAR (96% vs 90%), but at the same time much higher false positive rate (4% vs 1%).

Cheers
Alex

Smitha Shankar

unread,
Feb 27, 2013, 2:52:13 PM2/27/13
to rna-...@googlegroups.com
Thank you Alex! 
I have a few comments. For my analysis I have been using an annotation file with  --sjdbFileChrStartEnd (in the genome generation step) , I still get the soft clipping when I use this file. It is in this format: 
chr10 100008748         100010821   -
chr10 100010933   100011322   -
chr10 100011459         100012109   -
chr10 100011959         100015344  +
chr10 100012225         100013309   -

Could you spot any obvious mistakes I could be making? Also, modifying seedSearchStartLmax allows me to identify many junctions but like u said it also results in a high rate of false positives. For the time, I do not want to modify the seedSearchStartLmax. I tried downloading the file you have linked for annotation. It just times out everytime!

Smitha 

Alexander Dobin

unread,
Feb 27, 2013, 5:17:19 PM2/27/13
to rna-...@googlegroups.com
Hi Smitha,

I think the intron start coordinates are all shifted by one base, the correct line for the 1st one is:
chr10   100008749       100010821       -
If you look at the Log.final.out file, and there are very few annotated junctions ('Number of splices: Annotated (sjdb)' line), it means that something is wrong with the splice junction database.

With the latest version of STAR you can use annotations in the GTF format (--sjdbGTFfile /path/to/gtf at the genome generation step). For example, this is one from Gencode: ftp://ftp.sanger.ac.uk/pub/gencode/release_15/gencode.v15.annotation.gtf.gz

If you want to use GTF files from UCSC table browser, please let me know and I will explain how to fix it - they do not conform to GTF standards.

Cheers
Alex

Smitha Shankar

unread,
Feb 27, 2013, 6:10:06 PM2/27/13
to rna-...@googlegroups.com
Thank you Alex, I used the same annotation file as you and I am now able to account for all the junctions. Thank you for helping out!

Smitha
Reply all
Reply to author
Forward
0 new messages