Optimal parameters for chimeric split reads

903 views
Skip to first unread message

Nicolas Stransky

unread,
Feb 16, 2013, 6:26:45 PM2/16/13
to rna-...@googlegroups.com
Hi all,

I have a dataset of relatively short reads (48bp) in which I can definitely see a (known, real) fusion event between two genes after aligning with STAR.
The problem is that despite the high number of chimeric reads, there are no split reads supporting the fusion. However, I do see that some reads are soft clipped and have 29M19S, 30M18S and 31M17S CIGAR strings. I can manually align the soft-clipped sequences to the other gene, but what I'd really like to do is have STAR output them automatically too (like 29S19M, etc.). 
I've tried --chimSegmentMin 17 to no avail. I've also tried to combine it with --chimScoreSeparation 15, thinking that it should allow the output of both 31M17S and 31S17M, given the score difference, but that didn't work either. 
What am I missing? 

Thanks!
Nico

Alexander Dobin

unread,
Feb 18, 2013, 12:36:20 AM2/18/13
to rna-...@googlegroups.com
Hi Nicolas,

are you reads paired end 2x48? 
It look like I have a problem in Chimeric.out.sam for single-end reads, I have never tried it before. I am working to fix it.

I think the parameter that is limiting output in your case is 
--chimJunctionOverhangMin     20
    int>0: minimum overhang for a chimeric junction
By default, it would require 20b of read sequence on each side of a chimeric junction, while you are looking for 17-19b pieces.
Still, it should have output the splits like 28/20,27/21,...,24/24.
I would recommend to try --chimJunctionOverhangMin 15 and --chimSegmentMin 15.
--chimScoreSeparation deals with the "next best" chimeric alignment and will probably not affect the results much. It prevents output of chimeras which have the next best hit within --chimScoreSeparation value, so increasing it will actually reduce sensitivity.

From a different perspective, if you are looking to quantify known chimeric junctions, and not to discover the new ones (which I think is troublesome with the short reads), the most sensitive approach would be to create artificial "chromosomes" made of the sequences of the full chimeric transcripts. For single end reads you would not need to have full chimeric transcripts, but just short sequences from donor and acceptor sides of a chimeric junction concatenated (for instance, 45 + 45 for 1x48 reads). That would allow STAR to map reads directly to the chimeric "chromosomes".

Please let me know if this worked for you
Cheers
Alex

Nicolas Stransky

unread,
Feb 18, 2013, 8:22:04 PM2/18/13
to rna-...@googlegroups.com
Hi Alex

My reads are indeed paired end 2x48. 
The goal here was actually not to quantify a known chimeric junction in a sample; instead, I was just using it to benchmark my pipeline and see if I could recover most reads that I can manually identify. 
Here is an example with a read that STAR can map as 19S29M (I used --chimJunctionOverhangMin 15 --chimSegmentMin 15 as you suggested).
TGGTGCTTCCGGCGGTACAAGAGCCCACACCTGGGAAAGGACCTAAAG
The first 19 bases belong to ALK, and the last 29 bases belong to EML4.

Is there anything else I could try?

Thanks!!
Reply all
Reply to author
Forward
0 new messages