include chimeric reads into the main Aligned.out.sam file

mbourgey

unread,

Sep 25, 2014, 10:23:43 AM9/25/14

to rna-...@googlegroups.com

Hi Alex,

First of all thanks a lot for your impressive work !

I was running star (with the chimeric detection turned on) on a sample with a known fusion. I found strange to see that some pairs reads appeared unaligned in the Aligned.out.sam and aligned in the Chimeric.out.sam and for other pairs reads one is in the Aligned.out.sam and both reads in Chimeric.out.sam. It is a little bit confusing to me.
So here are my question:
what is the rule to decide if reads are include in one or two files ?
Do you think merging both chimeric and and not chimeric in a same sam file is a good idea ?

Thank you in advance

Mathieu

Alexander Dobin

unread,

Sep 30, 2014, 11:31:50 AM9/30/14

to rna-...@googlegroups.com

Hi Mathieu,

I think the reads that you see appear as both chimeric and non-chimeric can be explained as follows.

STAR will output a non-chimeric alignment into Aligned.out.sam with soft-clipping a portion of the read. If this portion is long enough, and it maps well and uniquely somewhere else in the genome, there will also be a chimeric alignment output into Chimeric.out.sam.

Imagine that you have a PE read where the second mate can be split chimerically into 70 and 30 bases. The 100b of the first mate + 70b of the 2nd mate map non-chimerically,and the mapping length/score are big enough, so they will be output into Aligned.out.sam file. At the same time, the chimeric segments 100-mate1 + 70-mate2 and 30-mate2 will be output into Chimeric.out.sam.

This is the reason I keep the normal and chimeric alignments separated.

I am working on the new formatting of chimeric output, consistent with the latest SAM recommendations - this will allow output of chimeric and non-chimeric alignments in one file.

Cheers

Alex

Zhongwu Lai

unread,

Feb 14, 2015, 11:18:33 PM2/14/15

to rna-...@googlegroups.com

Hi Alex,

Thanks for the impressive STAR aligner.

I have a slightly different question. First, I like the soft-clipped reads in the alignment to allow fusion detection. My question is whether you have plans to output discordant pairs, where both pairs can be mapped non-chimerically, but to different genes, or even different chromosomes? Another question is how to handle pairs where one read is mapped uniquely, while the other is not mappable? So far, STAR seems not allowing singletons, but they typically are indicative of fusions or large insertions.

Tophat handles discordant mates well by allowing mate to be mapped to different locations but can't handle soft-clipping, while STAR handles chimeric reads well by soft-clipping but can't handle discordant mates. I would really like to have an aligner that can handle both. That would make fusion detection much easier and with high confidence.

Many thanks,

Zhongwu

Alexander Dobin

unread,

Feb 20, 2015, 12:39:03 AM2/20/15

to rna-...@googlegroups.com

Hi Zhongwu,

STAR will output discordant pairs into a separate file Chimeric.out.sam if you switch on chimeric detection with

--chimSegmentMin (N>0).

STAR will output single-end alignments if you reduce the minimum mapped length/score requirements to <0.5, e.g.

--outFilterScoreMinOverLread 0.4 --outFilterMatchNminOverLread 0.4

Cheers

Alex

Zhongwu Lai

unread,

Feb 21, 2015, 11:02:39 PM2/21/15

to rna-...@googlegroups.com

Hi Alex,

Many thanks for the tip. That's very helpful. I have the same request as Mathieu to combine the two in a single file, and set the insert size to non-zero if they're on the same chromosome. Looking forward to your updates.

Also, I'm curious why the second mate below was marked as secondary alignment?

read1 73 chr2 29445400 3 48M * 0 0 GGGGGCTTGGGTCGTTGGGCATTCCGGACACCTGGCCTTCATACACCT @CCFFFFFHHHDFHHIJJIIGIIIIIIBGHIJGIJJJJJGIIJJJJJG NH:i:2 HI:i:1 AS:i:47 NM:i:0 nM:i:0

read1 393 chr2 42488395 3 40M1883N8M * 0 0 GACAAACTCCAGAAAGCAAGAATGCTACTCCCACCAAAAGCATAAAAC @@@DDFFFHFHHHJJJJJJIGGHHIJGGIIJIJJIJGGIJGJJIHDDG NH:i:2 HI:i:2 AS:i:47 NM:i:0 nM:i:0 XS:A:+

I used --chimSegmentMin 25 --outFilterScoreMinOverLread 0.4 --outFilterMatchNminOverLread 0.4, as you recommended. They are in the Aligned.out.sam, as singletons, and not in Chimeric.out. The second read can be uniquely mapped in a unique region.

Thanks again,

Zhongwu

Alexander Dobin

unread,

Feb 25, 2015, 5:46:06 PM2/25/15

to rna-...@googlegroups.com

Hi Zhongwu,

if you use --outFilterScoreMinOverLread 0.4 --outFilterMatchNminOverLread 0.4, STAR finds two single-end alignments of this paired-end read, which pass the filters, and are considered "multi-mapping" alignments of the same read, hence the non-primary flag for one of them, and NH:i:2. At the same time, these alignments are no longer considered chimeric, and are not output into Chimeric.out.sam.

If you do not use --outFilterScoreMinOverLread 0.4 --outFilterMatchNminOverLread 0.4, these alignments will be output into Chimeric.out.sam:

1 65 chr2 29445400 3 48M chr2 42488395 0 GGGGGCTTGGGTCGTTGGGCATTCCGGACACCTGGCCTTCATACACCT * NH:i:2 HI:i:1 AS:i:47 nM:i:0

1 129 chr2 42488395 3 40M1883N8M chr2 29445400 0 GACAAACTCCAGAAAGCAAGAATGCTACTCCCACCAAAAGCATAAAAC * NH:i:2 HI:i:2 AS:i:47 nM:i:0

Note, that since there is a small junction overhang of 8b in one of the chimeric pieces, you need to use small value of --chimJunctionOverhangMin, say 5.

Cheers

Alex

Zhongwu Lai

unread,

Feb 27, 2015, 2:17:22 PM2/27/15

to rna-...@googlegroups.com

Thank you Alex. I did notice NH:i:2 and was wondering why. My question is "is this desired behavior?". My assumption is if two chimeric mates can each be uniquely mapped, they should just belong to chimeric, not as singletons, even if singleton mapping is turned on. Singletons are for those that one mate can be mapped uniquely, and the other can't be aligned anywhere that satisfies the user specified criteria. Given the speed of STAR, it would be great to just use STAR, instead of combining alignments from other aligners. I'm currently developing a variant caller (VarDict) that I hope can do the fusion calling from STAR alignment, in addition to WGS. My goal would be for VarDict to handle SNP, MNP, Indels, and structural variants in a single pass. So it would be very helpful if STAR can perform both chimeric and singleton alignments and have them in a single BAM file. Currently, I have to write a script to combine them so it can be visualized.