Re: Chimeric alignments

2,923 views
Skip to first unread message

Alexander Dobin

unread,
Dec 3, 2012, 11:04:53 PM12/3/12
to rna-...@googlegroups.com
Hi Nicolas,

Thanks for trying out STAR!

In my terminology, each chimeric alignment consists of two "segments". Each segment is actually non-chimeric on its own, but the segments are chimeric to each other (i.e. the segments belong to different chromosomes, or different strands, or are far from each other).

Note that both segments may contain splice junctions, and one of the segments may contain portions of both mates.

The --chimSegmentMin parameter controls the minimum mapped length of the two segments that is allowed.
For example, if you have 2x75 reads and used --chimSegmentMin 20, a chimeric alignment with 130b on one chromosome and 20b on the other will be output, while 135 + 15 won't be.

Setting any number >0 for -chimSegmentMin should generate chimeric output. I just checked the latest release and it worked fine for my data, so I would need a sample of your data to see what the problem is. Could you please send me your genome, and a few thousand reads that replicate this bug?

Cheers
Alex


On Monday, December 3, 2012 7:21:31 PM UTC-5, Nicolas Stransky wrote:
Hi!
STAR is really impressive!
I'm curious about the meaning of --chimSegmentMin and how to turn on chimeric output.
What is the "length of chimeric segment"? Is this referring to the length of a chimeric pair when both reads would align to the same chromosome?
In any case, I've tried tried many different values and I invariably get this when chimSegmentMin is not 0:
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check
Any idea how to fix this and turn on chimeric output?
Thanks!
Nico

Nicolas Stransky

unread,
Dec 4, 2012, 12:39:30 AM12/4/12
to rna-...@googlegroups.com
Hi Alex, and thanks for the explanation!

Regarding the error, it's actually irrespective of the data, and I have tried with different input dataset with the same effect.
I have isolated the problem to a combination of using chimSegmentMin > 0 and hg19_Gencode14.overhang75. It's only when I use this genome that the error arises, not when I use hg19 (both downloaded from the ftp). With hg19_Gencode14.overhang75, the error occurs immediately after the genome has finished loading. To make sure this is not a problem with a corrupted file, I have checked the genome md5sum again, re-extracted the tar, cleared the shared memory, but the error remains the same.

Nicolas

Alexander Dobin

unread,
Dec 4, 2012, 10:55:28 PM12/4/12
to rna-...@googlegroups.com
Hi Nicolas,

thanks a lot for detailed reporting, there was indeed a problem with running chimeric detection with sjdb-genomes. 
I fixed the bug - could you please try the patched version here:

Cheers
Alex

Nicolas Stransky

unread,
Dec 5, 2012, 9:57:30 AM12/5/12
to Alexander Dobin, rna-...@googlegroups.com
Works great now, thanks a lot!

I did notice something else: in the "Chimeric.out.sam" file, the sam flags are all "mate unmapped", whereas it is not the case, i.e the mate is there and it is mapped. Was there a reason for it?

Thanks!
Nico

Alexander Dobin

unread,
Dec 5, 2012, 12:35:55 PM12/5/12
to rna-...@googlegroups.com, Alexander Dobin
Hi Nicolas,

I personally think that SAM/BAM format it's not very specific when it comes to describing a chimeric junction.
My general rule for both Chimeric.out.sam and Aligned.out.sam outputs is that the mate is considered unmapped if it's not mapped at all OR it's mapped discordantly (chimerically).

Consider the following read that maps to a BCR-ABL junction;
SINATRA_0006:3:3:6387:5665#0    105     chr22   23632554        3       47M29S  *       0       0       
SINATRA_0006:3:3:6387:5665#0    99      chr9    133729451       3       47S29M  =       133729520       145     
SINATRA_0006:3:3:6387:5665#0    147     chr9    133729520       3       76M     =       133729451       -145    
The first two lines are from the first mate split at the chimeric junction. The 3rd line is from the second mate, and it makes a "concordant" PE alignment with the second portion of the 1st mate.
STAR sets the FLAG bits in the following way:
0x2       0x8
0          1
1          0
1          0
My logic was as follows. The 0x2 flag shows that both mates contribute to a proper PE alignments on chr9 (last two lines), while the first line on chr22 is "non-concordant" alignment.
The 0x8 flag shows that for the 1st line the mate is not concordantly paired, while for 2nd and 3rd lines the mates are concordantly paired.
In other words, when setting the "mate unmapped" bit 0x8, STAR considers the mate unmapped if it's mapped discordantly (chimerically).

Cheers
Alex

Nicolas Stransky

unread,
Dec 5, 2012, 12:49:19 PM12/5/12
to Alexander Dobin, rna-star
Hi Alex,

Thanks for the explanation (and thanks again for fixing STAR so
quickly, it's terrific!).

That is indeed one way to interpret the SAM format. But to be fair, if
we stick to the strict meaning of the flag descriptions, both reads in
the pair map somewhere, so in that sense, they shouldn't be flagged as
0x8 "mate unmapped". I agree that the mapping could be discordant, but
it is still mapped, and that is how picard would see it when fixing
the flags. Another implication of setting 0x8 is that you have no way
to get to the mate in coordinate-sorted bams (or even know that it
exists), whereas it is indeed there.

Nico

Alexander Dobin

unread,
Dec 5, 2012, 2:56:20 PM12/5/12
to rna-...@googlegroups.com, Alexander Dobin
Hi Nicolas,

I do not have a particularly strong feeling about this format issue, so I can change it to agree with Picard's interpretation.
What is Picard doing for the 0x2 flag? Again, in my example, I could set it for all three lines, since both mates were actually mapped.

There is another field that's ambiguous - the RNEXT (col 7). I use "*" for chimeric mates, however, it's supposed to be the "next segment" reference. In my example should I rather have the following column 7:
SINATRA_0006:3:3:6387:5665#0    105     chr22   23632554        3       47M29S  chr9       0       0       
SINATRA_0006:3:3:6387:5665#0    99      chr9    133729451       3       47S29M  =       133729520       145     
SINATRA_0006:3:3:6387:5665#0    147     chr9    133729520       3       76M     chr22       133729451       -145    

Cheers
Alex

Alexander Dobin

unread,
Dec 5, 2012, 5:01:47 PM12/5/12
to rna-...@googlegroups.com, Alexander Dobin
One more important message. This information will be in the manual in the next release.

In addition to 'Chimeric.out.sam', STAR will generate 'Chimeric.out.junction' file which I think is a bit easier for further analysis.
The format of this file is as follows. Every line contains one chimerically aligned read.

chr22   23632601        +       chr9    133729450       +       1       0       0       SINATRA_0006:3:3:6387:5665#0    23632554   47M29S  133729451       47S29M40p76M

The first 9 columns give information about the chimeric junction:
1: chromosome of donor
2: first base of the intron of the donor
3: strand of the donor
4: chromosome of the acceptor
5: first base of the intron of the acceptor
6: strand of the acceptor
7: junction type: -1=junction is between the mates, 1=GT/AG, 2=CT/AC
8: repeat length to the left of the junction
9: repeat length to the right of the junction

Columns 10-14 describe the alignments of the two chimeric segments, it is SAM like. Alignments are given with respect to the + strand
10: read name
11: first base of the first segment (on the + strand)
12: CIGAR of the first segment
13: first base of the second segment
14: CIGAR of the second segment

Note, that unlike standard SAM, both mates are recorded in one line here. The gap of length L  between the mates is marked by the “Lp” in the CIGAR string.
If the mates overlap, L<0. I believe this format is convenient since all the information is provided on one line.

For strand definitions, when aligning paired end reads, the sequence of the second mate is reverse complemented.

To filter chimeric junctions and find the number of reads supporting each junction you could use, for example:
cat Chimeric.out.junction | awk '$1!="chrM" && $4!="chrM" && $7>0 && $8+$9<=5 {print $1,$2,$3,$4,$5,$6,$7,$8,$9}' | sort | uniq -c | sort -k1,1rn
This will keep only the canonical junctions with the repeat length less than 5 and will remove chimeras with mitochondrion genome.

When I do it for one of our K562 runs, I get:
    181 chr1    144676873       -       chr1    147917466       +       1       0       1
     29 chr5    69515744        -       chr5    34182973        -       1       3       1
     28 chr1    143910077       -       chr1    149459550       -       1       1       0
     27 chr22   23632601        +       chr9    133729450       +       1       0       0
     20 chr12   90313405        -       chr21   40684813        -       1       2       0
     20 chr22   23632601        +       chr9    133655755       +       1       0       1
     20 chr9    123636256       -       chr9    123578959       +       1       1       4
     15 chr16   85589970        +       chr6    16762582        +       1       3       2
     15 chr3    197348574       -       chr3    195392936       +       1       1       0
     14 chr18   39584506        +       chr18   39560613        -       1       2       0
Note that line 4 and 6 here are BCR/ABL fusions. You would need to filter these junctions further to see which of them connect known but not homologous genes. 

Nicolas Stransky

unread,
Dec 5, 2012, 6:02:54 PM12/5/12
to rna-...@googlegroups.com
Hi Alex,

I believe that if the mate is unmapped, RNEXT is supposed to be "*",
not "chr9". Or else, if you specify MRNM, you need to specify MPOS too
and the mate is then mapped.

Also, is it really the case that the mate is unmapped? The mate is
marked as mapping to chr9 133729520 and has the mapped flag. So I
think that one could argue that both segments from the first read
should be "mapped in proper pair".

Actually what picard does for your example, is that it marks the reads as follows:
SINATRA_0006:3:3:6387:5665#0    67      chr9    133729451       3       47M29S  chr22   23632554        0
SINATRA_0006:3:3:6387:5665#0    147     chr9    133729520       3       76M     chr22   133729451       -145
SINATRA_0006:3:3:6387:5665#0    65      chr22   23632554        3       47S29M  chr9    133729451       0

So the chimeric read is not unmapped, but it doesn't have the flag "mapped in proper pair" either.

Nico

Nicolas Stransky

unread,
Dec 5, 2012, 6:05:01 PM12/5/12
to rna-...@googlegroups.com, Alexander Dobin
This output looks extremely useful and a great addition to STAR! Thanks Alex! Looking forward to the next release.

Best,
    Nico

Alexander Dobin

unread,
Dec 6, 2012, 9:48:49 AM12/6/12
to rna-...@googlegroups.com
Hi Nicolas,

thanks for the explanations!
I can now see the logic of it. 
For the next release (hopefully, within a couple of days)I will make the changes in the SAM output to conform with Picard's conventions.

Cheers
Alex

Alexander Dobin

unread,
Dec 7, 2012, 1:33:11 PM12/7/12
to rna-...@googlegroups.com
Hi Nicolas,

I guess I rushed to say that I understand the logic of Picard chimeric conventions.
The good thing is that Alignned.out.sam files (non-chimeric reads) passed the Picard validation without any errors.

I was trying to get Picard to check my chimeric .sam file, and I am still having problems.
I start with my old example:
SINATRA_0006:3:3:6387:5665#0    105     chr22   23632554        3       47M29S  *       0       0       
SINATRA_0006:3:3:6387:5665#0    99      chr9    133729451       3       47S29M  =       133729520       145     
SINATRA_0006:3:3:6387:5665#0    147     chr9    133729520       3       76M     =       133729451       -145    

The I fixed the mates with FixMateInformation.jar which gave me:
SINATRA_0006:3:3:6387:5665#0    65      chr22   23632554        3       47M29S  chr9    133729451       0
SINATRA_0006:3:3:6387:5665#0    67      chr9    133729451       3       47S29M  chr22   23632554        0
SINATRA_0006:3:3:6387:5665#0    147     chr9    133729520       3       76M     =       133729451       -145

This basically agrees with your e-mail except for ordering. 
One thing I do not understand here is why the "next segment" of the Record2 is marked as chr22-23632554, i.e. Record1, while both Record1 and Record2 belong to the same mate.

For this transformed lines I run ValidateSamFile.jar and still get the following errors:

ERROR: Record 1, Read name SINATRA_0006:3:3:6387:5665#0, Both mates are marked as first of pair
ERROR: Read name SINATRA_0006:3:3:6387:5665#0, Mate not found for paired read

I would welcome any suggestions.
Cheers
Alex

Nicolas Stransky

unread,
Dec 7, 2012, 1:53:59 PM12/7/12
to Alexander Dobin, rna-star
Hi Alex,

I was trying to figure out the same thing yesterday! I agree this is confusing and basically I think it's worth taking it to the picard mailing list, because in this case, picard does not even validate its own output (it's true for SamToFastq.jar as well: picard won't process the chimeric .sam file even after FixMateInformation! The ability to return to FastQ files is a must-have).

Thanks so much for trying to make STAR play nicely with other tools.
Nico

--
Nicolas

Alexander Dobin

unread,
Dec 7, 2012, 4:52:02 PM12/7/12
to rna-...@googlegroups.com, Alexander Dobin
Hi Nicolas,

I posted the question at samtools-help, let us see if they have a good solution.

I have found one way to trick Picard into validating the files: if one of the chimeric pieces is marked as secondary alignment (0x100), Picard will not generate errors and SamToFastq works fine (see the example below). However, Picard does not check the mate of the secondary alignment, I can put any number there, so this might screw-up some other tools.

Cheers
Alex

@SQ     SN:chr9 LN:141213431
@SQ     SN:chr22        LN:51304566
SINATRA_0006:3:3:6387:5665#0    321     chr22   23632554        3       47M29S  chr22   133729520       0       TCTGAATGTCATCGTCCACTCAGCCACTTGATTTAAGCATAGTTCAAAAGCCCTTCAGCGGCCAGTAGCATCTGAC    eghchhfehhhhhchhhgghhghghhchhhhhfhhghffhhghfhghheefafahhhahca_f]cbfffabdcfdf
SINATRA_0006:3:3:6387:5665#0    65      chr9    133729451       3       47S29M  =       133729520       145     TCTGAATGTCATCGTCCACTCAGCCACTTGATTTAAGCATAGTTCAAAAGCCCTTCAGCGGCCAGTAGCATCTGAC    eghchhfehhhhhchhhgghhghghhchhhhhfhhghffhhghfhghheefafahhhahca_f]cbfffabdcfdf
SINATRA_0006:3:3:6387:5665#0    129     chr9    133729520       3       76M     =       133729451       -145    CCAAGGAAAACCTTCTCGCTGGACCCAGTGAAAATGACCCCAACCTTTTCGTTGCACTGTATGATTTTGTGGCAAG    ]caaa_aaa\agfahfffd_dfafcfccchchgghedcfafhdghhhhhfhhhfhhgghhghhhhhghhgdfdfff
Reply all
Reply to author
Forward
0 new messages