One more important message. This information will be in the manual in the next release.
The format of this file is as follows. Every line contains one chimerically aligned read.
chr22 23632601 + chr9 133729450 + 1 0 0 SINATRA_0006:3:3:6387:5665#0 23632554 47M29S 133729451 47S29M40p76M
The first 9 columns give information about the chimeric junction:
1: chromosome of donor
2: first base of the intron of the donor
3: strand of the donor
4: chromosome of the acceptor
5: first base of the intron of the acceptor
6: strand of the acceptor
7: junction type: -1=junction is between the mates, 1=GT/AG, 2=CT/AC
8: repeat length to the left of the junction
9: repeat length to the right of the junction
Columns 10-14 describe the alignments of the two chimeric segments, it is SAM like. Alignments are given with respect to the + strand
10: read name
11: first base of the first segment (on the + strand)
12: CIGAR of the first segment
13: first base of the second segment
14: CIGAR of the second segment
Note, that unlike standard SAM, both mates are recorded in one line here. The gap of length L between the mates is marked by the “Lp” in the CIGAR string.
If the mates overlap, L<0. I believe this format is convenient since all the information is provided on one line.
For strand definitions, when aligning paired end reads, the sequence of the second mate is reverse complemented.
To filter chimeric junctions and find the number of reads supporting each junction you could use, for example:
cat Chimeric.out.junction | awk '$1!="chrM" && $4!="chrM" && $7>0 && $8+$9<=5 {print $1,$2,$3,$4,$5,$6,$7,$8,$9}' | sort | uniq -c | sort -k1,1rn
This will keep only the canonical junctions with the repeat length less than 5 and will remove chimeras with mitochondrion genome.
When I do it for one of our K562 runs, I get:
181 chr1 144676873 - chr1 147917466 + 1 0 1
29 chr5 69515744 - chr5 34182973 - 1 3 1
28 chr1 143910077 - chr1 149459550 - 1 1 0
27 chr22 23632601 + chr9 133729450 + 1 0 0
20 chr12 90313405 - chr21 40684813 - 1 2 0
20 chr22 23632601 + chr9 133655755 + 1 0 1
20 chr9 123636256 - chr9 123578959 + 1 1 4
15 chr16 85589970 + chr6 16762582 + 1 3 2
15 chr3 197348574 - chr3 195392936 + 1 1 0
14 chr18 39584506 + chr18 39560613 - 1 2 0
Note that line 4 and 6 here are BCR/ABL fusions. You would need to filter these junctions further to see which of them connect known but not homologous genes.