Star 2 Pass not mapping an intron inclusion event

Besse Cummings

unread,

Oct 5, 2015, 3:28:05 PM10/5/15

to rna-...@googlegroups.com

Hi Alex,

I've been realigning a number of Tophat aligned bams with Star 2-pass and checking individual sites to optimize STAR parameters. I haven't been able to get Star to align an event we are pretty certain is real and which aligns with TopHat. We were hoping you could help out!

The event is the splicing in of an intron at chr21:47409808- 47409879. Here is a Sashimi plot comparisons between TopHat and Star 2 pass: http://www.broadinstitute.org/~berylc/Star2PassForum/Img1_StarPost.png

We believe the event is real given that the left hand of the inclusion is aligned with both aligners (the sequence of the exons in the region are unique if you look at UCSC) and that the sample carries a splice creating intronic variant 3' of the inclusion.

All IGV screenshots are showing only uniquely aligning reads, filtering out PCR duplicates, secondary and supplementary alignments.

I’ve looked at where reads that are correctly aligned in Tophat are going with Star. A large portion of the reads are getting placed correctly at the 3' end of this inclusion with Star 2 Pass but instead of aligning the rest of the read to the adjacent canonical exon as splicing, the bases are getting softclipped. Here is a shot of this happening (these are all reads that are aligning uniquely to the adjacent exon in Tophat):

http://www.broadinstitute.org/~berylc/Star2PassForum/Img2_StarPost.png

Here’s a comparison of the CIGAR strings in TopHat and Star2Pass for the same read

TopHat

Alignment start = 47,409,821 (-)

Cigar = 59M292N17M

Star 2 pass

Location = chr21:47,409,829

Alignment start = 47,409,821 (-)

Cigar = 60M16S

(Star aligning one extra base to the splice site and softclipping the rest, the adjacent exon starts with a G)

I’ve tried forcing –alignEndType EndtoEnd. This doesn’t change where the alignment is placed, it just doesn’t softclip the reads. So the CIGAR string becomes 76M (alignment still starting at 47,409,821). Here’s a screenshot of this.

http://www.broadinstitute.org/~berylc/Star2PassForum/Img3_StarPost.png

Another example of the same thing

TopHat

Alignment start = 47,409,813 (-)

Cigar = 67M292N9M

Star 2 Pass

Alignment start = 47,409,813

Cigar = 68M8S

Here are my general parameters

-76 bp unstranded paired-end reads using genome generated with --sjdbOverhang  75

--outFilterMismatchNoverReadLmax 0.1 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --sjdbOverhang 75 --outFilterScoreMinOverLread 0.33 --= outFilterMatchNminOverLread 0.33 --alignSoftClipAtReferenceEnds No --chimJunctionOverhangMin 15

I’ve individually made the following changes and ran the aligner each time:

1) outFilterScoreMinOverLread to 0.66 (default)

2) outFilterMatchNminOverLread to 0.66 (default)

3) sjdbScore 0

4) All of the above

5) add --alignEndsType Extend5pOfRead1

6) add --alignEndsType EndToEnd

7) --alignSJDBoverhangMin 2

None of these options align this inclusion on the right hand side.

It seems there are at least two problems. The first is that Star is soft-clipping the reads a little aggressively. However another issue could be that the exon downstream of the inclusion is quite small (26 bps) so reads mapping to the intron inclusion have to map to this exon and the next exon downstream. So these reads are crossing multiple junction. Is Star unforgiving about reads like these? Any comments would help!

Thanks so much!

Best,

Beryl

Alexander Dobin

unread,

Oct 6, 2015, 3:00:07 PM10/6/15

to rna-star

Hi Beryl,

could you please send me SJ.out.tab from the 1st pass of STAR, as well as all the reads (43) that were mapped to this junction by TopHat (SAM from TopHat with both mates would be most useful).

The question is why STAR did not detect any of the 43 reads mapping to this junction in the 1st pass.

Cheers

Alex

-76 bp unstranded paired-end reads using genome generated with --sjdbOverhang  76

Besse Cummings

unread,

Oct 11, 2015, 12:03:12 PM10/11/15

to rna-star

Exactly! Here's the SJ.out.tab from 1st pass from the two samples that had reads mapped to the junction: http://www.broadinstitute.org/~berylc/Star2PassForum/Sample1_1stpassSJ.out.tab

And a sliced bam from TopHat with some flanking region around the identified junction. If you'd like to look at it in IGV the region is chr21:47,407,560-47,412,817 and the spliced-in intron is at chr21:47,409,808-47,409,879. http://www.broadinstitute.org/~berylc/Star2PassForum/Sample1_forStarForum.bam

On Monday, October 5, 2015 at 3:28:05 PM UTC-4, Besse Cummings wrote:

-76 bp unstranded paired-end reads using genome generated with --sjdbOverhang  76

Besse Cummings

unread,

Oct 19, 2015, 11:34:11 AM10/19/15

to rna-star

Hi Alex,

Just bumping this up, if you have time to look at it!

Thanks,
Beryl

Alexander Dobin

unread,

Oct 19, 2015, 7:31:45 PM10/19/15

to rna-star

Hi Beryl,

sorry for the delay, the week has been busy for me, I will work on it on Wed.

Cheers

Alex

Alexander Dobin

unread,

Oct 21, 2015, 7:22:40 PM10/21/15

to rna-star

Hi Beryl,

I have tried to re-align these reads with STAR, and when I map them as single-end reads, STAR can detect this junction for 4 of them (with long enough overhang), see below.

So something must be wrong with their mates, however, the BAM file does not contain the mates for these reads, so I could not check it.

Could you please send me the BAM file for a bigger interval, starting at 47,406,500?

Cheers

Alex

C6JPEANXX150626:2:2202:7880:64242 0 chr21 47407072 255 18M323N21M90N37M * 0 0 AATAACGTGGAGCAAGTGTGCTGCTCCTTCGAATGCCAGCCTGCAAGAGGACCTCCGGGGCACCGGGGCGACCCCG * NH:i:1 HI:i:1 nM:i:1 AS:i:76 NM:i:1 MD:Z:61T14 jI:B:i,47407090,47407412,47407434,47407523 jM:B:c,21,21

C6JPEANXX150626:3:1312:1694:83238 0 chr21 47407072 255 18M323N21M90N37M * 0 0 AATAACGTGGAGCAAGTGTGCTGCTCCTTCGAATGCCAGCCTGCAAGAGGACCTCCGGGGCTCCGGGGCGACCCCG * NH:i:1 HI:i:1 nM:i:0 AS:i:78 NM:i:0 MD:Z:76 jI:B:i,47407090,47407412,47407434,47407523 jM:B:c,21,21

C6JPEANXX150626:3:2305:7396:87379 0 chr21 47407073 255 17M323N21M90N38M * 0 0 ATAACGTGGAGCAAGTGTGCTGCTCCTTCGAATGCCAGCCTGCAAGAGGACCTCCGGGGCTCCGGGGCGACCCCGG * NH:i:1 HI:i:1 nM:i:0 AS:i:78 NM:i:0 MD:Z:76 jI:B:i,47407090,47407412,47407434,47407523 jM:B:c,21,21

C6JPEANXX150626:3:1216:5689:38387 0 chr21 47407074 255 16M323N21M90N39M * 0 0 TAACGTGGAGCAAGTGTGCTGCTCCTTCGAATGCCAGCCTGCAAGAGGACCTCCGGGGCTCCGGGGCGACCCCGGC * NH:i:1 HI:i:1 nM:i:0 AS:i:78 NM:i:0 MD:Z:76 jI:B:i,47407090,47407412,47407434,47407523 jM:B:c,21,21

Besse Cummings

unread,

Oct 21, 2015, 8:17:11 PM10/21/15

to rna-star

Here is a bam with a much larger region: http://www.broadinstitute.org/~berylc/Star2PassForum/Sample1_ForStar_21:47403962-474090380.bam

Thanks!

Beryl

Alexander Dobin

unread,

Oct 22, 2015, 5:57:29 PM10/22/15

to rna-star

Hi Beryl,

I cannot reproduce the problem you are seeing. For the read in your IGV example, I get the spliced alignment with the correct intron:

C6JPEANXX150626:1:1114:9276:86838 147 chr21 47409821 255 59M292N17M = 47407412 -2777 CCCCTCGCCGTCCCCTCCATCTGGAAGGACAAGGACAGCCACCCAGGCACCCAGCAAAGGGCTCCAGGGGACCCAA FF<BF<F/FBF/</F/F/<<FFFFFFFFFFFFFBFFFFFFFFFF/BFFFFFFFBBFFF/FFFFFFFFFFFFBBBBB NH:i:1 HI:i:1 nM:i:0 AS:i:137 NM:i:0 MD:Z:76 jI:B:i,47409880,47410171 jM:B:c,3

I have used genome without annotations, and your parameters --outFilterMismatchNoverReadLmax 0.1 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --sjdbOverhang 75 --outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33 --alignSoftClipAtReferenceEnds No --chimJunctionOverhangMin 15

You mentioned 2-pass alignment, how are you doing it? Which parameters are using for the 1st and 2nd pass.

Could you now send me the BAM file from STAR for the same region?

By any chance, are you using --outFilterType BySJout - if so it would explain why this junction is filtered out.

The junction in question (47409880-47410171) has GC/AG "semi-canonical" intron motif.

However, there is another - canonical - junction with the same acceptor site (47409693-47410171).

Parameter --outSJfilterDistToOtherSJmin (=10 0 5 10) filters junctions that are too close to the other junctions. the number are the minimum distance allowed.

It's 0 for canonical junctions, but 5 for GC/AG, so the resulting junction was not output into the SJ.out.tab file, and, if --outFilterType BySJout is used, it will also be filtered from the Aligned.out.bam.

Cheers

Alex

Besse Cummings

unread,

Oct 23, 2015, 1:15:53 AM10/23/15

to rna-star

Yes --outFilterType BySJout was the problem! It was tucked away, and after I removed it the junction is mapping. Thanks for your help!

Beryl

Reply all

Reply to author

Forward