Star 2 Pass not mapping an intron inclusion event

151 views
Skip to first unread message
Assigned to ado...@gmail.com by me

Besse Cummings

unread,
Oct 5, 2015, 3:28:05 PM10/5/15
to rna-...@googlegroups.com

Hi Alex,

I've been realigning a number of Tophat aligned bams with Star 2-pass and checking individual sites to optimize STAR parameters. I haven't been able to get Star to align an event we are pretty certain is real and which aligns with TopHat. We were hoping you could help out!

The event is the splicing in of an intron at chr21:47409808- 47409879. Here is a Sashimi plot comparisons between TopHat and Star 2 pass: http://www.broadinstitute.org/~berylc/Star2PassForum/Img1_StarPost.png

 We believe the event is real given that the left hand of the inclusion is aligned with both aligners (the sequence of the exons in the region are unique if you look at UCSC) and that the sample carries a splice creating intronic variant 3' of the inclusion. 

All IGV screenshots are showing only uniquely aligning reads, filtering out PCR duplicates, secondary and supplementary alignments.

I’ve looked at where reads that are correctly aligned in Tophat are going with Star. A large portion of the reads are getting placed correctly at the 3' end of this inclusion with Star 2 Pass but instead of  aligning the rest of the read to the adjacent canonical exon as splicing, the bases are getting softclipped. Here is a shot of this happening (these are all reads that are aligning uniquely to the adjacent exon in Tophat): 

http://www.broadinstitute.org/~berylc/Star2PassForum/Img2_StarPost.png    

Here’s a comparison of the CIGAR strings in TopHat and Star2Pass for the same read

TopHat

Alignment start = 47,409,821 (-)

Cigar = 59M292N17M

Star 2 pass

Location = chr21:47,409,829

Alignment start = 47,409,821  (-)

Cigar = 60M16S

 

(Star aligning one extra base to the splice site and softclipping the rest, the adjacent exon starts with a G)

 

I’ve tried forcing –alignEndType EndtoEnd. This doesn’t change where the alignment is placed, it just doesn’t softclip the reads. So the CIGAR string becomes 76M (alignment still starting at 47,409,821). Here’s a screenshot of this.

http://www.broadinstitute.org/~berylc/Star2PassForum/Img3_StarPost.png 

Another example of the same thing

TopHat

Alignment start = 47,409,813 (-)

Cigar = 67M292N9M

Star 2 Pass

Alignment start = 47,409,813

Cigar = 68M8S

 

Here are my general parameters

-76 bp unstranded paired-end reads using genome generated with --sjdbOverhang
 75

--outFilterMismatchNoverReadLmax 0.1 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --sjdbOverhang 75 --outFilterScoreMinOverLread 0.33 --= outFilterMatchNminOverLread 0.33 --alignSoftClipAtReferenceEnds No --chimJunctionOverhangMin 15

 

I’ve individually made the following changes and ran the aligner each time:

1) outFilterScoreMinOverLread to 0.66 (default)

2) outFilterMatchNminOverLread to 0.66 (default)

3) sjdbScore 0

4) All of the above

5) add --alignEndsType Extend5pOfRead1

6) add --alignEndsType EndToEnd

7) --alignSJDBoverhangMin 2

None of these options align this inclusion on the right hand side.

 

It seems there are at least two problems. The first is that Star is soft-clipping the reads a little aggressively. However another issue could be that the exon downstream of the inclusion is quite small (26 bps) so reads mapping to the intron inclusion have to map to this exon and the next exon downstream. So these reads are crossing multiple junction. Is Star unforgiving about reads like these? Any comments would help!

Thanks so much!

Best,

Beryl

Alexander Dobin

unread,
Oct 6, 2015, 3:00:07 PM10/6/15
to rna-star
Hi Beryl,

could you please send me SJ.out.tab from the 1st pass of STAR, as well as all the reads (43) that were mapped to this junction by TopHat (SAM from TopHat with both mates would be most useful).
The question is why STAR did not detect any of the 43 reads mapping to this junction in the 1st pass.

Cheers
Alex

-76 bp unstranded paired-end reads using genome generated with --sjdbOverhang
 76 

Besse Cummings

unread,
Oct 11, 2015, 12:03:12 PM10/11/15
to rna-star
Exactly! Here's the SJ.out.tab from 1st pass from the two samples that had reads mapped to the junction: http://www.broadinstitute.org/~berylc/Star2PassForum/Sample1_1stpassSJ.out.tab

And a sliced bam from TopHat with some flanking region around the identified junction. If you'd like to look at it in IGV the region is chr21:47,407,560-47,412,817 and the spliced-in intron is at chr21:47,409,808-47,409,879.  http://www.broadinstitute.org/~berylc/Star2PassForum/Sample1_forStarForum.bam

 








On Monday, October 5, 2015 at 3:28:05 PM UTC-4, Besse Cummings wrote:

-76 bp unstranded paired-end reads using genome generated with --sjdbOverhang
 76 

Besse Cummings

unread,
Oct 19, 2015, 11:34:11 AM10/19/15
to rna-star
Hi Alex,
Just bumping this up, if you have time to look at it! 
Thanks,
Beryl 

Alexander Dobin

unread,
Oct 19, 2015, 7:31:45 PM10/19/15
to rna-star
Hi Beryl,

sorry for the delay, the week has been busy for me, I will work on it on Wed.

Cheers
Alex

Alexander Dobin

unread,
Oct 21, 2015, 7:22:40 PM10/21/15
to rna-star
Hi Beryl,

I have tried to re-align these reads with STAR, and when I map them as single-end reads, STAR can detect this junction for 4 of them (with long enough overhang), see below.
So something must be wrong with their mates, however, the BAM file does not contain the mates for these reads, so I could not check it.
Could you please send me the BAM file for a bigger interval, starting at 47,406,500?

Cheers
Alex


C6JPEANXX150626:2:2202:7880:64242       0       chr21   47407072        255     18M323N21M90N37M        *       0       0       AATAACGTGGAGCAAGTGTGCTGCTCCTTCGAATGCCAGCCTGCAAGAGGACCTCCGGGGCACCGGGGCGACCCCG    *       NH:i:1  HI:i:1  nM:i:1  AS:i:76 NM:i:1  MD:Z:61T14      jI:B:i,47407090,47407412,47407434,47407523      jM:B:c,21,21
C6JPEANXX150626:3:1312:1694:83238       0       chr21   47407072        255     18M323N21M90N37M        *       0       0       AATAACGTGGAGCAAGTGTGCTGCTCCTTCGAATGCCAGCCTGCAAGAGGACCTCCGGGGCTCCGGGGCGACCCCG    *       NH:i:1  HI:i:1  nM:i:0  AS:i:78 NM:i:0  MD:Z:76 jI:B:i,47407090,47407412,47407434,47407523      jM:B:c,21,21
C6JPEANXX150626:3:2305:7396:87379       0       chr21   47407073        255     17M323N21M90N38M        *       0       0       ATAACGTGGAGCAAGTGTGCTGCTCCTTCGAATGCCAGCCTGCAAGAGGACCTCCGGGGCTCCGGGGCGACCCCGG    *       NH:i:1  HI:i:1  nM:i:0  AS:i:78 NM:i:0  MD:Z:76 jI:B:i,47407090,47407412,47407434,47407523      jM:B:c,21,21
C6JPEANXX150626:3:1216:5689:38387       0       chr21   47407074        255     16M323N21M90N39M        *       0       0       TAACGTGGAGCAAGTGTGCTGCTCCTTCGAATGCCAGCCTGCAAGAGGACCTCCGGGGCTCCGGGGCGACCCCGGC    *       NH:i:1  HI:i:1  nM:i:0  AS:i:78 NM:i:0  MD:Z:76 jI:B:i,47407090,47407412,47407434,47407523      jM:B:c,21,21

Besse Cummings

unread,
Oct 21, 2015, 8:17:11 PM10/21/15
to rna-star

Alexander Dobin

unread,
Oct 22, 2015, 5:57:29 PM10/22/15
to rna-star
Hi Beryl,

I cannot reproduce the problem you are seeing. For the read in your IGV example, I get the spliced alignment with the correct intron:
C6JPEANXX150626:1:1114:9276:86838       147     chr21   47409821        255     59M292N17M      =       47407412        -2777   CCCCTCGCCGTCCCCTCCATCTGGAAGGACAAGGACAGCCACCCAGGCACCCAGCAAAGGGCTCCAGGGGACCCAA    FF<BF<F/FBF/</F/F/<<FFFFFFFFFFFFFBFFFFFFFFFF/BFFFFFFFBBFFF/FFFFFFFFFFFFBBBBB    NH:i:1  HI:i:1 nM:i:0   AS:i:137        NM:i:0  MD:Z:76 jI:B:i,47409880,47410171        jM:B:c,3

I have used genome without annotations, and your parameters --outFilterMismatchNoverReadLmax 0.1 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --sjdbOverhang 75 --outFilterScoreMinOverLread 0.33 --outFilterMatchNminOverLread 0.33 --alignSoftClipAtReferenceEnds No --chimJunctionOverhangMin 15

You mentioned 2-pass alignment, how are you doing it? Which parameters are using for the 1st and 2nd pass.
Could you now send me the BAM file from STAR for the same region? 

By any chance, are you using --outFilterType BySJout - if so it would explain why this junction is filtered out.
The junction in question (47409880-47410171) has GC/AG "semi-canonical" intron motif.
However, there is another - canonical - junction with the same acceptor site (47409693-47410171).
Parameter --outSJfilterDistToOtherSJmin (=10   0   5   10) filters junctions that are too close to the other junctions. the number are the minimum distance allowed.
It's 0 for canonical junctions, but 5 for GC/AG, so the resulting junction was not output into the SJ.out.tab file, and, if --outFilterType BySJout is used, it will also be filtered from the Aligned.out.bam.

Cheers
Alex

Besse Cummings

unread,
Oct 23, 2015, 1:15:53 AM10/23/15
to rna-star
Yes --outFilterType BySJout was the problem! It was tucked away, and after I removed it the junction is mapping. Thanks for your help!
Beryl 
Reply all
Reply to author
Forward
0 new messages