strand specific control of novel splice junction discovery

404 views
Skip to first unread message

Sol Katzman

unread,
Feb 19, 2015, 11:49:27 AM2/19/15
to rna-...@googlegroups.com
Dear Alex,

Thanks for the wonderful software.

I am running STAR using the internal 2pass method to discover splice junctions with  "--twopass1readsN -1"
I am using "--outFilterType BySJout"
There is no sjdb in the target genome.

Per a previous post on this forum from you regarding the "--outSJfilter" parameters:

These four integers actually work for all 7 motif types - I need to clarify it in my documentation.
1st int is for non-canonical (i.e. none of the motifs below)
2nd number is for GT/AG and CT/AC. Note that CT/AC is reverse complementary to GT/AG, which means that the junction is transcribed from the (-) DNA strand - that is actually how strand (col. 4) is determined.
3rd:  GC/AG and CT/GC 
4th:  AT/AC and GT/AT

Therefore, if I set "--outSJfilter*" to favor GT/AG junctions, then CT/AC junctions are also favored.

This is undesirable because CT/AC junctions are actually non-canonical. etc.

To avoid this, it would be necessary for STAR to be informed, for strand-specific libraries,
about the relative strand of read1 and read2 (for paired-End reads). There are 4 possibilities
(2 each for read1, read2). (Plus the unstranded possibility).

As you are probably aware, tophat has a parameter "--library-type", which it uses
for just this purpose (filtering splice junctions).

Have you considered adding such parameters? Or have I missed something in the existing parameters?

I am aware that I could post-process the bam files, removing the reads that have the reverse complement
splices, but this is much less desirable than having a "clean" bam output by STAR.

Thanks again for the software.

/Sol Katzman

Alexander Dobin

unread,
Feb 25, 2015, 4:55:51 PM2/25/15
to rna-...@googlegroups.com
Hi Sol,

I have thought about filtering out the wrong-strand junctions for stranded libraries as you described, however, decided against it.
In the most commonly used dUTP stranded protocol, the strand error (i.e. % of reads sequenced from the wrong strand of the RNA) is typically 0.5-1%.
So the number of reads mapping to the wrong strand is quite substantial. While it is easy to filter out the wrong-strand spliced reads by looking at the intron motif, it is not so clear how to deal with unspliced reads mapping to the wrong strand of the annotations. Note that the latter might also originate from real antisense transcripts.
If we were to filter just the wrong-strand spliced reads, it would create a bias against the end of the exons in the opposite strand, but it will eliminate no more that 40-50% of the wrong-strand reads.
It's relatively easy to eliminate the wrong-splice reads from the BAM file in the post-processing, e.g. you can use --outSAMattributes jM to output intron motif (always with respect to + strand of DNA), and then check the FLAG for actual read strand.

Cheers
Alex

skat...@ucsc.edu

unread,
Apr 3, 2015, 10:48:09 AM4/3/15
to rna-...@googlegroups.com
Dear Alex,

Thanks for the explanation of your thinking.

Based on your estimate of strand error in the dUTP protocol, it seems like including the revComp junctions is going to increase
the sensitivity of splice junction detection by 1%.

Of course, any increase in sensitivity is likely to entail a decrease in specificity of unknown extent. In my samples,
the revComp junction mappings were largely found to be false positives, perhaps due to alignment peculiarities to the
yeast genomes that I was working with. So for me, the sensitivity/specificity tradeoff was clearly a poor one, which
is why I wrote the original post.

I am well aware that real antisense transcripts are common, having seen it for many genes. But if there are splice
junctions among them, I would expect from biological considerations that the splicing machinery would not work
on revComp splice-sites. But again, for true "wrong strand" sequences, we are talking about 1% differences between
spliced and unspliced alignments, which seems pretty minor.

I also note that you have provided exquisite control for the user to filter various splice junction types with parameters
(actually 4 parameters per option):
  --outSJfilterOverhangMin
  --outSJfilterCountUniqueMin
  --outSJfilterCountTotalMin
  --outSJfilterDistToOtherSJmin

 
It seems somewhat inconsistent that you would not let the user control whether revComp junctions should
be included for each junction type.

Note that if you allow the user to specify the library as "unstranded", the results would be the
same as the current situation.

On a more global note, I wonder if you should really be "baking in" to the software an assumption of library protocol
and sequencing technology.

While I often have datasets prepared with dUTP, I also have worked with RNAseq datasets that are unstranded,
as well as those where Illumina read1 is the same as the coding strand (not reverse as in dUTP). (Aside:
I recently had to re-write the "circular RNA" filtering scripts to accommodate such data sets.)

I might not object to your assumption that read1 and read2 are from opposite strands, as it is in Illumina
technology, but I would expect that to be documented very clearly, which I may have missed.

But I would urge you to consider a parameter to specify the strands of read1 and read2, or the fact
that the library is unstranded. (Post-processing the SAM file is a far less desirable option, as you can imagine.)

Thanks again for you attention and the software, which is highly useful.

/Sol

Alexander Dobin

unread,
Apr 7, 2015, 12:25:28 PM4/7/15
to rna-...@googlegroups.com
Hi Sol,

thanks a lot for your thoughts. You make valid arguments, and I agree that the "library strandedness" feature would be quite useful. I will add it to my list, which is already quite long. I may need start some polling to rank the features by user demand.

There is always a question to what extent a mapper should massage the alignments, and how much of that should be done in the post-processing. I feel that filtering at the mapping stage should remove mapping artifacts, while the wet-lab artifacts should be taken care of at the post-mapping stage. Allowing reverse strand junctions follows that philosophy - it's not done to increase sensitivity - I believe these reads have to be filtered out in the post-processing. However, I appreciate that having an option to filter these reads directly in STAR will save users a lot of effort in re-processing SAM files.

Note, that my defnition of the intron motif always refer to the (+) DNA strand - STAR does not try to guess the RNA strand, which works for both stranded and unstranded data.
For RNA on (-) strand the canonical intron motif would be CT/AC in my definition. Only if RNA-seq is stranded, you can tell which strand the RNA is on, and than check whether the true intron motif is GT/AG.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages