comparing 2 pass approaches: multiple sample vs per sample

960 views
Skip to first unread message

shao

unread,
Nov 30, 2017, 4:08:35 PM11/30/17
to rna-star
Hi Alex,

We are running 2 pass alignment with pair-end RNA-seq data. Each sample has roughtly 200M reads.

The 1st pass of aligment took around 2 hours with 5 threads, and as recommended in the manul,
we merged and remove MT/GL.. related SJ (MT rocks!). The clean SJ.out.tab is used for the 2nd pass.

Now with 30 threads, the 2-pass alignment take around 2 hours. It is 6 times slower than 1 pass.

I am wondering how to speed up the 2nd pass STAR alignment? And will the per-sample approach be faster?

More importantly, how much gain we could obtain comparing multiple-samples to per-sample approach?

Thanks!

Shao



G
M
T
Text-to-speech function is limited to 200 characters

Alexander Dobin

unread,
Dec 3, 2017, 1:23:36 PM12/3/17
to rna-star
Hi Shao,

this slow-down is most likely caused by (i) too many novel junctions inserted in the 2nd pass; or (ii) novel junctions in highly expressed loci.

The 2-pass, multi-sample or per-sample, is mostly useful for more accurate quantification of novel junctions.
The advantage of multi-sample over per-sample 2-pass approach is that the former the junctions may only be detected in one sample, but will be quantified in all samples. However, the multi-sample approach often suffers from the "too many junctions problem", while the per-sample approach rarely does.

If you insert too many junctions, couple of problems may occur:

1. The mapping speed is reduced significantly.

2. The number of multimappers increases at the expense of unique mappers.


To mitigate these problems, you need to filter the junctions used in the 2nd pass. Here is approximate strategy:


1. Collapse the junctions from all samples into a set of unique junctions counting the number of reads per junction from all samples, and the number of samples the junction was detected in. I wrote a simple script that does just that:

https://github.com/alexdobin/STAR/blob/master/extras/scripts/sjCollapseSamples.awk


2. Calculate some statistics on these junctions: number of junctions with different intron motifs (column 5), number of junctions detected in 1,2,3... samples (column 10) etc.

This will give you an idea on how to filter these junctions best.


3. Filter the junctions on: (i) number of samples detected, (ii) total number of unique/multimap reads, (iii) max overhang. You may want to do harsher filtering for non-canonical junctions (col5=0). You would want to bring the number of junctions to <1M.


4. For the 2nd pass, use --sjdbFileChrStartEnd SJ.filtered /path/to/this/sample/1st/pass/SJ.out.tab

where SJ.filtered is the list of filtered junctions from 3, and /path/to/this/sample/1st/pass/SJ.out.tab

is the SJ.out.tab of the 1st pass for this one sample.


You may need to adjust the filtering in step 3 to bring the increase in multimapers to no more than 1-2%.


Cheers
Alex


shao

unread,
Dec 6, 2017, 5:16:05 AM12/6/17
to rna-star
Hi Alex,

Thanks for the informative reply!

Before trying your suggestions, here are some comparison results between multiple sample and per-sample 2-pass alignment with one test sample.
Running time: per-sample basic 2-pass took 1.5 hours with 30 cores, so there isnt much gain in speed. However, this time I used
the original SJ tab with  MT and GL contigs, and in my experience the SJ in MT slows down the STAR a lot!

Junctions: In multiple-samples there are 1.5% more total splicing juctions (56445436 vs 55656357), but there are 200% more non-canonical SJ (1136477 vs 317346)!

With your suggestion I will filter the SJs and also MT/GL contigs and run 2 pass again.

Best,
Shao

shao

unread,
Jan 6, 2018, 1:49:07 AM1/6/18
to rna-star
Hi Alex,

Here are more updates on the comparisons:
- run the star in 1-passs mode
- collect the SJ.out.tab, remove the MT and GL related SJ
- reduce the size of total SJ by the script, from 31000906 to 1405884 (close to 1M)
- run star in the two pass mode with the total SJ, reduced total SJ, or original SJ (without MT and GL)

The running time and statistic:
sample running time (30 cores) uniquely mapped reads Total splices annotated Non-canonical
2pass 7 hours 83.13% 51093946  51013149 231818
2pass-basic 1.5 hours 85.58% 50320738 50214759 108398
2pass-filter 6 hours 84.29% 50801705 50715242 145772











Even with reduced SJ, it still took 6 hours to finish, close to the running time without filtering.

And there are few difference in the annotated SJ. Major difference could be seen in the non-can SJ, which has high FDR (I saw you comments on this in another thread). In the end I used the original SJ in the work.

Best,
Shao





Alexander Dobin

unread,
Jan 6, 2018, 3:45:58 PM1/6/18
to rna-star
Hi Shao,

thanks for sharing the results.
Switching from 2-pass basic (single sample) to filtered multi-sample 2-pass, you got ~500k more reads spliced reads,
and then another ~300k spliced reads for unfiltered 2-pass.
Note that these splices are called annotated because they are annotated after the 1st pass. However, most of them will be unannotated with respect to annotations (if you used them).

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages