Multi-sample 2-pass mapping clarification

je...@encodedgenomics.com

unread,

Aug 4, 2016, 8:38:34 PM8/4/16

to rna-star

Hi Alex,

Quick question: in the STAR manual it says for multi-sample 2-pass mapping:

For a study with multiple samples, it is recommended to collect 1st pass junctions from all samples.

1. Run 1st mapping pass for all samples with ”usual” parameters. Using annotations is recom- mended either a the genome generation step, or mapping step.

2. Run 2nd mapping pass for all samples , listing SJ.out.tab files from all samples in --sjdbFileChrStartEnd /path/to/sj1.tab /path/to/sj2.tab ....

So to be clear: I run STAR with my usual parameters once, and then I run STAR again with the EXACT SAME parameters as the first run, but just adding an extra '--sjdbFileChrStartEnd sj1.out.tab sj2.out.tab...' option, where sj1.out.tab, sj2.out.tab, ... are the STAR splice junction output files from the first run?

Thanks!

-Jerry

Alexander Dobin

unread,

Aug 5, 2016, 5:32:26 PM8/5/16

to rna-star

Hi Jerry,

yes, this is correct.

Cheers

Alex

On Thursday, August 4, 2016 at 8:38:34 PM UTC-4,

Ren Wenhua

unread,

Aug 29, 2016, 10:17:38 AM8/29/16

to rna-star

Dear Alex,

I am reading all the recent STAR 2-pass topics and found that you have recommended to merge all the SJ.out.tab files and remove the extra columns and the Mitochondria junctions in some of the earlier posts. However I have also seen that you suggested here and in some other posts that just listing all the SJ.out.tab files generated from the 1st mapping step after "--sjdbFileChrStartEnd" would also work.

Which way is recommanded or preferred? Will the answer differ among all the different versions?

I have actually just tried building index for 2-pass mapping by listing all the SJ.out.tab files. It worked without any problem. I am going to move forward to map the files using the new index and just read about all these, so hope you can help to explain a bit more about the differences with two slightly different method you brought up.

Thanks!

Wenhua

---------------------

Thanks!

Alexander Dobin

unread,

Aug 29, 2016, 2:22:02 PM8/29/16

to rna-star

Hi Wenhua,

concatenating the SJ.out.tab or listing them all will produce exactly the same results.

Filtering the junctions will alter the results, of course. Typically, filtering is only required if without filtering the mapping speed decreases drastically.

Cheers

Alex

Ren Wenhua

unread,

Aug 31, 2016, 11:35:16 AM8/31/16

to rna-star

Thank you, Alex! Got it.

Wenhua

Patrick Tran Van

unread,

May 31, 2017, 5:19:19 PM5/31/17

to rna-star

No need to make a new index between mapping 1pass and mapping 2pass ?

Alexander Dobin

unread,

May 31, 2017, 6:04:23 PM5/31/17

to rna-star

Hi Patrick,

if you use the --sjdbFileChrStartEnd /path/to/sj1.tab /path/to/sj2.tab option for the 2-pass run (with sj files from the 1st pass), you do not need to manually re-generate the genome.

Cheers

Alex

Caleb Radens

unread,

Jun 23, 2017, 9:59:10 AM6/23/17

to rna-star

Hi Alex,

I have a speed question. Let's say I have 300 fastqs I want to align using your two pass approach. I do the first pass on each fastq individually. Next, for the second pass, I do the second pass alignment with the following extra commands:

--sjdbFileChrStartEnd ./directory/with/300_SJs/*SJ.out.tab --limitSjdbInsertNsj=5000000

(I had to increase --limitSjdbInsertNsj because my 300 *SJ.out.tabs had more than the default max)

The slowest step appears to be the "inserting junctions into the genome indices" step:

Jun 22 12:14:03 ..... started STAR run

Jun 22 12:14:03 ..... loading genome

Jun 22 12:15:20 ..... inserting junctions into the genome indices

Jun 22 12:22:14 ..... started mapping

Jun 22 12:23:04 ..... started sorting BAM

Jun 22 12:23:21 ..... finished successfully

As you can see, the "inserting junctions into the genome indices" step takes 7 minutes, but the mapping is quick.

Would it be faster if I first re-build the genome using the 300 *SJ.out.tab files, then align each file using the newly built genome? (Rather than how I'm doing it now with on-the-fly inserting all 300 *SJ.out.tab files to the second pass alignment)

Thanks!!

- Caleb Radens

Alexander Dobin

unread,

Jun 23, 2017, 3:42:00 PM6/23/17

to rna-star

Hi Caleb,

in terms of total CPU time, re-building genome is faster than spending 7 min 300 times.

Actually, you can save the re-built genome when you insert the junctions on the fly, which will only take 7min.

Basically, run 2nd pass mapping for just one of your samples with

--sjdbFileChrStartEnd ./directory/with/300_SJs/*SJ.out.tab --limitSjdbInsertNsj=5000000 --sjdbInsertSave All

This will save the re-built genome in the _STARgenome/ directory inside your run directory.

Then for all the other samples, you can specify --genomeDir /path/to/_STARgenome/ without the --sjdbFileChrStartEnd ... options.