2-pass mapping

Felix Schlesinger

unread,

May 1, 2015, 1:49:21 PM5/1/15

to rna-...@googlegroups.com

Just to check that I am not misreading the documentation: The new automated two-pass mode should work with annotation build into the genome index, right? For me this seems to hang indefinitely (at 100% CPU) during the second mapping pass. Could be related to the dataset, but I wanted to check first if it's even supposed to work. It works when adding the annotation on the fly (i.e. during mapping, not during genome generation).

Felix

Alexander Dobin

unread,

May 1, 2015, 4:13:08 PM5/1/15

to rna-...@googlegroups.com

Hi Felix,

yes, it's supposed to work if you re-generated the genome with the latest version. The 2.4.1ab had problems with the genome generation without annotations, so, just in case, please re-generate genome with 2.4.1c. If it does not work, please send me a test example (Log.out, small set of reads, _STARpass1/SJ.out.tab, and links to the genome/annotations).

Cheers

Alex

Felix Schlesinger

unread,

May 5, 2015, 1:45:29 PM5/5/15

to rna-...@googlegroups.com

I still see this issue with 2.4.1.c. The problem appears to be generating a genome with a specific sjdbOverhang != 100 and then running alignment with on the fly annotations (or two-pass) and default sjdbOverhang. I.e. mostly a user error, but a warning message from STAR would be helpful:

Genome built (overhang=75):

../STAR-STAR_2.4.1c/bin/Linux_x86_64/STAR --runMode genomeGenerate --runThreadN 16 --genomeDir . --genomeFastaFiles UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa --sjdbGTFfile UCSC/hg19/Annotation/Genes/genes.gtf --sjdbOverhang 75

* Successfull two-pass (explicit overhang=75; matched):

STAR-STAR_2.4.1c/bin/Linux_x86_64/STAR --genomeDir genome.sj/ --readFilesCommand zcat --readFilesIn fastq/mRNA_UHR_S5_L001_R1_001.fastq.gz fastq/mRNA_UHR_S5_L001_R2_001.fastq.gz --twopassMode Basic --sjdbOverhang 75

* Error message (explicit overhang=100; mismatched):

STAR-STAR_2.4.1c/bin/Linux_x86_64/STAR --genomeDir genome.sj/ --readFilesCommand zcat --readFilesIn fastq/mRNA_UHR_S5_L001_R1_001.fastq.gz fastq/mRNA_UHR_S5_L001_R2_001.fastq.gz --twopassMode Basic --sjdbOverhang 100

EXITING because of fatal PARAMETERS error: present --sjdbOverhang=100 is not equal to the value at the genome generation step =75

SOLUTION:

* No error message, hangs forever on 2nd mapping step: (default overhang):

STAR-STAR_2.4.1c/bin/Linux_x86_64/STAR --genomeDir genome.sj/ --readFilesCommand zcat --readFilesIn fastq/mRNA_UHR_S5_L001_R1_001.fastq.gz fastq/mRNA_UHR_S5_L001_R2_001.fastq.gz --twopassMode Basic

May 05 10:38:07 ..... Started STAR run

May 05 10:38:07 ..... Loading genome

May 05 10:38:17 ..... Started 1st pass mapping

May 05 10:39:01 ..... Finished 1st pass mapping

May 05 10:39:03 ..... Inserting junctions into the genome indices

May 05 10:40:43 ..... Started mapping

[hangs with 100% CPU, no logging output]

I think this is independent of the fastq file, genome sequence and annotation used.

Felix

Felix Schlesinger

unread,

May 6, 2015, 8:39:47 PM5/6/15

to rna-...@googlegroups.com

On a similar note: What about the interaction of shared memory genomes and annotations at mapping-time?

Using shared mem with two-pass produces an error message. Shared-mem with mapping-time annotation starts fine, but fails at "Inserting junctions into the genome indices" for me. Is that expected?

Felix

Alexander Dobin

unread,

May 8, 2015, 11:53:35 AM5/8/15

to rna-...@googlegroups.com, felix.sc...@gmail.com

Hi Felix,

thanks a lot for testing figuring it out. These issues are related to both the "on-the fly" changes in the code, and the change in the default value of sjdbOverhang from 0 to 100, that screwed up the parameters checking. I am working on a fix to be released on Moday - was busy at the Biology of Genomes this week.

Cheers

Alex

Alexander Dobin

unread,

May 8, 2015, 12:50:53 PM5/8/15

to rna-...@googlegroups.com, felix.sc...@gmail.com

Any options with on the fly sjdb insertion will not work with shared memory (I need to add more checks preventing this).

Since the genome index is modified dynamically, I cannot think of a safe way to allow it with the shared memory.

Felix Schlesinger

unread,

May 8, 2015, 1:05:49 PM5/8/15

to rna-...@googlegroups.com, felix.sc...@gmail.com

Yes that makes sense. The procedure would probably have to be

- Load shared genome

- Align all samples 1st pass

- Make new shared genome with all discovered SJs

- Align all samples 2nd pass

The main case where something like this could be useful would be many small samples/replicates from the same biological context (e.g. single-cell experiments). There the few minutes spent loading and unloading the genome and adding SJs to the index can add up and sharing SJs across samples can be important.

But this can already be done, either with a single STAR run and read groups or a wrapper-script around several STAR calls, so for the specific problem here, I think a clear error message is all that is needed.

Alexander Dobin

unread,

May 13, 2015, 11:27:46 AM5/13/15

to rna-...@googlegroups.com, felix.sc...@gmail.com

Hi Felix,

in this scenario mapping all samples together in a single 2-pass run is indeed the best approach.

If the samples run separately, then it has to be done with wrapping scripts, since you need to make sure that all the 1st pass jobs completed successfully.

The genome with junctions inserted on the fly can be saved for future use, which will save 1-2 hours compared to re-generating genome from scratchso the workflow would look like this:

- Load shared genome

- Align all samples 1st pass to the shared genome

- Check that all jobs completed successfully (generated SJ.out.tab file)

- Unload the shread genome

- Make new shared genome with all discovered SJs by mapping one of the samples with junctions inserted on the fly

STAR ... --genomeDir /path/to/pass1/genome/ --sjdbFileChrStartEnd /path/to/sample1/SJ.out.tab /path/to/sample2/SJ.out.tab --sjdbInsertSave All

this will save the new genome indices into _STARgenome directory in the run directory

- Load _STARgenome genome from the previous step into shared memory

- Align all samples to the new genome ("2nd pass")

Cheers

Alex

Reply all

Reply to author

Forward