sjdbOverhang and twopassMode usage

568 views
Skip to first unread message

Gil S

unread,
Jul 31, 2017, 11:42:24 AM7/31/17
to rna-star
Hi All

I have created an index for my genome using 
STAR 
--runThreadN 30 
--runMode genomeGenerate 
--genomeDir ./STAR_index/ 
--genomeFastaFiles S_lycopersicum_chromosomes.3.00.fa 
--sjdbGTFfile ../Annotation/Genes/ITAG3.10_gene_models.gtf 
--sjdbOverhang 100

My colleague has been using the following command for mapping
STAR 
--genomeDir tomato_solanum_lycopersicum/sl3.0/Sequence/STAR_index 
--readFilesIn SRR567999_R1.fastq SRR567999_R2.fastq 
--outFilterMultimapNmax 1 
--outReadsUnmapped Fastx 
--outSAMtype BAM SortedByCoordinate 
--twopassMode Basic 
--runThreadN 20 
--sjdbGTFfile sl3.genes.gtf 
--sjdbOverhang 100 
--quantMode GeneCounts 
--readFilesCommand cat 
--outFileNamePrefix SRR567999 
--genomeLoad NoSharedMemory &> SRR567999.txt

Does the use of --sjdbOverhang 100 param cause a re-indexing of the genome or is this ignored since the index already exists?  What are the ramifications of including this param in the mapping stage as seen above?

In addition, are novel junctions automatically detected or is the --twopassMode Basic parameter necessary for this feature (I think the answer is no but want to be sure).

Many thanks,
Gil

Alexander Dobin

unread,
Aug 1, 2017, 3:58:45 PM8/1/17
to rna-star
Hi Gil,

if the genome index was generated with annotations, the --sjdbOverhang at the mapping stage has to be set to exactly the same value - otherwise, STAR will stop with an error. It only makes sense to specify this parameter if you generated the genome without annotations.

If a GTF file is used at the mapping stage with --sjdbGTFfile, the junctions from this file will be added to the junctions in the indexed genome - this usually requires a few minutes.

The --twopassMode Basic will run two passes of STAR, with the novel junctions detected in the 1st pass inserted into the genome for the 2nd pass.
The novel junctions will be detected without this option, however, this options will allow to detect more reads mapping to the novel junctions.

Cheers
Alex

Gil S

unread,
Aug 6, 2017, 5:21:52 AM8/6/17
to rna-star
Hi Alexander

What would happen if we don't create an index for a genome and each time we run STAR we use the --sjdbOverhang and --sjdbGTFfile paramters?  Will STAR index the genome every time or will it check to see if an index exists and avoid doing so?
Basically I'm asking if there is a drawback to using the aforementioned parameters in every STAR run.

Many thanks,
Gil

Alexander Dobin

unread,
Aug 7, 2017, 3:12:06 PM8/7/17
to rna-star
Hi Gil,

you need to generate the genome index for the genome sequence (FASTA file) before mapping. 
--sjdbGTFfile and --sjdbOverhang add annotations to the genome index, which can be done at the genome indexing stage, or at the mapping stage. 
The mapping results will be the same, however, in the latter case every mapping job will spend a few minutes adding the junctions to the index.

Cheers
Alex

Gil S

unread,
Aug 8, 2017, 12:02:11 PM8/8/17
to rna-star
Dear Alex

I am confused.  I am trying to determine the best way to map using STAR.
We use indexes generated by Illumina (iGenomes) for the organisms they have available.  They index using --sjdbOverhang 0.  We map we using --sjdbOverhang 100 with --sjdbGTFfile and a GTF file, as well as --twopassMode Basic.  To my understanding, this means that every time we map using STAR it will re-index (taking several minutes in each run) since the sjdbOverhang  was used.  Executing STAR in this way will map reads to junctions defined in the GTF (and reflected in the index) as well as find new junctions (since we are using --twopassMode Basic which are also added to the index).  Since we do not have write permissions to the Illumina folder I can only assume that temporary index files are written for EACH run (and rewritten in each run).
Am I correct in my understanding?

In addition, we have two types of map runs.  I am not sure if having data that is not persistent from one run to the next is advantageous.  In one map type we only use 3' ends with an appropriate GTF and the second maps to entire transcripts, with its GTF.  
Is there any disadvantage to mapping as we do right now?

Thanks for all your help,
Gil

Alexander Dobin

unread,
Aug 9, 2017, 12:26:12 PM8/9/17
to rna-star
Hi Gil,

I am not sure how the iGenome genome generation was run.
Do they have Log.out output for that run? If not please post the genomeParameters.txt file from the genome directory.
It looks like they did not use the annotations, which means you have to add them at the mapping stage -  so what you are doing is correct.

>>> "We map we using --sjdbOverhang 100 with --sjdbGTFfile and a GTF file, as well as --twopassMode Basic.  To my understanding, this means that every time we map using STAR it will re-index (taking several minutes in each run) since the sjdbOverhang  was used.  Executing STAR in this way will map reads to junctions defined in the GTF (and reflected in the index) as well as find new junctions (since we are using --twopassMode Basic which are also added to the index)."

This is correct.

The modified index is kept in RAM and by default is not saved on the disk. You can save the modified index (with inserted junctions) by using --sjdbInsertSave All
option. It will be saved in _STARgenome/ directory inside the run directory, and can be used as the STAR genome directory for other run.
So you could run one jobs with --genomeDir iGenome/STAR/ --sjdbGTFfile ... --sjdbOverhang ... --sjdbInsertSave All , and use the resulting _STARgenome/ for future runs - it will contains the junctions from the GTF. Again, the mapping results will be the same - it will just save you a few minutes for every mapping run.
However, it might be even simpler to get the FASTA file used in iGenome generation and re-generate the genome index with it and --sjdbGTFfile ... --sjdbOverhang ... .

Cheers
Alex

Gil S

unread,
Aug 10, 2017, 4:38:08 AM8/10/17
to rna-star
Hi Alex

Thanks for the information.

The genomeParameters.txt parameters are - 
 ### STAR   --runMode genomeGenerate   --runThreadN 30   --genomeDir hg38_star_index   -- WholeGenomeFasta/genome.fa
versionGenome   20201
genomeFastaFiles        WholeGenomeFasta/genome.fa
genomeSAindexNbases     14
genomeChrBinNbits       18
genomeSAsparseD 1
sjdbOverhang    0
sjdbFileChrStartEnd     -
sjdbGTFfile     -
sjdbGTFchrPrefix        -
sjdbGTFfeatureExon      exon
sjdbGTFtagExonParentTranscript  transcript_id
sjdbGTFtagExonParentGene        gene_id
sjdbInsertSave  Basic


We would like to use the --sjdbInsertSave All option when mapping using STAR, which means that any junction previously found (whether via GTF or a novel junction) can be used in a subsequent run.  How do we use \ call the modified index stored in _STARgenome/ directory on a subsequent run?

Many thanks, 
Gil

Alexander Dobin

unread,
Aug 16, 2017, 4:32:18 PM8/16/17
to rna-star
Hi Gil,

the --sjdbInsertSave All will only add to the genome those junctions that you supplied with --sjdbGTFfile and --sjdbFileChrStartEnd options,
it will not save the genome with the novel junction inserted. 
If you want to use novel junctions in the future mapping, you need to run STAR again supplying it with the detected junctions file SJ.out.tab, i.e. --sjdbGTFfile /path/to/GTF --sjdbFileChrStartEnd SJ.out.tab --sjdbInsertSave All options.

To use the saved indexes, move/rename the _STARgenome/ into a new "central" location, and in the future mapping runs simply specify the path to it in the --genomeDir parameter.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages