Outline for running STAR 2pass with multiple samples and both pair-ended and single-ended reads

1,945 views
Skip to first unread message

Mauricio Losilla

unread,
Jun 28, 2015, 7:27:01 PM6/28/15
to rna-...@googlegroups.com

Hello,

 

I am trying to use STAR 2pass mode to map multiple samples from pair-ended RNAseq (i.e. RNA from organ1, organ2, organ3, etc) to one reference genome. I cleaned my sequences with Trimmomatic, and as a consequence I have 2 files with pair-ended reads, but also one with single-ended reads (I merged all individual survivors from Trimmomatic into one file).

 

My goal is to map all reads, both pair-ended and single-ended, from all organs, to the same reference genome. I want to use STAR 2pass mode, making sure that the 2 pass for every read uses the junctions detected from the 1 pass of all reads.

 

I searched extensively in previous topics. Other users have asked similar questions before, but they were using old versions of STAR, where the 2pass mode was implemented in a different manner. I gathered useful information from those topics, but I didn’t find an exact answer. Hence I open this new topic.

 

This is an outline of what I am thinking of doing:

 

1)      Run Star 2pass mode for all pair-ended files from all organs (i.e. --readFilesIn organ1F, organ1R, organ2F, organ2R, organ3F, organ3R, etc), using my preferred –outSJfilter options,  --twopassMode Basic  --sjdbInsertSave All

 

I hope that this will map all pair-ended reads using filtered junctions from all organs; but most importantly, will save the reference genome with all these junctions.

 

2)      Run Star 2pass mode with the same options as before, but for all single-ended reads from all organs (i.e. –readFilesIn organ1SE, organ2SE, organ3SE, etc).

 

I expect that this will map all single-ended reads using filtered junctions from both the pair-ended (step 1) and single-ended (this step) mapped reads, and add the newly detected junctions to the reference genome.

 

3)      Run Star 1pass mode for all pair-ended files from all organs, as in step 1.

 

What I want here is to map all pair-ended reads using the junctions from both pair-ended and single-ended mapped reads. Also, I want to keep the reference genome with all the junctions, which I could use later, for example to analyze each organ individually.

 

Lastly, I want to create a single sortedBam file as the final output. I want this file to have both pair-ended and single-ended reads. Naturally, I want all the original information for each read, such as pair or single- ended, mapped or unmapped, etc. My idea is to use this unified Bam file in Trinity.

 

Maybe naively, I am thinking of simply merging the sortedBams generated from steps 2 and 3:

 

4)      cat SortedBamPE SortedBamSE > SortedBamAllReads

 

Is this how I should proceed? Please comment/correct/suggest.

 

I am using STAR 2.4.1d, but I plan to move to the newly released 2.4.2a in a few days.

 


Many thanks,

Mau

Alexander Dobin

unread,
Jun 30, 2015, 6:08:35 PM6/30/15
to rna-...@googlegroups.com, mlosi...@gmail.com
Hi Mau,

I would recommend the following multi-mapping strategy.

1. Generate the genome with annotations.
2. Map all the different samples, SE and PE separately, without using any 2-pass options.
3. Concatenate all SJ.out.tab files. Remove junctions on chrM (false positives, may slow down the 2nd pass).
4. For one of the samples, run the 2nd pass inserting the new junctions:
STAR ... --genomeDir /path/to/pass1/genome/ --sjdbFileChrStartEnd /path/to/SJ.out.tab.combined.filtered --sjdbInsertSave All
The last option will save the 2nd pass genome (with new junctions inserted) into _STARgenome directory in the run directory.
5. Re-map all sample  to this new genome.

To merge sorted BAM files (and get sorted file out), you need to use the 'samtools merge' command.
Concatenating the SE and PE in one file is a bit more complicated, since you would need to set the proper flags for the SE reads.
You can do it while mapping SE reads using --outSAMflagOR option.
I believe you would need to set this number to 0x1+0x8+0x40=73 for the first read, and 0x1+0x8+0x80=137 for the 2nd read.

Cheers
Alex

Mauricio Losilla

unread,
Jul 6, 2015, 5:07:31 PM7/6/15
to rna-...@googlegroups.com, mlosi...@gmail.com

Hi Alex,

Thank you very much for your suggestions. I followed your strategy, and since I had also started with my own, I compared the final outputs. I think they yielded very similar results. In my opinion, the most striking difference is the number of canonical and non-canonical splices detected by each strategy. I am attaching an excel file with the comparisons, if anyone is interested. I wish I could get the same stats from the Log.final.out file on the Merged Bam file!


As for your approach, I generated the initial genome without annotations, because I don't have any. Also, I couldn't filter mtDNA, because it hasn't been ID and labeled as such on my genome. 
To merge the Bam files from the paired-end (PE) and single-end (SE) analyses, I used your suggestion of samtools merge. I think it worked fine, thanks.

As of my strategy, the only change I did was running the SE files on steps 1 and 3, and the PE on step 2. This way it took a lot less time.

All filtering options where the default.


Thank you very much for your input

Mau
STARcomparison.xlsx

Alexander Dobin

unread,
Jul 8, 2015, 11:40:14 AM7/8/15
to rna-...@googlegroups.com, mlosi...@gmail.com
Hi Mauricio,

the results look very similar to me, overall the differences are insignificant.
I think your approach would take more time, since you are effectively doing 3 passes of mapping - 2-pass mapping in the 1st step for each sample separately, and another 1pass in the 3rd step with junction collected from all samples.

Cheers
Alex

Mauricio Losilla

unread,
Jul 8, 2015, 2:22:19 PM7/8/15
to rna-...@googlegroups.com, mlosi...@gmail.com

With yours I did two 1-pass mode runs, a 2-pass mode run, and a 1-pass mode run.
With mine I did two 2-pass mode runs, and a 1-pass mode run.

I didn't see any obvious differences in running times, at least under my computing resources. I think your strategy would be more useful if handling a very large number of PE reads.

I am very happy with this, thank you very much for your guidance

Mau

Alexander Dobin

unread,
Jul 9, 2015, 5:59:30 PM7/9/15
to rna-...@googlegroups.com, mlosi...@gmail.com
Hi Mauricio,

I was suggesting 1-pass run for both SE and PE, then collecting the junctions and re-generating the genome, and then another 1-pass run for PE and SE.
But if you do not have computational time constraints, making any of those 2-pass run would not hurt.

Cheers
Alex

Tommy Carstensen

unread,
Apr 11, 2017, 10:37:57 AM4/11/17
to rna-star, mlosi...@gmail.com
Can STAR accept multiple fasta files from different samples as input? It runs for a while, but then the log file is not updated. Is my fasta file somehow malformed? Is there a tutorial on how to use STAR? I've already read the manual, but it contains no real world examples. Thanks.

Here my fasta:

>HS31_21758:1:1212:12019:41489/1

GGGAGCTCCCTGGACTGAAGGAGACGCGCTGCTGCTGCTGTCGTCCTGCCTGGCGCCTTGGCCTACAGGGGCCGC

>HS31_21758:1:1108:9330:41982/1

GAAGGAGACGCGCTGCTGCTGCTGTCGTCCTGCCTGGCGCCTTGGCCTACAGGGGCCGCGGTTGAGGGTGGGAGT

>HS31_21758:1:1316:11018:39202/1

ACGCGCTGCTGCTGCTGTCGTCCTGCCTGGCGCCTTGGCCTACAGGGGCCGCGGTTGAGGGTGGGAGTGGGGATG

>HS31_21758:1:1302:5188:24725/1

GCGCTGCTGCTGCTGTCGTCCTGCCTGGCGCCTTGGCCTACAGGGGCCGCGGTTGAGGGTGGGAGTGGGGATGCA

>HS31_21758:1:1206:5924:68173/1

GCTGCTGCTGCTGTCGTCCTGCCTGGCGCCTTGGCCTACAGGGGCCGCGGTTGAGGGTGGGAGTGGGGATGCACT


Best wishes,
Tommy

Alexander Dobin

unread,
Apr 11, 2017, 10:48:32 AM4/11/17
to rna-star, mlosi...@gmail.com
Hi Tommy,

please send me the Log.out file.
STAR can run multiple samples simultaneously, but only for all PE or all SE reads, PE and SE reads have to be mapped separately.
If you want to run 2-pass for all samples combined, you would need to do it manually, i.e. run the 1st pass for all samples, combine the detected junctions, and then run 2ns pass on all samples.

For a simple tutorial, please check this paper:

Cheers
Alex

Stephanie

unread,
May 21, 2018, 3:55:47 PM5/21/18
to rna-star
Hello,

I am running STAR (v2.6.0a) similar to Mauricio, in which I would like to map both PE and SE reads to a reference genome.  The SE reads are orphan reads after quality trimming in which the mate was discarded due to low quality.  I have one SE read file per sample, including both orphan R1 and R2 reads.

Alex, in your previous post you had mentioned that after mapping SE and PE reads separately, you can merge the PE and SE BAM files but need to set the proper flags for SE reads.  (During the SE mapping step, the --outSAMflagOR option can be set as 73 for the first read or 137 for the 2nd read).

Since all of my orphan R1 and R2 reads are in a single fastq file per sample, I wanted to ask whether it is valid to map a single SE file per sample in this manner.  In such case, is it possible to set the SAM flags appropriately for R1 and R2 when running STAR?  Alternatively, do I need to create 2 SE files per sample (one for R1 and one for R2) and map these separately so that I can set the appropriate SAM flags?  In the latter case, I imagine I would need to map 3 times (PE, SE R1, SE R2), concatenate SJ.out.tab files, remap 1 sample with the new junctions, remap all 3 sets with the new genome, then merge sorted BAM files.  As another option, can I edit the SE SAM flags appropriately after mapping?

Any advice you have is much appreciated!

Thank you kindly,
Stephanie

Alexander Dobin

unread,
May 22, 2018, 6:01:08 PM5/22/18
to rna-star
Hi Stephanie,

the alignments will be exactly the same, whether the SE R1 and R2 are in one file, or in two separate files.
However, if you want to keep track which read is R1 and which is R2 downstream of STAR, you would need to set different FLAGs for them.
You can do it by post-processing the STAR's alignment files, but in this case, you would need to encode R1/2 in the read name.
If you want to use --outSAMflagOR option, you will need R1 and R2 in separate files, and run STAR separately for each.

For the 2-pass alignment, in the 1st pass you do not need to keep track of R1/2, as no SAM is generated, so you can map SE R1/2 together from one file or comma-separated list.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages