STAR mapping pipeline with 2-pass for multiple samples

4,412 views
Skip to first unread message

Rajesh Detroja

unread,
Jun 3, 2019, 10:32:20 AM6/3/19
to rna-star
Dear Alex,

I am running an STAR 2.7.0f mapping pipeline with 2-pass mode for multiple samples of different diseases and healthy samples as follows:



1) Indexing genome with annotations
STAR --runMode genomeGenerate --genomeDir ~/db/hg38/ --genomeFastaFiles ~/db/hg38/hg38.fa --sjdbGTFfile ~/db/hg38/hg38.gtf --runThreadN 30 --sjdbOverhang 89

Note:
- Indexing for maximum read lengh 90 bp.




2) 1-pass mapping with indexed genome

STAR --genomeDir ~/db/hg38/ --readFilesIn sample1.R1.fastq.gz sample1.R2.fastq.gz --readFilesCommand zcat --outSAMunmapped Within --outFileNamePrefix sample1. --runThreadN 30

Notes:
- The same command has been run for multiple samples in the for loop, therefore, it will generate SJ.out.tab file for each sample.
- Next, I have copied SJ.out.tab files of all the samples into the a single folder "SJ_out"





3) Indexing genome with annotations and SJ.out.tab files

STAR --runMode genomeGenerate --genomeDir ~/db/hg38/SJ_Index/ --genomeFastaFiles ~/db/hg38/SJ_Index/hg38.fa --sjdbGTFfile ~/db/hg38/SJ_Index/hg38.gtf --runThreadN 30 --sjdbOverhang 89 --sjdbFileChrStartEnd SJ_out/*.SJ.out.tab


Note:
- Again indexing for maximum read lengh 90 bp.




2) 2-pass mapping with new indexed genome with annotations and SJ.out.tab files

STAR --genomeDir ~/db/hg38/SJ_Index/ --readFilesIn sample1.R1.fastq.gz sample1.R2.fastq.gz --readFilesCommand zcat --outSAMunmapped Within --outFileNamePrefix sample1. --runThreadN 30

Notes:
- Again, same command has been run for multiple samples in the for loop, therefore, it will genreated mapping files for each sample.


Could you please help me to check if all the commands I am running correctly or do you have any suggestions?

Alexander Dobin

unread,
Jun 5, 2019, 5:44:47 PM6/5/19
to rna-star
Hi Rajesh,

the commands look good to me. The only concern is that, if you have too many samples, and too many novel junctions are detected across all samples, the 2nd pass may become too slow and the number of multimappers may be increased.
Then you would need to filter the set of junctions you are using for the 2-pass genome.

Cheers
Alex

Rajesh Detroja

unread,
Jun 6, 2019, 1:43:57 AM6/6/19
to rna-...@googlegroups.com
Dear Alex,

Thanks for the confirming the commands.

According to one of your recent post you have suggested following possible criterias to filter the merged junction file of multiple samples, Is that correct?

1. Filter out the junctions on chrM, those are most likely to be false. That's mean remove all chrM records.

2. Filter out non-canonical junctions (column5 == 0). That's mean (column5 > 0).

3. Filter out junctions supported by multi mappers only (column7==0). That's mean (column7 > 0)

4. Filter out junctions supported by too few reads (e.g. column7<=2). That's mean (column7 > 2)


One more question, I should merge junction files of diseases and healthy samples into one or I need to make separate index for diseases and healhy for 2-pass?




Alexander Dobin

unread,
Jun 6, 2019, 6:49:08 PM6/6/19
to rna-star
Hi Rajesh,

this looks like a good filtering strategy. However, you may need to make filtering more stringent if you see a significant slowdown of the 2nd pass, or increase in multimappers.

If you are planning to compare novel splice junctions between healthy and diseased samples, it's better to combine the junctions from both healthy and diseased samples into one index.

Cheers
Alex

Rajesh Detroja

unread,
Jun 7, 2019, 4:41:37 AM6/7/19
to rna-...@googlegroups.com
Hello Alex,


Thanks for your help.


I have calculated the number of junctions after filtering as follows and I have decided to take filtration with 340,968 Junctions remained:


What do yout think?


cat *.tab | awk '($5 > 0 && $7 > 2)' | wc -l : 32,985,210 Junctions remained


cat *.tab | awk '($5 > 0 && $7 > 5)' | wc -l: 26,981,895 Junctions remained


cat *.tab | awk '($5 > 5 && $7 > 5)' | wc -l: 15,359 Junctions remained


cat *.tab | awk '($5 > 2 && $7 > 5)' | wc -l: 243,547 Junctions remained


cat *.tab | awk '($5 > 1 && $7 > 5)' | wc -l: 13,645,142 Junctions remained


cat *.tab | awk '($5 > 1 && $7 > 5)' | grep -v "MT" | wc -l: 13,645,060 Junctions remained

cat *.tab | awk '($5 > 2 && $7 > 2)' | wc -l: 340,968 Junctions remained

Alexander Dobin

unread,
Jun 7, 2019, 10:37:55 AM6/7/19
to rna-star
Hi Rajesh,

I think you are counting multiple times the same junctions from many samples. You would need to "collapse" the junctions, i.e. remove the duplicate ones after filtering, e.g.
cat *.tab | awk '($5 > 0 && $7 > 2 && $6==0)' | cut -f1-6 | sort | uniq |wc -l
I added $6==0 condition to avoid counting annotated junctions since they are always included from the GTF.
You can feed the resulting small file to STAR, or the original non-collapsed file - STAR will collapse the junctions before inserting them.

Cheers
Alex
Message has been deleted

Rajesh Detroja

unread,
Jun 7, 2019, 2:10:43 PM6/7/19
to rna-...@googlegroups.com
Dear Alex,

Thank you so much for your help!

Finally, I have total 501,404 junctions remaining that I will feed to START for generating a new index for 2-pass mapping.

cat *.tab | awk '($5 > 0 && $7 > 2 && $6==0)' | cut -f1-6 | sort | uniq | wc -l

I got total 501,404 records.

Moreover, 2-pass mapping is not slow at all, I feel as much fast as 1-pass.

Please mark this post as completed, I hope it this post will be useful for many peoples too.

Alexander Dobin

unread,
Jun 10, 2019, 12:50:31 PM6/10/19
to rna-star
Hi Rajesh!

Thanks a lot for sharing your experience!
It will definitely be helpful to other users.

Cheers
Alex

Pedro Milanez-Almeida

unread,
Jun 11, 2019, 11:41:19 AM6/11/19
to rna-star
Dear Dr. Dobin and Dr. Detroja,

If I want to keep the resulting file after using 


cat *.tab | awk '($5 > 0 && $7 > 2 && $6==0)' | cut -f1-6 | sort | uniq |wc -l

all I need to do is replace "| wc -l" in the end by "> SJ.filtered", right?

Thank you very much!
Pedro

PS: using STAR2.4.2a

Rajesh Detroja

unread,
Jun 11, 2019, 12:14:27 PM6/11/19
to rna-star
Dear Pedro,

Yes, you are correct! You can keep resulting file as follows:


cat *.tab | awk '($5 > 0 && $7 > 2 && $6==0)' | cut -f1-6 | sort | uniq > SJ.filtered.tab

Pedro Milanez-Almeida

unread,
Jun 11, 2019, 12:22:07 PM6/11/19
to rna-star
Awesome! Thank you very much! Also thanks for sharing your code and explaining what it means. It's very helpful.

Sreya Mukherjee

unread,
Feb 5, 2020, 8:55:11 AM2/5/20
to rna-star
Hello

I am quite new to STAR hence was wondering if you could share the loop part of your code to take care of multiple samples being mapped in STAR ?

rohit satyam

unread,
Oct 13, 2020, 8:09:53 AM10/13/20
to rna-star
Hi, I wanted to know if removing the mitochondrial junctions should be practiced for all genomes such as the rice genome too?

Alexander Dobin

unread,
Oct 14, 2020, 7:39:21 PM10/14/20
to rna-star
Hi Rohit,

yes, it makes sense to remove mitochondrial splice junctions for all species.

Cheers
Alex

Bekah W

unread,
Sep 12, 2021, 3:58:43 AM9/12/21
to rna-star
Hi Alex,

Is there a reason why non-canonical splice sites should be removed for running the second pass?

Best wishes,
Bekah

Alexander Dobin

unread,
Sep 24, 2021, 6:07:19 PM9/24/21
to rna-star
Hi Bekah,

in general, non-canonical junctions have a much higher rate of false positives, and so their detection is unreliable.
If you are not specifically interesting in non-canonical junctions, it's better to filter them out before the 2nd pass.

Cheers
Alex

Reply all
Reply to author
Forward
0 new messages