two-pass mapping for "too many" samples

588 views
Skip to first unread message

Vasily A.

unread,
Apr 3, 2019, 10:38:22 PM4/3/19
to rna-...@googlegroups.com
I am trying to map RNA-seq data (paired-end, 2x150bp, about 25 M reads per sample) for 150 samples. From the manual, I see that "it is recommended to collect 1st pass junctions from all samples".
So my --readFilesIn part will be extremely long:
 --readFilesIn path/to/sample1_read1.fq,path/to/sample2_read1.fq,...,path/to/sample150_read1.fq path/to/sample1_read2.fq,path/to/sample2_read2.fq,...,path/to/sample150_read2.fq

I hope I will not have any issues because of the length of the command, but I am worried about computational resources, especially walltime: it is hard to predict how long it will take and if my job will be terminated because of the cluster's time limit, I will lose everything and will have to start again from scratch, right? Is there any way to split the job in smaller parts but get the same result at the end? Also, any recommendations on chosing the parameters for this specific situation? (amount of memory to request from the cluster, etc.)?

Thanks!

P.S. or maybe I understand it wrong altogether and "collect 1st pass junctions from all samples" meant to run it separately for each sample? Then I don't understand for which cases it is useful to run multiple samples in the same line? I see that we get a single SJ.out.tab in this case - will it be the same as a summary of individual SJ.out.tab files for single samples?

Alexander Dobin

unread,
Apr 4, 2019, 12:32:48 PM4/4/19
to rna-star
Hi Vasily,

you can run all samples independently, and the simply concatenate the SJ.out.tab files from all sample, and feed it back to STAR for genome generation.
The danger here is that you may get too many junctions from 150 samples, which may slow down the 2-nd pass mapping, and also bring down the unique mapping rate.
Here are some ideas for filtering the junctions:

Cheers
Alex

Vasily A.

unread,
Apr 23, 2019, 5:40:45 AM4/23/19
to rna-star
Hi Alex,
sorry for getting back to this question, but even after reading the post you recommended (and some others) I still have some issues and don't quite understand how that thing works... I would be very grateful if you could take a look and give some comments.

For the first pass, I used the following parameters (ENCODE options are taken from the manual section 3.2.2):
 STAR \
 
--outFileNamePrefix ./$sN\_ \
 
--runThreadN 16 \
 
--genomeDir $fndRefIndSTAR \
 
--readFilesIn $fnFQ1 $fnFQ2 \
 
--outSAMtype None \
 
--genomeLoad LoadAndKeep \
 
--outFilterType BySJout \
 
--outFilterMultimapNmax 20 \
 
--alignSJoverhangMin 8 \
 
--alignSJDBoverhangMin 1 \
 
--outFilterMismatchNmax 999 \
 
--outFilterMismatchNoverReadLmax 0.04 \
 
--alignIntronMin 20 \
 
--alignIntronMax 1000000 \
 
--alignMatesGapMax 1000000



I had the followig results (numbers from Log.final.out, median of all samples):
Uniquely mapped reads:  26,188,803 (92.04%)
Number of splices: Total:  19,588,401
Number of splices: Annotated (sjdb):  19,402,818
Number of reads mapped to multiple loci:  1,625,906 (5.81%)
Number of reads mapped to too many loci:  3,425 (0.01%)
% of reads unmapped: too short: 2.00%
% of reads unmapped: other: 0.05%

Each SJ.out.tab contained, in average, 276,798 records (198,040 annotated and 78,758 unannotated).
When I have merged all my 138 SJ.out.tab files together, I got 38,198,183 records (27,329,514 annotated and 10,868,669 unannotated).
This consisted of 2,295,946 unique sites (311,952 annotated and 1,983,994 unannotated). From the unannotated sites, most of them occurred in only couple samples (59% found only in one sample).
I used the following script to extract unannotated sites and filter out rare sites:
for f in *_SJ.out.tab; do
 awk
'{ if ($6==0) {print $1,$2,$3,$4} } ' $f ;
done > ALL_unAnn.tab
wc
-l ALL_unAnn.tab # 10,868,669 unannotated junctions


cat ALL_unAnn
.tab | sort | uniq -c > unAnn_uniq.tab
wc
-l unAnn_uniq.tab # 1,983,994 unique


awk
'BEGIN {OFS="\t"} { if ($1 > 1) { print $2,$3,$4,$5} }' unAnn_uniq.tab > unAnn_gt1.tab # sites found in more than one sample
awk
'BEGIN {OFS="\t"} { if ($1 > 2) { print $2,$3,$4,$5} }' unAnn_uniq.tab > unAnn_gt2.tab # found in more than two samples
wc
-l unAnn_gt1.tab # 822,348
wc
-l unAnn_gt2.tab # 581,057


Then I tried to use the resulting list for the second pass by running, for each sample:
 STAR \
 
--outFileNamePrefix ./$sN\_ \
 
--runThreadN 16 \
 
--genomeDir $fndRefIndSTAR \
 
--sjdbFileChrStartEnd unAnn_gt1.tab \
 
--readFilesIn $fnFQ1 $fnFQ2 \
 
--outReadsUnmapped Fastx \
 
--outSAMunmapped Within KeepPairs \
 
--outFilterType BySJout \
 
--outFilterMultimapNmax 20 \
 
--alignSJoverhangMin 8 \
 
--alignSJDBoverhangMin 1 \
 
--outFilterMismatchNmax 999 \
 
--outFilterMismatchNoverReadLmax 0.04 \
 
--alignIntronMin 20 \
 
--alignIntronMax 1000000 \
 
--alignMatesGapMax 1000000


When I tried the first list of 822,348 sites detected in more than one sample, I got an error "Fatal LIMIT error: the number of junctions to be inserted on the fly =1183128 is larger than the limitSjdbInsertNsj=1000000", so I went for a smaller list of 581,057 sites detected in more than two samples. For this run, it was about 5 times slower, and produced significantly less uniquely mapped reads: only 88% (compare to 92% in the first run), while number of multi-mapped reads increased (from 6% to 10% for reads mapped to multiple loci).

I have a lot of questions, would be very grateful if you could answer at least some of them or point me to the specific resources to read (besides the manual, I have read your "Optimizing RNA-Seq Mapping with STAR" paper - but maybe not thoroughly enough).
1. Do the parameters and output numbers look normal overall? For example I'm not sure if it's correct to use --outFilterType BySJout for both passes
2. For the first pass - do I understand correctly that it is needed only to generate SJ.out.tab, so we can completely disable SAM/BAM output?
3. Why the second pass produces more multi-mapped reads? why more unmapped reads?
4. In the second pass, does STAR still try to detect novel junctions in addition to those we added by providing --sjdbFileChrStartEnd file from the first pass?

Thanks a lot!!!

Alexander Dobin

unread,
Apr 26, 2019, 12:45:25 PM4/26/19
to rna-star
Hi Vasily,

could you post the Log.final.out files for the 1st and 2nd passes for one of the samples? It would make it easier to comment.
1. Do the parameters and output numbers look normal overall? For example I'm not sure if it's correct to use --outFilterType BySJout for both passes
--outFilterType BySJout filters from the SAM file those reads that contain unannotated junctions that did not make it into the SJ.out.tab.
So it's not needed at the 1st pass.

2. For the first pass - do I understand correctly that it is needed only to generate SJ.out.tab, so we can completely disable SAM/BAM output?
Correct.

3. Why the second pass produces more multi-mapped reads? why more unmapped reads?
As you add a large number of junctions into the sequence space, the reads have more opportunities to multi-map, this is especially true for short splice overhangs which you are allowing with --alignSJDBoverhangMin 1.
You can try to increase this parameter (it's =3 by default) in the 2nd pass, or filter the junctions more stringently.

4. In the second pass, does STAR still try to detect novel junctions in addition to those we added by providing --sjdbFileChrStartEnd file from the first pass?
Yes, it will still detect novel splices, though they will be very rare.

Cheers
Alex

Vasily A.

unread,
Apr 26, 2019, 3:05:49 PM4/26/19
to rna-star
Alex, thanks a lot for your help! I attach the log files.

For question #4 - just curious, if there would be any way to skip detection of novel splices, will it make the job faster?
pass1_Log.final.out
pass2_Log.final.out

Alexander Dobin

unread,
Apr 27, 2019, 12:32:35 PM4/27/19
to rna-star
Hi Vasily,

it does not take too much time to discover the few novel junctions in the 2nd pass.
It's the mapping to 1st pass junctions that takes long and slows down the 2nd pass, so you would need to filter the junctions more stringently.
A few suggesstionn:
1. Filter out the junctions on chrM, those are most likely to be false.
2. Filter out non-canonical junctions (column5 == 0).
3. Filter out junctions supported by multimappers only (column7==0)
4. Filter out junctions supported by too few reads (e.g. column7<=2)

Cheers
Alex

Vasily A.

unread,
Apr 29, 2019, 7:09:06 AM4/29/19
to rna-star
thanks Alex!
I applied more stringent filtering by number of unique reads (I set column7>5), so the final tab file was more than 2x smaller (274,000 sites). As a result, 2nd pass run was relatively fast, and number of multimapped reads increased only slightly compared to 1st pass.
Reply all
Reply to author
Forward
0 new messages