Hi Alex,
sorry for getting back to this question, but even after reading the post you recommended (and some others) I still have some issues and don't quite understand how that thing works... I would be very grateful if you could take a look and give some comments.
For the first pass, I used the following parameters (ENCODE options are taken from the manual section 3.2.2):
STAR \
--outFileNamePrefix ./$sN\_ \
--runThreadN 16 \
--genomeDir $fndRefIndSTAR \
--readFilesIn $fnFQ1 $fnFQ2 \
--outSAMtype None \
--genomeLoad LoadAndKeep \
--outFilterType BySJout \
--outFilterMultimapNmax 20 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverReadLmax 0.04 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000
I had the followig results (numbers from Log.final.out, median of all samples):
Uniquely mapped reads: 26,188,803 (92.04%)
Number of splices: Total: 19,588,401
Number of splices: Annotated (sjdb): 19,402,818
Number of reads mapped to multiple loci: 1,625,906 (5.81%)
Number of reads mapped to too many loci: 3,425 (0.01%)
% of reads unmapped: too short: 2.00%
% of reads unmapped: other: 0.05%
Each SJ.out.tab contained, in average, 276,798 records (198,040 annotated and 78,758 unannotated).
When I have merged all my 138 SJ.out.tab files together, I got 38,198,183 records (27,329,514 annotated and 10,868,669 unannotated).
This consisted of 2,295,946 unique sites (311,952 annotated and 1,983,994 unannotated). From the unannotated sites, most of them occurred in only couple samples (59% found only in one sample).
I used the following script to extract unannotated sites and filter out rare sites:
for f in *_SJ.out.tab; do
awk '{ if ($6==0) {print $1,$2,$3,$4} } ' $f ;
done > ALL_unAnn.tab
wc -l ALL_unAnn.tab # 10,868,669 unannotated junctions
cat ALL_unAnn.tab | sort | uniq -c > unAnn_uniq.tab
wc -l unAnn_uniq.tab # 1,983,994 unique
awk 'BEGIN {OFS="\t"} { if ($1 > 1) { print $2,$3,$4,$5} }' unAnn_uniq.tab > unAnn_gt1.tab # sites found in more than one sample
awk 'BEGIN {OFS="\t"} { if ($1 > 2) { print $2,$3,$4,$5} }' unAnn_uniq.tab > unAnn_gt2.tab # found in more than two samples
wc -l unAnn_gt1.tab # 822,348
wc -l unAnn_gt2.tab # 581,057
Then I tried to use the resulting list for the second pass by running, for each sample:
STAR \
--outFileNamePrefix ./$sN\_ \
--runThreadN 16 \
--genomeDir $fndRefIndSTAR \
--sjdbFileChrStartEnd unAnn_gt1.tab \
--readFilesIn $fnFQ1 $fnFQ2 \
--outReadsUnmapped Fastx \
--outSAMunmapped Within KeepPairs \
--outFilterType BySJout \
--outFilterMultimapNmax 20 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverReadLmax 0.04 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000
When I tried the first list of 822,348 sites detected in more than one sample, I got an error "Fatal LIMIT error: the number of junctions to be inserted on the fly =1183128 is larger than the limitSjdbInsertNsj=1000000", so I went for a smaller list of 581,057 sites detected in more than two samples. For this run, it was about 5 times slower, and produced significantly less uniquely mapped reads: only 88% (compare to 92% in the first run), while number of multi-mapped reads increased (from 6% to 10% for reads mapped to multiple loci).
I have a lot of questions, would be very grateful if you could answer at least some of them or point me to the specific resources to read (besides the manual, I have read your "Optimizing RNA-Seq Mapping with STAR" paper - but maybe not thoroughly enough).
1. Do the parameters and output numbers look normal overall? For example I'm not sure if it's correct to use --outFilterType BySJout for both passes
2. For the first pass - do I understand correctly that it is needed only to generate SJ.out.tab, so we can completely disable SAM/BAM output?
3. Why the second pass produces more multi-mapped reads? why more unmapped reads?
4. In the second pass, does STAR still try to detect novel junctions in addition to those we added by providing --sjdbFileChrStartEnd file from the first pass?
Thanks a lot!!!