If you insert too many junctions, couple of problems may occur:
1. The mapping speed is reduced significantly.
2. The number of multimappers increases at the expense of unique mappers.
To mitigate these problems, you need to filter the junctions used in the 2nd pass. Here is approximate strategy:
1. Collapse the junctions from all samples into a set of unique junctions counting the number of reads per junction from all samples, and the number of samples the junction was detected in. I wrote a simple script that does just that:
https://github.com/alexdobin/STAR/blob/master/extras/scripts/sjCollapseSamples.awk
2. Calculate some statistics on these junctions: number of junctions with different intron motifs (column 5), number of junctions detected in 1,2,3... samples (column 10) etc.
This will give you an idea on how to filter these junctions best.
3. Filter the junctions on: (i) number of samples detected, (ii) total number of unique/multimap reads, (iii) max overhang. You may want to do harsher filtering for non-canonical junctions (col5=0). You would want to bring the number of junctions to <1M.
4. For the 2nd pass, use --sjdbFileChrStartEnd SJ.filtered /path/to/this/sample/1st/pass/SJ.out.tab
where SJ.filtered is the list of filtered junctions from 3, and /path/to/this/sample/1st/pass/SJ.out.tab
is the SJ.out.tab of the 1st pass for this one sample.
You may need to adjust the filtering in step 3 to bring the increase in multimapers to no more than 1-2%.
|
|||||||||||||||||||||||||||||
Even with reduced SJ, it still took 6 hours to finish, close to the running time without filtering. And there are few difference in the annotated SJ. Major difference could be seen in the non-can SJ, which has high FDR (I saw you comments on this in another thread). In the end I used the original SJ in the work. Best, Shao |