Second pass mapping - novel junction multisample experiment.

2,054 views
Skip to first unread message
Assigned to leog...@gmail.com by me

fjros...@gmail.com

unread,
Mar 11, 2014, 4:26:02 AM3/11/14
to rna-...@googlegroups.com
Hi Alex,

As you suggested in a previous post (repost here since it was posted it was not posted in the STAR group), one should/could merge the collected junctions from several samples when running a multi-sample RNA-Seq experiment. Sorry if this question is too naive but could you please advise on the criteria to merge the novel junctions of all samples apart from non-canonical filtering (as described in here). Should I use all those that are unique according to Chr, Start, End, Strand? Should any +/- overlap in star/end be taken into account? Or this issue is considered when --sjdbOverhang is set.

Thanks in advance.

Cheers,

Fernando

Alexander Dobin

unread,
Mar 12, 2014, 5:22:15 PM3/12/14
to rna-...@googlegroups.com
Hi Fernando,

in principle, you can simply concatenate the junctions from all samples after filtering, STAR will take care of "collapsing" identical junctions. The list of "collapsed" junctions is output as sjdbList.out.tab file in the genome directory.
If you want to disallow junction sites that are close to each other, you would have to do it manually - however, unless there is a large number of such cases, I would not do it, especially if you are already filtering out non-canonical junctions.
I would filter out any junctions that are detected on the mitochondria genome, those are  likely false positives and may reduce mapping speed (see this post).

Cheers
Alex

fjros...@gmail.com

unread,
Mar 12, 2014, 7:42:09 PM3/12/14
to rna-...@googlegroups.com
Hi Alex,

Thanks for your prompt reply. I tested concatenating all junctions, filter mitochondria and non-canonical using the shell script you posted before and then let STAR to collapse identical junctions. I have also concatenated, filter and collapse (cat, grep, sort -u, awk) all SJ.out.tab files manually. Both produced almost identical no. of junctions (minor differences). One thing I saw was that, in both cases, when generating the genome index (using --sjdbFileChrStartEnd) no sjdbList.out.tab file was generated; a sjdbInfo.txt was generated instead. I could see that sjdbList.out.tab file was generated when using a gtf file and the GTF annotation parameters: --sjdbGTFfeatureExon exon sjadbGTFtagExonParentTranscript transcript_id.

Is it normal for STAR not to generate sjdbList.out.tab when using --sjdbFileChrStartEnd with a Chr \tab\ Start \tab\ End \tab\ Strand(+or-) file like the one generated using 2 pass mapping?


Thanks in advance.

Cheers,

Fernando

Alexander Dobin

unread,
Mar 13, 2014, 11:56:03 AM3/13/14
to rna-...@googlegroups.com
Hi Fernando,

sorry, my mistake - I forgot that STAR does NOT generate  sjdbList.out.tab when  --sjdbFileChrStartEnd is used to input junctions - this is normal behavior. The sjdbList.out.tab is not used by STAR at the mapping stage, is generated for information purposes only, showing which junctions were extracted from the GTF file. The actual information about junctions that STAR uses while mapping is contained in  sjdbInfo.txt  in internal format, which is generated for both --sjdbFileChrStartEnd and  --sjdbGTFfile options.

Cheers
Alex

fjros...@gmail.com

unread,
Mar 13, 2014, 7:23:40 PM3/13/14
to rna-...@googlegroups.com
Excellent Alex! Thanks for your help and info.

Cheers,

Fernando
Reply all
Reply to author
Forward
0 new messages