filter by SJ.out.tab with multiple runs per sample

Alison Meynert

unread,

Mar 28, 2014, 11:55:13 AM3/28/14

to rna-...@googlegroups.com

I have a batch of human 2x100bp stranded RNA-seq samples to be aligned, and I've settled on STAR 2 pass, influenced in no small part by the GATK RNA-seq best practices, as variant calling is one of the downstream analyses. Most of my samples have multiple runs, so they need to have multiple read groups in order for GATK to correctly handle base quality score recalibration. I've had no problem doing first pass alignment with annotations for each run separately and am working on merging and filtering the splice junctions (SJ.out.tab files) prior to re-generating the genome index. What I'd like to do for the second pass is align each run again using the new index, use Picard AddOrReplaceReadGroups to sort and deal with read groups, merge all the runs for a given sample, and only *then* apply filtering of aligned reads based on the support for the splice junctions. Basically like using --outFilterType BySJout, but post- rather than pre-merge. Has anyone done something similar?

Alexander Dobin

unread,

Mar 31, 2014, 6:51:36 PM3/31/14

to rna-...@googlegroups.com

Hi Alison,

if I understand it correctly, you want to be able to setup different read IDs for multiple run files, while processing the files simultaneously as a comma-separated list in the --readFilesIn.

This is not implemented yet, unfortunately, but it's easy to implement - I will include it in the next patch.

At the moment I could suggest two workarounds for the 2nd pass (1st pass is not affected):

1. Include read group in each read ID before mapping (i.e. modify fastq file), map all the files together with --outFilterType BySJout, and then process SAM by moving read group ID from read ID into a separate SAM field.

Modifying fastq file can be done on the fly - you can run your own script that streams modified fastq files and use it through --readFilesCommand.

2. You can try a 3-pass run, where after 2nd pass you will again collect all junctions (there will be a very few new junctions added), re-generate the genome, and map for the 3rd time not allowing any novel junctions, so you would not need the --outFilterType BySJout option, and you can map all runs separately with different group IDs.

Cheers

Alex

Alison Meynert

unread,

Apr 4, 2014, 9:29:04 AM4/4/14

to rna-...@googlegroups.com

Hi Alex,

Thanks for the reply. The read group ids are based on the existing read ids from the FASTQ, so I'll try option 1 without the input processing. Looking forward to the next patch - already using 2.3.1z for the shared memory fix and it's generally been well behaved.

Cheers,

Alison

Jonathan Keats

unread,

May 5, 2014, 2:59:19 AM5/5/14

to rna-...@googlegroups.com

Hi Alex,

Did this patch to allow multiple different read groups to be included get implemented in one of the recent releases? I didn't notice it on the change log but this would be immensely helpful.

Jonathan

Jonathan Keats

unread,

May 5, 2014, 2:59:44 AM5/5/14

to rna-...@googlegroups.com

Alexander Dobin

unread,

May 6, 2014, 4:49:23 PM5/6/14

to rna-...@googlegroups.com

Hi Jonathan,

it's only partially working, you can list read groups for multiple input files as a comma-separate list.

However, it does not yet work with --outFilterType BySJout option, it was harder to implement than I thought.

I will fix it in the next patch.

Cheers

Alex

Jonathan Keats

unread,

May 8, 2014, 3:04:50 AM5/8/14

to rna-...@googlegroups.com

Thanks for the update Alex, I'd though about trying what you suggested but we use the --outFilterType BySJout so likely a good thing. We also ran into the issue with "z3", have not tried "z4" as we went back the last working version I'd tested "z".

As always, thanks for all your hard work!

Reply all

Reply to author

Forward