Output in bam rather than in sam

r

unread,

Apr 21, 2014, 4:32:34 PM4/21/14

to rna-...@googlegroups.com

Hi,

I'm probably not the first one to ask this, so I apologize if I missed the answer to this in my searches.

I was wondering if STAR has or will soon have an option to output the alignments in bam rather than in sam?

Some expression estimation tools accept only bam, and when analyzing many replicates the sam to bam conversion takes quite a while. For the same reason, producing a bam file without having to go through a sam file saves valuable disk space.

Thanks a lot,

rubi

Nico

unread,

Apr 21, 2014, 5:42:24 PM4/21/14

to rna-...@googlegroups.com

you can easily output to std and pipe it into samtools to output a bam file

r

unread,

Apr 21, 2014, 7:26:38 PM4/21/14

to rna-...@googlegroups.com

I see. But that will only address the space issue but not the running time issue, won't it?

Alexander Dobin

unread,

Apr 22, 2014, 2:13:40 PM4/22/14

to rna-...@googlegroups.com

Hi Rubi,

I am working on outputting coordinate-sorted BAM, but it will be ready not earlier than mid-May.

Nico is right - piping it into samtools command will save disk space and some processing time.

Latest versions of samtools allow for multithreading, which speeds up sorting, so you can do something like

$ STAR ..... --outStd SAM | samtools view -buS - | samtools sort -m 5G -@6 - Aligned.sorted

Cheers

Alex

r

unread,

Apr 23, 2014, 12:03:51 AM4/23/14

to rna-...@googlegroups.com

Thanks a lot.

r

unread,

Apr 24, 2014, 5:14:05 PM4/24/14

to rna-...@googlegroups.com

Hi Alex,

A few clarifications if it's ok (and I apologize if it's trivial). This relates both to using the shared memory option as well as piping sam to the standard output to convert it to bam

What I'm basically trying to do is align multiple samples to multiple indexed genomes (each sample is aligned to its own reference genome) efficiently - space and time wise.

Since I'm using a university cluster my plan is for each reference genome to align all samples relating to it on one node, sequentially.

So, for example my lsf queue system job script will be:

#!/bin/sh

.

STAR <args.sample1> --genomeLoad LoadAndKeep --outStd SAM | samtools view -buS > <sample1.bam>

STAR <args.sample2> --genomeLoad LoadAndKeep --outStd SAM | samtools view -buS > <sample2.bam>

.

Is this the correct way to achieve my goals?

Also, I'm not sure whether the outFileNamePrefix and the outStd are mutually exclusive or not? I'm asking since I'm also interested in the logs and splice junctions outputs and I want them to be located in a specified location. If these options are no mutually exclusive am I guessing that specifying --outFileNamePrefix <prefix> --outStd SAM | samtools view -buS > <sample1.bam> would save the logs and splice junctions output to <prefix> and the sam output would be piped to standard output? If how do I achieve this?

Thanks a lot,

rubi

Alexander Dobin

unread,

Apr 25, 2014, 12:23:39 PM4/25/14

to rna-...@googlegroups.com

Hi Rubi,

you would need to check with your sys-admin on the policy for shred memory on the cluster nodes, it's often not allowed.

In principle, your idea is right, sending the jobs with the same genome to one node will save time on loading the genome. This is true even if you do not use shared memory option, owing to Linux file caching.

The piping of the output into samtools will save disk space if you use compression, i.e. do not use -u option:

STAR <args.sample2> --genomeLoad LoadAndKeep --outStd SAM | samtools view -bS - > <sample2.bam>

Note that you need "-" instead of input file name for samtools to read from stdin.

If you need coordinate-sorted bam you can also include samtools sort it in the pipe.

This is correct:

"specifying --outFileNamePrefix <prefix> --outStd SAM | samtools view -bS - > <sample1.bam> saves the logs and splice junctions output to <prefix> and the sam output would be piped to standard output"