Mapping multiple files to the same genome

Osvaldo Zagordi

unread,

Mar 7, 2014, 5:09:09 AM3/7/14

to rna-...@googlegroups.com

Hi,

I've been using STAR since a few weeks because I needed a good splice-aware mapping tool. Thanks for making it available. I have a question regarding the genome loading and how to keep in memory between runs (if possible). Apologies if this is a totally trivial question.

I have multiple samples that I want to map against the same genome. It is actually multiple genomes, because the reads that do not hit the first genome are screened against a second one and so on. But let's keep the things simple and say that I have two fastq files: S1.fastq and S2.fastq. Then I have human genome indexed in directory HS (Homo sapiens) and bovine genome in directory BT (Bos taurus). What I currently do is (without giving all the options in detail)

STAR --genomeDir HS --readFilesIn S1.fastq

and I save the SAM file in S1.sam. Then I run

STAR --genomeDir HS --readFilesIn S2.fastq

and save the SAM file in S2.sam.

This actually takes time because the genome is loaded twice, and I would like to keep it in memory for S1 and S2. Then I would like to remove it, load Bos taurus genome, map S1 and S2 against it (actually, only reads that do not align to HG), remove Bos taurus genome from memory and so on.

I tried to play with the genomeLoad options running

STAR --genomeDir HS--readFilesIn S1.fastq --genomeLoad LoadAndKeep &

and then running in parallel another job. But the second one started loading the genome in memory and I stopped it.

Thanks again.

P.S. In case you are wondering, it is for a viral metagenomics project.

Alexander Dobin

unread,

Mar 7, 2014, 5:23:55 PM3/7/14

to rna-...@googlegroups.com

Hi Osvaldo,

the easiest way is to list you input files separated by commas (no spaces!), e.g.

STAR --genomeDir HS --readFilesIn S1.fastq,S2.fastq

You can use --genomeLoad LoadAndKeep, but you have to specify it for each run - then it should only be loaded once. Do not forget to remove it with --genomeLoad Remove after you are done with this genome.

I am not sure what your particular circumstances are, but if you are mapping to multiple species, I generally recommend mapping to a combined genome of all species. It allows the aligner to choose the best possible alignment.

Of course, you need to have plenty of RAM to do that.

Cheers

Alex

J. C. Szamosi

unread,

Nov 24, 2015, 11:35:57 AM11/24/15

to rna-star

Hi Alex,

I just want to clarify this answer, since I have a related question. If I use --genomeLoad LoadAndKeep, can I start running a second alignment on that same genome with a second instance of STAR before the first alignment has completed? I have many cores but only 32GB of RAM, so that would speed things up for me significantly. Or is it better to run all the different samples in a single STAR instance and then split the BAM output by read file name after?

Alexander Dobin

unread,

Nov 24, 2015, 12:16:55 PM11/24/15

to rna-star

Hi J.C.,

if the shared genome option works on your system, it may be more conevenient to use in your case. I do not think there will be a large difference in throughput whether you use share memory,

or map all samples into one file, or even reload genome for every sample (unless your samples are very small).

With a large number of threads, the main bottleneck will be the disk bandwidth. If you have multiple physical disks (RAIDs) on your system, using different disks for different samples run in parallel may increase throughput significantly.