STAR mapping takes 5 days for a couple dozen samples?

Chahat Upreti

unread,

Nov 17, 2019, 8:17:05 PM11/17/19

to rna-...@googlegroups.com

Hello,

I have been trying to use STAR to map human samples to the hg38 genome.

For generating genome index, I used -

/Users/chahat/Documents/DRG/STAR-master/bin/MacOSX_x86_64/STAR --runThreadN 3 --runMode genomeGenerate --genomeDir ./ --genomeFastaFiles ./Homo_sapiens.GRCh38.dna.primary_assembly.fa

Then for mapping I used -

for i in $(ls /Volumes/bam/DRG/fastq_75/PhenoInfoAvailable); do  /Users/chahat/Documents/DRG/STAR-master/bin/MacOSX_x86_64/STAR --runThreadN 8  --genomeDir /Users/chahat/Documents/DRG/STAR-master/genome --readFilesIn /Volumes/bam/DRG/fastq_75/PhenoInfoAvailable/$i --outFileNamePrefix /Volumes/bam/DRG/STAR_outputs_redo/$i --limitBAMsortRAM 10000000000 --outSAMtype BAM SortedByCoordinate; done


for i in $(ls /Volumes/bam/DRG/fastq_50/PhenoInfoAvailable); do  /Users/chahat/Documents/DRG/STAR-master/bin/MacOSX_x86_64/STAR --runThreadN 8  --genomeDir /Users/chahat/Documents/DRG/STAR-master/genome --readFilesIn /Volumes/bam/DRG/fastq_50/PhenoInfoAvailable/$i --outFileNamePrefix /Volumes/bam/DRG/STAR_outputs_redo/$i --limitBAMsortRAM 10000000000 --outSAMtype BAM SortedByCoordinate; done

Both folders have about a dozen samples (average filesize ~10 GB). I am running this on a 48 GB system.

After running this code for 5 days, the output was that I got 'SampleX.fastqAligned.sortedByCoord.out.bam' files for most of the samples in the second folder, but for none of the samples in the first folder (.bam files were empty). For the samples for which the .bam files were empty, the corresponding _STARtmp folders were present, and they were huge, so I guess the processing aborted during BAM sorting.

My question is, is there a reason why my STAR runs are taking so long, and if there is anything I can do to make it more efficient/faster?

Adding the `--genomeLoad LoadandKeep` option to the run, immediately gives the error -

Nov 17 18:48:50 ..... started STAR run
Nov 17 18:48:50 ..... loading genome
./runSTARonAllSamples.sh: line 1: 36287 Abort trap: 6           /Users/chahat/Documents/DRG/STAR-master/bin/MacOSX_x86_64/STAR --runThreadN 8 --genomeDir /Users/chahat/Documents/DRG/STAR-master/genome --readFilesIn /Volumes/bam/DRG/fastq_75/PhenoInfoAvailable/$i --genomeLoad LoadAndKeep --outFileNamePrefix /Volumes/bam/DRG/STAR_outputs_redo/redo/$i --limitBAMsortRAM 10000000000 --outSAMtype BAM SortedByCoordinate

Any ideas?

Alexander Dobin

unread,

Nov 19, 2019, 2:35:02 PM11/19/19

to rna-star

Hi Chahat,

were you processing the files from two folders simultaneously, i.e. two STAR jobs were running at the same time?

This will take ~64 GB of RAM, which will not fit into 48GB of RAM so the system will swap and slow down.

I would recommend running them sequentially, with the number of threads = number of cores.

Cheers

Alex

Chahat Upreti

unread,

Nov 20, 2019, 7:49:10 AM11/20/19

to rna-star

Thanks a lot Alex for the response.

were you processing the files from two folders simultaneously, i.e. two STAR jobs were running at the same time?

The code I ran is in my original post, basically, I had 2 for loops in my code, one for each of the folders, so my understanding was that they will be run one after the other (and the output files were also generated in that order) - not simultaneously. I was just concerned whether such kind of processing time was typical.

My second question was about using the "--genomeLoad LoadandKeep" option. Adding it to the STAR run, immediately leads to 'Abort trap 6' - even if I try to run it for one file. Do you have an idea why this is happening? I think having this option could possibly make my STAR run faster, because maybe right now I am loading the genome for each file, which could be delaying the run a lot (I speculate).

Thank you!

Alexander Dobin

unread,

Nov 23, 2019, 11:04:46 AM11/23/19

to rna-star

Hi Chahat,

what is strange about this behavior is that you got BAM output in the 2nd folder, but not in the 1st one - which was supposed to complete first, before he 2nd loop started.

Please send me the Log.out file from one of the incomplete runs in the 1st loop.

Abort 6 could mean problems with shared memory. What's the output of

$ cat /proc/sys/kernel/shmall

$ cat /proc/sys/kernel/shmmax

Cheers

Alex

Chahat Upreti

unread,

Nov 25, 2019, 1:21:24 AM11/25/19

to rna-...@googlegroups.com

Hi Alex,

The Log.out file from one of the incomplete runs in the 1st loop is attached. My feeling is that the STAR run was somehow aborted during BAM sorting, because the _STARtmp folder was present for all the samples that had an empty output BAM file, and inside this folder there was the BAMsort folder - which was pretty huge in size.

Regarding the second part, this is being done on a Mac, and the values of shmall and shmmax are -

kern.sysv.shmmax: 4194304

kern.sysv.shmmin: 1

kern.sysv.shmmni: 32

kern.sysv.shmseg: 8

kern.sysv.shmall: 1024

Do you think these need to be changed to be able to run STAR with the --genomeLoad option?

Thanks a lot,

Chahat

42T7R.fastqLog.out

Alexander Dobin

unread,

Nov 26, 2019, 10:40:29 AM11/26/19

to rna-star

Hi Chahat,

the Log.out file you attached seems to contain a run with shared memory, --genomeLoad LoadAndRemove , and it failed at the shared memory allocation step.

If you have the Log.out files from runs that failed at the sorting step, please send them to me.

If the problem is really at the sorting step, one solution could be to output unsorted SAM/BAM from STAR, and then sort it separately with samtools sort.

I doubt shared memory will help in this case - it actually requires reserving more RAM for sorting, since the genome has to be kept in RAM.

I found this link that explains how to tweak shared memory on Mac:

http://www.spy-hill.com/help/apple/SharedMemory.html

You would need to make kern.sysv.shmmax equal to your RMA size (in bytes, I believe), and kern.sysv.shmall=kern.sysv.shmmax/4096

Cheers

Alex

Chahat Upreti

unread,

Nov 26, 2019, 1:56:23 PM11/26/19

to rna-star

Alex,

I have attached Log file from another sample, where the shared memory option was not used. It seems to have stopped at the 'Loading SA' step, but the reason I feel the run got stuck at the BAM sorting step is because, as I mentioned above, the _STARtmp folder has a BAMsort folder which is pretty big.

Thank you for the link to how I can tweak shared memory on MAc. For now, I tried to rerun STAR without the BAM sorting step, and it did finish very fast, so I don't have anything to complain about. I am sorting them now using samtools, and yes, that is the most time taking part - its taking days to sort those bam files!

Thanks a lot,

Chahat

40T4L.fastqLog.out

Alexander Dobin

unread,

Nov 27, 2019, 9:12:29 AM11/27/19

to rna-star

Hi Chahat,

are you running samtools sort with -@ <threads> and -m <memory_per_thread> options ?

Sorting BAM files usually should not take that long... how big are the BAM files?

Cheers

Alex

Chahat Upreti

unread,

Nov 27, 2019, 1:33:48 PM11/27/19

to rna-star

Alex,

My BAM files are ~7 GB in size, and there are about 15 of them. It has been running for 5 days now, on a 48 GB system.

I had not used either the -@ or the -m options. Will use them now and see if it gets faster.

Thanks!

Reply all

Reply to author

Forward