STAR stuck on sorting Suffix Array chunks and saving them to disk

1,252 views
Skip to first unread message

Stephen Smith

unread,
Apr 20, 2017, 3:18:40 PM4/20/17
to rna-star
Hi,

I am trying to use STAR in an RNA-seq analysis.

I am using the Xenopus Laevis genome found here: ftp://ftp.xenbase.org/pub/Genomics/JGI/Xenla9.1/Xla.v91.repeatMasked.fa.gz

and the gff3 file found here: ftp://ftp.xenbase.org/pub/Genomics/JGI/Xenla9.1/1.8.3.2/XL_9.1_v1.8.3.2.primaryTranscripts.gff3.gz

the command I am entering is this :

STAR --runThreadN 6 --runMode genomeGenerate --genomeChrBinNbits 14 --genomeDir /path/to/star_index/ --genomeFastaFiles /path/to/Xla.v91.repeatMasked.fa  --sjdbGTFfile /path/to/XL_9.1_v1.8.3.2.primaryTranscripts.gff3 --sjdbGTFtagExonParentTranscript Parent --sjdbOverhang 75



The Log.out file is 400k lines long and Number of SA indices: 4900931576
Something doesn't seem right with those numbers!

Thanks,
Stephen

Alexander Dobin

unread,
Apr 21, 2017, 3:55:09 PM4/21/17
to rna-star
Hi Stephen,

this genome has a lot of contigs, so you need to scale down --genomeChrBinNbits to 10 , which will reduce RAM consumption.
However, such a large number of contigs may result in slow mapping. If this is the case, I would recommend that you combine the short sequences into one big "supercontig". For instance, you can keep the longest 50,000 sequences separate, and merge the rest of them. If you decide to go this path, I can send you a simple awk script for this conversion. Note that you will have to convert your mapping results 
Also, it is best to convert the gff3 file into GTF - you can use gffread tool from Cufflinks package.

Cheers
Alex

Stephen Smith

unread,
Apr 22, 2017, 8:03:31 AM4/22/17
to rna-star
Hi Alex,

OK I now have a GTF to use instead.

That script would be very much appreciated!
When you say I will have to "convert my mapping results" what exactly does this mean?

Best wishes,
Stephen


Alexander Dobin

unread,
Apr 24, 2017, 4:37:04 PM4/24/17
to rna-star
Hi Stehpen,

here is the script:
To run it:
$ awk -f mergeSuperContig.awk All.fasta All.gtf
It has the following hardcoded parameters that you can edit inside the script:

shortL=64000 defines the max contig length for contigs that are merged into the supercontig. Longer contigs are kept separate.
You need to select this number so that the number of separate contigs is < 50,000-100,000.

pN=60 is the length of N-padding in between of the merged short contigs. This has to be ~ read length.


The script will generate Long.out.fasta, Short.out.fasta and Annot.out.gtf files that have to be fed to STAR for genome generation.

After mapping, the reads mapped to short contigs will need to be transformed to local coordinates. 
This can be done using the ChrStart.tab file that contains the start positions of the short contigs in the super-contig.
If you test the genome generation and mapping, and it works fine for you, I can write a simple script to make this transformation.

Cheers
Alex

Ron

unread,
May 18, 2017, 3:57:02 PM5/18/17
to rna-star
Hi Stephen,

Have you been able to align the fastq files ,for this Xenopus Laevis genome?
I am getting segmentation fault (core dumped ) error while running alignment.
I was able to generate genome indices files, for this genome.

Any suggestions, would be of great help.

Thanks,
Ron


On Saturday, April 22, 2017 at 8:03:31 AM UTC-4, Stephen Smith wrote:

Stephen Smith

unread,
Jun 25, 2017, 9:38:00 AM6/25/17
to rna-...@googlegroups.com
Hi Alex,

Am I right in thinking that the "number of separate contigs is < 50,000- 100,000" means that after running the script I should have less than 50,000 contigs in my Long.out.fasta?

I currently have ~35,000 in Long.out.fasta and I'm assuming 1 in the Short.out.fasta.

Thanks,
Stephen

P.S. left STAR because of time constraints but I've about a month free to try it now.

Edit: I've left STAR running overnight on the genome indexing stage and it is still stuck on the issue in the title.


Log_STAR.out

Alexander Dobin

unread,
Jun 26, 2017, 4:28:56 PM6/26/17
to rna-star
Hi Stephen,

how much RAM do you have? You are setting --limitGenomeGenerateRAM 15000000000 ~ 15GB which is a bit too small for your genome.
Please send me the output of `ls -l` on the genome directory.
Also, please try to --genomeSuffixLengthMax 1000 which may speed up suffix array generation.

Cheers
Alex

Stephen Smith

unread,
Jun 26, 2017, 6:06:51 PM6/26/17
to rna-...@googlegroups.com

Hi Alex,

There is plenty of RAM, its being done on an SGE, I had just found that option while searching around for a solution. I've removed it now.

I have deleted the genome directory, I will run the new command and show the ls -l asap.

I can tell you that the previous contents were seemingly finished chrName.txt chrNameLength.txt chrLength.txt and one other that I can't recall along with 11 SA files which were empty.

Best,
Stephen

Edit: here is the new ls -l
total 412
-rw-r--r-- 1 ssmith ssmith  45357 Jun 26 22:10 chrLength.txt
-rw-r--r-- 1 ssmith ssmith 160138 Jun 26 22:10 chrNameLength.txt
-rw-r--r-- 1 ssmith ssmith 114781 Jun 26 22:10 chrName.txt
-rw-r--r-- 1 ssmith ssmith  87033 Jun 26 22:10 chrStart.txt
-rw-r--r-- 1 ssmith ssmith      0 Jun 26 22:17 SA_0
-rw-r--r-- 1 ssmith ssmith      0 Jun 26 22:17 SA_1
-rw-r--r-- 1 ssmith ssmith      0 Jun 26 22:20 SA_10
-rw-r--r-- 1 ssmith ssmith      0 Jun 26 22:18 SA_11
-rw-r--r-- 1 ssmith ssmith      0 Jun 26 22:20 SA_2
-rw-r--r-- 1 ssmith ssmith      0 Jun 26 22:18 SA_3
-rw-r--r-- 1 ssmith ssmith      0 Jun 26 22:18 SA_4
-rw-r--r-- 1 ssmith ssmith      0 Jun 26 22:20 SA_5
-rw-r--r-- 1 ssmith ssmith      0 Jun 26 22:19 SA_6
-rw-r--r-- 1 ssmith ssmith      0 Jun 26 22:19 SA_7
-rw-r--r-- 1 ssmith ssmith      0 Jun 26 22:17 SA_8
-rw-r--r-- 1 ssmith ssmith      0 Jun 26 22:18 SA_9


Reply all
Reply to author
Forward
0 new messages