Segfault on custom transcriptome

Vulliard Loan

unread,

Jul 26, 2016, 10:30:11 AM7/26/16

to rna-star

Hi!

I'm stuck on a segmentation fault looking quite similar to the issue on this thread.
I'm mapping reads on some concatenations of alleles in order to target reads mapping on specific regions. The genomeGenerate part went fine (no error message) but the mapping stops on this error message :
Segmentation fault STAR --runThreadN $COHLA_NB_CORE --genomeDir $COHLA_MAIN_SCRIPT_DIR/rsc/bait_genome/STAR_tree/ --readFilesIn $COHLA_INPUT --outFileNamePrefix $COHLA_TMP_STAR_FOLDER/captured_hla_ >> $COHLA_MAIN_LOG_FILE 2>&1
I have tried to remove as much options as possible but still get this message. I have also tried to run on subsets of my input data : if I extract only 40 or 80 reads from my file, I don't get any segfault (and I get a .sam output including only the header, which is fine since I expect most of my reads to be rejected). For 120, 160 or more reads I get the segmentation fault message.
I tried (and failed) with both 2.4.2a and 2.5.2a_modified (cloned two weeks ago) versions. Not sure if my setups work perfectly but I successfully ran other (smaller) mapping jobs with it. Any way to quickly test my STAR installation ?
I attach to this post my log files.
Any idea how to deal with this? Thanks!

Loan

STAR_GenomeBuild.Log.out

STAR_Mapping.Log.out

Alexander Dobin

unread,

Jul 27, 2016, 6:23:46 PM7/27/16

to rna-star

Hi Vulliard,

I think this is the small genome problem.

At the genome generation step, please reduce the --genomeSAindexNbases - it needs to be scaled with the genome length, as ~min(14,log2(ReferenceLength)/2 - 1) .

Your genome is ~20MB, so try 11 or even 10.

Cheers

Alex

Vulliard Loan

unread,

Jul 29, 2016, 11:41:49 AM7/29/16

to rna-star

Hi Alex,

I have already tried 10 and 11 for my --genomeSAindexNbases (cf. the log file for the genome generation in my previous post), but lowering it even more seems to work. For values lower or equal to 6 I don't get the Segmentation Fault message even when mapping all the reads. Not sure why since, as you said, my genome is ~20MB, so it does not follow the log2 formula written in the manual. Could it be linked to the fact that I have a lot repeated sequences in my reference genome, or of how the reference fasta is formatted ?
Anyway, now it works fine, thanks !
Regards,

Loan

Alexander Dobin

unread,

Jul 29, 2016, 3:55:26 PM7/29/16

to rna-star

Hi Loan,

you are right, this depends not on the genome length per se, but rather on the k-mer content in the genome. The formula estimates the "normal" k-mer size for a random genome.

If you have many repeats, than the actual value should go down.

Cheers

Alex

Reply all

Reply to author

Forward