STAR Build genome index using repeat genome (exceeded memory limit (123675052

yang chen

unread,

Nov 18, 2019, 1:47:30 PM11/18/19

to rna-star

Hi ,

I downloaded the repeat genome and gtf (RepeatMasker) files from UCSC genome table browser. I want to build repeat genome index to remove the reads which may be mapped to the repeat regions. But the error is always exceeding memory limit. I adjust the memory from 30GB to 120GB. Could you help me to check my script? Thanks very much.

The repeat genome file size is 2.1GB and gtf file size is 552 MB.

############################################## output

Nov 18 17:58:19 ..... started STAR run

Nov 18 17:58:19 ... starting to generate Genome files

slurmstepd: Job 11091167 exceeded memory limit (123675052 > 122880000), being killed

slurmstepd: Exceeded job memory limit

slurmstepd: *** JOB 11091167 CANCELLED AT 2019-11-18T13:20:19 *** on node311

##################################################### Script

/home/ychen10/STAR-2.7.3a/bin/Linux_x86_64/STAR --runThreadN 4 \

--runMode genomeGenerate \

--genomeDir index \

--genomeFastaFiles /scratch/users/ychen10/STAR/repBase/repeatSeq.fa \

--sjdbGTFfile /scratch/users/ychen10/STAR/repBase/repeatSeq.gtf \

--sjdbOverhang 99 \

--genomeChrBinNbits 16 \

--genomeSAindexNbases 10 \

--genomeSAsparseD 4

I tried to follow previous questions about how to adjust the memory. But it seems not working.

Alexander Dobin

unread,

Nov 19, 2019, 4:00:44 PM11/19/19

to rna-star

Hi Yang,

what is the number of "contigs" in the repeat genome? You may need to further reduce --genomeChrBinNbits .

However, I am not sure if this approach is going to work well with STAR, as most of the reads will not map to repeats, and because of that, the mapping speed will be slow.

I think mapping to the whole genome (repeats included, no masking), and then removing the reads that overlap the repeat loci.

Cheers

Alex

yang chen

unread,

Nov 20, 2019, 9:55:53 AM11/20/19

to rna-star

Hi Alex,

Thanks very much. I find there are 5,607,738 contigs in the repeat genome. Could I use the reference genome "Homo_sapiens.GRCh38.dna_rm.primary_assembly.fa" which marks the repeat region as "N"? Then the reads will be not mapped to the repeats. Is this reasonable?

Best regards,

Yang Chen

在 2019年11月19日星期二 UTC-5下午4:00:44，Alexander Dobin写道：

Alexander Dobin

unread,

Nov 23, 2019, 11:18:31 AM11/23/19

to rna-star

Hi Yang,

I would not recommend using the masked genome: it's dangerous because you will be forcing reads that originate in repeats to map to wrong loci in the "non-repeat" genome.

The best solution is to map to unmasked genome, and then discard the alignments that overlap repeat regions.

Cheers

Alex

yang chen

unread,

Nov 25, 2019, 6:25:23 PM11/25/19

to rna-star

Great. Thanks Alex. I will follow your suggestions.

Best regards,

Yang Chen

在 2019年11月23日星期六 UTC-5上午11:18:31，Alexander Dobin写道：

Reply all

Reply to author

Forward

STAR Build genome index using repeat genome (exceeded memory limit (123675052 > 122880000))

yang chen

Alexander Dobin

yang chen

Alexander Dobin

yang chen