STAR Build genome index using repeat genome (exceeded memory limit (123675052 > 122880000))

202 views
Skip to first unread message

yang chen

unread,
Nov 18, 2019, 1:47:30 PM11/18/19
to rna-star
Hi ,
     I downloaded the repeat genome and gtf (RepeatMasker) files from UCSC genome table browser.  I want to build repeat genome index to remove the reads which may be mapped to the repeat regions. But the error is always exceeding memory limit. I adjust the memory from 30GB to 120GB. Could you help me to check my script? Thanks very much. 
     The repeat genome file size is 2.1GB and gtf file size is 552 MB. 

############################################## output
Nov 18 17:58:19 ..... started STAR run
Nov 18 17:58:19 ... starting to generate Genome files
slurmstepd: Job 11091167 exceeded memory limit (123675052 > 122880000), being killed
slurmstepd: Exceeded job memory limit
slurmstepd: *** JOB 11091167 CANCELLED AT 2019-11-18T13:20:19 *** on node311


##################################################### Script

/home/ychen10/STAR-2.7.3a/bin/Linux_x86_64/STAR  --runThreadN 4 \
           --runMode genomeGenerate \
           --genomeDir index \
           --genomeFastaFiles /scratch/users/ychen10/STAR/repBase/repeatSeq.fa \
           --sjdbGTFfile /scratch/users/ychen10/STAR/repBase/repeatSeq.gtf \
           --sjdbOverhang 99 \
           --genomeChrBinNbits 16 \
           --genomeSAindexNbases 10 \
           --genomeSAsparseD 4

I tried to follow previous questions about how to adjust the memory. But it seems not working. 
      
        

Alexander Dobin

unread,
Nov 19, 2019, 4:00:44 PM11/19/19
to rna-star
Hi Yang,

what is the number of "contigs" in the repeat genome? You may need to further reduce --genomeChrBinNbits .

However, I am not sure if this approach is going to work well with STAR, as most of the reads will not map to repeats, and because of that, the mapping speed will be slow.
I think mapping to the whole genome (repeats included, no masking), and then removing the reads that overlap the repeat loci.

Cheers
Alex

yang chen

unread,
Nov 20, 2019, 9:55:53 AM11/20/19
to rna-star
Hi Alex,
      Thanks very much. I find there are 5,607,738 contigs in the repeat genome. Could I use the reference genome "Homo_sapiens.GRCh38.dna_rm.primary_assembly.fa" which marks the repeat region as "N"?  Then the reads will be not mapped to the repeats.  Is this reasonable? 

Best regards,
Yang Chen 


在 2019年11月19日星期二 UTC-5下午4:00:44,Alexander Dobin写道:

Alexander Dobin

unread,
Nov 23, 2019, 11:18:31 AM11/23/19
to rna-star
Hi Yang,

I would not recommend using the masked genome: it's dangerous because you will be forcing reads that originate in repeats to map to wrong loci in the "non-repeat" genome.
The best solution is to map to unmasked genome, and then discard the alignments that overlap repeat regions.

Cheers
Alex

yang chen

unread,
Nov 25, 2019, 6:25:23 PM11/25/19
to rna-star
Great. Thanks Alex. I will follow your suggestions.

Best regards,
Yang Chen 

在 2019年11月23日星期六 UTC-5上午11:18:31,Alexander Dobin写道:
Reply all
Reply to author
Forward
0 new messages