STAR mapping for RSEM quantification

2,429 views
Skip to first unread message

verenas...@gmx.net

unread,
Oct 6, 2014, 10:47:40 AM10/6/14
to rna-...@googlegroups.com
Hi folks,

I try to get STAR (2.4.0d) to work with RSEM (1.2.18) quantification of RNA-seq data from PE 100nt sequencing. STAR mapping on the prepared STAR index is quite RAM intensive. Final quantification gives me an error. Here I give a hopefully reproducible example. 

get data

cd STARquestionGC19
wget ftp
://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz; gunzip gencode.v19.annotation.gtf.gz
wget ftp
://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/GRCh37.p13.genome.fa.gz; gunzip GRCh37.p13.genome.fa.gz
wget https
://ws.molgen.mpg.de/ws/352451/PErun100nt_sample_R1.fastq.gz; gunzip PErun100nt_sample_R1.fastq.gz
wget https
://ws.molgen.mpg.de/ws/983671/PErun100nt_sample_R2.fastq.gz; gunzip PErun100nt_sample_R2.fastq.gz; cp *.fastq ../fastq

generate RSEM reference

rsem-prepare-reference --no-polyA --gtf gencode.v19.annotation.gtf  GRCh37.p13.genome.fa RSEMtr_gc19
# no warnings
# 196520 transcripts are extracted and 0 transcripts are omitted.


generate STAR reference (with sjdbOverhang option)

mkdir STAR_rsem_gc19
STAR
-STAR_2.4.0d/STAR --runMode genomeGenerate --sjdbOverhang 99 --genomeChrBinNbits 14 --genomeDir STAR_rsem_gc19 --genomeFastaFiles RSEMtr_gc19.transcripts.fa --runThreadN 10
# Finished successfully

STAR mapping

STAR-STAR_2.4.0d/STAR \
--genomeDir ../STARquestionGC19/STAR_rsem_gc19 \
--readFilesIn ../fastq/PErun100nt_sample_R1.fastq ../fastq/PErun100nt_sample_R2.fastq \
--outSAMunmapped Within \
--outFilterType BySJout \
--outSAMattributes NH HI AS NM MD \
--outFilterMultimapNmax 20 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverReadLmax 0.04 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--runThreadN 10 \
--outSAMtype BAM SortedByCoordinate \
--quantMode TranscriptomeSAM
#Finished successfully

This took quite a lot of RAM (up to 120GB)

Quantify using RSEM

rsem-calculate-expression \
-p 12 \
--bam \
--paired-end \
--no-bam-output \
--forward-prob 0.5 \
--seed 12345 \
../STARquestionGC19/Aligned.toTranscriptome.out.bam ../STARquestionGC19/RSEMtr_gc19 RSEM_
#"rsem-parse-alignments ../STARquestionGC19/RSEMtr_gc19 RSEM_.temp/RSEM_ RSEM_.stat/RSEM_ b ../STARquestionGC19/Aligned.toTranscriptome.out.bam -t 3 -tag XM" failed! Plase check if you provide correct parameters/options for the pipeline!

Do I miss something essential here? Btw, thanks for this great aligner Alex. We use it a lot with HTseq.

Cheers and thanks in advance,
Verena

Alexander Dobin

unread,
Oct 13, 2014, 12:01:04 PM10/13/14
to rna-...@googlegroups.com
Hi Verena,

if you want to use --quantMode TranscriptomeSAM option to make STAR output RSEM compatible alignments, you need to use the full genome (i.e. GRCh37.p13.genome.fa) at the STAR genome generation step, and use the GTF file (gencode.v19.annotation.gtf):
STAR-STAR_2.4.0d/STAR --runMode genomeGenerate --sjdbOverhang 99 --genomeDir STAR_rsem_gc19 --genomeFastaFiles GRCh37.p13.genome.fa --runThreadN 10 --sjdbGTFfile gencode.v19.annotation.gtf 
(--genomeChrBinNbits 14 is not needed since the full genome, not transcriptome, is used).

All other parameters seem OK, hopefully it will work.

Cheers
Alex

verenas...@gmx.net

unread,
Oct 14, 2014, 9:05:33 AM10/14/14
to rna-...@googlegroups.com
Hi Alex,

thank you very much for the response. Using the suggested command made STAR mapping with RSEM quantification working.

generate STAR reference


STAR-STAR_2.4.0d/STAR --runMode genomeGenerate --sjdbOverhang 99 --genomeDir STAR_rsem_gc19 --genomeFastaFiles GRCh37.p13.genome.fa --runThreadN 10 --sjdbGTFfile gencode.v19.annotation.gtf

Best,
Verena
Reply all
Reply to author
Forward
0 new messages