RSEM + STAR pipeline help

Joshua Bradley

unread,

Nov 3, 2015, 4:58:00 PM11/3/15

to RSEM Users

I have been trying to use RSEM (v1.2.23) and STAR (v2.4.2a) on some chimp samples for a few days now but continue to get the same error message (below). My pipeline is based on Colin Dewey's comments here. I would appreciate some guidance as I don't know if the problem comes from how I am using STAR or RSEM. I have been able to run RSEM+bowtie2. My data is paired-end and strand-specific.

Generate RSEM reference (using genome and annotations)

rsem-prepare-reference --gtf Pan_troglodytes.CHIMP2.1.4.82.chr.gtf --transcript-to-gene-map Pan_troglodytes.CHIMP2.1.4.82.gene_transcript_mapping.txt Pan_troglodytes.CHIMP2.1.4.chr.fa RSEMchimp

Generate STAR reference (using genome and annotations)

mkdir STARchimp
STAR --runMode genomeGenerate --genomeDir STARchimp --genomeFastaFiles Pan_troglodytes.CHIMP2.1.4.chr.fa --sjdbGTFfile Pan_troglodytes.CHIMP2.1.4.82.chr.gtf --sjdbOverhang 199

Align Reads (against the genome, and output transcript coordinates)

mkdir mapped_reads
STAR --genomeDir STARchimp --readFilesIn RNA-Seq_1.fastq RNA-Seq_2.fastq --outSAMunmapped Within  --outFilterType BySJout  --outSAMattributes NH HI AS NM MD  --outFilterMultimapNmax 20  --outFilterMismatchNmax 999  --outFilterMismatchNoverLmax 0.04  --alignIntronMin 20  --alignIntronMax 1000000  --alignMatesGapMax 1000000  --alignSJoverhangMin 8  --alignSJDBoverhangMin 1 --outSAMtype BAM SortedByCoordinate  --quantMode TranscriptomeSAM  --outFileNamePrefix mapped_reads/

RSEM Quantification (using the transcript-coordinate-based file produced by STAR)

rsem-calculate-expression --paired-end --bam --no-bam-output --forward-prob 1.0 mapped_reads/Aligned.toTranscriptome.out.bam RSEMchimp rnaseq_quant

Error Message

rsem-parse-alignments RSEMchimp rnaseq_quant.temp/rnaseq_quant rnaseq_quant.stat/rnaseq_quant b rnaseq_mapped/Aligned.toTranscriptome.out.bam -t 3 -tag XM
RSEM can not recognize reference sequence name ENSPTRT00000054489!
"rsem-parse-alignments RSEMchimp rnaseq_quant.temp/rnaseq_quant rnaseq_quant.stat/rnaseq_quant b rnaseq_mapped/Aligned.toTranscriptome.out.bam -t 3 -tag XM" failed! Plase check if you provide correct parameters/options for the pipeline!

There are two things I looked into.

1) I checked rnaseq_mapped/Aligned.toTranscriptome.out.bam and do not see the XM tag on every read. When running RSEM+bowtie2 though, I see the XM tag in the reads. I know bowtie generated the XM tag (which I confirmed when running bowtie2) but STAR does not. To run RSEM+STAR do I need to use a different tag with the --tag option?

2) Since the error message was about recognizing a sequence name, I looked into RSEMchimp.idx.fa that was generated and found the reference sequence is named ENSPTRT00000054489_ptr-mir-200a-201. Can RSEM detect that sequence name with "_ptr-mir-200a-201" appended at the end? I can post some examples of the reads if necessary.

IH Lin

unread,

Nov 8, 2015, 5:57:24 AM11/8/15

to RSEM Users

Hi

I am having the same problem. I am using STAR (2.4.2a) with RSEM (tried 1.2.20 ~ 1.2.24).

After generating the RSEM reference indices and I used Aligned.toTranscriptome.out.bam as input, but RSEM always complains it doesn't recognized the sequence name, e.g.

RSEM can not recognize reference sequence name ENST00000456328!

However, the RSEM indices always appends transcript name after id, is it because of this that it doesn't work?

rsem.idx.fa:>ENST00000456328_DDX11L1-002
rsem.n2g.idx.fa:>ENST00000456328_DDX11L1-002
rsem.seq:ENST00000456328_DDX11L1-002
rsem.ti:ENST00000456328_DDX11L1-002
rsem.ti:gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";
rsem.transcripts.fa:>ENST00000456328_DDX11L1-002

I-Hsuan

Joshua Bradley於 2015年11月4日星期三 UTC+8上午5時58分00秒寫道：

Bo Li

unread,

Nov 8, 2015, 6:49:54 PM11/8/15

to rsem-...@googlegroups.com

Hi Joshua and IH,

Thanks for letting me know this bug. If you are using RSEM v1.2.24, yes,
it is because we attached gene/transcript name at the end of
gene/transcript id but STAR does not. We will fix this bug as soon as we
can. For now, you can go back to RSEM v1.2.23
(https://github.com/deweylab/RSEM/archive/v1.2.23.tar.gz). This version
should work well with STAR.

Hope it helps,
Bo

>> [1]. I would appreciate some guidance as I don't know if the problem

>> comes from how I am using STAR or RSEM. I have been able to run
>> RSEM+bowtie2. My data is paired-end and strand-specific.
>>

>> GENERATE RSEM REFERENCE (USING GENOME AND ANNOTATIONS)

>>
>> rsem-prepare-reference --gtf Pan_troglodytes.CHIMP2.1.4.82.chr.gtf
>> --transcript-to-gene-map
>> Pan_troglodytes.CHIMP2.1.4.82.gene_transcript_mapping.txt
>> Pan_troglodytes.CHIMP2.1.4.chr.fa RSEMchimp

>> GENERATE STAR REFERENCE (USING GENOME AND ANNOTATIONS)

>>
>> mkdir STARchimp
>> STAR --runMode genomeGenerate --genomeDir STARchimp
>> --genomeFastaFiles Pan_troglodytes.CHIMP2.1.4.chr.fa --sjdbGTFfile
>> Pan_troglodytes.CHIMP2.1.4.82.chr.gtf --sjdbOverhang 199
>>

>> ALIGN READS (AGAINST THE GENOME, AND OUTPUT TRANSCRIPT COORDINATES)

>>
>> mkdir mapped_reads
>> STAR --genomeDir STARchimp --readFilesIn RNA-Seq_1.fastq
>> RNA-Seq_2.fastq --outSAMunmapped Within --outFilterType BySJout
>> --outSAMattributes NH HI AS NM MD --outFilterMultimapNmax 20
>> --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04
>> --alignIntronMin 20 --alignIntronMax 1000000
>> --alignMatesGapMax 1000000 --alignSJoverhangMin 8
>> --alignSJDBoverhangMin 1 --outSAMtype BAM SortedByCoordinate
>> --quantMode TranscriptomeSAM --outFileNamePrefix mapped_reads/
>>

>> RSEM QUANTIFICATION (USING THE TRANSCRIPT-COORDINATE-BASED FILE
>> PRODUCED BY STAR)

>>
>> rsem-calculate-expression --paired-end --bam --no-bam-output
>> --forward-prob 1.0 mapped_reads/Aligned.toTranscriptome.out.bam
>> RSEMchimp rnaseq_quant
>>

>> ERROR MESSAGE

>>
>> rsem-parse-alignments RSEMchimp rnaseq_quant.temp/rnaseq_quant
>> rnaseq_quant.stat/rnaseq_quant b
>> rnaseq_mapped/Aligned.toTranscriptome.out.bam -t 3 -tag XM
>> RSEM can not recognize reference sequence name ENSPTRT00000054489!
>> "rsem-parse-alignments RSEMchimp rnaseq_quant.temp/rnaseq_quant
>> rnaseq_quant.stat/rnaseq_quant b
>> rnaseq_mapped/Aligned.toTranscriptome.out.bam -t 3 -tag XM" failed!
>> Plase check if you provide correct parameters/options for the
>> pipeline!
>>
>> There are two things I looked into.
>> 1) I checked rnaseq_mapped/Aligned.toTranscriptome.out.bam and do
>> not see the XM tag on every read. When running RSEM+bowtie2 though,

>> I see the XM tag in the reads. I know bowtie [2] generated the XM

>> tag (which I confirmed when running bowtie2) but STAR does not. To
>> run RSEM+STAR do I need to use a different tag with the --tag
>> option?
>> 2) Since the error message was about recognizing a sequence name, I
>> looked into RSEMchimp.idx.fa that was generated and found the
>> reference sequence is named ENSPTRT00000054489_ptr-mir-200a-201. Can
>> RSEM detect that sequence name with "_ptr-mir-200a-201" appended at
>> the end? I can post some examples of the reads if necessary.
>

> --
> RSEM website: http://deweylab.biostat.wisc.edu/rsem/ [3]
> ---
> You received this message because you are subscribed to the Google
> Groups "RSEM Users" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to rsem-users+...@googlegroups.com.
> To post to this group, send email to rsem-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/rsem-users [4].
>
>
> Links:
> ------
> [1] https://groups.google.com/forum/#!topic/rsem-users/BqXesH92tyA
> [2]
> http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#bowtie2-build-opt-fields-as
> [3] http://deweylab.biostat.wisc.edu/rsem/
> [4] http://groups.google.com/group/rsem-users

IH Lin

unread,

Nov 9, 2015, 3:27:12 AM11/9/15

to RSEM Users

Hi Bo,

Yesterday while trying earlier versions, I found them not working, then I realized my $PATH setting keeps pointing to the latest version.
I have generated new set of indices with v1.2.23 and it now works with STAR's BAM output. Thanks.

I-Hsuan

Bo Li於 2015年11月9日星期一 UTC+8上午7時49分54秒寫道：

Joshua Bradley

unread,

Nov 9, 2015, 11:33:33 AM11/9/15

to RSEM Users

I just want to confirm for anyone else that may come to this thread. Bo's suggestion to use RSEM (v1.2.23) with STAR (v2.4.2a) works. I originally reported having problems with RSEM v1.2.23 but I must have made a mistake when writing the question and misreported the version. Downloading RSEM v1.2.23 and using the previous commands I posted works!

koi...@wisc.edu

unread,

Nov 10, 2015, 12:47:59 AM11/10/15

to RSEM Users

Hi,

I have been using RSEM (v1.2.23) and STAR (v2.4.2a) and got a similar issue.

Generate RSEM reference

rsem-prepare-reference --gtf NCBI_Felis8.0_annotation_1106151122.gtf -p 8 GCF_000181335.2_Felis_catus_8.0_genomic.fa Felis8.0_RSEM

rsem-extract-reference-transcripts Felis8.0_RSEM 0 NCBI_Felis8.0_annotation_1106151122.gtf 0 GCF_000181335.2_Felis_catus_8.0_genomic.fa

Parsed 200000 lines

Parsed 400000 lines

Parsed 600000 lines

Parsing gtf File is done!

GCF_000181335.2_Felis_catus_8.0_genomic.fa is processed!

40495 transcripts are extracted and 0 transcripts are omitted.

Extracting sequences is done!

Group File is generated!

Transcript Information File is generated!

Chromosome List File is generated!

Extracted Sequences File is generated!

rsem-preref Felis8.0_RSEM.transcripts.fa 1 Felis8.0_RSEM -l 125

Refs.makeRefs finished!

Refs.saveRefs finished!

Felis8.0_RSEM.idx.fa is generated!

Felis8.0_RSEM.n2g.idx.fa is generated!

Generate STAR reference (using genome and annotations)

STAR --runThreadN 6 --runMode genomeGenerate --genomeDir . --genomeFastaFiles GCF_000181335.2_Felis_catus_8.0_genomic.fa --limitGenomeGenerateRAM 19000000000 --genomeChrBinNbits 13 --sjdbGTFfile NCBI_Felis8.0_annotation_1106151122.gtf --sjdbOverhang 100 --genomeSAindexNbases 11

Mapping

STAR --genomeDir . --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --runThreadN 8 --genomeLoad NoSharedMemory --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --outSAMheaderHD \@HD VN:1.4 SO:unsorted --readFilesIn 541OD_R1.fq 541OD_R2.fq

RSEM Quantification (using the transcript-coordinate-based file produced by STAR)

rsem-calculate-expression --paired-end --bam -p 8 541Aligned.toTranscriptome.out.bam Felis8.0_RSEM 541RSEM

ERROR

rsem-parse-alignments Felis8.0_RSEM 541RSEM.temp/541RSEM 541RSEM.stat/541RSEM b 541Aligned.toTranscriptome.out.bam -t 3 -tag XM

RSEM can not recognize reference sequence name rna0!

"rsem-parse-alignments Felis8.0_RSEM 541RSEM.temp/541RSEM 541RSEM.stat/541RSEM b 541Aligned.toTranscriptome.out.bam -t 3 -tag XM" failed! Plase check if you provide correct parameters/options for the pipeline!

As follows to previous post, I used the same genome and annotation in STAR and RSEM. Also, input Aligned.toTranscriptome.out.bam, not genome.bam. In STAR Log.out,

Processing sjdbGTFfile=NCBI_Felis8.0_annotation_1106151122.gtf, found:

40495 transcripts

402177 exons (non-collapsed)

203665 collapsed junctions

..... Finished GTF processing

Loaded database junctions from the GTF file:

I couldn't find why it did't work. Please give me solution.

Thank you for your help!

Kazu

Message has been deleted

koi...@wisc.edu

unread,

Nov 11, 2015, 9:23:41 PM11/11/15

to RSEM Users

Thanks! The problem has been solved!

Reply all

Reply to author

Forward