FASTA before and after rsem-prepare-reference

473 views
Skip to first unread message

Floris Barthel

unread,
May 29, 2014, 6:33:55 PM5/29/14
to rsem-...@googlegroups.com
Hi,

I'm a new user of RSEM and I'm trying to understand some of the concepts.

I can get why RSEM needs access to reference files (ie. Ensembl FASTA) and annotations (ie. Ensembl GTF) internally to map transcripts to genes. But why is it needed to generate RSEM specific reference files to input to the alignment tools?

Running rsem-prepare-reference using Ensembl (release 64) FASTA including transcriptome and genome as well as GTF to generate "idx.fa" files to use for alignment I notice the following:
- the genome is removed in the idx.fa. Why is this? (File size 3+Gb reduced to ~250 Mb, file tail is a transcript)
- many transcripts are removed with the error message "cannot extract transcript XXX's sequence since the chromosome it locates, ZZZ, is absent". What does this mean?

End result: ~17k transcripts extracted, 9k omitted. Is this normal?

Why should I use the "idx.fa" file as a reference for alignment rather than the transcriptome plus genome fasta?

Thanks

Bo Li

unread,
May 29, 2014, 8:00:11 PM5/29/14
to rsem-...@googlegroups.com
Hi Floris,

RSEM only aligns to transcript sequences. It does not align to genomes.
If you provide RSEM with the genome sequence and a GTF file, RSEM will
extract the set of annotated transcripts for you. If you already have
the transcript sequences, you can directly use it.

The reasons that RSEM generate specific reference files for aligners is
two folds:
1) RSEM may add a 125bp poly(A) tail to the end of each mRNA;
2) Bowtie aligner cannot align reads against 'N' bases in the reference.
Thus RSEM first convert all 'N's in the reference to 'G's.

> - many transcripts are removed with the error message "cannot extract
> transcript XXX's sequence since the chromosome it locates, ZZZ, is
> absent". What does this mean?

It means in your genome FASTA file, there is no sequence for chromosome
ZZZ. Therefore RSEM cannot extract XXX since it locates on ZZZ.

Best,
Bo
> --
> RSEM website: http://deweylab.biostat.wisc.edu/rsem/ [1]
> ---
> You received this message because you are subscribed to the Google
> Groups "RSEM Users" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to rsem-users+...@googlegroups.com.
> To post to this group, send email to rsem-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/rsem-users [2].
>
>
> Links:
> ------
> [1] http://deweylab.biostat.wisc.edu/rsem/
> [2] http://groups.google.com/group/rsem-usersH

Floris Barthel

unread,
May 30, 2014, 11:14:06 AM5/30/14
to rsem-...@googlegroups.com
Thanks, it makes a lot more sense now. Does RSEM work with BWA? I've read some posts saying it does not, because BWA performs gapped alignment.

I got this error when trying to input a PE SAM file from BWA: the two reads do not come from the same pair

Floris

Bo Li

unread,
Jun 8, 2014, 11:06:59 PM6/8/14
to rsem-...@googlegroups.com
Hi Floris,

In theory RSEM works with BWA. But you need to figure out how to set the
right BWA parameters. In addition, it seems that BWA reports multiple
hits in its optional fields. Then you need to write your own script to
convert BWA-based BAM file into a standard BAM file.

Best,
Bo
>>> RSEM website: http://deweylab.biostat.wisc.edu/rsem/ [1] [1]
>>> ---
>>> You received this message because you are subscribed to the Google
>>
>>> Groups "RSEM Users" group.
>>> To unsubscribe from this group and stop receiving emails from it,
>>> send an email to rsem-users+...@googlegroups.com.
>>> To post to this group, send email to rsem-...@googlegroups.com.
>>> Visit this group at http://groups.google.com/group/rsem-users [2]
>> [2].
>>>
>>>
>>> Links:
>>> ------
>>> [1] http://deweylab.biostat.wisc.edu/rsem/ [1]
>>> [2] http://groups.google.com/group/rsem-usersH [3]
>
> --
> RSEM website: http://deweylab.biostat.wisc.edu/rsem/ [4]
> ---
> You received this message because you are subscribed to the Google
> Groups "RSEM Users" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to rsem-users+...@googlegroups.com.
> To post to this group, send email to rsem-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/rsem-users [2].
>
>
> Links:
> ------
> [1]
> http://www.google.com/url?q75http%3A%2F%2Fdeweylab.biostat.wisc.edu%2Frsem%2F46sa75D46sntz75146usg75AFQjCNEU-wRL_aNCO7ziCWa-Y12BbbySXw
> [2] http://groups.google.com/group/rsem-users
> [3] http://groups.google.com/group/rsem-usersH
> [4] http://deweylab.biostat.wisc.edu/rsem/
Reply all
Reply to author
Forward
0 new messages