rsem with trinity

1,146 views
Skip to first unread message

Musa

unread,
Oct 14, 2011, 3:07:48 PM10/14/11
to rsem-...@googlegroups.com

Hi All,

This is my first time to use RSEM and I can’t figure out a few things. Unfortunately I’ve run into some problems that I’ll probably be embarrassed to look at a few weeks/months from now after using RSEM over and over again, but right now I have no choice but to ask

I have aligned my reads in Trinity and have the fasta file. Now all I want to do with RSEM is to calculate transcript abundance. I’ve read the RSEM online manual and it’s not clear to me how to proceed. Do I still need to do the 'rsem-prepare-reference' at what point is the reference annotation file used?

Regards,

Musa

Colin Dewey

unread,
Oct 14, 2011, 5:12:32 PM10/14/11
to rsem-...@googlegroups.com
Dear Musa,

All questions are welcome. After running Trinity, you should have a set of contigs in FASTA format. Suppose that file is "contigs.fasta". To use RSEM, you first need to run rsem-prepare-reference on these contig sequences:

rsem-prepare-reference --no-polyA contigs.fasta reference_name

where reference_name is an identifier that you choose for this contig set. I suggest using the "--no-polyA" option, as many contigs will not extend to the 3' end of the transcript and others will already have some polyA sequence.

Once the reference is prepared you can run rsem-calculate-expression on your samples.

Hope that helps,
Colin

b...@cs.wisc.edu

unread,
Oct 15, 2011, 3:36:12 AM10/15/11
to rsem-...@googlegroups.com
Hi Musa,

Yes, you still need to run 'rsem-prepare-reference'. RSEM need to build
its internal reference indices for abundance calculation. You can follow
what Prof. Dewey suggests for 'rsem-prepare-reference'. reference_name is
the file name for all internal indices files. For example, if
reference_name = "/u/user/rsem/indices/trinity", RSEM will generate all
its internal indices under "/u/user/rsem/indices" with file name of
"trinity". You have to guarantee that "/u/user/rsem/indices" exists.

After building the indices, you can follow the manual to run
"rsem-calculate-expression". In the result file, the second column is the
expected counts and the third column is the expression value (tau value)
defined in our paper. If you want to use edgeR or DESeq for differential
analysis, you can round the second column into nearest interger and then
use edgeR or DESeq.

Best,
Bo

Musa Hassan

unread,
Oct 17, 2011, 2:34:09 PM10/17/11
to rsem-...@googlegroups.com, b...@cs.wisc.edu
Thanks for the reply.
Am kind of wondering in the event, like this, when the organism am working with doesn't have an annotated genome and I've used Trinity to assemble the reads and now rsem to to estimate abundances,is there a way of knowing which transcripts are the same i.e. same genes or same exons without having the transcript-to-gene map file? i.e. can rsem go all the way to summing up identical exons/isorfoms in the absence of a knownisorform files?

Christina Wells

unread,
Oct 18, 2011, 9:45:06 AM10/18/11
to RSEM Users
Hi Musa,

I also assembled a de novo transcriptional from a non-model organism
using Trinity. I used a really inelegant solution to create the
transcript-to-gene map file - but it might work for you too!

1) First, I used the fasta header extractor at http://users-birc.au.dk/biopv/php/fabox/
to create a file of all the Trinity contig headers from the
Trinity.fasta file.

2)Then I split this file into a several smaller ones using a perl
script called chomp.pl. I can send you this if you'd like. Depending
on how many rows your spreadsheet program can handle, you might not
need it.

3) I imported the fasta headers into a spread sheet program and used a
function like "left(A2, 12)" to extract the first portion of the
Trinity contig name that contains the information on the gene only
(i.e. comp10006_c0). So I ended up with two columns in the
spreadsheet: The full Trinity header, which specifies the isoform, and
the first portion of the header, which specifies the gene.

4) I copy-pasted these two columns into a text file, reassembling the
full set from my multiple spreadsheet files.

Hopefully this helps --

Christina

Samuel Arvidsson

unread,
Oct 18, 2011, 10:01:25 AM10/18/11
to rsem-...@googlegroups.com
Hi Musa and Christina,

I wrote a small python script that does the same in one go. You need
BioPython (it uses the SeqIO fasta parser), however it could be easily
modified to parse the fasta itself if you would need it - let me know.
Here it goes:

..."
from Bio import SeqIO
import sys

fasta_file = sys.argv[1]
fasta_records = SeqIO.parse(open(fasta_file),"fasta")

base_transcripts = {}
for record in fasta_records:
# Check cluster/comp count, if it is a Trinity assembly
id_parts = record.id.split("_")
if (len(id_parts) > 2):
base_transcript = id_parts[0] + "_" + id_parts[1]
if (not base_transcript in base_transcripts):
curr_list = []
else:
curr_list = base_transcripts[base_transcript]
curr_list.append(record.id)
base_transcripts[base_transcript] = curr_list

for base_transcript in base_transcripts:
for curr_transcript in base_transcripts[base_transcript]:
print(base_transcript+"\t"+curr_transcript)
"...

Put the code above into a .py file (e.g.
"extract_trinity_isoforms_for_RSEM.py") and run it as follows:

python extract_trinity_isoforms_for_RSEM.py Trinity.fasta >
Trinity_isoforms_to_genes.txt

The code could be simplified, but it works and is fast enough for my
needs...

Greetings,
Samuel


--
======================================================================
Dr. Samuel Arvidsson AG Bioinformatics / GoFORSYS
Postdoc
+49 331 567 8615 arvi...@mpimp-golm.mpg.de
Max Planck Institute of Molecular Plant Physiology
Am Mühlenberg 1, 14476 Potsdam-Golm, Germany
http://bioinformatics.mpimp-golm.mpg.de http://www.goforsys.de
======================================================================

Musa Hassan

unread,
Oct 18, 2011, 10:47:39 AM10/18/11
to rsem-...@googlegroups.com
Hi Christina,
Thanks very much for this info.
Would you please send me the scripts you used so I can give it a go.
Musa

Musa Hassan

unread,
Oct 18, 2011, 11:18:18 AM10/18/11
to rsem-...@googlegroups.com, arvi...@mpimp-golm.mpg.de
Hi Samuel,
Anything that can help will be highly appreciated. If you have a script that can do what Christina says in one go, I'll be glad to try it. Do you mind sending me this code in clearly formatted way, if you can comment it the better, otherwise its fine as it is. Better still if you can make it parse the fast file.
Musa.

Vamshi

unread,
Oct 18, 2011, 11:19:28 AM10/18/11
to RSEM Users
Hi all,

Even me too was stuck up with the same thing. So basically to just
outline it, if the transcript name of trinity output looks like:
comp1_c1_seq1
comp1_c1_seq2

The above will be the transcript names and the gene name corresponding
to them is comp1_c1.

Please correct me if i am wrong.

Thanks in advance.
Vamshi

On Oct 18, 10:47 pm, Musa Hassan <musah...@gmail.com> wrote:
> Hi Christina,
> Thanks very much for this info.
> Would you please send me the scripts you used so I can give it a go.
> Musa
>

Samuel Arvidsson

unread,
Oct 18, 2011, 11:37:39 AM10/18/11
to Musa Hassan, rsem-...@googlegroups.com
Dear Musa,

See the attached file for the same script, without the need of BioPython (just python is needed) and clear comments of each piece. It also outputs the average isoform count for your Trinity "genes".
It is a bit overkill, the basic extraction could be done with a one-liner with grep and sed. However you could easily extend the python script to do other stuff with the identifiers.

Greetings,
Samuel


On 10/18/2011 05:18 PM, Musa Hassan wrote:
Hi Samuel,
Anything that can help will be highly appreciated. If you have a script that can do what Christina says in one go, I'll be glad to try it. Do you mind sending me this code in clearly formatted way, if you can comment it the better, otherwise its fine as it is. Better still if you can make it parse the fast file.
Musa.

On 18 October 2011 10:01, Samuel Arvidsson <arvi...@mpimp-golm.mpg.de> wrote:
Hi Musa and Christina,

I wrote a small python script that does the same in one go. You need BioPython (it uses the SeqIO fasta parser), however it could be easily modified to parse the fasta itself if you would need it - let me know. Here it goes:

..."
from Bio import SeqIO
import sys

fasta_file = sys.argv[1]
fasta_records = SeqIO.parse(open(fasta_file),"fasta")

base_transcripts = {}
for record in fasta_records:
� �# Check cluster/comp count, if it is a Trinity assembly
� �id_parts = record.id.split("_")
� �if (len(id_parts) > 2):
� � � �base_transcript = id_parts[0] + "_" + id_parts[1]
� � � �if (not base_transcript in base_transcripts):
� � � � � �curr_list = []
� � � �else:
� � � � � �curr_list = base_transcripts[base_transcript]
� � � �curr_list.append(record.id)
� � � �base_transcripts[base_transcript] = curr_list

for base_transcript in base_transcripts:
� �for curr_transcript in base_transcripts[base_transcript]:
� � � �print(base_transcript+"\t"+curr_transcript)
On Oct 17, 2:34 pm, Musa Hassan<musah...@gmail.com> �wrote:
Dear Musa,
All questions are welcome. �After running Trinity, you should have a set
of contigs in FASTA format. �Suppose that file is "contigs.fasta". �To
use
RSEM, you first need to run rsem-prepare-reference on these contig
sequences:
rsem-prepare-reference --no-polyA contigs.fasta reference_name
where reference_name is an identifier that you choose for this contig
set.
�I suggest using the "--no-polyA" option, as many contigs will not extend

to the 3' end of the transcript and others will already have some polyA
sequence.
Once the reference is prepared you can run rsem-calculate-expression on
your samples.
Hope that helps,
Colin
On Oct 14, 2011, at 2:07 PM, Musa wrote:
Hi All,
This is my first time to use RSEM and I can�t figure out a few things.
Unfortunately I�ve run into some problems that I�ll probably be

embarrassed to look at a few weeks/months from now after using RSEM over
and over again, but right now I have no choice but to ask
I have aligned my reads in Trinity and have the fasta file. Now all I
want to do with RSEM is to calculate transcript abundance. I�ve read the
RSEM online manual and it�s not clear to me how to proceed. Do I still

need to do the 'rsem-prepare-reference' at what point is the reference
annotation file used?
Regards,
Musa


--
======================================================================
Dr. Samuel Arvidsson � � � � � � � � � � �AG Bioinformatics / GoFORSYS
Postdoc
+49 331 567 8615 � � � � � � � � � � � � � arvi...@mpimp-golm.mpg.de

Max Planck Institute of Molecular Plant Physiology
Am M�hlenberg 1, 14476 Potsdam-Golm, Germany
http://bioinformatics.mpimp-golm.mpg.de � � � � http://www.goforsys.de
======================================================================




-- 
======================================================================
Dr. Samuel Arvidsson                      AG Bioinformatics / GoFORSYS
Postdoc
+49 331 567 8615                           arvi...@mpimp-golm.mpg.de
Max Planck Institute of Molecular Plant Physiology
Am M�hlenberg 1, 14476 Potsdam-Golm, Germany
extract_trinity_isoforms_for_RSEM_no_BioPython.py

Samuel Arvidsson

unread,
Oct 19, 2011, 3:20:41 AM10/19/11
to Musa Hassan, rsem-...@googlegroups.com
Dear Musa,

Yes, "comp6803_c0" is the "gene" name and "comp6803_c0_seq1" the isoform name.

Greetings,
Samuel

On 10/18/2011 05:53 PM, Musa Hassan wrote:
Dear Samuel,
Thanks very much. The script runs so fast I was doubtful it worked, but it does. Now a question about the output. My output looks like this:
comp6803_c0���� comp6803_c0_seq1
Does that mean comp6803_c0 is the gene name and the other the squence?
Regards,
Musa


On 18 October 2011 11:37, Samuel Arvidsson <arvi...@mpimp-golm.mpg.de> wrote:
Dear Musa,

See the attached file for the same script, without the need of BioPython (just python is needed) and clear comments of each piece. It also outputs the average isoform count for your Trinity "genes".
It is a bit overkill, the basic extraction could be done with a one-liner with grep and sed. However you could easily extend the python script to do other stuff with the identifiers.

Greetings,
Samuel

Christina Wells

unread,
Oct 19, 2011, 2:57:00 PM10/19/11
to RSEM Users
Hi Musa,

I think Samuel's script sounds like the best way to go - I'm certainly
going to try it! But I am emailing you that other perl script just in
case.

Best,

Christina

On Oct 19, 3:20 am, Samuel Arvidsson <arvids...@mpimp-golm.mpg.de>
wrote:
> Dear Musa,
>
> Yes, "comp6803_c0" is the "gene" name and "comp6803_c0_seq1" the isoform
> name.
>
> Greetings,
> Samuel
>
> On 10/18/2011 05:53 PM, Musa Hassan wrote:
>
>
>
>
>
>
>
>
>
> > Dear Samuel,
> > Thanks very much. The script runs so fast I was doubtful it worked,
> > but it does. Now a question about the output. My output looks like this:
> > comp6803_c0     comp6803_c0_seq1
> > Does that mean comp6803_c0 is the gene name and the other the squence?
> > Regards,
> > Musa
>
> > On 18 October 2011 11:37, Samuel Arvidsson
> > <arvids...@mpimp-golm.mpg.de <mailto:arvids...@mpimp-golm.mpg.de>> wrote:
>
> >     Dear Musa,
>
> >     See the attached file for the same script, without the need of
> >     BioPython (just python is needed) and clear comments of each
> >     piece. It also outputs the average isoform count for your Trinity
> >     "genes".
> >     It is a bit overkill, the basic extraction could be done with a
> >     one-liner with grep and sed. However you could easily extend the
> >     python script to do other stuff with the identifiers.
>
> >     Greetings,
> >     Samuel
>
> --
> ======================================================================
> Dr. Samuel Arvidsson                      AG Bioinformatics / GoFORSYS
> Postdoc
> +49 331 567 8615                           arvids...@mpimp-golm.mpg.de

Bo Li

unread,
Oct 19, 2011, 2:59:08 PM10/19/11
to rsem-...@googlegroups.com
Hi all,

Now we released RSEM v1.1.12, which contains a perl script for
generating the map from trinity fasta file. It does not require any
third party package.

Best,
Bo

>> Am M�hlenberg 1, 14476 Potsdam-Golm, Germanyhttp://bioinformatics.mpimp-golm.mpg.de http://www.goforsys.de
>> ======================================================================
>>

Christina Wells

unread,
Oct 19, 2011, 6:53:40 PM10/19/11
to RSEM Users
That's great - thanks Bo!

Musa Hassan

unread,
Oct 19, 2011, 7:01:51 PM10/19/11
to rsem-...@googlegroups.com
Hi Christina,
Thanks for the script.
Bo told me there's a new version of rsem, rsem 1.12 which incorporates all these and I tried. It works just like Samuels' script.r

archana bhardwaj

unread,
Aug 9, 2013, 2:28:05 AM8/9/13
to rsem-...@googlegroups.com
Hello everyone 

I am new to NGS . I have paired end sequence data for differential gene expression analysis. First i did the prepossessing of my data such as adapter , ambiguous character removal and quality filtration . Now i want to assemble my data in form of contigs. I run trininty.pl to get the assembled reads in form of  contigs. There are so many output files . What is the exact contig files??? Inchworm contigs (InchwormK25.fasta or bundled.fasta ???  How can i get the read count of each contig ??? 

waiting for reply

Cleo Nicole

unread,
Jan 7, 2014, 1:28:59 AM1/7/14
to rsem-...@googlegroups.com, cleo nicole
Hello there,

I have some confusions about RSEM. I have performed a transcriptome de novo assembly using Trinity and proceeded with the transcript abundance estimation with RSEM.
What I am not sure is that since I do not have a reference genome to refer to, do I still need to run the rsem-prepare-reference before running RSEM?
What is the purpose of rsem-prepare-reference? I am referring to the protocol stated in this paper "De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis", and there is no mention about the rsem-prepare-reference step,

Can someone please explain further about this?

Thank you very much.

Regards,
Cleo

Brian Haas

unread,
Jan 7, 2014, 6:25:49 AM1/7/14
to rsem-...@googlegroups.com, trinityrn...@lists.sf.net
Hi Archana,

The final output of Trinity should be a 'Trinity.fasta' file.  Please see the web documentation at:

or our protocol paper:

for more guidance.

best,

~brian



--
RSEM website: http://deweylab.biostat.wisc.edu/rsem/
---
You received this message because you are subscribed to the Google Groups "RSEM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rsem-users+...@googlegroups.com.
To post to this group, send email to rsem-...@googlegroups.com.
Visit this group at http://groups.google.com/group/rsem-users.
 
 



--
--
Brian J. Haas
The Broad Institute
http://broad.mit.edu/~bhaas

 

Brian Haas

unread,
Jan 7, 2014, 6:27:54 AM1/7/14
to rsem-...@googlegroups.com, cleo nicole
Hi Cleo,

The process of running RSEM on trinity assemblies has been somewhat simplified into a one-step process, as described here:


and also described in the protocol paper.

best,

~brian


--
RSEM website: http://deweylab.biostat.wisc.edu/rsem/
---
You received this message because you are subscribed to the Google Groups "RSEM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rsem-users+...@googlegroups.com.
To post to this group, send email to rsem-...@googlegroups.com.
Visit this group at http://groups.google.com/group/rsem-users.

Dinesh

unread,
Apr 12, 2015, 12:43:12 PM4/12/15
to rsem-...@googlegroups.com
Hello All,

I am fairly new to trinity and I have managed to run it and assembled and received

Total trinity transcripts: 97741

Total trinity components: 54216

Contig N50: 1702


My understanding is that the 'total trinity transcripts' represents total contigs including all the isoforms; and the trinity components represents unigenes, correct? Here I m interested in separating unigenes for downstream analysis. How can I get them? 


I would really appreciate if I could any suggestions from you all. Thanks a lot in advance.


-Dinesh

Reply all
Reply to author
Forward
0 new messages