Hi All,
This is my first time to use RSEM and I can’t figure out a few things. Unfortunately I’ve run into some problems that I’ll probably be embarrassed to look at a few weeks/months from now after using RSEM over and over again, but right now I have no choice but to ask
I have aligned my reads in Trinity and have the fasta file. Now all I want to do with RSEM is to calculate transcript abundance. I’ve read the RSEM online manual and it’s not clear to me how to proceed. Do I still need to do the 'rsem-prepare-reference' at what point is the reference annotation file used?
Regards,
Musa
All questions are welcome. After running Trinity, you should have a set of contigs in FASTA format. Suppose that file is "contigs.fasta". To use RSEM, you first need to run rsem-prepare-reference on these contig sequences:
rsem-prepare-reference --no-polyA contigs.fasta reference_name
where reference_name is an identifier that you choose for this contig set. I suggest using the "--no-polyA" option, as many contigs will not extend to the 3' end of the transcript and others will already have some polyA sequence.
Once the reference is prepared you can run rsem-calculate-expression on your samples.
Hope that helps,
Colin
Yes, you still need to run 'rsem-prepare-reference'. RSEM need to build
its internal reference indices for abundance calculation. You can follow
what Prof. Dewey suggests for 'rsem-prepare-reference'. reference_name is
the file name for all internal indices files. For example, if
reference_name = "/u/user/rsem/indices/trinity", RSEM will generate all
its internal indices under "/u/user/rsem/indices" with file name of
"trinity". You have to guarantee that "/u/user/rsem/indices" exists.
After building the indices, you can follow the manual to run
"rsem-calculate-expression". In the result file, the second column is the
expected counts and the third column is the expression value (tau value)
defined in our paper. If you want to use edgeR or DESeq for differential
analysis, you can round the second column into nearest interger and then
use edgeR or DESeq.
Best,
Bo
I wrote a small python script that does the same in one go. You need
BioPython (it uses the SeqIO fasta parser), however it could be easily
modified to parse the fasta itself if you would need it - let me know.
Here it goes:
..."
from Bio import SeqIO
import sys
fasta_file = sys.argv[1]
fasta_records = SeqIO.parse(open(fasta_file),"fasta")
base_transcripts = {}
for record in fasta_records:
# Check cluster/comp count, if it is a Trinity assembly
id_parts = record.id.split("_")
if (len(id_parts) > 2):
base_transcript = id_parts[0] + "_" + id_parts[1]
if (not base_transcript in base_transcripts):
curr_list = []
else:
curr_list = base_transcripts[base_transcript]
curr_list.append(record.id)
base_transcripts[base_transcript] = curr_list
for base_transcript in base_transcripts:
for curr_transcript in base_transcripts[base_transcript]:
print(base_transcript+"\t"+curr_transcript)
"...
Put the code above into a .py file (e.g.
"extract_trinity_isoforms_for_RSEM.py") and run it as follows:
python extract_trinity_isoforms_for_RSEM.py Trinity.fasta >
Trinity_isoforms_to_genes.txt
The code could be simplified, but it works and is fast enough for my
needs...
Greetings,
Samuel
--
======================================================================
Dr. Samuel Arvidsson AG Bioinformatics / GoFORSYS
Postdoc
+49 331 567 8615 arvi...@mpimp-golm.mpg.de
Max Planck Institute of Molecular Plant Physiology
Am Mühlenberg 1, 14476 Potsdam-Golm, Germany
http://bioinformatics.mpimp-golm.mpg.de http://www.goforsys.de
======================================================================
Hi Samuel,
Anything that can help will be highly appreciated. If you have a script that can do what Christina says in one go, I'll be glad to try it. Do you mind sending me this code in clearly formatted way, if you can comment it the better, otherwise its fine as it is. Better still if you can make it parse the fast file.
Musa.
On 18 October 2011 10:01, Samuel Arvidsson <arvi...@mpimp-golm.mpg.de> wrote:
Hi Musa and Christina,
I wrote a small python script that does the same in one go. You need BioPython (it uses the SeqIO fasta parser), however it could be easily modified to parse the fasta itself if you would need it - let me know. Here it goes:
..."
from Bio import SeqIO
import sys
fasta_file = sys.argv[1]
fasta_records = SeqIO.parse(open(fasta_file),"fasta")
base_transcripts = {}
for record in fasta_records:
� �# Check cluster/comp count, if it is a Trinity assembly
� �id_parts = record.id.split("_")
� �if (len(id_parts) > 2):
� � � �base_transcript = id_parts[0] + "_" + id_parts[1]
� � � �if (not base_transcript in base_transcripts):
� � � � � �curr_list = []
� � � �else:
� � � � � �curr_list = base_transcripts[base_transcript]
� � � �curr_list.append(record.id)
� � � �base_transcripts[base_transcript] = curr_list
for base_transcript in base_transcripts:
� �for curr_transcript in base_transcripts[base_transcript]:
� � � �print(base_transcript+"\t"+curr_transcript)
On Oct 17, 2:34 pm, Musa Hassan<musah...@gmail.com> �wrote:
Dear Musa,
All questions are welcome. �After running Trinity, you should have a set
of contigs in FASTA format. �Suppose that file is "contigs.fasta". �To
use
RSEM, you first need to run rsem-prepare-reference on these contigset.
sequences:
rsem-prepare-reference --no-polyA contigs.fasta reference_name
where reference_name is an identifier that you choose for this contig
�I suggest using the "--no-polyA" option, as many contigs will not extend
to the 3' end of the transcript and others will already have some polyA
sequence.
Once the reference is prepared you can run rsem-calculate-expression on
your samples.
Hope that helps,
Colin
On Oct 14, 2011, at 2:07 PM, Musa wrote:
Hi All,
This is my first time to use RSEM and I can�t figure out a few things.
Unfortunately I�ve run into some problems that I�ll probably be
embarrassed to look at a few weeks/months from now after using RSEM over
and over again, but right now I have no choice but to ask
I have aligned my reads in Trinity and have the fasta file. Now all I
want to do with RSEM is to calculate transcript abundance. I�ve read the
RSEM online manual and it�s not clear to me how to proceed. Do I still
need to do the 'rsem-prepare-reference' at what point is the reference
annotation file used?
Regards,
Musa
--
======================================================================
Dr. Samuel Arvidsson � � � � � � � � � � �AG Bioinformatics / GoFORSYS
Postdoc
+49 331 567 8615 � � � � � � � � � � � � � arvi...@mpimp-golm.mpg.de
Max Planck Institute of Molecular Plant Physiology
Am M�hlenberg 1, 14476 Potsdam-Golm, Germany
http://bioinformatics.mpimp-golm.mpg.de � � � � http://www.goforsys.de
======================================================================
-- ====================================================================== Dr. Samuel Arvidsson AG Bioinformatics / GoFORSYS Postdoc +49 331 567 8615 arvi...@mpimp-golm.mpg.de
Max Planck Institute of Molecular Plant Physiology Am M�hlenberg 1, 14476 Potsdam-Golm, Germany
Dear Samuel,
Thanks very much. The script runs so fast I was doubtful it worked, but it does. Now a question about the output. My output looks like this:
comp6803_c0���� comp6803_c0_seq1
Does that mean comp6803_c0 is the gene name and the other the squence?
Regards,
Musa
On 18 October 2011 11:37, Samuel Arvidsson <arvi...@mpimp-golm.mpg.de> wrote:
Dear Musa,
See the attached file for the same script, without the need of BioPython (just python is needed) and clear comments of each piece. It also outputs the average isoform count for your Trinity "genes".
It is a bit overkill, the basic extraction could be done with a one-liner with grep and sed. However you could easily extend the python script to do other stuff with the identifiers.
Greetings,
Samuel
Now we released RSEM v1.1.12, which contains a perl script for
generating the map from trinity fasta file. It does not require any
third party package.
Best,
Bo
>> Am M�hlenberg 1, 14476 Potsdam-Golm, Germanyhttp://bioinformatics.mpimp-golm.mpg.de http://www.goforsys.de
>> ======================================================================
>>
--
RSEM website: http://deweylab.biostat.wisc.edu/rsem/
---
You received this message because you are subscribed to the Google Groups "RSEM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rsem-users+...@googlegroups.com.
To post to this group, send email to rsem-...@googlegroups.com.
Visit this group at http://groups.google.com/group/rsem-users.
--
RSEM website: http://deweylab.biostat.wisc.edu/rsem/
---
You received this message because you are subscribed to the Google Groups "RSEM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rsem-users+...@googlegroups.com.
To post to this group, send email to rsem-...@googlegroups.com.
Visit this group at http://groups.google.com/group/rsem-users.
Total trinity transcripts: 97741
Total trinity components: 54216
Contig N50: 1702
My understanding is that the 'total trinity transcripts' represents total contigs including all the isoforms; and the trinity components represents unigenes, correct? Here I m interested in separating unigenes for downstream analysis. How can I get them?
I would really appreciate if I could any suggestions from you all. Thanks a lot in advance.
-Dinesh