Error in "rsem-prepare-reference" from align_and_estimate_abundance using CORSET gene_trans_map

822 views
Skip to first unread message

Giorgio Casaburi

unread,
Apr 20, 2017, 12:51:02 PM4/20/17
to trinityrnaseq-users
Hi all,

I used CORSET to obtain a cluster of the transcript and I am trying to run align_and_estimate_abundance but I am keeping getting this error right away:

Mapping Info is not correct, cannot find TRINITY_DN40607_c0_g1_i1's gene_id!


I can see that "TRINITY_DN40607_c0_g1_i1" is in both my assembly.fasta file as well as in the CORSET cluster.txt file:


The header of the assembly.fasta file looks like this:

>TRINITY_DN40607_c0_g1_i1 len=214 path=[192:0-213] [-1, 192, -2]


While this is from the CORSET cluster.txt file:

TRINITY_DN40607_c0_g1_i1 Cluster-14284.0


(PS: I also tried tu run prep reference alone, and it crashed with the same error). 


So I cannot figure out what's going on. Any help please?


Thanks in advance,

Giorgio

Brian Haas

unread,
Apr 21, 2017, 9:11:20 AM4/21/17
to Giorgio Casaburi, trinityrnaseq-users
Hi Giorgio,

It looks like Corset's gene-to-transcript mapping file is a transcript-to-gene mapping file.  Try reversing the order of the fields like so:

  cat cluster.txt | perl -lane 'print "$F[1]\t$F[0]";' > cluster.forTrinity.txt

and use that file instead.

best of luck,

~brian


--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

Giorgio Casaburi

unread,
Apr 21, 2017, 9:23:23 AM4/21/17
to Brian Haas, trinityrnaseq-users
Thanks Brian! I realized that later and forgot to post back here. Your solution is perfect!

Cheers,
Giorgio
On Fri, Apr 21, 2017 at 6:11 AM Brian Haas <bh...@broadinstitute.org> wrote:
Hi Giorgio,

It looks like Corset's gene-to-transcript mapping file is a transcript-to-gene mapping file.  Try reversing the order of the fields like so:

  cat cluster.txt | perl -lane 'print "$F[1]\t$F[0]";' > cluster.forTrinity.txt

and use that file instead.

best of luck,

~brian

On Thu, Apr 20, 2017 at 12:51 PM, Giorgio Casaburi <giorgio...@gmail.com> wrote:
Hi all,

I used CORSET to obtain a cluster of the transcript and I am trying to run align_and_estimate_abundance but I am keeping getting this error right away:

Mapping Info is not correct, cannot find TRINITY_DN40607_c0_g1_i1's gene_id!


I can see that "TRINITY_DN40607_c0_g1_i1" is in both my assembly.fasta file as well as in the CORSET cluster.txt file:


The header of the assembly.fasta file looks like this:

>TRINITY_DN40607_c0_g1_i1 len=214 path=[192:0-213] [-1, 192, -2]


While this is from the CORSET cluster.txt file:

TRINITY_DN40607_c0_g1_i1 Cluster-14284.0


(PS: I also tried tu run prep reference alone, and it crashed with the same error). 


So I cannot figure out what's going on. Any help please?


Thanks in advance,

Giorgio

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 
--
________________________________
Giorgio Casaburi, Ph.D.
Bioinformatics Scientist
2121 Second Street
Suite B107
Davis, CA 95618

Brian Haas

unread,
Apr 21, 2017, 9:25:25 AM4/21/17
to Giorgio Casaburi, trinityrnaseq-users
great!  ok - best of luck!

~b

Giorgio

To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 
--
________________________________
Giorgio Casaburi, Ph.D.
Bioinformatics Scientist
2121 Second Street
Suite B107
Davis, CA 95618

Xavier

unread,
Nov 22, 2017, 4:43:24 AM11/22/17
to trinityrnaseq-users
Dear all,

I am getting a similar problem but I really don't know how to solve it. Here is the error:

CMD: touch /path/Assembly.fasta.bowtie2.started
CMD: bowtie2-build /path/Assembly.fasta /path/Assembly.fasta.bowtie2
Building a SMALL index
CMD: touch /path/Assembly.fasta.RSEM.rsem.prepped.started
CMD: rsem-prepare-reference  --transcript-to-gene-map /path/Gene_to_Map /path/Assembly.fasta /path/Assembly.fasta.RSEM
Mapping Info is not correct, cannot find DRK1_a_DN42_c0_g2_i1's gene_id!
Error, cmd: rsem-prepare-reference  --transcript-to-gene-map /path/Gene_to_Map /path/Assembly.fasta /path/Assembly.fasta.RSEM died with ret: 65280 at /media/vol1/apps/trinityrnaseq-Trinity-v2.5.1/util/align_and_estimate_abundance.pl line 778.

My file Gene_to_Map is tab delimiter (some genes are equal to trancript because there isn't isoform)

DRK1_a_DN1_c1_g1_i1    DRK1_a_DN1_c1_g1_i1
DRK1_a_DN42_c0_g2_i1    DRK1_d_DN44513_c0_g1_i1
DRK1_a_DN42_c0_g2_i1    DRK2_a_DN13230_c0_g1_i1
DRK1_a_DN42_c0_g2_i1    DRK2_a_DN13230_c0_g1_i2
DRK1_a_DN42_c0_g2_i1    DRK2_b_DN16647_c0_g1_i2
DRK1_c_DN25880_c1_g1_i14    DRK2_c_DN30917_c0_g2_i3
DRK1_c_DN25880_c1_g1_i14    DRK2_d_DN31894_c0_g1_i5
DRK1_d_DN11678_c0_g1_i1    DRK1_a_DN23988_c2_g10_i1
DRK2_a_DN19969_c0_g1_i3    DRK1_a_DN14564_c0_g1_i1

Any help?
Thank you in advance,
Xavier


Xavier

unread,
Nov 22, 2017, 4:46:16 AM11/22/17
to trinityrnaseq-users
Here is my command:
/media/vol1/apps/trinityrnaseq-Trinity-v2.5.1/util/align_and_estimate_abundance.pl --transcripts /path/Assembly.fasta --seqType fa --samples_file /path/Samples \
--est_method RSEM  --aln_method bowtie2 --thread_count 20 --output_dir /path/abundance \
--gene_trans_map /path/Gene_to_Map --prep_reference

Mark Chapman

unread,
Nov 22, 2017, 6:54:56 AM11/22/17
to Xavier, trinityrnaseq-users
Hi Xavier,
That file doesn't look right, did you prepare in manually?
I think it should be gene id <tab> transcript ID so all your transcript IDs (eg DRK1_a_DN42_c0_g2_i1) should match a gene ID (eg DRK1_a_DN42_c0_g2)
Also it should presumably follow the traditional trinity naming - what are your transcripts called in your .fasta? The default is eg >TRINITY_DN1000|c115_g5_i1
Not sure if this is the issue, just an idea
Cheers, Mark

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
Dr. Mark A. Chapman
+44 (0)2380 594396
------------------------------------
Biological Sciences
University of Southampton
Life Sciences Building 85
Highfield Campus
Southampton
SO17 1BJ

Xavier

unread,
Nov 22, 2017, 7:15:24 AM11/22/17
to trinityrnaseq-users
Hi Mark,

I created those identifiers from trinity results and I have change the names to measure traceability. This file is the result of using EvidentialGene software to reduce 'gene' redundancy.
I am using this information from file to map from transcript (isoform) ids to gene ids, not the trinity mode, so I suppose that gene_ID <tab> transcript_ID would be ok.
I don't know where the issue is...

Regards
Xavier

Mark Chapman

unread,
Nov 22, 2017, 7:17:47 AM11/22/17
to Xavier, trinityrnaseq-users
Hi Xavier,
I think its not in a format trinity accepts. The gene and transcript ID should be the same except the latter has _i1, _i2 etc on the end. It certainly shouldnt be DRK1_a_DN42 and DRK1_d_DN44513 on the same line
Thats my guess.

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Xavier

unread,
Nov 22, 2017, 7:28:48 AM11/22/17
to trinityrnaseq-users
Hi Mark,

I am not sure if that is the issue. As originally Giorgio said:

Mapping Info is not correct, cannot find TRINITY_DN40607_c0_g1_i1's gene_id!

While this is from the CORSET cluster.txt file:

TRINITY_DN40607_c0_g1_i1 Cluster-14284.0


In this example the format is not Trinity friendly, the problem was the trancript_ID<tab>gene_ID order in the file to map. I think is RSEM that have to handle with his data to do the job, so it doesn't matter if the format is _i1, _i2 etc.


Mark Chapman

unread,
Nov 22, 2017, 8:13:04 AM11/22/17
to Xavier, trinityrnaseq-users
If you can remove the '_1' and '_2' etc from column one will it work? I wonder if trinity thinks these are isoforms not genes

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Brian Haas

unread,
Nov 22, 2017, 1:13:54 PM11/22/17
to Mark Chapman, Xavier, trinityrnaseq-users
right, that gene-to-map file looks strange.


For lines like this:

DRK1_a_DN1_c1_g1_i1    DRK1_a_DN1_c1_g1_i1

you might make the gene identifier the prefix lacking the _i1

DRK1_a_DN1_c1_g1   DRK1_a_DN1_c1_g1_i1

and see if that helps.  


On Wed, Nov 22, 2017 at 8:13 AM, Mark Chapman <markcha...@gmail.com> wrote:
If you can remove the '_1' and '_2' etc from column one will it work? I wonder if trinity thinks these are isoforms not genes
On 22 November 2017 at 12:28, Xavier <microal...@gmail.com> wrote:
Hi Mark,

I am not sure if that is the issue. As originally Giorgio said:

Mapping Info is not correct, cannot find TRINITY_DN40607_c0_g1_i1's gene_id!

While this is from the CORSET cluster.txt file:

TRINITY_DN40607_c0_g1_i1 Cluster-14284.0


In this example the format is not Trinity friendly, the problem was the trancript_ID<tab>gene_ID order in the file to map. I think is RSEM that have to handle with his data to do the job, so it doesn't matter if the format is _i1, _i2 etc.


--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsubscribe...@googlegroups.com.

To post to this group, send email to trinityrnaseq-users@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.
--
Dr. Mark A. Chapman
------------------------------------
Biological Sciences
University of Southampton
Life Sciences Building 85
Highfield Campus
Southampton
SO17 1BJ

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--

Xavier

unread,
Nov 23, 2017, 4:37:36 AM11/23/17
to trinityrnaseq-users
Hello everyone,
After changing the Gene_to_Map file with your suggestions I got the same error:

WARNING - appears that another process has started the rsem-prep step... proceeding with caution.
CMD: touch /path/Assembly.rsem.fasta.RSEM.rsem.prepped.started
CMD: rsem-prepare-reference  --transcript-to-gene-map /media/vol2/home/jcordoba/Transcriptomics/DRK/assembly/Gene_to_Map /path/Assembly.rsem.fasta /path/Assembly.rsem.fasta.RSEM

Mapping Info is not correct, cannot find DRK1_a_DN42_c0_g2_i1's gene_id!
Error, cmd: rsem-prepare-reference  --transcript-to-gene-map /media/vol2/home/jcordoba/Transcriptomics/DRK/assembly/Gene_to_Map /path/Assembly.rsem.fasta /path/Assembly.rsem.fasta.RSEM died with ret: 65280 at /media/vol1/apps/trinityrnaseq-Trinity-v2.5.1/util/align_and_estimate_abundance.pl line 778.

It looks like:

DRK1_a_DN42_c0_g2    DRK1_d_DN44513_c0_g1_i1
DRK1_a_DN42_c0_g2    DRK2_a_DN13230_c0_g1_i1
DRK1_a_DN42_c0_g2    DRK2_a_DN13230_c0_g1_i2
DRK1_a_DN42_c0_g2    DRK2_b_DN16647_c0_g1_i2
DRK2_a_DN29929_c4_g1    DRK1_a_DN22192_c5_g2_i9
DRK2_a_DN29929_c4_g1    DRK1_a_DN23678_c2_g1_i22

I wonder if I change the first column by:

Gene1    DRK1_d_DN44513_c0_g1_i1
Gene2    DRK2_a_DN13230_c0_g1_i2
Gene3    DRK1_a_DN22192_c5_g2_i9
Gene4    DRK1_a_DN23678_c2_g1_i22

Then I should modify the gene identifier in the Assembly.fasta file. So I will lose all traceability.
I don't know how to modify this mapping file.
Thanks

Mark Chapman

unread,
Nov 23, 2017, 5:50:14 AM11/23/17
to Xavier, trinityrnaseq-users
Hi Xavier,
I am guessing from that error that DRK1_a_DN42_c0_g2_i1 is in your .fasta file but is not in your Gene_to_Map file, is that right? Every transcript should have a gene I would guess.
cheers, Mark

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Xavier

unread,
Mar 14, 2018, 6:35:37 AM3/14/18
to trinityrnaseq-users
Hello again,

I have been trying to modify my fasta files, but I got the same error again.

Mapping Info is not correct, cannot find DRK1_a_DN68_c0_g1_i1's gene_id!


If I grep this ID in fasta file:
>DRK1_a_DN68_c0_g1_i1
CAGGAAAAAAGCTGTGGATTGGAAATACAGAGGTGCACTATGTGGATCACAGAATACCTTTGCCAAGAGC
GTGAGGCGCTGATGTGTGACTTGGCAAATTGAACTGCGGCTGGTGCAGCAACTTTCGTCAAAGCAGTCGT
TGTGCGTCGTCCGATTTGGGCACCCATTGCGGCCGTGGTGGTGGGGAGGGAACGCCGCGCGGACGCCGAA

and in the mapping file
DRK1_a_DN68_c0_g1_i1    DRK1_c_DN1361_c0_g1_i1

It seems it is correct. Unique sequence with mapping info

The log output:
CMD: touch /path/OKALT.fasta.bowtie2.started
CMD
: bowtie2-build /path/OKALT.fasta /path/OKALT.fasta.bowtie2
Building a SMALL index
CMD
: touch /path/OKALT.fasta.RSEM.rsem.prepped.started
CMD
: rsem-prepare-reference  --transcript-to-gene-map /path/isoform /path/OKALT.fasta.RSEM
Mapping Info is not correct, cannot find DRK1_a_DN68_c0_g1_i1's gene_id!
Error, cmd: rsem-prepare-reference  --transcript-to-gene-map /path/isoform /path/OKALT.fasta /path/OKALT.fasta.RSEM died with ret: 65280 at /path/trinityrnaseq-Trinity-v2.5.1/util/align_and_estimate_abundance.pl line 778.
To post to this group, send email to trinityrn...@googlegroups.com.

Xavier

unread,
Mar 14, 2018, 6:42:37 AM3/14/18
to trinityrnaseq-users
Do you have any suggestions? Any help will be welcome.
The code I have used:

/path/trinityrnaseq-Trinity-v2.5.1/util/align_and_estimate_abundance.pl --transcripts /paht/OKALT.fasta --seqType fa \
--samples_file path/A.samples --est_method RSEM --aln_method bowtie2 --thread_count 40 --output_dir /path/A \
--gene_trans_map /path/isoform --prep_reference


Thank you again for your support
Best wishes,
Xavier

Brian Haas

unread,
Mar 14, 2018, 7:35:45 AM3/14/18
to Xavier, trinityrnaseq-users
Hi Xavier,

It looks like it's not finding the gene identifier for

in your

/path/isoform

file

The format should be 

 gene_id(tab)isoform_id

and your file has the order switched.


--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.



--

Xavier

unread,
Mar 14, 2018, 7:46:32 AM3/14/18
to trinityrnaseq-users
Hello Brian,

The ID DRK1_a_DN68_c0_g1_i1 is a gene. So, the mapping file is correct:

gene_id(tab)isoform_id
DRK1_a_DN68_c0_g1_i1    DRK1_c_DN1361_c0_g1_i1


Why it is consider as an isoform? Should I exclude gene_ids from the fasta file?
Cheers,
Xavier


El miércoles, 14 de marzo de 2018, 12:35:45 (UTC+1), Brian Haas escribió:
Hi Xavier,

It looks like it's not finding the gene identifier for

in your

/path/isoform

file

The format should be 

 gene_id(tab)isoform_id

and your file has the order switched.

On Wed, Mar 14, 2018 at 6:42 AM, Xavier <microal...@gmail.com> wrote:
Do you have any suggestions? Any help will be welcome.
The code I have used:

/path/trinityrnaseq-Trinity-v2.5.1/util/align_and_estimate_abundance.pl --transcripts /paht/OKALT.fasta --seqType fa \
--samples_file path/A.samples --est_method RSEM --aln_method bowtie2 --thread_count 40 --output_dir /path/A \
--gene_trans_map /path/isoform --prep_reference


Thank you again for your support
Best wishes,
Xavier

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.

Brian Haas

unread,
Mar 14, 2018, 7:53:14 AM3/14/18
to Xavier, trinityrnaseq-users
The isoforms have physical sequences and are in the fasta file. 

The genes (in this case) are just  identifiers that group the isoforms together for the purpose of getting gene-level expression estimates.  The genes don't have sequences here. (if you want gene sequences at some point, you can generate 'supertranscript' sequences to represent the genes, but this is a different analysis).

In the case of the gene-to-trans mapping file, each transcript_id (2nd column) should be represented by a sequence in the target fasta file.

It's unusual that your gene identifiers would look like isoform identifiers.  In Trinity, the gene identifiers lack the 'i' suffix and end with a g-number.

hope this helps,

~b

To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsubscribe...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.

Xavier

unread,
Mar 14, 2018, 8:30:44 AM3/14/18
to trinityrnaseq-users
Thank you very much Brian,

Now it's working fine. I removed the "gene_ids" sequences from my fasta.


It's unusual that your gene identifiers would look like isoform identifiers

I have created my mapping file clustering sequences with more than 99% of identity and similarity. For that reason I have sequence for gene_id and transcript_id.
Now I am going to optimize my mapping file.

Thanks again,
Xavier
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages