abundance_estimates_to_matrix.pl with RSEM

277 views
Skip to first unread message

Sjannie Lefevre

unread,
Mar 12, 2018, 8:55:17 AM3/12/18
to trinityrnaseq-users
Hi

I have used RSEM to estimate abundance for both genes and isoforms, so I have both .genes.results and .isoforms.results for all samples, one of each in separate directories for each sample. However, it does not seem that the script is set up to actually use both files, at least it errors out when I put both in the list of files (causing there to be duplicate sample names, as there is one for egens and one for isoforms). Rather it seems it uses the gene_trans_map to make the gene abundances from just the isoform.results. I have tried to make the matrix separately from the genes.results, as that made me a bit suspicious, and the counts are not the same - not far off, but not actually the same. 

It is of course not at problem just running the script twice (though the results for genes will actually be called isoforms), but then I think it should be specified in the documentation that one needs to do this, and that the counts in the gene matrix will not match the counts from RSEM for genes.

Or have I misunderstood the use of the quant_files option?

Cheers
Sjannie

Sjannie Lefevre

unread,
Mar 12, 2018, 8:59:37 AM3/12/18
to trinityrnaseq-users
Using v2.5.1 by the way.

Brian Haas

unread,
Mar 12, 2018, 9:29:58 AM3/12/18
to Sjannie Lefevre, trinityrnaseq-users
Hi Sjannie,

The latest versions of Trinity just need the isoform files and it'll automatically create the gene matrix for you based on that (and having the gene-to-trans mapping file).

best,

~brian

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

Sjannie Lefevre

unread,
Mar 12, 2018, 1:53:07 PM3/12/18
to trinityrnaseq-users
Hi Brian, thanks for clearing that up.

It makes sense, assuming that the gene count is just the sum of the isoforms (which is how RSEM makes the gene.results), but why are the numbers consistently a bit different (smaller), then, e.g. 3286 when they should be 4223, or 178 when they should be 444? I have looked at several now, and it is always the same, that the gene.counts are a bit smaller.

It is even the case when there is just one isoform for a gene, which is really weird - the numbers are close but not the same (e.g. 3 for the isoform and 5 for the 'gene', so the opposite difference)... 
Basically it is just confusing, what is happening with the gene.counts.matrix in between that I am not aware of?

When looking at the TPM files, the gene counts are correct, i.e. they are the same as calculated by RSEM, and also corresponds to the sum of the isoforms.

The number of isoforms and genes are correct, and the ids look as they should, so I do not think it is the gene_trans_map that is wrong (in any case it would not explain the situation where there is just one isoform). Unless the script reads the map in a way that makes it wrong. It looks like this:

Cluster100011_gene0     Cluster100011_gene0_iso157037c5g6i1
Cluster100011_gene0     Cluster100011_gene0_isoDN124568c2g1i1

As I said before I used the trinitystats script on the fasta file with these headers, and that was able to give the correct numbers of isoforms and genes, so I just assumed that it was okay.

Anyways - I am just confused as to why it does not add up, because I though it should (and it does in RSEM's output).

Cheers
Sjannie


On Monday, 12 March 2018 14:29:58 UTC+1, Brian Haas wrote:
Hi Sjannie,

The latest versions of Trinity just need the isoform files and it'll automatically create the gene matrix for you based on that (and having the gene-to-trans mapping file).

best,

~brian
On Mon, Mar 12, 2018 at 8:59 AM, Sjannie Lefevre <sjannie...@gmail.com> wrote:
Using v2.5.1 by the way.


On Monday, 12 March 2018 13:55:17 UTC+1, Sjannie Lefevre wrote:
Hi

I have used RSEM to estimate abundance for both genes and isoforms, so I have both .genes.results and .isoforms.results for all samples, one of each in separate directories for each sample. However, it does not seem that the script is set up to actually use both files, at least it errors out when I put both in the list of files (causing there to be duplicate sample names, as there is one for egens and one for isoforms). Rather it seems it uses the gene_trans_map to make the gene abundances from just the isoform.results. I have tried to make the matrix separately from the genes.results, as that made me a bit suspicious, and the counts are not the same - not far off, but not actually the same. 

It is of course not at problem just running the script twice (though the results for genes will actually be called isoforms), but then I think it should be specified in the documentation that one needs to do this, and that the counts in the gene matrix will not match the counts from RSEM for genes.

Or have I misunderstood the use of the quant_files option?

Cheers
Sjannie

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.

Brian Haas

unread,
Mar 12, 2018, 1:57:18 PM3/12/18
to Sjannie Lefevre, trinityrnaseq-users
Hi Sjannie,

Yes, you'll see some differences there.   The gene counts are computed a bit differently now, as outlined here:

with

When you include the --gene_trans_map file above, it will automatically generate the gene-level count and expression matrices, using the 'scaledTPM' method as described in txImport but implemented here directly in the Trinity script. This 'scaledTPM' method for estimating gene counts accounts for differences in isoform lengths that could otherwise lead to false gene DE reporting under situations where it is differential transcript usage (DTU) as opposed to differential gene expression (DGE) occurring. See Soneson et al., F1000 Research, 2016 for details.

To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsubscribe...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.



--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.

Sjannie Lefevre

unread,
Mar 12, 2018, 2:09:52 PM3/12/18
to trinityrnaseq-users
Doh - note to self: look into provided refs before asking silly questions ;)

Thanks!

Sjannie

Brian Haas

unread,
Mar 12, 2018, 2:11:02 PM3/12/18
to Sjannie Lefevre, trinityrnaseq-users
No worries!  This is relatively new and I'm constantly changing things (hopefully for the better).

Do let me know at any time if things don't look as expected.

best,

~b


Sjannie

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages