differential expression analysis (RSEM) using isoforms or genes?

gaball...@gmail.com

unread,

Mar 28, 2016, 10:44:35 PM3/28/16

to trinityrnaseq-users

Dear All:

Hello, my name is Gabriel and I'm working on a de-novo RNAseq project for a novel organism (from which we have no genomic data). We secuenced different samples and I'd like to perform Differential Expression analysis. I've aligned my trinity assembly file with Blast/NR in order to annotate it (the plan is to use a tool such as B2G to assign GO terms).

So far, using the Trinity tutorial coupled with RSEM, I've got the RSEM.isoforms.results and RSEM.genes.results matrix for my samples. Which matrices should I use to perform differential expression analysis? Both, only isoforms or genes?

I've tried using the gene matrix but then, when trying to associate the results with Blast/NR alignments, I find that the Isoform is missing in the DE results, hence I can't give annotation to my transcripts (which is as expected as gene table does not carry the Isoform information). Should I just erase the isoform part of the header, or should I perform the DE analysis with isoforms?

So far I've read that both approaches have advantages and disavantages but I'm still not sure about which one should I take.

Please, could you help me out?

Best Regards,

Gabriel Ballesteros

Ken Field

unread,

Mar 29, 2016, 7:59:29 AM3/29/16

to gaball...@gmail.com, trinityrnaseq-users

Dear Gabriel-

I struggled with this question too. I recommend, at first, to stick with analysis at the transcript level. This has two advantages, in my opinion:

1. It is easier to annotate. Annotation occurs at the transcript level and it is much easier to assign homology, etc.

2. You don't have to worry about the validity of the transcript clusters. I don't understand how Trinity defines a "gene" well enough to know why, but I definitely found that some transcript clusters were not truly the same gene in my assembly. This was with the 2014 version of Trinity, so that may have gotten better, and it may have been a rare occurrence even then. However, it taught me that apparent gene-level differences in expression can be an artifact of the assembly and that each "gene" has to be validated individually.

In the long run, it would be great to do both, of course. But stick with transcripts at first and then see how different gene-level analysis is later.

Sincerely,

Ken

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

--

Ken Field, Ph.D.

Associate Professor of Biology

Program in Cell Biology/Biochemistry

Bucknell University

Room 203A Biology Building

setar...@gmail.com

unread,

Mar 31, 2016, 6:04:58 AM3/31/16

to trinityrnaseq-users

Dear Ken,

Thank you for your comment, this issue is also one of my challenges. From your comment, I found that expression analysis at the transcript level is more reliable than gene-level, at least for Trinity assembly (version 2014). Could you please let me know if you have some experimental validation (by qRT-PCR for example) for this conclusion?

I read that given uneven coverage with RNA-seq data, it might better to do expression analysis at the gene-level and perform transcript level profiling for some interesting differentially expressed genes. Having your opinion on this issue would be great.

In the case of doing expression analysis at the gene-level, please kindly tell us if removing isoform part of the header (_i) is sufficient to link the DE genes with annotated transcriptome assembly?

Thank you,

Mary

Ken Field

unread,

Mar 31, 2016, 6:34:25 AM3/31/16

to maryam moazam, trinityrnaseq-users

Hi Mary,

No, I did not use qPCR to validate the expression patterns. My conclusions came from examining closely a few "genes" using IGV and the Trinotate annotations. I don't remember all of the details now, but I think that there were several genes that had transcripts that showed best blast hits to unrelated genes. These appeared to be due to chimeric transcripts that Trinity had grouped with legitimate transcripts. Now the chimeric transcripts showed very low levels of expression (by RSEM) so they may not have been hurting anything, but by using transcript-level expression I didn't have to worry about them (unless they had been differentially expressed and had been a legitimate novel gene, in which case I would have had to do a lot of work to verify it).

I am not sure about just manually deleting the isoform part of the header. That sounds like it should work for me, but I always just used align_and_estimate_abundance.pl and the genes and isoforms matrices that it generates.

Ken

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

setar...@gmail.com

unread,

Mar 31, 2016, 2:42:28 PM3/31/16

to trinityrnaseq-users

Hi Ken,

Thank you for your response. As you mentioned that artifact transcripts usually have low levels of expression, so we can ignore them by filtering of transcripts with FPKM value of less than 1 at the gene or transcript level as suggested in the Trinity, am I right? actually, one paper said that DE genes (or transcript) with FPKM value of less than 1 or 0.5 (I don't remember exactly which one) did not verify by qPCR analysis.

Regarding my second question, I also used align_and_estimate_abundance.pl that generated the genes and isoforms matrices as you told. However as Gabriel mentioned that we cannot give annotation to DE genes because when transcriptome assembly is subjected to blast, the isoform information is kept while it (isoform) missed when expression profiling was conducted at the gene level. In fact, with Trinity (version of 20140717), we have something like c2356_g1 as DE gene while we have something like c2356_g1_i1 or / and c2356_g1_i2 in the annotated assembly. In this case, I was wondering if I can simply remove the isoform part (_i) from the header of annotated assembly in order to identify DE genes using annotated assembly?

Thanks,

Mary

On Tuesday, March 29, 2016 at 7:14:35 AM UTC+4:30, gaball...@gmail.com wrote:

gaball...@gmail.com

unread,

Mar 31, 2016, 5:44:02 PM3/31/16

to trinityrnaseq-users

Hi Mary

As far as I've seen, different isoforms align to the same sequence (at least using blast), although alignments will have differences between isoforms. I think that removing isoform information (_i) shouldn't be such a big deal.

Running align_and_estimate_abundance.pl without a given gene to transcript map makes the script to generate it's own map (this example was obtained using RSEM). The gene_trans_map generated is as follows:

TR55145|c0_g1    TR55145|c0_g1_i1
TR55145|c0_g1    TR55145|c0_g1_i2
TR55145|c0_g1    TR55145|c0_g1_i3
TR55145|c0_g1    TR55145|c0_g1_i4
TR55145|c0_g1    TR55145|c0_g1_i5

This map is used afterwards for RSEM for gene level analysis. Hence, all isoforms are associated to the same gene, so at that level, all the isoform annotations would be associated to the same gene. That suggests me that removing _i from the annotation wouldn't be a big deal for merging annotation at gene level (please, correct me if I'm wrong)

The issue that I see from performing the analysis at gene level is what isoform should we keep/report afterwards? for example, in this case we would have 5 different isoforms associated with the same gene. If we decide to remove the _i from the fasta file, we would have five sequences with the same header. Which sequence (isoform) should we keep and which ones should we discard? The longest one? the better aligned?

Best regards,

Gabriel

setar...@gmail.com

unread,

Apr 1, 2016, 3:39:24 PM4/1/16

to trinityrnaseq-users

Hi Gabriel,

I agree with you for removing the isoform information (_i), from header and I plan to do it for my work. To be honest, I have no enough experience in this field. Please, Brian and other experts join us and correct our discussion.

Your second question is also one of my questions for analysis at the gene-level. It is important which isoform should be kept and probably validated by qRT-PCR. Please let me know if you get any update.

Thanks,

Mary

On Tuesday, March 29, 2016 at 7:14:35 AM UTC+4:30, gaball...@gmail.com wrote:

Brian Haas

unread,

Apr 2, 2016, 8:01:37 AM4/2/16

to maryam moazam, trinityrnaseq-users

For DE analysis, I tend to run it at both the transcript and at the gene level.

For an alternative grouping of transcript contigs into genes, you could explore using CORSET. In any case, read the CORSET paper to understand the utility of gene vs. isoform level analysis.

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0410-6

hope this helps,

~b

setar...@gmail.com

unread,

Apr 5, 2016, 11:25:04 AM4/5/16

to trinityrnaseq-users

Hi Brian,

Nice to hear from you.

As I remember from you, transcript level analysis can provide info on differential splicing of alternatively spliced isoforms. Gene-level can have more power for detection of DE given it captures the read counts from all corresponding isoforms. So, I initially prefer to do expression profiling at the gene-level. As Gabriel mentioned in the post here, our question are:

We cannot give annotation to DE genes because when transcriptome assembly is subjected to blast, the isoform information is kept while it (isoform) missed when expression profiling was conducted at the gene level. In fact, with Trinity (version of 20140717), we have something like c2356_g1 as DE gene while we have something like c2356_g1_i1 or / and c2356_g1_i2 in the annotated assembly. In this case, I was wondering if I can simply remove the isoform part (_i) from the header of annotated assembly in order to identify DE genes using annotated assembly?

In the case of doing expression analysis at the gene-level, for genes with various transcript isoform (like below that we have 3 different isoforms associated with the same gene), if we decide to remove the _i from the fasta file, we would have 3 sequences with the same header. Which sequence (isoform) should be reported and validated by qRT-PCR analysis?

Thank you,

Mary

On Tuesday, March 29, 2016 at 7:14:35 AM UTC+4:30, gaball...@gmail.com wrote:

Brian Haas

unread,

Apr 5, 2016, 12:02:54 PM4/5/16

to maryam moazam, trinityrnaseq-users

Hi Mary,

I tend to do the DE analysis at both the gene level and the transcript level separately. I then scrutinize the results from each analysis.

I also load up the results into Trinotate for analysis using TrinotateWeb. If I find a gene of interest, I look at that gene report page and examine the expression of the different isoforms for that gene. It's usually obvious which isoform is the one worth validating since, in the case there are many isoforms, it's usually just one or two that have the dominant expression values (based on my experiences).

~b

setar...@gmail.com

unread,

Apr 5, 2016, 1:26:40 PM4/5/16

to trinityrnaseq-users

Hi Brian,

Thank you for your response. You mentioned that you scrutinize the results from expression analysis at the gene and transcipt level. For me as a beginner, it may take long time, It will be great if you please tell me your pipline for parsing DE results. Sorry, after expression analysis at the gene-level, removing isoform part (_i) from the header of annotated assembly in order to identify DE genes is OK, right? and for the same gene with different isoforms, those isoforms with dominant expression values should be reported and validated, yes?

All the best,

Mary

On Tuesday, March 29, 2016 at 7:14:35 AM UTC+4:30, gaball...@gmail.com wrote:

Brian Haas

unread,

Apr 5, 2016, 2:51:04 PM4/5/16

to maryam moazam, trinityrnaseq-users

Hi Mary,

The quantitation is described here:

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Transcript-Quantification

You'd enable --trinity_mode in order to get the gene counts matrix in addition to the isoform counts matrix. I'll add a note to that effect so it'll be much clearer.

Then, the DE analysis is described here:

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Differential-Expression

Just use the gene matrix instead of the isoform matrix to get the DE gene info.

best,

~b

Reed Liu

unread,

Jan 11, 2018, 4:27:16 AM1/11/18

to trinityrnaseq-users

Hi guys,

I've been working on RSEM gene&isoform quantification. Now I have a problem. I can calculate isoform.count.matrix, but can't get gene.counts.matrix. The error says that .isAllZero(counts): counts must be positive finite values. And the gene.counts.matrix (err) show lots NAs. I don't know how to solve this. Or just using the isoform file to do next DE analysis?

Thanks.

Brian Haas

unread,

Jan 11, 2018, 5:18:09 AM1/11/18

to Reed Liu, trinityrnaseq-users

Hi Reed,

The NAs have shown up when folks have run the quantification using different assemblies for each sample instead of against the same assembly. Could that be the case here? Trinity assumes you've done a single assembly for all your reads and are quantifying each sample on that single assembly.

best,

~brian

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.

Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.