how to reduce trinity transcript redundancy

gaball...@gmail.com

unread,

Jul 30, 2016, 1:13:58 AM7/30/16

to trinityrnaseq-users

Hello:

Recently we sequenced several libraries from different populations of a non model insect species (illumina 2x100bp, RNAseq). I created an assemblied transcriptome using Trinity. Currently I'm trying to perform differential expression analysis. However, I'm facing the following situation:

Different Isoforms with the same annotation (i.e same blast hit) have differential expression values between samples and are considered as overexpressed for all samples.

Which is the correct way to analyse these transcripts? I understand that these transcripts would be redundant transcripts/contigs. Should I remove these redundant transcripts?

I've tried to reduce the number of redundant transcripts by picking transcripts using homology searches against NR database (keeping the highest bit score transcript from several transcripts aligning to the same protein), as described in the following paper "removal of redundant contigs from the novo RNA-seq assemblies via homolgy search improves accurate detection of differentially expressed genes", http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-2247-0 . Is this a correct way to proceed? or should I really do the differential expression analyisis using the complete transcript set given by Trinity?

Best regards,

Gabriel

Brian Haas

unread,

Jul 30, 2016, 8:19:42 AM7/30/16

to gaball...@gmail.com, trinityrnaseq-users

Hi Gabriel,

The way to deal with this is to perform the DE analysis at the 'gene' level (in addition to what you've done at the 'isoform' level). You can use the provided Trinity 'gene' identifiers

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Differential-Expression

(search for 'gene' in the page and you'll find the relevant section)

Alternatively you could use the 'gene' groupings provided by CORSET:

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0410-6

best,

~brian

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

--

--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

gaball...@gmail.com

unread,

Jul 30, 2016, 9:57:07 AM7/30/16

to trinityrnaseq-users, gaball...@gmail.com

Hi Brian

Thanks for your answer. I understand that, when doing that analysis, all isoform expression values are assigned to the respective "gene" using the RSEM isoform-to-gene mapping file. But then, how can I link the 'gene' level with the 'isoform' level for annotation? I.E from 5 different isoforms, which one should I consider as the gene? Some people just grab the longest isoform and consider that as the "gene" level sequence.

Another question, do you recommend using clustering tools such as CD-HIT for reducing transcriptome redundancy and improving detection of differentially expressed genes? I've also seen a paper (linked on my original post) about using homology searches and keeping best transcripts and use this reduced dataset for DE analysis. Would that one be a good strategy?

Best regards,

Gabriel

El sábado, 30 de julio de 2016, 8:19:42 (UTC-4), Brian Haas escribió:

Hi Gabriel,

The way to deal with this is to perform the DE analysis at the 'gene' level (in addition to what you've done at the 'isoform' level). You can use the provided Trinity 'gene' identifiers

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Differential-Expression
(search for 'gene' in the page and you'll find the relevant section)

Alternatively you could use the 'gene' groupings provided by CORSET:
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0410-6

best,

~brian

On Sat, Jul 30, 2016 at 1:13 AM, <gaball...@gmail.com> wrote:

Hello:

Recently we sequenced several libraries from different populations of a non model insect species (illumina 2x100bp, RNAseq). I created an assemblied transcriptome using Trinity. Currently I'm trying to perform differential expression analysis. However, I'm facing the following situation:

Different Isoforms with the same annotation (i.e same blast hit) have differential expression values between samples and are considered as overexpressed for all samples.

Which is the correct way to analyse these transcripts? I understand that these transcripts would be redundant transcripts/contigs. Should I remove these redundant transcripts?

I've tried to reduce the number of redundant transcripts by picking transcripts using homology searches against NR database (keeping the highest bit score transcript from several transcripts aligning to the same protein), as described in the following paper "removal of redundant contigs from the novo RNA-seq assemblies via homolgy search improves accurate detection of differentially expressed genes", http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-2247-0 . Is this a correct way to proceed? or should I really do the differential expression analyisis using the complete transcript set given by Trinity?

Best regards,

Gabriel

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Brian Haas

unread,

Jul 30, 2016, 10:08:12 AM7/30/16

to gaball...@gmail.com, trinityrnaseq-users

Responses below

On Sat, Jul 30, 2016 at 9:57 AM, <gaball...@gmail.com> wrote:

Hi Brian

Thanks for your answer. I understand that, when doing that analysis, all isoform expression values are assigned to the respective "gene" using the RSEM isoform-to-gene mapping file. But then, how can I link the 'gene' level with the 'isoform' level for annotation? I.E from 5 different isoforms, which one should I consider as the gene? Some people just grab the longest isoform and consider that as the "gene" level sequence.

I tend to start with the gene-level DE analysis results, and then if I need to pick a representative transcript for downstream analysis, I examine the transcript-level expression values and transcript-level DE results in combination with the Trinotate annotations:

http://trinotate.github.io/

I generally do this through TrinotateWeb, where you select a gene and it shows you all the individual transcript level results in various plots (heatmaps, line graphs, etc.). This is not a high-throughput exercise mind you... but rather exploring certain candidates of interest for more thorough downstream investigation.

Another question, do you recommend using clustering tools such as CD-HIT for reducing transcriptome redundancy and improving detection of differentially expressed genes? I've also seen a paper (linked on my original post) about using homology searches and keeping best transcripts and use this reduced dataset for DE analysis. Would that one be a good strategy?

The DE analysis tools we use in bioconductor tend to be count-based, and if you do the expression analysis at the gene level, it aggregates the (nonredundant) counts across all the corresponding isoforms. As long as you have good gene-to-transcript mappings, I don't think there's reason to worry or need to do any cdhit-style reduction.

A lot of people get stressed out over the sheer number of transcripts. Here's our FAQ entry related to this:

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-FAQ#ques_why_so_many_transcripts

I hope this helps,

~b

Best regards,

Gabriel

El sábado, 30 de julio de 2016, 8:19:42 (UTC-4), Brian Haas escribió:

Hi Gabriel,

The way to deal with this is to perform the DE analysis at the 'gene' level (in addition to what you've done at the 'isoform' level). You can use the provided Trinity 'gene' identifiers

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Differential-Expression
(search for 'gene' in the page and you'll find the relevant section)

Alternatively you could use the 'gene' groupings provided by CORSET:
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0410-6

best,

~brian

On Sat, Jul 30, 2016 at 1:13 AM, <gaball...@gmail.com> wrote:

Hello:

Recently we sequenced several libraries from different populations of a non model insect species (illumina 2x100bp, RNAseq). I created an assemblied transcriptome using Trinity. Currently I'm trying to perform differential expression analysis. However, I'm facing the following situation:

Different Isoforms with the same annotation (i.e same blast hit) have differential expression values between samples and are considered as overexpressed for all samples.

Which is the correct way to analyse these transcripts? I understand that these transcripts would be redundant transcripts/contigs. Should I remove these redundant transcripts?

I've tried to reduce the number of redundant transcripts by picking transcripts using homology searches against NR database (keeping the highest bit score transcript from several transcripts aligning to the same protein), as described in the following paper "removal of redundant contigs from the novo RNA-seq assemblies via homolgy search improves accurate detection of differentially expressed genes", http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-2247-0 . Is this a correct way to proceed? or should I really do the differential expression analyisis using the complete transcript set given by Trinity?

Best regards,

Gabriel

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

--

You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

setar...@gmail.com

unread,

Jul 31, 2016, 10:59:46 AM7/31/16

to trinityrnaseq-users

Hi Brian,

In your responses, you said, "I tend to start with the gene-level DE analysis results, and then if I need to pick a representative transcript for downstream analysis, I examine the transcript-level expression values". Here, would you please let me know if your mean from representative transcripts is the transcript with the highest FPKM value or there are another criteria for selecting representative transcript?

During gene expression analysis with edgeR within Trinity, I found that several DE genes that annotated as the same protein have the opposite expression, some of them were up-regulated and some of them were down-regulated. In order to get the meaningful biological concept from such DE genes, I should consider them as either up-regulated or down-regulated, could you please let me know how I can interpret such results and if there are any criteria to correctly consider them as up- or down-regulated gene?

Thank you

Brian Haas

unread,

Jul 31, 2016, 11:24:59 AM7/31/16

to maryam moazam, trinityrnaseq-users

responses below

On Sun, Jul 31, 2016 at 10:59 AM, <setar...@gmail.com> wrote:

Hi Brian,

In your responses, you said, "I tend to start with the gene-level DE analysis results, and then if I need to pick a representative transcript for downstream analysis, I examine the transcript-level expression values". Here, would you please let me know if your mean from representative transcripts is the transcript with the highest FPKM value or there are another criteria for selecting representative transcript?

I generally take into account the expression profile of the transcript across all samples as shown in the heatmap, including its expression intensity. I also examine any annotation information available, such as the finding of conserved pfam domains and extent of alignment to homologous proteins (usually shown in the little integrated 'genome viewer' in the TrinotateWeb report for that gene.

During gene expression analysis with edgeR within Trinity, I found that several DE genes that annotated as the same protein have the opposite expression, some of them were up-regulated and some of them were down-regulated. In order to get the meaningful biological concept from such DE genes, I should consider them as either up-regulated or down-regulated, could you please let me know how I can interpret such results and if there are any criteria to correctly consider them as up- or down-regulated gene?

If these are paralogous genes, then it could be interesting biology. If they are different isoforms of the same gene, it could also be very interesting - demonstrating differential isoform usage - but I'd be very skeptical and carefully consider the DE analysis results. For differential isoform usage, I've relied more on tools such as EBSeq and mmdiff (neither well integrated into Trinity yet) since the DE analysis takes into account read mapping uncertainty among isoforms, and read mapping uncertainty is the single most confounding issue when trying to estimate different isoform abundances.

Nothing substitutes for validating the potentially interesting cases via rt-pcr, etc.

I hope this helps,

~b

Thank you

Thank you

On Saturday, July 30, 2016 at 9:43:58 AM UTC+4:30, gaball...@gmail.com wrote:
Hello:

Recently we sequenced several libraries from different populations of a non model insect species (illumina 2x100bp, RNAseq). I created an assemblied transcriptome using Trinity. Currently I'm trying to perform differential expression analysis. However, I'm facing the following situation:

Different Isoforms with the same annotation (i.e same blast hit) have differential expression values between samples and are considered as overexpressed for all samples.

Which is the correct way to analyse these transcripts? I understand that these transcripts would be redundant transcripts/contigs. Should I remove these redundant transcripts?

I've tried to reduce the number of redundant transcripts by picking transcripts using homology searches against NR database (keeping the highest bit score transcript from several transcripts aligning to the same protein), as described in the following paper "removal of redundant contigs from the novo RNA-seq assemblies via homolgy search improves accurate detection of differentially expressed genes", http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-2247-0 . Is this a correct way to proceed? or should I really do the differential expression analyisis using the complete transcript set given by Trinity?

Best regards,

Gabriel

--

You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

setar...@gmail.com

unread,

Jul 31, 2016, 12:00:21 PM7/31/16

to trinityrnaseq-users

Thank you very much for your prompt response. Regarding paralogous genes that you mentioned, their sequence should be high similar with together, yes? however, when I checked this issue for some of such DE genes, they have not significant similarity with together while they got similar annotation during blast with the e-value cutoff of 1e-5 (the identity percent and score were more than 50 and 100, respectively). Please kindly share me your opinion about it.

Best

On Saturday, July 30, 2016 at 9:43:58 AM UTC+4:30, gaball...@gmail.com wrote:

Brian Haas

unread,

Jul 31, 2016, 12:04:31 PM7/31/16

to maryam moazam, trinityrnaseq-users

Sounds like paralogs. Be sure the different assembled transcripts do align to each other (try tblastn). If you have transdecoder-predicted peptides, see if you can align them together. There should be detectable patterns of conservation.

Paralogs show up differently in Trinity assemblies. If they're divergent paralogs, then you might detect homology at the coding level (translated). In other cases, they could be highly similar, but the patterns of variation will be very different from isoforms of the same gene (isoforms tend to have long identical stretches with skipped exons, etc., whereas paralogs will have lots of scattered small stretches of variation consistent with different evolutionary histories of fixed mutations).

--

You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

setar...@gmail.com

unread,

Jul 31, 2016, 12:33:40 PM7/31/16

to trinityrnaseq-users

Thank you, Brian. That’s interesting. OK, I’ll check the translated sequences. Assuming they are paralogue, we can say they are redundant genes that up and down-expression of them at the same time have a regulatory effect in a given experiment?

As a last question, other than RT-PCR that is an experimental approach, could you please let me know if there is another reliable way to validate the expression of such DE genes?

Many thanks

On Saturday, July 30, 2016 at 9:43:58 AM UTC+4:30, gaball...@gmail.com wrote:

Brian Haas

unread,

Jul 31, 2016, 1:41:03 PM7/31/16

to maryam moazam, trinityrnaseq-users

responses below

On Sun, Jul 31, 2016 at 12:33 PM, <setar...@gmail.com> wrote:

Thank you, Brian. That’s interesting. OK, I’ll check the translated sequences. Assuming they are paralogue, we can say they are redundant genes that up and down-expression of them at the same time have a regulatory effect in a given experiment?

you can only say that they have similar expression profiles. If the expression profiles are correlated among a number of experiments/conditions/perturbations, then they could be part of the same regulatory module. Redundant functionality is a difficult thing to prove.... requiring double knockouts / phenotypes, and rescue by either, etc. Both of these issues are important and you'll want to hit the literature to research this.

As a last question, other than RT-PCR that is an experimental approach, could you please let me know if there is another reliable way to validate the expression of such DE genes?

there are many experimental approaches one could take, but rt-pcr is the simplest and cheapest that I'm aware of, though some might argue for doing a Northern blot (old school). One of the most convincing assays is FISH, but it's not the first thing you'd usually jump to.

Many thanks

On Saturday, July 30, 2016 at 9:43:58 AM UTC+4:30, gaball...@gmail.com wrote:
Hello:

Recently we sequenced several libraries from different populations of a non model insect species (illumina 2x100bp, RNAseq). I created an assemblied transcriptome using Trinity. Currently I'm trying to perform differential expression analysis. However, I'm facing the following situation:

Different Isoforms with the same annotation (i.e same blast hit) have differential expression values between samples and are considered as overexpressed for all samples.

Which is the correct way to analyse these transcripts? I understand that these transcripts would be redundant transcripts/contigs. Should I remove these redundant transcripts?

I've tried to reduce the number of redundant transcripts by picking transcripts using homology searches against NR database (keeping the highest bit score transcript from several transcripts aligning to the same protein), as described in the following paper "removal of redundant contigs from the novo RNA-seq assemblies via homolgy search improves accurate detection of differentially expressed genes", http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-015-2247-0 . Is this a correct way to proceed? or should I really do the differential expression analyisis using the complete transcript set given by Trinity?

Best regards,

Gabriel

--

You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

setar...@gmail.com

unread,

Jul 31, 2016, 4:29:00 PM7/31/16

to trinityrnaseq-users

Thanks a lot, Brian. Sorry for further question, could you please tell me how Trinity distinguish Paralogs during de novo assembly? and what is the benefit of this differentiation?

All the best

On Saturday, July 30, 2016 at 9:43:58 AM UTC+4:30, gaball...@gmail.com wrote:

Brian Haas

unread,

Jul 31, 2016, 4:33:37 PM7/31/16

to maryam moazam, trinityrnaseq-users

Trinity doesn't do anything special to distinguish paralogs... Identifying paralogs would be a downstream analysis (which we don't have documentation for yet).

sorry - I don't quite understand the second question...

--

You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

setar...@gmail.com

unread,

Aug 2, 2016, 8:30:11 AM8/2/16

to trinityrnaseq-users

Hi Brian,

Thank you for all your help and response. In your previous response, you mentioned that Trinity doesn't do anything special to distinguish paralogs. Could you please let us know if there is any way to confirm paralogs?

Best

On Saturday, July 30, 2016 at 9:43:58 AM UTC+4:30, gaball...@gmail.com wrote:

Mark Chapman

unread,

Aug 2, 2016, 8:47:05 AM8/2/16

to maryam moazam, trinityrnaseq-users

You can sequence the putative paralogues from gDNA or RNA using paralogue-specific primers. If they both amplify then they're 'real'

--

You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

--

Dr. Mark A. Chapman

M.Ch...@soton.ac.uk

+44 (0)2380 594396

------------------------------------

Centre for Biological Sciences
University of Southampton

Life Sciences Building 85
Highfield Campus
Southampton
SO17 1BJ

setar...@gmail.com

unread,

Aug 2, 2016, 9:10:11 AM8/2/16

to trinityrnaseq-users

Hi Mark,

Thank you. Before doing an experimental work, like PCR, I prefer to do some bioinformatic analysis. Please tell me if there is any way for it?

Thanks

On Saturday, July 30, 2016 at 9:43:58 AM UTC+4:30, gaball...@gmail.com wrote:

Mark Chapman

unread,

Aug 2, 2016, 9:15:12 AM8/2/16

to maryam moazam, trinityrnaseq-users

PCR/seq is the gold standard, everything else will just give you some suggestions, eg:

1. You can compare to sequenced genomes - do the two potential 'paralogues' come back as best hits to to different paralogues in another species? (if they do then this is likely they are real paralogues)

2. Are the two 'paralogues' indeed from a gene family in another species, or do other species always have one copy? (if so, this suggests they're not really paralogues)

3. Are the alignments 'sensible' eg if they're 99% similar they might just be alleles that have assembled separately. If they're 90% similar its unlikely they're isoforms of the same gene and so may well be paralogues.

Just some ideas.. there's no one right answer

Best wishes, Mark

--

You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

setar...@gmail.com

unread,

Aug 3, 2016, 8:40:54 AM8/3/16

to trinityrnaseq-users

Hi Mark,

Thank you very much for your explanation.

On Saturday, July 30, 2016 at 9:43:58 AM UTC+4:30, gaball...@gmail.com wrote:

Reply all

Reply to author

Forward