too many genes and transcripts in the output file(Trinity.fasta)

Qiuzhong Zhou

unread,

Apr 15, 2015, 9:10:47 PM4/15/15

to trinityrn...@googlegroups.com

now, i use trinity2.0.6 to assembly my plant transcriptome, my data is PE-end, specific strand, read length:125bp, three repeat, the number of left reads is about 67,000,000 in each sample. total clean data is 48G (fastq). after Trinity assembler finished its assembly, i found too many genes and transcripts in the Trinity.fasta files, abot 300,000 genes and 450,000 trabscripts. who can tell me why and how to deal with? my cmd is "Trinity --seqType fq --max_memory 130G --SS_lib_type RF --left 1_1.clean.fq,2_1.clean.fq,3_1.clean.fq --right 1_2.clean.fq, 2_2.clean.fq,3_2.clean.fq --CPU 6 -output trinity_out/"

thanks for all at first.

Brian Haas

unread,

Apr 15, 2015, 9:56:25 PM4/15/15

to Qiuzhong Zhou, trinityrn...@googlegroups.com

I think this might address the issue:

http://trinityrnaseq.github.io/analysis/abundance_estimation.html#how_many_expr

best,

~brian

On Wed, Apr 15, 2015 at 9:10 PM, Qiuzhong Zhou <cnzq...@gmail.com> wrote:

now, i use trinity2.0.6 to assembly my plant transcriptome, my data is PE-end, specific strand, read length:125bp, three repeat, the number of left reads is about 67,000,000 in each sample. total clean data is 48G (fastq). after Trinity assembler finished its assembly, i found too many genes and transcripts in the Trinity.fasta files, abot 300,000 genes and 450,000 trabscripts. who can tell me why and how to deal with? my cmd is "Trinity --seqType fq --max_memory 130G --SS_lib_type RF --left 1_1.clean.fq,2_1.clean.fq,3_1.clean.fq --right 1_2.clean.fq, 2_2.clean.fq,3_2.clean.fq --CPU 6 -output trinity_out/"

thanks for all at first.

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

--

--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

Qiuzhong Zhou

unread,

Apr 15, 2015, 10:28:18 PM4/15/15

to trinityrn...@googlegroups.com, cnzq...@gmail.com

thanks very much, but i think the url "http://trinityrnaseq.github.io/analysis/abundance_estimation.html#how_many_expr" can not solve my issue. my issue is why i have so much genes and transcripts, is not the abundance estimation of the assembly result.

On Thursday, April 16, 2015 at 9:56:25 AM UTC+8, Brian Haas wrote:

I think this might address the issue:

http://trinityrnaseq.github.io/analysis/abundance_estimation.html#how_many_expr

best,

~brian

On Wed, Apr 15, 2015 at 9:10 PM, Qiuzhong Zhou <cnzq...@gmail.com> wrote:

now, i use trinity2.0.6 to assembly my plant transcriptome, my data is PE-end, specific strand, read length:125bp, three repeat, the number of left reads is about 67,000,000 in each sample. total clean data is 48G (fastq). after Trinity assembler finished its assembly, i found too many genes and transcripts in the Trinity.fasta files, abot 300,000 genes and 450,000 trabscripts. who can tell me why and how to deal with? my cmd is "Trinity --seqType fq --max_memory 130G --SS_lib_type RF --left 1_1.clean.fq,2_1.clean.fq,3_1.clean.fq --right 1_2.clean.fq, 2_2.clean.fq,3_2.clean.fq --CPU 6 -output trinity_out/"

thanks for all at first.

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Dan Browne

unread,

Apr 16, 2015, 2:09:35 AM4/16/15

to trinityrn...@googlegroups.com, cnzq...@gmail.com

The point is that you can filter out lowly expressed transcripts essentially as noise in your assembly, thus leaving you with a smaller set of perhaps more significant transcripts, as they are more strongly supported by the reads than the transcripts that you filtered out.

Qiuzhong Zhou

unread,

Apr 16, 2015, 3:22:49 AM4/16/15

to trinityrn...@googlegroups.com, cnzq...@gmail.com

but the lowly expressed transcripts maybe also have an important role, i do not want to filter out them. do you have any other method do for this issue. thanks

Mark Chapman

unread,

Apr 16, 2015, 4:11:55 AM4/16/15

to Qiuzhong Zhou, trinityrn...@googlegroups.com

Hi Qiuzhong,

I would agree with Dan's point that some of your isoforms might be very lowly expressed and/or poorly supported, so a little filtering is often useful, and getting an idea for what your data look like is often a good idea.

Try modifying a few of the parameters in the trinity command line. They will probably not make a huge change, but may remove some of the more poorly supported 'genes' and isoforms. In my command line I often use:

--min_kmer_cov 2 --max_internal_gap_same_path 15 --max_diffs_same_path 4

or variants on this.

Also what is it you are assembling? is it likely to be polyploid? is there a genome sequence available?

Thanks, Mark

To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

--

Dr. Mark A. Chapman

M.Ch...@soton.ac.uk

+44 (0)2380 594396

------------------------------------

Centre for Biological Sciences
University of Southampton

Life Sciences Building 85
Highfield Campus
Southampton
SO17 1BJ

Brian Haas

unread,

Apr 16, 2015, 7:01:24 AM4/16/15

to Mark Chapman, Qiuzhong Zhou, trinityrn...@googlegroups.com

There are a few key points that we're trying to make:

1. *lots* of transcripts is the rule rather than the exception.

2. most of the transcripts are very lowly expressed, and the deeper you sequence and the more complex your genome, the larger the number of lowly expressed transcripts you will be able to assemble. Biological relevance of the lowly expressed transcripts could be questionable - some are bound to be very relevant.

3. there's really no good reason to filter them out. They can be 'passengers' throughout all of your data analyses, and if any of them are important, they'll ideally surface in the relevant study. You can put them all through Trinotate.github.io for annotation/analysis, and you can put them through DE studies just fine. If the read counts are few or lacking, they simply won't surface as significant DE entries, but if there's protein homology or other interesting features, you'll continue to capture this info.

4. if you want to guestimate 'how many expressed genes/transcripts do I have', then the analysis I pointed you to will enable you to count the number that reflect the majority of the reads, excluding counting those that have little read support. The entries with >= ~1 fpkm or tpm tend to be heavily enriched for transcripts that correspond to what we typically think of as 'genes' in the pre-RNA-Seq era, and what typically are awarded annotation status in genome annotation.

best,

~brian

Qiuzhong Zhou

unread,

Apr 16, 2015, 7:34:47 AM4/16/15

to trinityrn...@googlegroups.com, cnzq...@gmail.com

Thanks for your advice, Mark, my research plant is amphiploid and no reference genome, i intend to use your cmd line to run again. if the result also have too many transcripts, i have no choice but to filter out poorly supported transcripts.

On Thursday, April 16, 2015 at 4:11:55 PM UTC+8, Mark Chapman wrote:

Hi Qiuzhong,
I would agree with Dan's point that some of your isoforms might be very lowly expressed and/or poorly supported, so a little filtering is often useful, and getting an idea for what your data look like is often a good idea.
Try modifying a few of the parameters in the trinity command line. They will probably not make a huge change, but may remove some of the more poorly supported 'genes' and isoforms. In my command line I often use:
--min_kmer_cov 2 --max_internal_gap_same_path 15 --max_diffs_same_path 4
or variants on this.
Also what is it you are assembling? is it likely to be polyploid? is there a genome sequence available?

Thanks, Mark

On 16 April 2015 at 08:22, Qiuzhong Zhou <cnzq...@gmail.com> wrote:

but the lowly expressed transcripts maybe also have an important role, i do not want to filter out them. do you have any other method do for this issue. thanks

On Thursday, April 16, 2015 at 2:09:35 PM UTC+8, Dan Browne wrote:

The point is that you can filter out lowly expressed transcripts essentially as noise in your assembly, thus leaving you with a smaller set of perhaps more significant transcripts, as they are more strongly supported by the reads than the transcripts that you filtered out.

On Wednesday, April 15, 2015 at 9:28:18 PM UTC-5, Qiuzhong Zhou wrote:

thanks very much, but i think the url "http://trinityrnaseq.github.io/analysis/abundance_estimation.html#how_many_expr" can not solve my issue. my issue is why i have so much genes and transcripts, is not the abundance estimation of the assembly result.

On Thursday, April 16, 2015 at 9:56:25 AM UTC+8, Brian Haas wrote:

I think this might address the issue:

http://trinityrnaseq.github.io/analysis/abundance_estimation.html#how_many_expr

best,

~brian

On Wed, Apr 15, 2015 at 9:10 PM, Qiuzhong Zhou <cnzq...@gmail.com> wrote:

now, i use trinity2.0.6 to assembly my plant transcriptome, my data is PE-end, specific strand, read length:125bp, three repeat, the number of left reads is about 67,000,000 in each sample. total clean data is 48G (fastq). after Trinity assembler finished its assembly, i found too many genes and transcripts in the Trinity.fasta files, abot 300,000 genes and 450,000 trabscripts. who can tell me why and how to deal with? my cmd is "Trinity --seqType fq --max_memory 130G --SS_lib_type RF --left 1_1.clean.fq,2_1.clean.fq,3_1.clean.fq --right 1_2.clean.fq, 2_2.clean.fq,3_2.clean.fq --CPU 6 -output trinity_out/"

thanks for all at first.

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsubscribe...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

--

Qiuzhong Zhou

unread,

Apr 16, 2015, 7:39:13 AM4/16/15

to trinityrn...@googlegroups.com, markcha...@gmail.com, cnzq...@gmail.com

thanks Brian, your points enlighten me.

Dan Browne

unread,

Apr 17, 2015, 2:37:35 PM4/17/15

to trinityrn...@googlegroups.com, cnzq...@gmail.com, markcha...@gmail.com

Hey Brian,

I was thinking about this and I am wondering whether allowing lowly expressed genes to carry on as 'passengers' through the data analyses would cause problems. For example, in my own sequencing libraries, the depth of sequencing is highly variable across the libraries. I have a time course of data, with no replicates at each time point (I didn't design the experiment, just inherited the data), as follows:

Day 03: 28,523,795 pairs
Day 05: 18,110,358 pairs
Day 08: 3,482,488 pairs
Day 13: 11,599,744 pairs
Day 22: 5,284,506 pairs

My concern is that the variability in sequencing depth will result in the apparent differential expression of some lowly expressed transcripts. Whereas in reality they are simply not detected in the libraries with lower sequencing depth. But then if I'm trying to do GO term enrichment analysis, and all of these lowly expressed genes are showing up as significantly differentially expressed, mightn't they confound the apparent GO term enrichment?

At least in my case, it seems like the better option for me is to filter out the transcripts with very low read support and focus on a subset of more highly represented transcripts. Not to say that there aren't lowly expressed transcripts with significant biological function, but I feel like the noise from sequence depth variability at those low expression levels could mask the important transcript expression level changes.

On the other end of the spectrum, I've also found that most of the highly expressed transcripts (25 out of top 30) in my initial assembly are rRNAs. Thus to improve the quality of my assembly and abundance estimation, I am using Bowtie to filter out all the reads that align to the reconstructed rRNA transcripts. I will then use the rRNA-free libraries to do another assembly and abundance estimation. My results show approximately 5% rRNA contamination in the Day 3, 5, 8, and 13 libraries, whereas the Day 22 library had roughly 30% rRNA contamination.

Would be great to have your thoughts on these matters.

Thanks,
Dan

Brian Haas

unread,

Apr 17, 2015, 4:30:57 PM4/17/15

to Dan Browne, trinityrn...@googlegroups.com, Qiuzhong Zhou, Mark Chapman

Hi Dan,

Very excellent points.

If you have very different read depths among your set of samples, then you will be over-powered for detecting 'upregulated' transcripts in that sample. I'm not exactly sure how the different DE analysis tools are going to handle this.... some might handle it better than others. I have some ideas on how to handle it, but I don't have the mathematics/statistics background to be sure that it wouldn't be the wrong thing to do. Someone else on the list can comment on the best approach. Ultimately, I think it should be the DE process that properly handles all the logic here, and if the read depths are skewed to the point of this sort of artifactual DE result showing up, then it would adjust accordingly.

If rRNA differences are showing up as key DE differences, they're easy enough to discard. Preemptively filtering rRNA reads is certainly a useful thing to do, though it's not a requirement. Also, most DE analysis methods have a way of doing cross-sample normalizations, so excess rRNA or other compositional differences are accounted for (ie. TMM).

Finally, low reads counts corresponding to lowly expressed transcripts are going to be generally underpowered for DE studies, and if they show up as false-positives, they shouldn't have strong statistical significance (ie. running edgeR with bio replicates). Differences in read depth are hopefully not accounting for the most significant results obtained in your comparisons.

Great thread... I hope others contribute. This is something I've thought about for a while but haven't decided on what the best approach should be. (it should be safe to downscale reads in one sample to match the depth in the other to avoid FPs, but then you increase your FN rate.... ). Statistics has never been my forte, unfortunately.

best,

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Ken Field

unread,

Apr 18, 2015, 9:22:42 AM4/18/15

to Brian Haas, Dan Browne, trinityrn...@googlegroups.com, Qiuzhong Zhou, Mark Chapman

Brian and Dan-

I am wondering if anyone has ever quantified the effect of multiple comparisons corrections on the statistical power lost when you have (too) many isoforms?

In my own analysis I have noticed that the FDR corrections are huge with an assembly of ~100,000 genes. I expect that as you increase the number of "genes" or "isoforms" your statistical power will go down because of that jerk, Bonferroni. So while you may get some interesting additional differences when you have a more diverse assembly, those differences become very unlikely to survive FDR correction.

Ken

Ken Field, Ph.D.

Associate Professor of Biology

Program in Cell Biology/Biochemistry

Bucknell University

Room 203A Biology Building

Brian Haas

unread,

Apr 18, 2015, 9:32:56 AM4/18/15

to Ken Field, Dan Browne, trinityrn...@googlegroups.com, Qiuzhong Zhou, Mark Chapman

Hi Ken,

I don't think most of the methods are going to use Bonferroni correction due to it being overly restrictive with applications to large data sets. Instead, the Q-value (Storey 2003) is often used, and I think it should be less influenced by the extra isoforms issue.

The one problem that too many isoforms could have is that they end up sharing reads with other similar isoforms and this could lower power for DE detection at the transcript level (not at the gene level, since all counts are still aggregated to the same gene-level count). Note, though, Trinity already has built-in methods to remove isoforms that are considered to be likely artifacts or noise, and if you have a lot of isoforms remaining for a given gene, trying to further decide on which ones are worthy of keeping vs. discarding can be challenging (unless RSEM indicates good evidence for keeping vs. discarding some of them based on the isopct stat).

I believe there are papers that discuss these issues - such as how the selection of transcript targets impacts the results of your DE analysis. I'm sure there's room for improvement as far as best practices are concerned, but I also think the current SOP should work reasonably well in most applications.