CD-HIT-EST decreases Trinity contig size

1,556 views
Skip to first unread message

Jane

unread,
Jun 16, 2015, 2:18:54 PM6/16/15
to trinityrn...@googlegroups.com
Hi,

I am using cd-hit-est to reduce redundancy in my transcriptome assembly prior to annotation. However, I find that contig N50, mean, and median lengths all decrease when I run filtering with any similarity cutoff (-c 0.95, 0.99, or 1.00). I was under the impression that cd-hit selects the longest sequence of each cluster to be a representative sequence and removes shorter redundant sequences, and would therefore increase contig size. Has anyone else had this problem? Also, can anyone suggest the option in cd-hit-est to use to remove sequences shorter than 200 bp, or has this already been done by Trinity?

Thank you,
Jane


Tiago Hori

unread,
Jun 16, 2015, 2:26:24 PM6/16/15
to Jane, trinityrn...@googlegroups.com
Jane,

Although it is a clustering algorithm, it is very normal for CD Hit to reduce contig average size, in the same way as the longest isoform stat is often lower than all transcripts. It is not a problem at all. The reason is that linger transcripts often have more isoform different parts of the gene and those buffer the effect of outlier shirt transcripts on the average. When you remove those by clustering you in fact increase the effect of outliers. For example: if I have 2 isoforms that are 99% identical and they are 1000 bp and a third gene that is 100 bp. If I take all sequences that 2100/3=700. Now if those isoforms get clustered I know have 1100/2=550. 

I wonder why you are using CD-HIT though, are you merging assemblies? Part of the power of Trinity is distinguishing paralogs and splice variants and by running CD-HIT on a single assembly you could be losing those.

T.

Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Jane

unread,
Jun 16, 2015, 3:00:45 PM6/16/15
to trinityrn...@googlegroups.com, jkhud...@gmail.com
Tiago,

Thank you for the response, that makes a lot of sense.

I was trying to cluster because Trinity gave me 510,000 transcripts in 390,000 gene families (I know this is a caveat of de novo assembly, esp Trinity), which made a previous transcriptome incredibly difficult to use for practical purposes such as designing QPCR primers for DE genes. The gene families were enormous (some with ~2000 isoforms!) and contained many isoforms that were quite short (~300 bp) that I thought may be redundant. I was hoping to remove some of those to collapse gene families to usable sizes.

Another problem I had was that after running RSEM-EBSeq, at least a third of genes categorized as DE had FPKM values < 1 or even 0 in some of the samples, which I thought might make them basically junk that is undetectable by QPCR for validation. I was wondering if those were somehow related to the redundancy problem. I tried RSEM filtering by FPKM values, but that removed so many transcripts that I was worried I was losing some real isoforms in the process. I may try eXpress this time instead.
 
I am also trying TransDecoder right now, which reduces the number of transcripts quite significantly to about 250,000 in 50,000 gene families, which sounds more biologically reasonable than half a million transcripts.

On Tuesday, June 16, 2015 at 11:26:24 AM UTC-7, Tiago Hori wrote:
Jane,

Although it is a clustering algorithm, it is very normal for CD Hit to reduce contig average size, in the same way as the longest isoform stat is often lower than all transcripts. It is not a problem at all. The reason is that linger transcripts often have more isoform different parts of the gene and those buffer the effect of outlier shirt transcripts on the average. When you remove those by clustering you in fact increase the effect of outliers. For example: if I have 2 isoforms that are 99% identical and they are 1000 bp and a third gene that is 100 bp. If I take all sequences that 2100/3=700. Now if those isoforms get clustered I know have 1100/2=550. 

I wonder why you are using CD-HIT though, are you merging assemblies? Part of the power of Trinity is distinguishing paralogs and splice variants and by running CD-HIT on a single assembly you could be losing those.

T.

Sent from my iPhone

On Jun 16, 2015, at 3:18 PM, Jane <jkhud...@gmail.com> wrote:

Hi,

I am using cd-hit-est to reduce redundancy in my transcriptome assembly prior to annotation. However, I find that contig N50, mean, and median lengths all decrease when I run filtering with any similarity cutoff (-c 0.95, 0.99, or 1.00). I was under the impression that cd-hit selects the longest sequence of each cluster to be a representative sequence and removes shorter redundant sequences, and would therefore increase contig size. Has anyone else had this problem? Also, can anyone suggest the option in cd-hit-est to use to remove sequences shorter than 200 bp, or has this already been done by Trinity?

Thank you,
Jane


--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

Tiago Hori

unread,
Jun 16, 2015, 3:04:20 PM6/16/15
to Jane, trinityrn...@googlegroups.com
Hi Jane,

Filtering by FPKM is always a good idea in my opinion. As long as you are not clustering before mapping, I think you are is less risky of missing some of the nuances of your transcriptome. Also, if you have particular interests in terms of genes, those nuances are not as important, but if you are looking for novel markers they may be!

T.

Sent from my iPhone
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

Jane

unread,
Jun 16, 2015, 3:13:32 PM6/16/15
to trinityrn...@googlegroups.com
Tiago,

Thanks for the suggestion. I will skip cd-hit and go straight to TransDecoder-Trinotate and then RSEM filter by FPKM. I appreciate your prompt responses.

Cheers,
Jane

thomas duge de bernonville

unread,
Jun 17, 2015, 2:30:48 AM6/17/15
to trinityrn...@googlegroups.com
Hi Jane,

you may try CD-HIT-EST at different homology threshold (-c parameter, ie. from 0.9 to 1) in order to decrease redundancy while keeping true isoforms/spliced variants.

best regards

Thomas

Jane

unread,
Jun 18, 2015, 3:33:58 PM6/18/15
to trinityrn...@googlegroups.com, jkhud...@gmail.com
Tiago,

I have some questions about RSEM filtering by FPKM. I am using RSEM-edgeR and DESeq for diff exp analysis. Do you know if edgeR and/or DESeq have some sort of filtering for transcripts with low counts or should I map reads back to the filtered, instead of raw assembly? I have used RSEM-EBSeq with raw assembly previously and got a large number of DE genes that had super low FPKM so I want to avoid that this time.

Thanks,
Jane

Tiago Hori

unread,
Jun 18, 2015, 4:24:22 PM6/18/15
to Jane, trinityrn...@googlegroups.com
Hi Jane,

There is a Perl script within trinity that we not only filter but also graph the per FPKM range transcripts.

T.

Sent from my iPhone
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages