Tiago,
Thank you for the response, that makes a lot of sense.
I was trying to cluster because Trinity gave me 510,000 transcripts in 390,000 gene families (I know this is a caveat of de novo assembly, esp Trinity), which made a previous transcriptome incredibly difficult to use for practical purposes such as designing QPCR primers for DE genes. The gene families were enormous (some with ~2000 isoforms!) and contained many isoforms that were quite short (~300 bp) that I thought may be redundant. I was hoping to remove some of those to collapse gene families to usable sizes.
Another problem I had was that after running RSEM-EBSeq, at least a third of genes categorized as DE had FPKM values < 1 or even 0 in some of the samples, which I thought might make them basically junk that is undetectable by QPCR for validation. I was wondering if those were somehow related to the redundancy problem. I tried RSEM filtering by FPKM values, but that removed so many transcripts that I was worried I was losing some real isoforms in the process. I may try eXpress this time instead.
I am also trying TransDecoder right now, which reduces the number of transcripts quite significantly to about 250,000 in 50,000 gene families, which sounds more biologically reasonable than half a million transcripts.
On Tuesday, June 16, 2015 at 11:26:24 AM UTC-7, Tiago Hori wrote:
Jane,
Although it is a clustering algorithm, it is very normal for CD Hit to reduce contig average size, in the same way as the longest isoform stat is often lower than all transcripts. It is not a problem at all. The reason is that linger transcripts often have more isoform different parts of the gene and those buffer the effect of outlier shirt transcripts on the average. When you remove those by clustering you in fact increase the effect of outliers. For example: if I have 2 isoforms that are 99% identical and they are 1000 bp and a third gene that is 100 bp. If I take all sequences that 2100/3=700. Now if those isoforms get clustered I know have 1100/2=550.
I wonder why you are using CD-HIT though, are you merging assemblies? Part of the power of Trinity is distinguishing paralogs and splice variants and by running CD-HIT on a single assembly you could be losing those.
T.
Sent from my iPhone
Hi,
I am using cd-hit-est to reduce redundancy in my transcriptome assembly prior to annotation. However, I find that contig N50, mean, and median lengths all decrease when I run filtering with any similarity cutoff (-c 0.95, 0.99, or 1.00). I was under the impression that cd-hit selects the longest sequence of each cluster to be a representative sequence and removes shorter redundant sequences, and would therefore increase contig size. Has anyone else had this problem? Also, can anyone suggest the option in cd-hit-est to use to remove sequences shorter than 200 bp, or has this already been done by Trinity?
Thank you,
Jane
--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.