Trinity assembly statistics

283 views
Skip to first unread message

Adriana Fróes

unread,
Jul 16, 2015, 1:59:21 PM7/16/15
to trinityrn...@googlegroups.com
Hi all,

I have a doubt, if anyone could help me....
I have 12 samples, but 4 are from a healthy coral and 8 from diseased corals. I tried to assemble first all sequences together and then all samples separately and then grouped all and assembled again together to obtain the reference. When I used the perl script from Trinity, Trinity.Statistics.pl, I could see that, when I assembled first all samples separately, and them together, the results were:

################################

## Counts of transcripts, etc.

################################

Total trinity 'genes': 29358

Total trinity transcripts: 110345

Percent GC: 8.41

########################################

Stats based on ALL transcript contigs:

########################################


Contig N10: 253

Contig N20: 197

Contig N30: 159

Contig N40: 134

Contig N50: 119


Median contig length: 107

Average contig: 124.98

Total assembled bases: 13791075


And the results for assembling all of them together on the first time were:

################################

## Counts of transcripts, etc.

################################

Total trinity 'genes': 87384

Total trinity transcripts: 88902

Percent GC: 45.13


########################################

Stats based on ALL transcript contigs:

########################################


Contig N10: 963

Contig N20: 815

Contig N30: 773

Contig N40: 740

Contig N50: 647


Median contig length: 485

Average contig: 544.07

Total assembled bases: 48368739


My questions are:

Just looking at this results, can I tell which assembled strategy was better?

Why the first assemble generated fewer 'genes' (29358) and much more 'transcripts' (110345), while the second assembly strategy generated almost the same number for genes and transcripts (~88000)?


Thank you very much!!

Tiago Hori

unread,
Jul 16, 2015, 2:42:33 PM7/16/15
to Adriana Fróes, trinityrn...@googlegroups.com
The more depth you get the more genes/transcripts you will eventually have, to an ideal maximum of a perfectly assembled transcriptome (which never happens). The ratio of genes to transcripts depends on many things, how fragmented your assembly, how complex the genome is, how much paralogs is there in the genome and so forth. 

My guess would be that the second assembly is better, but I would not trust those stats. Use DETONATE if you have pair end, if not at least re-map the reads to the transcriptome and see what proportion of read map correctly and also those who map unambiguously. 

Remember those the reads that generated a given assembly, otherwise the analysis will be biased.

T.

"Profanity the is the only language all programmers understand" 
Sent from my iPhone, the universal excuse for my poor spelling.
--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Mark Chapman

unread,
Jul 16, 2015, 3:19:19 PM7/16/15
to Tiago Hori, Adriana Fróes, trinityrn...@googlegroups.com

I would be very wary of the first assembly as it has a GC% of only 8%. Seems fishy to me (no disrespect to those of you who work on fish Tiago!). This suggests something has gone weird somewhere.
Cheers, Mark

Adriana Fróes

unread,
Jul 16, 2015, 3:26:20 PM7/16/15
to Mark Chapman, Tiago Hori, trinityrn...@googlegroups.com
Thanks Tiago, I thought the same. It is difficult to decide only based on this statistics.

Well observed Mark. Very weird the GC content be so low... I didn't pay attention on that important detail because I was actually working with the second assembly. 

I mapped back the reads but just 47% of the reads mapped back to the "reference" contigs and, maybe, because of this, I couldn't obtain differential expressed genes...
Tha samples are also very mixed, because are from the corals mucous which contains also bacteria, fungi, virus, protozoans, micro algae.....


Adriana M. Froes
Laboratório de Microbiologia, Instituto de Biologia, Depto de Biologia Marinha
Universidade Federal do Rio de Janeiro   
Av. Carlos Chagas Filho 373, Sala A3-202, Bloco A (Anexo) do CCS
21941-599, Ilha do Fundão, Rio de Janeiro, RJ

Mark Chapman

unread,
Jul 16, 2015, 5:58:08 PM7/16/15
to Adriana Fróes, Tiago Hori, trinityrn...@googlegroups.com

Hi Adriana,
The very short N50 and mean contig size plus the weird GC suggests the first assembly is bogus. Can you explain how you did this: How did you combine all the individual assemblies into one?
Cheers, Mark

Adriana Fróes

unread,
Jul 17, 2015, 11:37:55 AM7/17/15
to Mark Chapman, Tiago Hori, trinityrn...@googlegroups.com
Hi Mark, after filter quality and collapsing the paired end reads, I just grouped all sequences from all the samples in one and used Trinity to assemble.


Adriana M. Froes
Laboratório de Microbiologia, Instituto de Biologia, Depto de Biologia Marinha
Universidade Federal do Rio de Janeiro   
Av. Carlos Chagas Filho 373, Sala A3-202, Bloco A (Anexo) do CCS
21941-599, Ilha do Fundão, Rio de Janeiro, RJ

Reply all
Reply to author
Forward
0 new messages