Trinity assembly statistics

Yogesh Gupta

unread,

Nov 2, 2016, 1:18:33 AM11/2/16

to Brian Haas, Mark Chapman, trinityrnaseq-users

Dear All,

I assembled 14 tissue specific RNAseq library using Trinityv2.2.0. I look at the statistics, it is shown below, so it has lot of transcripts around 200 bp.

################################

## Counts of transcripts, etc.

################################

Total trinity 'genes': 198359

Total trinity transcripts: 271942

Percent GC: 45.12

########################################

Stats based on ALL transcript contigs:

########################################

Contig N10: 2921

Contig N20: 2171

Contig N30: 1717

Contig N40: 1368

Contig N50: 1049

Median contig length: 303

Average contig: 599.19

Total assembled bases: 162944629

#####################################################

## Stats based on ONLY LONGEST ISOFORM per 'GENE':

#####################################################

Contig N10: 2473

Contig N20: 1685

Contig N30: 1147

Contig N40: 701

Contig N50: 450

Median contig length: 253

Average contig: 419.35

Total assembled bases: 83181387

Further I filtered it and select the transcripts above 500 bp, I extract the assembly statistics:

###############################

## Counts of transcripts, etc.

################################

Total trinity 'genes': 32171

Total trinity transcripts: 85753

Percent GC: 43.40

########################################

Stats based on ALL transcript contigs:

########################################

Contig N10: 3322

Contig N20: 2577

Contig N30: 2134

Contig N40: 1817

Contig N50: 1554

Median contig length: 1062

Average contig: 1309.00

Total assembled bases: 112250910

#####################################################

## Stats based on ONLY LONGEST ISOFORM per 'GENE':

#####################################################

Contig N10: 3338

Contig N20: 2539

Contig N30: 2084

Contig N40: 1758

Contig N50: 1480

Median contig length: 936

Average contig: 1222.40

Total assembled bases: 39325850

Further I used longest isoform extract perl script and it gives only 103 transcript:

################################

## Counts of transcripts, etc.

################################

Total trinity 'genes': 103

Total trinity transcripts: 103

Percent GC: 42.31

########################################

Stats based on ALL transcript contigs:

########################################

Contig N10: 14636

Contig N20: 9903

Contig N30: 7765

Contig N40: 6277

Contig N50: 4907

Median contig length: 2165

Average contig: 3138.86

Total assembled bases: 323303

#####################################################

## Stats based on ONLY LONGEST ISOFORM per 'GENE':

#####################################################

Contig N10: 14636

Contig N20: 9903

Contig N30: 7765

Contig N40: 6277

Contig N50: 4907

Median contig length: 2165

Average contig: 3138.86

Total assembled bases: 323303

My queries Why I am getting so many transcripts of around 200 bp, does it due to sequencing error at the end of the transcript? and why I am getting so less no. of contigs after using the longest isoform perl script.

Thanks

Yogesh

Yogesh Gupta

unread,

Nov 2, 2016, 1:38:47 AM11/2/16

to Brian Haas, Mark Chapman, trinityrnaseq-users

Dear All,

I was using older version longest isoform script that was it was showing the problem, but why I am getting more number of contigs around 200bp what will be the possible reason for that.

Thanks

Yogesh

Mark Chapman

unread,

Nov 2, 2016, 3:48:44 AM11/2/16

to Yogesh Gupta, trinityrn...@googlegroups.com, Brian Haas

Many small transcripts often means the depth of sequencing wasn't sufficient to get good coverage of the transcripts and hence many are fragmented. Also, are your tissue specific libraries all from one inbred individual or multiple individuals? If it's the latter some transcripts could be different alleles at the same locus.
Best wishes, Mark

Brian Haas

unread,

Nov 2, 2016, 7:56:34 PM11/2/16

to Mark Chapman, Yogesh Gupta, trinityrn...@googlegroups.com

I'd suggest exploring the ExN50 values as well:

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Transcriptome%20Contig%20Nx%20and%20ExN50%20stats

This is usually more informative.

best,

~b

Reply all

Reply to author

Forward