Trinity assembly statistics

84 views
Skip to first unread message

Yogesh Gupta

unread,
Nov 2, 2016, 1:18:33 AM11/2/16
to Brian Haas, Mark Chapman, trinityrnaseq-users
Dear All,

I assembled 14 tissue specific RNAseq library using Trinityv2.2.0. I look at the statistics, it is shown below, so it has lot of transcripts around 200 bp.
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  198359
Total trinity transcripts:      271942
Percent GC: 45.12

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 2921
        Contig N20: 2171
        Contig N30: 1717
        Contig N40: 1368
        Contig N50: 1049

        Median contig length: 303
        Average contig: 599.19
        Total assembled bases: 162944629


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 2473
        Contig N20: 1685
        Contig N30: 1147
        Contig N40: 701
        Contig N50: 450

        Median contig length: 253
        Average contig: 419.35
        Total assembled bases: 83181387

Further I filtered it and select the transcripts above 500 bp, I extract the assembly statistics:



###############################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  32171
Total trinity transcripts:      85753
Percent GC: 43.40

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 3322
        Contig N20: 2577
        Contig N30: 2134
        Contig N40: 1817
        Contig N50: 1554

        Median contig length: 1062
        Average contig: 1309.00
        Total assembled bases: 112250910


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 3338
        Contig N20: 2539
        Contig N30: 2084
        Contig N40: 1758
        Contig N50: 1480

        Median contig length: 936
        Average contig: 1222.40
        Total assembled bases: 39325850


Further I used longest isoform extract perl script and it gives only 103 transcript:

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  103
Total trinity transcripts:      103
Percent GC: 42.31

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 14636
        Contig N20: 9903
        Contig N30: 7765
        Contig N40: 6277
        Contig N50: 4907

        Median contig length: 2165
        Average contig: 3138.86
        Total assembled bases: 323303


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 14636
        Contig N20: 9903
        Contig N30: 7765
        Contig N40: 6277
        Contig N50: 4907

        Median contig length: 2165
        Average contig: 3138.86
        Total assembled bases: 323303

My queries Why I am getting so many transcripts of around 200 bp, does it due to sequencing error at the end of the transcript? and why I am getting so less no. of contigs after using the longest isoform perl script.


Thanks
Yogesh



Yogesh Gupta

unread,
Nov 2, 2016, 1:38:47 AM11/2/16
to Brian Haas, Mark Chapman, trinityrnaseq-users
Dear All,

I was using older version longest isoform script that was it was showing the problem, but why I am getting more number of contigs around 200bp what will be the possible reason for that.



Thanks
Yogesh

Mark Chapman

unread,
Nov 2, 2016, 3:48:44 AM11/2/16
to Yogesh Gupta, trinityrn...@googlegroups.com, Brian Haas

Many small transcripts often means the depth of sequencing wasn't sufficient to get good coverage of the transcripts and hence many are fragmented. Also, are your tissue specific libraries all from one inbred individual or multiple individuals? If it's the latter some transcripts could be different alleles at the same locus.
Best wishes, Mark

Brian Haas

unread,
Nov 2, 2016, 7:56:34 PM11/2/16
to Mark Chapman, Yogesh Gupta, trinityrn...@googlegroups.com

I'd suggest exploring the ExN50 values as well:


This is usually more informative.

best,

~b
Reply all
Reply to author
Forward
0 new messages