Dear All,
I assembled 14 tissue specific RNAseq library using Trinityv2.2.0. I look at the statistics, it is shown below, so it has lot of transcripts around 200 bp.
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 198359
Total trinity transcripts: 271942
Percent GC: 45.12
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 2921
Contig N20: 2171
Contig N30: 1717
Contig N40: 1368
Contig N50: 1049
Median contig length: 303
Average contig: 599.19
Total assembled bases: 162944629
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 2473
Contig N20: 1685
Contig N30: 1147
Contig N40: 701
Contig N50: 450
Median contig length: 253
Average contig: 419.35
Total assembled bases: 83181387
Further I filtered it and select the transcripts above 500 bp, I extract the assembly statistics:
###############################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 32171
Total trinity transcripts: 85753
Percent GC: 43.40
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 3322
Contig N20: 2577
Contig N30: 2134
Contig N40: 1817
Contig N50: 1554
Median contig length: 1062
Average contig: 1309.00
Total assembled bases: 112250910
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 3338
Contig N20: 2539
Contig N30: 2084
Contig N40: 1758
Contig N50: 1480
Median contig length: 936
Average contig: 1222.40
Total assembled bases: 39325850
Further I used longest isoform extract perl script and it gives only 103 transcript:
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 103
Total trinity transcripts: 103
Percent GC: 42.31
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 14636
Contig N20: 9903
Contig N30: 7765
Contig N40: 6277
Contig N50: 4907
Median contig length: 2165
Average contig: 3138.86
Total assembled bases: 323303
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 14636
Contig N20: 9903
Contig N30: 7765
Contig N40: 6277
Contig N50: 4907
Median contig length: 2165
Average contig: 3138.86
Total assembled bases: 323303
My queries Why I am getting so many transcripts of around 200 bp, does it due to sequencing error at the end of the transcript? and why I am getting so less no. of contigs after using the longest isoform perl script.
Thanks
Yogesh