Hello,
We are currently testing an assembly of RNAseq data for a non-model insect in our lab. We have 4 different tissues we're interested in analyzing, in 2 different conditions, and we did this in triplicate, with a total of 24 samples. For each sequenced sample we got an average of 30 million reads, 150 bp, paired. Some of them weren't that great, but the company is re-sequencing some of the samples because of the machine problems. While we wait, we are testing out the parameters of our assembly.
Initially, we are pooling only the data from two full replicates of the experiments to assemble with Trinity 2.6.6 due to our limits on RAM and CPU access at the university's server. It still amounts to around 560 million reads, which were trimmed with TrimGalore to remove adapters. After trimming, most of the reads were still above 130 bp long, so I hope that is not much of an issue. I allowed default read normalization and tested both min_kmer_cov of 1 and 2. I used the metrics script for a quick and dirty look at the assembly quality. I expected some fragmentation due to the some of the reads, but here are the outputs.
With Kmer coverage of 1
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 240907
Total trinity transcripts: 366666
Percent GC: 32.71
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 4457
Contig N20: 3188
Contig N30: 2411
Contig N40: 1834
Contig N50: 1371
Median contig length: 415
Average contig: 790.29
Total assembled bases: 289771839
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 3714
Contig N20: 2483
Contig N30: 1770
Contig N40: 1259
Contig N50: 889
Median contig length: 363
Average contig: 628.22
Total assembled bases: 151342796
-----------
With kmer coverage of 2
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 185850
Total trinity transcripts: 304734
Percent GC: 32.82
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 4512
Contig N20: 3258
Contig N30: 2514
Contig N40: 1942
Contig N50: 1485
Median contig length: 412
Average contig: 814.87
Total assembled bases: 248319954
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 3870
Contig N20: 2705
Contig N30: 1986
Contig N40: 1465
Contig N50: 1039
Median contig length: 346
Average contig: 649.87
Total assembled bases: 120778934
-------------
Assembly improved somewhat by not allowing singletons in our pooled data, but it still has a high number of small transcript fragments. Maybe going higher with the K-mer coverage or changing the kmer size might improve the assembly. Another colleague in the lab is working with a similar data set, but he decided not to pool the data from his samples and do individual assemblies per tissue instead and has gotten less partial assemblies. He attributes this to the presence of gene isoforms that may be giving the graph paths problems. Should I use that strategy to improve the assembly? At the end we want to perform differential expression analysis, but that might prove tricky to compare if each tissue has its own reference to do the mapping of the reads. Elsewhere I read that increasing the normalization coverage to 200 instead of the default 50 could help as more reads could be used to improve the assembly. Any pointers will be helpful, thank you.
Emiliano Cantón