Hi everyone,
We've used CLC Genomics Workbench in my lab for managing our Illumina data, including doing assemblies, but now want to switch over to an open-source solution. This is for 150 paired end data, and genome skimming from libraries that were not subject to hybrid capture. For each sample we have between 60 million and 770 million reads (i.e., 30 million to 385 million pairs). Reads are trimmed with TrimGalore. These are beetles, and we don't have reference genomes nearby, so the assemblies need to be de novo assemblies. The genome sizes are between 500 Mb and 2 Gb.
We would like to be able to do these assemblies on a multicore machine with 128GB memory. CLC does that easily, and quite quickly. Our exploration of SPAdes suggests that it needs much more memory than 128GB, and that it isn't really designed for low coverage assemblies of large genomes. From what I have read, ABySS using the newer Bloom filter algorithm to reduce memory usage seems like the thing to use. However, for our test set of data, ABySS assemblies are way worse than CLC assemblies (using default options). This table shows the fragments recovered for 6 genes, for three different specimens:
In this table, Total Reads is the number of individual reads, so the number of pairs is half that amount. The length of the sequence used as the query sequence for BLASTing the assemblies is shown on the bottom row. The size of the largest fragment returned as a BLAST hit in each assembly is shown in the other rows in the last 6 columns (one column per gene). For Bembidion we have also tried kmer sizes 32, 64, and 128 - 32 and 64 were way worse, and 128 was basically the same as 96.
As you can see, for every gene, CLC did way better than ABySS. For the 28S gene, CLC managed to assemble the entire ribosomal cistron from Bembidion and Blennidius, and a 6300 base piece for Medusapyga, but that the ABySS assemblies did not. For COI, the CLC assembler yielded one fragment that was at least 60% of the whole mitochondrial genome, whereas ABySS yielded a much smaller piece or no piece that matched COI. Similar patterns held for the nuclear protein coding genes.
The ABySS (version 2.3.7) command that was run is of this form:
abyss-pe name=Test B=50G j=16 k=96 v=-v in='lib1_R1_001_val_1.fq lib1_R2_001_val_2.fq'
Is there something we are doing wrong? Or is ABySS not the tool to use for this?
Thanks for any advice you might have,
David