ABySS versus CLC: ABySS options to perform as well as CLC?

31 views
Skip to first unread message

David Maddison

unread,
Dec 15, 2023, 4:39:51 PM12/15/23
to ABySS
Hi everyone,

We've used CLC Genomics Workbench in my lab for managing our Illumina data, including doing assemblies, but now want to switch over to an open-source solution.  This is for 150 paired end data, and genome skimming from libraries that were not subject to hybrid capture.  For each sample we have between 60 million and 770 million reads (i.e., 30 million to 385 million pairs).   Reads are trimmed with TrimGalore.  These are beetles, and we don't have reference genomes nearby, so the assemblies need to be de novo assemblies.  The genome sizes are between 500 Mb and 2 Gb.  

We would like to be able to do these assemblies on a multicore machine with 128GB memory.  CLC does that easily, and quite quickly.  Our exploration of SPAdes suggests that it needs much more memory than 128GB, and that it isn't really designed for low coverage assemblies of large genomes.  From what I have read, ABySS using the newer Bloom filter algorithm to reduce memory usage seems like the thing to use.  However, for our test set of data, ABySS assemblies are way worse than CLC assemblies (using default options).  This table shows the fragments recovered for 6 genes, for three different specimens: 

Screenshot 2023-12-10 at 12.45.46 PM.png

In this table, Total Reads is the number of individual reads, so the number of pairs is half that amount.  The length of the sequence used as the query sequence for BLASTing the assemblies is shown on the bottom row.  The size of the largest fragment returned as a BLAST hit in each assembly is shown in the other rows in the last 6 columns (one column per gene).  For Bembidion we have also tried kmer sizes 32, 64, and 128 - 32 and 64 were way worse, and 128 was basically the same as 96.  

As you can see, for every gene, CLC did way better than ABySS.  For the 28S gene,  CLC managed to assemble the entire ribosomal cistron from Bembidion and Blennidius, and a 6300 base piece for Medusapyga, but that the ABySS assemblies did not.   For COI, the CLC assembler yielded one fragment that was at least 60% of the whole mitochondrial genome, whereas ABySS yielded a much smaller piece or no piece that matched COI.  Similar patterns held for the nuclear protein coding genes. 

The ABySS (version 2.3.7) command that was run is of this form:
abyss-pe name=Test B=50G j=16 k=96 v=-v  in='lib1_R1_001_val_1.fq lib1_R2_001_val_2.fq'

Is there something we are doing wrong?  Or is ABySS not the tool to use for this?  

Thanks for any advice you might have,
David


Lauren Coombe

unread,
Dec 15, 2023, 4:55:19 PM12/15/23
to ABySS
Hi David,

Thanks for reaching out. Just so I'm understanding properly - are you intending to assemble targeted regions of these genomes using this genome skimming data? Or are you looking for full de novo assemblies, and using those targeted regions for QC?

If you are looking for de novo assemblies of a full nuclear genome, generally, I recommend using at least 30-fold coverage. According to the genome size range that you provided, looks like you are just about approaching that on the lower end, but would be significantly lower on the upper end. To get the best assemblies, you would probably have to play around with some parameters, like the k and kc (k-mer coverage). For the nuclear genome, you may want to play around with kc values 1-3, and k-mer sizes 64-108. ABySS uses a de Bruijn graph approach, so redundancy in the k-mers are needed (and with lower coverage skimming, you don't get as much of that).

If you are looking for targeted solutions, there are a couple of alternatives I can suggest which utilize ABySS, but are more specialized for the tasks at hand. For animal mitochondrial genome assembly, we have recently developed the pipeline mtGrasp (https://github.com/bcgsc/mtGrasp). For targeted assembly, we have the tool Kollector (https://github.com/bcgsc/kollector)

Let me know if you have any more questions - always happy to discuss more. 
Also, just so you know, we monitor our GitHub repository (https://github.com/bcgsc/abyss) more than this google group, so feel free to open an issue over there.

Thanks!
Lauren

David Maddison

unread,
Dec 18, 2023, 11:34:30 AM12/18/23
to ABySS
Thanks, Lauren!   Yes, this is full de novo assemblies.   I can do reference-based assemblies (thanks for the tip about Kollector!  I'll check that out!), but I'd prefer to explore de novo for these data at the moment.  As for coverage:  yes, it is lower than ideal, but it is what it is.  This is part of a project in which we are getting >100 samples sequenced, and I can't afford more than 40M pairs per sample.   I'll also play around with kc values.  (I've tried kmer sizes 32, 64, 96, and 128 for the default kc value.)   I'll report back!  

Best wishes,
David

Lauren Coombe

unread,
Dec 18, 2023, 11:39:13 AM12/18/23
to ABySS
Hi David,

Thanks for confirming!

Yes, totally understand that you have to deal with what you can get data-wise! Sweeping on those kc along with the k-values is probably your best bet for which parameters to tune. I don't know how your unitigs/contigs/scaffolds contiguity results are varying, but other parameters that you could tune would be `n` and `N` - both of those control read pair support needing in the contig and scaffolding stages (https://github.com/bcgsc/abyss?tab=readme-ov-file#assembly-parameters). With low coverage I'd expect those would need to be tuned lower.

Good luck!
Lauren

Reply all
Reply to author
Forward
0 new messages