Binsanity-lc: estimate -C parameter for large dataset

xvaz...@gmail.com

unread,

Apr 14, 2020, 3:51:20 AM4/14/20

to BinSanity

Hi there,

We are trying to bin a large metagenome assembly (~1.6M contigs, cutoff 1kbp) using Binsanity-lc. We have run an initial iteration in our cluster with most default parameters (just increase of num. of threads) and it runs out of time. The initial K-means clustering takes about 125h alone and the whole process runs out of time (200h limit) before finishing (barely finishes binning the first 31 clusters). Many of this initial clusters generate >50 bins each (min. 13, max. 89), most over 20. This 31 clusters generated 1616 bins at this point.

I've been reading the forum and I encounter a couple of mentions about the -C parameter and the influence in the process, running time, etc. albeit they are a bit dated. In this reply, it is mentioned to use num. contigs/10,000 to estimate the -C, but it seems a bit extreme. I mean, for our dataset it would mean 160... which doesn't seem right to me given the initial run output already generated 1600 bins.

I'd rather tune some parameters before increasing the min. contig size, so any input on this would be appreciated.

Thank you in advance,

Xabi

xvaz...@gmail.com

unread,

Apr 15, 2020, 12:16:21 AM4/15/20

to BinSanity

PS: we run anvi'o SCG workflow and it suggests 869 bacterial genomes (and a very few archaeal or protist)

Elaina

unread,

Apr 16, 2020, 2:42:35 PM4/16/20

to BinSanity

Hi Xabi,

So '-C' parameter can sometimes help but with the sheer number of contigs your inputing I am not sure if that will actually affect the run times. So I'll recommend two potential angles you can take:

1. First your working with ~1.6M contigs and a cutoff 1kbp. While I have tested Binsanity down to contigs of 1000bp I find that setting a cuttoff at ~ 2000bp typically speeds up the run significantly and does not remove any bin quality. Ultimately below 2000kbp while these contigs can often be useful they also often have more variable coverage profiles and composition metrics that may not align directly with the actual source genome, often when I include contigs this small most end up unbinned or I end up having to do quite a lot more manual genome refinement using anvio to confirm contig assignment. Increasing your cut-off to 2000bp would be the quickest way to speed up the run and reduce complexity.

2. Your other option in terms of getting BinSanity to run with that many contigs gets a little trickier. One of the main reasons I haven't pushed much further from the implementation of 'Binsanity-lc' in trying to reduce memory is that all the methods I have tried beyond the current implementation start to see some amount of loss in the quality of resultant bins. I give you this work around with that caveat. So from what I understand your first attempt completed the K-MEANS clustering step before canceling out meaning that you have produced a directory 'BINSANITY-RESULTS' in which you should see the following directories (or something similar depending on where it canceled:

BINSANITY-INITIAL  BinSanityLC_binsanity_checkm  BinSanityLC-BinsanityLC-log.txt  BinSanityLC.checkm.logfile  BinSanityLC-KMEAN-BINS

Now the initial K-MEANS bins can be found in 'BinSanityLC-KMEAN-BINS'. If you cannot get 'Binsanity-lc' to run on your system in any other way you can take the 'BINS' in this directory. Then run individual instances of `Binsanity-lc` or `Binsanity-wf`. Take the final-genomes from each individual run on each individual KMEAN BIN and combine them at the the end to get your final genome set. Again I will say doing this may cause you to loose some amount of genome quality, but it is one of the few ways I have found to get around memory related issues. You may still though run into the 200 hour time limit though depending on your system configuration.

Please let me know if any of this works or if you need more clarification!

-Elaina

Jéssica Bianca da Silva

unread,

Feb 2, 2022, 5:25:12 PM2/2/22

to BinSanity

Hello everyone,

Currently I'm with the same problem. I have a big dataset, with more then 1 milion contigs. I use the binsanity-lc command line with the binsanity parameters, but when I run, the serv memory is quickly exhausted. I'm using the follow command line:

Binsanity-lc -f . - l 01.DirPath/contigs.fa -x 2000 --threads 1 --Prefix binsanity_ - c 07.BinSanity/coverage_profile.txt.cov.x100.lognorm -C 100

I have doubt, if 100 is a good number to -C parameter. I read that is default and indicate for big data. But, using this number, many memory is consumed.

I'd appreciate a opinion or better advice to continue this process.

Thanks so much

Xabier Vázquez-Campos

unread,

Feb 2, 2022, 6:18:56 PM2/2/22

to Jéssica Bianca da Silva, BinSanity

Hi Jéssica,

How big is your dataset? How much mem are you requesting?

I checked some of our logs and it seems we ended up using -x 2500 and the default -C value. However, we had to use the large computing node in our HPC and request an absurd amount of memory "just in case". I remember that in the end it didn't use as much as expected, but still I think it was >250GB (definitely >124GB). Hard to provide exact numbers as this was run by my student.

Cheers,

Xabi

--
You received this message because you are subscribed to a topic in the Google Groups "BinSanity" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/binsanity/4s_qpDpjUk8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to binsanity+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/binsanity/79e9a844-e044-4785-8467-63f793954a68n%40googlegroups.com.

--

Xabier Vázquez-Campos, PhD
Senior Research Associate
NSW Systems Biology Initiative
School of Biotechnology and Biomolecular Sciences
The University of New South Wales
Sydney NSW 2052 AUSTRALIA

Reply all

Reply to author

Forward