Hello Bill,
So with the standard BinsanityWF and Binsanity scripts it would likely crash with 3M contigs (although 3TB might be enough to try!). But there is another way! You can try the current release of Binsanity-lc (which has an initial kmeans clustering step), or conversely I have been working on a new update that would push everything to `Binsanity2` and should be ready for official release by the summer. Binsanity2 is designed to replace both the `BinsanityLC` and `BinsanityWF` scripts. I have updated the github with a new directory called `Binsanity2-Beta`. The script itself is functional, but it remains a bit clunky and needs to be cleaned up. One of the things this new version does that will help with overall memory allocation is it offers two options, an option using initial kmeans clustering and an option using no initial kmeans clustering (as designated by --skip-kmeans). For large assemblies you will want the initial kmeans clustering to prevent a memory crash. This is similar to what Binsanity-lc did, but in Binsanity-lc with very large assemblies sometimes the subsetted clusters from the initial kmeans clustering were larger than 100,000 contigs and thus when piped into affinity propagation in the following step could still lead to memory failures. Binsanity2-beta adjusts for these instances.
I have actively been using Binsanity2-beta and have sanity checked the performance against both the test dataset available and some larger metagenomes (of which data and some new performance comparison figures will be released with the official version over summer). Current results have shown comparable performance to the original Binsanity on small metagenomes, but has shown improvements in number and quality of MAGs for large metagenome assemblies. To use the beta version of Binsanity2 I recommend creating a binsanity conda env or virtual environment (if you haven't already done so) for python 3 (while I am working to make sure everything works in python3.8 to be safe go with python 3.6). Install the newest version of Binsanity. I updated the pypi and conda install so they should automatically put the Binsanity2-beta script in your path. If they don't pull the github repo and add the script to your path (note that you may have to use the pip install binsanity rather than conda install binsanity to get v0.5.2 downloaded)
I also included two temporary workflow bash scripts (which will be rolled into a snakemake workflow down the line) on the github. These reflect the protocols we have been using in our lab to produce MAGs. There is one specifically for large metagenomes (>25,000) and one for small (<25,000). These workflows are updated versions of the workflow used here:
Tully, B. J. et al. The reconstruction of 2,631 draft metagenome-assembled genomes from the global oceans. Sci. Data 5:170203 doi:10.1038/sdata.2017.203 (2018)
Please let me know if you have any questions about the beta version of Binsanity2 or any comments on performance once you've tried it!
regards,
Elaina