Very large contig sets

32 views
Skip to first unread message

Bill Nelson

unread,
Mar 23, 2021, 12:41:34 PM3/23/21
to BinSanity
Hi Elaina,

I have an assembly of a set of 7 very large metagenomes. There are 3M contigs >2500bp. Is this even going to be possible to bin with Binsanity if 110k contigs took 400G of memory? I do have access to a server with 3Tb of memory.

thanks,
Bill

Elaina Graham

unread,
Mar 24, 2021, 7:24:04 PM3/24/21
to Bill Nelson, BinSanity
Hello Bill,

So with the standard BinsanityWF and Binsanity scripts it would likely crash with 3M contigs (although 3TB might be enough to try!). But there is another way! You can try the current release of Binsanity-lc (which has an initial kmeans clustering step), or conversely I have been working on a new update that would push everything to `Binsanity2` and should be ready for official release by the summer. Binsanity2 is designed to replace both the `BinsanityLC` and `BinsanityWF` scripts. I have updated the github with a new directory called `Binsanity2-Beta`. The script itself is functional, but it remains a bit clunky and needs to be cleaned up. One of the things this new version does that will help with overall memory allocation is it offers two options, an option using initial kmeans clustering and an option using no initial kmeans clustering (as designated by --skip-kmeans). For large assemblies you will want the initial kmeans clustering to prevent a memory crash. This is similar to what Binsanity-lc did, but in Binsanity-lc with very large assemblies sometimes the subsetted clusters from the initial kmeans clustering were larger than 100,000 contigs and thus when piped into affinity propagation in the following step could still lead to memory failures. Binsanity2-beta adjusts for these instances. 

I have actively been using Binsanity2-beta and have sanity checked the performance against both the test dataset available and some larger metagenomes (of which data and some new performance comparison figures will be released with the official version over summer). Current results have shown comparable performance to the original Binsanity on small metagenomes, but has shown improvements in number and quality of MAGs for large metagenome assemblies. To use the beta version of Binsanity2 I recommend creating a binsanity conda env or virtual environment (if you haven't already done so) for python 3 (while I am working to make sure everything works in python3.8 to be safe go with python 3.6). Install the newest version of Binsanity. I updated the pypi and conda install so they should automatically put the Binsanity2-beta script in your path. If they don't pull the github repo and add the script to your path (note that you may have to use the pip install binsanity rather than conda install binsanity to get v0.5.2 downloaded)

I also included two temporary workflow bash scripts (which will be rolled into a snakemake workflow down the line) on the github. These reflect the protocols we have been using in our lab to produce MAGs. There is one specifically for large metagenomes (>25,000) and one for small (<25,000). These workflows are updated versions of the workflow used here:

Tully, B. J. et al. The reconstruction of 2,631 draft metagenome-assembled genomes from the global oceans. Sci. Data 5:170203 doi:10.1038/sdata.2017.203 (2018)

Please let me know if you have any questions about the beta version of Binsanity2 or any comments on performance once you've tried it! 

regards,
Elaina

--
You received this message because you are subscribed to the Google Groups "BinSanity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to binsanity+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/binsanity/cfe769aa-cf85-4db3-93b8-ce2deb238c77n%40googlegroups.com.

Elaina Graham

unread,
Mar 24, 2021, 7:27:50 PM3/24/21
to Bill Nelson, BinSanity
And forgot to add this but if you use the workflow bash scripts they are called as such:

$ bash binsanity2-beta-wf-largeassemblies.sh [NAME OF ASSEMBLY] [PATH TO ASSEMBLY] [COVERAGE FILE w/ FULL PATH] [THREADS]

The same order is true for binsanity2-beta-wf-smallassemblies.sh

Bill Nelson

unread,
Apr 5, 2021, 4:25:04 PM4/5/21
to BinSanity
Hi Elaina,

Thanks for your help.

Regarding the kmeans clustering, is the Clusternumber default of 100 appropriate for a dataset such as mine? I am anticipating ~1200 bins (based on results from MaxBin2). Is setting the Clusternumber significantly higher than 100 a bad idea?

Thanks,
Bill

Elaina Graham

unread,
Apr 5, 2021, 5:10:15 PM4/5/21
to Bill Nelson, BinSanity
Hi Bill,

You could definitely play around with that number and raise it. Part of why I implemented the initial K-Means clustering was to create more manageable contig subsets to pipe into the Affinity Propagation (AP) algorithm (which is the memory intensive step). From my experience the AP algorithm does a better job overall delineating relationships so the goal with the initial Kmeans cluster/subset step is to group contigs with similar coverage patterns that can then be further refined with AP incorporating composition at that stage. With that in mind I tend to underestimate the total number of bins estimated for that stage of clustering because the point here isn't necessarily to get final bins, but to do an initial grouping of contigs to reduce the memory intensivity of the final stages of clustering and refinement. Based on what you sent me initially if you have 3M contigs greater than 2500bp I would aim to set that initial cluster number somewhere between 100-600. You could go higher or lower if you wanted to play around with it, but on the higher than 600 range for the test samples I used it has led to a little more splitting of bins. I haven't tested this approach on something with more than 1M contigs though so arguably it may not affect you at all if you set that higher. 

Hopefully that makes sense, let me know if it doesn't! 

Regards
Elaina 

--
You received this message because you are subscribed to the Google Groups "BinSanity" group.
To unsubscribe from this group and stop receiving emails from it, send an email to binsanity+...@googlegroups.com.

Bill Nelson

unread,
Apr 5, 2021, 5:16:18 PM4/5/21
to BinSanity
Hi Elaina,
Yes, that makes sense. How does a higher clustersize affect run time? I've had the process chugging away on 40 threads for 6+ days (with a clustersize of 1000).

Bill

Elaina Graham

unread,
Apr 5, 2021, 8:24:00 PM4/5/21
to Bill Nelson, BinSanity
Run time is always a tough one because to a large extent it depends on how complex the sample is and how lunch the clustering takes to converge on an answer. But ultimately cluster size shouldn't impact run time too much from my understanding. What may be causing the lag is that right now the default for kmeans is for 2000 initializations run with different centroid seeds and 5000 maximum iterations. With how large your contig set is it's probably having to use every possible initialization to converge on the final cluster set. Now that you have pointed it out it is a bit of overkill and is something I'll work on adding an option to set manually for the user. I also probably need to set the random_state flag to make it more deterministic which would probably help here. Usually once the initial Kmeans clustering is overcome the AP refinement moves through each bin fairly easily but I have noticed lagging on some of my larger samples.

Bill Nelson

unread,
Apr 24, 2021, 1:17:02 AM4/24/21
to Elaina Graham, BinSanity
Hi Elaina,
Just wanted to report that my Binsanity job did finally finish. It took 18 days using 40 threads on the 3TB RAM server. And the checkM portion failed (not sure why yet), but I did get 28k bins from my 3M contigs. My plan is to use this as input for DAS tool (along with some other binning results). If you have any suggestions, please let me know.

Thanks for your help,
Bill

egrah...@gmail.com

unread,
Apr 24, 2021, 12:30:35 PM4/24/21
to Bill Nelson, BinSanity
Hi Bill,

Wow that’s an insane number of bins but excited it got through! Interesting about checkM. I would almost wager that the checkM error is something called a “recursion error” similar to this issue: https://github.com/Ecogenomics/CheckM/issues/118

I’ve only recently started running into it, and I’ve been trying to devise a solution. One is adding into the code base an option to set the recursion depth but I need to dive into how that could backfire. The second option is what I’m more likely leaning towards which is batch runs of bins into checkM so that you never breach say 500 bins being input at one time. It’s on my radar though but seems to only happen with really large sample sets.

I have definitely used DASTool and my favorite secondary binner for pairing with BinSanity in those scenarios is MetaBat2 because those two combined seem to cover most things in my experience. You could also try something like Autometa which has a different approach. Typically for me if I’m working with multiple binning results I’d use DASTool or just use something like dRep to identify a non-redundant set at a 98.5-99% ANI cutoff.

Regards,
Elaina Graham
Reply all
Reply to author
Forward
0 new messages