Multithreading support?

44 views
Skip to first unread message

aditya...@gmail.com

unread,
May 4, 2017, 10:02:02 PM5/4/17
to BinSanity
Hi

I have a small metagenome dataset with approximately 7 million assembled contigs (in hundreds of GB's), length >1kbp. Binsanity has now been running on it for 12+ hours without a single line of output.

I also assume Binsanity, at the moment can make use of only a single core. METABAT on the same dataset, outputs bins in 3-9h, depending on the settings. Although, I assume I have enough memory for Binsanity on our cluster (2TB, can scale up to 9TB), I am not sure how long it will take to finish

Any way, that Binsanity can be made to use multiple cores?

Regards
Aditya

edgraham

unread,
May 10, 2017, 4:07:10 PM5/10/17
to BinSanity
Hello Aditya,

We have yet to explore parallel computting, but because the algorthim it uses is deterministic I am not sure multithreading would make a difference. The way AP (affinity propagation) works is it stores and updates a matrix of affinities, responsibilities, and similarities between samples. This process, although more memory intensive ultimately ends up in more accurate clusters because there is no a priori assumption of cluster numbers. Methods such as metabat and concoct are non-deterministic meaning every time you run an initialization on a different core a new set of cluster centers are set before each initialization making multicore usage a useful way to improve performance. This multithreading issue is one of the main limitations of Binsanity right now. 

7 million contigs is a large number of contigs, so what I would recommend first is to look at the the size of your assembly. Due to the size one thing you should consider is seting a larger size cut-off like 2500 bp. Small contigs often have difficulty clustering and sometimes can prevent the clustering algorithm from converging on an answer quickly.  Our lab tries to limit the number of contigs going directly into the `BinSanity-wf` to around 100,000 contigs or else it will just gobble up memory.

Now from there I would try running the `Binsanity-lc` script instead of `Binsanity-wf`. 

To run it first Update your install via pip, I just modified the `Binsanity-lc` script for you so it will run the full workflow. This is still in the developmental stage which is why I haven't updated the documentation to reflect this script, but tests we've run in our lab indicate similar performance to `Binsanity-wf`. 

What `Binsanity-lc` does is run initial clustering using K-means to subset your contigs before clustering with BinSanity. So the key here is selecting an appropriate number of initial clusters. One way to do this would be using this approach which estimates the number of genomes in your assembly based on single copy genes. You can then take that number and extrapolate how many initial clusters you want, generally though you want to round down to ensure your not oversplitting. For example if I ran that approach and got 8 expected genomes I would reduce the expected clusters (--clusters) for initial subsampling to 5. 

For your metagenome, assuming you are keeping that 1kbp cutoff, I would try running this command:


`> Binsanity-lc -f directory/with/fasta -l /fastafile/ -c /coverage_file/ -o /outputdirectory/ --clusters 800  --threads /number/of/threads -p -5 `

I leave you with the caveat that the largest dataset I have tested this on so far was an assembly with ~300,000 contigs. On that dataset the method yieled high quality genomes similar to running BinSanity without the initial subsampling, but I haven't played around with optimizing parameters, particularly with how the number of kmean initializations can affect the final output. As of right now it is set to 5000 initializations. if you want to change that, you'll need to open up the script and find this line

`average_linkage = KMeans(n_clusters=int(clusters), n_jobs=-1,n_init=5000)` 

Then change the `n_init` parameter to whatever number you'd like. 

Let me know if you have any other questions, as well as how this ends up working out!

Regards,
Elaina Graham
Reply all
Reply to author
Forward
0 new messages