Tbbmalloc/jemalloc and python modules

265 views
Skip to first unread message

mi...@crabcat.com

unread,
Nov 17, 2021, 9:55:27 PM11/17/21
to OpenVDB Forum
Hi all,

I use openvdb from vcpkg for building both our windows and linux python library for our in-house geological data processing software.

I hadn't really paid any attention to the allocator until recently and I tried tbbmalloc_proxy and for our workload it gives > 10x performance gain in many places which is amazing.   

The problem that I face though is that it works when we produce an exe output but if we produce a pyd (python shared dll library) the tbbmalloc_proxy doesn't load.  I'm assuming this is because the proxy needs to set itself up on process startup not when dynamically loading the pyd dll.  I've googled and googled but can't see any solutions to this (except maybe I can make a custom python build with tbbmalloc_proxy linked in python.exe which I'm yet to try).  I supect that LDPRELOAD would solve this problem on linux but I can't find any equivalent feature in windows.

Has anyone else managed to get tbbmalloc_proxy to work in a python library on windows?

Given the performance gain that we're seeing I'm willing to entertain almost any solution to get this working on windows.  It looks to me like openvdb itself has no in-built allocator related code and just relies on proxies or LDPRELOAD to replace underlying system allocation calls.  Has anyone tried customizing openvdb directly (override new/delete operators) or anything like that?  Is there some way that linking jemalloc when building openvdb library can magically mean that the openvdb allocates using jemalloc?

Any advice or ideas would be appreciated.

Thanks,
Mike


b...@formlabs.com

unread,
Nov 18, 2021, 7:05:32 AM11/18/21
to OpenVDB Forum
I’ve also had great success with tbb’s allocator. A while ago I opened this request for allocator awareness: https://github.com/AcademySoftwareFoundation/openvdb/issues/1002 
I suspect it’s not actually that hard to do. My interest then was in stateful allocators and John Lakos’s “wink-out” trick, but just doing regular old templated allocators aka std::vector would be a powerful addition. The blocker was lack of interest due to no perf benefit. If you can show perf, maybe that would change?

I think we could add a trailing `typename Allocator = std::allocator<T>` to `LeafNode` and have other nodes get the child allocator. Then use allocators in place of new & delete. There's plumbing to allow stateful allocators by passing them into the c'tor. Not trivial, and it would make the type names longer, but I think that would let us switch out allocators at compile time or runtime.

—Ben

mi...@crabcat.com

unread,
Nov 18, 2021, 9:05:51 PM11/18/21
to OpenVDB Forum
Hi Ben,

I did end up making progress yesterday with getting tbbmalloc_proxy working for python in windows albeit with a dirty solution.  Rather than recompiling python I was able to use the cff.exe (CFF explorer) tool to rewrite the DLL load table of python.exe and inject the tbbmalloc_proxy.dll there.  This is effectively the same as LD_PRELOAD afaik.  We can only do this because for windows we distribute python with our application and so we have the freedom to patch the python.exe like this.  I had a user test with their data yesterday and it took a runtime from 150minutes to 16 minutes so that is a huge win for us.  

In the end since LD_PRELOAD is sufficient on linux and this python.exe hack is sufficient for windows I also wouldn't be trying to pressure anyone into making any substantial changes in openvdb.  I hadn't really thought about how it could be implemented in openvdb but can also imagine that if it requires new template arguments then that would have a huge impact on everyone.  Maybe something using compiler macros and cmake arguments?

Also FYI that issue is also interesting to me as I'd previously looked at how much time was being spent in our code in destructors due to the size of our trees.  I suspect tbb has also helped alot with this but don't have any numbers atm.  I'll try to create a simplified test to demonstrate our performance gain and the impact on the tree destructors but that would just be a curiosity at this point.

Cheers,
Mike

edward

unread,
Nov 28, 2021, 5:01:26 PM11/28/21
to OpenVDB Forum
Hi Mike,

Late to this conversation but have you tried just adding tbbmalloc_proxy.lib to your linker command when creating your python module? From the docs , note that you also need to add  /INCLUDE:"__TBB_malloc_proxy" to the linker line for the python module .DLL as well.

In general, this is not advised because it "infects" the loading process, causing all subsequent allocations to go through tbbmalloc_proxy which may or may not be what one desires.

Cheers,
-Edward

mi...@crabcat.com

unread,
Nov 28, 2021, 8:33:03 PM11/28/21
to OpenVDB Forum
Hi Edward,

I did try linking tbbmalloc_proxy first and it worked fine when I produced an exe output but when I used it for a python module library (pyd) the injection process just didn't seem to work.  Note that I didn't use the /INCLUDE: option but instead had included the proxy header in my main boost python module cpp file which afaik is equivalent.  In the docs it says ".. that is loaded during application startup." so I think the issue is that the pyd modules are not loaded until later when you do a 'import ..' in the python code. Did you manage to get it to work that way?

Thanks,
Mike

edward

unread,
Nov 29, 2021, 12:03:22 PM11/29/21
to OpenVDB Forum
On Sunday, November 28, 2021 at 8:33:03 PM UTC-5 mi...@crabcat.com wrote:
I did try linking tbbmalloc_proxy first and it worked fine when I produced an exe output but when I used it for a python module library (pyd) the injection process just didn't seem to work.  Note that I didn't use the /INCLUDE: option but instead had included the proxy header in my main boost python module cpp file which afaik is equivalent.  In the docs it says ".. that is loaded during application startup." so I think the issue is that the pyd modules are not loaded until later when you do a 'import ..' in the python code. Did you manage to get it to work that way?

No, I've never tried it for python modules but I have for DLLs. It sounds reasonable to me that pyd modules are not loaded until you import them. But why is that be an issue? I'd presume that OpenVDB wouldn't get used until your python module is imported? So once the module is imported, then tbbmalloc_proxy hooks into all the allocation functions, and then anything you do with OpenVDB after will be "fast".

-Edward

 
Reply all
Reply to author
Forward
0 new messages