running independent Gecode invocations in threads

Oliver Kullmann

unread,

Jan 17, 2022, 7:34:28 AM1/17/22

to us...@gecode.org

Hello,

we are using several independent Gecode solvers in one program, and we want to run them in parallel, in different threads. It works correctly, but is extremely slow: running N solvers in parallel, each solver slows down by at least a factor of N. This alread occurs for N=2, on servers with many cores and with large memory.

The reason seems to be the global variable for memory management, as mentioned in Section 31.1 "Memory areas" in the "Modelling and Programming with Gecode". So the different threads, which in our cases need to acquire a lot of heap memory, have to go through this single resource, and thus are permanently blocking each other.

So apparently we need to write a custom memory allocator, as mentioned at the end of Section 31.1. Unfortunately not much further information is given. Our task should be very simple: we just want to turn the global variable

Heap heap;

(line 44 of heap.cpp in gecode.org/doc/6.2.0/reference/heap_8cpp_source.html )

for heap management into a local one, so that each thread can use its own heap management.

We don't have yet any experience with doing such things. We would be thankful for any help!

With best regards

Oliver Kullmann

Mikael Zayenz Lagerkvist

unread,

Jan 17, 2022, 7:59:42 AM1/17/22

to Gecode

Hi,

Interesting use-case and findings, although I wonder a bit if there might be something else that is slowing you down. Unless you've configured the Heap object to record the peak heap size, there is no synchronization done by Gecode for the member functions in Heap (so making the heap variable thread local would not actually have any effect), nor is it done in the standard Allocator instance [1, 2]. I've also used several disjoint Gecode searches in a single program before, and have not seen the same behavior. What does a profiler say are the top hot-spots?

In general, two things you might want to try first before anything else

* Using the release/6.3.0 branch instead. I've made some improvements to the memory handling there with regards to running code in parallel (this was for running the tests in parallel), that might help you. In particular, the region pool became thread local and some shared counters moved from locking to atomics. This was very useful in cases where very many propagators were created quickly.

* Using a different memory allocator than the standard one. I've used mimalloc (https://github.com/microsoft/mimalloc) to great effect when running tests in parallel with Gecode on macOS, which has a standard malloc implementation that leaves a lot to be desired.

If neither of the two above tips help, then I would look into profiling data to see where locking is happening.

Cheers,

Mikael

[1] https://github.com/Gecode/gecode/blob/release/6.3.0/gecode/support/heap.hpp

[2] https://github.com/Gecode/gecode/blob/release/6.3.0/gecode/support/allocator.hpp

Oliver Kullmann

unread,

Jan 17, 2022, 6:02:15 PM1/17/22

to us...@gecode.org

Hi Mikael,

thanks for the prompt answer.
I installed now 6.3.0, which worked find (just using "./configure; make; sudo make install").
But now trying to compile our code I get the following error:

/usr/local/include/gecode/kernel/gpi.hpp: In member function ‘Gecode::Kernel::GPI::Info* Gecode::Kernel::GPI::allocate(unsigned int)’:
/usr/local/include/gecode/kernel/gpi.hpp:190:50: error: ‘memory_order_seq_cst’ is not a member of ‘std::memory_order’
190 | c->init(npid.fetch_add(1, std::memory_order::memory_order_seq_cst),gid);

Apparently the compilation used -std=c++17 throughout, while we are compiling our code with -std=c++20 (gcc 10.3.0).

According to
en.cppreference.com/w/cpp/atomic/memory_order

it should be either

std::memory_order_seq_cst

or

std::memory_order::seq_cst

but not

std::memory_order::memory_order_seq_cst

??

Best

Oliver

Mikael Zayenz Lagerkvist

unread,

Jan 18, 2022, 1:21:31 AM1/18/22

to gec...@googlegroups.com, us...@gecode.org

Hi,

Thanks for pointing that out. Annoying that
std::memory_order::memory_order_seq_cst worked compiled with C++11 but
not C++20. There was actually an attempt at making it work in C++20,
but I removed it and just kept the version that should work in both
cases. I've pushed the changes (and a change to make the black hole
example work again) to a PR here:
https://github.com/Gecode/gecode/pull/140 that you can try out. Hope
this helps!

Cheers,
Mikael

> --
> You received this message because you are subscribed to a topic in the Google Groups "Gecode" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/gecode/9C6Pi39URlY/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to gecode+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/gecode/LO4P265MB43895F06D22953C3E4F72188C9579%40LO4P265MB4389.GBRP265.PROD.OUTLOOK.COM.

--
Mikael Zayenz Lagerkvist

Oliver Kullmann

unread,

Jan 18, 2022, 2:54:09 AM1/18/22

to us...@gecode.org

Hi Mikael,

thanks, now everything worked: and running the parallel threads now seems to work well!

You wrote

"This was very useful in cases where very many propagators were created quickly."

And that was likely the problem (we perform many look-aheads, by just copying the space).

Thanks a lot!

Oliver

Mikael Zayenz Lagerkvist

unread,

Jan 18, 2022, 3:07:22 AM1/18/22

to gec...@googlegroups.com, us...@gecode.org

Hi,

Great to hear that it worked well now. Creating very many small copies
sounds like it will trigger exactly the same behaviors as running the
tests in parallel, so I am not surprised that the same changes were
needed.

If you haven't tried it yet, it can be very useful to try out
alternative memory allocators like mimalloc as well. For simple
testing, using LD_PRELOAD to replace the system malloc works well
enough.

Cheers,
Mikael

> --
> You received this message because you are subscribed to the Google Groups "Gecode" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to gecode+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/gecode/LO4P265MB43892A340D1C6B6E8F44C0C2C9589%40LO4P265MB4389.GBRP265.PROD.OUTLOOK.COM.

--
Mikael Zayenz Lagerkvist

Filip Konvička

unread,

Jan 18, 2022, 5:22:25 AM1/18/22

to gec...@googlegroups.com, us...@gecode.org

Hi,

This reminded me of a concurrency related memory allocation issue that we ran into some time ago. This isn't really strictly Gecode-related but might be interesting anyway, and it was the first thing that I thought of when I saw the original post.

Some time ago I contributed an 'arena allocator' to Gecode, and since then it has undergone some further development in the Gecode code base (I think it is called the "region allocator" now). The idea is that the allocator gets a big chunk of memory from the heap and serves the allocation requests from it, and the 'free' operation is a no-op. In our products, we use a similar allocator for temporary structure allocations (eliminating the cost of heap mutexes, heap allocation, and structure destructors). This works very well, but in case when multiple threads rapidly create and delete such 'arenas', the 'malloc' of the initial arena memory chunk may become a bottleneck, since ultimately they all access the process-wide heap. We were able to work around this by introducing a 'resettable' arena, so that when a thread works in a for-loop, it can reuse the arena that was used for the previous for-loop iteration (which solves the contention, but also improves performance by reusing the already allocated heap memory).

I'm not sure how widely the arena (region) allocators are used by Gecode itself or by Gecode users, but if there is enough interest I could try and provide the extension I just described.

Best Regards,

Filip

út 18. 1. 2022 v 9:07 odesílatel Mikael Zayenz Lagerkvist <zay...@gmail.com> napsal:

To view this discussion on the web visit https://groups.google.com/d/msgid/gecode/CAPKxCj47T0-YhsD9Mo-Rs-my2t%3DGhYFOE1CDymiMprujM0ZJvg%40mail.gmail.com.

Mikael Zayenz Lagerkvist

unread,

Jan 18, 2022, 5:44:14 AM1/18/22

to gec...@googlegroups.com, us...@gecode.org

Hi Filip,

As you correctly guessed, the Region allocator (which is used _a lot_
in Gecode) was a bottle-neck for running the tests in parallel. For
the testing case it was the static pool of chunks that was the
problem, which was improved by turning it into a thread local static
instead: https://github.com/Gecode/gecode/commit/bd5c61c0c9de7d930db50316ba811b4a7117777d

I think that in most cases in Gecode, regions are not used in a loop
(haven't looked carefully though), and even so that such access
patterns should be handled reasonably well with the thread local
separation of the pools. This is just a guess though, I haven't looked
into it that much. When I've profiled testing which stresses the
region system, the move to thread local regions made it all but
disappear from the profiles IIRC.

Chers,
Mikael

> To view this discussion on the web visit https://groups.google.com/d/msgid/gecode/CALiU-vbb0OjtNkbzGsvA4Xh7adtHdUNUZhVmh_ztX5omT3wjow%40mail.gmail.com.

--
Mikael Zayenz Lagerkvist

Reply all

Reply to author

Forward