On Thu, 28 Nov 2013 16:25:12 -0600, Paavo Helde
<
myfir...@osa.pri.ee> wrote:
>
>On NUMA, as the acronym says, some memory is better accessible by a certain
>NUMA node than others. Now, let's say I have a deep dynamic-allocated data
>structure I want to use in a thread running in another NUMA node, how
>should I pass it there? Should I perform the copy in the other thread so
>that the new dynamic allocations take place in the target thread? Probably
>depends on the memory allocator, what would be the best choice for
>Linux/Windows?
On Windows, for example, you can use the VirtualAllocExNuma API to
allocate storage "near" a given node from any running code. By
default the allocation will be in local memory for the allocating
thread, which would not be optimal if another thread is going to do
all the work on that area..
In general, allocating structure close to the executing node is
certainly a win for some applications. Yes, that's vague, but so is
your problem statement. If a thread (or group of threads), is going
to heavily use a structure that won't cache, running all of those
threads in a single NUMA node, and allocating that structure in memory
local to that node, can significantly improve the available bandwidth
and latency for those threads, as well as consuming less of the global
resources for other threads in the system.
>This is not a theoretical question, actually we see a large scaling
>performance drop on NUMA and have to decide whether to go multi-process
>somehow or are there some ways to make multi-threaded apps to behave
>better. As far as I understand there is a hard limit of 64 worker threads
>per process on Windows, so probably we have to go multi-process anyway at
>some time point. Any insights or comments?
There is no 64 threads per process limit in Windows. Perhaps such a
thing existed in Win9x. Win32 has a limit of 32 logical cores per
machine (which has nothing in particular to do with the number of
threads in a process), but that doesn't apply to Win64 (although if
you run Win32 applications on Win64 only the first 32 core in the
first processor group are used to execute Win32 code).
If you're running multiple processes, then the system can fairly
easily split the workload between nodes, as you're implicitly telling
the system that the data is not shared, so the system will try to keep
a process (and hence it's allocations) on a particular node. If
you're running enough threads in a process that you're going to span
more than one node, you're going to have to specify some of that
manually, or structure things so that allocations only happen on the
appropriate node (usually by only doing the required allocations on
the actual threads using the allocated areas).