Data copying on NUMA

Paavo Helde

unread,

Nov 28, 2013, 5:25:12 PM11/28/13

to

On NUMA, as the acronym says, some memory is better accessible by a certain
NUMA node than others. Now, let's say I have a deep dynamic-allocated data
structure I want to use in a thread running in another NUMA node, how
should I pass it there? Should I perform the copy in the other thread so
that the new dynamic allocations take place in the target thread? Probably
depends on the memory allocator, what would be the best choice for
Linux/Windows?

This is not a theoretical question, actually we see a large scaling
performance drop on NUMA and have to decide whether to go multi-process
somehow or are there some ways to make multi-threaded apps to behave
better. As far as I understand there is a hard limit of 64 worker threads
per process on Windows, so probably we have to go multi-process anyway at
some time point. Any insights or comments?

TIA
Paavo

Robert Wessel

unread,

Dec 1, 2013, 8:21:29 PM12/1/13

to

On Thu, 28 Nov 2013 16:25:12 -0600, Paavo Helde
<myfir...@osa.pri.ee> wrote:

>
>On NUMA, as the acronym says, some memory is better accessible by a certain
>NUMA node than others. Now, let's say I have a deep dynamic-allocated data
>structure I want to use in a thread running in another NUMA node, how
>should I pass it there? Should I perform the copy in the other thread so
>that the new dynamic allocations take place in the target thread? Probably
>depends on the memory allocator, what would be the best choice for
>Linux/Windows?

On Windows, for example, you can use the VirtualAllocExNuma API to
allocate storage "near" a given node from any running code. By
default the allocation will be in local memory for the allocating
thread, which would not be optimal if another thread is going to do
all the work on that area..

In general, allocating structure close to the executing node is
certainly a win for some applications. Yes, that's vague, but so is
your problem statement. If a thread (or group of threads), is going
to heavily use a structure that won't cache, running all of those
threads in a single NUMA node, and allocating that structure in memory
local to that node, can significantly improve the available bandwidth
and latency for those threads, as well as consuming less of the global
resources for other threads in the system.

>This is not a theoretical question, actually we see a large scaling
>performance drop on NUMA and have to decide whether to go multi-process
>somehow or are there some ways to make multi-threaded apps to behave
>better. As far as I understand there is a hard limit of 64 worker threads
>per process on Windows, so probably we have to go multi-process anyway at
>some time point. Any insights or comments?

There is no 64 threads per process limit in Windows. Perhaps such a
thing existed in Win9x. Win32 has a limit of 32 logical cores per
machine (which has nothing in particular to do with the number of
threads in a process), but that doesn't apply to Win64 (although if
you run Win32 applications on Win64 only the first 32 core in the
first processor group are used to execute Win32 code).

If you're running multiple processes, then the system can fairly
easily split the workload between nodes, as you're implicitly telling
the system that the data is not shared, so the system will try to keep
a process (and hence it's allocations) on a particular node. If
you're running enough threads in a process that you're going to span
more than one node, you're going to have to specify some of that
manually, or structure things so that allocations only happen on the
appropriate node (usually by only doing the required allocations on
the actual threads using the allocated areas).

Andrew Gabriel

unread,

Dec 7, 2013, 5:34:38 AM12/7/13

to

In article <XnsA287445E6980my...@216.196.109.131>,

Paavo Helde <myfir...@osa.pri.ee> writes:
>
> On NUMA, as the acronym says, some memory is better accessible by a certain
> NUMA node than others. Now, let's say I have a deep dynamic-allocated data
> structure I want to use in a thread running in another NUMA node, how
> should I pass it there? Should I perform the copy in the other thread so
> that the new dynamic allocations take place in the target thread? Probably
> depends on the memory allocator, what would be the best choice for
> Linux/Windows?

On Solaris, memory is always allocated preferentially local to the
core the thread is running on. When a thread becomes runnable, it
is preferentially scheduled on a core local to the data it has been
accessing. The application itself doesn't need to do anything - this
is a feature of the OS. I'm less clear how Linux/Windows handle this.

> This is not a theoretical question, actually we see a large scaling
> performance drop on NUMA and have to decide whether to go multi-process

Multi-process or multi-threaded doesn't make any difference from the
NUMA point of view, if the data is shared in both cases.

> somehow or are there some ways to make multi-threaded apps to behave
> better. As far as I understand there is a hard limit of 64 worker threads
> per process on Windows, so probably we have to go multi-process anyway at
> some time point. Any insights or comments?

You can optimze the data layout such that different cores are not
competing for the same cache lines, and continually invalidating
each others' local caches. To do this, group data accessed by a
single thread at a time into aligned chunks which are multiples of
64 bytes long.

Break up hot locks where possible. Check for tools to identify hot
locks (on Solaris, plockstat).

If you are heavily using malloc/free in a multi-threaded process,
make sure you are linking in a library which contains a
version of these designed for heavy multi-threaded use.

--
Andrew Gabriel
[email address is not usable -- followup in the newsgroup]