Interaction between multi-threaded shepherds, NUMA, and the topology libraries

14 views
Skip to first unread message

Romain Dolbeau

unread,
Mar 9, 2011, 4:58:05 AM3/9/11
to qthr...@googlegroups.com
Hello all,

I've run into an issue regarding the behavior of the multi-threaded
shepherds in a (non)NUMA environment. Specifically, when I want to mess
with the settings :-)

On one non-NUMA system (a dual-socket quad-core Harpertown), qthreads
picks libnumaV2 for the topology. By default, it creates a single shepherd
with 8 workers, and exploit all cores. This is fine. If I force a
different number of shepherd (4), qthreads uses 2 workers in each - which
is also fine. Only, I don't know how to check that the pinning was done
properly: I want one shepherd per shared L2 (quad-core Harpertown have two
dual-core dies, there is no shared cache between each halves, and I want
all the workers in a shepherd to see a shared cache).

One a NUMA system, I have much more trouble. The system is a dual-socket
quad-core Nehalem. If I don't do anything, qthreads uses libnuma (not V2,
which is not installed on that system). It then creates 2 shepherds, but
apparently only create 1 worker for each, as my CPU usage is never above 200%.

So I force qthreads to use hwloc-1.1.1, it will also creates 2 shepherds,
and goes all the way to 800% (I assume 4 workers per shepherd). So far, so
good. But when I force the number of shepherd, and sometimes of worker per
shepherd, I don't get what I expect/want. I abbreviate
QTHREAD_NUM_SHEPHERDS as QNS and QTHREAD_NUM_WORKERS_PER_SHEPHERD as QNWPS

1) QNS=1 -> CPU is only 400%
2) QNS=1 & QNWPS=8 -> CPU is still only 400%
3) QNS=2 -> OK, 800%
4) QNS=2 & QNWPS=4 > OK, 800%
5) QNS=4 -> CPU is only 400%
6) QNS=4 & QNWPS=2 -> CPU is only 400%
7) QNS=8 -> CPU is only 200% (!)
8) QNS=8 & QNWPS=1 -> CPU is only 200% (!)

Obviously the point of qthreads is to use many qthreads, but my current
test case doesn't really load the shepherds, so I'd like to be able to
limit them to 2 workers each.

Any idea welcome (other than "fix your test case" :-)

--
Romain Dolbeau
<romain....@caps-entreprise.com>

Kyle Wheeler

unread,
Mar 9, 2011, 8:51:56 AM3/9/11
to qthr...@googlegroups.com, Romain Dolbeau
Hello Romain,

As a general rule, libnumaV2 is *always* preferable to libnuma (the
latter has several annoying bugs), and for accurate cpu pinning, hwloc
is often even better because it provides us with a greater level of
detail about the system (such as what the shared caches are) that we
can make decisions based upon.

> On one non-NUMA system (a dual-socket quad-core Harpertown), qthreads
> picks libnumaV2 for the topology. By default, it creates a single shepherd
> with 8 workers, and exploit all cores.

That means libnuma told it there was a single node with 8 cpus.

> This is fine. If I force a different number of shepherd (4), qthreads uses 2 workers in each - which
> is also fine. Only, I don't know how to check that the pinning was done
> properly: I want one shepherd per shared L2 (quad-core Harpertown have two
> dual-core dies, there is no shared cache between each halves, and I want
> all the workers in a shepherd to see a shared cache).

Using just libnuma, of any variant, qthreads has no information about
the cache, so it's just going to assign shepherds (and workers) in a
relatively brain-dead fashion.

The way to check exactly what the bindings are is to compile in debug
mode (--enable-debug) and export the QTHREAD_DEBUG_LEVEL environment
variable to be 7 before running something simple (like test_basic, in
the tests directory). That will print out a lot of junk, but near the
beginning, you'll see output for a function called assign_nodes(),
which will tell you exactly which shepherd is being assigned where.

> One a NUMA system, I have much more trouble. The system is a dual-socket
> quad-core Nehalem. If I don't do anything, qthreads uses libnuma (not V2,
> which is not installed on that system). It then creates 2 shepherds, but
> apparently only create 1 worker for each, as my CPU usage is never above 200%.

Check the debug output, in particular look for functions beginning
with the word "guess_". My guess is that libnuma isn't giving qthreads
very good information.

> So I force qthreads to use hwloc-1.1.1, it will also creates 2 shepherds,
> and goes all the way to 800% (I assume 4 workers per shepherd). So far, so
> good. But when I force the number of shepherd, and sometimes of worker per
> shepherd, I don't get what I expect/want. I abbreviate
> QTHREAD_NUM_SHEPHERDS as QNS and QTHREAD_NUM_WORKERS_PER_SHEPHERD as QNWPS
>
> 1) QNS=1 -> CPU is only 400%

That's as expected. It uses hwloc to try to figure out what a "node"
is, to determine how many workers to spawn. For numbers of shepherds
less than what it would have guessed by itself, it will assume that
you want to use less than the whole machine and so will use the same
number of workers per shepherd that it would have used if you *were*
using the whole machine... if that makes any sense.

> 2) QNS=1 & QNWPS=8 -> CPU is still only 400%

Hrm. What's the debug output say?

> 3) QNS=2 -> OK, 800%
> 4) QNS=2 & QNWPS=4 > OK, 800%
> 5) QNS=4 -> CPU is only 400%

When over-subscribing the machine (i.e. specifying more shepherds than
it thinks is a good idea), it defaults to just a single worker per
shepherd.

> 6) QNS=4 & QNWPS=2 -> CPU is only 400%
> 7) QNS=8 -> CPU is only 200% (!)
> 8) QNS=8 & QNWPS=1 -> CPU is only 200% (!)

Hrm. What's the debug output say?

~Kyle
--
Ahh! Arrogance and stupidity in the same package, how efficient of you!
--Londo Molari

Allan Porterfield

unread,
Mar 9, 2011, 9:05:10 AM3/9/11
to qthr...@googlegroups.com, Romain Dolbeau
Adding to Kyle's note. Most of the work developing the worker code has done without the numa or
hwloc packages and used the relatively stable placement of cores within a linux system for
cache sharing.

I have noticed in the past several situations that the code, when hwloc is loaded (1.0.2 in my case),
ignores the provided values of at least the worker count and guesses what it thinks the
correct number is. I have a test case on a quad socket Nehalem here that I think is giving
me the same problem (ignoring the number of workers given and maybe the given number of shepherds).
I will try to track down the problem today and hopefully either Kyle or I will have a fix
in soon. I believe that if the user set environment variable, it should override any guesses.

Allan

Romain Dolbeau

unread,
Mar 9, 2011, 11:26:48 AM3/9/11
to qthr...@googlegroups.com
Kyle Wheeler wrote:

> As a general rule, libnumaV2 is *always* preferable to libnuma (the
> latter has several annoying bugs), and for accurate cpu pinning, hwloc
> is often even better because it provides us with a greater level of
> detail about the system (such as what the shared caches are) that we
> can make decisions based upon.

Question: when both are available, why does qthreads picks libnuma{,V2}
over hwloc? I have hwloc-1.1.1 on both systems, but it is not used by
default, I have to use the configure option. It would seem a good choice
to always pick the 'best' available library.

> Using just libnuma, of any variant, qthreads has no information about
> the cache, so it's just going to assign shepherds (and workers) in a
> relatively brain-dead fashion.

OK, makes sense to me. libnuma is fairly basic indeed.

> The way to check exactly what the bindings are is to compile in debug
> mode (--enable-debug) and export the QTHREAD_DEBUG_LEVEL environment
> variable to be 7 before running something simple (like test_basic, in
> the tests directory). That will print out a lot of junk, but near the
> beginning, you'll see output for a function called assign_nodes(),
> which will tell you exactly which shepherd is being assigned where.

OK, and that's where the trouble start. The number of workers is actually
OK, but the pinning is completely wrong (IMHO).

If I don't force anything, I have 2 shepherds * 4 workers, and each worker
on a single core ; as far as I can tell, each shepherd uses its own socket
-> perfect.

If I force the number of shepherds, then
1) for 1: I get 4 workers on a single socket, as explained later in your
message ; it makes sense (even if it second-guess the user a bit in my
opinion ;-)
2) for 2: same as nothing, perfect
3) for 4: hell breaks loose:
QDEBUG: qt_affinity_set(176): binding shep 0 worker 0 (0) to mask 0x00000001
QDEBUG: qt_affinity_set(176): binding shep 0 worker 1 (1) to mask 0x00000002
QDEBUG: qt_affinity_set(176): binding shep 1 worker 0 (2) to mask 0x00000010
QDEBUG: qt_affinity_set(176): binding shep 1 worker 1 (3) to mask 0x00000020
QDEBUG: qt_affinity_set(176): binding shep 2 worker 0 (4) to mask 0x00000001
QDEBUG: qt_affinity_set(176): binding shep 2 worker 1 (5) to mask 0x00000002
QDEBUG: qt_affinity_set(176): binding shep 3 worker 0 (6) to mask 0x00000010
QDEBUG: qt_affinity_set(176): binding shep 3 worker 1 (7) to mask 0x00000020
-> core mask 0x04, 0x08, 0x40 & 0x80 are not used.
4) for 8: same as 4, just worse, as 0x02 & 0x20 are no longer used.

It feels like shepherd are allocated to sockets "'shepherd number' modulo
'# of sockets'", and the workers are pinned in sequential order to the
cores inside each socket. Which is not very good (I'm guessing here, I
haven't looked at the affinity code yet :-)

> Check the debug output, in particular look for functions beginning
> with the word "guess_". My guess is that libnuma isn't giving qthreads
> very good information.

Maybe, but frankly it's a rhetorical question: I want the memory hierarchy
anyway, so I'm going to stick with hwloc. Out with libnuma, even V2.

> That's as expected. It uses hwloc to try to figure out what a "node"
> is, to determine how many workers to spawn. For numbers of shepherds
> less than what it would have guessed by itself, it will assume that
> you want to use less than the whole machine and so will use the same
> number of workers per shepherd that it would have used if you *were*
> using the whole machine... if that makes any sense.
>
>> 2) QNS=1 & QNWPS=8 -> CPU is still only 400%
> Hrm. What's the debug output say?

That 8 workers are created, but they are all pinned to the same socket (2
workers per cores -> not good).

> When over-subscribing the machine (i.e. specifying more shepherds than
> it thinks is a good idea), it defaults to just a single worker per
> shepherd.

From what I observe, it creates 8 workers in all (unless there is only 1
shepherd), but the pinning is wrong (well, in my opinion, obviously).

>> 6) QNS=4 & QNWPS=2 -> CPU is only 400%
>> 7) QNS=8 -> CPU is only 200% (!)
>> 8) QNS=8 & QNWPS=1 -> CPU is only 200% (!)
> Hrm. What's the debug output say?

Again, the right number of workers, but the 'wrong' pinning.

I *think* the best choice would be to have as many workers as cores
(unless overridden by a dumb user like me :-), and pin them for maximum
cache locality inside of a shepherd. That's the same behavior as now for
#shepherds == #nodes, but noticeably different for more shepherds.

Cordially

P.S. any documentation beyond the code for the affinity code? I might want
to hack my own to get what I want ;-)

--
Romain Dolbeau
<romain....@caps-entreprise.com>

Kyle Wheeler

unread,
Mar 9, 2011, 12:08:03 PM3/9/11
to qthr...@googlegroups.com
>> As a general rule, libnumaV2 is *always* preferable to libnuma (the
>> latter has several annoying bugs), and for accurate cpu pinning, hwloc
>> is often even better because it provides us with a greater level of
>> detail about the system (such as what the shared caches are) that we
>> can make decisions based upon.
>
> Question: when both are available, why does qthreads picks libnuma{,V2}
> over hwloc? I have hwloc-1.1.1 on both systems, but it is not used by
> default, I have to use the configure option. It would seem a good choice
> to always pick the 'best' available library.

Because, until recently, hwloc did not have *memory* pinning (which,
as it turns out, is not as reliable as you'd think on Linux). Since
libnuma of either variety does have memory pinning, I dubbed it
"preferable". I suppose the right thing to do is to prefer hwloc-1.1.x
over libnumaV2, then libnuma, then hwloc-1.0.x, and add support for
hwloc's memory pinning interface.

> 3) for 4: hell breaks loose:
> QDEBUG: qt_affinity_set(176): binding shep 0 worker 0 (0) to mask 0x00000001
> QDEBUG: qt_affinity_set(176): binding shep 0 worker 1 (1) to mask 0x00000002
> QDEBUG: qt_affinity_set(176): binding shep 1 worker 0 (2) to mask 0x00000010
> QDEBUG: qt_affinity_set(176): binding shep 1 worker 1 (3) to mask 0x00000020
> QDEBUG: qt_affinity_set(176): binding shep 2 worker 0 (4) to mask 0x00000001
> QDEBUG: qt_affinity_set(176): binding shep 2 worker 1 (5) to mask 0x00000002
> QDEBUG: qt_affinity_set(176): binding shep 3 worker 0 (6) to mask 0x00000010
> QDEBUG: qt_affinity_set(176): binding shep 3 worker 1 (7) to mask 0x00000020
> -> core mask 0x04, 0x08, 0x40 & 0x80 are not used.
> 4) for 8: same as 4, just worse, as 0x02 & 0x20 are no longer used.

Yike!

> It feels like shepherd are allocated to sockets "'shepherd number' modulo
> '# of sockets'", and the workers are pinned in sequential order to the
> cores inside each socket. Which is not very good (I'm guessing here, I
> haven't looked at the affinity code yet :-)

That sounds likely.

>>> 2) QNS=1 & QNWPS=8 -> CPU is still only 400%
>> Hrm. What's the debug output say?
>
> That 8 workers are created, but they are all pinned to the same socket (2
> workers per cores -> not good).

Ahhhh, right, that makes sense, given our built-in assumptions (i.e.
that if QNS < num_sockets_in_machine, you're explicitly wanting to use
less than the whole machine).

>> When over-subscribing the machine (i.e. specifying more shepherds than
>> it thinks is a good idea), it defaults to just a single worker per
>> shepherd.
>
> From what I observe, it creates 8 workers in all (unless there is only 1
> shepherd), but the pinning is wrong (well, in my opinion, obviously).

Fair enough. There's probably something to be said for making
overlapping shepherds aware of each other, but I view that as a corner
case.

For example, there's nothing that says the #shepherds has to be a
multiple of the number of sockets or the number of cores. What is the
"right" thing to do when I want 57 shepherds on an 8-core (4-socket)
machine?

> P.S. any documentation beyond the code for the affinity code? I might want
> to hack my own to get what I want ;-)

Not really. I'm happy to answer any specific questions you have, and
if you put together a more "intelligent" affinity algorithm, I'd be
thrilled to have it.

Kyle Wheeler

unread,
Mar 9, 2011, 12:09:22 PM3/9/11
to qthr...@googlegroups.com
> I will try to track down the problem today and hopefully either Kyle or I will have a fix
> in soon.  I believe that if the user set environment variable, it should override any guesses.

I agree with you. If that's still happening, we need to fix it.

Romain Dolbeau

unread,
Mar 9, 2011, 12:40:42 PM3/9/11
to qthr...@googlegroups.com
Kyle Wheeler wrote:
> Not really. I'm happy to answer any specific questions you have, and
> if you put together a more "intelligent" affinity algorithm, I'd be
> thrilled to have it.

Before 'intelligent', I'll try "working as I guess is intended" ;-)

Currently, setting QTHREAD_SHEPHERD_BOUNDARY to 'cache' does nothing,
because hwloc answer HWLOC_TYPE_DEPTH_MULTIPLE (well, unless you only have
one level of cache...) and fail over to "pu".

This patch:

1) uses the outermost layer of cache for "cache" (i.e. L2 on Hapertown, L3
on Nehalem)
2) uses the innermost layer of cache for "L1cache"
3) uses the second innermost layer of cache for "L2cache"
4, 5) ditto for "L3cache", "L4cache"

Cordially,

--
Romain Dolbeau
<romain....@caps-entreprise.com>

patch.affinity.RD

Romain Dolbeau

unread,
Mar 10, 2011, 4:07:35 AM3/10/11
to qthr...@googlegroups.com
New patch againt r1896, to improve the boundary stuff.

this

1) fix the #of iterations so that L*cache are properly parsed
2) fix an excessive setting of shep_depth in case of error
3) add a couple of constants so that maintenance is easier (no more
immediate values scattered inside the loops)
4) remove a duplicate message about the depth
5) 'fix' types to match hwloc

patch.hwloc_affinity.cleanup.RD

Romain Dolbeau

unread,
Mar 10, 2011, 5:53:49 AM3/10/11
to qthr...@googlegroups.com
Romain Dolbeau wrote:
> New patch againt r1896, to improve the boundary stuff.

This mail includes a second (cumulative with the previous one) against
1896, which tries to improve the binding when using the boundary stuff.

Basically, it tries to find the object used for boundary, and bind workers
to the cpuset of that object.

Seems to work-for-me(tm) on a dual-nehalem (HT disabled, shared L3 inside
each socket), dual-harpertown (two L2 inside each socket), and a single i7
(HT enabled).

Notes
1) it uses the cpuset interface, which seems deprecated in hwloc 1.1.1 (?)
2) it's not very clean...

patch.hwloc_affinity.newbinding.RD

Wheeler, Kyle Bruce

unread,
Mar 10, 2011, 12:24:17 PM3/10/11
to qthr...@googlegroups.com
Works for me - pushed.

> <patch.hwloc_affinity.cleanup.RD>

--
Kyle B. Wheeler
Dept. 1423: Scalable System Software
Sandia National Laboratories
505-844-0394


Wheeler, Kyle Bruce

unread,
Mar 10, 2011, 12:40:14 PM3/10/11
to qthr...@googlegroups.com
This one I'm going to object to, in part because I don't understand it (and I've been staring at this code recently).

> This mail includes a second (cumulative with the previous one) against
> 1896, which tries to improve the binding when using the boundary stuff.

It looks like you're trying to find the boundary object type inside qt_affinity_set(), which shouldn't be necessary, since we already found the shep_depth in qt_affinity_init().

Since we've already found the correct shep_depth, when we execute

hwloc_obj_t obj = hwloc_get_obj_inside_cpuset_by_depth(ltopology, allowed_cpuset, shep_depth, myshep->node);

We should be finding an object of the correct type, and then allocating workers within that object's cpuset (which should already be within allowed_cpuset, so there's no need to AND the result with allowed_cpuset except maybe as a sanity check).

Also note that we're using myshep->node instead of myshep->shepherd_id, because we've already handled issues surrounding "what if we have more shepherds than nodes?", in qt_affinity_gendists().

So I guess my question is: what's broken that this is fixing?

> Notes
> 1) it uses the cpuset interface, which seems deprecated in hwloc 1.1.1 (?)

It's not exactly deprecated, it's just that in hwloc 1.1.1 a hwloc_cpuset_t is a typedef of hwloc_bitmask_t, and operations on it are done through the bitmask interface rather than a special cpuset manipulation interface. MOST of the time that doesn't matter, except in cases like the ones I wrapped in #ifdef's.

Romain Dolbeau

unread,
Mar 10, 2011, 1:43:20 PM3/10/11
to qthr...@googlegroups.com
Wheeler, Kyle Bruce wrote:
> So I guess my question is: what's broken that this is fixing?

The old "it doesn't work for me" issue :-)

I tried rev 1896 on both the dual-Hapertown & dual-Nehalem system, and
using the 'cache' stuff was broken: the right number of shepherds was
created, but the workers were pinned all wrong. Not all cores were used,
and I don't like it when not all cores are used :-)

> It looks like you're trying to find the boundary object type inside
> qt_affinity_set(), which shouldn't be necessary, since we already found
> the shep_depth in qt_affinity_init().

The depth yes, we have ; not the object itself.

> Since we've already found the correct shep_depth, when we execute
>
> hwloc_obj_t obj = hwloc_get_obj_inside_cpuset_by_depth(ltopology,
allowed_cpuset, shep_depth, myshep->node);
>
> We should be finding an object of the correct type, and then allocating

*an* object, while my patch look for *the* object ;-)

> workers within that object's cpuset (which should already be within
> allowed_cpuset, so there's no need to AND the result with allowed_cpuset
> except maybe as a sanity check).

I was playing safe, the patch is certainly not perfect.

> Also note that we're using myshep->node instead of myshep->shepherd_id,
> because we've already handled issues surrounding "what if we have more
> shepherds than nodes?", in qt_affinity_gendists().

I wasn't sure how it's supposed to work, but it didn't on my system, I had
all the wrong binding. I suspect the ->node wasn't at fault (I mistakenly
thought it referred to the NUMA node), but that only the binding logic was.

Example for the dual-Hapertown:

##### rev 1896 + only the cleanup patch
$ QTHREAD_DEBUG_LEVEL=77 QTHREAD_SHEPHERD_BOUNDARY=L1cache ./test_basic
2>&1 | grep binding
QDEBUG: qt_affinity_set(252): binding shep 0 worker 0 (0) to mask 0x00000001
QDEBUG: qt_affinity_set(252): binding shep 1 worker 0 (1) to mask 0x00000004
QDEBUG: qt_affinity_set(252): binding shep 2 worker 0 (2) to mask 0x00000010
QDEBUG: qt_affinity_set(252): binding shep 3 worker 0 (3) to mask 0x00000040
QDEBUG: qt_affinity_set(252): binding shep 4 worker 0 (4) to mask 0x00000002
QDEBUG: qt_affinity_set(252): binding shep 5 worker 0 (5) to mask 0x00000008
QDEBUG: qt_affinity_set(252): binding shep 6 worker 0 (6) to mask 0x00000020
QDEBUG: qt_affinity_set(252): binding shep 7 worker 0 (7) to mask 0x00000080
***** -> this is OK
$ QTHREAD_DEBUG_LEVEL=77 QTHREAD_SHEPHERD_BOUNDARY=L2cache ./test_basic
2>&1 | grep binding
QDEBUG: qt_affinity_set(252): binding shep 0 worker 0 (0) to mask 0x00000001
QDEBUG: qt_affinity_set(252): binding shep 0 worker 1 (1) to mask 0x00000001
QDEBUG: qt_affinity_set(252): binding shep 1 worker 0 (2) to mask 0x00000010
QDEBUG: qt_affinity_set(252): binding shep 1 worker 1 (3) to mask 0x00000010
QDEBUG: qt_affinity_set(252): binding shep 2 worker 0 (4) to mask 0x00000008
QDEBUG: qt_affinity_set(252): binding shep 2 worker 1 (5) to mask 0x00000008
QDEBUG: qt_affinity_set(252): binding shep 3 worker 0 (6) to mask 0x00000080
QDEBUG: qt_affinity_set(252): binding shep 3 worker 1 (7) to mask 0x00000080
***** -> this is wrong, only 1 core is used per shepherd, half the cores
are wasted
##### rev 1896 + both patches
$ QTHREAD_DEBUG_LEVEL=77 QTHREAD_SHEPHERD_BOUNDARY=L2cache ./test_basic
2>&1 | grep binding
QDEBUG: qt_affinity_set(277): binding shep 0 worker 0 (0) to mask 0x00000001
QDEBUG: qt_affinity_set(277): binding shep 0 worker 1 (1) to mask 0x00000004
QDEBUG: qt_affinity_set(277): binding shep 1 worker 0 (2) to mask 0x00000010
QDEBUG: qt_affinity_set(277): binding shep 1 worker 1 (3) to mask 0x00000040
QDEBUG: qt_affinity_set(277): binding shep 2 worker 0 (4) to mask 0x00000002
QDEBUG: qt_affinity_set(277): binding shep 2 worker 1 (5) to mask 0x00000008
QDEBUG: qt_affinity_set(277): binding shep 3 worker 0 (6) to mask 0x00000020
QDEBUG: qt_affinity_set(277): binding shep 3 worker 1 (7) to mask 0x00000080
***** -> now we use all cores, and one each shepherd only see one L2 cache
(no sharing between shepherd)
(the patch doesn't change the 'good' behavior when looking for the L1 cache).

------------------------------

I attach a simpler version of this patch which

1) use ->node instead of ->shepherd_id (wrong fix)
2) drop the cpuset stuff to simply use obj->allowed_cpuset (easier)

... so it basically keeps the 1896 code, but replaces the (in my
experience not-quite-working) "me->packed_worker_id / nb_shepobjs" by
"me->worker_id % hwloc_cpuset_weight(obj->allowed_cpuset)" (and makes the
debug message a bit more verbose).

patch.hwloc_affinity.newbinding.RD

Romain Dolbeau

unread,
Mar 10, 2011, 2:01:07 PM3/10/11
to qthr...@googlegroups.com
Romain Dolbeau wrote:

> The depth yes, we have ; not the object itself.

> *an* object, while my patch look for *the* object ;-)

Those two comments don't make sense because I updated the patch while
writing the mail ; we actually do have the proper object in 'obj'. It
wasn't the 'fualty' bit.

> ... so it basically keeps the 1896 code, but replaces the (in my
> experience not-quite-working) "me->packed_worker_id / nb_shepobjs" by
> "me->worker_id % hwloc_cpuset_weight(obj->allowed_cpuset)" (and makes the
> debug message a bit more verbose).

Same patch attached, but against rev 1899.

patch.hwloc_affinity.newbinding.RD

Wheeler, Kyle Bruce

unread,
Mar 10, 2011, 2:28:38 PM3/10/11
to qthr...@googlegroups.com

> I attach a simpler version of this patch which
>
> 1) use ->node instead of ->shepherd_id (wrong fix)
> 2) drop the cpuset stuff to simply use obj->allowed_cpuset (easier)
>
> ... so it basically keeps the 1896 code, but replaces the (in my
> experience not-quite-working) "me->packed_worker_id / nb_shepobjs" by
> "me->worker_id % hwloc_cpuset_weight(obj->allowed_cpuset)" (and makes the
> debug message a bit more verbose).

Hrm. So I think I'm understanding this a bit more, but we still aren't in the same place yet. I recognize that the current code is broken, but I don't think we've got the right fix yet. The crux of this, as I understand it, is the call to find sub_obj. Allan and I fixed that recently to deal with the case where he specified 32 shepherds on a 4-socket 8-core-per-socket machine. Originally (and, I think, after applying your change here), they would all pin themselves to the first core on each socket. That's why I started using packed_worker_id and nb_shepobjs, so that they would use successive cores.

In some sense, it may be an issue of "what is the right shepherd boundary when you have more shepherds than you would expect for the default shepherd boundary?"---perhaps it needs to automatically jump to PU's? Or maybe it should check each type of boundary and see if the num-obj's for each category matches the number of desired shepherds, and falls back to PUs if nothing matches? I'm not sure what the right answer is.

I need to spend the rest of today on a different project, but if you have time, can you make sure that your patch correctly handles the case of QTHREAD_NUM_SHEPHERDS=num_cores? Once it can do that, I'll push it into svn.

Romain Dolbeau

unread,
Mar 11, 2011, 3:38:19 AM3/11/11
to qthr...@googlegroups.com
Wheeler, Kyle Bruce wrote:

> Hrm. So I think I'm understanding this a bit more, but we still aren't
> in the same place yet. I recognize that the current code is broken, but
> I don't think we've got the right fix yet.

Same here. The patch works if you set QTHREAD_SHEPHERD_BOUNDARY, but still
miserably fails if you set QTHREAD_NUM_SHEPHERDS.

> In some sense, it may be an issue of "what is the right shepherd
> boundary when you have more shepherds than you would expect for the
> default shepherd boundary?"---perhaps it needs to automatically jump to
> PU's? Or maybe it should check each type of boundary and see if the
> num-obj's for each category matches the number of desired shepherds,
> and falls back to PUs if nothing matches? I'm not sure what the right
> answer is.

I think maybe the problem is the initialization sequence. What happen now
is (well, that's how I understand it):

1) qthread (not affinity) look up the env. var. for the numbers)
2) qt_affinity is called *without* that information, and look for the
boundary env. var.
3) the 'guess' functions are called but only if step 1) didn't give any
results.
4) later on, the shepherds call qt_affinity_set()

So as far as I can tell, the affinity code is never told how many
shepherds & workers there is, or why.

I think an more efficient way would be:

1) (no change)
2) qt_affinity is called with pointer to this (non-MT) or those (MT)
variables ; if set it has to work with it/them, otherwise it guesses and
set the value(s)
3) (this step deliberately left empty)
4) later on, the shepherds call qt_affinity_set()

That way, the affinity code would have enough information about how many
and why, so has to make an 'intelligent' decision.

If I find the time (I have the code _using_ qthread to work on :-), I'll
try to whip up something along those lines.

Romain Dolbeau

unread,
Mar 11, 2011, 5:01:40 AM3/11/11
to qthr...@googlegroups.com
Wheeler, Kyle Bruce wrote:
> but if you have time can you make sure that your patch correctly

> handles the case of QTHREAD_NUM_SHEPHERDS=num_cores? Once it can do
> that, I'll push it into svn.

Attached, a suggestion on how to handle things slightly better (IMHO) in
the affinity code.

Basically, qt_affinity_init() is made aware of the user-defined number of
shepherds / workers. Then if they are set, it tries to guess the depth to
use for 'optimal' affinity.

In case the number of shepherd doesn't have an exact match in the
topology, it uses the next *larger* width, thus using *less* than the
whole machine (i.e., no overlap by default). So for instance on my
dual-harpertown, if I use 3 shepherds, I get 6 workers, and each shepherd
use one L2 cache. Half a socket (and the associated L2 cache) is unused.
That's debatable, but you got to pick a value, so...

The case QTHREAD_NUM_SHEPHERDS=num_cores (well, I assume you mean PU, as
on system with hyperthreading there is more PU than cores...) is handled
by picking the depth of either the actual PU (w/ HT) or more often the L1
cache (w/o HT, i.e. #PU==#cores==#L1caches), as the system pick the
outermost (lower depth) level that matches.

So it *should* work as specified on the quad-octo-systems, but I don't
have one to try on.

So the full logic is:
1) if the user specified QTHREAD_NUM_SHEPHERDS &
QTHREAD_NUM_WORKERS_PER_SHEPHERD, use those, and use the lower depth of
width equal-or-larger to QTHREAD_NUM_SHEPHERDS; workers are round-robin'ed
on all PUs in that subtree
2) if the user specified QNS but not QNWPS, then same logic but with one
worker per PU
3) if the user specified QTHREAD_SHEPHERD_BOUNDARY, then find the depth,
use the width a # of shepherds, then again fill PUs with workers
4) if the user didn't say anything, same as 3 with QSB="socket"

I'm sure the code could be cleaner, that's mostly a proof-of-concept.

Cordially,

------------------------------------------------------------------------
*WARNING* this break every non-hwloc affinity ; to fix them & restore the
previous behavior, one needs to

1) add the parameters to each qt_affinity_init()

void qt_affinity_init(
- void)
+ qthread_shepherd_id_t *nbshepherds
+#ifdef QTHREAD_MULTITHREADED_SHEPHERDS
+ , qthread_worker_id_t *nbworkers
+#endif
+ )

2) add the following two lines at the end of each qt_affinity_init()

if (*nbshepherds == 0)
*nbshepherds = guess_num_shepherds();
#ifdef QTHREAD_MULTITHREADED_SHEPHERDS
if (*nbworkers == 0)
*nbworkers = guess_num_workers_per_shep(*nbshepherds);
#endif

Also, guess_num_shepherds() & guess_num_workers_per_shep(*nbshepherds) can
probably be made local to the affinity code, as they're not needed outside
anymore.

--
Romain Dolbeau
<romain....@caps-entreprise.com>

patch.hwloc_affinity.init_with_parameters.RD

Wheeler, Kyle Bruce

unread,
Mar 14, 2011, 11:36:10 AM3/14/11
to qthr...@googlegroups.com
This looks good to me (except for the obvious updates to the other affinity interfaces).

Allan, any thoughts?

> <patch.hwloc_affinity.init_with_parameters.RD>

Allan Porterfield

unread,
Mar 14, 2011, 11:47:31 AM3/14/11
to qthr...@googlegroups.com
This sounds like it gives me the control I want and prevents most(all?) of the brain-dead layouts, we have noticed in the last week or so. Without actually having it checked out and using it for a while, I won't say it’s the fix, but it definitely looks better than where we started.

Allan

Romain Dolbeau

unread,
Mar 14, 2011, 12:22:45 PM3/14/11
to qthr...@googlegroups.com
Wheeler, Kyle Bruce wrote:
> This looks good to me (except for the obvious updates to the other affinity interfaces).

The fix I mentioned should restore the previous behavior for everything
else, except for one (nontrivial) detail I just noticed: it doesn't check
that the guess functions returns a value > 0. In the hwloc affinity, I
fixed that directly in the guess functions (i.e. they always return a
'good' value).

That's why I didn't put the other affinities in the patch: if I can't test
the code, it's better to break compilation and have someone who can test
do the fix than submit untested code, IMHO. Should Tilera eventually
answer me, maybe at some point in the distant future I'll be able to test
the tile_affinity.c stuff ;-)

Romain Dolbeau

unread,
Mar 14, 2011, 12:54:42 PM3/14/11
to qthr...@googlegroups.com
Allan Porterfield wrote:
> This sounds like it gives me the control I want and prevents most(all?)
> of the brain-dead layouts, we have noticed in the last week or so.
> Without actually having it checked out and using it for a while,
> I won't say it�s the fix, but it definitely looks better than where we
> started.

I've played with qthreads for only about a week, so I'm not sure if the
following makes any sense. I was considering adding some sort of 'cache
requirements' for shepherd creation, should the user be able to evaluate
the working set size of the qthreads inside each shepherd. This would
allow maximizing the amount of data blocked in the caches, while
maximizing the number of PU using those data without 'spilling'.

Something like QTHREADS_SHEPHERD_MIN_CACHE=7MIB would give

1) on a dual Harpertown, fail to obtain enough cache (the L2 are 6 MiB, so
no cache is big enough) -> either fail or create a shepherd for each L2 (2
workers per shepherd)
2) on a dual Nehalem, use one shepherd for each L3 (they are 8 MiB) with 4
workers per shepherd
3) on a single socket POWER7, creates 4 with 1 or 2 workers each, blocking
into the shared L3 cache (the L2 are too small)

Obviously, using smaller values would try to block into closer cache, so
that 200 KiB would:

1) on a dual Harpertown, block in L2 but still creates one shepherd
(single worker) per PU
2) on a dual Nehalem, block in L2 and therefore creates one shepherd
(single worker) per PU
3) on a POWER7, same as Nehalem

Unfortunately I haven't yet been able to produce the code I intended to
try this on. And I'm not even sure that I'm trying to use the right tool
for the job (OpenMP works kinda well on that particular code...) But I
like the tool anyway :-)

Wheeler, Kyle Bruce

unread,
Mar 14, 2011, 1:15:28 PM3/14/11
to qthr...@googlegroups.com

On Mar 14, 2011, at 10:54 AM, Romain Dolbeau wrote:

> I've played with qthreads for only about a week, so I'm not sure if the
> following makes any sense. I was considering adding some sort of 'cache
> requirements' for shepherd creation, should the user be able to evaluate
> the working set size of the qthreads inside each shepherd. This would
> allow maximizing the amount of data blocked in the caches, while
> maximizing the number of PU using those data without 'spilling'.

Not that I wouldn't accept a solid patch that implemented this (I think it's a reasonable feature), but it's not something I'm interested in writing myself. I view this as a convenience feature (a highly hwloc-dependent convenience feature) for (essentially) auto-selecting a shepherd boundary that you could select manually. So, if you feel motivated, cool! Otherwise... it's probably not going to happen.

Speaking of hwloc... since it doesn't (yet) have a pseudo distance estimator, it's not going to displace libnuma and liblgrp at the top of the "preferred" line. However, when they release 1.2, it's supposed to get a good pseudo-distance estimator, which will make it the preferred affinity library, as far as I'm concerned.

Romain Dolbeau

unread,
Mar 14, 2011, 1:26:17 PM3/14/11
to qthr...@googlegroups.com
Wheeler, Kyle Bruce wrote:

> Not that I wouldn't accept a solid patch that implemented this
> (I think it's a reasonable feature), but it's not something I'm
> interested in writing myself. I view this as a convenience feature
> (a highly hwloc-dependent convenience feature) for (essentially)
> auto-selecting a shepherd boundary that you could select manually.

Actually, I was thinking of it more as an 'auto-tunning' feature. If you
know the working set size for each parallel atom of computation (assuming
each atom is going to run on a single shepherd), then ask the system to
allocate as many shepherds as possible to exploit available caches, then
use as many worker (1 per PU) as possible inside of each shepherd to
exploit computational resources. The idea is to maximize throughput on the
set of all atoms, rather than the latency of a single atom.

But I have no idea what the currently preferred domain of application for
qthreads is, and if that would actually make any sense in practice. (it
also assumes the overhead of MT-shepherds in the caches is zero...).

... and to be clear, I'm not asking anyone to write the code for me :-)

--
Romain Dolbeau
<romain....@caps-entreprise.com>

Reply all
Reply to author
Forward
0 new messages