pin task to worker

Dragos

unread,

Apr 27, 2014, 2:52:04 PM4/27/14

to qthr...@googlegroups.com

Hi all,

Is it currently possible in Qthreads to pin a task to a particular worker (not a shepherd)?

Thank you,

Dragos

Kyle Wheeler

unread,

Apr 27, 2014, 3:32:01 PM4/27/14

to qthr...@googlegroups.com

Hey!

Not using the current API (though the logic to do something like that is relatively simple to add). I'm curious, though: what's the use case?

Sent from my iPhone

--
You received this message because you are subscribed to the Google Groups "qthreads" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qthreads+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dragoș Sbîrlea

unread,

Apr 27, 2014, 4:44:31 PM4/27/14

to qthr...@googlegroups.com

Hi Kyle,

Thanks for the super fast answer! I am exploring a possible locality
optimization in which tasks that work on a particular piece of data
run on the same core. This should get more cache reuse than pinning
them at socket/shepherd level (assuming perfect load balancing this
would translate to better performance, too).

Could you suggest generally where/what the changes would be so that
they make sense/are correct/can be integrated back to Qthreads? I am
having trouble figuring out what would be a good design.
Best,
Dragos

Kyle Wheeler

unread,

Apr 27, 2014, 6:09:44 PM4/27/14

to qthr...@googlegroups.com

Hi Dragoș,

It sounds like the easiest way to get what you want is to simply tell qthreads that you want a shepherd to map to a core: set the QT_SHEPHERD_BOUNDARY environment variable to "core". The default shepherd boundary is "node", which is a reference to the "NUMA node", or the set of processors that have the same view of memory layout with regard to latency. This default only makes sense if you assume that cache reuse between scheduler operations isn't a big deal (eg that one core can cheaply peek into the next core's cache) such that the benefits of sharing a queue between those cores (I.e. Load balancing) outweigh the cache costs of potentially shifting around within a numa node. If that is not the case, and your cost of migrating cache lines within a shepherd/NUMA-node is a significant performance problem (which probably depends on your computer as much as it depends on the problem), then the shepherd boundary should be shrunk to the point that those costs are no longer an issue.

Now, it's possible that for a given problem, some tasks benefit by having a shepherd be a NUMA node and some benefit by having a shepherd be the size of an L1 cache (or even a single worker thread - the two are different on hyperthreaded cores). I thought that might be your issue, but it doesn't sound like it. (That kind of performance heterogeneity is a tough nut to crack, but also a tough nut to demonstrate, and MAY be a case where more flexible affinity specification, a la hierarchical place trees, can really shine. Hacking worker thread affinity into qthreads might solve that problem, and might just create different problems.)

Does that help?

Sent from my iPhone

Dragoș Sbîrlea

unread,

Apr 27, 2014, 8:28:05 PM4/27/14

to qthr...@googlegroups.com

Hi Kyle,

I think your intuition was more accurate than my short description of
the problem. :)

By using the current support for binding tasks to shepherds, and
adding the support of binding tasks to workers, I was hoping to get
basic hierarchical place tree operating on Qthreads (just three levels
of tasks: unbound, bound to socket, bound to core). I wanted to use
the two types of pinning to do the following: pin tasks that share
small data to cores (so I get temporal reuse in L1 cache) and to pin
to sockets the tasks that share bigger data that only fits L2/L3.

I think for this I really need to implement support for worker-level
pinning, there is no way around it (I had considered the solution of
having a shepherd per core, but that loses the pinning to sockets).

Thank you for your help,
Dragos

Kyle Wheeler

unread,

Apr 28, 2014, 8:14:51 AM4/28/14

to qthr...@googlegroups.com

Hi Dragoș,

Understood. I'm curious if you will be able to reliably demonstrate benefit (I hope so!). Good luck! What's the app, if you don't mind me asking?

~Kyle

Sent from my iPhone

Stark, Dylan

unread,

Apr 28, 2014, 10:54:11 AM4/28/14

to qthr...@googlegroups.com

Hi,

I'd like to second Kyle's view that "hacking worker thread affinity into

qthreads might solve that problem, and might just create different

problems." I would start with the single-worker-per-shepherd and non-work
stealing, which is available with the Nemesis scheduler, and then
demonstrate that there is a load imbalance, before diving into updating
the work stealing scheduler with worker-level affinity. The second step
would then be to use the single-worker-per-shepherd with work stealing to
see if there is still a load imbalance issue.

Dylan

Dragoș Sbîrlea

unread,

Apr 28, 2014, 12:42:00 PM4/28/14

to qthr...@googlegroups.com

Kyle,
There is no specific app I am targeting. Assuming an oracle sets these
pinning constraints for any app and the oracle is "perfect", the
oracle would just decide to do no pinning at core level if the data
reuse pattern is not suitable for it or if it leads to load imbalance.
That said, HPT have been shown to, at least in some cases, show
speedup, but not necessarily at core level. I know it is a tough nut
to crack... :(

Dylan,
I am not sure why load imbalance before pinning is an issue - I am
focusing on apps where load balance is perfect before pinning, but
that may not have good locality because of that. Without pinning, the
work may be completely balanced, but just the overall time will suffer
because of lack of cache locality. By adding pinning (at whatever
level), we trade off load balancing for more locality, which can
increase the overall performance or may decrease it, depending on the
pinning done. I was hoping that there is an opportunity for pinning at
core level also, even though the load balancing is even mode
restricted than with pinning at socket level. I hope to avoid load
balancing issues by leaving a percentage of tasks un-pinned for each
application.

Do you guys have more experience to share about possible issues that
could happen with pinning at core level versus socket level?
Best,
Dragos

Reply all

Reply to author

Forward