TensorFlow scheduler

Dmitry Vyukov

unread,

Feb 27, 2019, 4:35:04 AM2/27/19

to lock...@googlegroups.com

Hi,

TensorFlow CPU task scheduler I wrote some time ago:

https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h?at=default&fileviewer=file-view-default

https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/RunQueue.h?at=default&fileviewer=file-view-default

https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/EventCount.h?at=default&fileviewer=file-view-default

This and some other fixes improved model training time on CPU up to 3x.
There is some interesting lock-free stuff in there.

Main design goal is to tailor pieces for specific problem at hand & be
practical Eg lock-free where matters mutex-based otherwise. Sane level
of complexity(need to get it right). Not trading practical properties
that matter (array-based queue) for formal properties (lock-freedom)

EventCount (aka "condvar for lock-free algos") has a fast common paths
and minimizes contention as much as possible. Due to
non-uniform/bursty work blocking/unblocking threads all the time
actually turned out to be one of the most critical parts of scheduling
in TF.

RunQueue (work-stealing deque) allows all operations from both sides
as work can be submitted from external threads in TF. Owner ops are
lock-free, remote use mutex. This part was somewhat unique as compared
to similar systems. Getting Size right is tricky!

ThreadPool (scheduler itself) is quite standard design based on
distributed runqueues. Though you can find some interesting tricks wrt
stealing order there. Also steal partitions (not done by me). Spinning
logic required quite some tuning to balance between latency and wasted
CPU

Chris M. Thomasson

unread,

Mar 1, 2019, 2:27:37 AM3/1/19

to Scalable Synchronization Algorithms

On Wednesday, February 27, 2019 at 1:35:04 AM UTC-8, Dmitry Vyukov wrote:

Hi,

TensorFlow CPU task scheduler I wrote some time ago:

https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h?at=default&fileviewer=file-view-default

https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/RunQueue.h?at=default&fileviewer=file-view-default

https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/EventCount.h?at=default&fileviewer=file-view-default

This and some other fixes improved model training time on CPU up to 3x.
There is some interesting lock-free stuff in there.

Main design goal is to tailor pieces for specific problem at hand & be
practical Eg lock-free where matters mutex-based otherwise. Sane level
of complexity(need to get it right). Not trading practical properties
that matter (array-based queue) for formal properties (lock-freedom)

EventCount (aka "condvar for lock-free algos") has a fast common paths
and minimizes contention as much as possible. Due to
non-uniform/bursty work blocking/unblocking threads all the time
actually turned out to be one of the most critical parts of scheduling
in TF.

Gotta love the eventcount. Fwiw, I remember way back when I created one in 2005:

https://groups.google.com/d/topic/comp.programming.threads/qoxirQbbs4A/discussion

Actually, iirc, Joe Seigh created a very interesting bakery style rw spinlock, need to find it. Iirc, it can be outfitted with an eventcount to remove the spins.

RunQueue (work-stealing deque) allows all operations from both sides
as work can be submitted from external threads in TF. Owner ops are
lock-free, remote use mutex. This part was somewhat unique as compared
to similar systems. Getting Size right is tricky!

ThreadPool (scheduler itself) is quite standard design based on
distributed runqueues. Though you can find some interesting tricks wrt
stealing order there. Also steal partitions (not done by me). Spinning
logic required quite some tuning to balance between latency and wasted
CPU

Nice! Btw, did you ever find a use for work requesting?

Something like:

https://groups.google.com/d/topic/comp.programming.threads/YBnjd-Sqc-w/discussion

;^)

Manuel Pöter

unread,

Mar 2, 2019, 10:13:35 AM3/2/19

to Scalable Synchronization Algorithms

Thanks for sharing this!

Do you know the Chase-Lev work-stealing deque? (https://www.dre.vanderbilt.edu/~schmidt/PDF/work-stealing-dequeue.pdf)

I suppose you do, but I was wondering why you went with a blocking design for operations on the back of RunQueue. Was it only because you wanted to store std::function instances instead of pointers (as mentioned in the source comments)? Just wondering since std:function also requires a dynamic allocation.

Manuel Pöter

unread,

Mar 5, 2019, 11:38:47 AM3/5/19

to Scalable Synchronization Algorithms

After reading the code more thoroughly I've realized that you have some operations that the Chase-Lev work-stealing deque does not support, like PushBack or PopBackHalf (although Hendler and Shavit proposed a variation of the Arora work-stealing deque that supports stealing ~half the entries).

FWIW: for those you are interested - my Xenium library contains an implementation of the Chase-Lev work-stealing deque: https://github.com/mpoeter/xenium/blob/master/xenium/chase_work_stealing_deque.hpp

Chris M. Thomasson

unread,

Mar 6, 2019, 1:04:40 AM3/6/19

to Scalable Synchronization Algorithms

On Wednesday, February 27, 2019 at 1:35:04 AM UTC-8, Dmitry Vyukov wrote:

Hi,

TensorFlow CPU task scheduler I wrote some time ago:

https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h?at=default&fileviewer=file-view-default

https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/RunQueue.h?at=default&fileviewer=file-view-default

https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/EventCount.h?at=default&fileviewer=file-view-default

This and some other fixes improved model training time on CPU up to 3x.
There is some interesting lock-free stuff in there.

[...]

The code is a work of art: Well done Dmitry!

Chris M. Thomasson

unread,

Mar 6, 2019, 1:11:11 AM3/6/19

to Scalable Synchronization Algorithms

Love the:

 Work PushFront(Work w) {
    unsigned front = front_.load(std::memory_order_relaxed);
    Elem* e = &array_[front & kMask];
    uint8_t s = e->state.load(std::memory_order_relaxed);
    if (s != kEmpty ||
        !e->state.compare_exchange_strong(s, kBusy, std::memory_order_acquire))
      return w;
    front_.store(front + 1 + (kSize << 1), std::memory_order_relaxed);
    e->w = std::move(w);
    e->state.store(kReady, std::memory_order_release);
    return Work();

}

part. It reminds me of the same basic logic in your mpmc queue:

http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue

Ahh, I still like my little tweak of your most excellent algorithm:

https://groups.google.com/d/topic/lock-free/acjQ3-89abE/discussion

;^)

Dmitry Vyukov

unread,

Mar 7, 2019, 12:36:36 AM3/7/19

to lock...@googlegroups.com

Not any real one...

Dmitry Vyukov

unread,

Mar 7, 2019, 12:52:53 AM3/7/19

to lock...@googlegroups.com

On Tue, Mar 5, 2019 at 5:38 PM Manuel Pöter <man...@manuel-poeter.at> wrote:
>
> After reading the code more thoroughly I've realized that you have some operations that the Chase-Lev work-stealing deque does not support, like PushBack or PopBackHalf (although Hendler and Shavit proposed a variation of the Arora work-stealing deque that supports stealing ~half the entries).
> FWIW: for those you are interested - my Xenium library contains an implementation of the Chase-Lev work-stealing deque: https://github.com/mpoeter/xenium/blob/master/xenium/chase_work_stealing_deque.hpp

Hi Manuel,

Yes, stealing half, pushing from the other side, storing functions by
value, moving functions were the reasons.
std::function may allocate itself, but if you allocate std::function
as well, that's 2 allocations, right? Or you need to rework whole
Eigen and TensorFlow to allocate your special pointer-sized thing. And
good std::function impls don't allocate in common case. So that makes
it completely dynamic-allocation free. And the free will frequently be
remote, and there is no magic in memory allocator, it will need to
queue it back or something. These are things that authors of most
lock-free papers prefer to not mention ;)
Also in my design owner and remote threads operate on different data
(owner don't touch back, remote don't touch front), I know that
front/back are collocated (don't remember why) but they can be made
spread completely and work independently.

Dmitry Vyukov

unread,

Mar 7, 2019, 12:56:54 AM3/7/19

to lock...@googlegroups.com

On Wed, Mar 6, 2019 at 7:11 AM Chris M. Thomasson <cri...@charter.net> wrote:
>>>
>>> Hi,
>>>
>>> TensorFlow CPU task scheduler I wrote some time ago:
>>>
>>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h?at=default&fileviewer=file-view-default
>>>
>>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/RunQueue.h?at=default&fileviewer=file-view-default
>>>
>>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/EventCount.h?at=default&fileviewer=file-view-default
>>>
>>> This and some other fixes improved model training time on CPU up to 3x.
>>> There is some interesting lock-free stuff in there.
>>> [...]
>>
>>
>> The code is a work of art: Well done Dmitry!

Thanks!

> Love the:
>
> Work PushFront(Work w) {
> unsigned front = front_.load(std::memory_order_relaxed);
> Elem* e = &array_[front & kMask];
> uint8_t s = e->state.load(std::memory_order_relaxed);
> if (s != kEmpty ||
> !e->state.compare_exchange_strong(s, kBusy, std::memory_order_acquire))
> return w;
> front_.store(front + 1 + (kSize << 1), std::memory_order_relaxed);
> e->w = std::move(w);
> e->state.store(kReady, std::memory_order_release);
> return Work();
>
> }
>
> part. It reminds me of the same basic logic in your mpmc queue:
>
> http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
>
> Ahh, I still like my little tweak of your most excellent algorithm:
>
> https://groups.google.com/d/topic/lock-free/acjQ3-89abE/discussion
>
> ;^)

IIRC once a thread executed XADD it has no way back, which implies
blocking/waiting on empty/full queue.
Here I needed "try" versions because this is not a producer-consumer
scenario and we don't need back-pressure in cooperative task execution
scenario. If we fail to pop from one queue, we will go and try to pop
from another. If we fail to push onto a queue, we can push onto
another or execute directly.

Chris M. Thomasson

unread,

Mar 7, 2019, 5:57:17 PM3/7/19

to Scalable Synchronization Algorithms

Damn. Oh well. Thanks for trying in the first place...

What about a non-real one? lol. ;^)

Chris M. Thomasson

unread,

Mar 7, 2019, 5:59:51 PM3/7/19

to Scalable Synchronization Algorithms

On Wednesday, March 6, 2019 at 9:56:54 PM UTC-8, Dmitry Vyukov wrote:

On Wed, Mar 6, 2019 at 7:11 AM Chris M. Thomasson <cri...@charter.net> wrote:
>>>
>>> Hi,
>>>
>>> TensorFlow CPU task scheduler I wrote some time ago:
>>>
>>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h?at=default&fileviewer=file-view-default
>>>
>>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/RunQueue.h?at=default&fileviewer=file-view-default
>>>
>>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/EventCount.h?at=default&fileviewer=file-view-default
>>>
>>> This and some other fixes improved model training time on CPU up to 3x.
>>> There is some interesting lock-free stuff in there.
>>> [...]
>>
>>
>> The code is a work of art: Well done Dmitry!

Thanks!

No problem. Well, when I see awesome code, I know it, in a sense. :^)

> Love the:
>
> Work PushFront(Work w) {
> unsigned front = front_.load(std::memory_order_relaxed);
> Elem* e = &array_[front & kMask];
> uint8_t s = e->state.load(std::memory_order_relaxed);
> if (s != kEmpty ||
> !e->state.compare_exchange_strong(s, kBusy, std::memory_order_acquire))
> return w;
> front_.store(front + 1 + (kSize << 1), std::memory_order_relaxed);
> e->w = std::move(w);
> e->state.store(kReady, std::memory_order_release);
> return Work();
>
> }
>
> part. It reminds me of the same basic logic in your mpmc queue:
>
> http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
>
> Ahh, I still like my little tweak of your most excellent algorithm:
>
> https://groups.google.com/d/topic/lock-free/acjQ3-89abE/discussion
>
> ;^)

IIRC once a thread executed XADD it has no way back, which implies
blocking/waiting on empty/full queue.
Here I needed "try" versions because this is not a producer-consumer
scenario and we don't need back-pressure in cooperative task execution
scenario. If we fail to pop from one queue, we will go and try to pop
from another. If we fail to push onto a queue, we can push onto
another or execute directly.

Hard to disagree with this point. Humm... Thanks again.

Chris M. Thomasson

unread,

Mar 7, 2019, 6:40:38 PM3/7/19

to Scalable Synchronization Algorithms

Perhaps there can be a mix of try and/or wait commit access. Kind of like my XADD tweak to your excellent CAS based queue.

Reply all

Reply to author

Forward