Hi,
TensorFlow CPU task scheduler I wrote some time ago:
https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h?at=default&fileviewer=file-view-default
https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/RunQueue.h?at=default&fileviewer=file-view-default
https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/EventCount.h?at=default&fileviewer=file-view-default
This and some other fixes improved model training time on CPU up to 3x.
There is some interesting lock-free stuff in there.
Main design goal is to tailor pieces for specific problem at hand & be
practical Eg lock-free where matters mutex-based otherwise. Sane level
of complexity(need to get it right). Not trading practical properties
that matter (array-based queue) for formal properties (lock-freedom)
EventCount (aka "condvar for lock-free algos") has a fast common paths
and minimizes contention as much as possible. Due to
non-uniform/bursty work blocking/unblocking threads all the time
actually turned out to be one of the most critical parts of scheduling
in TF.
RunQueue (work-stealing deque) allows all operations from both sides
as work can be submitted from external threads in TF. Owner ops are
lock-free, remote use mutex. This part was somewhat unique as compared
to similar systems. Getting Size right is tricky!
ThreadPool (scheduler itself) is quite standard design based on
distributed runqueues. Though you can find some interesting tricks wrt
stealing order there. Also steal partitions (not done by me). Spinning
logic required quite some tuning to balance between latency and wasted
CPU
Hi,
TensorFlow CPU task scheduler I wrote some time ago:
https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h?at=default&fileviewer=file-view-default
https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/RunQueue.h?at=default&fileviewer=file-view-default
https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/EventCount.h?at=default&fileviewer=file-view-default
This and some other fixes improved model training time on CPU up to 3x.
There is some interesting lock-free stuff in there.
[...]
Work PushFront(Work w) { unsigned front = front_.load(std::memory_order_relaxed); Elem* e = &array_[front & kMask]; uint8_t s = e->state.load(std::memory_order_relaxed); if (s != kEmpty || !e->state.compare_exchange_strong(s, kBusy, std::memory_order_acquire)) return w; front_.store(front + 1 + (kSize << 1), std::memory_order_relaxed); e->w = std::move(w); e->state.store(kReady, std::memory_order_release); return Work();
On Wed, Mar 6, 2019 at 7:11 AM Chris M. Thomasson <cri...@charter.net> wrote:
>>>
>>> Hi,
>>>
>>> TensorFlow CPU task scheduler I wrote some time ago:
>>>
>>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h?at=default&fileviewer=file-view-default
>>>
>>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/RunQueue.h?at=default&fileviewer=file-view-default
>>>
>>> https://bitbucket.org/eigen/eigen/src/9fdbc9e653c1adc64db1078fd2ec899c1e33ba0c/unsupported/Eigen/CXX11/src/ThreadPool/EventCount.h?at=default&fileviewer=file-view-default
>>>
>>> This and some other fixes improved model training time on CPU up to 3x.
>>> There is some interesting lock-free stuff in there.
>>> [...]
>>
>>
>> The code is a work of art: Well done Dmitry!
Thanks!
> Love the:
>
> Work PushFront(Work w) {
> unsigned front = front_.load(std::memory_order_relaxed);
> Elem* e = &array_[front & kMask];
> uint8_t s = e->state.load(std::memory_order_relaxed);
> if (s != kEmpty ||
> !e->state.compare_exchange_strong(s, kBusy, std::memory_order_acquire))
> return w;
> front_.store(front + 1 + (kSize << 1), std::memory_order_relaxed);
> e->w = std::move(w);
> e->state.store(kReady, std::memory_order_release);
> return Work();
>
> }
>
> part. It reminds me of the same basic logic in your mpmc queue:
>
> http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue
>
> Ahh, I still like my little tweak of your most excellent algorithm:
>
> https://groups.google.com/d/topic/lock-free/acjQ3-89abE/discussion
>
> ;^)
IIRC once a thread executed XADD it has no way back, which implies
blocking/waiting on empty/full queue.
Here I needed "try" versions because this is not a producer-consumer
scenario and we don't need back-pressure in cooperative task execution
scenario. If we fail to pop from one queue, we will go and try to pop
from another. If we fail to push onto a queue, we can push onto
another or execute directly.