Anton Ertl wrote:
>>No, that's not a resource lock, that's a wakeup signal.
>
> Only one thread is active, so there can be no contention. You are
> thinking about something that is not contention.
"Contention" here is when someone has to wait. The producer needs space in
the queue - if the queue is full, it's a contention for the producer
("congestion", write access is blocked). The consumer needs data in the
queue, if the queue is empty, it's a contention for the consumer, read
access is blocked.
It's not the same sort of contention you have with equal processes, it's
"pipeline contention".
> If the consumers are faster than the producers, run the producers on
> more cores than the consumers. If you have only one producer, and
> that is sequential, and the consumers combined are still faster than
> the producer, you have a parallelism of <2.
I don't quite see the problem of having a parallelism of say 1.8, and be
able to use it. When the wake/sleep operation costs me as much as
processing one block, I can't use it. The overall number of available cores
today is a small integer, so I can't have 10 producers and 9 consumers
running in parallel.
> You can still reduce the
> wakeup overhead by only sending wakeup signals when the producer has
> produced enough (and stored it in a buffer) that the wakeup overhead
> is small compared to the processing time. Yes, this increases
> latency, so it's a balance of efficiency vs. latency; you can bound
> the increase in latency by waking the consumer some fixed time after
> the producer has started producing.
net2o's timing-based flow control breaks with that approach. You have to
process each packet as soon as possible, and finally, when a packet arrived
unencrypted in its destination, you take the time of arrival.
There are cases where you can do that, though.
>>And yes, I've a model in mind, where more compute cores are sleeping than
>>working; with a x64-style implementation, you would use SMT for that, not
>>multiple cores.
>
> If there are not enough active threads for all the cores, the only
> reason to use SMT is to conserve energy or because communication is
> cheaper (the shared cache is L1 instead of L3). But if you are
> interested in performance, I expect that, running two threads on two
> otherwise idle cores usually gives better performance than running
> them in the same core, even if they communicate. Sure, there is the
> PAUSE instruction for slowing a waiting thread down, but even with
> that, i expect that situation where SMT performs better than two cores
> are not very common.
My situation is different: I've running and sleeping processes, and I want
to wake/sleep them quickly. Using different cores and having them all spin-
loop is consuming energy and a waste of resources, because the threads are
waiting. So I put them all as "semi-active" on a SMT core, where the
sleeping threads only consume registers in the register file, but can start
running within a nanosecond.
SMT with more than one active thread per core is also faster with your usual
pipeline bubbles like mispredicted branches and waits for cache accesses (L2
and beyond), where one thread has all resources for itself.
For real-time tasks, just being able to run at all, even at a slower pace,
is better than having to wait for the next time slice.
>>And buffering reduces performance, as it requires way more actual
>>parallelism.
>
> Buffering also enables parallelism. E.g., with a conventional screen
> and with vertical synchronization to avoid tearing, double buffering
> means that rendering has to wait for the vsync, while with tripple
> buffering there can be rendering all the time.
But the right solution is to go away from the fixed refresh, and display the
buffers when they are ready. So the rendering only has to wait when the
render time is so short that it exceeds the bandwidth (e.g. 140fps), and not
when it's just faster than the minimum acceptable rate (e.g. 30fps).
> Anyway, yes, you were talking of having lots of little tasks, which
> sounds like having lots of parallelism to me. Then you switch to a
> hardly-parallel problem of one sequential producer and consumers that
> need little CPU. Of course these different kinds of problems need
> different solutions.
The little tasks are usually connected in such a way. There may be overall
way more than just two tasks, but the typical relation between these tasks
is often producer/consumer, and they aren't well balanced.
>>> So don't use the OS in the normal case.
>>
>>Anton, that's the entire point of suggesting that the wake/sleep IPC
>>should be done in hardware.
>
> Well, Intel tried that kind of thing in the iAPX 432 and in the 80286
> protected mode and the hardware operations (task gates) were slow. A
> context switch is slow whether it is done in hardware or in software.
That's why I say "use the SMT capabilities". That's how to do that cheap.
BTW: Just moving the task switch from software into microcode doesn't make
it "hardware". Neither iAPX 432 or 286 PM had any hardware capabilities for
task switching, they both just had complicated microcode. AFAIK, a Forth
PAUSE with pusha/mov sp,[up+next]/popa was about two orders of magnitude
faster than a task gate call.
> You may be dreaming of hardware that has, say, dozens sets of
> process/thread states per core, most of which are sleeping most of the
> time, and that can put themselves to sleep cheaply, and be woken up
> cheaply, but you have to make a really good case for that to get it
> included in hardware. Is your model of lots of small tasks with
> little parallelism overall really that relevant?
That's your model, not mine. Let's repeat it:
* The tasks are small, so that expensive wake/sleep operations don't work,
and moving the tasks from one core to the other is also not a good idea.
Hot tasks which are already in the L1 code cache and data in L1 data cache
can easily be several times faster than "cold tasks". There's not enough
work for one specific task to run all the time, because the tasks are
diverse.
* The tasks are related to each other in a consumer-producer relation,
sometimes with DAG-like structures, i.e. a task may combine the output of
two producers, or feed two consumers.
* There are enough of those tasks to keep all cores in current CPUs active
(and maybe many more), but when used with the current high-overhead IPC,
there's no benefit.
Take Marcel's ngspice as example: You really can compute the "next
voltage/current" output for all active nodes at the same time (and that's
often only a part of the entire circuit, not all devices have fast input
voltage or current swings). The parallelism is there. Nodes depending on
fast-changing voltages need to recalculate their output more often when
triggered, but those nodes which stay mostly the same don't need to
calculate often (then the linear solver is sufficient). The number of
different device models can easily be 20 or more, with device-parameter
depending pathes (CMOS, NMOS, different gate thickness, Zener, Schottky and
"normal" diodes, NPN, PNP bipolars, several types of capacitors, including
parasitics, and resistors, including parasitics).
AMD's GCN GPUs have to some extend what I'm asking for: you can run several
asynchronous tasks, and feed data from one to the next, context switching is
cheap.
http://www.anandtech.com/show/9124/amd-dives-deep-on-asynchronous-shading
It does considerably improve performance. This is a rather new approach for
GPU-style parallelism; it's not just 1000s of cores, each doing the same
thing on different data. The number of different contexts active at the
same time might be a bit on the low side now, but that stuff is just new.