Way back in the mid 90s, P6 simplified the problem by having only fully
pipelined functional units, and completely non-pipelined (DIVide).
If all units are fully pipelined, you never have to schedule their
availability.
So all you need is
(1) a marker, that marks ready instructions (non-priority CAM)
and (2) a picker for each fully pipeline functional unit. Basically a
find-first circuit, for some permutation of the candidates.
Yes, the critical loop picker->marker to determine back to back ready
instructions can be a pain. Various logic tricks are possible, e.g.
using decoded rather than encoding CAMs, and other decoded bitmask
tricks to move the find-first out of the loop. Trading more gates for
logic depth.
Or see the Wmt era patents on bit matrix schedulers - the P6 scheduler
was not strictly oldest first, but that introduced glass jaws.
Subrao Palacharla and Jim Smith started a line of research on
"complexity effective" microarchitecture, a large thread of which
attacked the scheduling problem. E.g. instead of having a homogenous
dataflow scheduler, they (and others at the same time and before, olike
me) proposed schedulers that combined queues with dataflow. E.g. at
onwe point in time Wmt was supposed to be purely a timing wheel
scheduler, but ended up being queues from a larger window fronted by a
16-entry bit matrix.
Much of this should be in patents.
...
As for mixing fully pipelined with fully non-pipelined: fixed latency is
easy. Variable latency, the divider had to arbitrate into the scheduler
loop several cycles (3?) before it needed a writeback port.
Basically, fully pipelined / independent pipelines / latency known up
front is easy. But variable latency is harder to deal with, and is
usually penalized. Which saddens me: one of the things that dataflow
should do is tolerate variable latency, but the pipelining costs means
that the only variable latencies we can tolerate are comparatively long,
like cache misses and divides.
...
P6-era, i.e. simplified the problem.
E.g. one of the biggest reasons that P6 only had a single load port was
that, while 2 load ports would be easy if truly dual ported, if you have
pseudo-dual-porting by banking then you have a dynamic scheduling
problem. Whereas if you pseudo-dual-port 1 load port and 1 store port,
the stores can often be delayed when there is a bank conflict without
the hassles of repairing incorrectly scheduled dependent operations.
We also see 2 load ports + 1 store port, on systems that have true dual
porting (between the loads, => no bank conflicts) and pseudo-porting for
the store port.
...
More recent machines *are* more complicated. They truly do have dynamic
scheduling, e.g. when they have an incomplete bypass network. So on
such systems there may be a "packer" in addition to the marking and
picking. However, if the packing is a simple greedy algorithm, it can
be done with bitmasks. Truly optimal picking and packing is probably too
hard (and IMHO is better left to compiler guys like Ivan, although
recording such information in a trace cache is tempting).
--
The content of this message is my personal opinion only. Although I am
an employee (currently Imagination Technologies, which acquired MIPS; in
the past of companies such as Intellectual Ventures/QIPS, Intel, AMD,
Motorola, and Gould), I reveal this only so that the reader may account
for any possible bias I may have towards my employers' products. The
statements I make here in no way represent my employers' positions on
the issue, nor am I authorized to speak on behalf of my employers, past
or present.