The basic problem is, for each of the four 32-bit register representing the
current AES state, we have to extract four bytes, send them to
8-bit-to-32-bit S-boxes, then MOV or XOR the S-box outputs into four
different registers representing the next round's state. It seems that after
we're done with the first register representing the current state, we'd have
to use 3 registers for the current state, 4 for the next, and 1 more as a
scratch register for byte extraction, so 8 appears necessary.
Gladman's trick is to process parts of two registers representing the
current state, then combine the remaining parts into one register with a
rotate, mask and OR. This saves a register because the outputs of the first
4 S-box lookups are now stored into only 3 output registers (with one XOR
being done) instead of 4 registers. My improvement is to combine the parts
with a single 8-bit register move, like "mov al, cl", instead of 3
operations, thus saving 2 instructions per round.
(If anyone looks at my code, it actually uses an MMX register as well,
because for -fPIC compatibility, one general purpose register has to be used
to point to the S-boxes.)
Another register saving trick I used is to copy round keys to the stack, and
then use the ESP register as a loop counter. This avoids having to fully
unroll the loops, without incurring additional memory accesses or costing
another register for the loop counter.