Hi,
Not feeling very helpful today, are we, Jim? ;-)
Robert, I've not looked at your code. In fact, I somewhat doubt I
could help. (I don't really understand FPU/SSE.) But since you claim
you have a pure Pascal version, I would indeed be curious to run it
(preferably in pure DOS). Then again, I think you said it runs in
less than a second, which is hardly a worthwhile benchmark.
Just to explain, I've recently tested some compilers (GPC, FPC, TP55,
VP21) and various high-level tricks to optimize their output. Nothing
fancy, just mild curiosity.
For GPC, the obvious answer is attribute(inline) or let it do it
automatically with either -finline-functions or -O3.
FPC needs "inline" function directive (and -Si). I've seen you
complain about FPC before, but it matches (GCC 3.4.6 / 2005) GPC
in output speed nowadays. Seriously, I would reconsider and try
FPC again. It's very good.
TP55 (and similar) are too old but still work fine. There are
various speedups available there, but of course there are better
compilers nowadays, too.
I think you said Virtual Pascal is slow and generates lousy code.
Not quite true. Sure, it doesn't go past 586, but it's not really
slow. It also has the (Delphi-ish) inline function directive
(same as FPC) but in much more limited functionality, so it's not
nearly as useful. Still, it can help a lot.
The other problem I noticed is that VP does indeed claim to
use (186+) ENTER/LEAVE for nested procs. The docs said that
was for 586, but AFAIK that is for all targets (386, 486, 586).
The docs say it was faster on an actual 586. But I've seen this
problem before. On my Core i5 (admittedly somewhat old, Nehalem
Westmere), that kind of code, when heavily used, is actually
four times slower than the older 8086 equivalent. So try
"flattening" your source to avoid nested procedures (move
them to global scope) and re-benchmark it. It really helps!