Based on personal experience, you should be able to get close to your goals using HardFloat as your starting point. There are some fairly obvious places to add staging flops to accomplish pipelining. You should also be able minimize power/energy with aggressive clock gating techniques.
The problem with a lot of the FP implementations I've looked at in Github is that they fail compliance in some of the hairy edges of IEEE754 (issues with rounding, subnormal handling, etc.). HardFloat (which is pretty much a Verilog or Scala implementation of SoftFloat) has never failed anything we've thrown at it.
Regarding alternate algorithms (e.g. high-radix Booth encoding for the multiplier arrays, or alternative div/sqrt methods), I too would be interested in any data regarding latency vs gate count vs power tradeoffs.
Al Martin
Hi Tommy, thanks for the pointers.
Nasir, if you end up doing a detailed study, would you be up for sharing your evaluation?
Note: we are using hardfloat, which comes with Rocket-Chip, but are interested in other options as we turn towards deeper level optimization. If there is interest, we can share what we discover as well.
FYI, the main criteria we're looking at are:
1. IEEE754 compliance
2. ability to pipeline it deeply (targeting 3.5GHz on 6nm)
3. energy per operation -- measured on a canonical node -- for open source work, an option would be
skywater PDK plus OpenLane tools -- they enable rough power estimation (note: energy varies by operation, but this is more of a back of the envelope estimate, so we use 7% activation)
4. performance factors -- these are qualitative features, such as the type of divider (invert and multiply versus iterative, etc) -- basically some notes about implementation choices that you believe affect performance..
Thanks Nasir and Tommy,
Sean