mul/div is an insanely odd grouping

Peter Barfuss

unread,

Jan 4, 2016, 7:58:28 AM1/4/16

to isa...@lists.riscv.org

So this makes no sense. One of those (integer multiply) is incredibly important and a fast implementation simplifies so much code, the other (div/rem) should basically be software-only because it's both so complicated and so rarely used, let alone in the hot path.

Note that every single div-by-constant can always be implemented as a wide multiply, which is another reason for a suggestion: I strongly think mul should be core. If not, then at least break up the addition so it's more clear that you can implement mul without also having to implement div/rem.

-Peter

Christopher Celio

unread,

Jan 4, 2016, 10:57:51 AM1/4/16

to Peter Barfuss, isa...@lists.riscv.org

There's no reason you HAVE to implement div in HW just to get access to mul. You can lie to everyone and let an illegal instruction trap handle it in machine mode, and nobody is the wiser. Unlike the FD extensions, M/D doesn't affect the abi so this works out just fine.

-Chris

kr...@berkeley.edu

unread,

Jan 4, 2016, 11:34:54 AM1/4/16

to Peter Barfuss, isa...@lists.riscv.org

Several general-purpose ISA designs that left out divide were forced
by their software developers to add it later (e.g., Apple and ARM),
because it isn't rare enough to make software-only solutions palatable.

The simplest 32-bit independent hardware divide/remainder unit is very
simple, less than 100 flops. Implementation tricks can further cut the
additional cost, depending on your existing microarchitecture.

Even without hardware support, trap and emulate ensures you stay
compatible with the ABI.

The compiler will usually translate simple div-by-constant to other
sequences anyway, as divide is usually much costlier.

We didn't want to make multiply/divide part of base, as neither are
needed in many low-end applications, or if you're building an
accelerator that has multiply/divide superpowers.

Add to the desire to avoid too many ABIs, and that's why we made
divide a standard part of the M extension.

There's nothing to prevent you defining your own ABI with hardware
multiply but software divide, but I'd recommend against that if you
want to run a substantial amount of software.

Krste

Joel Vandergriendt

unread,

Jan 4, 2016, 6:55:11 PM1/4/16

to Christopher Celio, Peter Barfuss, isa...@lists.riscv.org

Would it be possible to create a compiler flag that inserts software routines for division but hardware instructions for multiplication? A function call is much cheaper than an illegal instruction exception. Or can someone point me in the right direction to fork riscv-gnu-toolchain?

Andrew Waterman

unread,

Jan 4, 2016, 10:22:10 PM1/4/16

to Joel Vandergriendt, Christopher Celio, Peter Barfuss, isa...@lists.riscv.org

Should be straightforward (grep for MULDIV in the gcc port; split into MUL and DIV).

In some sense, this is an optimization flag--might make sense for -mtune=xxx to disable division, rather than e.g. -mno-div. I say this because it will show up in more places, like omitting MULH or FSQRT.

Probably can't justify another multilibs for this in the standard distribution, but building toolchain with a default value for -mtune could have this effect.

Joel Vandergriendt

unread,

Jan 7, 2016, 2:18:46 PM1/7/16

to Andrew Waterman, Christopher Celio, Peter Barfuss, isa...@lists.riscv.org

I took a look at the code, the -mtune takes a processor name, would it be bad practices to make it take something like "-mtune=custom_fpadd_4_5_fpmul_4_5" or something like that, or would that be a terrible Idea. I'm coming at this from the idea of a parametrized fpga cpu and I would rather not have to add tune_info structs for each configuration.

Tommy Thorn

unread,

Jan 7, 2016, 4:21:14 PM1/7/16

to Joel Vandergriendt, Andrew Waterman, Christopher Celio, Peter Barfuss, isa...@lists.riscv.org

Guy and I discussed some alternatives at the workshop:

1) Don’t assume either but profile div on start up and for the code that cares,

have both software and hardware divider versions (I posit that very little code

is bottlenecked by the perf of DIV).

2) trap the DIV function, but accelerate the trapping support; trapping using the

vanilla RISC-V ISA is pretty slow (*that* is a different discussion), but nothing

precludes private extensions for this, say [a partial] shadow register set (credit

to Guy) and a private “divide-step” instruction. The extensions could be cheaper

that a full implementation.

3) Variant of 2). Implement the unsigned divide instruction and trap the signed

variant (optionally, trap only if either argument is negative).

None of these requires a compiler flag.

Tommy

Joel Vandergriendt

unread,

Jan 8, 2016, 1:20:18 PM1/8/16

to Tommy Thorn, Andrew Waterman, Christopher Celio, Peter Barfuss, isa...@lists.riscv.org

option 1 seems bad because generating both software and hardware versions seems tedious and would consume potentially scarce instruction memory.

I have to think more about the other options

kr...@berkeley.edu

unread,

Jan 8, 2016, 4:59:38 PM1/8/16

to Joel Vandergriendt, Tommy Thorn, Andrew Waterman, Christopher Celio, Peter Barfuss, isa...@lists.riscv.org

I really, really doubt any of these is a better choice than simply
implementing a low-cost iterative divider in hardware. If you're
building a cheap iterative multiplier then the divider is almost free.

I realize FPGAs have hardware support for multiply and not division,
but these other solutions don't seem that much better in the FPGA
context.

Krste

Reply all

Reply to author

Forward