Tradeoff bitwidth operands for cycles in arithmetic operations

Leonardo Romor

unread,

Nov 26, 2020, 7:26:03 AM11/26/20

to xls-dev

Hi,

I wrote the following xls code:

import std

```

fn main(x: sN[17], a: sN[23], b: sN[23], c: sN[23]) ->sN[74] {

let x2: sN[34] = std::smul(x, x); // 1:1:32

let x3: sN[51] = std::smul(x2, x); // 1:2:48

// Before summing it to the final value, we need shift it

// left to match the fractional point. This is 50 - (16 + 2) = 32

let p1: sN[40] = std::smul(a, x);

let p1: sN[74] = p1 as sN[74];

let p1: sN[74] = p1 << sN[74]: 32;

// The fractional part is positioned at 2 + 32 = 34 index.

// The shift is then 50 - (32 + 2) = 14;

let p2: sN[57] = std::smul(b, x2);

let p2: sN[74] = p2 as sN[74];

let p2: sN[74] = p2 << sN[74]: 16;

// This doesn't need any shift.

let p3: sN[74] = std::smul(c, x3);

p1 + p2 + p3

}

```

The generated verilog is shown here: https://github.com/lromor/beagleg-fpga-backend/blob/bezier/xls/cubic-bezier.sv

as you can see most of the operations are full width.

https://github.com/lromor/beagleg-fpga-backend/blob/bezier/xls/cubic-bezier.sv#L18

and the pipeline seems to me split split according to the composition of the IR primitive functions.

Does xls offers a way to optionally change the width of the multiplication operations?

For instance, instead of keeping intact the 16 bits width multiplication, to split it into smaller multiplications and accs.

Is it something that the delay model automatically takes care of?

My intuition says that this shouldn't just be a problem of delays as I might want to reduce the width of some operations to spare BELs independently of the maximum of the maximum delay required to perform some operation.

Is chunking into smaller operations something that the user should do?

Best,

Leonardo

Robert Hundt

unread,

Nov 26, 2020, 1:53:40 PM11/26/20

to Leonardo Romor, xls-dev

Hi Leonardo,

Thanks for the feedback, comments inline...

On Thu, Nov 26, 2020 at 4:26 AM Leonardo Romor <leonard...@gmail.com> wrote:

Hi,

I wrote the following xls code:

import std

```
fn main(x: sN[17], a: sN[23], b: sN[23], c: sN[23]) ->sN[74] {
let x2: sN[34] = std::smul(x, x); // 1:1:32
let x3: sN[51] = std::smul(x2, x); // 1:2:48

// Before summing it to the final value, we need shift it
// left to match the fractional point. This is 50 - (16 + 2) = 32
let p1: sN[40] = std::smul(a, x);
let p1: sN[74] = p1 as sN[74];
let p1: sN[74] = p1 << sN[74]: 32;

// The fractional part is positioned at 2 + 32 = 34 index.
// The shift is then 50 - (32 + 2) = 14;
let p2: sN[57] = std::smul(b, x2);
let p2: sN[74] = p2 as sN[74];
let p2: sN[74] = p2 << sN[74]: 16;

// This doesn't need any shift.
let p3: sN[74] = std::smul(c, x3);
p1 + p2 + p3
}
```
The generated verilog is shown here: https://github.com/lromor/beagleg-fpga-backend/blob/bezier/xls/cubic-bezier.sv

as you can see most of the operations are full width.
https://github.com/lromor/beagleg-fpga-backend/blob/bezier/xls/cubic-bezier.sv#L18

and the pipeline seems to me split split according to the composition of the IR primitive functions.

This is correct, pipelining happens at the granularity of IR nodes, before verilog generation.

Does xls offers a way to optionally change the width of the multiplication operations?
For instance, instead of keeping intact the 16 bits width multiplication, to split it into smaller multiplications and accs.

Unfortunately, nope. I think that's an interesting idea. I wonder where the right place for such a transformation would be.

The compiler can't guess the intent. So the designer would have to be very specific, eg., provide corresponding annotations in code. But since that has to be done, it strikes me that a better approach would be to provide "finer-grained" multiplications in form of a library, and have the user pick the particular
granularity they want.
The library itself should be generated somehow, across a variety of bit-width. But then you run into the situation that during design space exploration, the code needed to be changed, which is also not good.
Hence we're back to a combined library / compiler approach with a generated library that the compiler fully "understands" and a compiler pass that picks the "right" implementation strategy based on some criteria. There will be interesting interaction with the pipeliner, and probably some form of iterative compilation. Not trivial. It might be most practical to just drive the process from outside the compiler.

Is it something that the delay model automatically takes care of?

Not at this point, the delay really just estimates the delay from the given IR nodes, it doesn't attempt to change the IR at all. That would have to be done earlier in an earlier phase, with the corresponding phase ordering problem.

My intuition says that this shouldn't just be a problem of delays as I might want to reduce the width of some operations to spare BELs independently of the maximum of the maximum delay required to perform some operation.

Is chunking into smaller operations something that the user should do?

For maximum control - probably. That would be my call. As mentioned above, Some flexibility at the DSLX level, and an outside driver for design space exploration. But to be clear, we don't have that today.

Thanks again

R

Best,

Leonardo

--
You received this message because you are subscribed to the Google Groups "xls-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xls-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xls-dev/0bf271b7-df1f-4124-9ce3-ed1d003b0cd3n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Robert Hundt | Platforms | rhu...@google.com | 650-253-7680

Leonardo Romor

unread,

Nov 30, 2020, 7:50:50 AM11/30/20

to xls-dev

Hi Robert,

thank you for your answer and clarification. I think I understand your arguments and I must say that I also prefer the idea of letting the user take care of the bit-widths as it seems something very user-specific and hard to model/solve in some general way.

An idea could be to write a code generator which spits dslx to target small FPGAs and decompose it in smaller functions to keep the theoretical space usage under some threshold. This could be one of the policies which xls could also implement.

It could be treated as a flexible CSP problem where we want to decompose a function into the largest (space-wise) set of functions whose global space consumption stays under some value while minimizing latency to compute the output. Sounds like a hell of a problem.

In our case we will just take the largest multiplication and write a function which computes it using smaller widths and reuse the function for each of the other multiplications.

Something similar will happen for sums as well.

Best,

Leonardo

Mark Heffernan

unread,

Dec 1, 2020, 3:31:05 PM12/1/20

to Leonardo Romor, xls-dev

Splitting expensive operations for timing/area purpose is something XLS could conceivably take care of. Currently, XLS only considers timing (not area) when generating pipelines. I can imagine if an operation doesn't fit in a cycle (or doesn't fit well, as in schedluing leaves lots of slack in a stage) XLS could split the op. However, we don't have that feature at the moment.

Mark

To view this discussion on the web visit https://groups.google.com/d/msgid/xls-dev/843e462f-360a-46af-bce4-00634bc96b64n%40googlegroups.com.

Reply all

Reply to author

Forward