Challenge: (n mod 3) in hardware???

RISC taker

unread,

Apr 23, 2003, 12:54:22 PM4/23/03

to

Hey, I need to calculate (n mod 3) in a Virtex-II design. n is a
10-bit unsigned number and 3 is a constant. This has to be done in the
same cycle (combinatorial!). Now what's a good way to implement that?

I thought of a lookup table (distributed RAM) but this takes quite a
lot of space. Any better ideas? (Ray, the arithmetic guru? :-)

Do you think I can perform this operation at 200 MHz in a Virtex-II?

Thanks!
RISC_taker

Ray Andraka

unread,

Apr 23, 2003, 1:58:58 PM4/23/03

to

with only 10 bits input, you only have a 1K address space, and the output
only needs 2 bits (0,1,2), a so this can easily fit in a single block
RAM. That would be your best bet for single cycle operation at 200 MHz.
If you are really adverse to using block RAM, there is a shortcut similar
to the one for decimal numbers but I don't recall it off hand. The
decimal shortcut is to sum the digits, and divide that sum by 3. The
remainder of that division is the mod. It would take a fair amount of
area to attempt that combinatorially, and you'd be hard pressed to do it
in a single cycle at 200 MHz....go with the BRAM.

RISC taker wrote:

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email r...@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759

Ben Jackson

unread,

Apr 23, 2003, 2:09:19 PM4/23/03

to

In article <18c289aa.03042...@posting.google.com>,

RISC taker <RISC_...@alpenjodel.de> wrote:
>Hey, I need to calculate (n mod 3) in a Virtex-II design. n is a
>10-bit unsigned number and 3 is a constant. This has to be done in the
>same cycle (combinatorial!). Now what's a good way to implement that?

Off the top of my head, I think so:

Consider any all-1s binary number with an even number of bits. That
number is a multiple of 3. 3, 15, 63, 255, or in your case, 1023.
For every EVEN bit NOT set in that number, the remainder goes up by 2.
For every ODD bit not set in that number, the remainder goes up by 1.
For example:

1111 = 15 = multiple of 3
0111 = 7 = remainder 1
1011 = 11 = remainder 2
1101 = 13 = remainder 1
1110 = 14 = remainder 2

and any combination:
0101 = 5 = remainder (1+1) => 2
0110 = 6 = remainder (1+2) => 3 => 0

Conveniently each pair of bits makes a 2-bit number which you can add
up.

Now if you add up all of the possible bits that way for a 10 digit number
the most you can get is 15 (1 x 5 odd bits + 2 x 5 even bits = 15). You
could then handle that with a lookup table or a few well-chosen gates.
Or you could repeat this process again.

--
Ben Jackson
<b...@ben.com>
http://www.ben.com/

Falk Brunner

unread,

Apr 23, 2003, 1:22:54 PM4/23/03

to

"RISC taker" <RISC_...@alpenjodel.de> schrieb im Newsbeitrag
news:18c289aa.03042...@posting.google.com...

> Hey, I need to calculate (n mod 3) in a Virtex-II design. n is a
> 10-bit unsigned number and 3 is a constant. This has to be done in the
> same cycle (combinatorial!). Now what's a good way to implement that?
>
> I thought of a lookup table (distributed RAM) but this takes quite a
> lot of space. Any better ideas? (Ray, the arithmetic guru? :-)

Iam not a guru, but how about a BRAM, used in x4 configuration. Just store
the truth table ther, doen. OK, it has 1 cycle latency, but hey, its pretty
easy and fast. 200 MHz should be possible.

Otherwise you could try to use a excel sheet or something to generate the
truth table. Maybe you will find some clever optimization possibilities.

--
MfG
Falk

Allan Herriman

unread,

Apr 23, 2003, 2:20:49 PM4/23/03

to

On 23 Apr 2003 09:54:22 -0700, RISC_...@alpenjodel.de (RISC taker)
wrote:

Hi RISC,

If you break up your 10 bit input word into five 2 bit words, you can
take the modulus of each (using five pairs of 2 input LUTs), then add
the results (to get a four bit number), then take the modulus of that.

This works since

a mod (a - 1) = 1

and

b mod (a - 1) = (a x b) mod (a - 1)

(Think of a as being an even power of 2, which means we don't change
the modulus if we shift the input by 2 bits.)

We can improve the timing by using pairs of 4 input LUTS to take four
bit slices of your input.
We then sum the three 2 bit values (using 2 levels of logic) to get a
4 bit result, then take the modulus of that in another pair of LUTs.

This takes a total of four levels of logic, which should work at
200MHz in Virtex-II, depending on the speed grade, your patience, etc.

Regards,
Allan.

Allan Herriman

unread,

Apr 23, 2003, 2:28:58 PM4/23/03

to

BTW, the total hardware is 13 LUTs, which might fit into 2 CLBs.

(Or you could use a block ram, as other posters have suggested.)

Regards,
Allan.

Avrum

unread,

Apr 23, 2003, 2:34:34 PM4/23/03

to

You are on the right track...

There are three keys to this solution, the first is

(n*4^m mod 3) = n mod 3 for all values of m

The second is

(4n mod 3) = n mod 3

The third is

((a + b) mod 3) = ((a mod 3) + (b mod 3)) mod 3

Using these reductions you can construct a VERY fast mod 3 calculator with
extremely small gate count. I have used this to do 19 bits at over 100MHz
with no problems at all, and I am sure it will do 10 bits at 200 MHz. Its
size is TINY. For 10 bits it will require 3 levels of LUTs and a total of (I
think) 8 LUTs.

The basis of the algorithm is that you can take the input 4 bits at a time,
and generate the mod 3 (which is 2 bits) in two LUTs (one for each output
bit). Then adjacent pairs of these two bit quantities can be concatenated to
generate 4 more bits, which can be reduce to 2. You need to do this
log2(width)-1 times to get to the final 2 bits.

Here it is. You can let the synthesis tool optimize out the LUTs you don't
need (since you are only doing 10 bits), or you chop out the stages you
don't need by hand.

Avrum

------------------------------------------------------------

/*
* Module: syn_mod3_32
* Creation Date: Tue Feb 20 2000
* Author: Avrum Warshawsky
* Description: Synthetic Mod3 calculator
* Instantiated models: none
* DEFINE: WIDTH
*
*
* Description:
*
* This module will calculate mod 3 for any number up to 32 bits.
* The parameter WIDTH determines the width of the input.
* The width of the output is always 2 bits, which will be
* 0, 1, or 2 (never 3) - the MOD3 of the input data
*
*/

module syn_mod3_32(out, in);

//**************************************************************************
****
// Port Declarations
//**************************************************************************
****

parameter WIDTH=19;

input [WIDTH-1:0] in;
output [1:0] out;

function [1:0] digit_mod;
input [3:0] digit;
case(digit)
4'h0: digit_mod = 2'd0;
4'h1: digit_mod = 2'd1;
4'h2: digit_mod = 2'd2;
4'h3: digit_mod = 2'd0;
4'h4: digit_mod = 2'd1;
4'h5: digit_mod = 2'd2;
4'h6: digit_mod = 2'd0;
4'h7: digit_mod = 2'd1;
4'h8: digit_mod = 2'd2;
4'h9: digit_mod = 2'd0;
4'ha: digit_mod = 2'd1;
4'hb: digit_mod = 2'd2;
4'hc: digit_mod = 2'd0;
4'hd: digit_mod = 2'd1;
4'he: digit_mod = 2'd2;
4'hf: digit_mod = 2'd0;
endcase
endfunction

wire [1:0] m00, m01, m02, m03,
m04, m05, m06, m07;

wire [1:0] m10, m11, m12, m13;

wire [1:0] m20, m21;

wire [31:0] my_in = in; // Let it zero extend for us

assign m00 = digit_mod(my_in[ 3:0 ]);
assign m01 = digit_mod(my_in[ 7:4 ]);
assign m02 = digit_mod(my_in[11:8 ]);
assign m03 = digit_mod(my_in[15:12]);
assign m04 = digit_mod(my_in[19:16]);
assign m05 = digit_mod(my_in[23:20]);
assign m06 = digit_mod(my_in[27:24]);
assign m07 = digit_mod(my_in[31:28]);

assign m10 = digit_mod({m01, m00});
assign m11 = digit_mod({m03, m02});
assign m12 = digit_mod({m05, m04});
assign m13 = digit_mod({m07, m06});

assign m20 = digit_mod({m11, m10});
assign m21 = digit_mod({m13, m12});

assign out = digit_mod({m21, m20});

// synthesis translate_off
initial
begin
if (WIDTH > 32)
begin
$display("%t ERROR: Mod3 width must be <= 32 in %m",$realtime);
end
end
// synthesis translate_on

endmodule

"Ben Jackson" <b...@ben.com> wrote in message
news:jFApa.584264$3D1.324134@sccrnsc01...

rickman

unread,

Apr 23, 2003, 3:52:15 PM4/23/03

to

If I understand how this algorithm works, I think the logic can be
reduced to 10 LUTs in two levels.

Instead of adding the values of the pairs of inputs, group them into
even and odd bits. Use the LUTs with the F5 mux to implement the
modified two bit sum of five equal value inputs. When I say modified,
you need to produce a two bit result so logically add the result bit 2^2
back in as another input bit. Five bits in, two bits out. Or you can
think of this as a 32 entry truth table. The point is that you only
need two outputs from any of these functions to produce a 0, a 1 or a
2.

This will use a total of 8 LUTs to give you a two bit even bit sum and a
two bit odd bit sum. These four signals can then be run through a pair
of LUTs to give you the two bit modulo three result by using a simple
truth table.

--

Rick "rickman" Collins

rick.c...@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX

Avrum

unread,

Apr 23, 2003, 4:35:47 PM4/23/03

to

You are right - it can be done with 2 levels of logic and some F5MUXes
(which will be faster than the three levels of logic I propsed above).

Referring to my earlier implementation, you will still need the function
digit_mod, and you will also need an equivalent 5 bit function, lets call it
five_bit_mod - it will be a 5 bit input, 2 bit output case table like
digit_mod (which just lists the mod3 of all 32 5 bit combinations). This
SHOULD be implemented as 2*(2*LUT4+F5MUX), if the synthesis tool does its
job. The digit_mod should be implemented as 2*LUT4

wire [1:0] m0, m1;

m0 = five_bit_mod(in[4:0]);
m1 = five_bit_mod(in[9:5]);

out = digit_mod({m1,m0});

This will (oddly) take two more LUTs, but should be faster.

Avrum
"rickman" <spamgo...@yahoo.com> wrote in message
news:3EA6EEEF...@yahoo.com...

Avrum

unread,

Apr 23, 2003, 4:45:26 PM4/23/03

to

I take that back - it won't work (sorry).

32 mod 3 is NOT 1, therefore you cannot use five_bit_mod on bits 9:5. You
can build a different 32->2 function for bit 9:5, but it would be

5'd0: out = 0;
5'd1: out = 2; // the mod3 of 2
5'd2: out = 1; // the mod3 of 4
5'd3: out = 0; // the mod3 of 6
5'd4: out = 2; // the mod3 of 8
etc...

This would account for the 1 bit shift.

Avrum

"Avrum" <av...@REMOVEsympatico.ca> wrote in message
news:tPCpa.2998$2g5.4...@news20.bellglobal.com...

Allan Herriman

unread,

Apr 23, 2003, 5:01:35 PM4/23/03

to

On Wed, 23 Apr 2003 15:52:15 -0400, rickman <spamgo...@yahoo.com>
wrote:

Yes, I realised how to do it in 10 LUTs in *three* levels just after I
posted, but by that time Avrum had already posted the equivalent
solution in Verilog so I didn't bother with a retraction.

If anyone is interested, I have appended the equivalent code in VHDL
(with "automatic" depth detection, and without the 32 bit limitation)
to this post.

Your solution using the F5 mux is faster of course (if less portable).
I'll think about adding it to the VHDL when I get some time.

>Instead of adding the values of the pairs of inputs, group them into
>even and odd bits. Use the LUTs with the F5 mux to implement the
>modified two bit sum of five equal value inputs. When I say modified,
>you need to produce a two bit result so logically add the result bit 2^2
>back in as another input bit. Five bits in, two bits out. Or you can
>think of this as a 32 entry truth table. The point is that you only
>need two outputs from any of these functions to produce a 0, a 1 or a
>2.
>
>This will use a total of 8 LUTs to give you a two bit even bit sum and a
>two bit odd bit sum. These four signals can then be run through a pair
>of LUTs to give you the two bit modulo three result by using a simple
>truth table.

Regards,
Allan.

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity mod3 is
generic (
width : positive := 10
);
port (
input : in unsigned(width - 1 downto 0);
output : out unsigned(1 downto 0)
);
end entity mod3;

architecture rtl of mod3 is

pure function digit_mod (arg : unsigned(3 downto 0)) return unsigned
is
type mod3_table is array (0 to 15) of unsigned(1 downto 0);
constant luts : mod3_table := (
"00","01","10","00",
"01","10","00","01",
"10","00","01","10",
"00","01","10","00"
);
begin
return luts(to_integer(arg));
end digit_mod;

pure function work_out_depth (width : positive) return positive is
variable depth : integer := 1;
variable my_count : integer := (width - 1) / 4;
begin
while my_count > 0 loop
depth := depth + 1;
my_count := my_count / 2;
end loop;
return depth;
end work_out_depth;

constant depth : positive := work_out_depth(width);

type t_unsigned_array is array (depth downto 0) of unsigned(width +
3 downto 0);
signal unsigned_array : t_unsigned_array := (others => (others =>
'0'));

begin

unsigned_array(0)(input'range) <= input; -- zero extend input

g1: for d in 1 to depth generate
g2: for w in 0 to (width - 1) / 4 generate
unsigned_array(d)(2 * w + 1 downto 2 * w)
<= digit_mod(unsigned_array(d - 1)(4 * w + 3 downto 4 * w));
end generate g2;
end generate g1;

output <= unsigned_array(depth)(1 downto 0);

end architecture rtl;

rickman

unread,

Apr 23, 2003, 5:20:33 PM4/23/03

to

Avrum wrote:
>
> You are right - it can be done with 2 levels of logic and some F5MUXes
> (which will be faster than the three levels of logic I propsed above).
>
> Referring to my earlier implementation, you will still need the function
> digit_mod, and you will also need an equivalent 5 bit function, lets call it
> five_bit_mod - it will be a 5 bit input, 2 bit output case table like
> digit_mod (which just lists the mod3 of all 32 5 bit combinations). This
> SHOULD be implemented as 2*(2*LUT4+F5MUX), if the synthesis tool does its
> job. The digit_mod should be implemented as 2*LUT4
>
> wire [1:0] m0, m1;
>
> m0 = five_bit_mod(in[4:0]);
> m1 = five_bit_mod(in[9:5]);
>
> out = digit_mod({m1,m0});
>
> This will (oddly) take two more LUTs, but should be faster.

Two more LUTs than what? You originally proposed 13 LUTs, I suggested
10 and I think you are agreeing with me that it will take 10. The
outputs of the five_bit_mod are only two bits each. Are you thinking
that it will be more? Each of the five_bit_mod functions require four
LUTs and two F5MUX elements for a total of eight. To combine the two
pairs of outputs takes two more LUTs for a total of 10.

Am I missing something here with the LUT count?

BTW, your two five_bit_mod functions *can* be the same if you account
for it in the digit_mod function. Think of the combination of even bits
as counting +1 elements and the odd bits as counting -1 (-1 mod 3 = 2)
elements. In reality this is just another arbitrary truth table.

Avrum

unread,

Apr 23, 2003, 9:53:52 PM4/23/03

to

My original solution (which is based on only LUT4) actually requires only 8
LUTs total, if you want to hand optimize it...

m00 = digit_mod(in[3:0]); // 2 LUTS
m01 = digit_mod(in[7:4]); // 2 LUTS

m10 = digit_mod({m00,m01}); // 2 LUTs

out = digit_mod({m10,in[10:9]}); // 2 LUTs

Thus 8 LUTs in 3 levels.

The solution using the F5MUX (and your variation) requires 10 LUTs, 2
F5MUXes (which are free), and only 2 (and a bit) levels of logic.

Avrum

"rickman" <spamgo...@yahoo.com> wrote in message

news:3EA703A1...@yahoo.com...

Joshua Yin

unread,

Apr 23, 2003, 11:14:03 PM4/23/03

to

I think you can not get a way to finished this div operation in 5ns.
As I remember Xilinx need finished it in M(+N)+3 cycles, Altera's Div IP
Cores need 50~60ns.
(M, N is bit number of the div operation object)

--
Best Regards,
Joshua Yin
2003.04.23

"RISC taker" <RISC_...@alpenjodel.de> wrote in message
news:18c289aa.03042...@posting.google.com...

nospam

unread,

Apr 25, 2003, 9:50:22 AM4/25/03

to

RISC_...@alpenjodel.de (RISC taker) wrote:

>Hey, I need to calculate (n mod 3) in a Virtex-II design. n is a
>10-bit unsigned number and 3 is a constant. This has to be done in the
>same cycle (combinatorial!). Now what's a good way to implement that?
>
>I thought of a lookup table (distributed RAM) but this takes quite a
>lot of space. Any better ideas? (Ray, the arithmetic guru? :-)

A good way of implementing it?

Write a 1024 case switch assigning the required output.

If the solution is a simple combinatorial expression the tool will create
it - no arithmetic guru required.

John_H

unread,

Apr 25, 2003, 11:36:20 AM4/25/03

to

Dang! Typo on line 739 of the case statement.

Distilling a problem to its essence will often produce the best results.
Though when we get too creative, ample comments become a requirement.

"nospam" <nos...@nospam.invalid> wrote in message
news:kgeiav8p95jerg36q...@4ax.com...

Hal Murray

unread,

Apr 25, 2003, 12:36:54 PM4/25/03

to

>Dang! Typo on line 739 of the case statement.
>
>Distilling a problem to its essence will often produce the best results.
>Though when we get too creative, ample comments become a requirement.

Write a program to write the body of the code for the case statement.

It's just a big table - the same bits you would need if
you did the RAM approach.

--
The suespammers.org mail server is located in California. So are all my
other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's. I hate spam.

Paul Dankoski

unread,

Apr 29, 2003, 9:05:14 PM4/29/03

to

What's the prize for this challenge anyway?

reg [9:0] n;
wire [2:0] p1, p2;
wire [3:0] p;
wire [1:0] q1, q2;
wire [2:0] q;
wire [1:0] r1;
wire r2;
wire [1:0] r;
wire [1:0] out;

assign p1=n[8]+n[6]+n[4]+n[2]+n[0];
assign p2=n[9]+n[7]+n[5]+n[3]+n[1];
assign p={p2,1'b0}+{1'b0,p1};
assign q1=p[2]+p[0];
assign q2=p[3]+p[1];
assign q={q2,1'b0}+{1'b0,q1};
assign r1=q[2]+q[0];
assign r2=q[1];
assign r={r2,1'b0}+{1'b0,r1};
assign out = &r ? 2'b0 : r;

Paul Dankoski
"Ain't math great?"