Ed Chester
University of Newcastle
There may be some standard way I don't know, but here's my
best offer:
A full adder with carry-in and carry-out can be thought of
as a way of counting how many 1s in a 3-bit word, yielding a 2-bit
result. Two of those together feed into a 2-bit full adder yielding
a 3-bit result - but we can soak up one more bit of the original
word on the carry input of this new adder, so now we have accounted
for 7 bits of the source word with two 1-bit FAs and one 2-bit FA.
Two of these modules feed into a 3-bit full adder, which soaks up
yet one more bit of source word, so we have accounted for 15 bits
of source. One more stage in the tree and we have accounted for
31 bits. Only one bit to go! 5-bit incrementer with carry out will
deal with that. So the total damage is:
eight 1-bit full adders
four 2-bit full adders
two 3-bit full adders
one 4-bit full adder
one 5-bit incrementer
and any pipeline registers you may need for speed.
Not especially sneaky, but it's a start.
Jonathan Bromley
--
>Is there any sneaky way to get a 1's count (or 0's count) from e.g. a
>32b word without successively shifting-out and accumulating?
You can use a lookup table. Of course a 32 bit lookup table would be too big so
you can use an 8 bit lookup table and use it four times and add the results.
Another approach would be to use a 32 input 1 bit adder. The result will be at
most 6 bits so it wouldn't be too big.
muzo
WDM & NT Kernel Driver Development Consulting <mu...@pacbell.net>
Just for grins, I tried a few real implementations.
1) Verilog no-brainer to count how many 1s there are, presumably
synthesising to a bunch of adders:
module count32a (source, count);
input [31:0] source;
output [5:0] count; reg [5:0] count;
integer i;
always @(source) begin
count=0;
for (i=0; i<32; i=i+1) count=count+source[i];
end
endmodule
2) RTL Verilog to implement the tree scheme I proposed - messy,
not worth posting here
3) The scheme I proposed, implemented as a schematic using ripple
carry throughout (seems OK since they are all so narrow)
4) Bit-serial implementation: 32-bit parallel-in, serial-out
shifter, and a 6-bit counter to count the ones as they shift,
in regular RTL Verilog
Verilog examples (1), (2), (4) were synthesised using a fairly old
version of Synplify targeting QuickLogic. (3) was targeted to
QuickLogic using their off-the-shelf soft macro library of full
and half adders. Results (bear in mind that the combinatorial
part of a QuickLogic cell is worth around 8 to 10 gates -
your mileage may vary...):
version size (cells) longest delay including pads
======================================================
1 52 66ns (17 levels + I/O)
2 52 69ns (18 levels + I/O)
3 57 48ns (9 levels + I/O)
4 41 23ns (dominated by fanout on the
(incl. 38 FFs) LOAD/GO input)
Timing estimates from QuickLogic's static timing analyser,
set up for worst-case conditions on the slowest speed grade.
There must be a moral in this somewhere, but I'm darned if I can
see what it is, except perhaps that Synplify is pretty good at
inferring compact implementations of messy arithmetic.
I think Synplify got better delay-per-logic-level than I did
because it was more careful about fanout, but I'm not sure.
Hope this is some interest to someone....
Jonathan Bromley
--
Ed Chester
> Is there any sneaky way to get a 1's count (or 0's count) from e.g. a
> 32b word without successively shifting-out and accumulating? Trying to
> find some combinational way, ultimately for synthesis or use in a DSP.
> Thanks for any ideas.
>
> Ed Chester
> University of Newcastle
Why be sneaky? Take the straight forward approach and let synthesis do the
rest.
wire [31:0] vec;
wire [5:0] ones =
{5'd0, vec[31]}
+ {5'd0, vec[30]}
+ {5'd0, vec[29]}
.
.
+ {5'd0, vec[0]};
-- Bob Taylor
> Is there any sneaky way to get a 1's count (or 0's count) from e.g. a
> 32b word without successively shifting-out and accumulating? Trying to
> find some combinational way, ultimately for synthesis or use in a DSP.
> Thanks for any ideas.
>
> Ed Chester
> University of Newcastle
Here's a C function that does what you describe. The conversion
to verilog is left as an exercise to the reader.
int bitcount (unsigned int w)
{
w = (0x55555555 & w) + (0x55555555 & (w>> 1));
w = (0x33333333 & w) + (0x33333333 & (w>> 2));
w = (0x0f0f0f0f & w) + (0x0f0f0f0f & (w>> 4));
w = (0x00ff00ff & w) + (0x00ff00ff & (w>> 8));
w = (0x0000ffff & w) + (0x0000ffff & (w>>16));
return w;
}
I had to instantiate designware components in order to get good results. My
compile code looked like this: (I didn't really need the total if it was
greater than 7)
DW01_add #(1) U1(lfxor[0], lfxor[1], lfxor[2], lsum1, lsum1c);
DW01_add #(1) U2(lfxor[3], lfxor[4], lfxor[5], lsum2, lsum2c);
DW01_add #(1) U3(lfxor[6], lfxor[7], lfxor[8], lsum3, lsum3c);
DW01_add #(1) U4(lfxor[9], lfxor[10], lfxor[11], lsum4, lsum4c);
DW01_add #(1) U5(lfxor[12], lfxor[13], lfxor[14], lsum5, lsum5c);
DW01_add #(1) U6(lfxor[15], lfxor[16], lfxor[17], lsum6, lsum6c);
DW01_add #(2) U7({lsum1c,lsum1}, {lsum2c,lsum2}, lfxor[18], lsum7,
lsum7c);
DW01_add #(2) U8({lsum3c,lsum3}, {lsum4c,lsum4} , 1'b0, lsum8, lsum8c);
DW01_add #(2) U9({lsum5c,lsum5}, {lsum6c,lsum6}, lfxor[19], lsum9,
lsum9c);
DW01_add #(3) U10({lsum7c,lsum7}, {lsum9c,lsum9}, 1'b0, lsum11,
lsum11c);
wire [2:0]lsum11t = {3{lsum11c}} | lsum11; //limit to 3 bit sum
DW01_add #(3) U11(lsum11t, {lsum8c,lsum8}, 1'b0, lsum, lsumc);
wire [2:0] lfcnt = {3{lsumc}} | lsum; //limit to 3 bits
My simulation code was a little simpler, but gave crummy synthesis results:
wire [4:0] lsum
=lfxor[0]+lfxor[1]+lfxor[2]+lfxor[3]+lfxor[4]+lfxor[5]+lfxor[6]+lfxor[7]+
lfxor[8]+lfxor[9]+lfxor[10]+lfxor[11]+lfxor[12]+lfxor[13]+lfxor[14]+
lfxor[15]+lfxor[16]+lfxor[17]+lfxor[18]+lfxor[19];
Adding bits(for a 20 bit "flag") gave better results (smaller/faster) than
the shift/accumulate method.
Bruce
Bob Taylor wrote in message ...
Solution from our class (not the parenthesis to force the
tree balancing).
module COUNT_BITS(I, C);
input [31:0] I;
output [ 5:0] C;
//
// Please specify a synthesizable description of a circuit
// that counts the number of bits that are set to '1' in
// a 32-bit vector. e.g. "001101" should return '3'.
//
assign C = (((( I[0] + I[1] + I[2] + I[3]) +
( I[4] + I[5] + I[6] + I[7])) +
(( I[8] + I[9] + I[10] + I[11]) +
(I[12] + I[13] + I[14] + I[15]))) +
(((I[16] + I[17] + I[18] + I[19]) +
(I[20] + I[21] + I[22] + I[23])) +
((I[24] + I[25] + I[26] + I[27]) +
(I[28] + I[29] + I[30] + I[31]))));
endmodule
Thomas A. Coonan wrote:
>
> A tree of adders begining with 1-bit wide adders, gradually increasing
> in width. So, start with 16 2-bit adders each having a 2-bit output.
> Next level is 8 2-bit adders, etc. etc. Very common circuit in
> "majority logic" section in FEC modules.