1's counter?

Ed Chester

unread,

Sep 29, 1998, 3:00:00 AM9/29/98

to

Is there any sneaky way to get a 1's count (or 0's count) from e.g. a
32b word without successively shifting-out and accumulating? Trying to
find some combinational way, ultimately for synthesis or use in a DSP.
Thanks for any ideas.

Ed Chester
University of Newcastle

Jonathan Bromley

unread,

Sep 29, 1998, 3:00:00 AM9/29/98

to

Ed Chester wrote:
> Is there any sneaky way to get a 1's count (or 0's count) from e.g. a
> 32b word without successively shifting-out and accumulating?

There may be some standard way I don't know, but here's my
best offer:

A full adder with carry-in and carry-out can be thought of
as a way of counting how many 1s in a 3-bit word, yielding a 2-bit
result. Two of those together feed into a 2-bit full adder yielding
a 3-bit result - but we can soak up one more bit of the original
word on the carry input of this new adder, so now we have accounted
for 7 bits of the source word with two 1-bit FAs and one 2-bit FA.
Two of these modules feed into a 3-bit full adder, which soaks up
yet one more bit of source word, so we have accounted for 15 bits
of source. One more stage in the tree and we have accounted for
31 bits. Only one bit to go! 5-bit incrementer with carry out will
deal with that. So the total damage is:
eight 1-bit full adders
four 2-bit full adders
two 3-bit full adders
one 4-bit full adder
one 5-bit incrementer
and any pipeline registers you may need for speed.

Not especially sneaky, but it's a start.

Jonathan Bromley
--

muzo

unread,

Sep 29, 1998, 3:00:00 AM9/29/98

to

Ed Chester <e.i.c...@ncl.ac.uk> wrote:

>Is there any sneaky way to get a 1's count (or 0's count) from e.g. a
>32b word without successively shifting-out and accumulating?

You can use a lookup table. Of course a 32 bit lookup table would be too big so
you can use an 8 bit lookup table and use it four times and add the results.
Another approach would be to use a 32 input 1 bit adder. The result will be at
most 6 bits so it wouldn't be too big.

muzo

WDM & NT Kernel Driver Development Consulting <mu...@pacbell.net>

Jonathan Bromley

unread,

Sep 30, 1998, 3:00:00 AM9/30/98

to

In an earlier post I suggested an adder-tree scheme to count the
number of ones in a 32-bit word:
<snip details>

> So the total damage is:
> eight 1-bit full adders
> four 2-bit full adders
> two 3-bit full adders
> one 4-bit full adder
> one 5-bit incrementer
> and any pipeline registers you may need for speed.

Just for grins, I tried a few real implementations.

1) Verilog no-brainer to count how many 1s there are, presumably
synthesising to a bunch of adders:
module count32a (source, count);
input [31:0] source;
output [5:0] count; reg [5:0] count;
integer i;
always @(source) begin
count=0;
for (i=0; i<32; i=i+1) count=count+source[i];
end
endmodule
2) RTL Verilog to implement the tree scheme I proposed - messy,
not worth posting here
3) The scheme I proposed, implemented as a schematic using ripple
carry throughout (seems OK since they are all so narrow)
4) Bit-serial implementation: 32-bit parallel-in, serial-out
shifter, and a 6-bit counter to count the ones as they shift,
in regular RTL Verilog

Verilog examples (1), (2), (4) were synthesised using a fairly old
version of Synplify targeting QuickLogic. (3) was targeted to
QuickLogic using their off-the-shelf soft macro library of full
and half adders. Results (bear in mind that the combinatorial
part of a QuickLogic cell is worth around 8 to 10 gates -
your mileage may vary...):

version size (cells) longest delay including pads
======================================================
1 52 66ns (17 levels + I/O)
2 52 69ns (18 levels + I/O)
3 57 48ns (9 levels + I/O)
4 41 23ns (dominated by fanout on the
(incl. 38 FFs) LOAD/GO input)

Timing estimates from QuickLogic's static timing analyser,
set up for worst-case conditions on the slowest speed grade.

There must be a moral in this somewhere, but I'm darned if I can
see what it is, except perhaps that Synplify is pretty good at
inferring compact implementations of messy arithmetic.
I think Synplify got better delay-per-logic-level than I did
because it was more careful about fanout, but I'm not sure.

Hope this is some interest to someone....

Jonathan Bromley
--

Ed Chester

unread,

Oct 15, 1998, 3:00:00 AM10/15/98

to

Thanks for the replies, I also posted to comp.dsp, and there's some
pretty sneaky ideas appearing there in case anybody else is interested,
but it's mostly in Sharc code not Verilog...

Ed Chester

Bob Taylor

unread,

Nov 7, 1998, 3:00:00 AM11/7/98

to

In article <3610C8D9...@ncl.ac.uk>, Ed Chester
<e.i.c...@ncl.ac.uk> wrote:

> Is there any sneaky way to get a 1's count (or 0's count) from e.g. a

> 32b word without successively shifting-out and accumulating? Trying to
> find some combinational way, ultimately for synthesis or use in a DSP.
> Thanks for any ideas.
>
> Ed Chester
> University of Newcastle

Why be sneaky? Take the straight forward approach and let synthesis do the
rest.

wire [31:0] vec;
wire [5:0] ones =
{5'd0, vec[31]}
+ {5'd0, vec[30]}
+ {5'd0, vec[29]}
.
.
+ {5'd0, vec[0]};

-- Bob Taylor

Charlie Burns

unread,

Nov 8, 1998, 3:00:00 AM11/8/98

to

In article <3610C8D9...@ncl.ac.uk>, Ed Chester
<e.i.c...@ncl.ac.uk> wrote:

> Is there any sneaky way to get a 1's count (or 0's count) from e.g. a
> 32b word without successively shifting-out and accumulating? Trying to
> find some combinational way, ultimately for synthesis or use in a DSP.
> Thanks for any ideas.
>
> Ed Chester
> University of Newcastle

Here's a C function that does what you describe. The conversion
to verilog is left as an exercise to the reader.

int bitcount (unsigned int w)
{
w = (0x55555555 & w) + (0x55555555 & (w>> 1));
w = (0x33333333 & w) + (0x33333333 & (w>> 2));
w = (0x0f0f0f0f & w) + (0x0f0f0f0f & (w>> 4));
w = (0x00ff00ff & w) + (0x00ff00ff & (w>> 8));
w = (0x0000ffff & w) + (0x0000ffff & (w>>16));
return w;
}

Thomas A. Coonan

unread,

Nov 8, 1998, 3:00:00 AM11/8/98

to

A tree of adders begining with 1-bit wide adders, gradually increasing
in width. So, start with 16 2-bit adders each having a 2-bit output.
Next level is 8 2-bit adders, etc. etc. Very common circuit in
"majority logic" section in FEC modules.

>In article <3610C8D9...@ncl.ac.uk>, Ed Chester
><e.i.c...@ncl.ac.uk> wrote:
>
>> Is there any sneaky way to get a 1's count (or 0's count) from e.g. a
>> 32b word without successively shifting-out and accumulating? Trying to
>> find some combinational way, ultimately for synthesis or use in a DSP.
>> Thanks for any ideas.
>>
>> Ed Chester
>> University of Newcastle
>

Bruce Nepple

unread,

Nov 9, 1998, 3:00:00 AM11/9/98

to

Adding bit by bit gives good results

I had to instantiate designware components in order to get good results. My
compile code looked like this: (I didn't really need the total if it was
greater than 7)

DW01_add #(1) U1(lfxor[0], lfxor[1], lfxor[2], lsum1, lsum1c);
DW01_add #(1) U2(lfxor[3], lfxor[4], lfxor[5], lsum2, lsum2c);
DW01_add #(1) U3(lfxor[6], lfxor[7], lfxor[8], lsum3, lsum3c);
DW01_add #(1) U4(lfxor[9], lfxor[10], lfxor[11], lsum4, lsum4c);
DW01_add #(1) U5(lfxor[12], lfxor[13], lfxor[14], lsum5, lsum5c);
DW01_add #(1) U6(lfxor[15], lfxor[16], lfxor[17], lsum6, lsum6c);
DW01_add #(2) U7({lsum1c,lsum1}, {lsum2c,lsum2}, lfxor[18], lsum7,
lsum7c);
DW01_add #(2) U8({lsum3c,lsum3}, {lsum4c,lsum4} , 1'b0, lsum8, lsum8c);
DW01_add #(2) U9({lsum5c,lsum5}, {lsum6c,lsum6}, lfxor[19], lsum9,
lsum9c);
DW01_add #(3) U10({lsum7c,lsum7}, {lsum9c,lsum9}, 1'b0, lsum11,
lsum11c);
wire [2:0]lsum11t = {3{lsum11c}} | lsum11; //limit to 3 bit sum
DW01_add #(3) U11(lsum11t, {lsum8c,lsum8}, 1'b0, lsum, lsumc);
wire [2:0] lfcnt = {3{lsumc}} | lsum; //limit to 3 bits

My simulation code was a little simpler, but gave crummy synthesis results:

wire [4:0] lsum
=lfxor[0]+lfxor[1]+lfxor[2]+lfxor[3]+lfxor[4]+lfxor[5]+lfxor[6]+lfxor[7]+

lfxor[8]+lfxor[9]+lfxor[10]+lfxor[11]+lfxor[12]+lfxor[13]+lfxor[14]+
lfxor[15]+lfxor[16]+lfxor[17]+lfxor[18]+lfxor[19];

Adding bits(for a 20 bit "flag") gave better results (smaller/faster) than
the shift/accumulate method.

Bruce

Bob Taylor wrote in message ...

Janick Bergeron

unread,

Nov 9, 1998, 3:00:00 AM11/9/98

to

And lab #3 in our Verilog-for-Synthesis training day.
Apparently an interview question at Intel (and also an
interview question I got [and flunked] at PlainTree)

Solution from our class (not the parenthesis to force the
tree balancing).

module COUNT_BITS(I, C);

input [31:0] I;
output [ 5:0] C;

//
// Please specify a synthesizable description of a circuit
// that counts the number of bits that are set to '1' in
// a 32-bit vector. e.g. "001101" should return '3'.
//

assign C = (((( I[0] + I[1] + I[2] + I[3]) +
( I[4] + I[5] + I[6] + I[7])) +
(( I[8] + I[9] + I[10] + I[11]) +
(I[12] + I[13] + I[14] + I[15]))) +
(((I[16] + I[17] + I[18] + I[19]) +
(I[20] + I[21] + I[22] + I[23])) +
((I[24] + I[25] + I[26] + I[27]) +
(I[28] + I[29] + I[30] + I[31]))));

endmodule

Thomas A. Coonan wrote:
>
> A tree of adders begining with 1-bit wide adders, gradually increasing
> in width. So, start with 16 2-bit adders each having a 2-bit output.
> Next level is 8 2-bit adders, etc. etc. Very common circuit in
> "majority logic" section in FEC modules.