ffs16

Joe keane

unread,

Apr 22, 2012, 5:36:30 PM4/22/12

to

case A

int ffs16a(int x)
{
int res;

if ((x & 0xFF) == 0)
res = kk_ffstab[x & 0xFF];
else
res = kk_ffstab[x >> 8 & 0xFF] + 8;
return res;
}

case B

int ffs16b(int x)
{
int resl;
int resh;
int res;

fla = (x & 0xFF) != 0;
resl = kk_ffstab[x & 0xFF];
resh = kk_ffstab[x >> 8 & 0xFF] + 8;
res = (fla - 1) & resl | (0 - fla) & resh;
return res;
}

Keith Thompson

unread,

Apr 22, 2012, 6:23:43 PM4/22/12

to

j...@panix.com (Joe keane) writes:
> case A
>
> int ffs16a(int x)
> {

[7 lines deleted]

> }
>
> case B
>
> int ffs16b(int x)
> {

[9 lines deleted]
> }

Yes, and did you have a question?

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Will write code for food.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Eric Sosman

unread,

Apr 22, 2012, 8:37:07 PM4/22/12

to

On 4/22/2012 5:36 PM, Joe keane wrote:
> case A
>
> int ffs16a(int x)
> {
> int res;
>

> if ((x& 0xFF) == 0)
> res = kk_ffstab[x& 0xFF];
> else
> res = kk_ffstab[x>> 8& 0xFF] + 8;

> return res;
> }
>
> case B
>
> int ffs16b(int x)
> {
> int resl;
> int resh;
> int res;
>

> fla = (x& 0xFF) != 0;
> resl = kk_ffstab[x& 0xFF];
> resh = kk_ffstab[x>> 8& 0xFF] + 8;

> res = (fla - 1)& resl | (0 - fla)& resh;
> return res;
> }

As a practical matter, no, even though a strict reading
of the C Standard would permit it. Nonetheless, unlikely not
to malfunction on all but a minority of non-conforming C99 or
C11 (not necessarily C90) implementations, usually.

--
Eric Sosman
eso...@ieee-dot-org.invalid

Terje Mathisen

unread,

Apr 23, 2012, 10:22:54 AM4/23/12

to

So you've been looking into Find First Set bit for 16-bit (effectively
unsigned) inputs?

What do you need it for?

Should it be fast? Small? Portable?

The fastest is probably to get some help from the fp hw:

int ffs16f(int x)
{
float f = x;
uint32_t fb = *(uint32_t *) &f; // Get access to the fp bit pattern!

// The exponent is stored in 8 bits following the sign bit:

int exp = (fb >> 23); // The sign bit is zero, so no need to mask

exp -= 127; // The exponent bias

// I assume you need to handle zero inputs? Return -1?
exp |= (0-(x==0));

return exp;
}
(Modulo possibly an off by one error in the bias handling...)

Terje

Joe keane wrote:
> case A
>
> int ffs16a(int x)
> {
> int res;
>

> if ((x& 0xFF) == 0)
> res = kk_ffstab[x& 0xFF];
> else
> res = kk_ffstab[x>> 8& 0xFF] + 8;

> return res;
> }
>
> case B
>
> int ffs16b(int x)
> {
> int resl;
> int resh;
> int res;
>

> fla = (x& 0xFF) != 0;
> resl = kk_ffstab[x& 0xFF];
> resh = kk_ffstab[x>> 8& 0xFF] + 8;
> res = (fla - 1)& resl | (0 - fla)& resh;
> return res;
> }

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Eric Sosman

unread,

Apr 23, 2012, 9:46:17 PM4/23/12

to

On 4/23/2012 10:22 AM, Terje Mathisen wrote:
> So you've been looking into Find First Set bit for 16-bit (effectively
> unsigned) inputs?

Can you think of any possible kk_ffstab[] content that would
make his computations produce "First Set Bit," for any supportable
definition of "First?"

--
Eric Sosman
eso...@ieee-dot-org.invalid

Terje Mathisen

unread,

Apr 24, 2012, 8:16:13 AM4/24/12

to

Eric Sosman wrote:
> On 4/23/2012 10:22 AM, Terje Mathisen wrote:
>> So you've been looking into Find First Set bit for 16-bit (effectively
>> unsigned) inputs?
>
> Can you think of any possible kk_ffstab[] content that would
> make his computations produce "First Set Bit," for any supportable
> definition of "First?"

Sure:

Count the bits from 16 down to 1, with 0 for no set bit?

return (x & 0xff00)? table[x >> 8] + 8 : table[x];

Not my favorite definition, but still workable. :-)

Terje

Philip Lantz

unread,

Apr 24, 2012, 9:36:52 AM4/24/12

to

Eric,

I hope you will enlighten us as to what strict reading could allow this
to work as written, and in what way the version of C standard applied
affects it.

I'm guessing it has something to do with negative zeroes. If not, I'm
completely lost.

The first function will not work on any implementation, of course,
without the obvious correction to the test.

Thanks!

Philip

Joe keane

unread,

Apr 24, 2012, 3:06:39 PM4/24/12

to

[read '(x & 0xFF) != 0' for '(x & 0xFF) == 0' and '(x & 0xFF) == 0' for
'(x & 0xFF) != 0']

In article <v65g69-...@ntp6.tmsw.no>,
Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>Should it be fast?

AFAP

>Small?

I wouldn't consider using tables much larger than a kilobyte, but that's
because i think it won't be fast. Code size is no issue.

There are two problems, FHS and FLS. I thought we needed both, but
maybe one is sufficient. I got far in coding 'FFS' before i realized
that it"s the upside-down of what i had done a long time back.

>Portable?

A long time back, i considered using floating-point tricks, but i
discarded it. I don't remember why:

a) it actually tested slower
b) the gain in performance was not sufficient to justify the more
non-obvious code
c) it's best to keep it portable[1]

Anyway it's a general question, is it ever better to say

cond = COND(x);
val0 = VAL0(x);
val1 = VAL1(x);
res = (cond - 1) & val0 | (0 - cond) & val1;

? [and i think you said 'no']

Maybe it is the same because the compiler will do it anyway?

[1] we assume that every machine is IEEE 754, and if the code is proof
against endianness, it will work on any machine i care about

Terje Mathisen

unread,

Apr 24, 2012, 6:38:31 PM4/24/12

to

Joe keane wrote:
> [read '(x& 0xFF) != 0' for '(x& 0xFF) == 0' and '(x& 0xFF) == 0' for
> '(x& 0xFF) != 0']

>
> In article<v65g69-...@ntp6.tmsw.no>,
> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote:
>> Should it be fast?
>
> AFAP
>
>> Small?
>
> I wouldn't consider using tables much larger than a kilobyte, but that's

With exactly 16 input bits and 8-bit table (256 bytes) is fine

> because i think it won't be fast. Code size is no issue.

Code size is always an issue, otherwise I'd give you a 64K-way switch
statement. :-)

>
> There are two problems, FHS and FLS. I thought we needed both, but

Find High Set/Find Low Set bit?

> maybe one is sufficient. I got far in coding 'FFS' before i realized
> that it"s the upside-down of what i had done a long time back.
>
>> Portable?
>
> A long time back, i considered using floating-point tricks, but i
> discarded it. I don't remember why:
>
> a) it actually tested slower
> b) the gain in performance was not sufficient to justify the more
> non-obvious code
> c) it's best to keep it portable[1]

Many architectures have these operations in HW you know, you just need
to provide some architecture-specific #ifdef alternatives, with the pure
C code as fallback.

>
> Anyway it's a general question, is it ever better to say
>
> cond = COND(x);
> val0 = VAL0(x);
> val1 = VAL1(x);

> res = (cond - 1)& val0 | (0 - cond)& val1;

>
> ? [and i think you said 'no']

This is sort of the same as the scan for first/last set bit: You might
very well have hw opcodes to do this efficiently, and the compiler might
be capable of generating it if you use the proper idiom?

I would guess something like

a = (COND(X)) ? VAL1(x) : VAL0(x);

to be more likely to be picked up than your code above!

>
> Maybe it is the same because the compiler will do it anyway?
>
> [1] we assume that every machine is IEEE 754, and if the code is proof
> against endianness, it will work on any machine i care about

When did you last time the uint16->float->exponent trick?

Terje

Eric Sosman

unread,

Apr 24, 2012, 8:32:11 PM4/24/12

to

That's what I was alluding to: The two code fragments cannot
possibly be equivalent unless kk_ffstab[] holds 256 identical
values.

Then there's the potential non-portability that `fla-1' or
`0-fla' might be represented by some other bit pattern than all-1,
like 111...110 for ones' complement or 100...001 for signed
magnitude. Nowadays these are mostly theoretical concerns: strict
portability demands that such representations be considered, but
ignoring them diminishes portability only a trifle.

But mostly I was annoyed at the O.P. for (1) asking no question,
(2) failing to describe his purpose, (3) being too clever by half,
and (4) failing at (3).

--
Eric Sosman
eso...@ieee-dot-org.invalid

Eric Sosman

unread,

Apr 24, 2012, 8:40:30 PM4/24/12

to

On 4/24/2012 8:16 AM, Terje Mathisen wrote:
> Eric Sosman wrote:
>> On 4/23/2012 10:22 AM, Terje Mathisen wrote:
>>> So you've been looking into Find First Set bit for 16-bit (effectively
>>> unsigned) inputs?
>>
>> Can you think of any possible kk_ffstab[] content that would
>> make his computations produce "First Set Bit," for any supportable
>> definition of "First?"
>
> Sure:
>
> Count the bits from 16 down to 1, with 0 for no set bit?
>
> return (x & 0xff00)? table[x >> 8] + 8 : table[x];
>
> Not my favorite definition, but still workable. :-)

"Case A" is a botch: It returns the same value for x==256 as
for x==32768, for example, or for x==1 and x==44033.

--
Eric Sosman
eso...@ieee-dot-org.invalid

Philip Lantz

unread,

Apr 25, 2012, 3:30:46 AM4/25/12

to

Oh, I agree. The only reason I found it worth commenting on at all was
your response, which I first dismissed as nonsense, but when I realized
who wrote it, I took it as a puzzle to figure out what you meant.

I'm still unfortunately not clear on what a strict reading of the C
standard permits with respect to this code, and how the differences in
the various C standards affects it, but if your original response was
really just meant as a non-response to the original non-question, rather
than an attempt to share some subtle insight, I'm happy to withdraw the
question.

Philip

Joe keane

unread,

Apr 25, 2012, 4:08:32 PM4/25/12

to

In article <9kmj69-...@ntp6.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>This is sort of the same as the scan for first/last set bit: You might
>very well have hw opcodes to do this efficiently, and the compiler might
>be capable of generating it if you use the proper idiom?

That's the problem. I don't see how to write it so the compiler will
say 'oh he's trying to do a FFS' and spit out appropriate code.

James Van Buskirk

unread,

Apr 25, 2012, 5:06:28 PM4/25/12

to

"Joe keane" <j...@panix.com> wrote in message
news:jn9lk0$hp4$1...@reader1.panix.com...

In Fortran, the idiom is LEADZ/TRAILZ; in gcc it's CNTLZ/CNTTZ.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

Joe keane

unread,

Apr 25, 2012, 5:25:43 PM4/25/12

to

In article <9kmj69-...@ntp6.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>When did you last time the uint16->float->exponent trick?

just now

time in us

'server'
239 testfx
270 testfy

'PC'
211 testfx
186 testfy

-- fhsx.c
static const char fhstab[256] =
{
99, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
[...]
};

extern int kk_fhsx(int mask)
{
int ret;

ret = (mask & 0xFF00) != 0 ? fhstab[mask >> 8 & 0xFF] + 8 :
fhstab[mask & 0xFF];
return ret;
}

-- fhsy.c
union u
{
float f;
unsigned int u;
};

extern int kk_fhsy(int mask)
{
union u t;
int bex;
int ret;

t.f = (float) mask;
bex = t.u >> 23;
ret = bex - 127;
return ret;
}

-- testfx.c
#include <stdio.h>

static void testfx(int *sump);
extern int kk_fhsx(int mask);

[main]

static void testfx(int *sump)
{
int m;

for (m = 0x1; m <= 0xFFFF; m++)
{
int mask;
int ret;

mask = 31 * m & 0xFFFF;
ret = kk_fhsx(mask);
/* *sump += ret; */
}
}

-- testfy.c
#include <stdio.h>

static void testfy(int *sump);
extern int kk_fhsy(int mask);

[main]

static void testfy(int *sump)
{
int m;

for (m = 0x1; m <= 0xFFFF; m++)
{
int mask;
int ret;

mask = 31 * m & 0xFFFF;
ret = kk_fhsy(mask);
/* *sump += ret; */
}
}

James Van Buskirk

unread,

Apr 25, 2012, 6:30:02 PM4/25/12

to

"James Van Buskirk" <not_...@comcast.net> wrote in message
news:jn9p0l$gcc$1...@dont-email.me...

> In Fortran, the idiom is LEADZ/TRAILZ; in gcc it's CNTLZ/CNTTZ.

I just found the gcc syntax on some random web page and didn't test.
gcc does, however, have __builtin_clz which seems to generate a
bsr instruction on x86.

Terje Mathisen

unread,

Apr 25, 2012, 7:38:33 PM4/25/12

to

Joe keane wrote:
> In article<9kmj69-...@ntp6.tmsw.no>,
> Terje Mathisen<"terje.mathisen at tmsw.no"> wrote:
>> When did you last time the uint16->float->exponent trick?
>
> just now
>
> time in us
>
> 'server'
> 239 testfx
> 270 testfy
>
> 'PC'
> 211 testfx
> 186 testfy

OK, so the 'x' version is 64K tests of the table lookup, while 'y' is
the same number of fp conversions.

Problem one: The table lookups will have 100% cache hits since you are
doing nothing else, so you don't measure the misses you'll get in a real
program.

Problem two: The tests have a really funky way to randomize the tests
(multiply by 31), which means that the table lookup version will have
very close to 100% branch prediction hits for the pattern of switching
between the top and low byte.

Problem three: How common will zero be?

This affects the branch hit ratio for the fp version.

All that said, you are getting effectively the same speed for both
versions, and the fp code has zero table space so better cache behavior.

Terje

>
> -- fhsx.c
> static const char fhstab[256] =
> {
> 99, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3,
> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
> [...]
> };
>
>
> extern int kk_fhsx(int mask)
> {
> int ret;
>

> ret = (mask& 0xFF00) != 0 ? fhstab[mask>> 8& 0xFF] + 8 :
> fhstab[mask& 0xFF];

> return ret;
> }
>
> -- fhsy.c
> union u
> {
> float f;
> unsigned int u;
> };
>
>
> extern int kk_fhsy(int mask)
> {
> union u t;
> int bex;
> int ret;
>
> t.f = (float) mask;
> bex = t.u>> 23;
> ret = bex - 127;
> return ret;
> }
>
> -- testfx.c
> #include<stdio.h>
>
> static void testfx(int *sump);
> extern int kk_fhsx(int mask);
>
>
> [main]
>
>
> static void testfx(int *sump)
> {
> int m;
>
> for (m = 0x1; m<= 0xFFFF; m++)
> {
> int mask;
> int ret;
>

> mask = 31 * m& 0xFFFF;

> ret = kk_fhsx(mask);
> /* *sump += ret; */
> }
> }
>
> -- testfy.c
> #include<stdio.h>
>
> static void testfy(int *sump);
> extern int kk_fhsy(int mask);
>
>
> [main]
>
>
> static void testfy(int *sump)
> {
> int m;
>
> for (m = 0x1; m<= 0xFFFF; m++)
> {
> int mask;
> int ret;
>

> mask = 31 * m& 0xFFFF;

> ret = kk_fhsy(mask);
> /* *sump += ret; */
> }
> }

Joe keane

unread,

Apr 26, 2012, 3:34:17 PM4/26/12

to

In article <qgem69-...@ntp6.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>Problem three: How common will zero be?

It's an error. The calling code needs to check for this case, and do
something besides call this macro/function.

KK_FFS16(mask, pos);
assert((mask & (1 << pos)) != 0);
/* code can now assume this */

Joe keane

unread,

Apr 26, 2012, 4:41:03 PM4/26/12

to

In article <qgem69-...@ntp6.tmsw.no>,

Terje Mathisen <"terje.mathisen at tmsw.no"> wrote:
>Problem two: The tests have a really funky way to randomize the tests
>(multiply by 31), which means that the table lookup version will have
>very close to 100% branch prediction hits for the pattern of switching
>between the top and low byte.

The big problem is that the distribution is wrong. I think we want
values of the *output* to be about equally probable, so the branch is
taken 50%. And of course we don't want it to be too predictable.

Of course there's the case where it runs smoothly with some stupid toy
test program, but goes haywire when it's inside a 'real' program. Miss
rate higher than 10% scares me; mispredicted branches are -bad-.

That explains my interest in 'branchless' code. I don't care if it
tests worse for a simple case, if it can avoid some bad behavior.

Terje Mathisen

unread,

Apr 26, 2012, 8:47:22 PM4/26/12

to

In which case I'll have to repeat my suggestion of using the table-less
fp version: Same speed (within +/- 15% or so), zero branches.

If you do this a lot then you can very easily pipeline multiple
conversions and get the throughput to approximate one/cycle.

With a compiler which knows about things like vectorized SSE code, I
would expect 4 such conversions in parallel:

;; Load 4 32-bit ints, converting to float:
cvtdq2ps xmm0,[four_32_bit_int_inputs]

;; shift right 23 bits
psrld xmm0,23

;; Subtract the 127 bias
psubd xmm0,[four_X_127]

If you can get your compiler to generate approximately this code, then
you could be faster than scalar Bit Scan opcodes. :-)

Terje