Here's a faster implementation of generic_fls, that I discovered accidently,
by not noticing 2.5 already had a generic_fls, and so I rolled my own. Like
the incumbent, it's O log2(bits), but there's not a lot of resemblance beyond
that. I think the new algorithm is inherently more parallelizable than the
traditional approach. A processor that can speculatively evaluate both sides
of a conditional would benefit even more than the PIII I tested on.
The algorithm works roughly as follows: to find the highest bit in a value of
a given size, test the higher half for any one bits, then recursively apply
the algorithm to one of the two halves, depending on the test. Once down to 8
bits, enumerate all the cases:
return n & 0xf0? (n & 0xc0? (n & 0x80? 8: 7): (n & 0x20? 6: 5)):
n & 0x0f? (n & 0x0c? (n & 0x08? 4: 3): (n & 0x02? 2: 1)): 0;
The above expression can be considerably optimized by noting that once we get
down to just two bits, at least one of which is known to be a one bit, it's
faster to shift out the higher of the two bits and add it directly to the
result than to evaluate a conditional.
A sneaky optimization is possible for the lowest two bits: the four values
{0, 1, 2, 3} map directly onto three of the four wanted results {0, 1, 2, 2},
so a little bit bashing takes care of both the conditional mentioned above
and the test that would otherwise be needed for the zero case. The resulting
optimized code is sufficiently obfuscated for an IOCC entry, but it's fast:
return n & 0xf0?
n & 0xc0? (n >> 7) + 7: (n >> 5) + 5:
n & 0x0c? (n >> 3) + 3: n - ((n + 1) >> 2);
In short, to find the high bit of a 32 bit value, the algorithm enumerates
all 32 possibilities using a binary search. Inlines clarify the recursion,
and as a fringe benefit, give us a handy set of 8, 16, 32 and 64 bit
function flavors, the shorter versions being a little faster if you can use
them.
The thing that makes it fast (I think) is that the expressions at the leaves
can be evaluated in parallel with the conditional tests - that is, it's
possible to compute the results before we know exactly which one is needed.
Another thing that may contribute to the speed is that the algorithm is doing
relatively more reading than writing, compared to the current version.
Though I did not test it, I believe the speedup will carry over to assembly
implementations as well.
There are still some optimization possibilities remaining. For example, in
some of the evaluation cases the zero case doesn't have to be evaluated, so
a little arithmetic can be saved. But then the helper functions wouldn't be
useable as sub-functions in their own right any more, so I don't think the
small speedup is worth it for the decreased orthogonality.
The improvement on a PIII varies from about 1.43x with gcc -O2 to 2.08x at
-O3. The benchmark runs 2**32 iterations, evaluating all 32 bit cases.
Roughly speaking, at O3 it takes about 10 cycles to do the job:
Old New Empty loop Actual old Actual new Speedup
O2: 111.267 94.129 54.014 57.253 40.115 1.43
O3: 95.875 53.018 13.200 82.675 39.818 2.08
The test program is here:
http://people.nl.linux.org/~phillips/test_fls.c
The patch is against 2.5.68-bk9.
Regards,
Daniel
--- 2.5.68.clean/include/linux/bitops.h 2003-04-20 04:51:12.000000000 +0200
+++ 2.5.68/include/linux/bitops.h 2003-04-30 01:29:27.000000000 +0200
@@ -41,33 +41,35 @@
* fls: find last bit set.
*/
-extern __inline__ int generic_fls(int x)
+/* Faster generic_fls */
+/* (c) 2002, D.Phillips and Sistina Software */
+/* Licensed under Version 2 of the GPL */
+
+static inline unsigned fls8(unsigned n)
+{
+ return n & 0xf0?
+ n & 0xc0? (n >> 7) + 7: (n >> 5) + 5:
+ n & 0x0c? (n >> 3) + 3: n - ((n + 1) >> 2);
+}
+
+static inline unsigned fls16(unsigned n)
+{
+ return n & 0xff00? fls8(n >> 8) + 8: fls8(n);
+}
+
+static inline unsigned fls32(unsigned n)
+{
+ return n & 0xffff0000? fls16(n >> 16) + 16: fls16(n);
+}
+
+static inline unsigned fls64(unsigned long long n) /* should be u64 */
{
- int r = 32;
+ return n & 0xffffffff00000000? fls32(n >> 32) + 32: fls32(n);
+}
- if (!x)
- return 0;
- if (!(x & 0xffff0000u)) {
- x <<= 16;
- r -= 16;
- }
- if (!(x & 0xff000000u)) {
- x <<= 8;
- r -= 8;
- }
- if (!(x & 0xf0000000u)) {
- x <<= 4;
- r -= 4;
- }
- if (!(x & 0xc0000000u)) {
- x <<= 2;
- r -= 2;
- }
- if (!(x & 0x80000000u)) {
- x <<= 1;
- r -= 1;
- }
- return r;
+static inline int generic_fls(int n)
+{
+ return fls32(n);
}
extern __inline__ int get_bitmask_order(unsigned int count)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
On Wed, Apr 30, 2003 at 04:46:23AM +0200, Daniel Phillips wrote:
> Here's a faster implementation of generic_fls, that I discovered accidently,
> by not noticing 2.5 already had a generic_fls, and so I rolled my own. Like
> the incumbent, it's O log2(bits), but there's not a lot of resemblance beyond
> that. I think the new algorithm is inherently more parallelizable than the
> traditional approach. A processor that can speculatively evaluate both sides
> of a conditional would benefit even more than the PIII I tested on.
I did some tests on an Athlon MP2200 (1.8 GHz) and a Pentium 4/2.0 GHz, with
both gcc 2.95.3 and gcc 3.2.3.
The results are interesting, because they show that the compiler used is far
more important than the code ! To resume quickly, your code is faster than the
old one on an athlon, gcc-2.95.2 -O3, and in any combination of gcc-3.2.3.
On a P4, new code compiled with 3.2.3 is still slower than old code with 2.95.3.
But gcc's optimizer seems to be even worse with each new version. The NIL
checksum takes 3 times more to complete on gcc3/athlon than gcc2/athlon !!!
And on some test, the P4 goes faster when the code is optimized for athlon than
for P4 !
Oh, and the new2 function is another variant I have been using by the past, but
it is slower here. It performed well on older machines with gcc 2.7.2. Its main
problem may be the code size, due to the 32 bits masks. Your code has the
advantage of being short and working on small data sets.
#define new2_fls32(___a) ({ \
register int ___x, ___bits = 0; \
if (___a) { \
___x = (___a); \
++___bits; \
if (___x & 0xffff0000) { ___x &= 0xffff0000; ___bits += 16;} \
if (___x & 0xff00ff00) { ___x &= 0xff00ff00; ___bits += 8;} \
if (___x & 0xf0f0f0f0) { ___x &= 0xf0f0f0f0; ___bits += 4;} \
if (___x & 0xcccccccc) { ___x &= 0xcccccccc; ___bits += 2;} \
if (___x & 0xaaaaaaaa) { ___x &= 0xaaaaaaaa; ___bits += 1;} \
} \
___bits; \
})
Here are the results.
Cheers,
Willy
===== gcc-2.95.3 -march=i686 -O2 / athlon MP/2200 (1.8 GHz) =====
4294967295 iterations of nil... checksum = 4294967295
real 0m4.778s
user 0m4.770s
sys 0m0.010s
4294967295 iterations of old... checksum = 4294967265
real 0m37.923s
user 0m37.920s
sys 0m0.000s
4294967295 iterations of new... checksum = 4294967265
real 0m47.956s
user 0m47.920s
sys 0m0.030s
4294967295 iterations of new2... checksum = 4294967295
real 0m43.884s
user 0m43.860s
sys 0m0.020s
===== gcc-2.95.3 -march=i686 -O3 / athlon MP/2200 =====
4294967295 iterations of nil... checksum = 4294967295
real 0m4.779s
user 0m4.770s
sys 0m0.010s
4294967295 iterations of old... checksum = 4294967265
real 0m39.235s
user 0m39.180s
sys 0m0.050s
4294967295 iterations of new... checksum = 4294967265
real 0m32.175s
user 0m32.170s
sys 0m0.000s
4294967295 iterations of new2... checksum = 4294967295
real 0m37.560s
user 0m37.520s
sys 0m0.020s
===== gcc-2.95.3 -march=i686 -O3 / Pentium4 2.0 GHz =====
4294967295 iterations of nil... checksum = 4294967295
real 0m4.370s
user 0m4.370s
sys 0m0.000s
4294967295 iterations of old... checksum = 4294967265
real 0m22.148s
user 0m22.150s
sys 0m0.000s
4294967295 iterations of new... checksum = 4294967265
real 0m25.854s
user 0m25.850s
sys 0m0.000s
4294967295 iterations of new2... checksum = 4294967295
real 0m30.743s
user 0m28.930s
sys 0m0.160s
==== gcc 3.2.3 -march=i686 -O3 / Pentium 4 / 2.0 GHz ====
4294967295 iterations of nil... checksum = 4294967295
real 0m6.719s
user 0m6.710s
sys 0m0.000s
4294967295 iterations of old... checksum = 4294967265
real 0m45.571s
user 0m43.510s
sys 0m0.180s
4294967295 iterations of new... checksum = 4294967265
real 0m28.251s
user 0m26.420s
sys 0m0.010s
4294967295 iterations of new2... checksum = 4294967265
real 0m48.881s
user 0m47.080s
sys 0m0.000s
-> awful, 60% more time than with gcc-2.95.3 !
==== gcc 3.2.3 -march=pentium4 -O3 / Pentium 4 / 2.0 GHz ====
(yes, gcc 3.2 may be producing better code... for the eyes, not
for the CPU)
4294967295 iterations of nil... checksum = 4294967295
real 0m4.422s
user 0m4.420s
sys 0m0.000s
4294967295 iterations of old... checksum = 4294967265
real 0m34.950s
user 0m34.950s
sys 0m0.000s
4294967295 iterations of new... checksum = 4294967265
real 0m27.667s
user 0m25.810s
sys 0m0.000s
4294967295 iterations of new2... checksum = 4294967295
real 0m42.945s
user 0m42.760s
sys 0m0.160s
==== gcc 3.2.3 -march=athlon-mp -O3 / Pentium 4 / 2.0 GHz ====
4294967295 iterations of nil... checksum = 4294967295
real 0m6.486s
user 0m6.490s
sys 0m0.000s
4294967295 iterations of old... checksum = 4294967265
real 0m41.240s
user 0m39.220s
sys 0m0.160s
4294967295 iterations of new... checksum = 4294967265
real 0m25.780s
user 0m25.780s
sys 0m0.000s
4294967295 iterations of new2... checksum = 4294967295
real 0m51.789s
user 0m49.750s
sys 0m0.010s
==== gcc 3.2.3 -march=athlon-mp -O3 / Athlon MP2200 / 1.8 GHz ====
4294967295 iterations of nil... checksum = 4294967295
real 0m14.544s
user 0m14.540s
sys 0m0.000s
4294967295 iterations of old... checksum = 4294967265
real 0m44.609s
user 0m44.600s
sys 0m0.000s
4294967295 iterations of new... checksum = 4294967265
real 0m26.765s
user 0m26.760s
sys 0m0.000s
4294967295 iterations of new2... checksum = 4294967295
real 1m6.066s
user 1m6.030s
sys 0m0.030s
==== gcc 3.2.3 -march=pentium4 -O3 / Athlon MP2200 / 1.8 GHz ====
4294967295 iterations of nil... checksum = 4294967295
real 0m10.775s
user 0m10.750s
sys 0m0.030s
4294967295 iterations of old... checksum = 4294967265
real 0m45.837s
user 0m45.830s
sys 0m0.000s
4294967295 iterations of new... checksum = 4294967265
real 0m24.414s
user 0m24.410s
sys 0m0.010s
4294967295 iterations of new2... checksum = 4294967295
real 1m7.299s
user 1m7.280s
sys 0m0.010s
==== gcc 3.2.3 -march=i686 -O3 / Athlon MP2200 / 1.8 GHz ====
4294967295 iterations of nil... checksum = 4294967295
real 0m13.010s
user 0m13.000s
sys 0m0.010s
> But gcc's optimizer seems to be even worse with each new version. The NIL
> checksum takes 3 times more to complete on gcc3/athlon than gcc2/athlon !!!
> And on some test, the P4 goes faster when the code is optimized for athlon
> than for P4 !
It would be interesting to compare the assembly output of the
compilers to see what's happening. The gcc folks appreciate test cases
and this function seems simple and sensitive enough to make a good one.
> ===== gcc-2.95.3 -march=i686 -O2 / athlon MP/2200 (1.8 GHz) =====
-march=athlon would give gcc a chance to make better scheduling
decisions, which could make the results more sensible.
> ==== gcc 3.2.3 -march=i686 -O3 / Pentium 4 / 2.0 GHz ====
This version of gcc accepts -march=pentium4, which again might make
gcc's behavior less bizarre.
Oh, you tried that too, sorry.
And contrary to my memory, gcc 2.95.3 doesn't seem to support
march=athlon. I guess I need to go to sleep.
> Here's a faster implementation of generic_fls, that I discovered
> accidently, by not noticing 2.5 already had a generic_fls, and so I
> rolled my own. Like the incumbent, it's O log2(bits), but there's
> not a lot of resemblance beyond that. I think the new algorithm is
> inherently more parallelizable than the traditional approach. A
> processor that can speculatively evaluate both sides of a
> conditional would benefit even more than the PIII I tested on.
gcc 3.4 will have a __builtin_ctz function which can be used for this.
It will emit special instructions on CPUs that support it (i386, Alpha
EV67), and use a lookup table on others, which is very boring, but
also faster.
--
Falk
Actually, __builtin_clz:
http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
Not having a gcc 2.4 handy, I couldn't test it, but I did notice that the
built-in ffs is very fast. Perhaps all such standard functions will end up
as built-ins instead of kernel library functions, some very long time in the
future. If old compilers ever do finally fade away, that is.
It's somewhat annoying that __builtin_clz leaves the all-ones case dangling
instead of returning -1.
Regards,
Daniel
> It's somewhat annoying that __builtin_clz leaves the all-ones case
> dangling instead of returning -1.
I assume you mean all-zero... Well, the semantics of the corresponding
instructions simply differ too much. clz(0) is 31 on some, 32 on some
others, and undefined on some more architectures. So nailing it down
would incur some performance penalty.
--
Falk
It's really cute.
> It performed well on older machines with gcc 2.7.2.
> Its main problem may be the code size, due to the 32 bits masks. Your code
> has the advantage of being short and working on small data sets.
My code just looks small. At -O2 it expands to 4 times the size of the
existing fls and 5 times at -O3. The saving grace is that, being faster, it
doesn't have to be inline, saving code size overall.
> [code]
>
> Here are the results...
The overall trend seems to be that my version is faster with newer compilers
and higher optimization levels, and overall, the newer compilers are
producing faster code. The big anomally is where the original version turns
in the best time of the lot, running on PIV, compiled at -O3 with 2.95.3, as
you pointed out.
Well, it raises interesting questions, but I'm not entirely sure what they are
;-)
Regards,
Daniel
:-)
> Well, the semantics of the corresponding
> instructions simply differ too much. clz(0) is 31 on some, 32 on some
> others, and undefined on some more architectures. So nailing it down
> would incur some performance penalty.
It would only penalize stupidly broken architectures; it would help the ones
that got it right. It's hard to imagine how a sane person could think the
result of clz((u32)0) should be other than 32.
Sigh.
Regards,
Daniel
Classic mistake. Lookup tables are only faster in benchmarks, they are
almost always slower in real life. You only need to miss in the cache
_once_ on the lookup to lose all the time you won on the previous one
hundred calls.
"Small and simple" is almost always better than the alternatives. I
suspect that's one reason why older versions of gcc often generate code
that actually runs faster than newer versions: the newer versions _look_
like they do a better job, but..
Linus
> Classic mistake. Lookup tables are only faster in benchmarks, they
> are almost always slower in real life. You only need to miss in the
> cache _once_ on the lookup to lose all the time you won on the
> previous one hundred calls.
It seems to me if you call the function so seldom the table drops out
of the cache, it is irrelevant how long it takes anyway.
> "Small and simple" is almost always better than the alternatives.
Well, if a lookup table isn't "small and simple", I don't know what
is.
--
Falk
Yeah, but then this whole discussion is irrelevant too.
I'm just saying that micro-benchmarks of operations that do not make any
sense on their own are _bad_ measures of real performance. That lookup is
going to lose to the "natural" code even if it hits the L2, and misses
only the L1.
Any piece of code that only works well when it sits in (and owns) the L1
is a piece of crap.
> Well, if a lookup table isn't "small and simple", I don't know what
> is.
Something that calculates it directly? Like, perchance, the current code?
There is _never_ any excuse to use a lookup table for something that can
be calculated with a few simple instructions. That's just stupid.
Linus
> There is _never_ any excuse to use a lookup table for something that
> can be calculated with a few simple instructions. That's just
> stupid.
Well, the "few simple instructions" are 28 instructions on Alpha for
example, including 6 data-dependent branches. So I don't think it's
*that* stupid.
But I agree that benchmarking this stand-alone isn't particularly
useful.
--
Falk
You're comparing apples to oranges.
Clearly you're not going to make _one_ load to get fls, since having a
4GB lookup array for a 32-bit fls would be "somewhat" wasteful.
So the lookup table would probably look up just the last 8 bits.
So the lookup table version is several instructions in itself, doing about
half of what the calculating version needs to do _anyway_. Including those
data-dependent branches.
Linus
> Clearly you're not going to make _one_ load to get fls, since having
> a 4GB lookup array for a 32-bit fls would be "somewhat" wasteful.
>
> So the lookup table would probably look up just the last 8 bits.
>
> So the lookup table version is several instructions in itself, doing
> about half of what the calculating version needs to do _anyway_.
Right.
> Including those data-dependent branches.
Well, at least on Alpha, gcc can optimize them all away except one.
Note I'm not really arguing Linux should use a lookup table for fls;
I'm just arguing putting it as default into libgcc for architectures
that don't provide a special version (which aren't particularly many)
isn't stupid. But probably I'm just biased, because I put it there :)
--
Falk
I agree that this one lies in a gray area, being linearly faster (PIV
notwithwstanding) but bigger too.
In the dawn of time, before God gave us Cache, my version would have been the
fastest, because it executes the fewest instructions. In the misty future,
as cache continues to scale and processors sprout more parallel execution
units, it will be clearly better once again. For now, it's marginal, and
after all, what's a factor of two between friends?
Regards,
Daniel
return fls_table[n];
That'll be faster in benchmarks, possibly slower in real world. As usual.
Daniel,
I must acknowledge that your simple code was not easy to beat ! You can try this
one on your PIII, I could only test it on an athlon mobile and a P4. With gcc 2.95.3,
it gives me a boost of about 25%, because it seems as gcc cannot optimize shifts
efficiently. On 3.2.3, however, it's between 0 and 5% depending on optimization/CPU.
But the code is a lot smaller (about 105 bytes total for the function). I could also
code a version with parallel execution of several branches, but it was slower,
because every branch not taken was computed for nothing, and the glue around it had
a cost. I tried to implement it with as many booleans as possible so that the compiler
can try to emit conditionnal instructions and make use of carries. I did operations
on an unsigned char because it has some advantages, such as 0x80 being negative ;-)
I also let the other tests in, so that you can try them if you want. Looking through
the generated code, it's clear that an assembler version would be faster. Not much,
but a bit.
Cheers,
Willy
static inline
unsigned new2_fls32(unsigned n32)
{
unsigned t = 0;
unsigned char n;
if (n32 >> 16) {
n32 >>= 16;
t = 16;
}
n = n32;
if (n32 >> 8) {
n = (unsigned char)(n32 >> 8);
t += 8;
}
#if 0
else
if (!n) return t;
// this one uses comparisons and jumps, but is short and fast
return (n < 0x10) ?
t + ((n < 0x04) ? 2 - (n < 0x02) : 4 - (n < 0x08)) :
t + ((n < 0x40) ? 6 - (n < 0x20) : 8 - (n < 0x80)) ;
#endif
#if 0
// this one compiles into several parallel branches, but is slower
if (n < 0x10) {
return t += (n < 0x04) ? n - ((n + 1) >> 2) : 4 - (n < 8);
} else {
return t += (n < 0x40) ? 6 - (n < 0x20) : 8 - (n < 0x80) ;
}
#endif
#if 0
// the same as above
return t += (n < 0x10) ? (n < 0x04) ? n - ((n + 1) >> 2) : 4 - (n < 8) :
(n < 0x40) ? 6 - (n < 0x20) : 8 - (n < 0x80) ;
#endif
#if 1
// this one uses comparisons and jumps, but is short and the fastest
return (n < 0x10) ?
(n < 0x04) ? n + t - ((n + 1) >> 2) : 4 + t - (n < 8) :
(n < 0x40) ? 6 + t - (n < 0x20) : 8 + t - (n < 0x80) ;
#endif
It ought to be basically the same as ffs because if I remember rightly
ffs(x^(x-1)) == fls(x)
Been a long time so I may have it wrong
> On Mer, 2003-04-30 at 17:16, Linus Torvalds wrote:
> > Clearly you're not going to make _one_ load to get fls, since having a
> > 4GB lookup array for a 32-bit fls would be "somewhat" wasteful.
>
> It ought to be basically the same as ffs because if I remember rightly
>
> ffs(x^(x-1)) == fls(x)
I don't think so. Maybe you are thinking of ffs(x) == fls(x & -x). So
you can calculate ffs with fls, but not easily the other way round.
--
Falk
Well I asked my CPU to double check. If I got the code right it thinks that
ffs(i^(i-1))==fls(i) is true for 1->2^32-1
If you think about it it seems to make sense
x-1 turns the lowest 100..0 sequence into 01......1
The xor removes unchanged higher bits
so
10100
-1 10011
XOR 00111
So did anybody time this? ffs() we have hardware support for on x86, and
it's even reasonably efficient on some CPU's .. So this _should_ beat all
new-comers, and obviously some people already have benchmark programs
ready and willing..
Linus
Unfortunately on digging up my notes it seems I got ffs/fls backwards so it doesnt
help us a great deal.
> On 30 Apr 2003, Alan Cox wrote:
> >
> > It ought to be basically the same as ffs because if I remember rightly
> >
> > ffs(x^(x-1)) == fls(x)
>
> So did anybody time this? ffs() we have hardware support for on x86,
> and it's even reasonably efficient on some CPU's ..
There appears to be hardware support for fls, too. This is what gcc
generates for
int fls(int x) {
return x ? 32 - __builtin_clz(x) : 0;
}
fls:
pushl %ebp
xorl %edx, %edx
movl %esp, %ebp
movl 8(%ebp), %eax
testl %eax, %eax
je .L3
bsrl %eax, %ecx
movl $32, %edx
xorl $31, %ecx
subl %ecx, %edx
.L3:
popl %ebp
movl %edx, %eax
ret
--
Falk
Was something ifdef'd incorrectly? Otherwise, there is something the PIII
hates about that code. I got 107 seconds on the PIII, vs 53 seconds for my
posted code at O3, and virtually no difference at O2. (gcc 3.2.3)
Regards,
Daniel
Yeah, except if you want best code generation you should probably use
static inline int fls(int x)
{
int bit;
/* Only well-defined for non-zero */
asm("bsrl %1,%0":"=r" (bit):"rm" (x));
return x ? bit : 32;
}
which should generate (assuming you give "-march=xxxx" to tell gcc to use
cmov):
movl $32, %eax
bsrl %edx,%ecx
testl %edx, %edx
cmovne %ecx, %eax
which looks pretty optimal.
Linus
> Yeah, except if you want best code generation you should probably use
>
> static inline int fls(int x)
> {
> int bit;
> /* Only well-defined for non-zero */
> asm("bsrl %1,%0":"=r" (bit):"rm" (x));
"g" should work for the second operand too and it'll give gcc
slightly more choices with possibly better code.
But the __builtin is probably preferable if gcc supports it because
a builtin can be scheduled, inline assembly can't.
-Andi
"g" allows immediates, which is _not_ correct.
>But the __builtin is probably preferable if gcc supports it because
>a builtin can be scheduled, inline assembly can't.
You're over-estimating gcc builtins, and underestimating inline
assembly.
gcc builtins usually generate _worse_ code than well-written inline
assembly, since builtins try to be generic (at least the ones that are
cross-architecture).
And inline asms schedule as well as any built-in, unless they are marked
volatile (either explicitly or implicitly by not having any outputs).
And the proof is in the pudding: I'll bet you a dollar my code generates
better code. AND my code works on the intel compiler _and_ with older
gcc's.
The fact is, gcc builtins are almost never worth it.
Linus
> Daniel Phillips <dphi...@sistina.com> wrote:
> >
> > +static inline unsigned fls8(unsigned n)
> > +{
> > + return n & 0xf0?
> > + n & 0xc0? (n >> 7) + 7: (n >> 5) + 5:
> > + n & 0x0c? (n >> 3) + 3: n - ((n + 1) >> 2);
> > +}
>
> return fls_table[n];
>
>
> That'll be faster in benchmarks, possibly slower in real world. As usual.
>
Here is table version of the fls. Yes it fast than other.
typedef struct fls_table_t {
unsigned min;
unsigned max;
unsigned value;
} fls_table_t;
static fls_table_t fls_table[] = {
{0, 1, 1},
{1, 2, 2},
{2, 4, 3},
{4, 8, 4},
{8, 16, 5},
{16, 32, 6},
{32, 64, 7},
{64, 128, 8},
{128, 256, 9},
{256, 512, 10},
{512, 1024, 11},
{1024, 2048, 12},
{2048, 4096, 13},
{4096, 8192, 14},
{8192, 16384, 15},
{16384, 32768, 16},
{32768, 65536, 17},
{65536, 131072, 18},
{131072, 262144, 19},
{262144, 524288, 20},
{524288, 1048576, 21},
{1048576, 2097152, 22},
{2097152, 4194304, 23},
{4194304, 8388608, 24},
{8388608, 16777216, 25},
{16777216, 33554432, 26},
{33554432, 67108864, 27},
{67108864, 134217728, 28},
{134217728, 268435456, 29},
{268435456, 536870912, 30},
{536870912, 1073741824, 31},
{1073741824, 2147483648, 32},
{0, -1, 32},
};
static inline int fls_table_fls(unsigned n)
{
unsigned i = 0;
do {
i++;
} while (n <= fls_table[i].max && n > fls_table[i].min);
return fls_table[i].value;
}
--test log is here----
model name : Intel(R) Pentium(R) 4 Mobile CPU 1.60GHz
Reading specs from /usr/lib/gcc-lib/i386-linux/2.95.4/specs
gcc version 2.95.4 20011002 (Debian prerelease)
==== -march=i686 -O3 ====
test_fls.c:110: warning: decimal constant is so large that it is unsigned
4294967295 iterations of nil... checksum = 4294967295
real 0m22.311s
user 0m21.850s
sys 0m0.010s
4294967295 iterations of old... checksum = 4294967265
real 0m34.673s
user 0m33.960s
sys 0m0.000s
4294967295 iterations of new... checksum = 4294967265
real 0m57.702s
user 0m56.530s
sys 0m0.040s
4294967295 iterations of new2... checksum = 4294967265
real 0m41.709s
user 0m40.810s
sys 0m0.000s
4294967295 iterations of table... checksum = 4294967295
real 0m22.274s
user 0m21.640s
sys 0m0.010s
------------------
==== -march=i686 -O2 ====
test_fls.c:110: warning: decimal constant is so large that it is unsigned
4294967295 iterations of nil... checksum = 4294967295
real 0m29.501s
user 0m28.880s
sys 0m0.040s
4294967295 iterations of old... checksum = 4294967265
real 0m49.054s
user 0m47.840s
sys 0m0.040s
4294967295 iterations of new... checksum = 4294967265
real 0m52.331s
user 0m51.270s
sys 0m0.010s
4294967295 iterations of new2... checksum = 4294967265
real 1m7.043s
user 1m5.660s
sys 0m0.010s
4294967295 iterations of table... checksum = 4294967295
real 0m45.783s
user 0m44.860s
sys 0m0.010s
------------------
==== -O2 ====
test_fls.c:110: warning: decimal constant is so large that it is unsigned
4294967295 iterations of nil... checksum = 4294967295
real 0m34.395s
user 0m33.670s
sys 0m0.000s
4294967295 iterations of old... checksum = 4294967265
real 0m52.277s
user 0m51.230s
sys 0m0.000s
4294967295 iterations of new... checksum = 4294967265
real 0m54.750s
user 0m53.640s
sys 0m0.010s
4294967295 iterations of new2... checksum = 4294967265
real 0m57.728s
user 0m56.590s
sys 0m0.000s
4294967295 iterations of table... checksum = 4294967295
real 0m44.540s
user 0m43.600s
sys 0m0.020s
------------------
==== -O3 ====
test_fls.c:110: warning: decimal constant is so large that it is unsigned
4294967295 iterations of nil... checksum = 4294967295
real 0m16.196s
user 0m15.880s
sys 0m0.000s
4294967295 iterations of old... checksum = 4294967265
real 0m34.342s
user 0m33.630s
sys 0m0.010s
4294967295 iterations of new... checksum = 4294967265
real 0m58.554s
user 0m57.380s
sys 0m0.000s
4294967295 iterations of new2... checksum = 4294967265
real 0m37.140s
user 0m36.320s
sys 0m0.040s
4294967295 iterations of table... checksum = 4294967295
real 0m21.723s
user 0m21.270s
sys 0m0.000s
------------------
--
Hu Gang / Steve
Email : hua...@soulinfo.com, st...@soulinfo.com
GPG FinePrint: 4099 3F1D AE01 1817 68F7 D499 A6C2 C418 86C8 610E
ICQ# : 205800361
Registered Linux User : 204016
nooo.. That has a big cache footprint. At the very least you should use
a binary search. gcc will do it for you:
switch (n) {
case 0 ... 1:
return 1;
case 2 ... 3:
return 2;
case 4 ... 7:
return 3;
case 8 ... 15:
return 4;
etc.
> Daniel Phillips <dphi...@sistina.com> wrote:
> >
> > +static inline unsigned fls8(unsigned n)
> > +{
> > + return n & 0xf0?
> > + n & 0xc0? (n >> 7) + 7: (n >> 5) + 5:
> > + n & 0x0c? (n >> 3) + 3: n - ((n + 1) >> 2);
> > +}
>
> return fls_table[n];
>
>
> That'll be faster in benchmarks, possibly slower in real world. As usual.
>
Here is table version of the fls. Yes it fast than other.
{0, -1, 32},
};
return fls_table[i].value;
}
all of the files can download from:
http://soulinfo.com/~hugang/kernel/
--
Hu Gang / Steve
Email : hua...@soulinfo.com, st...@soulinfo.com
GPG FinePrint: 4099 3F1D AE01 1817 68F7 D499 A6C2 C418 86C8 610E
ICQ# : 205800361
Registered Linux User : 204016
> nooo.. That has a big cache footprint. At the very least you should use
> a binary search. gcc will do it for you:
>
> switch (n) {
> case 0 ... 1:
> return 1;
> case 2 ... 3:
> return 2;
> case 4 ... 7:
> return 3;
> case 8 ... 15:
> return 4;
>
> etc.
It is here.
--------------------
static inline int fls_table_fls(unsigned n)
{
switch (n) {
case 0 ... 0: return 1;
case 1 ... 1: return 2;
case 2 ... 3: return 3;
case 4 ... 7: return 4;
case 8 ... 15: return 5;
case 16 ... 31: return 6;
case 32 ... 63: return 7;
case 64 ... 127: return 8;
case 128 ... 255: return 9;
case 256 ... 511: return 10;
case 512 ... 1023: return 11;
case 1024 ... 2047: return 12;
case 2048 ... 4095: return 13;
case 4096 ... 8191: return 14;
case 8192 ... 16383: return 15;
case 16384 ... 32767: return 16;
case 32768 ... 65535: return 17;
case 65536 ... 131071: return 18;
case 131072 ... 262143: return 19;
case 262144 ... 524287: return 20;
case 524288 ... 1048575: return 21;
case 1048576 ... 2097151: return 22;
case 2097152 ... 4194303: return 23;
case 4194304 ... 8388607: return 24;
case 8388608 ... 16777215: return 25;
case 16777216 ... 33554431: return 26;
case 33554432 ... 67108863: return 27;
case 67108864 ... 134217727: return 28;
case 134217728 ... 268435455: return 29;
case 268435456 ... 536870911: return 30;
case 536870912 ... 1073741823: return 31;
default:
}
return 32;
}
Now it in test, I will put log and file into
http://soulinfo.com/~hugang/kernel/
--
Hu Gang / Steve
Email : hua...@soulinfo.com, st...@soulinfo.com
GPG FinePrint: 4099 3F1D AE01 1817 68F7 D499 A6C2 C418 86C8 610E
ICQ# : 205800361
Registered Linux User : 204016
Here is the table version of fls. Before version of it is wrong.
-------
static inline int fls_table_fls(unsigned n)
{
switch (n) {
case 0: return 0;
case 1 ... 1: return 1;
case 2 ... 3: return 2;
case 4 ... 7: return 3;
case 8 ... 15: return 4;
case 16 ... 31: return 5;
case 32 ... 63: return 6;
case 64 ... 127: return 7;
case 128 ... 255: return 8;
case 256 ... 511: return 9;
case 512 ... 1023: return 10;
case 1024 ... 2047: return 11;
case 2048 ... 4095: return 12;
case 4096 ... 8191: return 13;
case 8192 ... 16383: return 14;
case 16384 ... 32767: return 15;
case 32768 ... 65535: return 16;
case 65536 ... 131071: return 17;
case 131072 ... 262143: return 18;
case 262144 ... 524287: return 19;
case 524288 ... 1048575: return 20;
case 1048576 ... 2097151: return 21;
case 2097152 ... 4194303: return 22;
case 4194304 ... 8388607: return 23;
case 8388608 ... 16777215: return 24;
case 16777216 ... 33554431: return 25;
case 33554432 ... 67108863: return 26;
case 67108864 ... 134217727: return 27;
case 134217728 ... 268435455: return 28;
case 268435456 ... 536870911: return 29;
case 536870912 ... 1073741823: return 30;
case 1073741824 ... 2147483647: return 31;
default:return 32;
}
}
--here is the test log-----
model name : Intel(R) Pentium(R) 4 Mobile CPU 1.60GHz
Reading specs from /usr/lib/gcc-lib/i386-linux/2.95.4/specs
gcc version 2.95.4 20011002 (Debian prerelease)
==== -march=i686 -O3 ====
4294967295 iterations of nil... checksum = 4294967295
real 0m15.548s
user 0m15.500s
sys 0m0.010s
4294967295 iterations of old... checksum = 4294967265
real 0m34.778s
user 0m34.690s
sys 0m0.020s
4294967295 iterations of new... checksum = 4294967265
real 0m56.330s
user 0m56.190s
sys 0m0.000s
4294967295 iterations of new2... checksum = 4294967265
real 0m40.755s
user 0m40.680s
sys 0m0.000s
4294967295 iterations of table... checksum = 4294967265
real 0m49.928s
user 0m49.770s
sys 0m0.050s
------------------
==== -march=i686 -O2 ====
4294967295 iterations of nil... checksum = 4294967295
real 0m28.350s
user 0m28.280s
sys 0m0.020s
4294967295 iterations of old... checksum = 4294967265
real 0m48.199s
user 0m48.100s
sys 0m0.000s
4294967295 iterations of new... checksum = 4294967265
real 0m50.986s
user 0m50.890s
sys 0m0.000s
4294967295 iterations of new2... checksum = 4294967265
real 1m5.356s
user 1m5.210s
sys 0m0.010s
4294967295 iterations of table... checksum = 4294967265
real 0m45.238s
user 0m45.150s
sys 0m0.000s
------------------
==== -O2 ====
4294967295 iterations of nil... checksum = 4294967295
real 0m33.503s
user 0m33.410s
sys 0m0.020s
4294967295 iterations of old... checksum = 4294967265
real 0m50.571s
user 0m50.450s
sys 0m0.020s
4294967295 iterations of new... checksum = 4294967265
real 0m53.324s
user 0m53.200s
sys 0m0.010s
4294967295 iterations of new2... checksum = 4294967265
real 0m56.360s
user 0m56.250s
sys 0m0.000s
4294967295 iterations of table... checksum = 4294967265
real 0m46.524s
user 0m46.440s
sys 0m0.000s
------------------
==== -O3 ====
4294967295 iterations of nil... checksum = 4294967295
real 0m5.421s
user 0m5.410s
sys 0m0.000s
4294967295 iterations of old... checksum = 4294967265
real 0m29.744s
user 0m29.670s
sys 0m0.010s
4294967295 iterations of new... checksum = 4294967265
real 0m53.160s
user 0m53.050s
sys 0m0.000s
4294967295 iterations of new2... checksum = 4294967265
real 0m31.464s
user 0m31.400s
sys 0m0.010s
4294967295 iterations of table... checksum = 4294967265
real 0m46.546s
user 0m46.440s
sys 0m0.000s
------------------
No, it was correct. It simply means that some CPUs prefer bit shifting while
others prefer comparisons and sometimes jumps, that mostly depends on the
ALU and the branch prediction unit, it seems. That shows us that if we include
such code in the kernel, it has to be arch-specific, so we may end up coding
it in assembler, keeping your code as the default one when no asm has been coded
for a given arch. BTW, has someone benchmarked BSF/BSR on x86 ? It should be
faster, it it's also possible that a poor microcode implements it with a one
bit/cycle algo, which will result in one instruction not being as fast as your
code.
Cheers,
Willy
> Here is table version of the fls. Yes it fast than other.
Sorry, but this code returns wrong results. Test it with less
iterations, it doesn't return the same result.
> unsigned i = 0;
>
> do {
> i++;
> } while (n <= fls_table[i].max && n > fls_table[i].min);
You never even compare with table[0] !
> --test log is here----
I recoded a table based on your idea, and it's clearly faster than others, and
even faster than yours :
============
static unsigned tbl[33] = { 2147483648, 1073741824, 536870912, 268435456,
134217728, 67108864, 33554432, 16777216,
8388608, 4194304, 2097152, 1048576,
524288, 262144, 131072, 65536,
32768, 16384, 8192, 4096,
2048, 1024, 512, 256,
128, 64, 32, 16,
8, 4, 2, 1, 0 };
static inline int fls_tbl_fls(unsigned n)
{
unsigned i = 0;
while (n < tbl[i])
i++;
return 32 - i;
}
===========
it returns the right result. Compiled with gcc 2.95.3 -march=i686 -O3, athlon
1.533 GHz (1800+) :
4294967295 iterations of new... checksum = 4294967265
real 1m6.182s
user 1m6.060s
sys 0m0.133s
4294967295 iterations of new2... checksum = 4294967265
real 0m36.504s
user 0m36.294s
sys 0m0.202s
4294967295 iterations of fls_table... checksum = 4294967295
real 0m21.962s
user 0m21.833s
sys 0m0.124s
4294967295 iterations of fls_tbl... checksum = 4294967265
real 0m19.268s
user 0m19.102s
sys 0m0.168s
The same compiled with gcc-3.2.3 :
4294967295 iterations of fls_table... checksum = 4294967295
real 0m14.211s
user 0m14.210s
sys 0m0.000s
Cheers,
Willy
Do see the mail by Andrew Morton? -----
From: Andrew Morton <ak...@digeo.com>
To: hugang <hug...@soulinfo.com>
Cc: linux-...@vger.kernel.org
Subject: Re: [RFC][PATCH] Faster generic_fls
Date: Wed, 30 Apr 2003 22:11:29 -0700
Sender: linux-ker...@vger.kernel.org
X-Mailer: Sylpheed version 0.8.11 (GTK+ 1.2.10; i586-pc-linux-gnu)
hugang <hug...@soulinfo.com> wrote:
>
> Here is table version of the fls. Yes it fast than other.
nooo.. That has a big cache footprint. At the very least you should use
a binary search. gcc will do it for you:
switch (n) {
case 0 ... 1:
return 1;
case 2 ... 3:
return 2;
case 4 ... 7:
return 3;
case 8 ... 15:
return 4;
etc.
----------
I agree him, The Lastest code my posted has works fine, Please check it.
--
Hu Gang / Steve
Email : hua...@soulinfo.com, st...@soulinfo.com
GPG FinePrint: 4099 3F1D AE01 1817 68F7 D499 A6C2 C418 86C8 610E
ICQ# : 205800361
Registered Linux User : 204016
no sorry, I didn't see it, but I've read it now. I agree with him, that's
what I noticed here too. I also tried a binary search through the table in a
tree-like fashion, but the CPU doesn't like going back and forth through the
table at all ! I'm retrying with an "elastic mask".
Cheers
Willy
Except that the testl is unnecessary. The full semantics of
bsrl are:
if (source != 0) {
destination = <index of most significant set bit in source>;
ZF = 0;
} else {
destination = UNDEFINED;
ZF = 1;
}
Thus, there are two reasonable code sequences:
asm(" bsrl %2, %1"
"\n cmovne %1, %0" : "=r" (bit), "=r" (temp) : "rm" (x), "0" (32));
and (if I've got the earlyclobber syntax right):
asm(" bsrl %1, %0"
"\n cmoveq %2, %0" : "=&r,&r" (bit) : "0,rm" (x), "rm,rm" (32));
Note that in this latter case, I have to list %1 == %0 as an explicit
alternative, because otherwise the & on operand 0 would tell GCC to forbid
that combination and it's only %0 == %2 that's forbidden.
(The first alternative is listed first because it uses fewer registers
and so is preferred, all other things being equal.)
Unfortunately, I'm not sure how to let GCC choose between these two
alternatives...
Ok, I recoded the tree myself with if/else, and it's now faster than all others,
whatever the compiler. I have cut the tree to only 16 bits evaluations because
it was still faster than a full 32 bits tree on x86, probably because of the
code length. However, on Alpha EV6 (gcc-2.95.3), the 32 bits tree is faster.
The 32 bits code is also the same speed as the 16 bits on gcc-3.2.3 if I
specify -mbranch-cost=2 (because gcc assumes that today's CPUs need one cycle
per jump).
Here is the function, followed by the logs executed on an Athlon.
Disclaimer: Daniel's PentiumIII seems to give very different results than my
Athlon, so every test should be considered for one arch only !!!
To quickly resume the results, on gcc-3.2.3 it's 45% faster than the table algo,
I don't understand why, and just a little bit faster than Daniel's function.
On gcc-2.95.3, it's only 20% faster than the table, but 46% than Daniel's.
However, I have good news for you, the table code is 15% faster than my tree
on an Alpha EV6 ;-)
It may be because gcc can use different tricks on this arch.
Cheers
Willy
static inline int fls_tree_fls(unsigned n) {
#ifndef REALLY_WANT_A_32BITS_TREE
/* it seems as if working on half a tree is faster than the
complete tree.
*/
register t = 0;
if (n >= (1<<16)) {
n >>= 16;
t = 16;
}
if (n < (1<<8))
if (n < (1<<4))
if (n < 4)
return t + n - ((n + 1) >> 2);
else
return t + 3 + (n >= 8);
else
if (n < (1<<6))
return t + 5 + (n >= (1<<5));
else
return t + 7 + (n >= (1<<7));
else
if (n < (1<<12))
if (n < (1<<10))
return t + 9 + (n >= (1<<9));
else
return t + 11 + (n >= (1<<11));
else
if (n < (1<<14))
return t + 13 + (n >= (1<<13));
else
return t + 15 + (n >= (1<<15));
#else
/* perhaps this is faster on systems with BIG caches, but on
athlon, at least, it's slower than the previous one
*/
if (n < (1<<16))
if (n < (1<<8))
if (n < (1<<4))
if (n < 4)
return n - ((n + 1) >> 2);
else
return 3 + (n >= 8);
else
if (n < (1<<6))
return 5 + (n >= (1<<5));
else
return 7 + (n >= (1<<7));
else
if (n < (1<<12))
if (n < (1<<10))
return 9 + (n >= (1<<9));
else
return 11 + (n >= (1<<11));
else
if (n < (1<<14))
return 13 + (n >= (1<<13));
else
return 15 + (n >= (1<<15));
else
if (n < (1<<24))
if (n < (1<<20))
if (n < (1<<18))
return 17 + (n >= (1<<17));
else
return 19 + (n >= (1<<19));
else
if (n < (1<<22))
return 21 + (n >= (1<<21));
else
return 23 + (n >= (1<<23));
else
if (n < (1<<28))
if (n < (1<<26))
return 25 + (n >= (1<<25));
else
return 27 + (n >= (1<<27));
else
if (n < (1<<30))
return 29 + (n >= (1<<29));
else
return 31 + (n >= (1<<31));
#endif
}
== results on Athlon 1800+ (1.53 GHz), gcc-2.95.3 -O3 -march=i686 :
4294967295 iterations of nil... checksum = 4294967295
real 0m5.600s
user 0m5.590s
sys 0m0.007s
4294967295 iterations of old... checksum = 4294967265
real 0m39.640s
user 0m39.499s
sys 0m0.141s
4294967295 iterations of new... checksum = 4294967265
real 0m45.088s
user 0m44.936s
sys 0m0.158s
4294967295 iterations of fls_table... checksum = 4294967264
real 0m35.988s
user 0m35.833s
sys 0m0.159s
4294967295 iterations of fls_tree... checksum = 4294967265
real 0m28.699s
user 0m28.605s
sys 0m0.096s
== results on Athlon 1800+ (1.53 GHz), gcc-3.2.3 -O3 -march=athlon :
4294967295 iterations of nil... checksum = 4294967295
real 0m16.624s
user 0m16.624s
sys 0m0.001s
4294967295 iterations of old... checksum = 4294967265
real 0m57.685s
user 0m57.568s
sys 0m0.125s
4294967295 iterations of new... checksum = 4294967265
real 0m37.513s
user 0m37.415s
sys 0m0.103s
4294967295 iterations of fls_table... checksum = 4294967264
real 0m56.068s
user 0m55.991s
sys 0m0.085s
4294967295 iterations of fls_tree...
checksum = 4294967265
real 0m36.636s
user 0m36.516s
sys 0m0.125s
=== alpha EV6 / 466 Mhz, gcc-2.95.3 -O3 ====
4294967295 iterations of new... checksum = 4294967265
real 2m19.951s
user 2m19.543s
sys 0m0.018s
4294967295 iterations of fls_table... checksum = 4294967265
real 1m12.825s
user 1m12.667s
sys 0m0.005s
4294967295 iterations of fls_tree... checksum = 4294967265
real 1m33.469s
user 1m33.242s
sys 0m0.013s
> On Thu, May 01, 2003 at 03:05:57PM +0800, hugang wrote:
> Ok, I recoded the tree myself with if/else, and it's now faster than
> all others, whatever the compiler.
Have you tried with not simply increasing, but random numbers? I guess
this could make quite a difference here because of branch prediction.
--
Falk
I thought about this, and indeed, that's what I used in the program I used
to bench the first function I sent yesterday. The problem of the random, is
that it's so slow that you must build a giant table and apply your tests to
this table. So the problem mainly displaces to data cache misses which cost
more than certain operations. If you try it, you'll note that it's difficult
to get comparable results twice.
Other solutions include non-linear suites such as mixing some sequential
values with BSWAP. Eg: x ^ bswap(x) ^ bswap(x << 4).
Willy
> However, I have good news for you, the table code is 15% faster than my tree
> on an Alpha EV6 ;-)
> It may be because gcc can use different tricks on this arch.
>
> Cheers
> Willy
However, The most code changed has increase a little peformance, Do we really need it?
Here is test in my machine, The faster is still table.
==== -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -fomit-frame-pointer -pipe -mpreferred-stack-boundary=2 -march=i686 ====
test_fls.c: In function `fls_tree_fls':
test_fls.c:78: warning: type defaults to `int' in declaration of `t'
4294967295 iterations of nil... checksum = 4294967295
real 0m21.852s
user 0m21.500s
sys 0m0.300s
4294967295 iterations of old... checksum = 4294967265
real 0m45.206s
user 0m44.800s
sys 0m0.310s
4294967295 iterations of new... checksum = 4294967265
real 0m45.235s
user 0m44.720s
sys 0m0.420s
4294967295 iterations of new2... checksum = 4294967265
real 0m59.615s
user 0m58.960s
sys 0m0.540s
4294967295 iterations of table... checksum = 4294967265
real 0m41.784s
user 0m41.420s
sys 0m0.280s
4294967295 iterations of tree... checksum = 4294967265
real 0m44.874s
user 0m44.410s
sys 0m0.380s
------------------
--
Hu Gang / Steve
Email : hua...@soulinfo.com, st...@soulinfo.com
GPG FinePrint: 4099 3F1D AE01 1817 68F7 D499 A6C2 C418 86C8 610E
ICQ# : 205800361
Registered Linux User : 204016
On March 1, hugang wrote:
> On Thu, 1 May 2003 15:52:04 +0200
>
> Willy TARREAU <wi...@w.ods.org> wrote:
> > However, I have good news for you, the table code is 15% faster than my
> > tree on an Alpha EV6 ;-)
> > It may be because gcc can use different tricks on this arch.
> >
> > Cheers
> > Willy
>
> However, The most code changed has increase a little peformance, Do we
> really need it?
>
> Here is test in my machine, The faster is still table.
I think the table version is so fast as a normal distribution of numbers is
tested. With this distribution half of the occuring values have the MSB set,
one quarter the next significant bit and so on...
The tree version is approximately equally fast for every number, but the table
version is fast if a very significant bit is set and slow if a less
significant bit is set...
If small values occour more often than big ones the tree version should be
preferred, else following version may be very fast, too.
static inline int fls_shift(int x)
{
int bit = 32;
while(x > 0) {
--bit;
x <<= 1;
}
return x ? bit : 0;
}
I get following times on my K6-III (compiled with -O3 -march=k6) for
4294967296 iterations:
fls_bsrl: (uses the 'bsrl' command via inline assembler)
real 3m39.059s
user 3m27.416s
sys 0m0.345s
fls_shift:
real 1m47.337s
user 1m41.801s
sys 0m0.185s
fls_tree:
real 1m56.746s
user 1m50.681s
sys 0m0.165s
The 32Bit tree is a bit slower here, too:
real 1m59.447s
user 1m53.375s
sys 0m0.200s
I did not test the table version as I think it cannot be faster than the shift
version...
Best regards
Thomas Schlichter
P.S.: I'd be happy to see how this performs on an Athlon or P-IV...
I think the original i386 did it with a one-bit-per-cycle algorithm,
anything since should be fine. In particular, on a P4 where I just tested,
the bsf seems to be 4 cycles over the whole input set (actually, my whole
loop was 4 cycles per iteration, so 4 cycles is worst-case. I'm assuming
the rest could have been done in parallell).
Linus
>> BTW, has someone benchmarked BSF/BSR on x86 ? It should be
>> faster, it it's also possible that a poor microcode implements it with a one
>> bit/cycle algo, which will result in one instruction not being as fast as your
>> code.
>
> I think the original i386 did it with a one-bit-per-cycle algorithm,
> anything since should be fine. In particular, on a P4 where I just tested,
> the bsf seems to be 4 cycles over the whole input set (actually, my whole
> loop was 4 cycles per iteration, so 4 cycles is worst-case. I'm assuming
> the rest could have been done in parallell).
Just for comparison, the Pentium (Classic) manual says 6-43 clocks for
bsfl and 7-72 (!) for bsrl.
------
Chuck
because, as Falk said, the tests are incremental and the branch prediction
works very well. I proposed a simple scrambling function based on bswap. Please
consider this function :
f(i) = i ^ htonl(i) ^ htonl(i<<7)
It returns such values :
0x00000001 => 0x81000001
0x00000002 => 0x02010002
0x00000003 => 0x83010003
0x00000004 => 0x04020004
0x00000005 => 0x85020005
0x00000006 => 0x06030006
0x00000007 => 0x87030007
0x00000008 => 0x08040008
0x00000009 => 0x89040009
0x0000000a => 0x0a05000a
0x0000000b => 0x8b05000b
etc...
As you see, high bits move fast enough to shot a predictor.
The tree function as well as Daniel's "new" resist better to non linear suites.
BTW, the latest changes I did show that the convergence should be between
Daniel's function and mine, because there are common concepts. I noticed that
the expression used in Daniel's function is too complex for gcc to optimize it
well enough. In my case, gcc 3.2.3 coded too many jumps instead of conditional
moves. I saw that playing with -mbranch-cost changes the code. A mix between
the two is used here (and listed after). It's still not optimial, reading the
code, because there's always one useless jump and move. But in the worst case,
it gives exactly the same result as Daniel's. But in other cases, it is even
slightly faster. Honnestly, now it's just a matter of taste. Daniel's easier to
read, mine produces smaller code. This time, it's faster than others Athlon,
Alpha and Sparc. I Don't know about the PentiumIII nor the P4.
Here are the results on Athlon, gcc-3.2.3, then Alpha and Sparc.
Willy
====
4294967295 iterations of bswap/nil... checksum = 4294967295
real 0m53.133s
user 0m53.114s
sys 0m0.005s
4294967295 iterations of bswap/fls_table... checksum = 4294967262
real 1m44.686s
user 1m44.163s
sys 0m0.487s
4294967295 iterations of bswap/new... checksum = 4294967262
real 1m16.463s
user 1m16.144s
sys 0m0.314s
4294967295 iterations of bswap/fls_tree... checksum = 4294967262
real 1m16.511s
user 1m16.169s
sys 0m0.296s
== alpha 466 MHz , gcc-3.2.3
gcc -O3 -o testfls -fomit-frame-pointer -mcpu=ev67 ===
4294967295 iterations of bswap/fls_tree... checksum = 4294967262
real 4m14.432s
user 4m13.540s
sys 0m0.038s
4294967295 iterations of bswap/fls_table... checksum = 4294967262
real 4m36.204s
user 4m35.368s
sys 0m0.030s
== ultra-sparc 333 MHz, gcc 3.1.1 (linear only)
gcc -O3 -mcpu=v9 -fomit-frame-pointer ===
4294967295 iterations of fls_tree... checksum = 4294967265
real 1m52.680s
user 1m52.640s
sys 0m0.010s
4294967295 iterations of fls_table... checksum = 4294967265
real 3m24.514s
user 3m24.430s
sys 0m0.010s
======= here is the function :
unsigned fls_tree_fls(unsigned n) {
register t = 0;
if (n >= (1<<16)) {
n >>= 16;
t = 16;
}
if (n < (1<<8))
if (n < (1<<4))
if (n < 4)
return t + n - ((n + 1) >> 2);
else
return t + 3 + (n>>3);
else
if (n < (1<<6))
return t + 5 + (n>>5);
else
return t + 7 + (n>>7);
else
if (n < (1<<12))
if (n < (1<<10))
return t + 9 + (n>>9);
else
return t + 11 + (n>>11);
else
if (n < (1<<14))
return t + 13 + (n>>13);
else
return t + 15 + (n>>15);
~~ snip~~
Here are some results with the scrambling function on AMD K6-III 450MHz,
gcc-2.95.3:
fls_old (generic_fls from linux 2.5.68):
real 3m52.960s
user 3m42.080s
sys 0m0.348s
fls_bsrl:
real 4m39.040s
user 4m25.532s
sys 0m0.401s
fls_table:
real 4m59.622s
user 4m45.511s
sys 0m0.469s
fls_tree:
real 3m3.986s
user 2m55.236s
sys 0m0.272s
fls_shift:
real 2m58.765s
user 2m50.092s
sys 0m0.278s
So for me the table version seems to be the slowest one. The BSRL instruction
on the K6-III seems to be very slow, too. The tree and my shift version are
faster than the original version here...
That someone else can test my fls_shift version I'll provide it here again:
static inline int fls_shift(int x)
{
int bit = 32;
while(x > 0) {
--bit;
x <<= 1;
}
return x ? bit : 0;
}
> Willy
Thomas
This actually is false. GCC does not know what resources the
instruction uses, therefore it cannot perform accurate instruction
scheduling.
Richard and I discussed at some time providing a way for inline
asms to give the instruction attributes.
An easier way is to provide a per-platform way to get at the
"weird" instructions a cpu has. This is precisely what GCC currently
provides in the form of __builtin_${CPU}_${WEIRD_INSN}(args...) type
interfaces. These give you what you want (complete control) yet
also let GCC schedule the thing accurately.
I know you think GCC is a pile of dogshit, in many ways I do too, but I
think it's just as important to point out the good parts where they
exist.
--
David S. Miller <da...@redhat.com>
Your shift version is the fastest on the PIII as well, finishing in 45.3
seconds vs 53.4 for my original, and using only 12 bytes of text. This was a
big surprise. The time was the same, whether I inlined it or not. I take
this to mean that -O3 inlines such a short function in either case.
Regards,
Daniel
That's interesting Thomasr. I get 18.4 s on the Athlon here vs 32.3 for Daniel's
(I have broken my function at the moment so I cannot re-bench it right now, but
it should be near Daniel's). At first, I thought you had coded an FFS instead of
an FLS. But I realized it's valid, and is fast because one half of the numbers
tested will not even take one iteration.
Regards,
Willy
yep - faster - but with different results.
shouldn't the checksums be the same????
cheers,
lincoln.
Yes, The shift is very faster than other. Here is a test in P4 1.6G.
model name : Intel(R) Pentium(R) 4 Mobile CPU 1.60GHz
Reading specs from /usr/lib/gcc-lib/i386-linux/2.95.4/specs
gcc version 2.95.4 20011002 (Debian prerelease)
==== -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -fomit-frame-pointer -pipe -mpreferred-stack-boundary=2 -march=i686 ====
test_fls.c: In function `fls_tree_fls':
test_fls.c:85: warning: type defaults to `int' in declaration of `t'
4294967295 iterations of nil... checksum = 4294967295
real 0m21.698s
user 0m21.650s
sys 0m0.000s
4294967295 iterations of table... checksum = 4294967265
real 0m41.581s
user 0m41.500s
sys 0m0.000s
4294967295 iterations of tree... checksum = 4294967265
real 1m1.173s
user 1m1.040s
sys 0m0.000s
4294967295 iterations of shift... checksum = 4294967265
real 0m22.050s
user 0m22.000s
sys 0m0.010s
--
Hu Gang / Steve
Email : hua...@soulinfo.com, st...@soulinfo.com
GPG FinePrint: 4099 3F1D AE01 1817 68F7 D499 A6C2 C418 86C8 610E
ICQ# : 205800361
Registered Linux User : 204016
Aha, and duh. At 1 million iterations, my binary search clobbers the shift
version by a factor of four. At 2**31 iterations, my version also wins, by
20%.
I should note that the iterations parameter to my benchmark program is deeply
flawed, since it changes the nature of the input set, not just the number of
iterations. Fortunately, all tests I've done have been with 2**32
iterations, giving a perfectly uniform distribution of input values.
For uniformly distributed inputs the simple shift loop does well, but for
calculating floor(log2(x)) it's much worse than the fancier alternatives. I
suspect typical usage leans more to the latter than the former.
Regards,
Daniel
Would it be churlish to point out that the only significant user of fls()
is sctp_v6_addr_match_len()?
That is what I posted in my first message in this thread... The shift
algorithm only works fine for uniform distributed input values... But here is
a version that behaves better for small values, too. I don't think it will
reach the tree version but it should be much better that the old version!
int fls_shift(int x)
{
int bit;
if(x & 0xFFFF0000) {
bit = 32;
while(x > 0) {
--bit;
x <<= 1;
}
} else {
bit = 0;
while(x) {
++bit;
x >>= 1;
}
}
return bit;
}
For me this version even speeds up the uniform distributed benchmark...
> For uniformly distributed inputs the simple shift loop does well, but for
> calculating floor(log2(x)) it's much worse than the fancier alternatives.
> I suspect typical usage leans more to the latter than the former.
If this is the case the tree version will surely be the best!
But I think this topic is not worth any further work as this is not used very
often... So this version will be my last one!
But this was a good example how suited algorithms can speed up benchmarks ;-)
> Regards,
>
> Daniel
Best regards
Thomas
Yeah, and sadly the fact that gcc-3.2.x does better instruction scheduling
has shown itself to not actually _matter_ that much. I'm quite impressed
by the new scheduler, but gcc-2.x seems to still generate faster code on
too many examples.
CPU's are getting smarter, not dumber. And that implies, for example, that
instruction scheduling matters less and less. What matters on the P4, for
example, seems to be the _choice_ of instructions, not the scheduling of
them.
Linus
> I know you think GCC is a pile of dogshit, in many ways I do too, but I
> think it's just as important to point out the good parts where they
> exist.
GCC is the strangest combination of utterly brilliant and brain-dead
stupid that I've ever seen... I've seen it do tail merges that took
my breath away, followed by this:
mov <mem1>,eax
mov eax,<mem2>
mov <mem1>,eax ; eax already contains mem1 you stupid compiler
ret
------
Chuck
> GCC is the strangest combination of utterly brilliant and brain-dead
> stupid that I've ever seen... I've seen it do tail merges that took
> my breath away, followed by this:
>
> mov <mem1>,eax
> mov eax,<mem2>
> mov <mem1>,eax ; eax already contains mem1 you stupid compiler
> ret
Not necessarily if mem2 == mem1 + 2. Consider this code:
#include <string.h>
int f(char* a, char* b)
{
int t;
memcpy(&t, a, sizeof(int));
memcpy(b, &t, sizeof(int));
memcpy(&t, a, sizeof(int));
return t;
}
"gcc -O2 -Wall -S test.c -fomit-frame-pointer" correctly generates:
f:
movl 4(%esp), %ecx
movl (%ecx), %eax
movl 8(%esp), %edx
movl %eax, (%edx)
movl (%ecx), %eax
ret
--
Peter Osterlund - pet...@telia.com
http://w1.894.telia.com/~u89404340
>> mov <mem1>,eax
>> mov eax,<mem2>
>> mov <mem1>,eax ; eax already contains mem1 you stupid compiler
>> ret
>
> Not necessarily if mem2 == mem1 + 2. Consider this code:
I realized after sending that last that I should have added that there
were no volatiles and no aliasing possible. This was the kernel code:
return conf->last_used = new_disk;
(new_disk is a local variable, conf is a function arg.)
------
Chuck
Maybe. It's also useful for breaking up an arbitrary IO region optimally into
binary-sized blocks, which is part of the current 2.4 device-mapper patch
set, not yet submitted.
Regards,
Daniel
I also tried to change your version into this, and at least it's not slower,
so it's good anyway, considering the small code size.
> If this is the case the tree version will surely be the best!
>
> But I think this topic is not worth any further work as this is not used very
> often... So this version will be my last one!
I agree. Just for the record, I'll post two other original implementations
which are not intersting for their real-life performance, but may be
interesting to reuse in other projects or in cheap hardware implementations,
or as a base for other algos. One of the downsides is that they need many
registers. They both are about twice as slow as the tree on athlon, alpha and
sparc.
The first one uses no jump at all if the CPU can do CMOV. It's about twice as
slow as tree, but may be a win on machines with a big jump cost. And since
there's no jump, its execution time is constant :
unsigned fls32_1(unsigned n)
{
/* this code is totally jumpless on architectures that support CMOV,
and can execute up to 4 instructions per cycle. However, it uses
lots of instructions and registers and is not as fast as it should be.
It's about 20 cycles on a dual-pipeline CPU.
*/
register unsigned x = n, bits = 0, bits2, a, b;
a = x & 0xffff0000;
bits2 = bits + 16;
b = x & 0xff00ff00;
if (x & a) { bits = bits2;}
if (x & a) { x &= a;}
bits2 = bits + 8;
a = x & 0xf0f0f0f0;
if (x & b) { bits = bits2;}
if (x & b) { x &= b;}
bits2 = bits + 4;
b = x & 0xcccccccc;
if (x & a) { bits = bits2;}
if (x & a) { x &= a;}
bits2 = bits + 2;
a = x & 0xaaaaaaaa;
if (x & b) { bits = bits2;}
if (x & b) { x &= b;}
bits2 = bits + 1;
if (x & a) { bits = bits2;}
if (x & a) { x &= a;}
if (x) { bits += 1; }
return bits;
}
The second one has a radically different approach. It converges int 5 shifts.
However, each iteration has a non neglileable cost, and its time is nearly
constant too. The code is rather small (about 70 bytes), but it needs a CPU
which can shift and jump fast to be efficient. It consumes about the same
time as above.
unsigned fls32_2(unsigned n)
{
register unsigned t = 16, r = 0;
register unsigned m = 0xffff0000;
// it only needs 5 iterations to complete, and some
// instructions can be executed in parallel. It's
// more efficient than the pure shift on small values.
// But it needs many registers :-(
if (n) {
while (t) {
if (n & m) {
n >>= t;
r += t;
t >>= 1;
m >>= t;
} else {
n <<= t;
t >>= 1;
m <<= t ? t : 1;
}
}
return r + 1;
}
else
return n;
}
These were my last versions, and not particularly performant ones ;-)
Regards,
Willy
It could -- I had thought the -fno-strict-aliasing option
meant exactly the opposite of what it really means. Since the
compiler has been told to be pessimistic about possibilities
like this maybe that's what it has to do.
------
Chuck
Since new_disk is on the stack, is there something about 'conf'
that guarenetes it is not on the stack too? F.e. what if
&conf->last_used were one byte into 'new_disk' or something
like that.
Probably this is all illegal...
--
David S. Miller <da...@redhat.com>