Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

improve strlen

559 views

Skip to first unread message

Claudio Daffra

unread,

Oct 23, 2005, 7:10:40 PM10/23/05

hi all

i have read some manual pent opt;
i have tried write my pesonal routine faster than gcc compiler,
simply routine only
about strlen the result is the same time
my code and code is similar code found in paul hsieh page
i have tried read at the same time word and dword but result don't
chage
my routine work at the same time as strlen

any one know other solution (maybe mmx ist4uction) to improve the
algorithm)

regards

claudio

void RoutineC ( void )
{
char *source = "01234567890ABCDEF0123456789ABCDEF" \
"01234567890ABCDEF0123456789ABCDEF" \
"01234567890ABCDEF0123456789ABCDEF" \
"01234567890ABCDEF0123456789ABCDE\0" ; // 128 caratteri
unsigned int len = 0;

len = strlen ( &source[0] ) ;
}
void RoutineASM ( void )
{
char *source = "01234567890ABCDEF0123456789ABCDEF" \
"01234567890ABCDEF0123456789ABCDEF" \
"01234567890ABCDEF0123456789ABCDEF" \
"01234567890ABCDEF0123456789ABCDE\0" ; // 128 caratteri
unsigned int len = 0;

__asm__ ( " .p2align 4,,15\n\t"
" movl %%ebx ,%%edx\n\t"
"l1: movb (%%ebx) ,%%ah\n\t"
" incl %%ebx\n\t"
" test %%ah ,%%ah\n\t"
" jne l1\n\t"
" subl %%ebx,%%edx\n\t"
: "=b" (len)
: "d" (&source[0])
: "ebx", "edx"
) ;
len = len ;
}

spam...@crayne.org

unread,

Oct 23, 2005, 11:57:19 PM10/23/05

Faster way to implement strlen() is to store the length of the string,
example:

string x = ...;
int length = x.length();

This can be a simple variable dereferencing.. now the next problem is
to determine the length of string when assigning "const char* text"
into it. This is, ofcourse, a tradeoff:

- setup time is increased
- other activities are faster

On the other hand,

- "strlen" is always done -- at setup

But,

- "strlen" is ONLY done -- at setup

Ofcourse, the "length" member of string could be initialized to -1 to
signal "unknown" length and computed when value is first time queried,
sort of lazy evaluation. This would on the other hand introduce
test-branch every time the value is queried, again, it might be overall
win to just compute the length at initialization.

Now to the question (at last!), I am interested, what does the code
look like that does 16 bits or 32 bits at a time? It would help to see
what you already tried..

Here's a typical strlen() implementation in C

int strlen(const char* text)
{
const char* s = text;
for ( ; *s; ++s )
;
return (int)(s - text);
}

Visual C++ 8.1 Beta 2 compiles it like this:

mov eax, OFFSET $SG-5
$LL3@strlen:
add eax, 1
cmp BYTE PTR [eax], 0
jne SHORT $LL3@strlen
sub eax, OFFSET $SG-5
ret 0

Precisely the intention in C, translated into Assembly. Unless we have
some more clever optimizations in mind, such as testing two or more
characters with a single branch the function doesn't really.. pay off
to write in assembly..

Optimizing x86 code is much different than writing for the Pentium
these days, too.. look at Pentium 4 netburst microarchitechture, 128
registers internally.. the code is translated on-fly to risc like micro
instructions, the translated code is cached (the code cache is called
"tracecache" in P4). AMD has different approach in their K8
architechture, but knowing the x86 assembly doesn't tell jack about the
cost of the code in runtime, unless know how the internals works..
which is not very beneficial..

x86 assembly programming, well, these days I think the strong point in
favour of that sort of activity is to generate code in runtime and then
execute it. Virtual machines and realtime optimized systems where
number of permutations for computation are too numerous spring to mind.
Definitely areas where other alternatives are smoked alive. But that
requires know-how to write optimizing compiler (atleast the backend).

That said, I think writing strlen() in assembly is not very productive,
considering how good the compilers these days are getting (Intel, GNU,
Microsoft..) -- but that's just me, don't be discouraged. :)

spam...@crayne.org

unread,

Oct 24, 2005, 4:00:29 AM10/24/05

Hi,

spam...@crayne.org wrote:
> That said, I think writing strlen() in assembly is not very productive,
> considering how good the compilers these days are getting (Intel, GNU,
> Microsoft..) -- but that's just me, don't be discouraged. :)

Out of curiousity, what happened to "repne scasb"?

Is an explicit loop of smaller instructions faster, or don't compilers
know about this instruction?

Cheers,

Brendan

hutch--

unread,

Oct 24, 2005, 4:00:32 AM10/24/05

Claudio,

Don't be overawed by compilers, assembler coding is not restricted to
the architecture of a C compiler. The following code is a modification
of Agner Fog's DWORD string length routine that aligns the start and
tests the length 4 bytes at a time. It has no stack frame and conforms
to the normal register preservation rules under windows so it preserves
ESI and EDI but trashes the rest.

You will still need to convert it to AT&T syntax but it should have the
legs on any byte scanner around.

fn_004010A4:

push edi
push esi
mov eax, [esp+0Ch]
mov ecx, eax
add ecx, 3
and ecx, 0FFFFFFFCh
sub ecx, eax
mov esi, ecx
jz lbl2
sub eax, 1

lbl0:

add eax, 1
cmp BYTE PTR [eax], 0

jz lbl1
sub ecx, 1
jns lbl0
jmp lbl2

lbl1:
sub eax, [esp+0Ch]
jmp lbl5

lbl2:
lea edx, [eax+3]
nop

lbl3:
mov edi, [eax]
add eax, 4
lea ecx, [edi-1010101h]
not edi
and ecx, edi
and ecx, 80808080h
jz lbl3
test ecx, 8080h
jnz lbl4
shr ecx, 10h
add eax, 2

lbl4:
shl cl, 1
sbb eax, edx
add eax, esi

lbl5:
pop esi
pop edi
ret 4

Regards,

hutch at movsd dot com

Claudio Daffra

unread,

Oct 24, 2005, 1:34:15 PM10/24/05

thx all i learn more with you than all book !
claudio

>brendan

>
>Out of curiousity, what happened to "repne scasb"?

i have try but code go slower than code above

> Faster way to implement strlen() is to store the length of the string,
> example:

good solution

> hutch

your code work better than mine
i have compiled with

gcc -O3 str.c -o str

i have run loop for many time
this result

debian:~/source# ./s10

timer start : 1130244757
timer stop : 1130244775
timer diff : 18

timer start : 1130244775
timer stop : 1130244793
timer diff : 18
debian:~/source#

__asm__ ( ".p2align 4,,15\n\t"

"pushl %%edi\n\t"
"pushl %%esi\n\t"
"movl %%ebx,%%eax\n\t"
" movl %%eax ,%%ecx\n\t"
" addl $3,%%ecx\n\t"
" andl $0xFFFFFFFC,%%ecx\n\t"
" subl %%eax,%%ecx\n\t"
" movl %%ecx,%%esi\n\t"
" jz lbl2\n\t"
" subl $1,%%eax\n\t"

" lbl0:\n\t"

" addl $1,%%eax\n\t"
" cmpb $0,(%%eax)\n\t"
" jz lbl1\n\t"
" subl $1,%%ecx\n\t"
" jns lbl0\n\t"
" jmp lbl2\n\t"

" lbl1:\n\t"

" subl %%ebx,%%eax\n\t"
" jmp lbl5\n\t"

" lbl2:\n\t"

" leal 3(%%eax),%%edx\n\t"
" nop\n\t"

" lbl3:\n\t"
"movl (%%eax),%%edi\n\t"
"addl $4,%%eax\n\t"
"leal -0x1010101(%%edi),%%ecx\n\t"
"notl %%edi\n\t"
"andl %%edi,%%ecx\n\t"
"andl $0x80808080,%%ecx\n\t"
"jz lbl3\n\t"

"test $0x8080,%%ecx\n\t"
"jnz lbl4\n\t"
"shrl $10,%%ecx\n\t"
"addl $2 ,%%eax\n\t"

"lbl4:\n\t"

"shlb $1,%%cl\n\t"
"sbbl %%edx,%%eax\n\t"
"addl %%esi,%%eax\n\t"

"lbl5:\n\t"

"popl %%esi\n\t"
"popl %%edi\n\t"
: "=a" (len)
: "b" (&source[0])
) ;

spam...@crayne.org

unread,

Oct 24, 2005, 2:56:15 PM10/24/05

Some observations.

1. I did assert that doing more than a single compare per iteration
could be a reason to write this in assembly, so I wasn't advocating
compilers that much. But still looks to me that the assembly code in
the original post wouldn't be worth writing in assembly.. you get same
result in portable and easier-to-maintain form though use of, say, C.

2. The only instruction that really is out-of-reach for writing this in
C/C++ is the sbb, there is no predictable way to invoke sbb/adc I know
about anyway. Another example is barrel shifts, and shld/shrd are also
something that might come in handy.. but rarely performance suffers
from alternate implementations.. :)

I don't remember the precise ratios, but I think it goes something like
this: 95% of the time is spent in 5% of the functions. If it is
possible to refactor the code not to call the 5% too frequently, even
better, avoid calling them at all.. the performance usually increases.
No assembly required (!) <- hehe.. that's why I think storing the
length of the string is a pretty nice tradeoff, because it makes things
like string concatenation, asking for length, comparing two strings and
what not, much more efficient.

Example:

string a = ...;
string b = ...;
if ( a != b ) { ...

Internally, first check could be a nice cool if ( a.length != b.length
) -- cheap and statistically most of the time does most of the work
too. The length member just simply is the way to go! Generally, the
strcat(), strcpy(), strlen() ... -API is pretty inefficient. But damn
it's _simple_ and most important: it *needs* heavy optimization at
implementation level because of the implicit inefficiencies this
arrangement causes!

spam...@crayne.org

unread,

Oct 24, 2005, 4:36:48 PM10/24/05

Hello. I took it as a hobby project to write C++ code that does the
same thing..

int xstrlen(const char* text)
{
const char* p = text;
unsigned int a = reinterpret_cast<unsigned int&>(text);
unsigned int c = ((a + 3) & 0xfffffffc) - a;
unsigned int s = c;

if ( c )
{
for ( ; c >= 0; --c )
{
if ( *p++ == 0 )
return static_cast<int>(p - text - 1);
}
}

const unsigned int* ap = reinterpret_cast<const unsigned int*>(p);
unsigned int d = reinterpret_cast<unsigned int&>(p) + 3;

for ( ;; )
{
unsigned int i = *ap++;
c = (i - 0x01010101) & ~i & 0x80808080;
if ( c )
break;
}

if ( !(c & 0x8080) )
{
c >>= 16;
s += 2;
}

return reinterpret_cast<unsigned int&>(ap) - d - !(c >> 31) + s;
}

Here's the assembly output from MSVC++ (latest version):

push esi
mov esi, OFFSET $SG-28+3
mov eax, OFFSET $SG-28
and esi, -4 ; fffffffcH
sub esi, eax
je SHORT $LN6@xstrlen
$LL8@xstrlen:
mov cl, BYTE PTR [eax]
add eax, 1
test cl, cl
jne SHORT $LL8@xstrlen
sub eax, OFFSET $SG-28
sub eax, 1
pop esi
ret 0
$LN6@xstrlen:
mov eax, OFFSET $SG-28
npad 6
$LL4@xstrlen:
mov edx, DWORD PTR [eax]
lea ecx, DWORD PTR [edx-16843009]
not edx
and ecx, edx
add eax, 4
and ecx, -2139062144 ; 80808080H
je SHORT $LL4@xstrlen
test ecx, 32896 ; 00008080H
jne SHORT $LN1@xstrlen
shr ecx, 16 ; 00000010H
add esi, 2
$LN1@xstrlen:
shr ecx, 31 ; 0000001fH
not ecx
and ecx, 1
sub eax, ecx
sub eax, OFFSET $SG-28+3
add eax, esi
pop esi
ret 0

Is it just me, or does the "innerloop" resemble each other quite a bit?
Why not shouldn't we rely on compilers again? If you time the codes,
they perform +- 10% the same, too.

I still think very much that assembly itself is not optimization tool
anymore, ofcourse if you know it, you write better higher level code,
especially when you know the compiler.. and check assembly *output* to
check the compiler isn't doing something really stupid.

Best use for machine specific instructions, IMHO, is to let machine
generate them. Be this a offline compiler like g++, visual c++, et al..
or realtime code generator like JIT Compiler or something like that.

I'm not "dissing" assembly per-se, I used to be strongly with the
opinion that it is the way for performance.. I started with z80, then
some commodore64 (6508, cant be arsed to google what it was to be
correct), 68000, mips, ppc, x86, etc.etc.. but over the time everyone
learns the old truth about optimizing at the wrong level and at wrong
spots, premature optimization etc.etc.. at the very heart, optimizing
compilers are very interesting topic.

So is assembly, x86 included but if C/C++ compiler can give same
performance from higher level code which is easier to read, write and
generally "see" the flow of the program I prefer that. What I am going
to do next, is to proof-read, rewrite to be portable and then
regression test the code.. as I didn't do this conversion very
carefully, and the variable names are really just register names from
the original assembly code, I shall rename them to reflect the use
better (and cleanup the code in general :)

Then what I do with it, is to archive and never touch or look at it
again. ;-----)

spam...@crayne.org

unread,

Oct 24, 2005, 6:16:22 PM10/24/05

Correction to the previous post, here is the corrected code..
regression tests pay off, noticed "slight" bug in the SBB handling.. I
falsely wrote the C code as-if the sbb was using ecx, not cl.. this
makes a big difference, the bit we wanted to inspect was 7, not 31..

I also tested different versions of the "sum" routine (basicly, if in a
byte the bit 7 is set increase sum...), tried this for example:

s += !((v >> 31) & 1) + ...;

That's branchless.. but I didn't like the NOT's there so I did add:

v = ~v;
s += ...;

No dice: still not good, the code was slower than the one in the fixed
version below.. the divide-and-conquer technique that has two
compare-branches still beats the branch-free version (testing on
Pentium M).

This code is somewhat regress tested with random generated strings,
against strlen() for returning correct length.. and also doing bound
checking on array access. Note: there are some bytes
after-reserved-memory depending on left over chars in a dword.. 3, 2 or
1 bytes are read too much. This isn't a problem with Windows memory
allocator for instance which reserves memory in *heap* using new/malloc
in 8 byte chunks. This isn't problem on stack either with objects with
"auto" storage class, as stack is valid reading area.. ofcourse the
chars after terminating null character are garbage, but this code
doesn't care about that.

I didn't think how this works on big endian architecture nor the code
is portable in current shape anyway. For instance, aliasing the "const
char*" with "unsigned int" is very bad programming in general, the
conversion from pointer to integer should be done differently.. but I
do it that way for this version of the code only. :)

Actually, writing portable code is harder than it seems.. and that code
ain't. But demonstrates the concept of "C/C++ code isn't that slow" ;-)

int xstrlen(const char* text)
{
const char* p = text;

unsigned int a = reinterpret_cast<unsigned int&>(text);

unsigned int alignment = ((a + 3) & 0xfffffffc) - a;
unsigned int s = alignment;

if ( alignment )
{
for ( unsigned int i=0; i<alignment; ++i )

{
if ( *p++ == 0 )
return static_cast<int>(p - text - 1);
}
}

s -= reinterpret_cast<unsigned int&>(p) + 3;

const unsigned int* ap = reinterpret_cast<const unsigned int*>(p);

unsigned int v = 0;

for ( ; !v; )
{
unsigned int u = *ap++;
v = (u - 0x01010101) & ~u & 0x80808080;
}

if ( !(v & 0x8080) )
{
v >>= 16;
s += 2;
}

if ( !(v & 0x80) )
{
++s;
}

return reinterpret_cast<unsigned int&>(ap) + s - 1;
}

Some refactoring might still pay off.. looking into how "s" is updated
there might be room for improvement, but I think the basic idea is
nailed down more or less now. I think one or two variables could be
dropped out without aliasing problems.. but I think, the innerloop is
where the action is at, and that's pretty decent as things were.. so
optimizing further might not yield very good improvements, except with
very short strings.. well, which are, fast case anyway! So that about
wraps it up..

spam...@crayne.org

unread,

Oct 24, 2005, 8:40:49 PM10/24/05

Ahoy, here's the 'final' version that is put into experimental library,
abstracted the functionality with template specialization. This is used
indirectly by a string class which is implemented using c++ template
meta programming. I just snipped the relevant portions, lot of typedefs
etc. are not displayed..

The code been refactored for better performance, and it works on both
little and big endian architechtures (tested on MIPS R10000 and PPC
G5). Goes without saying that works on IA32 and AMD64 / x86-64..

template <typename chartype>
inline int string_length(const chartype* text)
{
assert( text != NULL );

const chartype* s = text;

for ( ; *s; ++s )
;

return static_cast<int>(s - text);
}

template <>
inline int string_length<char>(const char* text)
{
assert( text != NULL );

const char* p = text;

const char* base = 0;
meta::intp address = static_cast<meta::intp>(text - base);
unsigned int alignment = ((address + 3) & 0xfffffffc) - address;

if ( alignment )
{
for ( unsigned int i=0; i<alignment; ++i )
{
if ( *p++ == 0 )

return static_cast<int>(p - text) - 1;
}
}

const uint32* ap = reinterpret_cast<const uint32*>(p);
uint32 v = 0;

for ( ; !v; )
{
uint32 u = *ap++;

v = (u - 0x01010101) & ~u & 0x80808080;
}

uint32 s = static_cast<int>(reinterpret_cast<const char*>(ap) - p) +
alignment - 3;

#ifdef FUSIONCORE_BIG_ENDIAN

if ( !(v & 0x80800000) )
return v & 0x8000 ? s + 1 : s + 2;

return v & 0x80000000 ? s - 1 : s;

#endif

#ifdef FUSIONCORE_LITTLE_ENDIAN

if ( !(v & 0x8080) )

return v & 0x00800000 ? s + 1 : s + 2;

return v & 0x0080 ? s - 1 : s;

#endif
}

Note, the different offsets are: -1,0,1,2 .. it might be possible to
compute those more efficiently, currently using if-else-?: mess.. if we
assign each bit different weight and do a masked sum, we might get the
index adjustment value FAST, but doesn't look like worth the effort.. I
try not to post once more on the topic. :)

Hope this is my last post about this..

FYI, test results:

strlen() 24.0 usec
this version: 14.5 usec
asm version: 13.5 usec

That's on one same machine (Pentium M 1.8 GHz), some number of
iterations and test repetitions.. average of 20 tests (smallest and
largest result dropped out from average) with timings rounded to
nearest 0.5 usec.. the difference between asm and c++ versions seem
neglible from practical point of view ( < 8 % ).

Thanks for the tip, this comes in handy (not that string initialization
never been a performance bottleneck for me.. but in principle let's
have better code when possible, plus I enjoy little optimization fun
now and then, so thank you!!!)

spam...@crayne.org

unread,

Oct 24, 2005, 11:15:21 PM10/24/05

Another 0.5 usec eliminated, now the difference is: 14.0 vs. 13.5
usec..

if ( v & 0x00008080 )
return s - ((v & 0x00000080) >> 7);

return s + 2 - ((v & 0x00800000) >> 23);

The next step is to handle 64 or 128 bits per iteration (atleast
64-bits should be trivial with MMX/SSE or natively 64-bit platform if
writing this in C/C++ ..)

Note that such version is liable for crashing because we cannot
guarantee that reads are within allocated boundaries anymore (unless we
allocate the memory ourselves taking care of the issue?)

hutch--

unread,

Oct 25, 2005, 1:57:21 PM10/25/05

jukka,

Here is a slightly tweaked version of the algo I posted. It unrolls a
block of code by 8 and replaces an immediate in the loop code with the
same value in a spare register. It is clocking up on my test PIV at
about 22% faster than the last version I posted.

I have done all of the testing on strings that are misaligned so that
the alignment code is forced to run.

;
«««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««

fn_00401460:

mov [esp-4], esi
mov [esp-8], edi
mov [esp-0Ch], ebx
mov [esp-10h], ebp

mov ebx, 80808080h
mov ebp, 4

mov eax, [esp+4]

mov ecx, eax
add ecx, 3
and ecx, 0FFFFFFFCh
sub ecx, eax
mov esi, ecx
jz lbl2
sub eax, 1

lbl0:
add eax, 1
cmp BYTE PTR [eax], 0
jz lbl1
sub ecx, 1
jns lbl0
jmp lbl2

lbl1:
sub eax, [esp+4]
jmp lbl6

lbl2:
lea edx, [eax+3]
mov edi, edi

lbl3:
mov edi, [eax]
add eax, ebp