i have read some manual pent opt;
i have tried write my pesonal routine faster than gcc compiler,
simply routine only
about strlen the result is the same time
my code and code is similar code found in paul hsieh page
i have tried read at the same time word and dword but result don't
chage
my routine work at the same time as strlen
any one know other solution (maybe mmx ist4uction) to improve the
algorithm)
regards
claudio
void RoutineC ( void )
{
char *source = "01234567890ABCDEF0123456789ABCDEF" \
"01234567890ABCDEF0123456789ABCDEF" \
"01234567890ABCDEF0123456789ABCDEF" \
"01234567890ABCDEF0123456789ABCDE\0" ; // 128 caratteri
unsigned int len = 0;
len = strlen ( &source[0] ) ;
}
void RoutineASM ( void )
{
char *source = "01234567890ABCDEF0123456789ABCDEF" \
"01234567890ABCDEF0123456789ABCDEF" \
"01234567890ABCDEF0123456789ABCDEF" \
"01234567890ABCDEF0123456789ABCDE\0" ; // 128 caratteri
unsigned int len = 0;
__asm__ ( " .p2align 4,,15\n\t"
" movl %%ebx ,%%edx\n\t"
"l1: movb (%%ebx) ,%%ah\n\t"
" incl %%ebx\n\t"
" test %%ah ,%%ah\n\t"
" jne l1\n\t"
" subl %%ebx,%%edx\n\t"
: "=b" (len)
: "d" (&source[0])
: "ebx", "edx"
) ;
len = len ;
}
string x = ...;
int length = x.length();
This can be a simple variable dereferencing.. now the next problem is
to determine the length of string when assigning "const char* text"
into it. This is, ofcourse, a tradeoff:
- setup time is increased
- other activities are faster
On the other hand,
- "strlen" is always done -- at setup
But,
- "strlen" is ONLY done -- at setup
Ofcourse, the "length" member of string could be initialized to -1 to
signal "unknown" length and computed when value is first time queried,
sort of lazy evaluation. This would on the other hand introduce
test-branch every time the value is queried, again, it might be overall
win to just compute the length at initialization.
Now to the question (at last!), I am interested, what does the code
look like that does 16 bits or 32 bits at a time? It would help to see
what you already tried..
Here's a typical strlen() implementation in C
int strlen(const char* text)
{
const char* s = text;
for ( ; *s; ++s )
;
return (int)(s - text);
}
Visual C++ 8.1 Beta 2 compiles it like this:
mov eax, OFFSET $SG-5
$LL3@strlen:
add eax, 1
cmp BYTE PTR [eax], 0
jne SHORT $LL3@strlen
sub eax, OFFSET $SG-5
ret 0
Precisely the intention in C, translated into Assembly. Unless we have
some more clever optimizations in mind, such as testing two or more
characters with a single branch the function doesn't really.. pay off
to write in assembly..
Optimizing x86 code is much different than writing for the Pentium
these days, too.. look at Pentium 4 netburst microarchitechture, 128
registers internally.. the code is translated on-fly to risc like micro
instructions, the translated code is cached (the code cache is called
"tracecache" in P4). AMD has different approach in their K8
architechture, but knowing the x86 assembly doesn't tell jack about the
cost of the code in runtime, unless know how the internals works..
which is not very beneficial..
x86 assembly programming, well, these days I think the strong point in
favour of that sort of activity is to generate code in runtime and then
execute it. Virtual machines and realtime optimized systems where
number of permutations for computation are too numerous spring to mind.
Definitely areas where other alternatives are smoked alive. But that
requires know-how to write optimizing compiler (atleast the backend).
That said, I think writing strlen() in assembly is not very productive,
considering how good the compilers these days are getting (Intel, GNU,
Microsoft..) -- but that's just me, don't be discouraged. :)
spam...@crayne.org wrote:
> That said, I think writing strlen() in assembly is not very productive,
> considering how good the compilers these days are getting (Intel, GNU,
> Microsoft..) -- but that's just me, don't be discouraged. :)
Out of curiousity, what happened to "repne scasb"?
Is an explicit loop of smaller instructions faster, or don't compilers
know about this instruction?
Cheers,
Brendan
Don't be overawed by compilers, assembler coding is not restricted to
the architecture of a C compiler. The following code is a modification
of Agner Fog's DWORD string length routine that aligns the start and
tests the length 4 bytes at a time. It has no stack frame and conforms
to the normal register preservation rules under windows so it preserves
ESI and EDI but trashes the rest.
You will still need to convert it to AT&T syntax but it should have the
legs on any byte scanner around.
fn_004010A4:
push edi
push esi
mov eax, [esp+0Ch]
mov ecx, eax
add ecx, 3
and ecx, 0FFFFFFFCh
sub ecx, eax
mov esi, ecx
jz lbl2
sub eax, 1
lbl0:
add eax, 1
cmp BYTE PTR [eax], 0
jz lbl1
sub ecx, 1
jns lbl0
jmp lbl2
lbl1:
sub eax, [esp+0Ch]
jmp lbl5
lbl2:
lea edx, [eax+3]
nop
lbl3:
mov edi, [eax]
add eax, 4
lea ecx, [edi-1010101h]
not edi
and ecx, edi
and ecx, 80808080h
jz lbl3
test ecx, 8080h
jnz lbl4
shr ecx, 10h
add eax, 2
lbl4:
shl cl, 1
sbb eax, edx
add eax, esi
lbl5:
pop esi
pop edi
ret 4
Regards,
hutch at movsd dot com
>brendan
>
>Out of curiousity, what happened to "repne scasb"?
i have try but code go slower than code above
> Faster way to implement strlen() is to store the length of the string,
> example:
good solution
> hutch
your code work better than mine
i have compiled with
gcc -O3 str.c -o str
i have run loop for many time
this result
debian:~/source# ./s10
timer start : 1130244757
timer stop : 1130244775
timer diff : 18
timer start : 1130244775
timer stop : 1130244793
timer diff : 18
debian:~/source#
__asm__ ( ".p2align 4,,15\n\t"
"pushl %%edi\n\t"
"pushl %%esi\n\t"
"movl %%ebx,%%eax\n\t"
" movl %%eax ,%%ecx\n\t"
" addl $3,%%ecx\n\t"
" andl $0xFFFFFFFC,%%ecx\n\t"
" subl %%eax,%%ecx\n\t"
" movl %%ecx,%%esi\n\t"
" jz lbl2\n\t"
" subl $1,%%eax\n\t"
" lbl0:\n\t"
" addl $1,%%eax\n\t"
" cmpb $0,(%%eax)\n\t"
" jz lbl1\n\t"
" subl $1,%%ecx\n\t"
" jns lbl0\n\t"
" jmp lbl2\n\t"
" lbl1:\n\t"
" subl %%ebx,%%eax\n\t"
" jmp lbl5\n\t"
" lbl2:\n\t"
" leal 3(%%eax),%%edx\n\t"
" nop\n\t"
" lbl3:\n\t"
"movl (%%eax),%%edi\n\t"
"addl $4,%%eax\n\t"
"leal -0x1010101(%%edi),%%ecx\n\t"
"notl %%edi\n\t"
"andl %%edi,%%ecx\n\t"
"andl $0x80808080,%%ecx\n\t"
"jz lbl3\n\t"
"test $0x8080,%%ecx\n\t"
"jnz lbl4\n\t"
"shrl $10,%%ecx\n\t"
"addl $2 ,%%eax\n\t"
"lbl4:\n\t"
"shlb $1,%%cl\n\t"
"sbbl %%edx,%%eax\n\t"
"addl %%esi,%%eax\n\t"
"lbl5:\n\t"
"popl %%esi\n\t"
"popl %%edi\n\t"
: "=a" (len)
: "b" (&source[0])
) ;
1. I did assert that doing more than a single compare per iteration
could be a reason to write this in assembly, so I wasn't advocating
compilers that much. But still looks to me that the assembly code in
the original post wouldn't be worth writing in assembly.. you get same
result in portable and easier-to-maintain form though use of, say, C.
2. The only instruction that really is out-of-reach for writing this in
C/C++ is the sbb, there is no predictable way to invoke sbb/adc I know
about anyway. Another example is barrel shifts, and shld/shrd are also
something that might come in handy.. but rarely performance suffers
from alternate implementations.. :)
I don't remember the precise ratios, but I think it goes something like
this: 95% of the time is spent in 5% of the functions. If it is
possible to refactor the code not to call the 5% too frequently, even
better, avoid calling them at all.. the performance usually increases.
No assembly required (!) <- hehe.. that's why I think storing the
length of the string is a pretty nice tradeoff, because it makes things
like string concatenation, asking for length, comparing two strings and
what not, much more efficient.
Example:
string a = ...;
string b = ...;
if ( a != b ) { ...
Internally, first check could be a nice cool if ( a.length != b.length
) -- cheap and statistically most of the time does most of the work
too. The length member just simply is the way to go! Generally, the
strcat(), strcpy(), strlen() ... -API is pretty inefficient. But damn
it's _simple_ and most important: it *needs* heavy optimization at
implementation level because of the implicit inefficiencies this
arrangement causes!
int xstrlen(const char* text)
{
const char* p = text;
unsigned int a = reinterpret_cast<unsigned int&>(text);
unsigned int c = ((a + 3) & 0xfffffffc) - a;
unsigned int s = c;
if ( c )
{
for ( ; c >= 0; --c )
{
if ( *p++ == 0 )
return static_cast<int>(p - text - 1);
}
}
const unsigned int* ap = reinterpret_cast<const unsigned int*>(p);
unsigned int d = reinterpret_cast<unsigned int&>(p) + 3;
for ( ;; )
{
unsigned int i = *ap++;
c = (i - 0x01010101) & ~i & 0x80808080;
if ( c )
break;
}
if ( !(c & 0x8080) )
{
c >>= 16;
s += 2;
}
return reinterpret_cast<unsigned int&>(ap) - d - !(c >> 31) + s;
}
Here's the assembly output from MSVC++ (latest version):
push esi
mov esi, OFFSET $SG-28+3
mov eax, OFFSET $SG-28
and esi, -4 ; fffffffcH
sub esi, eax
je SHORT $LN6@xstrlen
$LL8@xstrlen:
mov cl, BYTE PTR [eax]
add eax, 1
test cl, cl
jne SHORT $LL8@xstrlen
sub eax, OFFSET $SG-28
sub eax, 1
pop esi
ret 0
$LN6@xstrlen:
mov eax, OFFSET $SG-28
npad 6
$LL4@xstrlen:
mov edx, DWORD PTR [eax]
lea ecx, DWORD PTR [edx-16843009]
not edx
and ecx, edx
add eax, 4
and ecx, -2139062144 ; 80808080H
je SHORT $LL4@xstrlen
test ecx, 32896 ; 00008080H
jne SHORT $LN1@xstrlen
shr ecx, 16 ; 00000010H
add esi, 2
$LN1@xstrlen:
shr ecx, 31 ; 0000001fH
not ecx
and ecx, 1
sub eax, ecx
sub eax, OFFSET $SG-28+3
add eax, esi
pop esi
ret 0
Is it just me, or does the "innerloop" resemble each other quite a bit?
Why not shouldn't we rely on compilers again? If you time the codes,
they perform +- 10% the same, too.
I still think very much that assembly itself is not optimization tool
anymore, ofcourse if you know it, you write better higher level code,
especially when you know the compiler.. and check assembly *output* to
check the compiler isn't doing something really stupid.
Best use for machine specific instructions, IMHO, is to let machine
generate them. Be this a offline compiler like g++, visual c++, et al..
or realtime code generator like JIT Compiler or something like that.
I'm not "dissing" assembly per-se, I used to be strongly with the
opinion that it is the way for performance.. I started with z80, then
some commodore64 (6508, cant be arsed to google what it was to be
correct), 68000, mips, ppc, x86, etc.etc.. but over the time everyone
learns the old truth about optimizing at the wrong level and at wrong
spots, premature optimization etc.etc.. at the very heart, optimizing
compilers are very interesting topic.
So is assembly, x86 included but if C/C++ compiler can give same
performance from higher level code which is easier to read, write and
generally "see" the flow of the program I prefer that. What I am going
to do next, is to proof-read, rewrite to be portable and then
regression test the code.. as I didn't do this conversion very
carefully, and the variable names are really just register names from
the original assembly code, I shall rename them to reflect the use
better (and cleanup the code in general :)
Then what I do with it, is to archive and never touch or look at it
again. ;-----)
I also tested different versions of the "sum" routine (basicly, if in a
byte the bit 7 is set increase sum...), tried this for example:
s += !((v >> 31) & 1) + ...;
That's branchless.. but I didn't like the NOT's there so I did add:
v = ~v;
s += ...;
No dice: still not good, the code was slower than the one in the fixed
version below.. the divide-and-conquer technique that has two
compare-branches still beats the branch-free version (testing on
Pentium M).
This code is somewhat regress tested with random generated strings,
against strlen() for returning correct length.. and also doing bound
checking on array access. Note: there are some bytes
after-reserved-memory depending on left over chars in a dword.. 3, 2 or
1 bytes are read too much. This isn't a problem with Windows memory
allocator for instance which reserves memory in *heap* using new/malloc
in 8 byte chunks. This isn't problem on stack either with objects with
"auto" storage class, as stack is valid reading area.. ofcourse the
chars after terminating null character are garbage, but this code
doesn't care about that.
I didn't think how this works on big endian architecture nor the code
is portable in current shape anyway. For instance, aliasing the "const
char*" with "unsigned int" is very bad programming in general, the
conversion from pointer to integer should be done differently.. but I
do it that way for this version of the code only. :)
Actually, writing portable code is harder than it seems.. and that code
ain't. But demonstrates the concept of "C/C++ code isn't that slow" ;-)
int xstrlen(const char* text)
{
const char* p = text;
unsigned int a = reinterpret_cast<unsigned int&>(text);
unsigned int alignment = ((a + 3) & 0xfffffffc) - a;
unsigned int s = alignment;
if ( alignment )
{
for ( unsigned int i=0; i<alignment; ++i )
{
if ( *p++ == 0 )
return static_cast<int>(p - text - 1);
}
}
s -= reinterpret_cast<unsigned int&>(p) + 3;
const unsigned int* ap = reinterpret_cast<const unsigned int*>(p);
unsigned int v = 0;
for ( ; !v; )
{
unsigned int u = *ap++;
v = (u - 0x01010101) & ~u & 0x80808080;
}
if ( !(v & 0x8080) )
{
v >>= 16;
s += 2;
}
if ( !(v & 0x80) )
{
++s;
}
return reinterpret_cast<unsigned int&>(ap) + s - 1;
}
Some refactoring might still pay off.. looking into how "s" is updated
there might be room for improvement, but I think the basic idea is
nailed down more or less now. I think one or two variables could be
dropped out without aliasing problems.. but I think, the innerloop is
where the action is at, and that's pretty decent as things were.. so
optimizing further might not yield very good improvements, except with
very short strings.. well, which are, fast case anyway! So that about
wraps it up..
The code been refactored for better performance, and it works on both
little and big endian architechtures (tested on MIPS R10000 and PPC
G5). Goes without saying that works on IA32 and AMD64 / x86-64..
template <typename chartype>
inline int string_length(const chartype* text)
{
assert( text != NULL );
const chartype* s = text;
for ( ; *s; ++s )
;
return static_cast<int>(s - text);
}
template <>
inline int string_length<char>(const char* text)
{
assert( text != NULL );
const char* p = text;
const char* base = 0;
meta::intp address = static_cast<meta::intp>(text - base);
unsigned int alignment = ((address + 3) & 0xfffffffc) - address;
if ( alignment )
{
for ( unsigned int i=0; i<alignment; ++i )
{
if ( *p++ == 0 )
return static_cast<int>(p - text) - 1;
}
}
const uint32* ap = reinterpret_cast<const uint32*>(p);
uint32 v = 0;
for ( ; !v; )
{
uint32 u = *ap++;
v = (u - 0x01010101) & ~u & 0x80808080;
}
uint32 s = static_cast<int>(reinterpret_cast<const char*>(ap) - p) +
alignment - 3;
#ifdef FUSIONCORE_BIG_ENDIAN
if ( !(v & 0x80800000) )
return v & 0x8000 ? s + 1 : s + 2;
return v & 0x80000000 ? s - 1 : s;
#endif
#ifdef FUSIONCORE_LITTLE_ENDIAN
if ( !(v & 0x8080) )
return v & 0x00800000 ? s + 1 : s + 2;
return v & 0x0080 ? s - 1 : s;
#endif
}
Note, the different offsets are: -1,0,1,2 .. it might be possible to
compute those more efficiently, currently using if-else-?: mess.. if we
assign each bit different weight and do a masked sum, we might get the
index adjustment value FAST, but doesn't look like worth the effort.. I
try not to post once more on the topic. :)
Hope this is my last post about this..
FYI, test results:
strlen() 24.0 usec
this version: 14.5 usec
asm version: 13.5 usec
That's on one same machine (Pentium M 1.8 GHz), some number of
iterations and test repetitions.. average of 20 tests (smallest and
largest result dropped out from average) with timings rounded to
nearest 0.5 usec.. the difference between asm and c++ versions seem
neglible from practical point of view ( < 8 % ).
Thanks for the tip, this comes in handy (not that string initialization
never been a performance bottleneck for me.. but in principle let's
have better code when possible, plus I enjoy little optimization fun
now and then, so thank you!!!)
if ( v & 0x00008080 )
return s - ((v & 0x00000080) >> 7);
return s + 2 - ((v & 0x00800000) >> 23);
The next step is to handle 64 or 128 bits per iteration (atleast
64-bits should be trivial with MMX/SSE or natively 64-bit platform if
writing this in C/C++ ..)
Note that such version is liable for crashing because we cannot
guarantee that reads are within allocated boundaries anymore (unless we
allocate the memory ourselves taking care of the issue?)
Here is a slightly tweaked version of the algo I posted. It unrolls a
block of code by 8 and replaces an immediate in the loop code with the
same value in a spare register. It is clocking up on my test PIV at
about 22% faster than the last version I posted.
I have done all of the testing on strings that are misaligned so that
the alignment code is forced to run.
;
«««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
fn_00401460:
mov [esp-4], esi
mov [esp-8], edi
mov [esp-0Ch], ebx
mov [esp-10h], ebp
mov ebx, 80808080h
mov ebp, 4
mov eax, [esp+4]
mov ecx, eax
add ecx, 3
and ecx, 0FFFFFFFCh
sub ecx, eax
mov esi, ecx
jz lbl2
sub eax, 1
lbl0:
add eax, 1
cmp BYTE PTR [eax], 0
jz lbl1
sub ecx, 1
jns lbl0
jmp lbl2
lbl1:
sub eax, [esp+4]
jmp lbl6
lbl2:
lea edx, [eax+3]
mov edi, edi
lbl3:
mov edi, [eax]
add eax, ebp
lea ecx, [edi-1010101h]
not edi
and ecx, edi
and ecx, ebx
jne lbl4
mov edi, [eax]
add eax, ebp
lea ecx, [edi-1010101h]
not edi
and ecx, edi
and ecx, ebx
jnz lbl4
mov edi, [eax]
add eax, ebp
lea ecx, [edi-1010101h]
not edi
and ecx, edi
and ecx, ebx
jnz lbl4
mov edi, [eax]
add eax, ebp
lea ecx, [edi-1010101h]
not edi
and ecx, edi
and ecx, ebx
jnz lbl4
mov edi, [eax]
add eax, ebp
lea ecx, [edi-1010101h]
not edi
and ecx, edi
and ecx, ebx
jnz lbl4
mov edi, [eax]
add eax, ebp
lea ecx, [edi-1010101h]
not edi
and ecx, edi
and ecx, ebx
jnz lbl4
mov edi, [eax]
add eax, ebp
lea ecx, [edi-1010101h]
not edi
and ecx, edi
and ecx, ebx
jnz lbl4
mov edi, [eax]
add eax, ebp
lea ecx, [edi-1010101h]
not edi
and ecx, edi
and ecx, ebx
je lbl3
lbl4:
test ecx, 8080h
jnz lbl5
shr ecx, 10h
add eax, 2
lbl5:
shl cl, 1
sbb eax, edx
add eax, esi
lbl6:
mov esi, [esp-4]
mov edi, [esp-8]
mov ebx, [esp-0Ch]
mov ebp, [esp-10h]
Unrolling like that kinda kills branch prediction, I suppose. And burns
code/tracecache for no real gain..
It probably depends wha you are running it on and how you have set it
up. It should be 16 byte aligned. This was tested on a PIV 2.8 gig
Prescott but if you have inlined this in a C compiler, check a few
things like how it protects registers on entry and exit and what
alignment it is running.
These are the timings I get wth a MASM test piece.
szLen_xx is the first version
szLen_x2 is the second version
Prescott PIV 2.8 gig
--------------------
1282 MS szLen_xx
859 MS szLen_x2
1266 MS szLen_xx
859 MS szLen_x2
1266 MS szLen_xx
859 MS szLen_x2
1266 MS szLen_xx
859 MS szLen_x2
AMD Sempron 2.4
---------------
1859 MS szLen_xx
1688 MS szLen_x2
1875 MS szLen_xx
1687 MS szLen_x2
1875 MS szLen_xx
1688 MS szLen_x2
1875 MS szLen_xx
1687 MS szLen_x2
> Unrolling like that kinda kills branch prediction, I suppose. And burns
> code/tracecache for no real gain..
Branch prediction penalties need to be weighed if in fact they are a
factor against jump reduction which is generally more useful as taken
jumps are usually slower than fallen through jumps. In te unroll the
unpredicted jump is only taken once on exit so its not a factor in loop
code.
Thos is the actual form of the MASM code that I derived the two posting
from.
OPTION PROLOGUE:NONE
OPTION EPILOGUE:NONE
align 16
szLen_x2 proc item:DWORD
mov [esp-4], esi
mov [esp-8], edi
mov [esp-12], ebx
mov [esp-16], ebp
mov ebx, 80808080h ; load immediates into registers
mov ebp, 4
mov eax, [esp+4]
mov ecx, eax ; copy EAX to ECX
add ecx, 3 ; align up by 4
and ecx, -4
sub ecx, eax ; calculate any misalignment in ecx
mov esi, ecx ; store ECX in ESI
jz proceed
sub eax, 1
@@:
add eax, 1
cmp BYTE PTR [eax], 0 ; scan for terminator for
je quit ; up to the 1st 3 bytes
sub ecx, 1
jns @B
jmp proceed
quit:
sub eax, [esp+4] ; calculate length if terminator
jmp outa_here ; is found in 1st 3 bytes
; ----------------
proceed: ; proceed with the rest
lea edx, [eax+3] ; pointer+3 used in the end
align 4
@@:
REPEAT 7
mov edi, [eax] ; read first 4 bytes
add eax, ebp ; increment pointer
lea ecx, [edi-01010101h] ; subtract 1 from each byte
not edi ; invert all bytes
and ecx, edi ; and these two
and ecx, ebx
jnz nxt ; exit loop on zero bytes
ENDM
mov edi, [eax] ; read first 4 bytes
add eax, ebp ; increment pointer
lea ecx, [edi-01010101h] ; subtract 1 from each byte
not edi ; invert all bytes
and ecx, edi ; and these two
and ecx, ebx
jz @B ; no zero bytes, continue loop
nxt:
test ecx, 00008080h ; test first two bytes
jnz @F
shr ecx, 16 ; not in the first 2 bytes
add eax, 2
@@:
shl cl, 1 ; use carry flag to avoid branch
sbb eax, edx ; compute length
add eax, esi ; add misalignment count
outa_here:
mov esi, [esp-4]
mov edi, [esp-8]
mov ebx, [esp-12]
mov ebp, [esp-16]
ret 4
szLen_x2 endp
OPTION PROLOGUE:PrologueDef
OPTION EPILOGUE:EpilogueDef
Unrolling code won't *kill* branch prediction; it might not take
advantage of branch prediction, but it doesn't kill it (i.e., make it
less effective when it is used).
Of course, the main problem I'm seeing here is that people are running
their programs and measuring the results on different CPUs. And all
bets are off when you do that. This is why it's almost a worthless
exercise to "count cycles" these days. If you optimize the code on one
CPU, chances are pretty good it *won't* be optimal on a different CPU.
So unless you're writing the code to run on a specific rev of a
specific CPU, shaving one or two clock cycles off the execution time of
an algorithm is a real waste of time.
While I can appreciate why it would be nice to have the world's fastest
strlen program, given the effort that seems to be put into this problem
over and over again, the more reasonable question to ask is "why aren't
all these people using a better data structure that makes string
operations more efficient?" (e.g., including the length as part of the
string data type; maybe reference counters, too). Sure, we'd all love
the idea of simply recompiling old programs that don't know any better
and having them run faster, but the truth is that if people would
simply stop using zstrings (except for interface with OSes and other
code that uses them) and employ a better internal data structure, the
whole issue of the "fastest strlen" algorithm would go away. If people
would use reference counters and pointers, then worrying about the
"fastest block move operation" would diminish, too.
As the old saying goes - get the algorithm (and, presumably, data
structure) right *FIRST*.
Cheers,
Randy Hyde
Unrolling 8 times while having essentially verbatim the same innerloop,
branch instruction excluded is not particularly effective optimization.
The biggest optimization is that the code is bigger.
/*
Of course, the main problem I'm seeing here is that people are running
their programs and measuring the results on different CPUs. And all
*/
It's not really a problem, I'm running MIPS, PPC, x86-64 and IA32 and
also working with ARM (various versions) and generally, higher-level
optimizations are "portable", there is ofcourse small fluctuations
based on architechtural differences but generally like I just wrote
higher level code, which does less "work", is usually faster on any
platform.
Biggest differences come from floating point to integer and vice-versa
interaction and dependent issues, also lack of floating point processor
mixes things about but that isn't the case here. The code is fairly
trivial integer-only very RISC-like ALU only is required. Number of
registers needed is small here, too.
/*
bets are off when you do that. This is why it's almost a worthless
exercise to "count cycles" these days. If you optimize the code on one
CPU, chances are pretty good it *won't* be optimal on a different CPU.
*/
First, I am not counting cycles. I am measuring average performance
over large number of runs on data that isn't same on each successive
run. Slowest and fastest runs are always dropped so that random
fluctiation is eliminated. Additionally, each test string is tested
with different start offsets so that different alignment cases are
tested and different "leftover" cases are tested.
The same code works as regression test, too. :)
/*
So unless you're writing the code to run on a specific rev of a
specific CPU, shaving one or two clock cycles off the execution time of
an algorithm is a real waste of time.
*/
In production code I rarely do this, I write pretty good code already
off the bat anyway. But this is for fun, dude, you think Usenet is
where my work is at? ;-)
/*
over and over again, the more reasonable question to ask is "why aren't
all these people using a better data structure that makes string
operations more efficient?" (e.g., including the length as part of the
*/
Apperently you haven't read the whole thread. I think it was me, who
suggested that storing length of the string into the string object is
way to go, multiple times infact. Here's the string class, actually:
http://www.liimatta.org/misc/string.hpp
It is using template meta programming for string operations, I wrote
this as experiment on the topic in 2003, the basic principle is that
when we have expression:
string bla = a + b + c + d + e ...;
The object on right side is type tree (look at expr_string template),
the right and left nodes are trees, which store the operans of each
side. This tree will be "unbalanced" as we don't have parenthesis,
however that isn't a problem as the tree is traversed only once (the
assignment evaluates the tree) and inserts are always O(1).
What this enables is that the "length" of the whole tree is known at
evaluation time in linear time O(n) (this could be possible to improve
but I weren't arsed as I don't believe optimizing what doesn't
require).
After the length is known, each node in the tree is basicly
block-copied into the slot in the left sidehand argument object. No
temporary string objects are created for each successive + operator
like with traditional string (object) implementation.
/*
the idea of simply recompiling old programs that don't know any better
and having them run faster, but the truth is that if people would
simply stop using zstrings (except for interface with OSes and other
code that uses them) and employ a better internal data structure, the
*/
The internal data structure still needs initialization, string's
lengths are very often needed. Lazy Evaluation of them invokes branch
for every instance of use of the information, which I already explained
in this thread. There's tradeoff being made between different usage
patterns.
/*
whole issue of the "fastest strlen" algorithm would go away. If people
would use reference counters and pointers, then worrying about the
"fastest block move operation" would diminish, too.
*/
I am not interested in "fast strlen" per-se, I am interested in
discussing the futility of optimizing it in *assembler* when the
performance difference is neglible. I first said that it's not such a
bright idea, just using kinder words. Then I went out my usual trouble
to put my code where my mouth is, done, as stated I don't talk shit and
can backup my opinion with facts.
/*
As the old saying goes - get the algorithm (and, presumably, data
structure) right *FIRST*.
*/
Well, ofcourse, you are talking to the guy who proposed that three days
before you..
After that is done, order of doubling the performance is still pretty
cute for being just petty implementation detail. I want to emphasize
the fact that I find it mostly cute that someone wants to stick to
assembly code, which isn't portable and for all practical purposes
equivalent performance to C++ code that does the same thing.
For the record, I'm not a big fan of writing software in assembly, I am
a big fan in understanding assembly on platforms I work on so that I
understand how to craft my code. The assumption is that the
optimizations the compiler implements don't suck, such as constant
propagation, dead code elimination, common subexpression elimination,
spilling and what not. That is, when such things matter.
They rarely do, as a matter of fact, as only fraction of code is time
critical. I'd go as far as to say that if strlen() speed is time
critical there is something seriously wrong with the design. However,
that isn' the POINT of this thread. It is (for me) about higher level
languages not necessarily implying inefficient, or slow.. or assembly
automatically being synonymous with fast.
The way I see it is that C is portable assembler, which makes it a jack
of all trades but master of none. But in this case it doesn't hurt the
performance, as can be seen the beginning of the code is nearly
identical when compiled. The only real difference is in the end,
because I cannot express "subtract with borrow" concept in C, therefore
I crafted alternative which is nearly same performance in practise.. it
ONLY adds little constant overhead, but the innerloop is, well, same.
Also, I could unroll the C++ innerloop but I won't do it because I
think it is not a particularly good idea in this case.
If you align to 8 byte boundary, you can safely read 64 bits at any
given time as the blocks would be aligned so granularity of memory
allocation unit. I bet that would double the performance if you are so
inclined.
--> misc ranting ..
My goal was to demonstrate that a modern C++ compilers are pretty darn
good: they come to about same ballpark in performance as handwritten
assembly. The other message was to promote more sound programming
techniques (see the notes about storing string length to begin with, I
fail to see how you could beat THAT with the following usage:)
void foo(const string& bar)
{
int size = bar.length();
// do something with length...
}
Better software isn't about better implementation but better software
engineering practises. I on purpose don't say *faster* but *better*,
because performance critical areas are usually not in string
processing. There is just so much string data you can find to process
to begin with, unless it is very special application (I bet it' s not
your usual run of the mill desktop application, think Google, SQL, ..)
The string processing was just a good example of code, that can be
written in generic, portable C++ which can be compiled to nearly any
platform without severe performance penalties even on the x86 platform
where the code has been handcrafted with performance in mind (why else
assembly?)
That's why it is pointless to do performance comparison between
assembly and higher level language implementations, usually, the
assembly-heavy software is light on algorithms and data structures,
that is usually because such are a real pain to maintain in assembly.
Not impossible, been there, done that.. when I was a young&wild..
Here's example of my stupidity from 1997:
www.liimatta.org/misc/_alphablend.asm
More optimal implementation recognizes the pattern (well which this is
based on anyway):
xyzw
wwww
~x~y~z~w
~w~w~w~w
what we notice is that we basicly mean two mode bits:
1. w replicate
2. complement
These create four permutations we see above, so we could easily
re-construct the code sequence with four basic loop start blocks and
"build" the code on-fly without resorting to very comprehensive
techniques. Now we have 100 unique innerloops, which we could generate
(!) easily with a simple algorithm.
Even better, we could construct more complex sequences adding very
simple rules and still use fixed register allocation.
But I rather write this logic in C++, have the frontend and interface
to be C++ (or ANSI C) and then only use the (in this case, MMX
assembly) machine specific binary as output format for performance
reasons mainly.
The rationale for this has many aspects, here are a few:
- we cannot write all permutations at compilation time in ANY language
if number of permutations is too high
- so, we will use branching and possible also calls to subroutines to
mix simpler pieces for more complex "whole", but this is inefficient:
in the context this code is used, one function call per fragment is one
too many
- we could use templates to generate the code at compilation time
(which is approach I use i one library for pixelformat conversions and
blitting) but that is ONLY because the number of permutations is still
manageable
- we could use macros, and keep including same header with different
values for macros to generate different functions with different
parameters, but same problem as before: too many permutations and we
end up with 120 MB executable very easily
- realtime generated code doesn't even HAVE TO BE fully optimized, what
matters is that it is REASONABLY efficient as the alternative would be
very, very slow code
- even better if the designer and implementor of such system has
experience in compiler design and implementation (shouldn't be too
difficult hurdle, I think this is standard material for universities
around the world for programmer's)
- if no experience, it is possible to gain it my practise just slows
down the progress initially :)
Et cetera. That's just my thoughts in "assembler" -- bytecode is where
the action is at, and it's hard to find faster bytecode than platform
native binaries.. now the problem is twofold:
- writing a good optimizing compiler for intermediate code generation
- developing a good instruction selection implementation :)
That's where the action is, don't miss it by fumbling assembly by hand,
mostly waste of time. Just my $.02
With tongue in cheek as usual, welcome to the world of mixed model
cross hardware development. :)
This is true, but it still doesn't *kill* branch prediction.
>
> /*
> Of course, the main problem I'm seeing here is that people are running
> their programs and measuring the results on different CPUs. And all
> */
>
> It's not really a problem, I'm running MIPS, PPC, x86-64 and IA32 and
> also working with ARM (various versions) and generally, higher-level
> optimizations are "portable", there is ofcourse small fluctuations
> based on architechtural differences but generally like I just wrote
> higher level code, which does less "work", is usually faster on any
> platform.
>
> Biggest differences come from floating point to integer and vice-versa
> interaction and dependent issues, also lack of floating point processor
> mixes things about but that isn't the case here. The code is fairly
> trivial integer-only very RISC-like ALU only is required. Number of
> registers needed is small here, too.
All the more reason that counting cycles isn't a good approach for a
strlen optimization.
>
> /*
> bets are off when you do that. This is why it's almost a worthless
> exercise to "count cycles" these days. If you optimize the code on one
> CPU, chances are pretty good it *won't* be optimal on a different CPU.
> */
>
> First, I am not counting cycles.
Yes, you are.
> I am measuring average performance
> over large number of runs on data that isn't same on each successive
> run. Slowest and fastest runs are always dropped so that random
> fluctiation is eliminated. Additionally, each test string is tested
> with different start offsets so that different alignment cases are
> tested and different "leftover" cases are tested.
And this is averaged across all possible x86 CPUs (and others, too),
right? If not, you're counting cycles on *one* particular CPU. And
that's a waste of time if you're trying to produce generic results.
>
> The same code works as regression test, too. :)
???
>
> /*
> So unless you're writing the code to run on a specific rev of a
> specific CPU, shaving one or two clock cycles off the execution time of
> an algorithm is a real waste of time.
> */
>
> In production code I rarely do this, I write pretty good code already
> off the bat anyway. But this is for fun, dude, you think Usenet is
> where my work is at? ;-)
Yes, rewriting strlen algorithms seems to be one of the funnest things
ever. Seems like someone new is doing it every other week.
>
> /*
> over and over again, the more reasonable question to ask is "why aren't
> all these people using a better data structure that makes string
> operations more efficient?" (e.g., including the length as part of the
> */
>
> Apperently you haven't read the whole thread.
Why bother?
Do you really think you're adding anything new to the discussion that
hasn't been said a thousand times already? Do you really think this is
new territory around here (or anywhere else)? The better question for
you to ask is "why bother responding to a thread that is probably the
#1 FAQ in a newsgroup such as this one?"
> I think it was me, who
> suggested that storing length of the string into the string object is
> way to go, multiple times infact. Here's the string class, actually:
>
> http://www.liimatta.org/misc/string.hpp
Do you think you're the first to make this suggestion? Do you think my
response is directly specifically at you? Do you think....
No matter what you said before, it's always important to inject reality
checks into threads such as this one. And this thread is definitely in
need of a reality check. And this isn't a direct response to you, it's
a comment to anyone reading this thread who thinks that there might be
something to what's being said here.
The truth is, *memory architecture* is the biggest impediment to memory
scanning algorithms like strlen. Even if *every* (x86) CPU had exactly
the same timing characteristics, measurements would still be all over
the map based on memory controllers and access times (unless, of
course, you cheat and assume that all your strings are in cache before
you run your strlen function, which isn't exactly fair).
[lots of unrelated stuff snipped]
>
> /*
> the idea of simply recompiling old programs that don't know any better
> and having them run faster, but the truth is that if people would
> simply stop using zstrings (except for interface with OSes and other
> code that uses them) and employ a better internal data structure, the
> */
>
> The internal data structure still needs initialization, string's
> lengths are very often needed.
Well, let's see. You initialize the string's length once. You use it
many times. Sounds like a pretty good tradeoff to me.
> Lazy Evaluation of them invokes branch
> for every instance of use of the information,
Who said anything about lazy evaluation? And even if we *do* use lazy
evaluation, it's still a *whole* lot cheaper than running strlen
everytime we want the length, no?
> which I already explained
> in this thread. There's tradeoff being made between different usage
> patterns.
Obviously, if you know something about your data, you can improve on
the algorithm in use. One thing you seem to know is that people use the
length quite frequently. So cache it rather than compute it (that is,
store the length as part of the data structure). While there may be
some *rare* cases where the extra work to save away this length is more
expensive than recomputing it each time, I'm not sure I can think of
any off-hand.
>
> /*
> whole issue of the "fastest strlen" algorithm would go away. If people
> would use reference counters and pointers, then worrying about the
> "fastest block move operation" would diminish, too.
> */
>
> I am not interested in "fast strlen" per-se, I am interested in
> discussing the futility of optimizing it in *assembler* when the
> performance difference is neglible.
Yes. It is futile to optimize bad algorithms to begin with. This is
true no matter *what* language you use. That being said, all you're
doing is saying that *some* HLL implementations beat *some* assembly
optimizations. Do you really think this is something new? The bottom
line is that on the variety of x86 processors available today, if
you've got a C compiler that generates code for *that specific*
processor and you compare this against assembly code that was optimized
for a *different* processor, what do you expect? And, of course,
CPU-dependent HLL code that does a decent job is going to walk all over
some sample code that was written by someone who (1) isn't very good,
(2) doesn't understand the characteristics of the CPU you're running
with, or (3) both.
> I first said that it's not such a
> bright idea, just using kinder words. Then I went out my usual trouble
> to put my code where my mouth is, done, as stated I don't talk shit and
> can backup my opinion with facts.
Why are you wasting your time? :-)
>
> /*
> As the old saying goes - get the algorithm (and, presumably, data
> structure) right *FIRST*.
> */
>
> Well, ofcourse, you are talking to the guy who proposed that three days
> before you..
No, I'm talking to the guy who proposed it in a repeat of a very long
running thread. I can assure you that my response to this question in
1996 was the same as it is today.
>
> After that is done, order of doubling the performance is still pretty
> cute for being just petty implementation detail. I want to emphasize
> the fact that I find it mostly cute that someone wants to stick to
> assembly code, which isn't portable and for all practical purposes
> equivalent performance to C++ code that does the same thing.
Unfortunately, your plan doesn't scale up very well. The problem with
C++ is *not* that you can't write efficient code if you're *very*
careful and consider the code the compiler is emitting (and adjust your
C++ source code appropriately). The problem is that no one writes
anything but trivial little (and often non-portable) code this way.
IOW, C++ really doesn't have much benefit over assembly when you go to
all this trouble other than it might be able to run on different CPUs
(but you often lose the optimizations when you do this).
>
> For the record, I'm not a big fan of writing software in assembly, I am
> a big fan in understanding assembly on platforms I work on so that I
> understand how to craft my code.
This is good.
> The assumption is that the
> optimizations the compiler implements don't suck, such as constant
> propagation, dead code elimination, common subexpression elimination,
> spilling and what not. That is, when such things matter.
Then the compilers turn around and generate brain-dead sequences for
other stuff. Believe me, I've spent a lot of time studying optimizing
compilers and their code emission (as I'm sure you have). I've seen
them produce some *brilliant* optimizations. I've *learned* some really
neat tricks by doing this. But for every brilliant optimization I've
seen in a particular compiler, I've also found a slew of bone-headed
code sequences that reduce brilliance to mediocrity.
I have no doubt that you can take a trivially small function like
strlen and coerce your HLL code to emit stuff about as good as a
hand-optimized assembly version of the same code. But as I mentioned
earlier, this trick doesn't scale up to larger programs. From a
software engineering perspective, it's just as hard to write C/C++ (or
other HLL code) this ways as it is to write assembly. And the tricks
you pull to get the good code emission often won't carry over to other
architectures (or even different CPUs in the same family). Though the
code *may* be portable (in the sense that the semantics are preserved
across compilations on different CPUs), the optimizations themselves
often are not. For example, IIRC your strlen function is working on
dwords. What happens when you compile this on a 16-bit CPU? On a 64-bit
CPU? Maybe the code works fine on various 32-bit CPUs, but the
optimization hardly portable across different CPUs. (BTW, just for the
record, I first saw this optimization in the BSD C standard library
code back around 1995, I wonder when the trick was first created?
Probably for the VAX?)
>
> They rarely do, as a matter of fact, as only fraction of code is time
> critical. I'd go as far as to say that if strlen() speed is time
> critical there is something seriously wrong with the design. However,
> that isn' the POINT of this thread. It is (for me) about higher level
> languages not necessarily implying inefficient, or slow.. or assembly
> automatically being synonymous with fast.
So you're proof is *one* example versus another?
Hmmm... Hardly convincing.
I can provide you with a whole slew of strlen programs in assembly that
run much slower than:
t = s;
while( *s ) ++s;
return s-t;
Indeed, if compilers are *so* great, why can't they convert this code
into something as wonderful as what you've presented? After all, doing
so is really just an induction step (albeit, a complex one).
>
> The way I see it is that C is portable assembler,
C provides no access to low-level machine facilities. Therefore, it is
not an assembly language. It may be lower-level than languages like C++
or Java, that does not make it an assembly language. In order for it to
be an assembly language, you need access to all the features of the
CPU.
> which makes it a jack
> of all trades but master of none. But in this case it doesn't hurt the
> performance, as can be seen the beginning of the code is nearly
> identical when compiled.
You make this claim with just one example?
Gee, I'd argue that it's going to be real hard for an assembly language
programmer to beat the code that a C compiler produces for the
following:
i = 0;
That doesn't prove C compilers are as good as assembly programmers by
any stretch of the imagination. You're example is a bit more complex,
but nowhere near sufficient to "prove" the point.
The real problem with HLLs like C++ is that they encourage people to
write code like:
for( myclass::iterator si = s.begin(); si != s.end(); si++ ) {...}
And they have no idea what the compiler is doing with their code. Take,
for example, that innocuous "si++" at the end of the for argument list.
Written in standard C style (++ as a suffix, which is bad style to
begin with), this simple statement may wind up creating a large
temporary object and then immediately destroy it. How ugly. Sure,
someone who knows exactly what's going on behind the scenes probably
wouldn't write code this way, but how often do you see people writing
standard C++ programs the way you wrote your xstrlen? Do you honestly
write *all* your C++ code that way? Or do you just write code that way
when you're trying to prove that C++ compilers can emit code that's as
good as assembly programmers? And when you *do* write code that way, is
it any faster or easier than using assembly?
Bottom line is that most C++ programmers would just write:
t = s;
while( *s ) ++s;
return s-t;
(or something similar) and move on. (This is assuming, of course, that
the function wasn't in the stdlib.) It doesn't matter how good the
compiler is (or should be), it's not going to be able to deal well with
code that is written in this fashion.
Though assembly programmers rarely do any better when writing a lot of
code (that is, they don't count cycles either), the bottom line is that
they are usually *forced* into looking at the abominations they create,
it's generally not hidden from them. As a result, they tend to do a
better job of implementing an algorithm that executes efficiently on
the underlying hardware than does a HLL programmer who doesn't have a
clue what's going on. Before you get in a tiff, I *do* realize that
*you* probably do know what's going on. But you don't write all the
world's HLL code.
> The only real difference is in the end,
> because I cannot express "subtract with borrow" concept in C,
Yes, you do not have access to the low-level machine. As I said, C is
not an assembly language. Believe me, you don't have access to a *lot*
of things that might be useful on occasion.
> therefore
> I crafted alternative which is nearly same performance in practise.. it
> ONLY adds little constant overhead, but the innerloop is, well, same.
The #2 thread (after strlen) is memcpy. My alternative is to simply use
the movsb instruction. When it's all said and done, it's not that much
slower than doing *really* fancy stuff involving SSE instructions,
unrolled loops, and other nonsense that doesn't help improve memory
bandwidth much. I won't argue that I've got the fastest memcpy around,
but the differences are rarely worth the effort.
>
> Also, I could unroll the C++ innerloop but I won't do it because I
> think it is not a particularly good idea in this case.
That depends entirely on the CPU and memory architecture.
Cheers,
Randy Hyde
pxor mm0,mm0
xloop:
pcmpeqb mm0,[esi]
pmovmskb eax,mm0
pxor mm0,mm0
add esi,8
// ...
Feel free to change the order if you think it helps
- do a 64 bit / 8 component compare, the dest will have all byte cells
initialized with 0xff if the corresponding byte value in source operand
was zero, 0x00 otherwise. Then pack the MSB of each component into 32
bit register.
>From there on the rest is too trivial to even mention.. about
instruction counts, your technique uses 1.5 instructions (roughly) per
char, this does 0.5 instructions (roughly) per char, and uses 64 bit
aligned reads. How much faster it is in practise.. if you want to
know.. find out!
p.s. don't worry I been around..
Here's what I wrote:
"Unrolling like that kinda kills branch prediction, I suppose. And
burns
code/tracecache for no real gain.. "
See, that when I am not sure, I express it (", I suppose."). I don't
see the problem here, I didn't correct the correction now did I? I
clarified what I mean and did acknowledge the correction as due, still
it seems to be issue somehow. I'm missing something?
/*
All the more reason that counting cycles isn't a good approach for a
strlen optimization.
*/
What "metric" would you, sir, use to measure which code is more
"efficient", when it comes to runtime performance? Timing is a good
practise, Pentium M isn't precisely off-the-charts and/or wildly
different for the timings for other contemporary x86 implementations.
I wanted to know how each perform, on *my system*, which I mention
carefully nearly every single time I post what it is (Pentium M 1.8
GHz). I post the results and Hutch is free to post his, which he has
done. This gives already three different x86 implementations to compare
with.
> Yes, you are.
I see. I thought you meant as-in "count clockcycles", when it was still
possible, like, with Pentium or older x86 implementations, memory
performance excluded. I thought I were *timing* the code, which is
slightly different but apparently not Enough.
Again, I don't see the problem with measuring the performance on one
system, note that I am not drawing conclusions how the code shall
perform on other, different systems. I am extrapolating, and with a
good reason that the performance shall be relatively similiar or in
similiar ballpark.
/*
And this is averaged across all possible x86 CPUs (and others, too),
right? If not, you're counting cycles on *one* particular CPU. And
that's a waste of time if you're trying to produce generic results.
*/
And it wouldn't be waste of time to run out and use a day or two to
test this code on 20 or even 40 different x86 implementations? I think
it is a fair comparison, that's how the code performs on Pentium M 1.8
GHz. That was stated many times in this thread.
Ofcourse, you admit later in your reply that you haven't READ the whole
thread so how could have you known?
/*
Yes, rewriting strlen algorithms seems to be one of the funnest things
ever. Seems like someone new is doing it every other week.
*/
I didn't rewrite anything. I reverse engineered the C++ source from
assembly source. I believe Agner Fog wrote the assembly code in
question.
>> Apperently you haven't read the whole thread.
>
>Why bother?
To know what I was replying to, for starters? At first you seem to hold
misconception that I am somehow advocate of strlen(), etc.. you seem
grossly misinformed in your assumptions of what my stance was in the
first place. But why bother?
/*
Do you really think you're adding anything new to the discussion that
hasn't been said a thousand times already?
*/
Sir, I been seeing these "discussions" a thousand times already, too,
and I seen bizarre things like "100% Windows Assembler Programmers",
and what not. But this discussion amuses me and quite honestly, I don't
have better things to do at this time as I am traveling the world with
my laptop and credit card. It is not a waste of time in a sense that it
is a good way to pass time.
You think, that I think, that I am adding a lot of new technology and
research innovation into the topic? You think that I am so stupid, that
I think *I* am creating something new, when I am merely reverse
engineering asm source into C++ ? Excuse me, either you must think I am
an idiot, or you are just trying to teach me something, like, a lesson?
As rational guy, those are things that spring to mind first. I could be
wrong, which wouldn't be the first time in history.. but either way I'm
going to force myself thinking you are simply much more experienced
than I am. No problem there, really!
>Do you really think this is
>new territory around here (or anywhere else)?
Now that you mention it, please give me references into the topic as I
am curious what conclusions the previous thousand similiar debates came
into? Let me guess, either brilliant new techniques were invented, or,
everyone agreed that , shit, it doesn't make a frigging difference! (I
am inclined towards the later proposal but that's just me)
/*
The better question for
you to ask is "why bother responding to a thread that is probably the
#1 FAQ in a newsgroup such as this one?"
*/
You seem to know the answer to that one as you asked it from yourself,
then promptly proceed to reply to this thread. No, wait, you aren't
even interested in the topic and reply anyway, why would you do that?
(I'm assuming you're not interested as you give impression that this
has been beaten to death a thousand times already, etc... if I'm wrong
and you have interest my bad).
> Do you think you're the first to make this suggestion?
Well, in this thread, yeah. You *thought* *you* were first to make this
suggestion in this thread? Yes, you did.
> Do you think my response is directly specifically at you?
Ofcourse not, since this is a public forum.. however, when you quote
me, it is still a comment on something I wrote, no? Or what's your
point? That I don't know how to discuss in the Usenet? Maybe I don't,
so?
/*
a comment to anyone reading this thread who thinks that there might be
something to what's being said here.
*/
That's fair, I agree with that without any problem whatsoever. Not that
you need my approval but I let you know anyway.
/*
The truth is, *memory architecture* is the biggest impediment to memory
scanning algorithms like strlen. Even if *every* (x86) CPU had exactly
*/
That's why I was confident that no assembler tomfoolery won't be
substentially more efficient. Infact, if you go and scan 64 bits per
iteration with MMX you still get roughly the same performance. The
"application" is bandwidth limited there is no way around that.
You ask me before if I think I am bringing some new amazing innovation
into the table. I return the question to you.
>Who said anything about lazy evaluation? And even if we *do* use lazy
>evaluation, it's still a *whole* lot cheaper than running strlen
>everytime we want the length, no?
I did. In the first reply to this thread, I believe (why bother even
knowing what my stance on the issue is before jumping the gun, right?).
The alternative I had in mind was calling the strlen() every time
string was created from zstring, not every time the length was queried,
so no, wrong assumption.
>expensive than recomputing it each time, I'm not sure I can think of
>any off-hand.
Me neither, hence, I store the length. std::string, Pascal, etc. all
seem to come to the same conclusion, so no, I don't think I am "alone"
with this "information" or presenting anything groundbreaking.
I was mostly interested in dispelling the "assembler myth", you seem to
be more interested in setting the record straight from "misinformation"
I been spreading (or whatever, wasting my own time, other's
time...bandwidth... storage.. increasing signal-to-noise ratio, all of
the previous :).
/*
hat being said, all you're
doing is saying that *some* HLL implementations beat *some* assembly
optimizations. Do you really think this is something new?
*/
In general? Nope. When replying to Hutch? Yeah, I did, actually.
So something new to whom? You? I think not. To me? I think not.
/*
for a *different* processor, what do you expect? And, of course,
CPU-dependent HLL code that does a decent job is going to walk all over
some sample code that was written by someone who (1) isn't very good,
(2) doesn't understand the characteristics of the CPU you're running
with, or (3) both.
*/
pssst... the assembly output from compiler for the innerloop was
*identical* to the original assembly code. That sort of means no matter
what x86 implementation the code being the same I am not surprised the
timings being nearly same, too, the differences come from the constant
overhead mostly from the code at the end.
> Why are you wasting your time? :-)
I got time!
> 1996 was the same as it is today.
Mine in 2005 is that it doesn't make much of a difference to me, the
code snip could have been anyting under the sky. Doesn't make any
difference to me what it was, I was mainly keen on the ASM vs HLL
aspect.
/*
Unfortunately, your plan doesn't scale up very well. The problem with
C++ is *not* that you can't write efficient code if you're *very*
careful and consider the code the compiler is emitting (and adjust your
C++ source code appropriately). The problem is that no one writes
anything but trivial little (and often non-portable) code this way.
*/
Mostly applies for instances where I expect to reuse the code and don't
want worst possible case runtime characteristics to be easily invoked.
If possible by design, not at all.
/*
IOW, C++ really doesn't have much benefit over assembly when you go to
all this trouble other than it might be able to run on different CPUs
(but you often lose the optimizations when you do this).
*/
A list of platforms I am working on omitted here, because you don't
give a rat's ass.
/*
I have no doubt that you can take a trivially small function like
strlen and coerce your HLL code to emit stuff about as good as a
hand-optimized assembly version of the same code. But as I mentioned
earlier, this trick doesn't scale up to larger programs. From a
*/
Nor it should.
/*
And the tricks you pull to get the good code emission often won't carry
over to other
architectures (or even different CPUs in the same family).
*/
Which tricks might those be? I used addition, subtraction, bitwise or,
bitwise and and other pretty fundamental operations and very small
number of local variables, the compiler must work really hard to come
up with more than 8 registers needed to hold all that up. Unless we go
to compile out code for Commodore64, or maybe something z80 based we
won't run out of registers for such trivial function very easily,
atleast.
I mention 8 above because it takes some effort to think what 32 bit
architechture, for example, would have less registers and I might be
actually someday compiling for it.
/*
What happens when you compile this on a 16-bit CPU?
*/
God forbid, or 8-bit CPU! There are practical limitations on what I
assume the system I will use the code on, will have. I don't assume
this will work very well on PDP's either!
/*
On a 64-bit CPU? Maybe the code works fine on various 32-bit CPUs, but
the
optimization hardly portable across different CPUs.
*/
It does, on some. MIPS, PPC and x86-86 spring to mind. If there is
64-bit CPU where sizeof(int) == sizeof(long long) == 8, where char is 8
bits (sizeof(char) == 1, always) then it won't work and the
configuration will be unknown or not supported, and the headers will
fail compilation at #error ).
If it is just one or two functions that fail, such as this case, have
to put #ifdef / #endif kludge there to always use the "char*" version,
if that doesn't work either then no support. So far the codebase has
been very useful, though.
I would surmise the code is order of magnitude more useful than x86
specific assembly snip.
> So you're proof is *one* example versus another?
It's not proof, it's an example. Here is the post I replied to, maybe
you should afterall read what is being discussed?
"Don't be overawed by compilers, assembler coding is not restricted to
the architecture of a C compiler. The following code is a modification
of Agner Fog's DWORD string length routine that aligns the start and
tests the length 4 bytes at a time. It has no stack frame and conforms
to the normal register preservation rules under windows so it preserves
ESI and EDI but trashes the rest. "
It is clearly stating that such optimization is exclusive to assembly,
which apparently isn't the case. Now you know the *context* of the
discussion, atleast.
>Hmmm... Hardly convincing.
Convincing enough to debunk the implication that such optimization is
only achievable though the holy assembly.
>I can provide you with a whole slew of strlen programs in assembly that
>run much slower than:
>
>t = s;
>while( *s ) ++s;
>return s-t;
Just one is plenty, please do by all means.
/*
Indeed, if compilers are *so* great, why can't they convert this code
into something as wonderful as what you've presented? After all, doing
so is really just an induction step (albeit, a complex one).
*/
Now it is all of a sudden "wonderful", do I sense sarcasm?
>C provides no access to low-level machine facilities. Therefore, it is
>not an assembly language. It may be lower-level than languages like C++
>or Java, that does not make it an assembly language. In order for it to
>be an assembly language, you need access to all the features of the
>CPU.
Oh, gee-whizz, you found something to pick on, good for you! Maybe I
need to clarify my stance on this?
What I mean, is, that when you write assembly you generally use the
fully qualified register names. You maintain register names manually.
Labourous and error prone process. Enter ANSI C. You can write your
intention with named variables, which are then at compilation
translated and assigned to real registers (add spilling for flavour in
this so.-called register allocation stage). And on and on.
Because most microarchitechtures are different, the pragmatic approach
taken is to find a common subset of operations the language supports. A
no-brainer, as you well know, you just wanted to nit-pick, well good
job! Congrats!
> which makes it a jack of all trades but master of none.
As you quoting this, I hope you also read this! Guess what!? The above
quote means same thing just without all the flair and nitpicking going
about!
> You make this claim with just one example?
Well, mostly the claim was based on 10+ years of professional
experience (and nearly 20 years of programming, total) and the opinion
that comes with that. I'm sure you also have a lot of experience, so
you know what I am talking about.
>Gee, I'd argue that it's going to be real hard for an assembly language
>programmer to beat the code that a C compiler produces for the
>following:
>
>i = 0;
Okay. Gee, that can be completely eliminated if the result isn't ever
used in the current scope, as I don't even see function call so any
possible side effect can easily be determined to be non-existent in
this case. It's random if assembler coder will "see" this or not, it is
more deterministic if a compiler will see this or not. But if don't
know the compiler in advance, then, it isn't.
I dunno what to make of that. Was that a kind of ridiculous example to
show my actions in a "different light", so to speak? If so, ummmm...
right.
>That doesn't prove C compilers are as good as assembly programmers by
>any stretch of the imagination. You're example is a bit more complex,
>but nowhere near sufficient to "prove" the point.
C compilers aren't better than assembly programmers, they are just more
time and money -efficient. When there isn't choise, there isn't choise,
ask Tom Duff.
But that wasn't my point, even though you seem to have that
disillusion... I was showing how that particular assembly code snip
doesn't "beat" HLL code, not the other way around.. a subtle
difference...maybe too subtle if haven't even read the thread...
>for( myclass::iterator si = s.begin(); si != s.end(); si++ ) {...}
>
>And they have no idea what the compiler is doing with their code. Take,
>for example, that innocuous "si++" at the end of the for argument list.
++i vs. i++, gee-whizz, now we're getting to the ABC's and 101's of C++
programming.. and you blame me for going too basic? ;-)
Yeah yeah, i++ creates temporary object because it has to return the
*current* value, before returning from ++ operator (postfix) we have to
increase the current value, we cannot return it.. so we return
temporary object created before the increment.
/*
someone who knows exactly what's going on behind the scenes probably
wouldn't write code this way, but how often do you see people writing
standard C++ programs the way you wrote your xstrlen?
*/
I wouldn't know, I suppose been in a professional community for far too
long. My attitude isn't professional as I am a bit childish, you might
have noticed.. but that's my problem, thank you for not making funny
remark about that in advance.
> Do you honestly write *all* your C++ code that way?
I need clarification on this, what you mean "that way?", what
specificly strikes odd in "that way"-- I don't get it, yes, I do write
code "that way" a lot of times, it comes from the backbone. Is it that
bad, if so, show me the error in my ways and I'll learn.
What took me so much effort was that first I reverse engineered the asm
snip, but I wasn't happy with it as I would *never* actually, go out,
and write code that was off the bat. I got some idiosynchronies, I
admit, which I follow as I found them a sound practise, and I keep
myself trim and up-to-date what works, and what doesn't.
For example,
while ( x-- ) { ... }
vs.
do { ... } while ( --x );
vs.
for ( ; x<xmax; ++x ) { ... }
And so on and so on. I assume very little, I check, but I still keep in
mind the semantics and how they can be expected to behave if thinking
in serial way about how the IP moves over the code. Example, the first
while loop might not do what expected if x has certain value ;) ;) the
do-while is always executed atleast ONCE, might be a Bad Thing.. the
for-loop is more equivalent to the while than do-while if want to look
at the pass-through-branch aspect and so on... there is no end to these
things, really.
> Or do you just write code that way
>when you're trying to prove that C++ compilers can emit code that's as
>good as assembly programmers? And when you *do* write code that way, is
>it any faster or easier than using assembly?
Well, shit, www.liimatta.org, go to Fusion page, download the "latest
version", decompress the sourcecode. The sourcecode is 750 kB
compressed (with some minor data inside), feel free to go through every
line if you have to.
And no, I don't write code "that way" to be faster than assembly.
That's "the way" I write code. I don't know if you trying to insult, be
polite or just being sceptic. Whatever, dude, if you don't have time or
will to verify what I write here, good, I wouldn't care what you think
about me. But that's some work I been doing. Want my resume? I don't
have one. I always have job offers on my inbox and it been that way
since 1996 or so.
>Bottom line is that most C++ programmers would just write:
>
>t = s;
>while( *s ) ++s;
>return s-t;
>
>(or something similar) and move on.
Guess what? That's precisely what I wrote, too, and moved on. It was
until recently, I think October 24, 2005 that I, for fun, wrote the C++
version and as it proved to be alright added it to the source
repository. I wrot regression tests only for the reason that the code
was added and I don't take changes lightly, otherwise I'd move on
again. Infact, I moved on already, now what is left is this
discussion..
/*
clue what's going on. Before you get in a tiff, I *do* realize that
*you* probably do know what's going on. But you don't write all the
world's HLL code.
*/
Most of the world's HLL code doens't need to be "fast", most of the
times I would be glad if it "worked", which it generally does, if not
before a patch or two atleast after.
"fast" is not a goal, "fast enough" is. If code is "fast enough",
that's it, job done. I seen some guy optimizing keyboard interrupt
handles in assembler for MS-DOS, maybe he thought someone would press
keys really, really fast and that would slow his program down, go
figure. Or maybe he was scared that he would miss a few keystrokes,
again, go figure. Such strange characters are not my specialty (you may
say myself excluded... ?=)
>Yes, you do not have access to the low-level machine. As I said, C is
>not an assembly language. Believe me, you don't have access to a *lot*
>of things that might be useful on occasion.
I don't have to believe YOU, I believe my own EXPERIENCE.
>The #2 thread (after strlen) is memcpy. My alternative is to simply use
>the movsb instruction.
Are you trying to insult my intelligence? Look at the string.hpp, you
might see std::memcpy() being invoked here, and there. I don't even
*consider* the alternative!
If you see meta::vcopy(), it is a different beast, it does check if
type is pod (uses traits) and does memcpy, or object-by-object copy so
that corresponding copy constructors and what not are invoked correctly
in the process. Mostly I use that construct with templates.
>> Also, I could unroll the C++ innerloop but I won't do it because I
>> think it is not a particularly good idea in this case.
>
>That depends entirely on the CPU and memory architecture.
Context: competing with the specificly mentioned assembly code, which
was unrolled. If you take that into consideration you have not-so-much
to nitpick about.
Since it is x86 assembly, I'm assuming some contemporary x86
implementation will be running the code. I don't think unrolling the
C++ code will do much good. Maybe on 386 or older processor it might
pay off, hell, most likely it would. But I'm not too much interested in
386 these days...
While I am certainly impressed with the capacity to write portable code
that comes near the performance of hand written assembler, on the major
Windows platform, portable code is of little value when so much code is
specific to the actual OS.
I have seen people write portable ansi C style code while running
directX or Windows specific mutimedia functionality which is simply a
waste of time. I lost my taste for portablity when Microsoft dropped
OS2 support and the time I wasted learning some of this stuff was never
recovered.
If I was writing code for the never-ending flavours of Linux or
something solid like FreeBSD, it would be in portable ANSI C (not++)
but as I write primarily for current 32 bit Windows, such portability
is a waste of time.
Something that makes me laugh is the assumption that assembler is
written only for immediate code problems where historicaly it has
regularly been used for reusable libraries which generally justifies
the extra time spent writing it. Instead of cobbling together someting
on the spur of the moment, a reusable library of commonly used
procedures works fine and assembler is an excellent tool in this area.
Something that also makes me chuckle is the assumption that if you wait
long enough, someone one day will create a C compiler that can output
code in the same class as hand written assembler but after years of
hearing this nonsense, people who make the effort still outclass a
compiler because there is more to code design that robot software
output.
With Windows still controlling somthing like 90% of the desktop market,
considerations of what runs on a PPC or MIPS or SPARC or SOLARIS fits
into the category of "who cares" in most instances.
> pxor mm0,mm0
> xloop:
> pcmpeqb mm0,[esi]
> pmovmskb eax,mm0
> pxor mm0,mm0
> add esi,8
> // ...
The basic idea is a good one but as about 5% of computers around the
world could run it, its a mute point where the two algos I have posted
will run on a 486 running OEM win95. I would certainly like the luxury
of being able to write PIV code with SSE3 but with the sheer volume of
older computers still running, it will not be any time soon that this
will happen.
One of the good things when 64 bit x86 gets going in the next few years
is that there will be a far more modern instruction set where 10 year
old compatibility is not an anchor around your neck.
Most of the code isn't OS specific unless the application is really
trivial.
>I have seen people write portable ansi C style code while running
>directX or Windows specific mutimedia functionality which is simply a
There is nothing wrong with that either, I do this all the time in
DirectX applications:
surface* pic = surface::create("data.zip/test.jpg");
The same code works with OpenGL, X11, etc.. on Linux, Windows, etc. I
don't find that strange at all. I also keep rendering module not too
integrated with rest of the code, I usually use components and build
applications using libraries, some of which, heck, most of which are
portable.
/*
something solid like FreeBSD, it would be in portable ANSI C (not++)
but as I write primarily for current 32 bit Windows, such portability
is a waste of time.
*/
That's what you do, I don't write for specific OS per se, I have a wide
range of software over hardware and platforms I been working on over
the years. This includes mobile graphics chips, game software
(published), graphics chip technology demos for chip vendors.. I'm not
locked to Windows so I don't share the same opinion.
/*
Something that makes me laugh is the assumption that assembler is
written only for immediate code problems where historicaly it has
regularly been used for reusable libraries which generally justifies
*/
You remind me in the young and wild days..
/*
Something that also makes me chuckle is the assumption that if you wait
long enough, someone one day will create a C compiler that can output
code in the same class as hand written assembler but after years of
*/
Wait a second champ, I never said that, I am advocating the thought
that resorting to assembly as First thing is folly. Resorting to it
when there is need isn't.
/*
hearing this nonsense, people who make the effort still outclass a
compiler because there is more to code design that robot software
output.
*/
Ofcourse. But only when it pays off in some ways, makes a difference to
real-world software. strlen() is a good example, where it doesn't make
a didly-doo's difference to real programs performance in most of the
real-world, production software. I only took it into myself to write
the C++ version to validate that my theory, which is forged by years of
practise, is still correct. It still is.
/*
With Windows still controlling somthing like 90% of the desktop market,
considerations of what runs on a PPC or MIPS or SPARC or SOLARIS fits
into the category of "who cares" in most instances.
*/
That depends who you writing software for. If you write it for
yourself, okay. If it is freeware, open source.. who cares where it
runs besides the author, or those who contribute. If it is application
written for a customer, usually they dictate the terms. I write
embedded stuff so platforms change. Sometimes someone wants to put
computations into a cluster. Sometimes... the point is I don't mind
that, I write where the software is needed. That's what I do.
If you are fine with Windows, there's nothing wrong with that.
/*
The basic idea is a good one but as about 5% of computers around the
world could run it, its a mute point where the two algos I have posted
will run on a 486 running OEM win95. I would certainly like the luxury
*/
My basic idea was that why not take advantage of the latest instruction
set extensions when *available* ? When they are detected, use them,
have faster software which I think was the WHOLE POINT to begin with
when writing in assembly, or did I miss something? You write in
assembly just because...? What?
If no SSE, MMX, what not, fall back to generic x86 code that works down
to 386, no problem?
If your point is to make fastest possible code, you take the pains to
write the code in assembly, but then ignore latest instructions
possible on x86 platform what's wrong with the picture?
/*
of being able to write PIV code with SSE3 but with the sheer volume of
older computers still running, it will not be any time soon that this
will happen.
*/
It depends what your audience is. If you write games for instance for
commercial market, go and take a look at minimum requirements on the
game boxes. Most of the games don't run on 386 anymore. Just an example
of one market segment.
I don't know why you still want to support 386 while writing *windows*
software in 2005.
/*
One of the good things when 64 bit x86 gets going in the next few years
is that there will be a far more modern instruction set where 10 year
old compatibility is not an anchor around your neck.
*/
I don't see how that is supposed to fit your point of view where
MMX/SSE is too modern already. Explain?
The problem is that you then generalize your results and say "see,
using assembly isn't worthwhile!" You understand the fallacy here,
right?
> with this "information" or presenting anything groundbreaking.
>
> I was mostly interested in dispelling the "assembler myth",
And that's the argument that I'm not buying. The fact that in some
controlled situation you can cajole a C++ compiler into producing code
about as optimal as one can expect does not imply that the compiler
will do this all the time. You are dispelling no myth, I'm afraid.
>
> /*
> hat being said, all you're
> doing is saying that *some* HLL implementations beat *some* assembly
> optimizations. Do you really think this is something new?
> */
>
> In general? Nope. When replying to Hutch? Yeah, I did, actually.
Yet earlier in your post you talk about the results on your machine vs.
the results on Hutch's machine. Exactly how are you dispelling any
myths here? Each machine requires an independent optimization. The fact
that an optimization on Hutch's machine isn't as valid on your machine
should prove to be no surprise here. It's one of the main reasons I
quit "counting cycles" when the Pentium first arrived -- there's no
sense in in anymore.
>
> /*
> for a *different* processor, what do you expect? And, of course,
> CPU-dependent HLL code that does a decent job is going to walk all over
> some sample code that was written by someone who (1) isn't very good,
> (2) doesn't understand the characteristics of the CPU you're running
> with, or (3) both.
> */
>
> pssst... the assembly output from compiler for the innerloop was
> *identical* to the original assembly code. That sort of means no matter
> what x86 implementation the code being the same I am not surprised the
> timings being nearly same, too, the differences come from the constant
> overhead mostly from the code at the end.
And how does this dispell any myths? No one is claiming that a compiler
*never* produces code that could be as good as a human code,
particularly for short code sequences. Just that as the programs get
larger, the compilers tend to fall flat on their faces. Again, it's the
issue of "brilliant code sometimes" plus "bonehead code other times"
equalling mediocre code overall. Sure, you can "reverse engineer" an
assembly algorithm in C++ (a perfectly fair thing to do), but how often
do you see this in practice? And is the result any better (readable,
understandable, maintainable, robust, etc.), than the corresponding
assembly code. How many people, for example, will find the C++ strlen
function you've written to be any more understandable than the assembly
version (from an algorithmic point of view, obviously)?
>
> /*
> Unfortunately, your plan doesn't scale up very well. The problem with
> C++ is *not* that you can't write efficient code if you're *very*
> careful and consider the code the compiler is emitting (and adjust your
> C++ source code appropriately). The problem is that no one writes
> anything but trivial little (and often non-portable) code this way.
> */
>
> Mostly applies for instances where I expect to reuse the code and don't
> want worst possible case runtime characteristics to be easily invoked.
> If possible by design, not at all.
Ultimately, the way to write faster programs is to *skip* the C
mentality. IOW, if you want really fast programs that manipulate
strings, you dont' get in the habit of using C standard library
routines (regardless of how well they are implemented). This is the HLL
trap, not the lack of compiler code-generation quality. If the compiler
produced the best possible code that could be generated and you turned
around and did things like "strlen" or any of a host of other stdlib
functions to achieve your goals, you'd wind up with slower running code
than would be possible if you were completely aware of what was going
on in the program. C++ (and the STL) take this problem to a new
extreme. It's so easy to use things like sets, lists, maps, or other
containers that people do so without thinking about the costs
associated with them. Even in plain C, you get performance problems
when people do things like:
strcpy( a, b );
strcat( a, c );
strcat( a, d );
The problem, of course, is that you wind up computing the length of the
strings over and over again when it is completely unnecessary (as each
string function call above internally produces a pointer to the end of
the string that could be used by the next function call). This may seem
like a trivial example, and easily worked around, but it's typical of
the kinds of problems that sap performance out of HLL programs; it's
also the kind of thing that you don't see in low-level assembly
programs because the programmer sees what is going on when writing
their code (by "low-level" assembly, I mean that you're not simply
writing the code with a HLL mindset and calling these same sorts of
functions from your assembly code).
>
> /*
> IOW, C++ really doesn't have much benefit over assembly when you go to
> all this trouble other than it might be able to run on different CPUs
> (but you often lose the optimizations when you do this).
> */
>
> A list of platforms I am working on omitted here, because you don't
> give a rat's ass.
When you put it that way, I guess I have to agree with you.
Precisely my point. The optimizations are not portable. For this
particular example, you're limited to 32-bit processors.
>
> /*
> On a 64-bit CPU? Maybe the code works fine on various 32-bit CPUs, but
> the
> optimization hardly portable across different CPUs.
> */
>
> It does, on some. MIPS, PPC and x86-86 spring to mind. If there is
> 64-bit CPU where sizeof(int) == sizeof(long long) == 8, where char is 8
> bits (sizeof(char) == 1, always) then it won't work and the
> configuration will be unknown or not supported, and the headers will
> fail compilation at #error ).
IOW, the "trick" isn't portable and you're suffering from some of the
same problems as assembly language.
>
> If it is just one or two functions that fail, such as this case, have
> to put #ifdef / #endif kludge there to always use the "char*" version,
> if that doesn't work either then no support. So far the codebase has
> been very useful, though.
Sure, and we can write multiple assembly routines for different
processors, too. Granted, more than you'd need with C (or any other
HLL), but the idea is just the same.
>
> I would surmise the code is order of magnitude more useful than x86
> specific assembly snip.
By what reasoning?
Given that about 90% of the world's computers today are x86 CPUs, I
don't see how having the code in portable C++ is going to make it an
order of magnitude more useful. Certainly we can find *some* people
for whom portability to other processors is necessary, but an order of
magnitude (IOW, they need to run the code on ten different processors)?
I don't question your claim; from a mathematical perspective I'm sure
we could find a group of people amongst whom the need to have a
portable strlen function that compiles on 10 different 32-bit (non-x86)
processors is important, but...
When you look at the number of people (end users) who will actually
benefit from the code, however, it becomes real clear that the choice
of HLL or x86 assembly is *mostly* irrelevant because most end-users
are running x86 boxes.
>
>
> > So you're proof is *one* example versus another?
>
> It's not proof, it's an example. Here is the post I replied to, maybe
> you should afterall read what is being discussed?
You are "debunking a myth". You don't debunk a myth (that is, prove
your point) with one example. You might be able to prove that assembly
isn't *always* better with one example, buy you cannot make a claim
that there is no need to use assembly language on the basis of one
example.
>
> "Don't be overawed by compilers, assembler coding is not restricted to
> the architecture of a C compiler. The following code is a modification
> of Agner Fog's DWORD string length routine that aligns the start and
> tests the length 4 bytes at a time. It has no stack frame and conforms
> to the normal register preservation rules under windows so it preserves
> ESI and EDI but trashes the rest. "
>
> It is clearly stating that such optimization is exclusive to assembly,
> which apparently isn't the case. Now you know the *context* of the
> discussion, atleast.
>
> >Hmmm... Hardly convincing.
>
> Convincing enough to debunk the implication that such optimization is
> only achievable though the holy assembly.
And you're assuming that Hutch got it right? That his example is the
absolute pinnacle of what can be done in assembly?
>
> >I can provide you with a whole slew of strlen programs in assembly that
> >run much slower than:
> >
> >t = s;
> >while( *s ) ++s;
> >return s-t;
>
> Just one is plenty, please do by all means.
mov edi, src
mov ecx, -1
mov al, 0
rep scasb
That will probably run slower than the output of a good compiler for
the above C code. Yes, scasb is *that* bad.
>
> /*
> Indeed, if compilers are *so* great, why can't they convert this code
> into something as wonderful as what you've presented? After all, doing
> so is really just an induction step (albeit, a complex one).
> */
>
> Now it is all of a sudden "wonderful", do I sense sarcasm?
Sarcasm or not, it's quite clear that a compiler working on your code
produces a faster result than one might expect from a compiler working
on the simple C code above.
> What I mean, is, that when you write assembly you generally use the
> fully qualified register names. You maintain register names manually.
Yes, one advantage of using assembly is that you have complete access
to the low-level machine facilities, including the registers.
> Labourous and error prone process.
Changing the subject?
I'm certainly not questioning the fact that, in general, writing
assembly language is more "labourous and error prone" than writing in a
HLL. OTOH, I'll also point out that if you write your HLL code the way
you've written that xstrlen function, you will find writing the code
fairly "labourous and error prone". Optimization is a painful process,
regardless of the language.
> Enter ANSI C. You can write your
> intention with named variables, which are then at compilation
> translated and assigned to real registers (add spilling for flavour in
> this so.-called register allocation stage). And on and on.
And sometimes the compiler is brilliant when doing this, and sometimes
it is real bone-headed. Your point?
>
> Because most microarchitechtures are different, the pragmatic approach
> taken is to find a common subset of operations the language supports. A
> no-brainer, as you well know, you just wanted to nit-pick, well good
> job! Congrats!
Again, we're back to the argument of "this makes life so much easier
for the programmer" rather than "the compiler does as good a job of
this as the programmer could do himself."
>
> > which makes it a jack of all trades but master of none.
>
> As you quoting this, I hope you also read this! Guess what!? The above
> quote means same thing just without all the flair and nitpicking going
> about!
Sorry, you've lost me along the road somewhere. Perhaps you could be
more articulate. It really seems to me that all you've done here is
switch from "compilers produce code as good as humans do" to "it isn't
cost-effective for humans to write code this way, so we live with what
the compilers produce." A very different argument. But for some reason,
that's where this argument always winds up. I guess that means we've
reached the end of the debate.
>
> > You make this claim with just one example?
>
> Well, mostly the claim was based on 10+ years of professional
> experience (and nearly 20 years of programming, total) and the opinion
> that comes with that. I'm sure you also have a lot of experience, so
> you know what I am talking about.
Well, maybe that's the difference between us. You see, I've got about
25 years' experience as a professional programmer and I've worked both
in the times when assembly was mandatory (to get any kind of
performance at all) and I've been around during the past 15 years when
compilers became efficient enough to be usable for the larger
percentage of projects. Most of my real (professional) work is done in
languages like C, C++, and Delphi. So it's not like I'm unaware of what
these languages are good for. OTOH, I don't go around claiming that
there is no reason to use assembly because compilers today are as good
as humans. I may very well say that it makes *economic* sense to use
HLLs, but it's not the case that compilers are as good as human beings.
>
> >Gee, I'd argue that it's going to be real hard for an assembly language
> >programmer to beat the code that a C compiler produces for the
> >following:
> >
> >i = 0;
>
> Okay. Gee, that can be completely eliminated if the result isn't ever
> used in the current scope, as I don't even see function call so any
> possible side effect can easily be determined to be non-existent in
> this case. It's random if assembler coder will "see" this or not, it is
> more deterministic if a compiler will see this or not. But if don't
> know the compiler in advance, then, it isn't.
Touche.
>
> I dunno what to make of that. Was that a kind of ridiculous example to
> show my actions in a "different light", so to speak? If so, ummmm...
> right.
The point I'm making is that xstrlen is a ridiculous example to use to
debunk the myth that assembly isn't useful. Hutch may have overspoken,
but attempts on your part to show that using assembly for this task
aren't quite making the point. xstrlen is actually one of the easier
things to code efficiently in C. Just like "i=0;" is pretty easy to
code efficiently in C. If you *really* want to debunk the myth that
assembly has no advantage over HLLs, you need to move beyond strlen.
As an aside:
A few years back (okay, maybe decades at this point) some research at
Berkeley, or thereabouts, demonstrated that most strings processed in
HLL programs (written by students, granted) were 10 characters or less
in length. If that still holds today, it's almost a no-brainer that the
trivial strlen function (byte at a time) will outperform the craziness
embodied in the examples in this thread, because of the intrinsic
overhead. Sure, you can feed your code thousands of long strings and
demonstrate how much better one algorithm is than another, but the
bottom line is that in the real world, the data sets in use may
completely invalidate the test set you're using. IOW, how well does
your test data model the real world data that an average program will
see? This is one reason why I argue that xstrlen is a ridiculous
example.
So when Hutch talks about saving all the function setup and tear-down
code, this is not an insignificant matter. It reduces the overhead of
the function call, thus vastly improving performance for small strings
(which this older research suggests is a common situation). Now the
truth is, some compilers can generate code that doesn't require setting
up and tearing down stack frames too, so Steve's proclaimations aren't
all *that* impressive, but for the common case, reducing function call
overhead can produce dramatic results (assuming, again, that short
strings are common).
>
> >That doesn't prove C compilers are as good as assembly programmers by
> >any stretch of the imagination. You're example is a bit more complex,
> >but nowhere near sufficient to "prove" the point.
>
> C compilers aren't better than assembly programmers, they are just more
> time and money -efficient. When there isn't choise, there isn't choise,
> ask Tom Duff.
Again, that's a different argument. Few people question the economic
aspects. Then again, if people wrote C code the way xstrlen has been
written, the economic advantage of C over assembly would be greatly
diminished. Again, *optimization* is an expensive process, regardless
of the language used. Assembly generally has a bad reputation in terms
of programmer efficiency because people who write assembly code tend to
write more (locally) optimized code than those working in HLLs. Ergo,
it's more expensive. If you write assembly code without regard to
minimizing resource use, then it's far less costly to use assembly.
>
> But that wasn't my point, even though you seem to have that
> disillusion... I was showing how that particular assembly code snip
> doesn't "beat" HLL code, not the other way around.. a subtle
> difference...maybe too subtle if haven't even read the thread...
Oh, I've read it. That's not what you said earlier. But I'll allow you
to back out of that gracefully. This is, after all, USENET and we have
to allow for considerable "unstateds" and "misreads".
>
> >for( myclass::iterator si = s.begin(); si != s.end(); si++ ) {...}
> >
> >And they have no idea what the compiler is doing with their code. Take,
> >for example, that innocuous "si++" at the end of the for argument list.
>
> ++i vs. i++, gee-whizz, now we're getting to the ABC's and 101's of C++
> programming.. and you blame me for going too basic? ;-)
Amazing, isn't it? Something so *basic* trips up 99% of the program out
there. Exactly the point I'm making. You won't see mistakes like this
made in a typical assembly program.
>
> Yeah yeah, i++ creates temporary object because it has to return the
> *current* value, before returning from ++ operator (postfix) we have to
> increase the current value, we cannot return it.. so we return
> temporary object created before the increment.
>
> /*
> someone who knows exactly what's going on behind the scenes probably
> wouldn't write code this way, but how often do you see people writing
> standard C++ programs the way you wrote your xstrlen?
> */
>
> I wouldn't know, I suppose been in a professional community for far too
> long. My attitude isn't professional as I am a bit childish, you might
> have noticed.. but that's my problem, thank you for not making funny
> remark about that in advance.
And when we look in your code, we'll never see an example of this,
right?
That's the only point I'm making- HLL abstractions, the things that
make it easier and faster for programmers to write code, also hide the
things that can cost them dearly. Even when they've got the experience
to know better.
>
> > Do you honestly write *all* your C++ code that way?
>
> I need clarification on this, what you mean "that way?",
As in the way you've written xstrlen.
> what
> specificly strikes odd in "that way"-- I don't get it, yes, I do write
> code "that way" a lot of times, it comes from the backbone. Is it that
> bad, if so, show me the error in my ways and I'll learn.
>
> What took me so much effort was that first I reverse engineered the asm
> snip, but I wasn't happy with it as I would *never* actually, go out,
> and write code that was off the bat. I got some idiosynchronies, I
> admit, which I follow as I found them a sound practise, and I keep
> myself trim and up-to-date what works, and what doesn't.
Great!
The point I'm making here is that writing code like xstrlen is a good
example of something that gets you into trouble down the road. Written
in assembly, we *expect* to rewrite it for later processors. Written in
C? No, we expect to be able to recompile it and have it work fine, no
matter what comes along. And we curse the guy who wrote C code like
that. Other than "why did this idiot use assembly?", few people would
question the use of that crazy strlen algorithm in assembly; indeed,
they would expect it to be written that way.
>
> > Or do you just write code that way
> >when you're trying to prove that C++ compilers can emit code that's as
> >good as assembly programmers? And when you *do* write code that way, is
> >it any faster or easier than using assembly?
>
> Well, shit, www.liimatta.org, go to Fusion page, download the "latest
> version", decompress the sourcecode. The sourcecode is 750 kB
> compressed (with some minor data inside), feel free to go through every
> line if you have to.
Well, I went through enough lines to know that you don't write your
code the way you wrote xstrlen. Which is *good* from a
readable/maintainable/robustness point of view, but it also means that
someone who writes assembly code to do the same job is generally going
to get much more efficient results. Whether this is important or not is
a different question, of course.
>
> And no, I don't write code "that way" to be faster than assembly.
Of course not. Most people writing C++ code don't write their code to
be faster than assembly. Indeed, "fast" is rarely a factor, other than
fast development or easy development.
> That's "the way" I write code.
And it's not a bad style (though I'd suggest more comments :-) ).
But people who write code that way (and I'm no different) are not
writing their C++ code in a manner than compilers can efficiently
translate into machine code. And if someone were doing the same
operations in assembly, even if they weren't the *greatest* assembly
programmer around, they'd probably produce better output than the
compiler. It all has to do with thinking in assembly language rather
than thinking in a HLL. That's the crucial difference. Your xstrlen
function is a good example of thinking in assembly (even when writing
in C). I've seen lots of assembly code where the author was thinking in
a HLL rather than assembly (and the result isn't very good). But when
someone thinks in assembly, the result is often quite good. This is why
assembly programs are generally better than HLL programs. Assembly
programmers often think in assembly wherease HLL programmers think in
their HLL.
> I don't know if you trying to insult, be
> polite or just being sceptic.
Label me a sceptic. I've heard it *many* times before. And the argument
always boils down to (as this one has) that it's more economical to
develop in a HLL, which is what makes the HLL better. No argument
there. But the economics don't imply that the compilers can do a
better, or even as good a job as the assembly programmer. Sure, in a
few specialized cases, it can. But the results don't scale up to large
systems (for, quite frankly, the same reasons using assembly language
doesn't scale up).
As for insulting, please check your own post. There are a few too many
profanities and inferences on your part for you to be able to play this
card here.
> Whatever, dude, if you don't have time or
> will to verify what I write here, good, I wouldn't care what you think
> about me.
That makes us even. I don't care what I think about you either :-)
> But that's some work I been doing. Want my resume? I don't
> have one. I always have job offers on my inbox and it been that way
> since 1996 or so.
Good for you. You escaped the problems of our industry over the past
four years. But discussions of your experience and how long you've been
employed are not particularly good supporting arguments for your
hypothesis that there is no need to use assembly language because
compilers generate code as good as a hand coder.
>
> >Bottom line is that most C++ programmers would just write:
> >
> >t = s;
> >while( *s ) ++s;
> >return s-t;
> >
> >(or something similar) and move on.
>
> Guess what? That's precisely what I wrote, too, and moved on.
And that's exactly the point I'm making. Most HLL programmers (myself
included) will often write code like this and just move on. Assembly
language programmers (myself included, when working in assembly)
generally *wouldn't* do this. Oh, they might do it on the first pass,
but then they'd see how ugly the result is and decide to do something
about it. Sometimes, particularly with inexperienced programmers, the
ugliness might not be discovered until someone points out that a HLL
call is faster than their assembly gem (witness this thread), sometimes
they can just tell that the solution isn't very good. But the bottom
line is that an assembly language programmer is more prone to do
something about the ugly code rather than thinking "well, that's the
best I can do" and move on. How many C/C++ programmers, for example, do
you think could come up with your xstrlen function on their own?
>
> /*
> clue what's going on. Before you get in a tiff, I *do* realize that
> *you* probably do know what's going on. But you don't write all the
> world's HLL code.
> */
>
> Most of the world's HLL code doens't need to be "fast", most of the
> times I would be glad if it "worked", which it generally does, if not
> before a patch or two atleast after.
And for code that doesn't need to be fast (or small, or otherwise
resource limited) there is no need to use assembly. We can agree on
that.
>
> "fast" is not a goal, "fast enough" is.
Of course. And just as in every other "assembly vs. HLL" thread that
has ever existed, we wind up with "okay, so what if HLL code isn't as
fast as assembly; CPUs are so fast we don't need it to be." The fact
that we may not need all programs to be efficient does not tell us that
compilers are doing as good a job as assembly programmers. It simply
tells us that the CPU manufacturers have been doing a decent job and we
can get away with a lot of slopiness on the part of the compilers
without it affecting our ability to deliver code that meets performance
specifications.
> If code is "fast enough",
> that's it, job done.
Unfortunately, code often gets used (and reused) in ways the original
programmer (or specification) doesn't expect. How fast is "fast enough"
for the xstrlen function, for example? No doubt, it's great for the
application you're writing today. But how about tomorrow? Some
routines, like generic library routines, should be *as fast as
possible* because there is no way to predict how they will be used. If
they're overkill for a beginning student's "number guessing game" then
that's no big deal, but if they're too slow for a database application,
uh-oh. How many programmers have the time to go in and rewrite the
stdlib when their application runs a little too slowly?
> I seen some guy optimizing keyboard interrupt
> handles in assembler for MS-DOS, maybe he thought someone would press
> keys really, really fast and that would slow his program down, go
> figure. Or maybe he was scared that he would miss a few keystrokes,
> again, go figure. Such strange characters are not my specialty (you may
> say myself excluded... ?=)
Again, one example of idiocy does not imply that all attempts to write
fast code are worthless. And you never know -- It could turn out that
this person you're talking about has a real-time foreground application
that could allow more than a few (hundred) microseconds' interruption.
In which case having a fast keyboard interrupt handler is a *very* good
idea.
Even if that person didn't need the performance for his/her current
app, perhaps the next user of that ISR would. I've got a *little* bit
of experience working in real-time systems, and I can assure you that
in most real-time OSes, minimizing time spent in an ISR is a *very*
critical thing (not that MS-DOS qualifies in this respect, but you get
the idea).
>
>
> >Yes, you do not have access to the low-level machine. As I said, C is
> >not an assembly language. Believe me, you don't have access to a *lot*
> >of things that might be useful on occasion.
>
> I don't have to believe YOU, I believe my own EXPERIENCE.
And, in your experience you call C an assembly language. That speaks
volumes about your experience, I'm afraid.
>
> >The #2 thread (after strlen) is memcpy. My alternative is to simply use
> >the movsb instruction.
>
> Are you trying to insult my intelligence?
No, I'm simply pointing out that this thread is second only to the
memcpy threads that pop up. What this would have to do with your
intelligence is beyond me.
> Look at the string.hpp, you
> might see std::memcpy() being invoked here, and there. I don't even
> *consider* the alternative!
Good for you.
>
> If you see meta::vcopy(), it is a different beast, it does check if
> type is pod (uses traits) and does memcpy, or object-by-object copy so
> that corresponding copy constructors and what not are invoked correctly
> in the process. Mostly I use that construct with templates.
I think you completely missed the point of my comment. Allow me to
explain it better and forgive me if I sound patronizing:
(In order of occurrence):
FAQ #1: what's the fastest block copy code we can write
Answer #1: Take a look at the AMD optimization guide and quit posting
routine after routine here. Any attempt to do better than that is going
to fail on different architectures.
FAQ #2: Here's my strlen function, how can I make it faster.
Answer #2: Check out the AMD optimization guide (or Agner Fog's page)
and use that code. Again, unless you're writing the code for a specific
CPU, you aren't going to do substantially better.
FAQ #3: Aren't compilers as good at generating machine code as human
coders?
Answer #3: No they are not. It's *easier* and more *economical* to
write code in a HLL, but the results are often much bigger and slower
than an equivalent program written in assembly. Most of the time,
slower and bigger is no problem, so go ahead and use your HLL. But
don't go around thinking that the code produced by your compiler is as
good as the stuff a decent assembly language programmer will write.
>
> >> Also, I could unroll the C++ innerloop but I won't do it because I
> >> think it is not a particularly good idea in this case.
> >
> >That depends entirely on the CPU and memory architecture.
>
> Context: competing with the specificly mentioned assembly code, which
> was unrolled. If you take that into consideration you have not-so-much
> to nitpick about.
>
> Since it is x86 assembly, I'm assuming some contemporary x86
> implementation will be running the code.
Okay, that's good for today. What about next week's CPU?
This is the thing that killed me 10-15 years ago. I was carefully
hand-optimizing code for the 486 and then the Pentium came out and
changed all the rules. Then the PII, then the PIII, then the PIV. Up to
the 486, whatever rules you applied on one CPU tended to work well on
the next generation. This stopped with the 486.
And "contemporary" doesn't even cut it. The optimization rules for the
PIV are quite a bit different from those for the AMD chips (and, the
PIII, upon which the PM is built). Better just to ignore all the
CPU-specific stuff and go with the general principles that work across
all CPUs. The differences you are talking about (e.g., loop unrolling)
are good examples of things that fall into the CPU-specific categories.
> I don't think unrolling the
> C++ code will do much good. Maybe on 386 or older processor it might
> pay off, hell, most likely it would. But I'm not too much interested in
> 386 these days...
I'm not suggesting that you unroll your code. I'm simply stating that
your argument that unrolling code is bad because of your experiences
with your particular CPU is not wise. On other contemporary CPUs, or on
future CPUs, the rules may be different. And the rules could also
changed based on the memory alignment of the code (I've seen some
pretty big differences in performance based on the position of the code
in a program). And let us not forget caching effects. When you run
1,000,000 strings through your xstrlen function, you're hammering on
the same code over and over again and even the data access (usually
sequential) is pretty good as far as a cache is concerned. What happens
when you call xstrlen from within a real program when the code isn't
cached up and the data isn't in cache? You'll probably get quite
different results based on whether the code is unrolled or not. I don't
know which would be better (for a given CPU, of course), but I do know
that claims of "this isn't better" or "this is better" tend to melt
away when the environment changes on you. Bottom line is that assuming
that code that works great on today's CPU is no guarantee that this
will be true on tomorrow's CPUs. A lot of assembly language programmers
discovered this fallacy when going from the PIII to the PIV.
Cheers,
Randy Hyde
And, no doubt, Steve will come around in another decade... :-)
>
> /*
> Something that also makes me chuckle is the assumption that if you wait
> long enough, someone one day will create a C compiler that can output
> code in the same class as hand written assembler but after years of
> */
>
> Wait a second champ, I never said that, I am advocating the thought
> that resorting to assembly as First thing is folly. Resorting to it
> when there is need isn't.
This is a new thought in this thread. Here's what you've said in the
past:
"I still think very much that assembly itself is not optimization tool
anymore, ofcourse if you know it, you write better higher level code,
especially when you know the compiler.. and check assembly *output* to
check the compiler isn't doing something really stupid.
Best use for machine specific instructions, IMHO, is to let machine
generate them. Be this a offline compiler like g++, visual c++, et al..
or realtime code generator like JIT Compiler or something like that.
I'm not "dissing" assembly per-se, I used to be strongly with the
opinion that it is the way for performance."
You will have to forgive Steve and I for interpreting this to mean that
you believe C compilers should be generating all the machine code
rather than human beings. As you've constantly complained that I've not
bothered reading this whole thread, I would suggest that you go back
and read what you've written. I suspect that you've not put all your
thoughts into your posts and the missing information that's in your
head is crucial to your line of thought if you want us to agree with
what you're saying.
>
> /*
> hearing this nonsense, people who make the effort still outclass a
> compiler because there is more to code design that robot software
> output.
> */
>
> Ofcourse. But only when it pays off in some ways, makes a difference to
> real-world software. strlen() is a good example, where it doesn't make
> a didly-doo's difference to real programs performance in most of the
> real-world, production software.
Most of the real-world software doesn't need to be any faster than what
the CPUs provide for free to us. Praise the miracle of the side-effects
of Moore's Law over the past 15 years! Perhaps Steve's optimism
concerning strlen is unfounded, but the general idea is still right
(again, as we all seem to agree, someone who is using strlen enough for
a faster algorithm to make a difference in the performance of their
program could get much better performance with a different data
structure; so why bother trying to speed up strlen?).
> I only took it into myself to write
> the C++ version to validate that my theory, which is forged by years of
> practise, is still correct. It still is.
As long as you pick and choose the examples to try, I'm sure you'll
remain convinced of this. :-) Again, I keep coming back to the
optimization of "i=0;" as my "proof". Strlen is no different. You can
certainly get within 10-20% of the performance of a well-written
assembly language function in a HLL like C/C++. That doesn't mean you
can always do this, however.
>
> /*
> With Windows still controlling somthing like 90% of the desktop market,
> considerations of what runs on a PPC or MIPS or SPARC or SOLARIS fits
> into the category of "who cares" in most instances.
> */
>
> That depends who you writing software for. If you write it for
> yourself, okay. If it is freeware, open source.. who cares where it
> runs besides the author, or those who contribute. If it is application
> written for a customer, usually they dictate the terms.
They also dictate the price. Which means it behooves you to write the
code as rapidly as possible (i.e., no optimizations at all, unless
absolutely necessary to meet specs) in order to maximize your profit.
This might seem like a good thing to do until you try and re-use that
code in the next project...
It's the good old "short-term" vs. "long-term" trade-off. Do you spend
more money up-front and less down the road? Or do you wind up rewriting
strlen over and over again because the requirements change on a
continuing basis? BTW, this isn't an argument for HLL vs. assembly, per
se, just an observation.
>
> /*
> The basic idea is a good one but as about 5% of computers around the
> world could run it, its a mute point where the two algos I have posted
> will run on a 486 running OEM win95. I would certainly like the luxury
> */
>
> My basic idea was that why not take advantage of the latest instruction
> set extensions when *available* ? When they are detected, use them,
> have faster software which I think was the WHOLE POINT to begin with
> when writing in assembly, or did I miss something? You write in
> assembly just because...? What?
I have to agree with you, particularly when strlen is involved. This
function has been written so many times that a person can *easily* find
a 386 algorithm should they need it. If you want something good, write
for the latest processors. #ifdef if you have to, but support the
latest.
>
> If no SSE, MMX, what not, fall back to generic x86 code that works down
> to 386, no problem?
Of course, there is the *overhead* associated with using MMX/SSE. You
need to preserve the state of the registers on input (you are nice,
right?). It's expensive to save the state of SSE and it's *really bad*
saving the FPU state. This can *kill* the performance of strlen if
you're processing lots of short strings (a common case).
>
> If your point is to make fastest possible code, you take the pains to
> write the code in assembly, but then ignore latest instructions
> possible on x86 platform what's wrong with the picture?
Again, I have to agree with you in principle. In practice, though,
there are other reasons why using the latest and greatest instructions
may not be the best choice in a *generic* function. Again, preserving
state can be more costly than the savings achieved. OTOH, one advantage
to assembly is that you can often in-line these instructions and not
have the overhead of saving FPU/SSE state (or, you get to amortize the
costs across several uses of those instructions rather than a single
function call). For example, in an earlier thread I wrote:
strcpy( a, b );
strcat( a, c );
strcat( a, d );
(ignoring the fact that this is bad code to begin with). In this
example, you'd only really need to preserve machine state across this
*sequence* of calls rather than on each call. This is one of the
advantages to assembly that you don't get in HLLs.
As for 386 vs. PIV issues, it's easy enough (via dynamic linking,
conditional assembly, or just run-time IF statements) to include the
code optimized for a series of processors, if you want to support older
as well as newer processors. Nevertheless, using a base instruction
set that most people's computers support and optimizing those
instructions based on modern rules is *not* a bad idea. I think it was
Terje, Michael Abrash, or other of their contemporaries who used to
write 8088 code optimized to run on a 386 processor. Everyone could run
the code, but it ran best on modern processors. True, the code wasn't
as good as it would have been had it been written for the 386, but it
ran everywhere. Hutch's requirement is to write code that is as fast as
possible but runs just about everywhere. Granted, I think he could use
the Pentium Pro as his baseline today (or even the PMMX), but his
approach isn't necessarily a bad one. His run-time requirements are
different than your's, that's all. And that's why I keep coming back to
the fact that your xstrlen function beating his unrolled code isn't
that impressive -- you've clearly got a different (set of) target
machine(s) in mind.
> /*
> of being able to write PIV code with SSE3 but with the sheer volume of
> older computers still running, it will not be any time soon that this
> will happen.
> */
>
> It depends what your audience is. If you write games for instance for
> commercial market, go and take a look at minimum requirements on the
> game boxes. Most of the games don't run on 386 anymore. Just an example
> of one market segment.
Hutch doesn't write games, to my knowledge.
>
> I don't know why you still want to support 386 while writing *windows*
> software in 2005.
Probably he meant the 486 or Pentium. In particular, I don't believe he
accepts the use of MMX or SSE.
>
> /*
> One of the good things when 64 bit x86 gets going in the next few years
> is that there will be a far more modern instruction set where 10 year
> old compatibility is not an anchor around your neck.
> */
>
> I don't see how that is supposed to fit your point of view where
> MMX/SSE is too modern already. Explain?
When 64-bit OSes come out, we know that everyone running one of those
OSes will have a baseline machine that supports 64 bits, SSE/3, etc.,
etc. Therefore, he doesn't have to worry about supporting older
machines at that point (of course, new machines *will* appear, but the
baseline will be high-end for some time after that).
Cheers,
Randy Hyde
I think maybe you have missed the direction I have commented in. I
certainly see writing multiport code as a worthwhile endeavour but I
don't see it as a replacement for hardware specific code where
assembler has no peers in terms of size and speed.
The original post in this topic was a member looking for the difference
between C code and an assembler version of a byte scanner for
determining the length of a zero terminated string. I posted for him a
slightly modified algorithm that was written by Agner Fog in about 1996
and in code terms, that is a reasonably long life for an algo design in
assembler.
The arguments against writing assembler are usually the development
cycle time yet to produce a nearly as fast version in C++, I sugest
that you have probably spent more time than it would take to write it
in assembler and while the development may be useful to you in terms of
portability, it is neither a development time or speed advantage.
> > You remind me in the young and wild days..
In my youth I wrote ANSI C but time and cynicism lead me down the road
of writing pure assembler in many places because portability in almost
every instance is a myth. My main use for a C compiler these days is
ratting through the mountain of old C junk for decent algorithm designs
which CL.EXE easily converts to MASM format assembler which is then a
good target for manual optimisation.
I am also not without criticism of current C compiler design in terms
of code generation. RISC theory code design may be convenient for
compiler designers but current x86 hardware is very badly suited for
such theory with its restricted range of general purpose registers and
you regularly see redundant loads and stores so that trivial API calls
and the like are performed in registers.
This is left over 1990s technology when earlier than PII hardware was
faster that way. Then there is the problem of using the same
optimisation strategy for all code in a module and while you can
seperate the fast code from the hack OS code, few would bother to do
this and even fewer would know what code matters and what does not.
> > Wait a second champ, I never said that, I am advocating the thought
> > that resorting to assembly as First thing is folly. Resorting to it
> > when there is need isn't.
The problem with this view is that it escalates upwards in the same
manner. Many VB programmers would use the same argument against C++
where you only need to write "low level" code on a needs basis so you
don't natively write in C++.
There is in fact an ever growing number of people who do use assembler
as a first choice for some tasks and it is purely a matter of
familiarity with the language format. Many with a high level background
don't properly understand that assembler can routinely work with the
12000 plus API calls, the near massive collection of compatible C
libraries, libraries written in assembler and so on.
Assembler programming is by no means restricted to plugging up the
defects in compiler code output but much more to do with freedom of
design and architecture as well as chasing speed where it matters.
Instruction choice is a matter of targetted market width. If the Linux
desktop market is 2%, gaming is 0.01% of the sum total market and it
makes high demands on video, meory and processor performance, all of
which change on a weekly basis to a later faster and more expensive
choice.
I have always been stuck with targetting code at the widest number of
people and this means the furthest backwards compatibility for the
current Window OS platform. This says primarily 486 code but there is
more to it than just liear backwards compatibility. MMX was a big deal
with a P200 mmx processor but it does not perform reliably across all
of the later processors. It was also cursed with sharing te FP
registers which excluded joint FP, MMX operations without a massively
expensive time delay.
SSE hit the deck with a PIII and was occasionally faster that MMX but
as usual the limiting factor is memory bandwidth. The gain with SSE(2)
is the non temporal writes where you can clock the speed difference in
real time.
I also see this as the saving factor with compiler generated code that
memory bandwidth compresses the diference between shorter code with
less instructions as against a mountain of redundant loads and stores.
Put simply processor is still some powers faster than curent DDR 400
and later memory and tis allows a reasonably large number of redundant
instructions to be placed between memory access instructions.
I have an example in mind clocking an insertion sort where the removal
of 33 redundant loads and stores made no difference in the time of the
algo. he only thing that did make it faster was reducing the number of
memory accesses.
> > Ofcourse. But only when it pays off in some ways, makes a difference to
> > real-world software. strlen() is a good example, where it doesn't make
> > a didly-doo's difference to real programs performance in most of the
> > real-world, production software.
A single string length algo is only a very small component of common
tasks yet if the same indiference is applied to the sum total of
software design, you end up with the slow bloated style of C++ that is
common these days in commercial applications and hardware is not
getting faster but software is still getting bigger and slower.
Its not what "CAN" be done but what "DOES" get done with the majority
of software production tools that is the measure of the tool. No doubt
a well written C library will easily produce good quality final
application code in many instances but a vast majority of modern
applications are not in this class.
The hallmark of modern application production is massive size
increases, reduced functionality, inappropriately used threaded code
with endless timing lags and very high demands on current hardware.
> > I only took it into myself to write
> > the C++ version to validate that my theory, which is forged by years of
> > practise, is still correct. It still is.
This is fine and I hope it was useful to you but with the development
time to produce a C++ version that is nearly as fast as an old
assembler version, development time goes for the assembler code, not
the C++ code.
> > That depends who you writing software for. If you write it for
> > yourself, okay. If it is freeware, open source.. who cares where it
> > runs besides the author, or those who contribute. If it is application
> > written for a customer, usually they dictate the terms.
The project that I maintain is used by a very large number of people
and it must remain useful to this number of people so there is no real
point in tergetting the 0.01% doing unusual things. People who need
code in this range have the perfect tool with an assembler to pick the
advantages they require and simply write what they need.
> > If your point is to make fastest possible code, you take the pains to
> > write the code in assembly, but then ignore latest instructions
> > possible on x86 platform what's wrong with the picture?
The problem with this comment is it assumes another language primacy
yet there are enough people who write simple things in assembler
without feeding it through the restrictions of Delphi or C++ or
whatever else. Apart from speed issues, near complete freedom in terms
of architecture has a lot going for it and in the case of MASM, its
pre-processor will eat C compilers alive in terms of capacity.
Being able to design your own language free from the claptrap is one of
the large advantages in assembler programming.
> > I don't know why you still want to support 386 while writing *windows*
> > software in 2005.
Very simple actually, the vast majority of computer around the world
are not high end dual core AMD 64 Opterons with > 8 gig of memory but
far more humble machines that profit from small fast software written
in assembler where the later slow bloated hardware specific stuff just
won't run on such boxes.
Really high end graphics run on SGI boxes and when you don't need to
target a wide range of people, tis will deliver performance that the PC
market is some power slower than.
Before, or after you came along I did not say using assembly isn't
worthwhile. It is, but very RARELY. In this case with the strlen() in
C++ vs. x86 assembly for instance it isn't worthwhile.
>And that's the argument that I'm not buying. The fact that in some
>controlled situation you can cajole a C++ compiler into producing code
>about as optimal as one can expect does not imply that the compiler
>will do this all the time. You are dispelling no myth, I'm afraid.
The point is that most of the time it doesn't pay off in any way other
than warm fuzzy feeling. When it does, it does, that is not being
disputed here.
>that an optimization on Hutch's machine isn't as valid on your machine
>should prove to be no surprise here. It's one of the main reasons I
>quit "counting cycles" when the Pentium first arrived -- there's no
>sense in in anymore.
The point is that it doesn't make any other difference than neglible,
if you feel that assembly is a good idea for neglible performance
increase, I don't want to agree with that stance being a particularly
good idea. SInce that is the opinion you are opposing here.
>assembly code. How many people, for example, will find the C++ strlen
>function you've written to be any more understandable than the assembly
>version (from an algorithmic point of view, obviously)?
You are asking a quantity out of undefined set, which is one way to
make a point, if I knew you better I might know what's happening here.
No smiley, check.
>associated with them. Even in plain C, you get performance problems
>when people do things like:
>
>strcpy( a, b );
>strcat( a, c );
>strcat( a, d );
Context specific. No matter what you do, how fast you do, it can turn
out to be a performance problem.
>Precisely my point. The optimizations are not portable. For this
>particular example, you're limited to 32-bit processors.
Application using this library is portable to platforms the library is
ported to, portable does not imply or equal universal. This is a useful
feature, I would say that the C++ code is MORE portable than the x86
(MASM!) specific assembly function.
>IOW, the "trick" isn't portable and you're suffering from some of the
>same problems as assembly language.
As does everyone else, who are writing portable code will face.
>> I would surmise the code is order of magnitude more useful than x86
>> specific assembly snip.
>
>By what reasoning?
By the reasoning that the snip only works with x86 clone, compiles only
with MASM and is generally only useful in WIndows software. It does not
make the snip useless, but C++ code which compiles to verbatim same
binary and also supports large number of OTHER platforms could be
reasoned to be "more useful", maybe the word "order" was a red flag for
you, I shall remove it from front of your eyes.
>Given that about 90% of the world's computers today are x86 CPUs, I
>don't see how having the code in portable C++ is going to make it an
>order of magnitude more useful. Certainly we can find *some* people
Sourcecode is useful only for software developers, x86 constitures a
large number of customer base for software developers but there are
uses for other platforms aswell and that is where the "more useful" (<-
revised!) comes into effect.
>I don't question your claim; from a mathematical perspective I'm sure
>we could find a group of people amongst whom the need to have a
>portable strlen function that compiles on 10 different 32-bit (non-x86)
>processors is important, but...
That is rich, you are consistently ignoring my stance on this issue: I
do not think optimized strlen() is very useful at all. I don't expect
anyone else to find such a thing useful either.
I am merely saying that A > B, that C++ version which produces
identical binary to compared to version B, which is single platform
single assembler only code is MORE useful.
The rest is your extrapolation, let's clarify. I never said such "group
of people" is existing or how many such groups of people exist or
anything to that effect. You are plain malicious, simple as that.
>When you look at the number of people (end users) who will actually
>benefit from the code, however, it becomes real clear that the choice
>of HLL or x86 assembly is *mostly* irrelevant because most end-users
>are running x86 boxes.
Ofcourse, who said it weren't?
>isn't *always* better with one example, buy you cannot make a claim
>that there is no need to use assembly language on the basis of one
>example.
So what, I didn't make a claim that there is no need to use assembly
language in general. I claim that in this case it doesn't bring
anything substential into the solution.
>And you're assuming that Hutch got it right? That his example is the
>absolute pinnacle of what can be done in assembly?
I am not assuming such a thing. I am assuming that until he provides us
with something better that is the best _he_ can come up with in
assembly, or rather, what dr. Fog does come up with.
>That will probably run slower than the output of a good compiler for
>the above C code. Yes, scasb is *that* bad.
Yes, it is indeed.
>> Labourous and error prone process.
>Changing the subject?
Where you get that notion from? Doing register allocation and spilling
by hand IS laborous among other things, this is stuff compiler does in
fraction of the time thus saving time and money.
>And sometimes the compiler is brilliant when doing this, and sometimes
>it is real bone-headed. Your point?
That it isn't laborous and error prone..?
>Again, we're back to the argument of "this makes life so much easier
>for the programmer" rather than "the compiler does as good a job of
>this as the programmer could do himself."
I didn't make such generalized claim, sorry, only one that did present
that argument here is you.
I am saying that the C++ code compiled with Visual C++ 8.1 Beta 2
produces the precisely the same innerloop as the assembly code it was
being compared with. That is not what you claim I am "presenting as
argument".
>Sorry, you've lost me along the road somewhere. Perhaps you could be
>more articulate. It really seems to me that all you've done here is
>switch from "compilers produce code as good as humans do" to "it isn't
>cost-effective for humans to write code this way, so we live with what
>the compilers produce." A very different argument. But for some reason,
I haven't done such a thing. What I wrote originally is not what you
present it to be, the later posts are responses to ongoing discussion
adding to what is being discussed, mainly, how I see thing being under
discussion.
You bet it is a very different argument: it is mine, the earlier one
was yours presented as mine.
>that's where this argument always winds up. I guess that means we've
>reached the end of the debate.
Hitler?
>Well, maybe that's the difference between us. You see, I've got about
>25 years' experience as a professional programmer and I've worked both
>in the times when assembly was mandatory (to get any kind of
>performance at all) and I've been around during the past 15 years when
z80 assembly was the first programming language I learned, so what?
>these languages are good for. OTOH, I don't go around claiming that
>there is no reason to use assembly because compilers today are as good
>as humans. I may very well say that it makes *economic* sense to use
Neither do I. It is different to say that using assembly for a
specific, well defined task rather than in general.
>code efficiently in C. If you *really* want to debunk the myth that
>assembly has no advantage over HLLs, you need to move beyond strlen.
I don't nor intend to do that, and I didn't either. You have got this
idea into your head, because, you think you know me, or "my kind".
You might want to read the FIRST reply to the original poster, that was
written by me. E-V-E-R-Y-T-H-I-N-G you accuse me of and "correct" with
your post, was already said there. Store string length. Don't optimize
this, it doesn't matter.
This ongoing discussion is replies to your criticism over points that
were covered already earlier.
>trivial strlen function (byte at a time) will outperform the craziness
>embodied in the examples in this thread, because of the intrinsic
>overhead.
The overhead is alignment, which costs: -,-,+,&
The alignment "innerloop" is virtually same as the trivial strlen()
innerloop, if we want the short strings < 10 to go with the less
overhead path we can add 8 into the alignment count to force that code
being ran for the first 1 to 11 characters, this will eliminate the
overhead at the end which is far greater with multiple branches.
Four arithmetic operations overhead only for the cheap cases, I could
live with that if it was important issue.
> This is one reason why I argue that xstrlen is a ridiculous
>example.
It's only purpose is to be compared to the x86 assembly strlen() in
context of this thread, it produces identical code for the most time
critical sections so it works more than well for the discussion that
was going on before you came along. The one I am having with you is
about Philosophy of Programming and Optimization or somesuch, or is it
maybe about Putting Jukka In His Place? ;-)
>So when Hutch talks about saving all the function setup and tear-down
>code, this is not an insignificant matter. It reduces the overhead of
>the function call, thus vastly improving performance for small strings
>(which this older research suggests is a common situation). Now the
I have MASM, I *link* not compile with __asm {} block within the c++
sourcecode, I don't do that if it can be avoided. That is taken into
consideration when I do the timings, the C++ code is as presented, too.
>Oh, I've read it. That's not what you said earlier. But I'll allow you
>to back out of that gracefully. This is, after all, USENET and we have
>to allow for considerable "unstateds" and "misreads".
Alright, so what I said earlier that is relevant to that outburst? What
opinion or statement, or claim I reversed or what ever it is you seem
to think that I did, I don't follow you.
>Amazing, isn't it? Something so *basic* trips up 99% of the program out
>there. Exactly the point I'm making. You won't see mistakes like this
>made in a typical assembly program.
I don't quite follow you, the question that springs to mind is WHY you
are making that point.
There is no substitute knowing what you are doing in either language,
so?
>And when we look in your code, we'll never see an example of this,
>right?
Depends on what you are looking for.
>As in the way you've written xstrlen.
What's the unique characteristic in xstrlen() I should be looking for?
> they would expect it to be written that way.
You still didn't explain what you mean "that way"
>As for insulting, please check your own post. There are a few too many
>profanities and inferences on your part for you to be able to play this
>card here.
I'm not playing any card. I'm writing what I think, how's that compute?
> That makes us even. I don't care what I think about you either :-)
So I assume you didn't verify what I wrote, my conversation prediction
logic seems to br working flawlessly.
>Good for you. You escaped the problems of our industry over the past
>four years. But discussions of your experience and how long you've been
You referring to some situation in USA? I wouldn't know about that,
because I don't care.
>Again, one example of idiocy does not imply that all attempts to write
>fast code are worthless. And you never know -- It could turn out that
>this person you're talking about has a real-time foreground application
I happen to know precisely what he was doing for reasons outside the
scope of this discussion.
>And, in your experience you call C an assembly language. That speaks
>volumes about your experience, I'm afraid.
"The way I see it is that C is portable assembler, which makes it a
jack
of all trades but master of none."
The language syntax is crafted to be close to hardware, that's where
this point of view originates from-- that is how I "see" the C
programming language. That is why it is so trivial for me to write C
code while thinking in assembly. It is just plain trivial.
If you think that's funny, go ahead and laugh my ego can take it.
>No, I'm simply pointing out that this thread is second only to the
>memcpy threads that pop up. What this would have to do with your
>intelligence is beyond me.
So far, you have not said or teached me ANYTHING I don't already know.
And I don't expect to have done neither to you either.
>FAQ #2: Here's my strlen function, how can I make it faster.
>FAQ #3: Aren't compilers as good at generating machine code as human
Go ahead and write the FAQ then.
Interesting hypothesis, what's it based on?
>In my youth I wrote ANSI C but time and cynicism lead me down the road
>of writing pure assembler in many places because portability in almost
>every instance is a myth. My main use for a C compiler these days is
It's a myth only to someone who never has to do it.
>don't properly understand that assembler can routinely work with the
>12000 plus API calls, the near massive collection of compatible C
>libraries, libraries written in assembler and so on.
That's their problem.
>Instruction choice is a matter of targetted market width. If the Linux
>desktop market is 2%, gaming is 0.01% of the sum total market and it
>makes high demands on video, meory and processor performance, all of
>which change on a weekly basis to a later faster and more expensive
>choice.
I don't find the prices rising, on the contrary a new hardware is
cheaper and faster than ever, and the prices of old hardware are diving
rapidly as we speak aswell.
>I have always been stuck with targetting code at the widest number of
>people and this means the furthest backwards compatibility for the
>current Window OS platform. This says primarily 486 code but there is
Out of curiosity, what software is that?
>The hallmark of modern application production is massive size
>increases, reduced functionality, inappropriately used threaded code
>with endless timing lags and very high demands on current hardware.
Yeah, damn Microsoft!
>This is fine and I hope it was useful to you but with the development
>time to produce a C++ version that is nearly as fast as an old
>assembler version, development time goes for the assembler code, not
>the C++ code.
You happen to know, how long it took Dr. Fog to develop the assembly
version? I happen to know pretty accurately how long it took to write
the C++ version.. I'm just guessing here, but I have a reason to
believe that you know neither.
>whatever else. Apart from speed issues, near complete freedom in terms
>of architecture has a lot going for it and in the case of MASM, its
>pre-processor will eat C compilers alive in terms of capacity.
MASM is fine if Windows is all you care about.
>Being able to design your own language free from the claptrap is one of
>the large advantages in assembler programming.
You meant programming using MASM?
>Very simple actually, the vast majority of computer around the world
>are not high end dual core AMD 64 Opterons with > 8 gig of memory but
>far more humble machines that profit from small fast software written
>in assembler where the later slow bloated hardware specific stuff just
>won't run on such boxes.
I would dare to say that your "0.01%" estimate is way unrealistic. I
don't recall the precise dates so I estimate (googling for precise
dates would make me look better but I don't care about that).
Let's say 386 been around since 1985 or thereabout, that's 20 years.
Let's say x86 compatible systems with SSE support as standard been
around since about 2000, that's 5 years.
I find it unbelievable, that in the first 15 years 99.99% of PC's in
*active use* today would be built, and only 0.01% of PC's in active use
today would been built after year 2000.
It just sounds totally unrealistic, the market has been expanding, not
shrinking overall during that time. This means that it's more likely
than in the last 5 years MORE systems been shipped than in the 6 years
before 2000.
I don't buy that 0.01% fud, it is totally unrealistic. Maybe I'm
missing the developing countries from my equation, that could explain
it, yeah.. </sarcasm>
"repne scasb" might work a bit better for what you wanted to do, btw.
For someone who has chosen to troll in an x86 assembler forum for C++
programming, I sugest that the only person you have convinced is
yourself as most programmers in the assembler market have heard all of
this stuff before.
> >I sugest that you have probably spent more time than it would take to write it
> >in assembler and while the development may be useful to you in terms of
> >portability, it is neither a development time or speed advantage.
>
> Interesting hypothesis, what's it based on?
Your response time.
> >In my youth I wrote ANSI C but time and cynicism lead me down the road
> >of writing pure assembler in many places because portability in almost
> >every instance is a myth. My main use for a C compiler these days is
>
> It's a myth only to someone who never has to do it.
For a combined market share of about 2%, who cares.
> >don't properly understand that assembler can routinely work with the
> >12000 plus API calls, the near massive collection of compatible C
> >libraries, libraries written in assembler and so on.
>
> That's their problem.
Its also its advantage when such a massive range of functionality is
available apart from what they write themselves. Try writing Windows
software with a C compiler without the library support and you will do
it harder and slower than an assembler.
> >Instruction choice is a matter of targetted market width. If the Linux
> >desktop market is 2%, gaming is 0.01% of the sum total market and it
> >makes high demands on video, meory and processor performance, all of
> >which change on a weekly basis to a later faster and more expensive
> >choice.
>
> I don't find the prices rising, on the contrary a new hardware is
> cheaper and faster than ever, and the prices of old hardware are diving
> rapidly as we speak aswell.
Speak to people who have problems raising the couple of grand to buy a
later high end box. Try the population of China, Asia generally, a
large number of people in the US, South America, programmers from the
old eastern block in Europe and of course the many very good programers
from Russia. There is a whole world out there without the type of
funding necessary to keep buying high end boxes.
> >I have always been stuck with targetting code at the widest number of
> >people and this means the furthest backwards compatibility for the
> >current Window OS platform. This says primarily 486 code but there is
>
> Out of curiosity, what software is that?
Commercial software is generally written under a non-disclosure
agreement and I still have a few older ones floating around so I cannot
help you with my own history. What I do place for public usage is MASM
code and it is aimed at 486 compatibility.
> >The hallmark of modern application production is massive size
> >increases, reduced functionality, inappropriately used threaded code
> >with endless timing lags and very high demands on current hardware.
>
> Yeah, damn Microsoft!
And damned Borland, damned Linux, damned FreeBSD, damned Sun, SGI and
everyone else who uses C++. :) Nothing like slopping around oversized
underperforming bloated junk to feel profound.
> >This is fine and I hope it was useful to you but with the development
> >time to produce a C++ version that is nearly as fast as an old
> >assembler version, development time goes for the assembler code, not
> >the C++ code.
>
> You happen to know, how long it took Dr. Fog to develop the assembly
> version? I happen to know pretty accurately how long it took to write
> the C++ version.. I'm just guessing here, but I have a reason to
> believe that you know neither.
Interestingly enough he never mentioned it even though he is a member
of our forum but then he is also a very experienced assembler
programmer.
> >whatever else. Apart from speed issues, near complete freedom in terms
> >of architecture has a lot going for it and in the case of MASM, its
> >pre-processor will eat C compilers alive in terms of capacity.
>
> MASM is fine if Windows is all you care about.
>
> >Being able to design your own language free from the claptrap is one of
> >the large advantages in assembler programming.
>
> You meant programming using MASM?
No, I mean what I said in the context of a pre-processor that will eat
C++ compilers alive with its capacity. Prebuilt languages come with
pre-built assumptions where an assembler with a high powered
pre-procesor suffers none of that garbage.
> >Very simple actually, the vast majority of computer around the world
> >are not high end dual core AMD 64 Opterons with > 8 gig of memory but
> >far more humble machines that profit from small fast software written
> >in assembler where the later slow bloated hardware specific stuff just
> >won't run on such boxes.
>
> I would dare to say that your "0.01%" estimate is way unrealistic. I
> don't recall the precise dates so I estimate (googling for precise
> dates would make me look better but I don't care about that).
Yes, it could be .02% but as a maket share, its trivial. Few in the x86
gaming market make any money any more and those that do are at the
corporate level.
> Let's say 386 been around since 1985 or thereabout, that's 20 years.
> Let's say x86 compatible systems with SSE support as standard been
> around since about 2000, that's 5 years.
>
> I find it unbelievable, that in the first 15 years 99.99% of PC's in
> *active use* today would be built, and only 0.01% of PC's in active use
> today would been built after year 2000.
>
> It just sounds totally unrealistic, the market has been expanding, not
> shrinking overall during that time. This means that it's more likely
> than in the last 5 years MORE systems been shipped than in the 6 years
> before 2000.
Put graciously the ass fell out of the computer market with the
collapse of the dot com boom and internationally sales are down. Look
at the number of well known software companies that went out backwards
since 2000 and the various hardware companies that have been taken over
in the last few years and you will forget the idea of a market that
expanded in the same way as it did through most of the 90s. There are a
large number of people who still run old boxes that do what they
require who have little use for the newer stuff, especially as the
later OS versions do little better than 10 year old versions.
Think of DOS boxes, Win95 boxes, old MACS that still slug along
perfectly and you will get some idea of why so many people won't spend
the money to get more problems, bugs and the like.
That's a bit rich, I did prove my point with actual code and did come
up with SSE scanner which does 8 characters per iteration, neither
which you have done.
All we have from you is talk, talk, talk.. and borrowed code from
someone else, yes, you didn't even write the code yourself!
>> Interesting hypothesis, what's it based on?
>
>Your response time.
Pleeez, if you don't have neither code, in the light of the results
would it make a big difference if the code was written in C++ or
assembly? Which is generally faster to develop?
Based on your response time to unroll the code, looks like it takes
even longer to write such code in assembly!
>For a combined market share of about 2%, who cares.
Anyone, who wants to do more with the computing power they've got?
>available apart from what they write themselves. Try writing Windows
>software with a C compiler without the library support and you will do
>it harder and slower than an assembler.
You realize, that your suggestions assumes that you would first have to
be very stupid to do that in the first place?
I am not against assembler. None whatsoever. I just don't see how it
improves the code SUBSTENTIALLY in this case (strlen). Now that you
already exhausted your arguments against that observation, you are
getting to attack my character (implying that I am trolling), and
invoking totally unrealistic scenarios to prove something that I am not
even against in the first place.
And no, I don't think it would be that much easier in assembler, C or
C++ to that matter. It takes considerable time and effort to replicate
WIN32 API, look at WINE project for starters. Haven't they been at it
for 10+ years and still working on it and fixing bugs now and then?
>Speak to people who have problems raising the couple of grand to buy a
>later high end box. Try the population of China, Asia generally, a
That's why you leave the support for generic x86 in there, then those
with SSE will get speed benefits. When processing power grows, I expect
the processor to be able to do more not less.
>from Russia. There is a whole world out there without the type of
>funding necessary to keep buying high end boxes.
Then by all means write software that is fast enough on a 486. That is
trivial for trivial tasks.
>And damned Borland, damned Linux, damned FreeBSD, damned Sun, SGI and
>everyone else who uses C++. :) Nothing like slopping around oversized
>underperforming bloated junk to feel profound.
I'm assuming you are joking.
>Interestingly enough he never mentioned it even though he is a member
>of our forum but then he is also a very experienced assembler
>programmer.
Yet you draw conclusions about development time w/o knowing how long it
took me (you looked at my "response time") and without knowing how long
that assembly code took to develop. Ask him if he is in your forums.
>Yes, it could be .02% but as a maket share, its trivial. Few in the x86
>gaming market make any money any more and those that do are at the
>corporate level.
The way I see it, you develop so trivial software that it still runs
alright even on 486 so it doesn't matter how it is optimized for later
processors. If it's fast on 486, it better be fast on latest Opteron,
ofcourse!
>at the number of well known software companies that went out backwards
>since 2000 and the various hardware companies that have been taken over
I guess the crap ones were weeded out, good riddance! Those who make
products people are willing to pay for are still here.
>require who have little use for the newer stuff, especially as the
>later OS versions do little better than 10 year old versions.
"Little better" is subjective, if someone has appliances he connects to
his computer older OS'es just don't cut it. Just to give a very common
example of what people do with their computers these days, often you
see digital camera, recorder or portable mp3 player being hooked up to
a computer and what not.
For email and surfing the web and the likes older computer is just
fine, if that is all you do.
>Think of DOS boxes, Win95 boxes, old MACS that still slug along
>perfectly and you will get some idea of why so many people won't spend
>the money to get more problems, bugs and the like.
I'm more keen to believe that if they are given the choise, most would
instantly go for the latest computer equipment than not. Most likely
they don't have a choise or don't plain care EITHER WAY.
You make it sound like they actually *prefer* that way, that is highly
unlikely. At best, they might be happy with what they got or not simply
know any better.
If you look at typical computer user, yes, most don't NEED the fraction
of computing power they've got. But that is not related to what they
WANT and EXPECT from their computers. Completely different discussion.
Dude, you are drifting way off-topic.. I take it you have nothing to
comment on the SSE zero scanner, which does 8 characters per iteration?
You won't use it, because your software is fast enough as it were on a
486, hence no need to make it faster on system with SSE support? I take
it that you aren't using assembler for speed but because _you_ develop
much faster with it. That tells volumes about you as a developer not
about assembler.
And once more: I am not against writing in assembler. I have been using
x86 assembly language for a long time. Apparently I know how to read
and write, both, if you read this thread, right? THAT shouldn't even be
the topic.. why are you pushing the discussion into market shares and
other non-technical nonsense?
> That's a bit rich, I did prove my point with actual code and did come
> up with SSE scanner which does 8 characters per iteration, neither
> which you have done.
Where I have posted 2 complete working algorithms that both run on a
486 upwards, I saw as your contribution a fragment of SSE code with a
throw away line after it and a lot of waffle about trying to emulate
asm in C++.
I don't deny the use of multiport code, even though its user base is
trivial in comparison to the mass market where if you could produce the
"killer app" for the sum total of the rest, you would have hit less
than 5% of the market.
> All we have from you is talk, talk, talk.. and borrowed code from
> someone else, yes, you didn't even write the code yourself!
You will have to forgive me but Agner Fog wrote the algo first. I have
seen and timed many variations of it over time but his original
architcture has stood the test of time. The only code of mine you saw
in it was the leading alignment code as the original was designed to
work with 4 byte aligned strings.
Noting that the original posting for this thread was about a string
length algo in asm, posting a well known faster 486 compatible
algorithm with a minor mod to align the start of the buffer is in fact
reasonable where trolling about how and why someone should choose
multport code is a long way off the subject.
Most assembler programmers have heard this crap before from people with
the same cross to carry from being committed to an outmoded idea of how
and why you should write code. The only thing we are missing is the
OOP(S), bloat and other trivia associated with the same bundle of out
of date nonsense.
> Pleeez, if you don't have neither code, in the light of the results
> would it make a big difference if the code was written in C++ or
> assembly? Which is generally faster to develop?
Everybody has a theory on development time, see the VB guys for really
fast development times.
> Based on your response time to unroll the code, looks like it takes
> even longer to write such code in assembly!
REPEAT 7
; code
ENDM
Truly amazing complexity ? :)
> >For a combined market share of about 2%, who cares.
>
> Anyone, who wants to do more with the computing power they've got?
I doubt that an old notebook Pentium hits the deck as high end
hardware.
> >available apart from what they write themselves. Try writing Windows
> >software with a C compiler without the library support and you will do
> >it harder and slower than an assembler.
>
> You realize, that your suggestions assumes that you would first have to
> be very stupid to do that in the first place?
For someone who MUST depend on other people's prewritten code, it would
in fact be very stupid but someone who does not know the difference and
argues that they are writing their own code, they comfortably exceeds
such a level of stupidity.
> I am not against assembler. None whatsoever. I just don't see how it
> improves the code SUBSTENTIALLY in this case (strlen).
Its really simple actually, Agner Fog wrote that algo back when C
compilers were producing crap like SCASB, years after it was out of
date legacy code for pre 486 hardware and in the context of true 486
compatible code that can be used by nearly every machine possible that
can run x86 windows, it is still a good algo.
Whether someone can eventually emulate it in COBOL or FORTRAN or PASCAL
or whatever else simply does not matter as it is a viable fast
algorithm in a general purpose context.
Whether you have a theory on string length caculation or not, the data
does not come by imaculate conception and you cannot always get that
info somewhere else. It may be something new to you but most programmer
already know that you don't do more work on code and/or data than you
need so when you get the length of string data, you normally store it
somewhere
> already exhausted your arguments against that observation, you are
> getting to attack my character (implying that I am trolling), and
> invoking totally unrealistic scenarios to prove something that I am not
> even against in the first place.
Well, what is all te noise about then ? I suggest it is you with an axe
to gring about assembler programming that is making the noise and YES
you are trolling in a topic that was a member asking about an assembler
string length algo.
> That's why you leave the support for generic x86 in there, then those
> with SSE will get speed benefits. When processing power grows, I expect
> the processor to be able to do more not less.
This suggestion already assumes multiple copies of algorithms and while
this is viable in some instances, fortunately its not the only way or
the best way to perform such a task. After normal procesor detect you
can easily pick the code you should run by building a set of DLLs for
each processor but to put it into context, how often can a user justify
the expense of buying a late model high end box so that an SSE2 algo
can read a string length for a typed input slower than a normal integer
version ?
> >And damned Borland, damned Linux, damned FreeBSD, damned Sun, SGI and
> >everyone else who uses C++. :) Nothing like slopping around oversized
> >underperforming bloated junk to feel profound.
>
> I'm assuming you are joking.
Only as much as you were. I have seen crap code from many places over
time and Microsoft by no means have a monopoly on this.
> Yet you draw conclusions about development time w/o knowing how long it
> took me (you looked at my "response time") and without knowing how long
> that assembly code took to develop. Ask him if he is in your forums.
Not only don't I care but I doubt that he does either. You may not be
familiar with the situation but I regularly deal with assembler
programmers who write assembler code daily so its not the pie in the
sky mystery you appear to be assuming.
> The way I see it, you develop so trivial software that it still runs
> alright even on 486 so it doesn't matter how it is optimized for later
> processors. If it's fast on 486, it better be fast on latest Opteron,
> ofcourse!
All you are saying here in the absence of grasping general purpose x86
assembler is that you prefer to avoid the major market that does not
own a 64 bit Opteron. It probably matches you view on multiport code
that does the same.
> "Little better" is subjective, if someone has appliances he connects to
> his computer older OS'es just don't cut it. Just to give a very common
> example of what people do with their computers these days, often you
> see digital camera, recorder or portable mp3 player being hooked up to
> a computer and what not.
Most people with high end boxes already have this stuff but there are a
very large number of people who use a win3.? box for word processing,
DOS boxes for database software for stock inventory and a host of
things they don't need or want to change.
You are trying to jump on the Microsoft bandwaggon here with bigger,
better, faster smarter etc .... but the mass market stoped buying at
the end of the dot com boom and the computer industry has been
floundering since.
> I'm more keen to believe that if they are given the choise, most would
> instantly go for the latest computer equipment than not. Most likely
> they don't have a choise or don't plain care EITHER WAY.
I would love to have a play with a 512 Itanium SGI box just to see how
fast it was but there is no way I am going to fund one to find out.
> Dude, you are drifting way off-topic..
No, you did by trolling an assembler newsgroup with C++ crap that few
would be interested in. I was one of those who was as I understood the
work to try and emulate assembler in the crippled assumptions of a high
level language but you can be sure I am not one iota interested in
trolling for C++, I have heard it all before.
I take it you have nothing to
> comment on the SSE zero scanner,
Yawn, I have seen megabytes of SSE(2) code in the last 5 or 6 years
written by people who are very good at it, tell me something new.
> You won't use it, because your software is fast enough as it were on a
> 486, hence no need to make it faster on system with SSE support?
No, SSE(2) code run really badly on older hardware. :) Hardware says
naughty things to you like invalid opcode and generally pulls the plug
unless you have some form of exception handling in place.
> it that you aren't using assembler for speed but because _you_ develop
> much faster with it. That tells volumes about you as a developer not
> about assembler.
I confess that my C is getting rustier by the day but then C is getting
rustier by the day without my help. I see rapid development done in a
multitude of languages from Pascal to basic to VB and srcipting and the
like so anyone assuming that C++ is at the leading edge of rapid
development is plain kidding themselves.
C used to be an excellent language for OS development but by no stretch
of the imagination is it simple or fast to develop in. Remove its
precanned libraries and an assembler will kick its ass. Treating a C
compiler as a premium code generation tools displays some ignorance of
what a C compiler is good at.
The capacity to use ANY COMPATIBLE OBJECT MODULE and having the
appropriate notation to address and use it makes a C compiler more a
management tool that a premium code generation tool. They are by no
means bad in most instances but they are by no means premium code
generation either.
> And once more: I am not against writing in assembler.
No perhaps not but you are trying to put it in a little box on the side
of your own C++ programming and further, you are trying to inflict this
view on other people who don't hold your assumptions.
> I have been using x86 assembly language for a long time.
Congratulations, so have many of us, we just don't all try and shoehorn
it into your little box and this is what this discussion has been
about, you trying to inflict your view on how other people should write
their own code.
I don't personally care if you chisel code on granite blocks and read
it into a computer with an OCR reader but try and inflict it on others
and you will end up hearing why they don't agree.
15 years? Moore's Law was actually published 40 years ago, and has proven
itself to be startlingly accurate over that entire period.
--
- Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.
That fragment is essentially the innerloop, it works, too.
>I don't deny the use of multiport code, even though its user base is
>trivial in comparison to the mass market where if you could produce the
>"killer app" for the sum total of the rest, you would have hit less
>than 5% of the market.
I would say, that the userbase I am targeting with the multiport code
is much larger than the market you command. I'm developing microcode
and drivers for mobile phones. This a market, which just last year was
in excess of 600 million units (based on the data that Nokia with 30%
market share in 2004 did ship 200 million units alone).
Yes, Hutch, we still think it is programming even if we don't write off
the shelf applications for desktop computers.
>You will have to forgive me but Agner Fog wrote the algo first. I have
>seen and timed many variations of it over time but his original
>architcture has stood the test of time. The only code of mine you saw
>in it was the leading alignment code as the original was designed to
>work with 4 byte aligned strings.
Mostly I have seen from you is talk, not much code and you call me the
troll. Go figure. Trolls don't go out the trouble to actually prove
their points do they or how should I know, I'm not expect on the
subject and not trolling. That's your view on the matter, in that case
you only playing a trolls game and a sucker but I can assure that you
are not.
>Noting that the original posting for this thread was about a string
>length algo in asm, posting a well known faster 486 compatible
If you note my first reply, it was on the topic and giving instructions
on much smarter practises to avoid need to optimize strlen() in the
first place. It was you who implied that assembly is a must in this
case and HLL compilers should effectively be ignored.
That isn't the case, as I demonstrated quite clearly the HLL code does
compile into precisely the *same* innerloop as your assembly piece.
Note that is not a statement against assembly language, I as many
others write assembly and mix it with HLL languages. It does not mean
we are *against* assembly or saying that it shouldn't be used.
You got to that conclusion all on your own and I can tell you mister,
that, that is a wrong assumption and a wrong conclusion.
YOU have been ranting all this time, and complaining that I am somekind
of anti-assembly language troll, heaven's, as I am posting at
comp.lang.asm.x86 -- which I post here, while I use assembly, it
doesn't mean that I cannot post reasons when NOT to use assembly. I
don't think the strlen() is a particularly clever thing to optimize in
assembly, no offence intended.
I think that you are simply over-reacting and protective, and assume
that I have points of view which I don't have-- there is nothing wrong
with that, until you go and broadcast them into the world as my
opinions.
>algorithm with a minor mod to align the start of the buffer is in fact
>reasonable where trolling about how and why someone should choose
>multport code is a long way off the subject.
Saying that before arguing the points for days would have much more
substance.
>Most assembler programmers have heard this crap before from people with
>the same cross to carry from being committed to an outmoded idea of how
I write assembly, often, yet I don't label myself as assembler
programmer. I'm interested in x86 assembly, but I am also interested in
ARM assembly, MIPS and 680x0 and PPC assembly among other things like
ANSI C, C++, C# and OCaml. The fact that I have open mind doesn't mean
that what i write is automatically crap.
If you think what I write, which is what I think, btw. is crap that's
your opinion. I don't say your opinion is crap as I know that you are
doing something entirely different. What that precisely is I don't know
as you are under heavy NDA's not to reveal what you are working on or
have worked on and it's none of my business to press the issue further.
But notice, that I don't think that your opinions are "crap", I just
don't agree with them because I have different systems I am working
with and on.
Before you say that "what you doing in asm.x86 then?", the answer is
that because I am interested in x86 aswell. Thought my interest is
obviously biased towards SIMD and x86-64. The interest in x86-64 is
mostly non-practical at this stage mostly fueled at fascination at
larger registers and larger number of them especially. But that is
entirely different topic so I won't go further in that direction.
>Everybody has a theory on development time, see the VB guys for really
>fast development times.
You seem to label people as "assembler programmers", "VB guys" and what
not. I'm just a programmer foremost, languages are just tools and am
not very religious or fanatic about them.
For me, assembler is a tool and not the only tool in the toolkit. It
doesn't disqualify me for discussing it or automatically make me a
troll like you imply. Shame on you!
>Truly amazing complexity ? :)
Truly amazing then to use the reaction time as argument.
>I doubt that an old notebook Pentium hits the deck as high end
>hardware.
I thought you were talking about Operating Systems and new software in
general, that was the argument until this point, now you talking about
old notebook Pentiums.
>For someone who MUST depend on other people's prewritten code, it would
>in fact be very stupid but someone who does not know the difference and
>argues that they are writing their own code, they comfortably exceeds
>such a level of stupidity.
You took Agner Fog's code and paster it here, you create a scenario
which would be very stupid to go through (rewrite the WIN32 API for
Windows of all things) and then of all things suggest that I am stupid.
I think you didn't understand that I meant, that going through the
scenario of recreating WIN32 API for Windows would be stupid, not you.
For example creating WINE isn't stupid as it is meant to be able to run
Windows software on other platforms, nor were you stupid for making the
suggestion. Neither was I to take the bite and try to play the game
only a fool would play.
>> I am not against assembler. None whatsoever. I just don't see how it
>> improves the code SUBSTENTIALLY in this case (strlen).
>
>Its really simple actually, Agner Fog wrote that algo back when C
>compilers were producing crap like SCASB, years after it was out of
>date legacy code for pre 486 hardware and in the context of true 486
>compatible code that can be used by nearly every machine possible that
>can run x86 windows, it is still a good algo.
What that has to do me saying that I am not against assembler in any
way whatsoever is a mystery to me.
>Whether someone can eventually emulate it in COBOL or FORTRAN or PASCAL
>or whatever else simply does not matter as it is a viable fast
>algorithm in a general purpose context.
Looks to me that Dr. Fog wrote it based on VAX code (written in C?) if
Dr. Hyde remembers correctly in one of his posts, so looks to me that
the code was originally C to begin with, so you could add X86 ASSEMBLY
to the above list, if you don't mind?
>info somewhere else. It may be something new to you but most programmer
>already know that you don't do more work on code and/or data than you
>need so when you get the length of string data, you normally store it
>somewhere
Then why assembly strlen(), as I understand it the OP didn't ask how to
compute length of zero terminated string in x86 assembly but faster
strlen() implementation, that implies C API and warrants such
suggestion.
>Well, what is all te noise about then ? I suggest it is you with an axe
>to gring about assembler programming that is making the noise and YES
>you are trolling in a topic that was a member asking about an assembler
>string length algo.
That noise is called discussion, you are taking part in it. I don't
have an axe to grind, each to his own and all that. Doesn't mean I
cannot express my points of view.
And correction, it was a question asking for faster strlen() not for
assembler string length algo(-rithm). He also explicitly asked for
possible MMX enhancements.
>each processor but to put it into context, how often can a user justify
>the expense of buying a late model high end box so that an SSE2 algo
>can read a string length for a typed input slower than a normal integer
>version ?
I think about that in different terms, I don't think that I will buy
high end box so that I can use SSE2 algo for something trivial like
strlen(). I think that while I have SSE2 box, how could I max it out
since I have it to begin with.
I don't either think that computer with SSE instructions, such as
Pentium III is all that high end. Those can be had used for less than
$50, used. Just looking at ebay offerings I see Pentium III laptops
going for $100-$200 (which is expensive for that sort of junk, IMHO, I
seen way cheaper) and < $100 for desktop systems. And that was just the
first page of hits, look for "Pentium III" in ebay.com ...
That may be a lot of money for some people, but then, Windows license
is much more expensive than that. They would be better off with free OS
and freeware and open source software, which does the same thing and
generally has lower hardware requirements especially if not running KDE
or GNOME.
>Not only don't I care but I doubt that he does either. You may not be
>familiar with the situation but I regularly deal with assembler
>programmers who write assembler code daily so its not the pie in the
>sky mystery you appear to be assuming.
I am not assuming or thinking to that matter that it is a pie in the
sky mystery. Neither am I against using assembler. I am neither against
using assembler for trivial, non time critical tasks. I just don't do
that myself.
>All you are saying here in the absence of grasping general purpose x86
>assembler is that you prefer to avoid the major market that does not
Who here is in the absense of grasping general purpose x86 assembler?
>own a 64 bit Opteron. It probably matches you view on multiport code
>that does the same.
You seem strangely obsessed with mentioning Opteron regarding me, where
did that come from?
>Most people with high end boxes already have this stuff but there are a
>very large number of people who use a win3.? box for word processing,
>DOS boxes for database software for stock inventory and a host of
>things they don't need or want to change.
If I had a lot of requirements like that on software I write at hand, I
would think that is a very important point. But as there isn't, I think
it is a moot point. But that is just how things swing for me and
doesn't mean I disrespect the way things worked out for you, as you
seem to believe.
I'm writing a lot of tools as part of the work I do, and some stuff
like simulations can run for days. Any time we can cut off is money in
the bank. At this time I don't write software for desktop computers,
unless it is some tool or other but then, the environment it is ran is
inhouse development systems and server clusters.
A different world. You cannot say anything about "market shares" for
work that simply has to be done. For what you do market shares seem to
be everything, that's all fine with me, really.
I go as far as say that it is a very good argument to convince yourself
that you are doing the right thing, which I have no doubt you are
doing, really.
If you try the SSE code, how does it scale against your unrolled
scanner? I mean, if we want to stick to technical discussion and not
troll about market shares, etc.
>You are trying to jump on the Microsoft bandwaggon here with bigger,
>better, faster smarter etc .... but the mass market stoped buying at
>the end of the dot com boom and the computer industry has been
>floundering since.
Excuse me, I am trying to do what? Please explain.
>I would love to have a play with a 512 Itanium SGI box just to see how
>fast it was but there is no way I am going to fund one to find out.
A little bit extreme reaction here? I was merely writing a little code,
which uses x86 SIMD, which isn't as uncommong as you are leading me to
believe. (0.02% market share? which market would that be.. where you
pull these numbers?)
>No, you did by trolling an assembler newsgroup with C++ crap that few
>would be interested in. I was one of those who was as I understood the
>work to try and emulate assembler in the crippled assumptions of a high
>level language but you can be sure I am not one iota interested in
>trolling for C++, I have heard it all before.
You have all the right in the world to think what you want.
>Yawn, I have seen megabytes of SSE(2) code in the last 5 or 6 years
>written by people who are very good at it, tell me something new.
The code in question is only a few lines and does 3x more work per
instruction than the code you posted. Don't knock it before trying.
Also look into the mirror, the post was very short and compact while
this discussion is not very short and compact about market shares and
other non topical things brough into the discussion by you.
>No, SSE(2) code run really badly on older hardware. :) Hardware says
>naughty things to you like invalid opcode and generally pulls the plug
>unless you have some form of exception handling in place.
You seem somehow upset that there are improvements made to the x86
since the 486 and consider them generally "bloat" and "OOP(S)" crap. I
disagree that the enhancements are useless. I agree that they are most
probably useless to software that you write.
See? We can be in agreement here without any friction.
>multitude of languages from Pascal to basic to VB and srcipting and the
>like so anyone assuming that C++ is at the leading edge of rapid
>development is plain kidding themselves.
I never said such a thing. But I think that you can write the trivial
stuff out of the way quicker and easier with C and C++ than in
assembler. There's a world of difference between what I say and what
you claim I say.
What isn't a performance problem isn't bloat either. You can write very
bloat free lean and mean programs in C++. Four kilobyte intros are
written in C++ for crying out loud. It's not about the language but the
programmer who writes the bloat when you compare assembler, C and C++.
Yes, it has been addressed that in absense of knowing what you are
doing you can easily write more bloat in HLL than assembler. But that
is only because you can write more code in the same time in HLL than in
assembler, and that is assuming a hypothetical typical programmer.
if a typical programmer was an assembly programmer, I would be very
surprised, wouldn't you?
>C used to be an excellent language for OS development but by no stretch
>of the imagination is it simple or fast to develop in. Remove its
>precanned libraries and an assembler will kick its ass. Treating a C
>compiler as a premium code generation tools displays some ignorance of
>what a C compiler is good at.
How come never a major operating system has been written entirely in
assembler. I know about a few projects that never heard a peep from
again, but a MAJOR operating system which consists, say, the majority
of the markte (your words), think Windows.. why isn't it writtein
entirely in assembler?
Because undertaking of that complexity would be totally unrealistic to
write in assembler, that's why.
While I am forced by your arguments to "find reasons" why assembler Is
Bad, that is not generally my attitude or opinion. I use assembler,
believe it or not.I am interested in assembler, too, believe it or not.
So far you chosen not to believe, so you're calling me a liar, now?
Sooner you stop thinking me of somekind of anti-assembler troll, the
sooner you can have the calm and rational mindset to have a rational
discussion.
>No perhaps not but you are trying to put it in a little box on the side
>of your own C++ programming and further, you are trying to inflict this
>view on other people who don't hold your assumptions.
I'm also putting explanation to go with the expression of my opinion,
that is not immoral or wrong thing to do. People who can think for
themselves don't need you to defend their own opinions.
>Congratulations, so have many of us, we just don't all try and shoehorn
>it into your little box and this is what this discussion has been
>about, you trying to inflict your view on how other people should write
>their own code.
The famous kettle calling the pot black -situation.
>I don't personally care if you chisel code on granite blocks and read
>it into a computer with an OCR reader but try and inflict it on others
>and you will end up hearing why they don't agree.
You can speak only for yourself.
> >Where I have posted 2 complete working algorithms that both run on a
> >486 upwards, I saw as your contribution a fragment of SSE code with a
> >throw away line after it and a lot of waffle about trying to emulate
> >asm in C++.
>
> That fragment is essentially the innerloop, it works, too.
Fine but you are imposing high level language "inner loop theory" on
people in an x86 newsgroup who are not saddled with your asumptions.
> I would say, that the userbase I am targeting with the multiport code
> is much larger than the market you command.
I seriously doubt you know much of any market I have worked on.
> I'm developing microcode
> and drivers for mobile phones. This a market, which just last year was
> in excess of 600 million units (based on the data that Nokia with 30%
> market share in 2004 did ship 200 million units alone).
Sounds like a viable way to make a buck but the assumptions of writing
hardware for portable gadgets is hardly the background for telling
people who have written commercial x86 assembler for years how to do
it. Having held Nokia shares in 2000 and got out before they fell
trough the floor, you will have to forgive me for not being impressed
with their performance.
> If you note my first reply, it was on the topic and giving instructions
> on much smarter practises to avoid need to optimize strlen() in the
> first place. It was you who implied that assembly is a must in this
> case and HLL compilers should effectively be ignored.
Same comment as before, string length information does not come by
immaculate conception, you must get it from somewhere. Once you have
got it you usually store it somewhere. You may be hinting at bad
programming practice of not saving the data and having to multiply
regain it but this has noting to do with a StrLen() algo, it has to do
with fundamental code design.
> That isn't the case, as I demonstrated quite clearly the HLL code does
> compile into precisely the *same* innerloop as your assembly piece.
If I remember correctly, you spent the time trying to emulate an old
and well known algorithm and eventually got something like the same
timings, fine if it works for you but this algo has been in many
libraries for many years so you prove little apart from being able to
eventually emulate a 9 year old algo.
> Note that is not a statement against assembly language, I as many
> others write assembly and mix it with HLL languages. It does not mean
> we are *against* assembly or saying that it shouldn't be used.
Perhaps not but its the same problem as I mentioned in an earlier post,
you are attempting to place other peoples coding capacity in your own
little box as a subset of how you use a high level language yet there
is a vast number of experienced assembler programmers out thee tha
don't need the confines of you little box. While you speak of high
level language "inner loop theory", there is a multitude of assembler
programmers who comfortably write inner loops, outer loops,
intermediate loops, interdependent loops, and a mountain of other
freestyle variations.
I certainly did not introduce the claptrap of multiport code and C++
into this discussion, you did and you did to try and impose a set of
restrictions on other people on how they write assembler code based on
your own high level language disposition. I will with indifference to
your views post bits of code if I have it around to help out members
who ask for something and I don't particularly care if this does not
fit into your language preferences.
Like many who have written assembler code for a long time, I have heard
all of this crap before and it usually came from people who resented
the performance advantages of true low level code or the lack of need
to comform to arbitrary standards of other languages. It used to be
open ridicule and abuse but enough pure assembler missiles went past
them to shut the noise up.
> Looks to me that Dr. Fog wrote it based on VAX code (written in C?) if
> Dr. Hyde remembers correctly in one of his posts, so looks to me that
> the code was originally C to begin with, so you could add X86 ASSEMBLY
> to the above list, if you don't mind?
Perhaps you should leave historical analysis alone. The vast majority
of historical algorithms existed before C did. C A Hoare wrote in the
60s, Shell in the 70s, Bob Boyer's BM search algo was written in PDP10
assembler and Knuth designs in his own asm dialect. It was people like
Robert Sedgewick that developed algorithms mainly in C during the 80s
yet the vast majority of fundamental algorithm design was up and going
before that and you look at languages like COBOL, Fortran, Pascal and a
few other old timers.
> And correction, it was a question asking for faster strlen() not for
> assembler string length algo(-rithm). He also explicitly asked for
> possible MMX enhancements.
In an x86 newsgroup, it does mean an assembler question, otherwise the
member probably would have posted in an "Object Pascal" or other
newsgroup.
> >Yawn, I have seen megabytes of SSE(2) code in the last 5 or 6 years
> >written by people who are very good at it, tell me something new.
>
> The code in question is only a few lines and does 3x more work per
> instruction than the code you posted. Don't knock it before trying.
Same comment as above, Yawn, I have seen many megabytes of very well
written SSE(2) code over the last few years so a fragment of sse code
is no ground breaking acheivement and it runs really badly on a pre sse
processor.
This discussion ran out of interest when you stopped posting results of
your hll optimisation and started waxing lyrical about C++ compiler,
multiport code and where assembler programming fitted into your scheme.
You sound like you know what you are doing in your own area of
expertise but I seriously doubt you have convinced anyone apart from
yourself as to the virtues of trying to place assembler programming
into the confines of the box you have in mind.
Regards
I have not done such thing. I said that writing the strlen() in
assembly wouldn't bring substential performance benefits. Then I wrote
HLL code to prove that case after you say in effect that HLL's suck.
>I seriously doubt you know much of any market I have worked on.
I've no doubt you worked on many different markets. I only see work
that must be done and figure out ways to get it done with different
tradeoffs which change from project to the next.
I see. So you think SSE is not very useful for you, fine! I believe
you!
>Sounds like a viable way to make a buck but the assumptions of writing
>hardware for portable gadgets is hardly the background for telling
>people who have written commercial x86 assembler for years how to do it.
Now you label my background as someone who develops for portable
gadgets, while at it, remember to label me as "x86 assembly programmer"
too, because I have done that a great deal and have a background on it
(among other things).
Or better yet, don't label people at all, that usually works better
atleast try to keep the labeling to yourself.
No, I'm not telling you how to write x86 code. I'm asking what's wrong
with SSE code, instead of simply stating that you don't need it
yourself you start this long lecture about market shares (!), which I
don't really think is as you present but that is another topic.
>it. Having held Nokia shares in 2000 and got out before they fell
>trough the floor, you will have to forgive me for not being impressed
>with their performance.
It doesn't matter who the market leader in the future is, I only used
that to give an idea how large the market is. 200 million devices did
command 30% of the market in 2004, I have no agenda related to Nokia. I
quote what my numbers were based on nothing more, usually more credible
than throwing 0.01% or 0.02% out of thin air.
>programming practice of not saving the data and having to multiply
>regain it but this has noting to do with a StrLen() algo, it has to do
>with fundamental code design.
Yes, that's what I wrote in my first post.
>If I remember correctly, you spent the time trying to emulate an old
>and well known algorithm and eventually got something like the same
>timings, fine if it works for you but this algo has been in many
Actually, I got the same innerloop in the first compile. What I did was
to simple use variable names "a" for eax, and so on. It was fairly
trivial and took, I think no more than 10 minutes or so. I went back to
it later to correct a bug which I found in regression and put some
effort into the postfix since I figured I could use the code.
You never asked, you just assumed.
>libraries for many years so you prove little apart from being able to
>eventually emulate a 9 year old algo.
The assembly code emulates even older VAX C code, so you should already
stop using this argument. I already made this point before.
>little box as a subset of how you use a high level language yet there
>is a vast number of experienced assembler programmers out thee tha
>don't need the confines of you little box. While you speak of high
The box is figment of your imagination, I am like a sponge for new
tools and ideas and keep my mind open. What you talking about is
somekind of average c++ evangelist, who you persistently assume me to
be.
>level language "inner loop theory", there is a multitude of assembler
>programmers who comfortably write inner loops, outer loops,
>intermediate loops, interdependent loops, and a mountain of other
>freestyle variations.
I was more thinking what you put inside the loop, not the loop control
itself. When you write code, say, fpu code you keep track what is in
which fpu stack position. That sort of manual labour can be a fairly
good way to spend time. Ofcourse you will disagree with this on
principle. :)
>I certainly did not introduce the claptrap of multiport code and C++
>into this discussion, you did and you did to try and impose a set of
>restrictions on other people on how they write assembler code based on
>your own high level language disposition. I will with indifference to
I did not impose any restrictions to anyone, that claim is purely
fiction.
I ask you is there MASM for Linux, that is one platform where x86 is
very common, too. My code works on x86 based Linux just tandy with g++,
yours doesn't and yet you make it out to be a bad thing.
I wouldn't even discuss the point but you keep bringing it up. Each
time you think of some excuse why the x86 code snip is superior in
almost every regard.
>Like many who have written assembler code for a long time, I have heard
>all of this crap before and it usually came from people who resented
>the performance advantages of true low level code or the lack of need
Do I come out like a person who resents the performance advantages of
well crafted code, no matter what the language being used? Have I
implied a single time that your opinion is "crap" in any way?
>to comform to arbitrary standards of other languages. It used to be
>open ridicule and abuse but enough pure assembler missiles went past
>them to shut the noise up.
What are you on about?
>Perhaps you should leave historical analysis alone. The vast majority
>of historical algorithms existed before C did. C A Hoare wrote in the
In that case they also existed before x86 assembly pretty certain.
>In an x86 newsgroup, it does mean an assembler question, otherwise the
>member probably would have posted in an "Object Pascal" or other
>newsgroup.
Or he thought it would be possible to gain meaningful performance
increase from writng this function in assembler, I don't see that
happening in this thread.
>Same comment as above, Yawn, I have seen many megabytes of very well
>written SSE(2) code over the last few years so a fragment of sse code
>is no ground breaking acheivement and it runs really badly on a pre sse
>processor.
As does 386 code run just as badly on a 286, some 286 programmer think
that Dr. Fog's 386 code isn't ground breaking achievement as they seen
many megabytes of 386 code.
Okay okay, I get it already! You don't use or need SSE, fine fine!
>You sound like you know what you are doing in your own area of
>expertise but I seriously doubt you have convinced anyone apart from
>yourself as to the virtues of trying to place assembler programming
>into the confines of the box you have in mind.
Implementing complex state machines with native binary generators -is-
where the action is. Java JIT compilers, .NET IL translators and
similiar are examples of this. Whether you like it or not Windows Vista
is coming in 2006 if their schedules just permit. It's nost just me,
the industry is going into that direction driven by Sun, IBM, Microsoft
etc.
It's not a box, it's the future.
That doesn't excluse us enjoying developing in handwritten assembly
when it is useful. For you that is the primary development language?
Congratulations, but don't assume I am against assembly just because it
isn't mine (at this time in the world, situation is always in a flux
and changes depending on requirements).
> I have not done such thing. I said that writing the strlen() in
> assembly wouldn't bring substential performance benefits. Then I wrote
> HLL code to prove that case after you say in effect that HLL's suck.
This is only recycling the same loop logic as before. This translates
to if you spend long enough, you an almost emulate an old assembler
algo in a hll. So what ?
The guy who wants a byte scaner has no need to feed it through your
assumptions just like the guy who wants a DWORD version like Agner Fogs
does not either. Simply because you can cobble together something
similar does not mean that anyone else has to look through other
language implimentations before they used it.
Do you keep up with Forth or Pascal algos ? I hope for your own sake
that you don't. How about the bleeding edge of VB algo design ?
This conversation turned into a nothing when you shifted from code to
waffle about C++ and while it may have gone past you, this is an x86
assembler newsgroup, not a C++ trolling place.
MASM for Linux ? Hold your breath waiting. :)
Regards,
> That doesn't excluse us enjoying developing in handwritten assembly
> when it is useful. For you that is the primary development language?
> Congratulations
This is probably the most funny sentence, in all this
demential discussion. It seems you do not know who and
what Hutch--, --, --, ... is. So here is it:
He is a Power Basic programmer, who inserts occasionaly
Asm Code into his HLL developements, and he never wrote
anything significative, in full Assembler. His most
significative Application, in full Assembly, is the
Editor found in the MASM32 package. Just take a look at
it, and you will understand better with who you are
debating, and the reason why he cannot live without such
debates and such attitudes, pushing the readers to
interpretations, that are nothing but the exact reverse
of facts.
;)
Betov.
During the years 1990-1995 Moore's Law applied to performance as well
as the number of transistors found on a chip. Prior to that (in
Microcomputer designs), performance was a bit behind the curve, and
obviously, after that, we're also a bit behind the curve. Moore's law
(doubling of transistors every two years) is still working, but
designers can't figure out how to use those extra transitors in order
to speed up the processors (well, other than doubling cores, which
*doesn't double performance).
Cheers,
Randy Hyde
The c++ compilers usually do the padding at 32-bit boundries.. and this
helps a lot when the processor reads code into the readahead buffer. I
think this one reason is sufficient to nopt code in assembly(except of
course the core super routin of your program)
Neo
> This conversation turned into a nothing when you shifted from code to
> waffle about C++ and while it may have gone past you, this is an x86
> assembler newsgroup, not a C++ trolling place.
Indeed, this is an x86 News Group, not a Power Basic
trolling place, but for the very few ones interrested
with Assembly, here, the single important point is that
your prefered discussion thema (String Length, repeated
again and again, since, at least, 8 or 10 years...) is
that this is best way to definitively kill Assembly.
As randall Hyde, the great expert in Flex and Bison wrote
above, if your String Length is not fast enough for you,
it is quite simple to not execute it, so that will take
zero time, and will beat hands down all of your demential
and ridiculous "Algos", whereas all of the ones in need
of an effective and practical snippet will live happy with:
mov edi StringPointer, ecx 0-1, al 0
repne scasb
mov eax 0-2 | sub eax ecx
; >>> Lenght in eax.
Betov.
"rand...@earthlink.net" <spam...@crayne.org> wrote in message
news:1130684744.7...@z14g2000cwz.googlegroups.com...
mov edi StringPointer, ecx 0-1, al 0
repne scasb
mov eax 0-2 | sub eax ecx
; >>> Lenght in eax.
It is not hard to see which is the ' demential and ridiculous "Algos"
'when the author of a broken assembler is willing to post trash like
this in an assembler newsgroup.
Algorithm uses 3 registers. EDI ECX and EAX.
Algorithm has a stall in the partial register read in AL followed by
the
write to EAX.
SCASB is particularly slow.
Here is the viable alternative in PowerBASIC inline assembler. 2
registers, shorter code and far faster than anything using SCASB.
! mov edx, src
! or eax, -1
lbl:
! add eax, &H01
! cmp BYTE PTR [edx+eax], 0
! jne lbl
The problem for the author of a broken assembler is that he is still
unable to match the clear intel syntax of a basic compiler let alone
the archetypal macro assembler for 32 bit Windows with MASM.
Same algo in MASM,
mov edx, src
or eax, -1
@@:
add eax, 01h
cmp BYTE PTR [edx+eax], 0
jne @B
Further comments awaiting on the reply from group moderation.
Regards,
On which platform did REP SCAS become slower than rolling your own?
Was it 486 or Pentium?
The trend with later optimising C compilers is to seperate assembler
code completely from the main C code as the inline assembler interferes
with the compiler optimisation. I gather that 64 bit Windows compilers
will not support inline assembler at all and you will have no choice
but to create seperate assembler modules if you need assembler code.
Regards,
Hi Everyone,
And there are some of us, that realise that if you do a lot of string
work, then null terminated strings are just plain awful. (Yes, I do see
some of the advantages of null terminated strings, however at what
cost)? Better off using meta-data or a string header to give the buffer
size and string length before the string (similar to how Turbo Pascal
and IIRC HLA does it), so that strlen operations are blazing fast...
eg
struct string {
buffer_size dw ?
string_length dw ?
string_data rb ?
}
Want to find the string length, well just load the string_length
variable...
But then again, some people won't learn...
--
Darran (aka Chewy509) brought to you by Google Groups!
> On which platform did REP SCAS become slower than rolling your own?
> Was it 486 or Pentium?
>From memory it was the single pipeline 486 version where incremented
pointers were faster than the old string instructions. I think that the
old string instructions are still OK in 16 bit code but this has some
to do with 16 bit code being slow generally on anything from an early
Pentium upwards.
There may be one of the early AMDs that is faster with some of the
older instructions. I used to own an AMD K6-2 550 that had a fast LOOP
instruction that crased win95b.
> And there are some of us, that realise that if you do a lot of string
> work, then null terminated strings are just plain awful. (Yes, I do see
> some of the advantages of null terminated strings, however at what
> cost)? Better off using meta-data or a string header to give the buffer
> size and string length before the string (similar to how Turbo Pascal
> and IIRC HLA does it), so that strlen operations are blazing fast...
Your words are nothing but _one_ of the ways i was
thinking about, when saying that it is quite simple
to _NOT_ compute any String Length.
So said, in Assembly, this is evidently not the job
of the language to do anything like this, but the job
of the programmer to choose the most appropriated
method for the actual problem. Your Turbo Pascal
and HLA are HLL, and you cannot promote any such
abusive generalization for Assembly, even if the
method may, occasionaly, be the one of choice.
In practice, when developing PEs there is very little
need of any String Length computation.
Betov.
> SCASB is particularly slow.
Intel problem. Not the programmer's problem.
AMD REP SCASB is not that slow and, anyway, again,
if your String Length is not fast enough for you,
keep away from any, and it will take no time, at
all. Period.
Betov.
> > SCASB is particularly slow.
>
> Intel problem. Not the programmer's problem.
>
> AMD REP SCASB is not that slow and, anyway, again,
> if your String Length is not fast enough for you,
> keep away from any, and it will take no time, at
> all. Period.
Documentation and objective testing say otherwise. Intel recommend NOT
using it and leave it there for backwards compatibility. The code is
larger and uses more registers apart from being far slower than the
smaller code.
String length information does not come by immaculate conception, it
has to be obtained somewhere. Writing out of date, oversized
sub-standard code with ancient technology is a particularly bad design
decision that is consistent with the author of the broken assembler.
As the author of the broken assembler seems to have some problem with
basic, learn from the posted basic assembler code that is smaller and
faster than his own.
Somewhere on or after the Pentium. Indeed, someone questioned a post
of mine the other day (private email) and I reverified that on the PIV,
rep scasb is *slower* than the code:
t=s;
while( *s ) ++s;
return s-t;
written in C (without optimization, which compiles to a very
straight-forward implementation in assembly of the C code above).
SCASB is great for those wanting to save space, but not for those who
want to run faster.
Cheers,
Randy Hyde
> String length information does not come by immaculate conception, it
> has to be obtained somewhere. Writing out of date, oversized
> sub-standard code with ancient technology is a particularly bad design
> decision that is consistent with the author of the broken assembler.
If you don't like SCASB, feel free to keep away from it,
and give us a breath with your insanities, while i will
go on living happy with using SCASB for RosAsm which is,
up to now, the fastest of the actual Assemblers:
1 Mega Source -with lots of HLLisms- per Second on a poor
Celeron 1.3.
:)
Betov.
Hi Rene,
Exactly my point. Too many people assume, that because it's provided
with the HLL (aka standard libraries) or how certain items are
implemented, that it's the best way.
Unless, of course, you're handed a zstring with no indication of the
length and you have to compute it yourself. If you don't do it very
often, feel free to use SCASB, it won't really matter how slow your
code runs.
>
> So said, in Assembly, this is evidently not the job
> of the language to do anything like this, but the job
> of the programmer to choose the most appropriated
> method for the actual problem.
And, sometimes, the most appropriate method is to compute the length of
a zstring, because you don't *get* the length out of thin air.
> Your Turbo Pascal
> and HLA are HLL,
Back to attacking HLA again?
Of course, HLA strings carry the length around with them. So there is
no such length computation.
> and you cannot promote any such
> abusive generalization for Assembly, even if the
> method may, occasionaly, be the one of choice.
And neither HLA nor Turbo Pascal require you to use the string formats
their library supports. Both support zstrings, for example. Both let
you put characters into an array and manipulate that character data any
way you like. The programmer can decide to do whatever they choose.
Neither the language nor the library requires them to do it however
you're thinking it must be done. There is no reason, for example, that
an HLA programmer (or even Turbo Pascal programmer, via BASM) could
write all their string length functions using SCASB, just as you claim
to do inside RosAsm :-).
>
> In practice, when developing PEs there is very little
> need of any String Length computation.
What on earth does the file format of the executable file have to do
with the algorithms and data structures used in the application? The
fact that you don't need to compute string length in RosAsm much does
not imply that it doesn't need to be computed in other applications. I,
too, have written lots of apps that don't compute the length of strings
(at all), I certainly wouldn't generalize such experience and claim
that there is no need for such a function in any program that compiles
to the PE file format; that's just ridiculous.
Cheers,
Randy Hyde
>> So said, in Assembly, this is evidently not the job
>> of the language to do anything like this, but the job
>> of the programmer to choose the most appropriated
>> method for the actual problem.
>
> And, sometimes, the most appropriate method is to compute the length of
> a zstring, because you don't *get* the length out of thin air.
I know that you are not an expert in matter of Win32
API programming, but you should, at least, know that
most of the Win32 Functions, that have to deal with
Strings, do effectively provide "the length out of
thin air".
>> Your Turbo Pascal
>> and HLA are HLL,
>
> Back to attacking HLA again?
Where? HLA is not an HLL? Saying that HLA is an HLL is
"attacking HLA"? Or is simple flat truth far too crual?
> Of course, HLA strings carry the length around with them. So there is
> no such length computation.
Feel free to implement whatever seems to you accurate
to implement into your HLL Pre-Parser: This is _your_
HLL Pre-Parser. :)
>> and you cannot promote any such
>> abusive generalization for Assembly, even if the
>> method may, occasionaly, be the one of choice.
>
> And neither HLA nor Turbo Pascal require you to use the string formats
> their library supports. Both support zstrings, for example. Both let
> you put characters into an array and manipulate that character data any
> way you like. The programmer can decide to do whatever they choose.
> Neither the language nor the library requires them to do it however
> you're thinking it must be done. There is no reason, for example, that
> an HLA programmer (or even Turbo Pascal programmer, via BASM) could
> write all their string length functions using SCASB, just as you claim
> to do inside RosAsm :-).
Great. Congratulations. :))
>> In practice, when developing PEs there is very little
>> need of any String Length computation.
>
> What on earth does the file format of the executable file have to do
> with the algorithms and data structures used in the application?
Same answer as above: For example, when you recover a String
from an EditBox, the API tells you the length for free.
> The
> fact that you don't need to compute string length in RosAsm much does
> not imply that it doesn't need to be computed in other applications. I,
> too, have written lots of apps that don't compute the length of strings
> (at all), I certainly wouldn't generalize such experience and claim
> that there is no need for such a function in any program that compiles
> to the PE file format; that's just ridiculous.
In case you would not yet know, RosAsm is a Assembler, this
is to say a program that reads a text File, and parses it
through a sery of Parsers, what is a good example of Strings
works. So, i fail to see where and when i could have said that
absolutely no 'StringLength' computation could ever be needed:
I said that discussing about _optimizing_ such Algo is absurd,
as long as, in cases when there could be any speed problem
with this, several other methods do exist (including the
one you are pushing as the universal solution - what it is
evidently not -). Nothing but the usual major key-word of
Assembly: Strategy Optimizations.
Also, given the speed of RosAsm, as opposed to you and Hutch,
i can prove my claims by my own work. Period.
Betov.
A possible optimization to strlen would be to end every string with
'00' instead of '0'. This would allow you to skip every second byte and
decrease the time taken by half. It would require an extra byte of
space for every string, and it wouldn't technically be standard C, but
for very long strings it would be worth it.