Optimize stricmp() algorithm (caseless string compare)

Rick C. Hodgin

unread,

Jun 24, 2017, 11:14:20 AM6/24/17

to

Can anybody help me optimize this code?

https://pastebin.com/rv8EfxSP

It doesn't use the stack for parameters, but two global variables
called leftptr and righptr which hold the input values. It does
return in the eax register, so it can be used normally that way
for assignment.

It's designed to be used as a custom assembly algorithm for a
stricmp() algorithm which follows this general pattern, which
is designed to be the target of a qsort() callback:

int caseless_compare(const void *p1, const void *p2)
{
int d;
const unsigned char *s1 = p1;
const unsigned char *s2 = p2;

while ((d = tolower(*s1) - tolower(*s2)) == 0 && *s1)
s1++, s2++;

return d;
}

Note that this algorithm returns negative, 0, or positive values,
which is different than other strcmp() algorithms which return -1,
0, 1, but it can still work generically in this case because it's
the return result to qsort(), which only checks negative, positive,
or equal.

I tried variations of the branch direction in my asm code, and this
one I posted was the best of those I tried.

There's room for improvement, but I don't know what to look for.
Right now it is faster than Visual Studio 2015 can generate in
32-bit code for that caseless_compare() block above. It scores
a value of 131 in my testing, and the assembly version I created
scores 107, with the time being in milliseconds on this test.

Any help is appreciated. Thank you in advance.

Thank you,
Rick C. Hodgin

Terje Mathisen

unread,

Jun 24, 2017, 5:50:14 PM6/24/17

to

Rick C. Hodgin wrote:
> Can anybody help me optimize this code?

[snip]

> It's designed to be used as a custom assembly algorithm for a
> stricmp() algorithm which follows this general pattern, which
> is designed to be the target of a qsort() callback:
>
> int caseless_compare(const void *p1, const void *p2)
> {
> int d;
> const unsigned char *s1 = p1;
> const unsigned char *s2 = p2;
>
> while ((d = tolower(*s1) - tolower(*s2)) == 0 && *s1)
> s1++, s2++;
>
> return d;
> }
>

Just writing a version of this code which can handle national 8-bit
character sets is more interesting, you pretty much have to use some
form of lookup table, i.e.

while (d = tolower[c = *s1++] - tolower[*s2++] && c) {};

With unicode characters (up to 20/31 bits wide?) it is of course
impossible to use a simple lookup table, you need some kind of
algorithmic conversion.

It is probably significantly faster to just create a special key table
containing a monocased version of the strings, sort that, and then use
the resulting sort order on the original table.

For searching I've found that you can get very good results by creating
two new opies of the target string, one that is all uppercase and
another which is all lowercase, then when looking for a match at a gven
string offset you allow either target copy to match.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Rick C. Hodgin

unread,

Jun 24, 2017, 6:20:17 PM6/24/17

to

On Saturday, June 24, 2017 at 5:50:14 PM UTC-4, Terje Mathisen wrote:
> Rick C. Hodgin wrote:
> > Can anybody help me optimize this code?
> [snip]
> > It's designed to be used as a custom assembly algorithm for a
> > stricmp() algorithm which follows this general pattern, which
> > is designed to be the target of a qsort() callback:
> >
> > int caseless_compare(const void *p1, const void *p2)
> > {
> > int d;
> > const unsigned char *s1 = p1;
> > const unsigned char *s2 = p2;
> >
> > while ((d = tolower(*s1) - tolower(*s2)) == 0 && *s1)
> > s1++, s2++;
> >
> > return d;
> > }
> >
>
> Just writing a version of this code which can handle national 8-bit
> character sets is more interesting, you pretty much have to use some
> form of lookup table, i.e.
>
> while (d = tolower[c = *s1++] - tolower[*s2++] && c) {};

This minor tweak has given the best results so far:

Rick2 = 123
Bart = 117
Ben = 116
RickAsm = 105
BartAsm = 104
Terje C = 96

The original algorithm yours was based off is the Ben score above.
Your little tweak gave it a 20-point gain.

FWIW, Terje, I am always impressed with your optimization skills.
That sentiment dates back to the 90s as well. I've always viewed
you as the best assembly developer I know, and I admire your mind
and skills in considering the variables in tasks like this.

Rick C. Hodgin

unread,

Jun 24, 2017, 7:05:21 PM6/24/17

to

On Saturday, June 24, 2017 at 5:50:14 PM UTC-4, Terje Mathisen wrote:

> Rick C. Hodgin wrote:
> > Can anybody help me optimize this code?
> [snip]
> > It's designed to be used as a custom assembly algorithm for a
> > stricmp() algorithm which follows this general pattern, which
> > is designed to be the target of a qsort() callback:
> >
> > int caseless_compare(const void *p1, const void *p2)
> > {
> > int d;
> > const unsigned char *s1 = p1;
> > const unsigned char *s2 = p2;
> >
> > while ((d = tolower(*s1) - tolower(*s2)) == 0 && *s1)
> > s1++, s2++;
> >
> > return d;
> > }
>
> Just writing a version of this code which can handle national 8-bit
> character sets is more interesting, you pretty much have to use some
> form of lookup table, i.e.
>
> while (d = tolower[c = *s1++] - tolower[*s2++] && c) {};

Someone pointed out that this does not produce correct results,
because it has to continue so long as d is 0, hence you need an
== in there, or a !(d...) expression.

When those are re-introduced, this version performs identically
to the previous Ben version.

I was able to update my assembly version to remove an xor, and
a redundant jmp. The score improved from 105 to 102.

Terje Mathisen

unread,

Jun 25, 2017, 1:50:45 AM6/25/17

to

Oops, as you've already noticed I forgot to keep the comparison, i.e.
the diff needs to be zero. :-(

However, as long as this is ~the same speed as the 7-bit only code then
it is a big win in my book since it would support any 8-bit character
set including national language versions of both ASCII and EBCDIC.

What I intended to write more about yesterday but was how to make a
reasonable fast unicode version:

Do a setup stage first, either to monocase all the keys, or to create a
hash table consisting of the lowercase version of all the individual
characters seen in those keys.

With monocased keys the problem is of course already solved, at a cost
of N*M (nr of keys * average key length) tolower() calls, while a hash
table would give near-linear lookup speeds after first spending the same
time calling tolower() the same number of times.

Personally I would have gone to great lengths in the old days to
minimize ram use, but today I would much rather use the nomocased keys
approach.

Rod Pemberton

unread,

Jun 25, 2017, 4:05:54 AM6/25/17

to

On Sat, 24 Jun 2017 08:03:26 -0700 (PDT)
"Rick C. Hodgin" <rick.c...@nospicedham.gmail.com> wrote:

> Can anybody help me optimize this code?
>

> [link]

Rick,

I'm wondering what happens if you break up the comparison into two
parts. E.g., do a basic character comparison without translation to
lowercase. If that fails, then do a comparison of the characters after
translation to lowercase.

I'm guessing that breaking the comparison into two would actually
improve the overall speed of comparisons overall, as large portions of
normal text strings will be lower-cased by default, e.g., this sentence
only has one uppercase character. A list of names or lists of
dictionary words will only have the first letter upper-cased also.

In other words, Terje/Ben algorithm has a fixed instruction cost per
comparison which includes lower-casing every character in the string
when, most likely, many of the characters will not need to be
lower-cased for a proper comparison. I.e., a variable rate of
comparison might be quicker if the text is mostly lowercase.

E.g., you might want something more like this (minimally tested):

int rp_stricmp(void *p0, void *p1)
{
int a,b;
char *x,*y;

a=!0;
x=p0;
y=p1;

while(1)
{
if(!a)
break;
a=*x;
b=*y;
x++;
y++;
if(a==b)
continue;
if(lower[a]==lower[b])
continue;
break;
}
return(a-b);
}

With -O2, 32-bit GCC (DJGPP for DOS) generates:

.globl _rp_stricmp
_rp_stricmp:
pushl %ebp
movl $1, %ecx
movl %esp, %ebp
pushl %esi
pushl %ebx
movl 12(%ebp), %edx
movl 8(%ebp), %ebx
L10:
testl %ecx, %ecx
je L3
movsbl (%ebx),%ecx
movsbl (%edx),%esi
incl %ebx
incl %edx
cmpl %esi, %ecx
je L10
movb _lower(%esi), %al
cmpb %al, _lower(%ecx)
je L10
L3:
subl %esi, %ecx
popl %ebx
movl %ecx, %eax
popl %esi
popl %ebp
ret

With -O2, 64-bit GCC (for Linux) generates:

.globl rp_stricmp
.type rp_stricmp, @function
rp_stricmp:
.LFB15:
.cfi_startproc
pushq %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
.L2:
movsbl (%rdi), %eax
movsbl (%rsi), %edx
addq $1, %rdi
addq $1, %rsi
cmpl %edx, %eax
je .L4
movslq %edx, %rcx
movslq %eax, %r8
movzbl lower(%rcx), %ebx
cmpb %bl, lower(%r8)
je .L4
.L3:
subl %edx, %eax
popq %rbx
.cfi_remember_state
.cfi_def_cfa_offset 8
ret
.p2align 4,,10
.p2align 3
.L4:
.cfi_restore_state
testl %eax, %eax
jne .L2
.p2align 4,,7
jmp .L3
.cfi_endproc

I'm not real sure why 64-bit GCC is dropping in a few extra resizing
instructions in the main part of the loop. It appears to me that GCC
is picking incorrect register sizes ... Adjusting the declared types
didn't seem to help any, e.g., int to char, etc.

Rod Pemberton
--
The entire idea that the U.S. government obtained information from deep
sources in the Russian government that Putin ordered a cyber campaign
to disrupt our democracy is itself dezinformatsiya. Wake up sheeple.

Terje Mathisen

unread,

Jun 25, 2017, 9:06:10 AM6/25/17

to

Rod, this is almost certainly faster in real-life situations, very nice
catch. :-)

For unicode it is even nicer because the overhead of monocasing is so
much larger, and the extra overhead of doing an extra compare of the raw
values is close to zero (maximum is the cost of the branch instruction
itself).

I have also considered a sort-of-hybrid approach where you write a
flipcase() function, i.e. given a char it returns the opposite case if
that is defined:

int caseless_compare(const void *p1, const void *p2)
{

wchar a, b;
const unsigned wchar *s1 = p1;
const unsigned wchar *s2 = p2;

do {
while ((a = *s1++) == (b = *s2++)) {
if (!a) return 0;
}
} while (flipcase(a) == b);

// I have to monocase before I can know which wchar is larger!
return wtolower(a) - wtolower(b);
}

The savings here is that I do just one flipcase() call for each
non-matching position, but then I have to do two wide wtolower() calls
to determine the actual order.

I also considered using the result of flipcase(): It is only when that
is different from the input that we would need to monocase the other
char, right?

Except one crucial detail here:

How is stricmp() defined?

Does it always force uppercase/force lowercase before comparing?

The problem occurs because there are ascii characters which are located
between 'Z' and 'a', right?

These will be sorted either before or after an alphabetic char depending
on the case used for the compare, all other chars gives the same result.
...
A bit of googling seems to indicate that the function is defined to
lowercase all chars _before_ the compare, so the OP had it right, and
the fastest sort will almost certanly come from a separate lowercased
set of strings!

Terje

Terje Mathisen

unread,

Jun 25, 2017, 12:51:24 PM6/25/17

to

More ideas:

In almost all languages the number of glyphs that have both uppercase
and lowercase versions is very limited, right?

I.e. it is probably way less than 256 such characters in all the text
you'll ever encounter, this means that a collision-free hash table can
be relatively easily constructed and that can reduce the wide wtolower()
call to hashing the input down to a reasonable number of bits and then
looking it up in the hash table:

Each hash table entry will contain the character that you are looking
up, along with its lowercase equivalent, so the code will look like this:

int wtolower(int c)
{
unsigned h = hash(c) & TABLE_MASK;
if (lowertable[h].ch == c)
c = lowertable[h].lo;
return c;
}

This function is still so expensive that Rod's optimization makes
perfect sense but it should reduce each wtolower() call to the cost of a
branch miss, more or less.

OTOH, lowercasing the strings first is still way faster since that is
O(n*m) (n = number of keys, m = average key length) while a comparison
function that lowercases on the fly will be called O(n*log(n)) times and
each call needs up to two wtolower) calls per key char.

Rod's version optimizes this to ~1 pair of wtolower per comparison early
on when the input data is mostly randomly ordered, but as soon as you
get smaller partitions with nearly equal keys, the average number of key
char pairs to compare before you find a difference will go up.

Terje

Rod Pemberton

unread,

Jun 25, 2017, 6:06:43 PM6/25/17

to

On Sun, 25 Jun 2017 18:41:00 +0200
Terje Mathisen <terje.m...@nospicedham.tmsw.no> wrote:

> In almost all languages the number of glyphs that have both uppercase
> and lowercase versions is very limited, right?
>
> I.e. it is probably way less than 256 such characters in all the text
> you'll ever encounter, this means that a collision-free hash table
> can be relatively easily constructed and that can reduce the wide
> wtolower() call to hashing the input down to a reasonable number of
> bits and then looking it up in the hash table:
>
> Each hash table entry will contain the character that you are looking
> up, along with its lowercase equivalent, so the code will look like
> this:
>
> int wtolower(int c)
> {
> unsigned h = hash(c) & TABLE_MASK;
> if (lowertable[h].ch == c)
> c = lowertable[h].lo;
> return c;
> }
>
> This function is still so expensive that Rod's optimization makes
> perfect sense but it should reduce each wtolower() call to the cost
> of a branch miss, more or less.
>

This brings to mind the concept of a rolling hash.
https://en.wikipedia.org/wiki/Rolling_hash

Terje Mathisen

unread,

Jun 26, 2017, 3:52:17 AM6/26/17

to

Rod Pemberton wrote:
> On Sun, 25 Jun 2017 18:41:00 +0200
> Terje Mathisen <terje.m...@nospicedham.tmsw.no> wrote:
>
>> In almost all languages the number of glyphs that have both uppercase
>> and lowercase versions is very limited, right?
>>
>> I.e. it is probably way less than 256 such characters in all the text
>> you'll ever encounter, this means that a collision-free hash table
>> can be relatively easily constructed and that can reduce the wide
>> wtolower() call to hashing the input down to a reasonable number of
>> bits and then looking it up in the hash table:
>>
>> Each hash table entry will contain the character that you are looking
>> up, along with its lowercase equivalent, so the code will look like
>> this:
>>
>> int wtolower(int c)
>> {
>> unsigned h = hash(c) & TABLE_MASK;
>> if (lowertable[h].ch == c)
>> c = lowertable[h].lo;
>> return c;
>> }
>>
>> This function is still so expensive that Rod's optimization makes
>> perfect sense but it should reduce each wtolower() call to the cost
>> of a branch miss, more or less.
>>
>
> This brings to mind the concept of a rolling hash.
> https://en.wikipedia.org/wiki/Rolling_hash

Roling hash functions, as used in rsync (which I've used daily for
literally decades now), is indeed very nice but not really applicable here.

OTOH they could come in usefully when doing LZ style compression, i.e.
for each block of input you have to figure out it it is a copy of
something that has occured during the previous N characters, with N=64K
for LZ4.

Terje

Rick C. Hodgin

unread,

Jun 26, 2017, 4:52:22 AM6/26/17

to

Good idea. I'll implement that and do some testing.

Oh my ... AT&T syntax. Rod, you'll have to use the -masm=intel if you
want me to try to read it. :-) I have dyslexia and have a hard enough
time reading without all the redundant text everywhere. It is actually
very very difficult for me to read AT&T syntax.

> I'm not real sure why 64-bit GCC is dropping in a few extra resizing
> instructions in the main part of the loop. It appears to me that GCC
> is picking incorrect register sizes ... Adjusting the declared types
> didn't seem to help any, e.g., int to char, etc.

I was surprised I was able to write an algorithm which beat modern C/C++
compilers. I would've expected I could not. And to be honest, apart
from some basic examination of the effect of JMPing on this or !this
conditions, I didn't do much to try and create an optimized algorithm,
but just to do it simply step-by-step in minimal assembly instructions
while maintaining a logical and natural flow through what I was seeing
in my mind as the standard C/C++ algorithm.

I had hoped Terje would jump up on the code and find some LEAs or
whatever to make it go much faster. :-)

Rod Pemberton

unread,

Jun 26, 2017, 8:23:30 PM6/26/17

to

On Mon, 26 Jun 2017 01:45:25 -0700 (PDT)
"Rick C. Hodgin" <rick.c...@nospicedham.gmail.com> wrote:

> Oh my ... AT&T syntax.

With -O2, 32-bit GCC (DJGPP for DOS) generates:

.globl _rp_stricmp
_rp_stricmp:

push ebp
mov ecx, 1
mov ebp, esp
push esi
push ebx
mov edx, DWORD PTR [ebp+12]
mov ebx, DWORD PTR [ebp+8]
L10:
test ecx, ecx
je L3
movsx ecx, BYTE PTR [ebx]
movsx esi, BYTE PTR [edx]
inc ebx
inc edx
cmp ecx, esi
je L10
mov al, BYTE PTR _lower[esi]
cmp BYTE PTR _lower[ecx], al
je L10
L3:
sub ecx, esi
pop ebx
mov eax, ecx
pop esi
pop ebp

ret

With -O2, 64-bit GCC (for Linux) generates:

.globl rp_stricmp
.type rp_stricmp, @function
rp_stricmp:
.LFB15:
.cfi_startproc

push rbx

.cfi_def_cfa_offset 16
.cfi_offset 3, -16
.L2:

movsx eax, BYTE PTR [rdi]
movsx edx, BYTE PTR [rsi]
add rdi, 1
add rsi, 1
cmp eax, edx
je .L4
movsx rcx, edx
movsx r8, eax
movzx ebx, BYTE PTR lower[rcx]
cmp BYTE PTR lower[r8], bl
je .L4
.L3:
sub eax, edx
pop rbx

.cfi_remember_state
.cfi_def_cfa_offset 8
ret
.p2align 4,,10
.p2align 3
.L4:
.cfi_restore_state

test eax, eax

jne .L2
.p2align 4,,7
jmp .L3
.cfi_endproc

Rod Pemberton
--

James Harris

unread,

Jun 27, 2017, 4:54:06 AM6/27/17

to

It's interesting that for lower() DJGPP uses a lookup table.

> je L10
> L3:
> sub ecx, esi
> pop ebx
> mov eax, ecx
> pop esi
> pop ebp
> ret

That's good code but it occurred to me that because the offset is the
same into each string, one offset could be incremented instead of two
string pointers. Then, rather than the following (if ESI and EDI are the
string pointers)

add esi, 1
add edi, 1
movsx eax, [esi]
movsx ebx, [edi]

if the offset is in EDX then the equivalent would be a bit shorter. That
could end up being faster. And it saves a register.

add ebx, 1
movsx eax, [esi + edx]
movsx ebx, [edi + edx]

I saw you (Rod) make a good point in another thread that if lower() is a
function call then its overhead can be avoided in many cases by XOR of
the two bytes to see if they /might/ match. If the XOR is 0 then they
match. If it is 0x20 then the might match. Otherwise, they cannot match
and there's no need to lower-case either of them.

Even better, the XOR operation can set the flags for the equality test
so we don't need a CMP. Instead of the initial test

cmp eax, ecx ;Compare the two chars
je these_chars_match

we could use

xor eax, ecx
jz these_chars_match

And then EAX is ready to be tested for whether the two bytes /might/ be
a case-insensitive match.

cmp eax, 0x20
jne found_a_mismatch
;The chars might match

(All code untested and may well contain errors....)

--
James Harris

James Harris

unread,

Jun 27, 2017, 5:39:10 AM6/27/17

to

On 27/06/2017 09:51, James Harris wrote:

Here's an effort at the code needed. All untested. The inner loop is a
seven-instruction scan for either (1) the end of the first string or (2)
two bytes (one from each string) which are not identical.

;ASCII case-insensitive string comparison
;Register usage
;ESI: pointer to start of string 0
;EDI: pointer to start of string 1
;EBX: offset from the start of the strings
;ECX: the current character from string 0
;EDX: temporary storage

push ebp
mov ebp, esp
mov esi, [ebp + 8] ;Point ESI at start of string 0
mov edi, [ebp + 12] ;Point EDI at start of string 1

xor ebx, ebx ;Offset = zero
jmp loop_test ;Skip ahead to loop test

loop_top:
movsx eax, [edi + ebx] ;Fetch char from string 1
xor eax, ecx ;Xor char from string 0
jne not_identical ;Jump if not identical but might still match

;These two bytes are identical. Go on to the next pair

next_pos:
add ebx, 1 ;Next offset

loop_test:
movsx ecx, [esi + ebx] ;Char from string 0
test ecx, ecx ;The string terminator?
jnz loop_top ;Jump back up if not end of string

go_back:
movsx eax, [esi + ebx] ;Get char from string 0
movsx ebx, [edi + ebx] ;Get char from string 1
sub eax, ebx ;Calculate difference
pop ebp
ret 8

not_identical:
cmp eax, 0x20 ;Could they be the same but different cases?
jne go_back ;Skip if no

;The bytes might be different cases of the same letter. Check them
push ecx ;The char from string 0
call lower ;Lower-case it
mov edx, eax ;Save in EDX
movsx eax, [edi + ebx] ;Fetch char from string 1
push eax ;Push it
call lower ;Lower-case it
cmp eax, edx ;Same?
je next_pos ;Yes, continue looping
jmp go_back ;Mismatch found. Go back

--
James Harris

Rick C. Hodgin

unread,

Jun 27, 2017, 8:54:23 AM6/27/17

to

That is nice code. Some savings could be made by removing the need
for saving the registers, and by using global variables or registers
for the passed parameters, which would also free up ebp as well (not
really useful here, but in other cases).

I am unfamiliar with the .cfi_ forms. What are they?

Rick C. Hodgin

unread,

Jun 27, 2017, 2:54:58 PM6/27/17

to

On Sunday, June 25, 2017 at 4:05:54 AM UTC-4, Rod Pemberton wrote:

This solution (of adding the equality test before the conversion to
lower-case) increased performance on my test data set from 102 to 96
(lower is better).

This algorithm produced a score of 86!

I don't have any other 64-bit tests to compare against. When I tried
this one, it produced a score of 87.

Your algorithm is the fastest thus far.

Rod Pemberton

unread,

Jun 27, 2017, 4:10:06 PM6/27/17

to

I don't know. I've only been using Linux for a few years. I've done
the majority of my personal code development on DOS. I've been meaning
to adapt to Linux as time permits.

Rod Pemberton
--

Rick C. Hodgin

unread,

Jun 27, 2017, 4:25:09 PM6/27/17

to

I think I figured out from context that they are hints provided
to the assembler or linker to do certain things, but aren't really
part of the generated asm code logic.

Rod Pemberton

unread,

Jun 27, 2017, 5:55:16 PM6/27/17

to

On Tue, 27 Jun 2017 09:51:06 +0100
James Harris <james.h...@nospicedham.gmail.com> wrote:

> It's interesting that for lower() DJGPP uses a lookup table.

No, that's not DJGPP. The Terje/Ben used example used an array as a
lookup table. Was that posted on the CLAX thread? I don't think it
was. It might be in the comp.lang.c thread. Anyway, lower[] is a 256
char ASCII table that I filled in with tolower()'s values with a loop
in main(). It was used to complete the routine. (Technically, the
"array" should be unsigned char, but I didn't do that.)

> > je L10
> > L3:
> > sub ecx, esi
> > pop ebx
> > mov eax, ecx
> > pop esi
> > pop ebp
> > ret
>
> That's good code but it occurred to me that because the offset is the
> same into each string, one offset could be incremented instead of two
> string pointers. Then, rather than the following (if ESI and EDI are
> the string pointers)

> [...]

> if the offset is in EDX then the equivalent would be a bit shorter.
> That could end up being faster. And it saves a register.

That might shorten the code. Usually though, using an offset instead
of a pointer requires an addition of the offset to the pointer.

Did you mean something like this? I kept the temporaries (local
variables) to encourage the compiler to place them in registers.

int rp_jh_stricmp(void *p0, void *p1)
{
int a,b,i;
char *x,*y;

i=0;

a=!0;
x=p0;
y=p1;

while(1)
{
if(!a)
break;

a=x[i];
b=y[i];
i++;

if(a==b)
continue;
if(lower[a]==lower[b])
continue;
break;
}
return(a-b);
}

With -O2, 32-bit GCC (DJGPP for DOS) generates:

.globl _rp_jh_stricmp
_rp_jh_stricmp:
push ebp
mov edx, 1
mov ebp, esp
push edi
push esi
push ebx
mov edi, DWORD PTR [ebp+8]
mov esi, DWORD PTR [ebp+12]
xor ebx, ebx
L10:
test edx, edx
je L3
movsx edx, BYTE PTR [edi+ebx]
movsx ecx, BYTE PTR [esi+ebx]
inc ebx
cmp edx, ecx
je L10
mov al, BYTE PTR _lower[ecx]
cmp BYTE PTR _lower[edx], al
je L10
L3:
sub edx, ecx
pop ebx
mov eax, edx
pop esi
pop edi

pop ebp
ret

With -O2, 64-bit GCC (for Linux) generates:

.globl rp_jh_stricmp
.type rp_jh_stricmp, @function
rp_jh_stricmp:

.LFB15:
.cfi_startproc
push rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16

xor edx, edx
.L2:
movsx eax, BYTE PTR [rdi+rdx]
movsx ecx, BYTE PTR [rsi+rdx]
cmp eax, ecx
je .L4
movsx r8, ecx
movsx r9, eax
movzx ebx, BYTE PTR lower[r8]
cmp BYTE PTR lower[r9], bl
je .L4
.L3:
sub eax, ecx

pop rbx
.cfi_remember_state
.cfi_def_cfa_offset 8
ret
.p2align 4,,10
.p2align 3
.L4:
.cfi_restore_state

add rdx, 1

test eax, eax
jne .L2

.p2align 4,,3
jmp .L3
.cfi_endproc

For 32-bit, this cuts out one inc from the main loop, but places an
addition into both movs instructions.

For 64-bit, this cuts out two "add ,1"s from the main loop, places one
of them further below, but places an addition into both movs
instructions.

Is this faster? I can't tell from comparing the code.

> I saw you (Rod) make a good point in another thread that if lower()
> is a function call then its overhead can be avoided in many cases by
> XOR of the two bytes to see if they /might/ match. If the XOR is 0
> then they match. If it is 0x20 then the might match. Otherwise, they
> cannot match and there's no need to lower-case either of them.

That is how I'd approach it for assembly. It might turn out good for C
as well ... I haven't tried that. Maybe, this:

int rp_xor_stricmp(void *p0, void *p1)
{
int t;
char *x,*y;

x=p0;
y=p1;

while(1)
{
if(!((*x)&(*y)))
break;
t=(*x)^(*y);
x++;
y++;
if(!t)
continue;
if(t==0x20)
continue;
break;
}
x--;
y--;
return((*x)-(*y));
}

Rod Pemberton
--

James Harris

unread,

Jun 27, 2017, 8:55:30 PM6/27/17

to

On 27/06/2017 22:48, Rod Pemberton wrote:
> On Tue, 27 Jun 2017 09:51:06 +0100
> James Harris <james.h...@nospicedham.gmail.com> wrote:
>
>> It's interesting that for lower() DJGPP uses a lookup table.
>
> No, that's not DJGPP. The Terje/Ben used example used an array as a
> lookup table. Was that posted on the CLAX thread? I don't think it
> was. It might be in the comp.lang.c thread. Anyway, lower[] is a 256
> char ASCII table that I filled in with tolower()'s values with a loop
> in main(). It was used to complete the routine. (Technically, the
> "array" should be unsigned char, but I didn't do that.)

True. Looking back, I see your code was lower[a]. For some reason I
thought you had used lower(a).

>
>>> je L10
>>> L3:
>>> sub ecx, esi
>>> pop ebx
>>> mov eax, ecx
>>> pop esi
>>> pop ebp
>>> ret
>>
>> That's good code but it occurred to me that because the offset is the
>> same into each string, one offset could be incremented instead of two
>> string pointers. Then, rather than the following (if ESI and EDI are
>> the string pointers)
>> [...]
>> if the offset is in EDX then the equivalent would be a bit shorter.
>> That could end up being faster. And it saves a register.
>
> That might shorten the code. Usually though, using an offset instead
> of a pointer requires an addition of the offset to the pointer.

You would need to go back a long way in x86 development to find a CPU
for which that was still true. For as long as I remember, Intel and AMD
machines have formed addresses a few steps in the pipeline before they
applied ALU ops. So,

[eax + ebx]

should be just as fast as

[eax]

though it would be good to avoid changing either register immediately
before, if possible.

>
> Did you mean something like this?

Yes.

> I kept the temporaries (local
> variables) to encourage the compiler to place them in registers.

Looking at the generated code, below, it picks up ESI and EDI from the
stack - i.e. they are p0 and p1. It keeps them unchanged and it uses
them in the body of the code. x and y would, therefore, seem to be
redundant. (Assuming your p0 and p1 could be declared as char *).

You know, it's funny how much I've forgotten from specific optimisation
guides. I used to know what the Intel Atom did differently from others
in this regard, for example (there is a specific difference), but I
don't now. But however vague I am on those specifics I'm pretty sure
that address formation has been in earlier pipeline stages for many
generations, and that it has its own adders. What was the target CPU for
your compilation? Your compiler evidently thought it was a good idea to
use [X+Y] in the movsx instruction for that CPU and above.

>
>> I saw you (Rod) make a good point in another thread that if lower()
>> is a function call then its overhead can be avoided in many cases by
>> XOR of the two bytes to see if they /might/ match. If the XOR is 0
>> then they match. If it is 0x20 then the might match. Otherwise, they
>> cannot match and there's no need to lower-case either of them.
>
> That is how I'd approach it for assembly. It might turn out good for C
> as well ... I haven't tried that. Maybe, this:
>
> int rp_xor_stricmp(void *p0, void *p1)
> {
> int t;
> char *x,*y;
>
> x=p0;
> y=p1;

Could that section be replaced with this?

int rp_xor_stricmp(char *x, char *y)
{
char t;

>
> while(1)
> {
> if(!((*x)&(*y)))

Do you need to test both chars?

> break;
> t=(*x)^(*y);
> x++;
> y++;
> if(!t)
> continue;
> if(t==0x20)
> continue;

t being 0x20 means that they /might/ match. You would still need to use
lower() (or lower[]) to check.

> break;
> }
> x--;
> y--;
> return((*x)-(*y));
> }

With array notation you would be able to finish with

return x[i] - y[i];

and then you would avoid having to adjust the pointers x and y in the
loop; only i would change.

--
James Harris

Rod Pemberton

unread,

Jun 27, 2017, 11:40:42 PM6/27/17

to

On Wed, 28 Jun 2017 01:42:31 +0100

James Harris <james.h...@nospicedham.gmail.com> wrote:

> On 27/06/2017 22:48, Rod Pemberton wrote:
> > On Tue, 27 Jun 2017 09:51:06 +0100
> > James Harris <james.h...@nospicedham.gmail.com> wrote:

> > I kept the temporaries (local
> > variables) to encourage the compiler to place them in registers.
>
> Looking at the generated code, below, it picks up ESI and EDI from
> the stack - i.e. they are p0 and p1. It keeps them unchanged and it
> uses them in the body of the code. x and y would, therefore, seem to
> be redundant. (Assuming your p0 and p1 could be declared as char *).

Maybe. I've not confirmed for this code whether that's needed or not.

I've generally found with GCC that using auto's "encourages" GCC to
keep values in registers, instead of reloading them from the stack or
memory. Sometimes this is optimized away, i.e., unnecessary as you've
pointed out may be the situation. But, it ensures that GCC doesn't
generate bad code by default, as is normal. E.g., GCC is exceptionally
bad at selecting when to use 8-bit registers. It can take plenty of
declaration type juggling to fix this.

> > Is this faster? I can't tell from comparing the code.
>
> You know, it's funny how much I've forgotten from specific
> optimisation guides. I used to know what the Intel Atom did
> differently from others in this regard, for example (there is a
> specific difference), but I don't now. But however vague I am on
> those specifics I'm pretty sure that address formation has been in
> earlier pipeline stages for many generations, and that it has its own
> adders. What was the target CPU for your compilation? Your compiler
> evidently thought it was a good idea to use [X+Y] in the movsx
> instruction for that CPU and above.

I overlooked this earlier, but the movsx's, by adding the offsets, both
*access* the same register, i.e., +rdx (64-bit) or +ebx (32-bit). I'm
not sure if that creates a register dependency or not, as the register
being accessed is not modified. I would assume that it wouldn't, but
it might.

Either way, due to the pairing, pipelining, loading time, decoding time,
etc, I'm not sure which is faster without testing.

> > int rp_xor_stricmp(void *p0, void *p1)
> > {
> > int t;
> > char *x,*y;
> >
> > x=p0;
> > y=p1;
>
> Could that section be replaced with this?
>
> int rp_xor_stricmp(char *x, char *y)
> {
> char t;

You can replace lots of things with lots of things, e.g., you could use
a for() loop instead of while(1). You could use *x++ or *sp++ as
Rick or Terje do. (Albeit, this is not recommended due to
sequence-point issues. The post- and pre- increment/decrement
operators are always supposed to be alone on a line for ANSI C.)

Are you asking if it makes a difference in the emitted code for 32-bit
or 64-bit? ... With -O2, 64-bit generates the same code. With -O0,
64-bit routines are similar, except stack offsets and some setup lines.
With -O2, 32-bit generates the same code. With -0O, 32-bit routines
are similar, except stack offsets, some setup lines, and an 'lea'
instruction which has been relocated.

So, yes, you can replace it, at least for the GCC compiler, without
issue.

> > while(1)
> > {
> > if(!((*x)&(*y)))
>
> Do you need to test both chars?

How do you stop when either string reaches the End-Of-String (EOS) or
nul terminator '\0' without checking both strings for EOS or nul
terminator '\0'?

It's not valid in C to read past the string's EOS or nul terminator.
Perhaps, doing so is fine for assembly, if the buffer is large enough
to not access other important data.

FYI, the earlier routine actually checks *both* strings for the EOS
or nul terminator, with the "if(!a)" statement at the top of the loop,
but only after "a==b" on a string match "continues" to the top of the
loop. So, a and b are both '\0' for a match when they reach "if(!a)".
I.e., only one needs to be checked. When a or b is '\0' for a
non-match, the routine will hit the "break;" at the end of the loop,
and never loop to reach "if(!a)" at the top of the loop.

In this routine, t==0 on any character match, due to the XOR, including
'\0'=='\0' for the EOS or nul terminator. So, you can't use t's value
to match '\0'=='\0', unless you read *past* the EOS or nul terminator
of at least one of the strings. This wouldn't be valid for C code,
i.e., undefined behavior.

Do you see a way to only check one string for the EOS without
potentially reading past the end of the other string? If so, I'd be
interested in seeing that solution.

> > break;
> > t=(*x)^(*y);
> > x++;
> > y++;
> > if(!t)
> > continue;
> > if(t==0x20)
> > continue;
>
> t being 0x20 means that they /might/ match. You would still need to
> use lower() (or lower[]) to check.

t being 0x20 means means a lowercase alphabetic matches an uppercase
alphabetic character, or vice-versa, at least for the ASCII character
set. Alphabetic characters are all 0x20 (32) apart in ASCII.

A-Z is 0x41 (65) to 0x5A (90).
a-z is 0x61 (97) to 0x7A (122).
0x41 | 0x20 = 0x61.
0x5A | 0x20 = 0x7A.

t being 0x20 wouldn't be a match when dealing with graphics characters,
i.e., non-alphanumeric. In that case, lower() won't fix the issue
either. lower() only works on alphabetic characters. You'd need to add
a range check or an isgraph() call etc. IIRC, the results of isgraph()
is not consistent or accurate across various C compiler platforms for
ASCII. I.e., you might consider constructing your own from other
ctype.h functions.

I.e., while unconfirmed by testing, the routine above should work for
alphabetic characters and digits, if I didn't make any mistakes. Then
again, maybe it doesn't. As always, it's up you to test it, if you use
it. Generally, I'd run the entire ASCII set through such a routine to
see what it emits with a judiciously placed printf(). (Did I get word
of the day again?)

> > break;
> > }
> > x--;
> > y--;
> > return((*x)-(*y));
> > }
>
> With array notation you would be able to finish with

s/array notation/use of the subscript operator/

(Remember, despite what those self-proclaimed "C experts" over on c.l.c.
say, there are *NO* arrays in C. C only has an array declaration that
allocates space for storage and declares a pointer to it. All use of
"arrays" in the C language are actually use of the "subscript
operator" [] . The subscript operator works with any pointer to a
valid type, i.e., non-void, and an offset. Hence, no arrays in C.)

> return x[i] - y[i];
>
> and then you would avoid having to adjust the pointers x and y in the
> loop; only i would change.

You can compare the effect of this with the assembly for
rp_jh_stricmp() and the assembly for the original rp_stricmp(), both
of which were posted earlier. The results for this routine
(rp_xor_stricmp) may be slightly different, but the changes should be
very similar.

Rod Pemberton
--

James Harris

unread,

Jun 28, 2017, 6:26:11 AM6/28/17

to

On 28/06/2017 04:31, Rod Pemberton wrote:
> On Wed, 28 Jun 2017 01:42:31 +0100
> James Harris <james.h...@nospicedham.gmail.com> wrote:
>
>> On 27/06/2017 22:48, Rod Pemberton wrote:
>>> On Tue, 27 Jun 2017 09:51:06 +0100
>>> James Harris <james.h...@nospicedham.gmail.com> wrote:
>
>>> I kept the temporaries (local
>>> variables) to encourage the compiler to place them in registers.
>>
>> Looking at the generated code, below, it picks up ESI and EDI from
>> the stack - i.e. they are p0 and p1. It keeps them unchanged and it
>> uses them in the body of the code. x and y would, therefore, seem to
>> be redundant. (Assuming your p0 and p1 could be declared as char *).
>
> Maybe. I've not confirmed for this code whether that's needed or not.
>
> I've generally found with GCC that using auto's "encourages" GCC to
> keep values in registers, instead of reloading them from the stack or
> memory. Sometimes this is optimized away, i.e., unnecessary as you've
> pointed out may be the situation. But, it ensures that GCC doesn't
> generate bad code by default, as is normal. E.g., GCC is exceptionally
> bad at selecting when to use 8-bit registers. It can take plenty of
> declaration type juggling to fix this.

Do you look at generated code much? When using a compiled language I
hardly ever look at the code it has emitted. I can't think why anyone
would - unless you think the compiler will produce faulty code. And if
you did, would you not ditch that compiler or file a bug report?

>
>>> Is this faster? I can't tell from comparing the code.
>>
>> You know, it's funny how much I've forgotten from specific
>> optimisation guides. I used to know what the Intel Atom did
>> differently from others in this regard, for example (there is a
>> specific difference), but I don't now. But however vague I am on
>> those specifics I'm pretty sure that address formation has been in
>> earlier pipeline stages for many generations, and that it has its own
>> adders. What was the target CPU for your compilation? Your compiler
>> evidently thought it was a good idea to use [X+Y] in the movsx
>> instruction for that CPU and above.
>
> I overlooked this earlier, but the movsx's, by adding the offsets, both
> *access* the same register, i.e., +rdx (64-bit) or +ebx (32-bit).

I see that in the array-form code. (The pointer-form code updates two
registers, not just one.)

> I'm
> not sure if that creates a register dependency or not, as the register
> being accessed is not modified. I would assume that it wouldn't, but
> it might.

Effectively, register reads can happen in parallel. There are three
"data hazards" but all involve writing:
https://en.wikipedia.org/wiki/Hazard_(computer_architecture).

>
> Either way, due to the pairing, pipelining, loading time, decoding time,
> etc, I'm not sure which is faster without testing.
>
>>> int rp_xor_stricmp(void *p0, void *p1)
>>> {
>>> int t;
>>> char *x,*y;
>>>
>>> x=p0;
>>> y=p1;
>>
>> Could that section be replaced with this?
>>
>> int rp_xor_stricmp(char *x, char *y)
>> {
>> char t;
>
> You can replace lots of things with lots of things, e.g., you could use
> a for() loop instead of while(1). You could use *x++ or *sp++ as
> Rick or Terje do. (Albeit, this is not recommended due to
> sequence-point issues. The post- and pre- increment/decrement
> operators are always supposed to be alone on a line for ANSI C.)
>
> Are you asking if it makes a difference in the emitted code for 32-bit
> or 64-bit? ... With -O2, 64-bit generates the same code. With -O0,
> 64-bit routines are similar, except stack offsets and some setup lines.
> With -O2, 32-bit generates the same code. With -0O, 32-bit routines
> are similar, except stack offsets, some setup lines, and an 'lea'
> instruction which has been relocated.
>
> So, yes, you can replace it, at least for the GCC compiler, without
> issue.

OK. Yes, I wondered if, although being shorter, it would achieve the
same effect.

>
>>> while(1)
>>> {
>>> if(!((*x)&(*y)))
>>
>> Do you need to test both chars?
>
> How do you stop when either string reaches the End-Of-String (EOS) or
> nul terminator '\0' without checking both strings for EOS or nul
> terminator '\0'?
>
> It's not valid in C to read past the string's EOS or nul terminator.
> Perhaps, doing so is fine for assembly, if the buffer is large enough
> to not access other important data.
>
> FYI, the earlier routine actually checks *both* strings for the EOS
> or nul terminator, with the "if(!a)" statement at the top of the loop,
> but only after "a==b" on a string match "continues" to the top of the
> loop. So, a and b are both '\0' for a match when they reach "if(!a)".
> I.e., only one needs to be checked. When a or b is '\0' for a
> non-match, the routine will hit the "break;" at the end of the loop,
> and never loop to reach "if(!a)" at the top of the loop.
>
> In this routine, t==0 on any character match, due to the XOR, including
> '\0'=='\0' for the EOS or nul terminator. So, you can't use t's value
> to match '\0'=='\0', unless you read *past* the EOS or nul terminator
> of at least one of the strings. This wouldn't be valid for C code,
> i.e., undefined behavior.
>
> Do you see a way to only check one string for the EOS without
> potentially reading past the end of the other string? If so, I'd be
> interested in seeing that solution.

AISI, we are scanning the two strings for the first substantive
difference. As soon as we find such a difference we exit. If that
difference is that one string ends before the other then the normal
algorithm will exit. For example, if 0 means the binary zero at the end
of the string

any0
anybody0

The normal algorithm will stop when it compares 0 and b.

But the normal algorithm wouldn't stop if the two strings were
identical. It would run on past the strings. Therefore we need a test to
prevent that happening - and testing either for a terminating zero is
enough. Hence I think we only need to check on of the strings for the
terminating zero.

>
>>> break;
>>> t=(*x)^(*y);
>>> x++;
>>> y++;
>>> if(!t)
>>> continue;
>>> if(t==0x20)
>>> continue;
>>
>> t being 0x20 means that they /might/ match. You would still need to
>> use lower() (or lower[]) to check.
>
> t being 0x20 means means a lowercase alphabetic matches an uppercase
> alphabetic character, or vice-versa, at least for the ASCII character
> set. Alphabetic characters are all 0x20 (32) apart in ASCII.
>
> A-Z is 0x41 (65) to 0x5A (90).
> a-z is 0x61 (97) to 0x7A (122).
> 0x41 | 0x20 = 0x61.
> 0x5A | 0x20 = 0x7A.
>
> t being 0x20 wouldn't be a match when dealing with graphics characters,
> i.e., non-alphanumeric.

Yes, the XOR of two non-alphabetic values could also result in 0x20,
which is why a lower() call is needed.

> In that case, lower() won't fix the issue
> either. lower() only works on alphabetic characters.

I had in mind that lower('!') would return '!'.

E.g. lower could be defined as follows.

char lower(unsigned char ch) /* ASCII, not EBCDIC */
{
if (ch >= 'A' && ch <= 'Z') /* If uppercase */
return ch - 'A' + 'a'; /* Return the lowercase version */
return ch; /* Else return the original */

}

> You'd need to add
> a range check or an isgraph() call etc. IIRC, the results of isgraph()
> is not consistent or accurate across various C compiler platforms for
> ASCII. I.e., you might consider constructing your own from other
> ctype.h functions.
>
> I.e., while unconfirmed by testing, the routine above should work for
> alphabetic characters and digits, if I didn't make any mistakes. Then
> again, maybe it doesn't. As always, it's up you to test it, if you use
> it. Generally, I'd run the entire ASCII set through such a routine to
> see what it emits with a judiciously placed printf(). (Did I get word
> of the day again?)
>
>>> break;
>>> }
>>> x--;
>>> y--;
>>> return((*x)-(*y));
>>> }
>>
>> With array notation you would be able to finish with
>
> s/array notation/use of the subscript operator/
>
> (Remember, despite what those self-proclaimed "C experts" over on c.l.c.
> say, there are *NO* arrays in C. C only has an array declaration that
> allocates space for storage and declares a pointer to it. All use of
> "arrays" in the C language are actually use of the "subscript
> operator" [] . The subscript operator works with any pointer to a
> valid type, i.e., non-void, and an offset. Hence, no arrays in C.)

OT for this newsgroup but contrast these two:

char a[10];
char *p;

When passed to a function they would both go as pointers

f(a, p);

and in the function they would both act as pointers to char. But in the
original they would be different in a number of ways. For example,

sizeof(a)
sizeof(p)

would not be the same.

>
>> return x[i] - y[i];
>>
>> and then you would avoid having to adjust the pointers x and y in the
>> loop; only i would change.
>
> You can compare the effect of this with the assembly for
> rp_jh_stricmp() and the assembly for the original rp_stricmp(), both
> of which were posted earlier. The results for this routine
> (rp_xor_stricmp) may be slightly different, but the changes should be
> very similar.

--
James Harris

Rod Pemberton

unread,

Jun 28, 2017, 8:42:09 PM6/28/17

to

On Wed, 28 Jun 2017 11:21:21 +0100

James Harris <james.h...@nospicedham.gmail.com> wrote:

> On 28/06/2017 04:31, Rod Pemberton wrote:
> > On Wed, 28 Jun 2017 01:42:31 +0100
> > James Harris <james.h...@nospicedham.gmail.com> wrote:
> >> On 27/06/2017 22:48, Rod Pemberton wrote:
> >>> On Tue, 27 Jun 2017 09:51:06 +0100
> >>> James Harris <james.h...@nospicedham.gmail.com> wrote:

> >>> I kept the temporaries (local
> >>> variables) to encourage the compiler to place them in registers.
> >>
> >> Looking at the generated code, below, it picks up ESI and EDI from
> >> the stack - i.e. they are p0 and p1. It keeps them unchanged and it
> >> uses them in the body of the code. x and y would, therefore, seem
> >> to be redundant. (Assuming your p0 and p1 could be declared as
> >> char *).
> >
> > Maybe. I've not confirmed for this code whether that's needed or
> > not.
> >
> > I've generally found with GCC that using auto's "encourages" GCC to
> > keep values in registers, instead of reloading them from the stack
> > or memory. Sometimes this is optimized away, i.e., unnecessary as
> > you've pointed out may be the situation. But, it ensures that GCC
> > doesn't generate bad code by default, as is normal. E.g., GCC is
> > exceptionally bad at selecting when to use 8-bit registers. It can
> > take plenty of declaration type juggling to fix this.
>
> Do you look at generated code much?

(This question/argument reminds me of someone on c.l.c., whom I'm not
fond of. He repeatedly and wrongly argues that there is no reason or
need for a C programmer to ever look at the assembly. AISI, a
programmers' job is not only to program, but to verify the accuracy of
the code they produce.)

Yes, I do look at generated code, but not for everything. In many
cases, it's not required to so, if you're reasonably assuaged that the
compiler is generating correct code. I look maybe only 10% to 25% of
the time. I do so especially when speed or size is an issue, such as
for the recent threads here, or for say a hex dump or text conversion
program, etc. I also look if I want something to be stored in
registers. GCC is bad about this. And, there are many ways to express
things in C and certain combinations produce much better assembly when
optimized. E.g., you probably noticed that I used a while(1) with
break instead of a for(). Many times, these simple loops optimize
better than for(), do-while(), or while().

> I hardly ever look at the code it has emitted. I can't think why
> anyone would

See above.

Also, recent versions of GCC were reported to optimize away legitimate
code in some situations. I happened to see one instance on Linux when
compiling for this thread. I had the "if-break" statement in a
different location, and even though the C code was correct, GCC
"optimized", i.e., deleted, away the rest of my loop ... This was
noticed upon execution, and then confirmed by the missing assembly.

> - unless you think the compiler will produce faulty code.

No, I don't usually do it for that. That'll usually show up with
testing, or execution. If it's a primitive C compiler, I may look
for a while just to make sure the code is correct. If there is
something complicated or obfuscated in the code, I might take a look
too, e.g., pointer chains, typedef obfuscation, etc.

> And if you did, would you not ditch that compiler or file a bug
> report?

I can't do that.

I can update GCC for Linux, if there is an issue. I can't update GCC
for DOS, as later versions of DJGPP have changes intended for NTVDM or
a DOS console in Windows. These changes aren't compatible with true
DOSes unlike my older DJGPP version. Although newer versions of
OpenWatcom compiler are available, I can't update OpenWatcom, yet, as
the newer versions have some changes which require "porting" the C code.

> > I'm not sure if that creates a register dependency or not, as the
> > register being accessed is not modified. I would assume that it
> > wouldn't, but it might.
>
> Effectively, register reads can happen in parallel. There are three
> "data hazards" but all involve writing:
> https://en.wikipedia.org/wiki/Hazard_(computer_architecture).

The keyword is "can". The question is "Do they?" and on which x86
processors? All, none, some, newer? Do any have bugs? Etc. Without
testing, I don't know.

> > Do you see a way to only check one string for the EOS without
> > potentially reading past the end of the other string? If so, I'd be
> > interested in seeing that solution.
>
> AISI, we are scanning the two strings for the first substantive
> difference. As soon as we find such a difference we exit. If that
> difference is that one string ends before the other then the normal
> algorithm will exit. For example, if 0 means the binary zero at the
> end of the string
>
> any0
> anybody0
>
> The normal algorithm will stop when it compares 0 and b.
> But the normal algorithm wouldn't stop if the two strings were
> identical. It would run on past the strings. Therefore we need a test
> to prevent that happening - and testing either for a terminating zero
> is enough. Hence I think we only need to check on of the strings for
> the terminating zero.

I don't know what you mean by "normal algorithm".

The original algorithm I posted will break out on any mismatch, whether
a '\0' is involved or not. So, we don't need to check in that
situation. It will also stop when the strings are identical by
breaking out by matching one '\0'. Since both strings are identical,
both have '\0', we only need to check one of them for '\0'.

The xor algorithm I posted will break if there is a mismatch with an
xor difference other than 0x20. So, the question is are there
situations where there can be a '\0' character which follows the
"continue" statements? The answer to that is: "Yes." If both are
'\0', i.e., an exact string match, we must check at least one of them.
That is the first "continue" statement. If one character is '\0' 0x00
and the other is space 0x20, then we must check both of them. That is
the second "continue" statement. E.g., at least one of these examples
will continue reading past the end of the shorter "Hello" string, if we
don't check both strings for the '\0':

/* xor difference of 0x20, x='\0' y=' ' */
Hello\0
Hello World!\n\0

/* xor difference of 0x20, x=' ' y='\0' */
Hello World!\n\0
Hello\0

If we only check x for '\0', we'll miss y being '\0'. Variable t in the
algorithm doesn't have sufficient information to detect this. So, we
must check both strings for '\0' when there is an xor difference of
0x20.

You stated you have TWO non-alphabetic characters with an XOR of 0x20.
For '!' (0x21), the only non-alphabetic "character" with an XOR of 0x20
is the non-printable SOH or Ctrl-A (0x01). So, let's take two
non-alphabetic visible graphics characters instead. For '[' lower
returns '[', and for '{' lower returns '{'. Neither is lower-cased.
Neither is converted to a value that won't result in an XOR difference
of 0x20 between the two, which would cause the XOR to fail. Also, '{'
isn't converted to '[', which would allow XOR to to match. If lower()
doesn't cause a failure or create an acceptance, how does this help? ...

JH> the XOR of two non-alphabetic values could also result in 0x20,

True, if they're graphics characters, not alpha-numeric. But, the XOR
routine wasn't designed for use with graphics characters. Remember the
XOR suggestion was for Ben, not Rick. Ben didn't need graphics
characters. He only needed a routine for alphabetic and numeric
characters. No graphic. You asked about the XOR routine as suggested
to Ben, but did so in Rick's thread. ...

JH> t being 0x20 means that they /might/ match. You would still need to
JH> use lower() (or lower[]) to check.

If you wanted this to work for graphics characters, you'd need an
additional check to exclude or detect the graphics characters, as I
stated previously. This could be a call to isgraph() or !isalnum() or
even x==y, etc. Calling lower() doesn't help, unless you replace the
XOR==0x20 check with "lower(x)==lower(y)". In which case, we're
basically back to the earlier non-XOR routine, since we aren't using
the XOR anymore which you specifically requested. ...

> > (Remember, despite what those self-proclaimed "C experts" over on
> > c.l.c. say, there are *NO* arrays in C. C only has an array
> > declaration that allocates space for storage and declares a pointer
> > to it. All use of "arrays" in the C language are actually use of
> > the "subscript operator" [] . The subscript operator works with
> > any pointer to a valid type, i.e., non-void, and an offset. Hence,
> > no arrays in C.)
>
> OT for this newsgroup but contrast these two:
>
> char a[10];
> char *p;
>
> When passed to a function they would both go as pointers
>
> f(a, p);

When used with the subscript operator [], both are pointers:

char c;
c=a[1]; /* 1 is offset. [] is subscript operator. a is pointer */
c=p[1]; /* 1 is offset. [] is subscript operator. p is pointer */

Where is the array? ... No array here.

The subscript operator [] doesn't accept arrays as a parameter. It
only accepts a pointer to a type and an offset. The operation of the
subscript operator is defined in all the C standards this way,
including early C documents and articles. This nullifies the entire
argument that an array, such as a[], *ONLY* decays into a pointer when
passed into a function as the specifications state. An array is never
an array in the first place, except when declared. Afterwards, it's
always a pointer, but appears otherwise due to syntax.

> and in the function they would both act as pointers to char.

Not act. Are.

> But in the original they would be different in a number of ways.
> For example,
>
> sizeof(a)
> sizeof(p)
>
> would not be the same.

sizeof() is rather "special". It provides access to information only
known by the compiler at compile time, e.g., symbol tables. IIRC, it's
rather bizarre in some situations and incomplete in others. I generally
don't need it for anything, but do use it to check type sizes on
different platforms.

So, you're arguing that a "faulty", specialty, customized, low-use
function, justifies claiming that arrays exist in the C language, other
than simply for their declaration and allocation? If so, I
wholeheartedly disagree.

The arrays' type *CANNOT* be checked throughout the C code, because
there are no arrays in C. Everywhere you think there is an array in C
code, except for the array declaration, there is a only a (non-void)
pointer, an offset, and a subscript operator, or just a (non-void)
pointer. Never is there an array, except the declaration. That's just
there to allocate space.

(This is apparently a hard concept for most to grasp, as it almost
always sparks an intense argument or flame war.)

Rod Pemberton
--

James Harris

unread,

Jun 29, 2017, 5:12:45 AM6/29/17

to

On 29/06/2017 01:40, Rod Pemberton wrote:
> On Wed, 28 Jun 2017 11:21:21 +0100
> James Harris <james.h...@nospicedham.gmail.com> wrote:
>
>> On 28/06/2017 04:31, Rod Pemberton wrote:

...

> Also, recent versions of GCC were reported to optimize away legitimate
> code in some situations. I happened to see one instance on Linux when
> compiling for this thread. I had the "if-break" statement in a
> different location, and even though the C code was correct, GCC
> "optimized", i.e., deleted, away the rest of my loop ... This was
> noticed upon execution, and then confirmed by the missing assembly.

I posted something similar to comp.lang.c recently. With higher
optimisation levels gcc would optimise away a certain loop test. Turned
out it was within its rights to do so, in a sense, because the code
depended on overflow of a signed integer - and that has no defined
behaviour in C. I can accept that. That's how C is specified. Where I am
not comfortable is that gcc didn't issue any warnings and that it made
different assumptions at different optimisation levels and, thereby,
produced different programs. I tended to assume, before that, that
higher levels of compiler optimisation would just take longer but would
still produce programs which operated in the same way.

...

>> Effectively, register reads can happen in parallel. There are three
>> "data hazards" but all involve writing:
>> https://en.wikipedia.org/wiki/Hazard_(computer_architecture).
>
> The keyword is "can". The question is "Do they?" and on which x86
> processors? All, none, some, newer? Do any have bugs? Etc. Without
> testing, I don't know.

Without checking I'm not sure how long this has been in place but I
think you'd have to go back a long way - probably more than 20 years -
to find an x86 CPU which did not carry out reads in parallel. And even
those machines would probably be faster doing two reads than your
original updating of two pointers. All these things are documented, of
course, if you wanted to check the details.

>
>>> Do you see a way to only check one string for the EOS without
>>> potentially reading past the end of the other string? If so, I'd be
>>> interested in seeing that solution.
>>
>> AISI, we are scanning the two strings for the first substantive
>> difference. As soon as we find such a difference we exit. If that
>> difference is that one string ends before the other then the normal
>> algorithm will exit. For example, if 0 means the binary zero at the
>> end of the string
>>
>> any0
>> anybody0
>>
>> The normal algorithm will stop when it compares 0 and b.
>> But the normal algorithm wouldn't stop if the two strings were
>> identical. It would run on past the strings. Therefore we need a test
>> to prevent that happening - and testing either for a terminating zero
>> is enough. Hence I think we only need to check on of the strings for
>> the terminating zero.
>
> I don't know what you mean by "normal algorithm".

Basically, the one we were using in this thread, i.e. one which steps
through character by character looking for a difference.

>
> The original algorithm I posted will break out on any mismatch, whether
> a '\0' is involved or not. So, we don't need to check in that
> situation. It will also stop when the strings are identical by
> breaking out by matching one '\0'. Since both strings are identical,
> both have '\0', we only need to check one of them for '\0'.
>
> The xor algorithm I posted will break if there is a mismatch with an
> xor difference other than 0x20.

True. Your latter algorithm relied on the 0x20 difference along these lines:

if the xor is 0 then go on to next pair (as the chars are the same)
if the xor is 0x20 then go on to next pair (assuming a match)

And, thereby, would work with limited ranges of characters as input, and
would need a check of both strings for terminating zeroes.

But the earlier code in this thread allowed any of 256 char codes and if
we keep to that, then the xor value is only a hint. It can be used to
avoid the lower() or lower[] operations, along the following lines.

if the xor is 0 then go on to next pair. The chars are the same
if the xor is 0x20 then we _might_ have a match
check the two lower()s. If they are the same, we have a match

And with that, if one string ends before the other then we will still
have a mismatch and the loop will terminate. lower(0) will return 0. So
AFAICS we only need to check one string for a terminating zero.

...

> JH> the XOR of two non-alphabetic values could also result in 0x20,
>
> True, if they're graphics characters, not alpha-numeric. But, the XOR
> routine wasn't designed for use with graphics characters. Remember the
> XOR suggestion was for Ben, not Rick. Ben didn't need graphics
> characters. He only needed a routine for alphabetic and numeric
> characters. No graphic. You asked about the XOR routine as suggested
> to Ben, but did so in Rick's thread. ...

I understand that in the other thread you were thinking of using xor as
a test in its own right, which would work on limited inputs. In /this/
thread I suggested it for a different purpose - a possible time
optimisation. The inputs would still be all 256 potential byte values,
not a limited range.

To illustrate, here's the complete idea as code. It is all untested (or
even compiled) so might have bugs. Read it as just a way to show the idea.

#define MIGHT ('A' ^ 'a') /* To indicate that the chars might match */
/* 0x20 in ASCII, 0x40 in EBCDIC */

char lower(unsigned char ch) /* ASCII, not EBCDIC */
{
if (ch >= 'A' && ch <= 'Z') /* If uppercase */
return ch - 'A' + 'a'; /* Return the lowercase version */
return ch; /* Else return the original */
}

int jh_stricmp(char *s, char *t)
{
int i;
char x; /* The xor of each pair of characters */

for (i = 0; s[i]; ++i)
{
x = s[i] ^ t[i];
if (x == 0) /* The chars match (in this case, they are identical) */
continue;
if (x == MIGHT) /* The chars might match (upper & lower case) */
if (lower(s[i]) == lower(t[i])) /* They do match */
continue;
break;
}
return s[i] - t[i];
}

Note that (x == MIGHT) is only used to avoid unnecessary function calls,
not to replace them.

In practice, on long identical strings the xor trick might cost more
than it saves, but would help if the routine were called frequently on
short non-matching strings.

The code should compile to something like this:

xor al, bl ;XOR the two characters
jz next_iteration ;Identical

cmp al, 0x20 ;If the XOR is 0x20
jne difference_found ;Exit the loop; they cannot match

Call the lower() routine twice
If the two lower() results are identical jump to next_iteration
Else: exit the loop; they don't match

As I say, XOR would avoid the need to call lower() if the two characters
could not possibly match.

--
James Harris

Rod Pemberton

unread,

Jun 29, 2017, 6:43:40 PM6/29/17

to

On Thu, 29 Jun 2017 09:57:41 +0100

James Harris <james.h...@nospicedham.gmail.com> wrote:

> On 29/06/2017 01:40, Rod Pemberton wrote:
> > On Wed, 28 Jun 2017 11:21:21 +0100
> > James Harris <james.h...@nospicedham.gmail.com> wrote:
> >> On 28/06/2017 04:31, Rod Pemberton wrote:

> > Also, recent versions of GCC were reported to optimize away
> > legitimate code in some situations. I happened to see one instance
> > on Linux when compiling for this thread. I had the "if-break"
> > statement in a different location, and even though the C code was
> > correct, GCC "optimized", i.e., deleted, away the rest of my
> > loop ... This was noticed upon execution, and then confirmed by
> > the missing assembly.
>
> I posted something similar to comp.lang.c recently. With higher
> optimisation levels gcc would optimise away a certain loop test.
> Turned out it was within its rights to do so, in a sense, because the
> code depended on overflow of a signed integer - and that has no
> defined behaviour in C. I can accept that. That's how C is specified.

No ... We should not accept that (in the general sense).

C has a whole bunch of "standards" including early C documents, early
books, articles by the authors, and the C Rationale, various Technical
Corrigendums, all of which explain required behavior, some of which is
not part of the formal ANSI or ISO standards. This historical behavior
must be preserved for C to compile correctly.

> Where I am not comfortable is that gcc didn't issue any warnings and
> that it made different assumptions at different optimisation levels
> and, thereby, produced different programs. I tended to assume, before
> that, that higher levels of compiler optimisation would just take
> longer but would still produce programs which operated in the same
> way.

Well, this is what happens when you get C specification pedants, like
those on comp.lang.c, bringing C compilers into compliance with the
C specifications. What you expected to work doesn't anymore. I.e.,
they force and maximize situations to have undefined behavior etc where
they think it should exist. Not good.

> But the earlier code in this thread allowed any of 256 char codes and
> if we keep to that, then the xor value is only a hint. It can be used
> to avoid the lower() or lower[] operations, along the following lines.
>
> if the xor is 0 then go on to next pair. The chars are the same
> if the xor is 0x20 then we _might_ have a match
> check the two lower()s. If they are the same, we have a match
>
> And with that, if one string ends before the other then we will still
> have a mismatch and the loop will terminate. lower(0) will return 0.
> So AFAICS we only need to check one string for a terminating zero.

Ok.

To avoid the function calls for multiple character translations, you
can create a lookup table.

#define ARY 256
char lower[ARY];

/* In main() or as a void func(void) */
int i;
for(i=0;i<ARY;i++)
{
lower[i]=tolower(i);
}

Also, if you didn't notice, my xor code has an off-by-one error. It
decrements x and y to compensate for the increments of x any to compute
the return value. But, it doesn't increment x or y prior to the '\0'
check at the top of the loop.

This,

while(1)
{
if(!((*x)&(*y)))
break;
...

Should actually be this, or the returned result is incorrect:

while(1)
{
if(!((*x)&(*y)))
{
x++;
y++;
break;
}
...

> In practice, on long identical strings the xor trick might cost more
> than it saves, but would help if the routine were called frequently
> on short non-matching strings.
>

I would approach the problem slightly differently ...

When "if (x == MIGHT)", the XOR result is 0x20, meaning that the two
characters are either an upper- and lower-case alphabetic which are the
same when lower-cased, or they're two different graphics characters
0x20 apart. So, I think you only need to know if _one_ of the
characters is alphabetic or not. I.e., if one is alphabetic, both
should be, if XOR==0x20, e.g., 'A' and 'a'. If one is a graphic, both
should be, if XOR==0x20, e.g., '[' and '{'. In C, detecting an
alphabet character is a call to islapha().

So, my routine using this identity/constraint would be something like
this, which replaces two calls to lower() or two lower[] lookups with
one alpha[] lookup.

int rp_xor_stricmp(void *p0, void *p1)
{
int t;
char *x,*y;

x=p0;
y=p1;

while(1)
{
if(!((*x)&(*y)))
{
x++;
y++;

break;
}
t=(*x)^(*y);
x++;
y++;
if(!t)
continue;
if(t==0x20)

{
if(!alpha[(int)*(x-1)])
break;

continue;
}
break;
}
x--;
y--;
return((*x)-(*y));
}

Note that I needed to dereference x minus one to check
for alpha. Otherwise, I'd have needed to introduce some more
temporaries. This might be better with an index i, as you prefer,
instead of pointers.

Using the same identity/constraint, your 'x==MIGHT' code should look
something like this:

> int jh_stricmp(char *s, char *t)
> {
> int i;
> char x; /* The xor of each pair of characters */
>
> for (i = 0; s[i]; ++i)
> {
> x = s[i] ^ t[i];
> if (x == 0) /* The chars match (in this case, they are
> identical) */
> continue;
> if (x == MIGHT) /* The chars might match (upper & lower case) */

if (alpha[(int)s[i]]) /* They do match */

continue;
> break;
> }
> return s[i] - t[i];
> }

Note that the array for my code is alpha not isalpha to prevent name
array-function collision, like the earlier lower versus tolower.

#include <stdio.h>
#include <ctype.h>

#define ARY 256
char alpha[ARY];

int main(void)
{
int i;

for(i=0;i<ARY;i++)
{
alpha[i]=!!isalpha(i);
}
...

<rant>
Sigh, the true value for Linux is___() ctype.h functions doesn't fit
into a char ... When cast to a char or masked with 0xFF, the true value
is zero (false). Unbelievable! So, it needs two logical-nots !! to
fix. A simple assignment works for tolower(). That tells me someone
once knew what they were doing. It also makes you wonder if anyone
coding for Linux had any C experience prior to Linux. Who would
intentionally cause a failure for a function intended to be used to
fill a character array for use as a lookup table? This is just F-word
insane. They must've broken every C program in existence that use
ctype.h functions, which was not written specifically for Linux. I
guess it's a good thing I don't usually use the ctype.h functions.
</rant>

Rod Pemberton
--

James Harris

unread,

Jun 30, 2017, 2:44:12 AM6/30/17

to

On 29/06/2017 23:40, Rod Pemberton wrote:
> On Thu, 29 Jun 2017 09:57:41 +0100
> James Harris <james.h...@nospicedham.gmail.com> wrote:
>
>> On 29/06/2017 01:40, Rod Pemberton wrote:
>>> On Wed, 28 Jun 2017 11:21:21 +0100
>>> James Harris <james.h...@nospicedham.gmail.com> wrote:
>>>> On 28/06/2017 04:31, Rod Pemberton wrote:

...

> To avoid the function calls for multiple character translations, you
> can create a lookup table.
>
> #define ARY 256
> char lower[ARY];
>
> /* In main() or as a void func(void) */
> int i;
> for(i=0;i<ARY;i++)
> {
> lower[i]=tolower(i);
> }

That's a good way to go. It's fast, simple and would work with any
character set that will fit in 8 bits.

...

> When "if (x == MIGHT)", the XOR result is 0x20, meaning that the two
> characters are either an upper- and lower-case alphabetic which are the
> same when lower-cased, or they're two different graphics characters
> 0x20 apart.

That's far from obvious but I have to say that for ASCII it seems to be
correct. And it may work for EBCDIC too (with an XOR of 0x40).

...

> Using the same identity/constraint, your 'x==MIGHT' code should look
> something like this:
>
>> int jh_stricmp(char *s, char *t)
>> {
>> int i;
>> char x; /* The xor of each pair of characters */
>>
>> for (i = 0; s[i]; ++i)
>> {
>> x = s[i] ^ t[i];
>> if (x == 0) /* The chars match (in this case, they are
>> identical) */
>> continue;
>> if (x == MIGHT) /* The chars might match (upper & lower case) */
> if (alpha[(int)s[i]]) /* They do match */
> continue;
>> break;
>> }
>> return s[i] - t[i];
>> }

I could understand an (unsigned) cast because chars might be signed but
why the (int) cast? Wouldn't it just sign-extend a negative number but
still leave it negative?

...

> Sigh, the true value for Linux is___() ctype.h functions doesn't fit
> into a char ... When cast to a char or masked with 0xFF, the true value
> is zero (false). Unbelievable! So, it needs two logical-nots !! to
> fix. A simple assignment works for tolower(). That tells me someone
> once knew what they were doing. It also makes you wonder if anyone
> coding for Linux had any C experience prior to Linux. Who would
> intentionally cause a failure for a function intended to be used to
> fill a character array for use as a lookup table? This is just F-word
> insane. They must've broken every C program in existence that use
> ctype.h functions, which was not written specifically for Linux. I
> guess it's a good thing I don't usually use the ctype.h functions.
> </rant>

Understood.

--
James Harris

James Harris

unread,

Jun 30, 2017, 6:59:30 AM6/30/17

to

On 30/06/2017 07:41, James Harris wrote:
> On 29/06/2017 23:40, Rod Pemberton wrote:

...

>> When "if (x == MIGHT)", the XOR result is 0x20, meaning that the two
>> characters are either an upper- and lower-case alphabetic which are the
>> same when lower-cased, or they're two different graphics characters
>> 0x20 apart.
>
> That's far from obvious but I have to say that for ASCII it seems to be
> correct. And it may work for EBCDIC too (with an XOR of 0x40).

Maybe that suggests a fast way of recognising a letter in ASCII without
using a lookup table:

or al, 'a' ^ 'A'
sub al, 'a'
cmp al, 'z' - 'a'
ja is_not_alphabetic

--
James Harris

Terje Mathisen

unread,

Jun 30, 2017, 11:59:48 AM6/30/17

to

That is the method I use for US ascii. :-)

Rod Pemberton

unread,

Jun 30, 2017, 10:30:27 PM6/30/17

to

On Fri, 30 Jun 2017 07:41:09 +0100
James Harris <james.h...@nospicedham.gmail.com> wrote:

> On 29/06/2017 23:40, Rod Pemberton wrote:

> > if (alpha[(int)s[i]]) /* They do match */
>

> I could understand an (unsigned) cast because chars might be signed
> but why the (int) cast? Wouldn't it just sign-extend a negative
> number but still leave it negative?
>

GCC complains about using chars as array indexes. It seems to like
them to be cast as an int, but I haven't tried all variations of types
in the cast.

Older C compilers accept int's, chars, longs, etc without issue, AFAIR.

My understanding is that the subscript operator, for the index or
offset "parameter", is supposed to accept any expression that has an
integer result. AIUI, that should include chars, both signed and
unsigned, which are C's small integers.

Without the (int) cast, I get:

OpenWatcom v1.3
(no warnings)

GCC v4.7.2 (64-bit Linux)
warning: array subscript has type 'char' [-Wchar-subscripts]

GCC v3.4.1 (32-bit DOS)
warning: array subscript has type 'char'

An (unsigned char) cast works to eliminate the warning for GCC v4.7.2,
as do larger type casts, both signed and unsigned. Given this, I'm
assuming that GCC's doing some kind of implicit range check, either to
make sure the unsigned range is large enough for the index, or to
encourage "you, the coder" to avoid integer types that might be too
small or have unexpected results in some circumstance, e.g, negative
index/offset.

https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html

Apparently, the -Wchar-subscripts is an attempt to catch use of signed
chars as an array index when unsigned chars are intended, as some
machines actually treat signed chars as signed instead of unsigned,
like the majority. Using unsigned chars in C, which is what people
actually want when using a char since C lacks a dedicated small
integer type, typically requires casts when using char functions and
string functions, due to the function declarations using (signed) chars.

This was really my fault. I've historically always used 'unsigned char'
instead of 'char' to avoid sign issues. But, recently, I've been
moving slowly away from C types that don't compile as easily with C, in
part due to comments from others on Usenet, e.g. str__ functions with
'unsigned char'. I'm wondering if this was a bad decision now. The
last time I adopted a new paradigm from self-proclaimed C experts on
c.l.c., who so passionately and logically argued their case, that turned
out to have absolutely horrid results. In this case, I did so because
when posting to Usenet, many people complain about me using
"non-standard" C types, e.g., 'unsigned char' instead of 'char' or not
using 'int' but using 'unsigned long' instead. Most C programmers seem
to use the signed integer types despite their faults, i.e., signed value
instead of expected unsigned or causing UB. And, at the time, I didn't
feel like dealing with any additional code casting issues that could've
popped up due to using an unsigned char in the example. Instead, I
should've either spent a bit more time finding out why GCC was
complaining instead of just casting it to an int, or used the safer
unsigned types with casts if necessary. If I was designing a language
like C today, signed integers would not be present. You only need
signed types when doing math or finance problems, which can be
simulated with subtraction. Computer programming almost always needs
unsigned types exclusively.

Rod Pemberton
--
You've got to love liberals. They're too stupid to understand that
government run healthcare (single-payer) is a huge cost break for the
rich. The burden of expensive healthcare is shifted onto all taxpayers.

James Harris

unread,

Jul 1, 2017, 3:00:45 AM7/1/17

to

On 01/07/2017 03:18, Rod Pemberton wrote:
> On Fri, 30 Jun 2017 07:41:09 +0100
> James Harris <james.h...@nospicedham.gmail.com> wrote:
>
>> On 29/06/2017 23:40, Rod Pemberton wrote:
>
>>> if (alpha[(int)s[i]]) /* They do match */
>>
>> I could understand an (unsigned) cast because chars might be signed
>> but why the (int) cast? Wouldn't it just sign-extend a negative
>> number but still leave it negative?
>>
>
> GCC complains about using chars as array indexes. It seems to like
> them to be cast as an int, but I haven't tried all variations of types
> in the cast.
>
> Older C compilers accept int's, chars, longs, etc without issue, AFAIR.

I have to say I think that gcc, rather than complaining, is being
helpful because it is pointing out that the array index that a
programmer has used is not guaranteed to be non-negative. Knowing your
long-standing preference for unsigned integers I was surprised to see
some uses of MOVSX in your code. That was not like the RP I know at all.
Very surprising! Now, you've said why.

>
> My understanding is that the subscript operator, for the index or
> offset "parameter", is supposed to accept any expression that has an
> integer result. AIUI, that should include chars, both signed and
> unsigned, which are C's small integers.
>
>
> Without the (int) cast, I get:
>
> OpenWatcom v1.3
> (no warnings)
>
> GCC v4.7.2 (64-bit Linux)
> warning: array subscript has type 'char' [-Wchar-subscripts]
>
> GCC v3.4.1 (32-bit DOS)
> warning: array subscript has type 'char'
>
>
> An (unsigned char) cast works to eliminate the warning for GCC v4.7.2,
> as do larger type casts, both signed and unsigned.

As I've said before, I dislike casts for the very reason that they
disable useful warnings. In this case, (int) has made the warning go
away but the problem continues. The index expression could still be
negative.

> Given this, I'm
> assuming that GCC's doing some kind of implicit range check, either to
> make sure the unsigned range is large enough for the index, or to
> encourage "you, the coder" to avoid integer types that might be too
> small or have unexpected results in some circumstance, e.g, negative
> index/offset.
>
> https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html

I think it's simply warning that the char data type could be negative,
tough I am no sure whether gcc is warning about the particular
implementation or about the code's potential non-portability.

>
> Apparently, the -Wchar-subscripts is an attempt to catch use of signed
> chars as an array index when unsigned chars are intended, as some
> machines actually treat signed chars as signed instead of unsigned,
> like the majority. Using unsigned chars in C, which is what people
> actually want when using a char since C lacks a dedicated small
> integer type, typically requires casts when using char functions and
> string functions, due to the function declarations using (signed) chars.

I just tried an implementation and found that its chars are signed.

#include <limits.h>
#include <stdio.h>

int main(void) {
printf("Char range: %i to %i\n", CHAR_MIN, CHAR_MAX);
return 0;
}

Output:
Char range: -128 to 127

Whatever compiler you used to produce the assembly code must also use
signed chars. Hence the MOVSX instructions.

>
> This was really my fault. I've historically always used 'unsigned char'
> instead of 'char' to avoid sign issues. But, recently, I've been
> moving slowly away from C types that don't compile as easily with C, in
> part due to comments from others on Usenet, e.g. str__ functions with
> 'unsigned char'.

I'd be interested in what those issues are. Care to discuss on another
newsgroup? (It's well OT for this one.)

> I'm wondering if this was a bad decision now. The
> last time I adopted a new paradigm from self-proclaimed C experts on
> c.l.c., who so passionately and logically argued their case, that turned
> out to have absolutely horrid results. In this case, I did so because
> when posting to Usenet, many people complain about me using
> "non-standard" C types, e.g., 'unsigned char' instead of 'char' or not
> using 'int' but using 'unsigned long' instead.

Haha - we'll never keep everyone happy!

> Most C programmers seem
> to use the signed integer types despite their faults, i.e., signed value
> instead of expected unsigned or causing UB.

It's OK to use plain char as long as we keep in mind its limitations.
But like you I tend to prefer unsigned chars. And using a signed char as
an array index is a danger, hence the helpful gcc warning.

> And, at the time, I didn't
> feel like dealing with any additional code casting issues that could've
> popped up due to using an unsigned char in the example. Instead, I
> should've either spent a bit more time finding out why GCC was
> complaining instead of just casting it to an int, or used the safer
> unsigned types with casts if necessary. If I was designing a language
> like C today, signed integers would not be present. You only need
> signed types when doing math or finance problems, which can be
> simulated with subtraction. Computer programming almost always needs
> unsigned types exclusively.

By that, aren't you undermining the premise of the very algorithm we are
discussing? It returns the difference between two byte values as in

return s1[i] - s2[i];

Surely the fact that that could be -ve, zero, or +ve is intended to be
useful. If you ruled out signed numbers then the function could not work
in the same way.

--
James Harris

Rod Pemberton

unread,

Jul 1, 2017, 5:45:57 AM7/1/17

to

On Sat, 1 Jul 2017 07:56:50 +0100
James Harris <james.h...@nospicedham.gmail.com> wrote:

> On 01/07/2017 03:18, Rod Pemberton wrote:

> > [...]

> As I've said before, I dislike casts for the very reason that they
> disable useful warnings.

They also allow valid conversions which the type system may prevent.

E.g., in certain situations, you need the dereferenced pointer type to
be the same type as the original, i.e., address interpreter.

E.g., the %p pointer format for printf() is implementation defined,
i.e., not available, faulty, weird format, etc. So, I prefer to cast
to unsigned long so that I can emit the address as formatted hex.

E.g., you have a pointer to some type, such as a struct, which you
need to treat as an array of chars, e.g., copy, fix endianness,
scramble/encode, translate, zero-fill, etc.

> > If I was designing a language like C today, signed integers would
> > not be present.
>

> By that, aren't you undermining the premise of the very algorithm we
> are discussing? It returns the difference between two byte values as
> in
>
> return s1[i] - s2[i];
>
> Surely the fact that that could be -ve, zero, or +ve is intended to
> be useful. If you ruled out signed numbers then the function could
> not work in the same way.
>

True. Maybe, I should allow int to be signed, but then again.

How many C functions return three conditions? Is there any use for two
true (non-zero) states? Won't true and false generally suffice? The
OP was using qsort(), IIRC. So, he may have needed 3 values, but
normally, I only need to know if a string compare was a match or not.
ATM, I can't think of any functions with 3 return states where I need
3-states.

Isn't the fact that a function uses subtraction to compute a return
value more likely to be the reason for 3-states "-ve, zero, or +ve"
than the fact that 3-states are somehow useful?

James Harris

unread,

Jul 1, 2017, 7:16:03 AM7/1/17

to

On 01/07/2017 10:45, Rod Pemberton wrote:
> On Sat, 1 Jul 2017 07:56:50 +0100
> James Harris <james.h...@nospicedham.gmail.com> wrote:
>
>> On 01/07/2017 03:18, Rod Pemberton wrote:

Since we're getting completely on to C (and you don't like comp.lang.c)
I've set followups to comp.lang.misc. OK with you?

>
>>> [...]
>> As I've said before, I dislike casts for the very reason that they
>> disable useful warnings.
>
> They also allow valid conversions which the type system may prevent.
>
> E.g., in certain situations, you need the dereferenced pointer type to
> be the same type as the original, i.e., address interpreter.

Can you give an example?

>
> E.g., the %p pointer format for printf() is implementation defined,
> i.e., not available, faulty, weird format, etc. So, I prefer to cast
> to unsigned long so that I can emit the address as formatted hex.

OK.

>
> E.g., you have a pointer to some type, such as a struct, which you
> need to treat as an array of chars, e.g., copy, fix endianness,
> scramble/encode, translate, zero-fill, etc.

* Structs might be copyable to/from functions, and assignable.

* There are usually more portable ways to handle endianness differences,
though i accept that C is not well provided with handy routines.

* Ditto encoding and other operations which need access to the bytes.

* I am not sure what you mean by translate.

* Zero fill can be done with memset.

One good reason to use casts is because C's literals can be untyped. For
example,

#define SIZE 400

printf("%l", SIZE);

The function call might fail but only on a big endian machine - a nasty
gotcha that you could miss in testing.

You could fix it with

printf("%l", (long) SIZE);

Or, maybe better, you could include the type in the definition.

#define SIZE ((long) 400)

But many, many uses of casts mask problems. And they also prevent
compilers issuing useful warnings. C's type system is there to help
programmers, not hinder them.

>
>>> If I was designing a language like C today, signed integers would
>>> not be present.
>>
>> By that, aren't you undermining the premise of the very algorithm we
>> are discussing? It returns the difference between two byte values as
>> in
>>
>> return s1[i] - s2[i];
>>
>> Surely the fact that that could be -ve, zero, or +ve is intended to
>> be useful. If you ruled out signed numbers then the function could
>> not work in the same way.
>>
>
> True. Maybe, I should allow int to be signed, but then again.
>
> How many C functions return three conditions? Is there any use for two
> true (non-zero) states? Won't true and false generally suffice? The
> OP was using qsort(), IIRC. So, he may have needed 3 values, but
> normally, I only need to know if a string compare was a match or not.
> ATM, I can't think of any functions with 3 return states where I need
> 3-states.
>
> Isn't the fact that a function uses subtraction to compute a return
> value more likely to be the reason for 3-states "-ve, zero, or +ve"
> than the fact that 3-states are somehow useful?

I guess the idea is that the return from strcmp matches the sense of one
string being less than, less than or equal to, equal to, not equal to,
greater than or equal to, or greater than another. All six tests are
possible from a single return value. That's pretty cool!

--
James Harris