How is strlen implemented?

3 views
Skip to first unread message

roy

unread,
Apr 22, 2005, 11:38:06 PM4/22/05
to
Hi,

I was wondering how strlen is implemented.
What if the input string doesn't have a null terminator, namely the
'\0'?
Thanks a lot
Roy

Chris McDonald

unread,
Apr 22, 2005, 11:44:17 PM4/22/05
to
"roy" <roy...@hotmail.com> writes:

>Hi,

>I was wondering how strlen is implemented.
>What if the input string doesn't have a null terminator, namely the
>'\0'?

Without the null-byte terminator, it's not a string!
strlen() can then do whatever it wants.

--
Chris.

roy

unread,
Apr 22, 2005, 11:59:49 PM4/22/05
to
Thanks. Maybe my question should be "what if the input is a char array
without a null terminator". But from my experimental results, it seems
that strlen can still return the number of characters of a char array.
I am just not sure whether I am just lucky or sth else happened inside
strlen.

Jason

unread,
Apr 23, 2005, 12:04:10 AM4/23/05
to

strlen will read from the char* until it finds a '\0' char. If your
string does not use the '\0' as a terminator, then you should avoid
most of the <string.h> functions.

-Jason

Chris Torek

unread,
Apr 22, 2005, 11:52:25 PM4/22/05
to
In article <1114227486....@g14g2000cwa.googlegroups.com>

roy <roy...@hotmail.com> wrote:
>I was wondering how strlen is implemented.
>What if the input string doesn't have a null terminator, namely the
>'\0'?

Q: What if a tree growing in a forest is made of plastic?
A: Then it is not a tree, or at least, it is not growing.

If something someone else is calling a "string" does not have the
'\0' terminator, it is not a string, or at least, not a C string.
In C, the word "string" means "data structure consisting of zero
or more characters, followed by a '\0' terminator". No terminator,
no string.

Since strlen() requires a string, it may assume it gets one.

There are functions that work on "non-stringy arrays"; in particular,
the mem* functions -- memcpy(), memmove(), memcmp(), memset(),
memchr() -- but they take more than one argument. If you have an
array that always contains exactly 40 characters, and it is possible
that none of them is '\0' but you want to find out whether there
is a '\0' in those 40 characters, you can use memchr():

char *p = memchr(my_array, '\0', 40);

memchr() stops when it finds the first '\0' or has used up the
count, whichever occurs first. (It then returns a pointer to the
found character, or NULL if the count ran out.) The strlen()
function has an effect much like memchr() with an "infinite" count,
except that because the count is "infinite", it "always" finds the
'\0':

size_t much_like_strlen(const char *p) {
const char *q = memchr(p, '\0', INFINITY);
return q - p;
}

except of course C does not really have a way to express "infinity"
here. (You can approximate it with (size_t)-1, though.)
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: forget about it http://web.torek.net/torek/index.html
Reading email is like searching for food in the garbage, thanks to spammers.

Chris McDonald

unread,
Apr 23, 2005, 12:10:45 AM4/23/05
to
"roy" <roy...@hotmail.com> writes:

You were just lucky.

--
Chris.

Martin Ambuhl

unread,
Apr 23, 2005, 1:08:18 AM4/23/05
to
roy wrote:
> Hi,
>
> I was wondering how strlen is implemented.

It could be implemented in several ways. The obvious one is to count
characters until a '\0' is encountered.

> What if the input string doesn't have a null terminator, namely the
> '\0'?

Then it isn't a string, which has such a terminator by definition.

Keith Thompson

unread,
Apr 23, 2005, 3:09:24 AM4/23/05
to

It's helpful to provide some context when you post a followup. I
happen to have read the previous articles just before I read this one,
but I could as easily have seen your followup first.

If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers.

As for your question, strlen()'s argument isn't a char array, it's a
pointer to a char. Normally the pointer should point to the first
element of a "string" (i.e., a sequence of characters marked by a '\0'
terminator). strlen() has doesn't know how many characters are
actually in the array. By calling strlen(), you're promising that
there's a '\0' terminator somewhere within the array; if you break
that promise, there's no telling what will happen.

A typical implementation of strlen() will simply traverse the elements
of what it assumes to be your array until it finds a '\0' character.
If it doesn't find a '\0' character within the array, it has no way of
knowing it should stop searching, so it will just continue until it
finds a '\0'. As soon as it passes the end of the array, it invokes
undefined behavior. It might happen to find a '\0' character (which
is what happened in your case). Or it might run past the memory owned
by your program and trigger a segmentation fault or something similar.
Or, as far as the C standard is concerned, it might make demons fly
out your nose.

So don't do that.

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Joe Wright

unread,
Apr 23, 2005, 8:46:57 AM4/23/05
to

More precisely, if your char array does not have a 0 terminator, it is
not a string.
--
Joe Wright mailto:joeww...@comcast.net
"Everything should be made as simple as possible, but not simpler."
--- Albert Einstein ---

Richard Tobin

unread,
Apr 23, 2005, 8:47:02 AM4/23/05
to
In article <1114228789.4...@l41g2000cwc.googlegroups.com>,
roy <roy...@hotmail.com> wrote:

>Thanks. Maybe my question should be "what if the input is a char array
>without a null terminator". But from my experimental results, it seems
>that strlen can still return the number of characters of a char array.

Bear in mind that a char array usually *does* have a null terminator.

If it doesn't, it's quite likely to be followed in by memory by a zero
byte, which is the representation of nul on almost all systems, so it
will often work by luck.

Debugging systems often have an option to initialize variables to
non-zero values, precisely to stop this kind of "luck" from obscuring
real errors. Some readers will remember the many bugs that were
revealed when dynamic linking was added to SunOS, causing
uninitialized variables in main() to no longer be zero.

-- Richard

Gregory Pietsch

unread,
Apr 23, 2005, 9:09:42 AM4/23/05
to
There has to be a null terminator somewhere.

Here's a short implementation:

#include <string.h>
size_t (strlen)(char *s)
{
char *p = s;

while (*p != '\0')
p++;
return (size_t)(p - s);
}

/* Gregory Pietsch */

Joe Estock

unread,
Apr 23, 2005, 9:30:59 AM4/23/05
to
Interesting seeing \0 so widely in use. On most systems, NULL is defined
as \0, however there are a few special cases where it is not. Shouldn't
we be using NULL instead of \0?

Joe Estock

Joe Wright

unread,
Apr 23, 2005, 11:10:37 AM4/23/05
to

No Joe, NULL is the 'null pointer constant' while '\0' is a constant
character (with int type) and value zero. This is often called the null
character or the NUL character. Never NULL character.

Minti

unread,
Apr 23, 2005, 1:23:14 PM4/23/05
to

Pardon me Chris, but I really don't get the drift of what you are
trying to convey. These strings are also "stringy", I don't see how
these are "non-stringy".

IOW you are assuming that these "non-stringy" arrays are also supposed
to end with a null character. "Stringy" I say.

--
Imanpreet Singh Arora

Chris Torek

unread,
Apr 23, 2005, 2:07:21 PM4/23/05
to
>Chris Torek wrote:
>>There are functions that work on "non-stringy arrays"; in particular,
>>the mem* functions ... If you have an array that always contains

>>exactly 40 characters, and it is possible that none of them is '\0'
>>but you want to find out whether there is a '\0' in those 40
>>characters, you can use memchr() ...

In article <1114276994.5...@o13g2000cwo.googlegroups.com>,


Minti <iman...@gmail.com> wrote:
>Pardon me Chris, but I really don't get the drift of what you are
>trying to convey. These strings are also "stringy", I don't see how
>these are "non-stringy".

If there is no '\0' byte in all 40 characters, it is not a string.
If there is a '\0' byte somewhere within those 40 characters, it
*is* a string -- and any characters after the first such '\0' are
not part of the string (but remain part of the array).

>IOW you are assuming that these "non-stringy" arrays are also supposed
>to end with a null character. "Stringy" I say.

In other words, I am saying that these arrays do not contain strings
if and only if they do not contain a '\0'. Note that strncpy()
sometimes makes such arrays (which is one reason some people invented
strlcpy()).

If I may draw an analogy: in mathematics, a statement is false if
there is even a single counterexample. Hence "x * (1/x) = 1" is
a false statement mathematically, because it does not hold for x=0.
(But note that if we limit it, "x * (1/x) = 1 provided x \ne 0",
the statement becomes true for x \elem real, while it remains false
for x \elem integer, and so on.) (Note that details like "x is a
real number" also matter in computing, where float and double do
not really give us "real numbers", but rather approximations.)

Keith Thompson

unread,
Apr 23, 2005, 4:13:02 PM4/23/05
to
"Gregory Pietsch" <GK...@flash.net> writes:
> There has to be a null terminator somewhere.

To clarify: This doesn't mean that there's a guarantee that there will
be a null terminator somewhere. It means that if there isn't a null
terminator anyway, you must not call strlen(). The burden is on the
caller.

(I briefly read your statement the other way.)

Mark McIntyre

unread,
Apr 23, 2005, 7:24:26 PM4/23/05
to
On 22 Apr 2005 20:59:49 -0700, in comp.lang.c , "roy"
<roy...@hotmail.com> wrote:

>Thanks. Maybe my question should be "what if the input is a char array
>without a null terminator".

your question was already answered. However, a quote from hte ISO
Standard may help:

7.21.6.3 The strlen function

3. The strlen function returns the number of characters that precede
the terminating null character.

Clearly if there's no terminating null, this function can't return
anything meaningful. It may in fact not return at all, and its not
uncommon for it to return absurd numbers such as 5678905 or -456


>But from my experimental results, it seems
>that strlen can still return the number of characters of a char array.

How can it do that? Its /required/ to search for the terminating null.
Your compiler is either not standard compilant, or its exhibiting
random behaviour.

>I am just not sure whether I am just lucky or sth else happened inside
>strlen.

lucky


--
Mark McIntyre
CLC FAQ <http://www.eskimo.com/~scs/C-faq/top.html>
CLC readme: <http://www.ungerhu.com/jxh/clc.welcome.txt>

----== Posted via Newsfeeds.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----

Keith Thompson

unread,
Apr 23, 2005, 8:32:23 PM4/23/05
to
Mark McIntyre <markmc...@spamcop.net> writes:
> On 22 Apr 2005 20:59:49 -0700, in comp.lang.c , "roy"
> <roy...@hotmail.com> wrote:
[...]

>>But from my experimental results, it seems
>>that strlen can still return the number of characters of a char array.
>
> How can it do that? Its /required/ to search for the terminating null.
> Your compiler is either not standard compilant, or its exhibiting
> random behaviour.

strlen() is almost certainly finding a zero byte immediately after his
array. I'd expect that to be a very common manifestation of the
undefined behavior in this case.

>>I am just not sure whether I am just lucky or sth else happened inside
>>strlen.
>
> lucky

No, if he'd been lucky it would have crashed the program (with a
meaningful diagnostic) rather than quietly returning a meaningless
result.

Stan Milam

unread,
Apr 24, 2005, 12:01:15 AM4/24/05
to

I found some C functions coded in assembler for the 8086 way back when.

;
; -------------------------------------------------------
; int strlen(s)
; char *s;
; Purpose: Returns the length of the string, not
; including the NULL character
; -------------------------------------------------------
;
ifndef pca
include macro2.asm
include libdef.asm
endif
;
idt strlen
def strlen
strlen: qenter bx,di
mov di,parm1[bx]
; cmp di,zero
; jz null
mov ax,ds
mov es,ax
mov cx,-1
xor al,al
cld
repnz scasb
not cx
dec cx
mov ax,cx
exitf
;null xor ax,ax
; exitf
modend strlen

I guess it's C equivelent is:

unsigned
strlen( char *string )
{
unsigned rv = -1;

while( *string ) rv--, *string++;

rv = (-rv) - 1;
return rv;
}

of course I'd just write it like this:

size_t
strlen( char *string )
{
size_t rv = 0;
while ( *string++ ) rv++;
return rv;
}

Stan Milam

unread,
Apr 24, 2005, 12:02:09 AM4/24/05
to
Keith Thompson wrote:

> Mark McIntyre <markmc...@spamcop.net> writes:
>
>>On 22 Apr 2005 20:59:49 -0700, in comp.lang.c , "roy"
>><roy...@hotmail.com> wrote:
>
> [...]
>
>>>But from my experimental results, it seems
>>>that strlen can still return the number of characters of a char array.
>>
>>How can it do that? Its /required/ to search for the terminating null.
>>Your compiler is either not standard compilant, or its exhibiting
>>random behaviour.
>
>
> strlen() is almost certainly finding a zero byte immediately after his
> array. I'd expect that to be a very common manifestation of the
> undefined behavior in this case.
>
>
>>>I am just not sure whether I am just lucky or sth else happened inside
>>>strlen.
>>
>>lucky
>
>
> No, if he'd been lucky it would have crashed the program (with a
> meaningful diagnostic) rather than quietly returning a meaningless
> result.
>

So, you are saying this is a poorly implemented compiler?

Keith Thompson

unread,
Apr 24, 2005, 1:02:10 AM4/24/05
to
Stan Milam <stm...@swbell.net> writes:
> Keith Thompson wrote:
>> Mark McIntyre <markmc...@spamcop.net> writes:
>>>On 22 Apr 2005 20:59:49 -0700, in comp.lang.c , "roy"
>>><roy...@hotmail.com> wrote:
>> [...]
>>
>>>>But from my experimental results, it seems
>>>>that strlen can still return the number of characters of a char array.
[...]

>>>>I am just not sure whether I am just lucky or sth else happened inside
>>>>strlen.
>>>
>>>lucky
>> No, if he'd been lucky it would have crashed the program (with a
>> meaningful diagnostic) rather than quietly returning a meaningless
>> result.
>
> So, you are saying this is a poorly implemented compiler?

Not at all.

First, strlen() is part of the runtime library, not part of the
compiler.

An implementation of strlen() that was able to detect the case where
the argument points to the first element of an array that doesn't
contain any '\0' characters would most likely add significant overhead
to *all* operations. The obvious way to implement it is to make all
pointers "fat", so each pointer includes both the base address and
bounds information; strlen() would then have to check the bounds.

James McIninch

unread,
Apr 24, 2005, 8:45:19 AM4/24/05
to roy
<posted & mailed>

By definition, a character array without a null terminator is not a string.

Calling strlen on somthing that isn't a string will cause undefined behavior
(an error).

roy wrote:

--
Remove '.nospam' from e-mail address to reply by e-mail

Mark McIntyre

unread,
Apr 24, 2005, 9:24:18 AM4/24/05
to
On Sun, 24 Apr 2005 00:32:23 GMT, in comp.lang.c , Keith Thompson
<ks...@mib.org> wrote:

>Mark McIntyre <markmc...@spamcop.net> writes:
>> On 22 Apr 2005 20:59:49 -0700, in comp.lang.c , "roy"
>> <roy...@hotmail.com> wrote:
>[...]
>>>But from my experimental results, it seems
>>>that strlen can still return the number of characters of a char array.
>>
>> How can it do that? Its /required/ to search for the terminating null.
>> Your compiler is either not standard compilant, or its exhibiting
>> random behaviour.
>
>strlen() is almost certainly finding a zero byte immediately after his
>array. I'd expect that to be a very common manifestation of the
>undefined behavior in this case.

that comes under my definition of 'random' - its by chance finding a
null just shortly after the string, possibly due to some debugging
mode 'helpfulness'.

Of course, if the string were zero length, then....
:-)

Emmanuel Delahaye

unread,
Apr 24, 2005, 2:02:25 PM4/24/05
to
roy wrote on 23/04/05 :

If the string is malformed (missing terminating 0), the behaviour is
undefined. Any thing could happen.

--
Emmanuel
The C-FAQ: http://www.eskimo.com/~scs/C-faq/faq.html
The C-library: http://www.dinkumware.com/refxc.html

.sig under repair

Emmanuel Delahaye

unread,
Apr 24, 2005, 2:05:12 PM4/24/05
to
Stan Milam wrote on 24/04/05 :

> So, you are saying this is a poorly implemented compiler?

What would be a better implementation ? If the limit is not here,
anything happens. Blame the coder, not the compiler.

"Clearly your code does not meet the original spec."
"You are sentenced to 30 lashes with a wet noodle."
-- Jerry Coffin in a.l.c.c++

Emmanuel Delahaye

unread,
Apr 24, 2005, 2:10:18 PM4/24/05
to
Joe Estock wrote on 23/04/05 :

> Interesting seeing \0 so widely in use. On most systems, NULL is defined as
> \0, however there are a few special cases where it is not. Shouldn't we be
> using NULL instead of \0?

No, because here, we are talking about the null character that is 0 or
'\0' (but I'm too lazy to type the latter).

"C is a sharp tool"

Stan Milam

unread,
Apr 24, 2005, 3:57:50 PM4/24/05
to
Stan Milam wrote:
> Keith Thompson wrote:

>>
>> No, if he'd been lucky it would have crashed the program (with a
>> meaningful diagnostic) rather than quietly returning a meaningless
>> result.
>>
>
> So, you are saying this is a poorly implemented compiler?

Okay guys, that was a joke.

Gregory Pietsch

unread,
Apr 24, 2005, 7:42:03 PM4/24/05
to
I checked my libraries, and the following may be faster than the above:

#include <string.h>
#ifndef _OPTIMIZED_FOR_SIZE
#include <limits.h>
/* Nonzero if either X or Y is not aligned on a "long" boundary. */
#ifdef _ALIGN
#define UNALIGNED1(X) ((long)X&(sizeof(long)-1))
#else
#define UNALIGNED1(X) 0
#endif

/* Macros for detecting endchar */
#if ULONG_MAX == 0xFFFFFFFFUL
#define DETECTNULL(X) (((X) - 0x01010101) & ~(X) & 0x80808080)
#elif ULONG_MAX == 0xFFFFFFFFFFFFFFFFUL
/* Nonzero if X (a long int) contains a NULL byte. */
#define DETECTNULL(X) (((X) - 0x0101010101010101) & ~(X) &
0x8080808080808080)
#else
#define _OPTIMIZED_FOR_SIZE
#endif

#ifdef DETECTNULL
#define DETECTCHAR(X,MASK) DETECTNULL(X^MASK)
#endif

#endif
/* strlen */
size_t (strlen)(const char *s)
{
const char *t = s;
#ifndef _OPTIMIZED_FOR_SIZE
unsigned long *aligned_addr;

if (!UNALIGNED1(s)) {
aligned_addr = (unsigned long *) s;
while (!DETECTNULL(*aligned_addr))
aligned_addr++;
/* The block of bytes currently pointed to by aligned_addr
contains a null. We catch it using the bytewise search. */
s = (const char *) aligned_addr;
}
#endif
while (*s)
s++;
return (size_t) (s - t);
}

/* Gregory Pietsch */

Gregory Pietsch

unread,
Apr 24, 2005, 7:43:21 PM4/24/05
to
NULL is usually reserved for the null pointer. Here, we're checking for
the null character, '\0'.

Gregory Pietsch

Flash Gordon

unread,
Apr 25, 2005, 4:26:36 AM4/25/05
to
Gregory Pietsch wrote:
> I checked my libraries,

Do you mean your personal libraries or your implementations. Remember
that the implementation is allowed to do things you are not allowed to do.

> and the following may be faster than the above:

What above? Please quote enough of the message you are replying to for
us to see what you are talking about. There is an option that gets
Google to do the right thing and if you search the group I'm sure you
will find the instructions. It's in someone's sig, but I can't remember who.

> #include <string.h>
> #ifndef _OPTIMIZED_FOR_SIZE

An implementation could declare that or not for any reason it wants.

> #include <limits.h>
> /* Nonzero if either X or Y is not aligned on a "long" boundary. */
> #ifdef _ALIGN

Again, a compiler could declare that or not as it saw fit.

> #define UNALIGNED1(X) ((long)X&(sizeof(long)-1))

There is no guarantee that this will tell you if it is aligned. Some
people around here have worked on word addressed systems where the byte
within the word was flagged in the *high* bits of the address.

> #else
> #define UNALIGNED1(X) 0
> #endif
>
> /* Macros for detecting endchar */
> #if ULONG_MAX == 0xFFFFFFFFUL
> #define DETECTNULL(X) (((X) - 0x01010101) & ~(X) & 0x80808080)

Misleading name, I initially read that as a screwy attempt to detect a
NULL pointer. DETECTNULCHAR would be better.

> #elif ULONG_MAX == 0xFFFFFFFFFFFFFFFFUL
> /* Nonzero if X (a long int) contains a NULL byte. */
> #define DETECTNULL(X) (((X) - 0x0101010101010101) & ~(X) &
> 0x8080808080808080)
> #else
> #define _OPTIMIZED_FOR_SIZE

Isn't that macro you are defining in the implementation name space?
Anything could happen.

> #endif
>
> #ifdef DETECTNULL
> #define DETECTCHAR(X,MASK) DETECTNULL(X^MASK)
> #endif
>
> #endif
> /* strlen */
> size_t (strlen)(const char *s)
> {
> const char *t = s;
> #ifndef _OPTIMIZED_FOR_SIZE
> unsigned long *aligned_addr;
>
> if (!UNALIGNED1(s)) {
> aligned_addr = (unsigned long *) s;
> while (!DETECTNULL(*aligned_addr))
> aligned_addr++;

The above could read bytes off the end of a properly nul terminated
string. For example,
size_t len = strlen("a");

> /* The block of bytes currently pointed to by aligned_addr
> contains a null. We catch it using the bytewise search. */
> s = (const char *) aligned_addr;
> }
> #endif
> while (*s)
> s++;
> return (size_t) (s - t);

No need to cast the result of the subtraction. The compiler already
knows is is returning a size_t so will do the conversion anyway.

> }
>
> /* Gregory Pietsch */
--
Flash Gordon
Living in interesting times.
Although my email address says spam, it is real and I read it.

Lawrence Kirby

unread,
Apr 25, 2005, 12:48:04 PM4/25/05
to
On Sun, 24 Apr 2005 05:02:10 +0000, Keith Thompson wrote:

> Stan Milam <stm...@swbell.net> writes:
>> Keith Thompson wrote:
>>> Mark McIntyre <markmc...@spamcop.net> writes:
>>>>On 22 Apr 2005 20:59:49 -0700, in comp.lang.c , "roy"
>>>><roy...@hotmail.com> wrote:
>>> [...]
>>>
>>>>>But from my experimental results, it seems
>>>>>that strlen can still return the number of characters of a char array.
> [...]
>>>>>I am just not sure whether I am just lucky or sth else happened inside
>>>>>strlen.
>>>>
>>>>lucky
>>> No, if he'd been lucky it would have crashed the program (with a
>>> meaningful diagnostic) rather than quietly returning a meaningless
>>> result.
>>
>> So, you are saying this is a poorly implemented compiler?
>
> Not at all.
>
> First, strlen() is part of the runtime library, not part of the
> compiler.

It is part of the implementation which covers both compiler and library.
Many compilers can generate their own inline code for strlen() in which
case the "library" as a separate concept has little to do with it.

Lawrence

Gregory Pietsch

unread,
Apr 25, 2005, 4:40:55 PM4/25/05
to

Flash Gordon wrote:
> Gregory Pietsch wrote:
> > I checked my libraries,
>
> Do you mean your personal libraries or your implementations. Remember

> that the implementation is allowed to do things you are not allowed
to do.

It was my implementation, based on unravelling the "while(*s)s++" loop.

>
> > and the following may be faster than the above:
>
> What above? Please quote enough of the message you are replying to
for
> us to see what you are talking about. There is an option that gets
> Google to do the right thing and if you search the group I'm sure you

> will find the instructions. It's in someone's sig, but I can't
remember who.
>
> > #include <string.h>
> > #ifndef _OPTIMIZED_FOR_SIZE
>
> An implementation could declare that or not for any reason it wants.

If _OPTIMIZED_FOR_SIZE is declared, the implementation tries to unravel
the "while(*s)s++" loop somewhat.

>
> > #include <limits.h>
> > /* Nonzero if either X or Y is not aligned on a "long" boundary.
*/
> > #ifdef _ALIGN
>
> Again, a compiler could declare that or not as it saw fit.

There's no way to portably detect whether a pointer-to-char is aligned
on a long boundary, is there?

>
> > #define UNALIGNED1(X) ((long)X&(sizeof(long)-1))
>
> There is no guarantee that this will tell you if it is aligned. Some
> people around here have worked on word addressed systems where the
byte
> within the word was flagged in the *high* bits of the address.

I bet that makes for some funky internal pointer arithmetic!

>
> > #else
> > #define UNALIGNED1(X) 0
> > #endif
> >
> > /* Macros for detecting endchar */
> > #if ULONG_MAX == 0xFFFFFFFFUL
> > #define DETECTNULL(X) (((X) - 0x01010101) & ~(X) & 0x80808080)
>
> Misleading name, I initially read that as a screwy attempt to detect
a
> NULL pointer. DETECTNULCHAR would be better.
>
> > #elif ULONG_MAX == 0xFFFFFFFFFFFFFFFFUL
> > /* Nonzero if X (a long int) contains a NULL byte. */
> > #define DETECTNULL(X) (((X) - 0x0101010101010101) & ~(X) &
> > 0x8080808080808080)
> > #else
> > #define _OPTIMIZED_FOR_SIZE
>
> Isn't that macro you are defining in the implementation name space?
> Anything could happen.
>

I tried two types of optimizations, one for time (try to unravel the
loop) and one for size. If I don't get a kind of system where casting
a pointer-to-char to a pointer-to-unsigned-long doesn't make much
sense, #defining _OPTIMIZED_FOR_SIZE allows me to leave out code that
wouldn't work in that situation.

> > #endif
> >
> > #ifdef DETECTNULL
> > #define DETECTCHAR(X,MASK) DETECTNULL(X^MASK)
> > #endif
> >
> > #endif
> > /* strlen */
> > size_t (strlen)(const char *s)
> > {
> > const char *t = s;
> > #ifndef _OPTIMIZED_FOR_SIZE
> > unsigned long *aligned_addr;
> >
> > if (!UNALIGNED1(s)) {
> > aligned_addr = (unsigned long *) s;
> > while (!DETECTNULL(*aligned_addr))
> > aligned_addr++;
>
> The above could read bytes off the end of a properly nul terminated
> string. For example,
> size_t len = strlen("a");

I'm testing for having a null character somewhere among the characters
that make up the area that aligned_addr points to. If I don't get a
sane environment (as indicated by the _OPTIMIZED_FOR_SIZE macro), this
code isn't even compiled in.

Here's the general idea: suppose, for example, sizeof(unsigned long) is
4. I can freely cast a pointer-to-char to a pointer-to-unsigned-long. I
don't care if *aligned_addr is big-end-aligned or little-end-aligned.
Oh, well, is there a better way to unravel "while(*s)s++"?

>
> > /* The block of bytes currently pointed to by aligned_addr
> > contains a null. We catch it using the bytewise search.
*/
> > s = (const char *) aligned_addr;
> > }
> > #endif
> > while (*s)
> > s++;
> > return (size_t) (s - t);
>
> No need to cast the result of the subtraction. The compiler already
> knows is is returning a size_t so will do the conversion anyway.

The cast is only for my eyes. ;-)

>
> > }
> >
> > /* Gregory Pietsch */
> --
> Flash Gordon
> Living in interesting times.
> Although my email address says spam, it is real and I read it.

Gregory Pietsch

Keith Thompson

unread,
Apr 25, 2005, 5:21:10 PM4/25/05
to
Lawrence Kirby <lkn...@netactive.co.uk> writes:
> On Sun, 24 Apr 2005 05:02:10 +0000, Keith Thompson wrote:
[...]

>> First, strlen() is part of the runtime library, not part of the
>> compiler.
>
> It is part of the implementation which covers both compiler and library.
> Many compilers can generate their own inline code for strlen() in which
> case the "library" as a separate concept has little to do with it.

You're right. I should have said that strlen() is *typically
implemented as* part of the runtime library, not part of the compiler.
(I don't know how many compilers generate inline code, and therefore
how accurate "typically" is.)

Christian Bau

unread,
Apr 25, 2005, 6:35:16 PM4/25/05
to

> Thanks. Maybe my question should be "what if the input is a char array

> without a null terminator". But from my experimental results, it seems


> that strlen can still return the number of characters of a char array.

> I am just not sure whether I am just lucky or sth else happened inside
> strlen.

You are not lucky, you are unlucky.

If you were lucky, your program would crash as soon as try this, and
then you would know there is a bug that needs fixing. If you are
unlucky, you get a result that doesn't show the bug.

Tim Prince

unread,
Apr 25, 2005, 9:16:46 PM4/25/05
to

"Keith Thompson" <ks...@mib.org> wrote in message
news:ln7jiqv...@nuthaus.mib.org...

> Lawrence Kirby <lkn...@netactive.co.uk> writes:
>> On Sun, 24 Apr 2005 05:02:10 +0000, Keith Thompson wrote:
> [...]
>>> First, strlen() is part of the runtime library, not part of the
>>> compiler.
>>
>> It is part of the implementation which covers both compiler and library.
>> Many compilers can generate their own inline code for strlen() in which
>> case the "library" as a separate concept has little to do with it.
>
> You're right. I should have said that strlen() is *typically
> implemented as* part of the runtime library, not part of the compiler.
> (I don't know how many compilers generate inline code, and therefore
> how accurate "typically" is.)
Several common compilers, both commercial and free software, have both
in-line and library implementations, as provided for in standard C (both C89
and C99). In normal usage, not allowing for both possibilities would open
up the possibility of Undefined Behavior.


Chris Torek

unread,
Apr 27, 2005, 3:51:16 AM4/27/05
to
In article <1114461655.5...@z14g2000cwz.googlegroups.com>,

Gregory Pietsch <GK...@flash.net> wrote:
>There's no way to portably detect whether a pointer-to-char is aligned
>on a long boundary, is there?

No (at least, not if by "portable" you mean what we usually do in
comp.lang.c :-) ... there are versions that are "portable" to those
systems that define an alignment function or macro, such as all
the BSD variants).

[code using things like]

>>> #define DETECTNULL(X) (((X) - 0x01010101) & ~(X) & 0x80808080)

>I tried two types of optimizations, one for time (try to unravel the
>loop) and one for size. ...

>Here's the general idea: suppose, for example, sizeof(unsigned long) is
>4. I can freely cast a pointer-to-char to a pointer-to-unsigned-long. I
>don't care if *aligned_addr is big-end-aligned or little-end-aligned.
>Oh, well, is there a better way to unravel "while(*s)s++"?

Maybe, maybe not. It is quite CPU-dependent.

For whatever it is worth (perhaps not much at this point), I tried
the above trick in SPARC assembly code when I was writing the 4.4BSD
C library routines for the SPARC. (I wrote many of the "portable"
routines as well; we set things up so that when you built for VAX,
Tahoe, or SPARC, you got either the machine-specific version or the
generic, depending on whether we had written a machine-specific
version.)

The result was that the fancy version using "four byte at a time"
scans (on aligned pointers) was significantly *slower* than the
dumb, simple, one-byte-at-a-time version, even for relatively long
strings. I was a bit surprised; and the results might be different
on a more modern CPU (this was back in 1991 or so).

(I wrote the whole thing in assembly -- well, in C at first, compiled
to assembly, then hand-edited -- so I know it was not the compiler
doing anything tricky, either.)

It turns out that in most C programs, most strings are very short.
The "Dhrystone" tests that many people used to use to compare C
library implementations use strings that are significantly longer
than average, and overemphasize the time behavior of strlen(),
strcpy(), and strcmp() on relatively long strings. Even for these
longer strings, the "optimized" strlen() was still slower.

Of course, this "most C strings are short" rule of thumb may come
about because most C libraries are optimized for short strings
because most strings are short because most C libraries are optimized
for short strings, etc. :-) In other words, if you have a lot of
long strings, and you do program optimization, you will avoid
calling strlen() on them so much.

Even if one breaks this initial chicken-and-egg loop (by calling
strlen() repeatedly on long strings), and then optimizes the heck
out of strlen(), one can probably still speed up one's programs by
fixing the repeated calls to strlen(). There is another rule of
thumb that applies beyond just C programming, or even computers:

The shortest, fastest, cheapest, and most reliable parts of
any system are the ones that are not there.

(This is another way of putting the "KISS" principle. Of course,
marketing usually gets in the way of this idea. :-) )