int4_t

Antoine Leca

unread,

Jul 12, 2004, 9:25:34 AM7/12/04

to

Assuming an architecture were "nibbles" (quartets), both signed and
unsigned, are easy to deal with.

Can we have int4_t in <stdint.h> ?

I cannot find the words that prevent this.

A similar question is, in an implementation where CHAR_BITS is defined to be
16 (to cope with UTF-16 "char"), can we have int8_t?

Antoine

Antoine Leca

unread,

Jul 12, 2004, 9:46:58 AM7/12/04

to

Sorry for the noise. I found the answer myself: the definition of CHAR_BIT
(without S) prevents it.

Antoine

En 40f291b2$0$25741$626a...@news.free.fr, Antoine Leca va escriure:

Richard Bos

unread,

Jul 12, 2004, 10:24:04 AM7/12/04

to

"Antoine Leca" <ro...@localhost.gov> wrote:

> En 40f291b2$0$25741$626a...@news.free.fr, Antoine Leca va escriure:
> > Assuming an architecture were "nibbles" (quartets), both signed and
> > unsigned, are easy to deal with.
> >
> > Can we have int4_t in <stdint.h> ?
>

> Sorry for the noise. I found the answer myself: the definition of CHAR_BIT
> (without S) prevents it.

Note that int_least4_t *is* allowed (and should be one char wide).

Richard

Douglas A. Gwyn

unread,

Jul 13, 2004, 3:29:19 AM7/13/04

to

Antoine Leca wrote:
> Can we have int4_t in <stdint.h> ?
> I cannot find the words that prevent this.

It's prohibited by a combination of:
intN_t cannot have padding bits
object size is an integral number of bytes
the smallest allowed byte size is 8 bits

> A similar question is, in an implementation where CHAR_BITS is defined to be
> 16 (to cope with UTF-16 "char"), can we have int8_t?

No. char is specified as occupying precisely one byte,
so a similar argument applies (change only the third
factor).

Antoine Leca

unread,

Jul 13, 2004, 7:46:08 AM7/13/04

to

En 2Midncfc0Zf...@comcast.com, Douglas A. Gwyn va escriure:

> object size is an integral number of bytes

Ah: this is the part I cannot find. How do you reach this conclusion?

Antoine

Francis Glassborow

unread,

Jul 13, 2004, 8:09:48 AM7/13/04

to

In article <40f3cbe6$0$18218$626a...@news.free.fr>, Antoine Leca
<ro...@localhost.gov> writes

>En 2Midncfc0Zf...@comcast.com, Douglas A. Gwyn va escriure:
>> object size is an integral number of bytes
>
>Ah: this is the part I cannot find. How do you reach this conclusion?

See 6.5.3.4. Now if you can come up with any interpretation of that
which allows correct calculation of the size of an array whilst also
allowing the size of an object to be non-integral (even though sizeof is
required to return an integer value) please share it with us.

--
Francis Glassborow ACCU
Author of 'You Can Do It!' see http://www.spellen.org/youcandoit
For project ideas and contributions: http://www.spellen.org/youcandoit/projects

Antoine Leca

unread,

Jul 13, 2004, 9:38:21 AM7/13/04

to

En D9UPRuYM...@robinton.demon.co.uk, Francis Glassborow va escriure:

> In article <40f3cbe6$0$18218$626a...@news.free.fr>, Antoine Leca
> <ro...@localhost.gov> writes
>> En 2Midncfc0Zf...@comcast.com, Douglas A. Gwyn va escriure:
>>> object size is an integral number of bytes
>>
>> Ah: this is the part I cannot find. How do you reach this conclusion?
>
> See 6.5.3.4.

This is precisely what I was intenting to escape!

I feel that demonstrating a "theorem" about a basic property of some basic
concept, using a ab absurdum reasoning on a related feature, is a bit
red-herring.

On a related example, there is quite a bit of code that is "defensively
programmed" and as such use "sizeof char" in place of "1". Of course, I have
nothing against such code (moreover, it serves very well some documentation
purposes).

But I also read people explaining that they write it this way "for the day
the Standard will be changed to make char being wider than 1 byte."

Isn't about the same reasonment in the reverse way?

In fact, is the constraint about object being integer-sized as thin as the
definition of sizeof for an array?

Antoine

Francis Glassborow

unread,

Jul 13, 2004, 10:04:46 AM7/13/04

to

In article <40f3e632$0$18217$626a...@news.free.fr>, Antoine Leca
<ro...@localhost.gov> writes

>On a related example, there is quite a bit of code that is "defensively
>programmed" and as such use "sizeof char" in place of "1". Of course, I have
>nothing against such code (moreover, it serves very well some documentation
>purposes).

There are two groups of people who do that: those that do not know any
better and those that want to allow themselves to change to a different
type such as wchar_t with minimal trouble.

>
>But I also read people explaining that they write it this way "for the day
>the Standard will be changed to make char being wider than 1 byte."

That is just pure paranoia. It will never happen for the simple reason
that WG14 is not actually trying to kill C (and breaking so much legacy
code would be Committee suicide)

Dan Pop

unread,

Jul 13, 2004, 10:35:16 AM7/13/04

to

In <40f291b2$0$25741$626a...@news.free.fr> "Antoine Leca" <ro...@localhost.gov> writes:

>Assuming an architecture were "nibbles" (quartets), both signed and
>unsigned, are easy to deal with.
>
>Can we have int4_t in <stdint.h> ?
>
>I cannot find the words that prevent this.

int4_t is supposed to be an object type and this is impossible.

>A similar question is, in an implementation where CHAR_BITS is defined to be
>16 (to cope with UTF-16 "char"), can we have int8_t?

The existence of the sizeof operator renders both scenarios impossible.

Think of unsigned char as the atom out of which larger objects are built.

Bit-fields are not first class citizens in C: they have no address and no
size (in terms of sizeof). And, of course, they cannot exist in free
state (consider them quarks ;-)

As a direct consequence of these considerations, the bit addressing
capabilities of certain architectures cannot be mapped on any standard C
feature: an implementation needs to provide specific extensions for
exploiting them.

Dan
--
Dan Pop
DESY Zeuthen, RZ group
Email: Dan...@ifh.de

Wojtek Lerch

unread,

Jul 13, 2004, 10:53:57 AM7/13/04

to

Antoine Leca wrote:
> En D9UPRuYM...@robinton.demon.co.uk, Francis Glassborow va escriure:
>>In article <40f3cbe6$0$18218$626a...@news.free.fr>, Antoine Leca
>><ro...@localhost.gov> writes
>>>En 2Midncfc0Zf...@comcast.com, Douglas A. Gwyn va escriure:
>>>
>>>>object size is an integral number of bytes
>>>Ah: this is the part I cannot find. How do you reach this conclusion?
>>See 6.5.3.4.
>
> This is precisely what I was intenting to escape!

What about this one then:

6.2.6.1p2 "Except for bit-fields, objects are composed of contiguous
sequences of one or more bytes, the number, order, and encoding of which
are either explicitly specified or implementation-defined."

> In fact, is the constraint about object being integer-sized as thin as the
> definition of sizeof for an array?

No, a lot of things in C relies on the fact that objects consist of an
integral number of bytes. Think about memcpy().

Douglas A. Gwyn

unread,

Jul 14, 2004, 12:33:01 AM7/14/04

to

Various ways. The simplest is that sizeof returns an integer.

Douglas A. Gwyn

unread,

Jul 14, 2004, 12:40:31 AM7/14/04

to

Francis Glassborow wrote:
> In article <40f3e632$0$18217$626a...@news.free.fr>, Antoine Leca
> <ro...@localhost.gov> writes
>> On a related example, there is quite a bit of code that is "defensively

>> programmed" and as such use "sizeof char" in place of "1". ...

> There are two groups of people who do that: those that do not know any
> better and those that want to allow themselves to change to a different
> type such as wchar_t with minimal trouble.
>> But I also read people explaining that they write it this way "for the
>> day
>> the Standard will be changed to make char being wider than 1 byte."
> That is just pure paranoia. It will never happen for the simple reason
> that WG14 is not actually trying to kill C (and breaking so much legacy
> code would be Committee suicide)

I was possible, up to the time that Dennis chipped in
during preparation of the original (1989) C standard
and said that sizeof(char)==1 was essential. I'm not
sure he had heard the full context of the debate..
Anyway, once it was standardized programmers started
relying on it as an axiom, so indeed changing it now
would have substantial adverse impact. The *only*
sane way to make such a change would be to do it in
two phases, the first a transition period during which
programmers could identify and change affected areas,
but sizeof(char) would still be 1, and the second when
some implementations would make sizeof(char) > 1.

However, this should have been done back in 1986 along
the lines of my "short char" proposal (short char would
have been the sizeof unit). I don't think it could be
fixed in C without such disruption as to cause mutiny.
Designers of new procedural languages should apply the
lessons learned to get it right in new PLs.

Dan Pop

unread,

Jul 14, 2004, 8:39:22 AM7/14/04

to

In <Ueo6Zdb+...@robinton.demon.co.uk> Francis Glassborow <fra...@robinton.demon.co.uk> writes:

>That is just pure paranoia. It will never happen for the simple reason
>that WG14 is not actually trying to kill C (and breaking so much legacy
>code would be Committee suicide)

C would survive such an attempt, it's the C standard that wouldn't.

It's not very clear to what extent even C99 can be considered alive...

David Hopwood

unread,

Jul 14, 2004, 11:50:28 AM7/14/04

to

Douglas A. Gwyn wrote:
> However, this should have been done back in 1986 along
> the lines of my "short char" proposal (short char would
> have been the sizeof unit). I don't think it could be
> fixed in C without such disruption as to cause mutiny.
> Designers of new procedural languages should apply the
> lessons learned to get it right in new PLs.

But they should make sure to learn the right lessons, some of which are:

- the ability to represent integer sequences as packed arrays of various
element sizes, including *at least* 1 bit, 8 bits, 16 bits, and 32 bits,
with signed and unsigned variants for > 1 bit, is absolutely required.
Whether a reference to an array position exists as a first class value
should not depend on the element size.

- an octet value is an integer between 0 and 255 inclusive. Encoding
of data for transfer between machines or long-term storage must
be, *at the lowest level of software*, defined in terms of sequences
of octets.

- the definitions of basic types should have well-defined sizes (not
excluding arbitrary-precision types) and semantics.

- character encoding issues cannot be left implementation-defined.
The idea that each platform has its own text format no longer makes
sense; what matters is how well the language and its libraries
support implementation of protocols that make use of specific
character encoding schemes.

- the character encoding schemes defined by Unicode are the most
important ones. Read the current Unicode standard, don't rely on
anything you may have heard about it.

None of this is specific to procedural languages. In any case, I
suspect most language designers would come to similar conclusions
themselves; it's more a matter of learning how not to do it from C.

David Hopwood <david.nospam.hopwood@

Douglas A. Gwyn

unread,

Jul 14, 2004, 2:07:11 PM7/14/04

to

David Hopwood wrote:
> - the ability to represent integer sequences as packed arrays of various
> element sizes, including *at least* 1 bit, 8 bits, 16 bits, and 32 bits,
> with signed and unsigned variants for > 1 bit, is absolutely required.
> Whether a reference to an array position exists as a first class value
> should not depend on the element size.

Agreed. Ideally, integers could be specified by required
properties rather than merely by width. C has a little of
that in signed vs. unsigned and <stdint.h>, but the
facility should be embedded in the language and provide
more capability. This would be especially useful in
laying out externally-imposed data structure formats.

> - an octet value is an integer between 0 and 255 inclusive. Encoding
> of data for transfer between machines or long-term storage must
> be, *at the lowest level of software*, defined in terms of sequences
> of octets.

I wouldn't put that at the "lowest level of software",
although since it is an overwhelming convention there
should be adequate support for it in I/O functions.
(Note that there are still people using 60-bit hosts
who want whatever modern languages they can get.)

> - the definitions of basic types should have well-defined sizes (not
> excluding arbitrary-precision types) and semantics.

One way or another, there needs to be flexibility for
a systems programming language to match the target
architecture. Something like "system typedefs" (in
terms of requirements-based specified integer types)
would probably meet the need. For example, in C,
"int" could be defined on one platform as integer:
size(32),align(8),fmt(2scompl,v[32..0]),notrap, and
on another as integer:size(48):fmt(sgnmag,v[31..0],
s[46],m[29..0,45..32]),ovtrap, or whatever the
notation would be. The compilers would be smart
enough to optimize such a type to use native integer
operations (as opposed to more involved, general
run-time arithmetic functions).

> - character encoding issues cannot be left implementation-defined.
> The idea that each platform has its own text format no longer makes
> sense; what matters is how well the language and its libraries
> support implementation of protocols that make use of specific
> character encoding schemes.

There still needs to be some platform default encoding,
different across platforms. What I would urge is a
*single* character-unit object type using a universal
encoding (UCS-4 seems to be the only good candidate)
for *internal* program use. All other encodings
would be converted at the I/O interface, using
whatever has been specified for the stream (at that
point).

> - the character encoding schemes defined by Unicode are the most
> important ones. Read the current Unicode standard, don't rely on
> anything you may have heard about it.

Actually, Unicode and Java screwed up big-time when
they tried to get away with only 16 bits. The entire
"surrogate" scheme should *not* be wired into a good PL.

> None of this is specific to procedural languages. In any case, I
> suspect most language designers would come to similar conclusions
> themselves; it's more a matter of learning how not to do it from C.

That's the flip side of the C experience..
Unfortunately, all too often designers model features
after what has already been implemented rather than
what should have been.

Niklas Matthies

unread,

Jul 14, 2004, 4:48:52 PM7/14/04

to

On 2004-07-14 18:07, Douglas A. Gwyn wrote:
:

> What I would urge is a *single* character-unit object type using a
> universal encoding (UCS-4 seems to be the only good candidate) for
> *internal* program use.

Most operations on text don't actually need random-access indexing.
Furthermore, Unicode combining character sequences, case mappings
mapping a character to a character sequence and the like means
that operations on single characters (=codepoints) should in many
cases be abandoned in favor of operations on complete character
sequences.

For space efficiency reasons, this would suggest UTF-8 as the default
in-memory encoding (with the added benefit of having a representation
compatible with US-ASCII), and only in very few places would there be
a need to convert to a fixed-length codepoint encoding like UCS-4.

-- Niklas Matthies

Eric Sosman

unread,

Jul 14, 2004, 5:28:43 PM7/14/04

to

Many operations on text move "right to left," starting
from the end of a string and working backward. (True, this
is less common than "left to right," but more common than
complete random access.) Can UTF-8 be read backwards?

Or then, there's unidirectional motion but in jumps of
varying sizes, as in some of the high-speed substring search
algorithms. Can one skip forward over N UTF-8 characters
(not C `char's) without examining the skipped data?

--
Eric....@sun.com

lawrenc...@ugsplm.com

unread,

Jul 14, 2004, 6:04:30 PM7/14/04

to

Douglas A. Gwyn <DAG...@null.net> wrote [re. sizeof (char) == 1]:

>
> Anyway, once it was standardized programmers started
> relying on it as an axiom

Programmers had been relying on it as an axiom long before it was
standardized, that's why the committee agreed with DMR that it was
essential.

-Larry Jones

All this was funny until she did the same thing to me. -- Calvin

Niklas Matthies

unread,

Jul 14, 2004, 6:54:39 PM7/14/04

to

On 2004-07-14 21:28, Eric Sosman wrote:
> Niklas Matthies wrote:
:

> Many operations on text move "right to left," starting
> from the end of a string and working backward. (True, this
> is less common than "left to right," but more common than
> complete random access.) Can UTF-8 be read backwards?

Yes.

> Or then, there's unidirectional motion but in jumps of
> varying sizes, as in some of the high-speed substring search
> algorithms. Can one skip forward over N UTF-8 characters
> (not C `char's) without examining the skipped data?

No. But if you mean the same high-speed search algorithms that I'm
familiar with, then they can be implemented in terms of octets instead
of in terms of codepoints, since any match of an octet sequence
corresponding to a UTF-8 codepoint sequence within another octet
sequence corresponding to a UTF-8 codepoint sequence will be an actual
match of the codepoint sequences, since octets that start a UTF-8-encoded
codepoint never occur in the middle of any UTF-8-encoded codepoint.

-- Niklas Matthies

Douglas A. Gwyn

unread,

Jul 15, 2004, 4:19:40 AM7/15/04

to

Niklas Matthies wrote:
> Most operations on text don't actually need random-access indexing.

I never said that they did. However, the convenience for
programming of characters being handled as single units
has been amply demonstrated. Universality of the internal
code allows programs to easily combine and process text
from different external encodings.

> Furthermore, Unicode combining character sequences, case mappings
> mapping a character to a character sequence and the like means
> that operations on single characters (=codepoints) should in many
> cases be abandoned in favor of operations on complete character
> sequences.

Certain linguistic operations of course need some knowledge
of the language involved. And some "control" functions may
have special handling when it is necessary to take account
of their meaning. However, for a large amount of text
processing, no interpretation of the text units is required,
except perhaps for certain delimiter characters.

> For space efficiency reasons, this would suggest UTF-8 as the default
> in-memory encoding (with the added benefit of having a representation
> compatible with US-ASCII), and only in very few places would there be
> a need to convert to a fixed-length codepoint encoding like UCS-4.

Unfortunately, UTF-8 is a variable-width code, which bogs
down programming. Uniformity is a virtue in programming.

Douglas A. Gwyn

unread,

Jul 15, 2004, 4:24:25 AM7/15/04

to

Eric Sosman wrote:
> ... Can UTF-8 be read backwards?

In theory: first translate into single units in a
forward direction, then it's easy... But this is
likely to be impractical or awkward in many cases.

> Or then, there's unidirectional motion but in jumps of
> varying sizes, as in some of the high-speed substring search
> algorithms. Can one skip forward over N UTF-8 characters
> (not C `char's) without examining the skipped data?

Indeed, Boyer-Moore et al. are good examples of
how uniformity of the character units supports
more efficient programming.

Douglas A. Gwyn

unread,

Jul 15, 2004, 4:38:30 AM7/15/04

to

lawrenc...@ugsplm.com wrote:
> Programmers had been relying on it as an axiom long before it was
> standardized, that's why the committee agreed with DMR that it was
> essential.

I think it was in the same category as assuming
4-byte ints; indeed many programmers made heavy
use of such assumptions, which worked on their
current platforms but which were not guaranteed
to work on every platform. Yet others were more
careful and coded without relying unnecessarily
on such assumptions. (I have encountered a lot
of sizeof(char) in old code, especially when the
type of sizeof was not available as a typedef
so that another useful purpose was achieved by
multiplication by sizeof(char).) During the C
standardization process, some such assumptions
were turned into guarantees and others were not;
they were all "judgment calls". The most
serious problem with sizeof(char)==1 is it
blurred the logical distinction between storage
access units and character representation, with
the ultimate effect of requiring a second kind
of internal character representation (wchar_t)
in addition to the unavoidabe external variety
of encodings. You may recall that in my
proposal (referred to as "short char" since that
was one of its distinctive features) all text
streams were essentially what are now known as
"wide-oriented" streams and char (being separate
from byte) played the role that is now occupied
by wchar_t. In effect, recompile any old
Software-Tools-like program on a new platform
and it would work with the platform's native
multibyte encoding. (There would probably have
eventually been some way to change the encoding
in effect on a stream.)
Anyway, as I've already agreed, it is too late
to do this for C, but maybe the next significant
PL can avoid falling into the same trap.

Dan Pop

unread,

Jul 15, 2004, 7:59:24 AM7/15/04

to

In <slrncfb71k.b8h...@nmhq.net> Niklas Matthies <usenet...@nmhq.net> writes:

>For space efficiency reasons, this would suggest UTF-8 as the default
>in-memory encoding (with the added benefit of having a representation
>compatible with US-ASCII), and only in very few places would there be
>a need to convert to a fixed-length codepoint encoding like UCS-4.

Are space efficiency reasons relevant today for text processing
applications? Few text documents exceed 1 MB, and very few 10 MB, while
the cheapest desktop/laptop computers money can buy today come with
256 MB of main memory. By the time UTF-8 becomes the de facto
standard character encoding for text documents, the low end machines
used for text processing will probably have 1 GB of main memory, while
the size of text documents is not expected to increase.

So, I can't imagine many programmers willing to bother with UTF-8 as the
internal character representation, i.e. use multibyte characters instead
of wide characters.

Niklas Matthies

unread,

Jul 15, 2004, 9:14:40 AM7/15/04

to

On 2004-07-15 11:59, Dan Pop wrote:
> In <slrncfb71k.b8h...@nmhq.net> Niklas Matthies <usenet...@nmhq.net> writes:
>
>>For space efficiency reasons, this would suggest UTF-8 as the default
>>in-memory encoding (with the added benefit of having a representation
>>compatible with US-ASCII), and only in very few places would there be
>>a need to convert to a fixed-length codepoint encoding like UCS-4.
>
> Are space efficiency reasons relevant today for text processing
> applications? Few text documents exceed 1 MB, and very few 10 MB, while
> the cheapest desktop/laptop computers money can buy today come with
> 256 MB of main memory. By the time UTF-8 becomes the de facto
> standard character encoding for text documents, the low end machines
> used for text processing will probably have 1 GB of main memory, while
> the size of text documents is not expected to increase.

It's not just text documents. Many (if not most) non-text-processing
programs do process and keep in memory a whole lot of strings.
One issue is data caches. With strings consisting mostly of ASCII
characters, cache misses become near to four times more likely with
UCS-4 than with UTF-8. When doing a lot of string operations this can
have a significant impact on performance. (No, I don't have figures.)

Anyway, "RAM is cheap" has always been a questionable reasoning.
My experience is that you always end up wishing for more RAM.

For example my e-mail archive, which consists mostly of text (I very
rarely get binary attachments), currently has over 2GB, and many
folders have dozens of MB. I wouldn't want those to be in UCS-4, or
have to be converted to UCS-4 each time for searching and sorting.

> So, I can't imagine many programmers willing to bother with UTF-8 as the
> internal character representation, i.e. use multibyte characters instead
> of wide characters.

It seems like UTF-8 is being established as the standard character
encoding under Unix (or at least Linux). For many operations, UTF-8
is not any more bothersome than UCS-4.
(This is even more true in C++ were it's easy to abstract away things
like the details of iterating through a UTF-8 string - and we were
talking about some hypothetical new language in this thread.)

-- Niklas Matthies

Message has been deleted

Dan Pop

unread,

Jul 15, 2004, 10:34:50 AM7/15/04

to

In <slrncfd0q0.1dd...@nmhq.net> Niklas Matthies <usenet...@nmhq.net> writes:

>For example my e-mail archive, which consists mostly of text (I very
>rarely get binary attachments), currently has over 2GB,

Highly irrelevant, unless it consists of a single file. And if it does,
the I/O time is likely to exceed the processing time by orders of
magnitude, when searching something in it.

>and many
>folders have dozens of MB. I wouldn't want those to be in UCS-4, or

No one was ralking about using UCS-4 as an external encoding, right?

>have to be converted to UCS-4 each time for searching and sorting.

Compared to the I/O time, the conversion to UCS-4 should be a piece of
cake.

OTOH, the search and sorting itself might be much more efficient on
characters of the same size. Has anyone tried Boyer-Moore on UTF-8
strings?

>> So, I can't imagine many programmers willing to bother with UTF-8 as the
>> internal character representation, i.e. use multibyte characters instead
>> of wide characters.
>
>It seems like UTF-8 is being established as the standard character
>encoding under Unix (or at least Linux).

In the sense that some utilities can handle it. This is a long way from
getting people to actually use it, instead of their current character set,
e.g. Latin-1. If I type 'é' in vi, under Linux, I get a single character
in the output file, with the default settings. Not exactly what I would
call "UTF-8 being the established standard character encoding under
Linux".

Antoine Leca

unread,

Jul 15, 2004, 11:11:14 AM7/15/04

to

En slrncfd0q0.1dd...@nmhq.net, Niklas Matthies va escriure:

> UCS-4 than with UTF-8. When doing a lot of string operations this can
> have a significant impact on performance. (No, I don't have figures.)

There have been tests done, on highly i15d datas (so probably biaised), that
showed that UTF-16 was a bit more efficient than UTF-8 for internal
processing. UTF-32 was clearly worse (cache efficiency, I believe.)

I may be wrong, but I seem to record that this was operative in the decision
of the committee to produce TR19769 on this matter, to make UTF-16 somewhat
standardizable (at the moment, since it is a variable-length encoding it
does not qualify as wchar_t; and of course it is impractical to use it as
char...)

> Anyway, "RAM is cheap" has always been a questionable reasoning.
> My experience is that you always end up wishing for more RAM.

Sure. After dealing with pure text on 80x25 screens, we enjoyed WYSIWYahwtG.
Then music. Now movies. Next, 3-D games?

> It seems like UTF-8 is being established as the standard character
> encoding under Unix (or at least Linux).

Many people would *pay* for this to be really true...

> For many operations, UTF-8 is not any more bothersome than UCS-4.

OTOH, there are other where it is a nightmare... Basically, if you are doing
anything at character-level, a pointer to UTF-8 data is twice as large as a
pointer to UTF-32; and double the size of the pointers means reducing by
half the number of usable registers; and on i386... you got the picture.

Antoine

Antoine Leca

unread,

Jul 15, 2004, 11:32:52 AM7/15/04

to

> Eric Sosman wrote:
>> ... Can UTF-8 be read backwards?

Yes.

En 9eudndciXdC...@comcast.com, Douglas A. Gwyn va escriure:

> In theory: first translate into single units in a
> forward direction, then it's easy...

This is the obvious way, but you can do it otherwise: assume the previous
character is correct, and start accumulating while reading backward, until
you got the signal (with *p & 0300 == 0300); you then have to deal with the
error cases (sequence too long, non canonical, etc.) but there is no real
added difficulties. It is just different from the way you do that while
reading forward.

The biggest problem I believe is when data are completely garbage, because
you may need to backtrack for a while (until you encounter some ASCII
character, in fact).

>> Can one skip forward over N UTF-8 characters
>> (not C `char's) without examining the skipped data?

If you cannot assume all datas are correct: no.

If you can assume that: it requires you to examine exactly N bytes (all
their bytes patterns will give you the numbers of bytes you may skip;
encountering an ASCII character mean no skip).

The really difficult thing is to move backwards for N characters: then you
really have to examine all the bytes (as it should be obvious from the
above). However, there are no a lot of algorithm that needs it.

Short theory of UTF-8: any character has a leading byte and a number of
trailing bytes (max 3 now); these categories do not intersect; and any
leading byte indicates how many trailing bytes are following it.

Antoine

Antoine Leca

unread,

Jul 15, 2004, 11:42:38 AM7/15/04

to

En slrncfb71k.b8h...@nmhq.net, Niklas Matthies va escriure:

> On 2004-07-14 18:07, Douglas A. Gwyn wrote:
>>
>> What I would urge is a *single* character-unit object type using a
>> universal encoding (UCS-4 seems to be the only good candidate) for
>> *internal* program use.
>

> Furthermore, Unicode combining character sequences, case mappings
> mapping a character to a character sequence and the like means
> that operations on single characters (=codepoints) should in many
> cases be abandoned in favor of operations on complete character
> sequences.

... or using some internal encoding for these complex sequences to be dealt
with as a single unit. I recently read (here) that emacs 22 may be doing
such a thing.

Similarly, if what you really want is sorting, storing a 32-bit value that
maps with the outside characters, but really is a compacted view of the
sorting keys (even multi-level), is probably more efficient than storing
UTF-8. And you might even pretend you are storing text by using the reverse
mapping, provided you choose it correctly.

Antoine

Keith Thompson

unread,

Jul 15, 2004, 4:10:36 PM7/15/04

to

"Antoine Leca" <ro...@localhost.gov> writes:
[...]

> OTOH, there are other where it is a nightmare... Basically, if you are doing
> anything at character-level, a pointer to UTF-8 data is twice as large as a
> pointer to UTF-32; and double the size of the pointers means reducing by
> half the number of usable registers; and on i386... you got the picture.

Why would a pointer to UTF-8 data be bigger than a pointer to UTF-32
data?

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Andreas Schwab

unread,

Jul 15, 2004, 6:20:50 PM7/15/04

to

Dan...@cern.ch (Dan Pop) writes:

> In the sense that some utilities can handle it. This is a long way from
> getting people to actually use it, instead of their current character set,
> e.g. Latin-1. If I type 'é' in vi, under Linux, I get a single character
> in the output file, with the default settings.

When running in a Latin-1 locale that's the right thing to do. But then,
with Emacs you can explicitly set the file encoding independent of the
locale, I don't know whether vi(m) can do that.

Andreas.

--
Andreas Schwab, SuSE Labs, sch...@suse.de
SuSE Linux AG, Maxfeldstraße 5, 90409 Nürnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

David Hopwood

unread,

Jul 15, 2004, 6:46:18 PM7/15/04

to

Eric Sosman wrote:
> Niklas Matthies wrote:
>
>> On 2004-07-14 18:07, Douglas A. Gwyn wrote:
>> :
>>
>>> What I would urge is a *single* character-unit object type using a
>>> universal encoding (UCS-4 seems to be the only good candidate) for
>>> *internal* program use.
>>
>> Most operations on text don't actually need random-access indexing.
>> Furthermore, Unicode combining character sequences, case mappings
>> mapping a character to a character sequence and the like means
>> that operations on single characters (=codepoints) should in many
>> cases be abandoned in favor of operations on complete character
>> sequences.
>>
>> For space efficiency reasons, this would suggest UTF-8 as the default
>> in-memory encoding (with the added benefit of having a representation
>> compatible with US-ASCII), and only in very few places would there be
>> a need to convert to a fixed-length codepoint encoding like UCS-4.
>
> Many operations on text move "right to left," starting
> from the end of a string and working backward. (True, this
> is less common than "left to right," but more common than
> complete random access.) Can UTF-8 be read backwards?

Of course.

> Or then, there's unidirectional motion but in jumps of
> varying sizes, as in some of the high-speed substring search
> algorithms. Can one skip forward over N UTF-8 characters
> (not C `char's) without examining the skipped data?

No. But see Niklas Matthies' argument above: skipping over the
UTF-8 encodings of N code points would not skip over N characters.

High-speed substring search does *not* require skipping over N characters,
or N code points. It does require skipping over N code units, but that
is as easy in UTF-8 as it is in ASCII.

UTF-8 in fact requires no change to substring search algorithms, because
the sets of lead byte values and trail byte values are disjoint (so a
subsequence of bytes that matches a given valid encoding is necessarily
a matching substring).

David Hopwood <david.nosp...@blueyonder.co.uk>

David Hopwood

unread,

Jul 15, 2004, 6:53:51 PM7/15/04

to

Dan Pop wrote:
> Niklas Matthies <usenet...@nmhq.net> writes:
>
>>For space efficiency reasons, this would suggest UTF-8 as the default
>>in-memory encoding (with the added benefit of having a representation
>>compatible with US-ASCII), and only in very few places would there be
>>a need to convert to a fixed-length codepoint encoding like UCS-4.
>
> Are space efficiency reasons relevant today for text processing
> applications?

Yes, absolutely. Remember that a lot of data that is not natural language
text is represented as character strings. Applications that need to process
data consisting of billions of characters are commonplace.

David Hopwood <david.nosp...@blueyonder.co.uk>

James Kuyper

unread,

Jul 16, 2004, 6:56:09 AM7/16/04

to

"Antoine Leca" <ro...@localhost.gov> wrote in message news:<40f6a409$0$1902$636a...@news.free.fr>...
>> Eric Sosman wrote:
...

>>> Can one skip forward over N UTF-8 characters
>>> (not C `char's) without examining the skipped data?
>
>
>
> If you cannot assume all datas are correct: no.
>
> If you can assume that: it requires you to examine exactly N bytes (all

In other words, still "no": the question specified "without examining
the skipped data".

Niklas Matthies

unread,

Jul 16, 2004, 9:49:23 AM7/16/04

to

On 2004-07-15 22:20, Andreas Schwab wrote:
> Dan...@cern.ch (Dan Pop) writes:
>
>> In the sense that some utilities can handle it. This is a long way
>> from getting people to actually use it, instead of their current
>> character set, e.g. Latin-1. If I type 'é' in vi, under Linux, I
>> get a single character in the output file, with the default
>> settings.
>
> When running in a Latin-1 locale that's the right thing to do. But then,
> with Emacs you can explicitly set the file encoding independent of the
> locale, I don't know whether vi(m) can do that.

Vim allows to configure internal encoding, file encoding and terminal
(display) encoding seperately. The default encoding is determined by
the current locale.

-- Niklas Matthies

Message has been deleted

lawrenc...@ugsplm.com

unread,

Jul 16, 2004, 6:11:58 PM7/16/04

to

Douglas A. Gwyn <DAG...@null.net> wrote:
>
> I think it was in the same category as assuming
> 4-byte ints;

I don't. At the time, 16-bit ints were still alive and kicking and
nearly as popular as 32-bit ints. In fact, given that the PCs of the
time were 16-bit machines, one could argue that they were still more
common than 32-bit ints. (A common philosophical argument of the time
was whether "most" C programmers were using Unix platforms or PC
platforms.) Even the early implementations of C had had varying sizes
of int (K&R notes 16, 32, and 36 bits) and careful programmers knew
better than to rely on any particular size. On the other hand, char had
*always* been 1 (not necessarily 8-bit) byte, and lots of otherwise
carefully written code depended on that characteristic.

-Larry Jones

I always send Grandma a thank-you note right away. ...Ever since she
sent me that empty box with the sarcastic note saying she was just
checking to see if the Postal Service was still working. -- Calvin

Douglas A. Gwyn

unread,

Jul 17, 2004, 2:30:08 AM7/17/04

to

lawrenc...@ugsplm.com wrote:
> ... On the other hand, char had

> *always* been 1 (not necessarily 8-bit) byte, and lots of otherwise
> carefully written code depended on that characteristic.

Since we're talking about the days when 7-bit codesets
were still deemed to be sufficient, no matter what
"byte size" an implementation used, it was bound to
be big enough to hold the native character encoding,
so there was no "practical" incentive to worry about
it until such time as larger character sets started to
become important. There *was* some carefully written
code that for purposes of portability accommodated
whatever sizeof(char) might be encountered, and there
was no written guarantee of sizeof(char)==1 in the
base document. We could have decided it either way.
Indeed there are about a dozen different approaches
we could have taken concerning characters and bytes;
as I recall we had more than cursory discussion about
at lest four of them. I don't think the one we chose
was best, if for no other reason (and there are other
reasons) than that it disallows UCS-2 or -4 as a
multibyte encoding (something there is now an effort
to fix).

Dennis Ritchie

unread,

Jul 17, 2004, 11:26:38 PM7/17/04

to

"Douglas A. Gwyn" <DAG...@null.net> wrote in message news:hK2dncJRgtb...@comcast.com...

> lawrenc...@ugsplm.com wrote:
> > ... On the other hand, char had
> > *always* been 1 (not necessarily 8-bit) byte, and lots of otherwise
> > carefully written code depended on that characteristic.

> ......there

> was no written guarantee of sizeof(char)==1 in the
> base document. We could have decided it either way.

Assuming the base document was (a close derivative of) K&R1,
this is true, but

The sizeof operator yields the size, in bytes, of its operand.
(A byte is undefined by the language except in terms of the
value of sizeof, However, in all existing implementations,
a byte is the space required to hold a char.)

Some wriggle room there, but evidently not enough in time.

Dennis

Douglas A. Gwyn

unread,

Jul 18, 2004, 2:05:54 AM7/18/04

to

Dennis Ritchie wrote:
> The sizeof operator yields the size, in bytes, of its operand.
> (A byte is undefined by the language except in terms of the
> value of sizeof, However, in all existing implementations,
> a byte is the space required to hold a char.)
> Some wriggle room there, but evidently not enough in time.

Yes, thanks for the quote. I always took it to mean
that "so far we haven't needed to do otherwise" rather
than "you might as well assume it will always be this
way". Others might have interpreted it differently.
Anyway, it was firmly embedded into the C standard,
and any sort of change in this area now would take a
lot of work and be rather disruptive.

Antoine Leca

unread,

Jul 19, 2004, 7:48:21 AM7/19/04

to

En ln3c3tm...@nuthaus.mib.org, Keith Thompson va escriure:

> "Antoine Leca" <ro...@localhost.gov> writes:
>> OTOH, there are other where it is a nightmare... Basically, if you
>> are doing anything at character-level, a pointer to UTF-8 data is
>> twice as large as a pointer to UTF-32
>

> Why would a pointer to UTF-8 data be bigger than a pointer to UTF-32
> data?

Because (if you are doing character-level stuff) you have to record the
length of the character as well as its position.

Antoine

Antoine Leca

unread,

Jul 19, 2004, 7:51:55 AM7/19/04

to

En 8b42afac.04071...@posting.google.com, James Kuyper va
escriure:

> "Antoine Leca" <ro...@localhost.gov> wrote in message
> news:<40f6a409$0$1902$636a...@news.free.fr>...
>>> Eric Sosman wrote:
> ...
>>>> Can one skip forward over N UTF-8 characters
>>>> (not C `char's) without examining the skipped data?
>>

>> If you can assume [all datas are correct]: it requires

>> you to examine exactly N bytes
>

> In other words, still "no"

I was not disputing this. I was only pointing out that you are not required
to examine ALL the skipped data, a perhaps useful property, depending on the
context.

Perhaps I might have put it more clearly. So thanks for doing it.

Antoine

Keith Thompson

unread,

Jul 19, 2004, 4:54:57 PM7/19/04

to

I thought the point of UTF-8 was that you can compute the length of a
character (the number of bytes composing it) by examining the data.
You could redundantly store that information along with the pointer,
but that shouldn't be necessary.

David Hopwood

unread,

Jul 19, 2004, 8:00:40 PM7/19/04

to

No, why would you need to do that? The length of the character is defined
by its lead byte, which is the byte addressed by the pointer.

David Hopwood <david.nosp...@blueyonder.co.uk>

Antoine Leca

unread,

Jul 20, 2004, 3:49:18 AM7/20/04

to

En IgZKc.77007$q8.1...@fe1.news.blueyonder.co.uk, David Hopwood va
escriure:

> Antoine Leca wrote:
>>> Why would a pointer to UTF-8 data be bigger than a pointer to UTF-32
>>> data?
>>
>> Because (if you are doing character-level stuff) you have to record
>> the length of the character as well as its position.
>
> No, why would you need to do that?

To improve efficiency (and I do not _need_ it.)

> The length of the character is
> defined by its lead byte, which is the byte addressed by the pointer.

Which means a dereference (no cost, since it will be done anyway) + a
256-byte table lookup (this one does not come for free).

Another way to see the same point is to figure how to store a row of
"several" characters: with UTF-32 or in general any fixed-length encoding,
only a pointer to the base of the row is needed. With UTF-8 and in general
any variable-length encoding, you then need to store the position of all the
characters, _or_ to pay a toll walking the array to find the 'N'-th
character (which is bascially the same tradeoff as above.)

I have found that more often than not, I did not need this array, and just
walking the characters did the job adequately. But when it comes to
_process_ characters, then it came out that I needed the array of positions
or, alternatively, the bigger pointer (depending if I know beforehand the
size of the "row").

Antoine