Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Wide character implementation

4 views
Skip to first unread message

Thomas Bushnell, BSG

unread,
Mar 19, 2002, 12:08:15 AM3/19/02
to

If one uses tagged pointers, then its easy to implement fixnums as
ASCII characters efficiently.

But suppose one wants to have the character datatype be 32-bit Unicode
characters? Or worse yet, 35-bit Unicode characters?

At the same time, most characters in the system will of course not be
wide. What are the sane implementation strategies for this?

Frode Vatvedt Fjeld

unread,
Mar 19, 2002, 4:08:59 AM3/19/02
to
tb+u...@becket.net (Thomas Bushnell, BSG) writes:

> If one uses tagged pointers, then its easy to implement fixnums as
> ASCII characters efficiently.

Hm.. perhaps you mean it's easy to implement characters as immediate
values?

> But suppose one wants to have the character datatype be 32-bit
> Unicode characters? Or worse yet, 35-bit Unicode characters?
>
> At the same time, most characters in the system will of course not
> be wide. What are the sane implementation strategies for this?

I suppose to assign "most characters in the system" to a sub-type of
the wide characters, and implement that sub-type as immediates.

--
Frode Vatvedt Fjeld

Pierpaolo BERNARDI

unread,
Mar 19, 2002, 5:22:05 AM3/19/02
to

"Thomas Bushnell, BSG" <tb+u...@becket.net> ha scritto nel messaggio
news:87wuw92...@becket.becket.net...

>
> If one uses tagged pointers, then its easy to implement fixnums as
> ASCII characters efficiently.
>
> But suppose one wants to have the character datatype be 32-bit Unicode
> characters? Or worse yet, 35-bit Unicode characters?

21 bits are enough for Unicode.

P.


Erik Naggum

unread,
Mar 19, 2002, 5:53:48 AM3/19/02
to
* Thomas Bushnell, BSG

| If one uses tagged pointers, then its easy to implement fixnums as
| ASCII characters efficiently.

Huh? No sense this makes.

| But suppose one wants to have the character datatype be 32-bit Unicode
| characters? Or worse yet, 35-bit Unicode characters?

Unicode is a 31-bit character set. The base multilingual plane is 16
bits wide, and then there are the possibility of 20 bits encoded in two
16-bit values with values from 0 to 1023, effectively (+ (expt 2 20) (-
(expt 2 16) 1024 1024)) => 1112064 possible codes in this coding scheme,
but one does not have to understand the lo- and hi-word codes that make
up the 20-bit character space. In effect, you need 16 bits. Therefore,
you could represent characters with the following bit pattern, with b for
bits and c for code. Fonts are a mistake, so is removed.

000000ccccccccccccccccccccc00110

This is useful when the fixnum type tag is either 000 for even fixnums
and 100 for odd fixnums, effectively 00 for fixnums. This makes
char-code and code-char a single shift operation. Of course, char-bits
and char-font are not supported in this scheme, but if you _really_ have
to, the upper 4 bits may be used for char-bits.

| At the same time, most characters in the system will of course not be
| wide. What are the sane implementation strategies for this?

I would (again) recommend actually reading the specification. The
character type can handle everything, but base-char could handle the
8-bit things that reasonable people use. The normal string type has
character elements while base-string has base-char elements. It would
seem fairly reasonable to implement a *read-default-string-type* that
would take string or base-string as value if you choose to implement both
string types.

///
--
In a fight against something, the fight has value, victory has none.
In a fight for something, the fight is a loss, victory merely relief.

Janis Dzerins

unread,
Mar 19, 2002, 6:31:52 AM3/19/02
to
"Pierpaolo BERNARDI" <pierpaolo...@hotmail.com> writes:

What "Unicode"?

--
Janis Dzerins

Eat shit -- billions of flies can't be wrong.

Sander Vesik

unread,
Mar 19, 2002, 11:22:30 AM3/19/02
to
In comp.lang.scheme Thomas Bushnell, BSG <tb+u...@becket.net> wrote:
>
> If one uses tagged pointers, then its easy to implement fixnums as
> ASCII characters efficiently.
>
> But suppose one wants to have the character datatype be 32-bit Unicode
> characters? Or worse yet, 35-bit Unicode characters?

They use either UTF8 or UTF16 - you cannot rely on whetvere size
you pick to be suitably long forever, unicode is sort of inherently
variable-length (characters even have too possible representations
in many cases, &auml; and similar 8-)

>
> At the same time, most characters in the system will of course not be
> wide. What are the sane implementation strategies for this?
>

Implement them as variable-length strings using say UTF-8. Also, saying that
most characters will not be wide may well be a wrong assumptin 8-)

--
Sander

+++ Out of cheese error +++

Sander Vesik

unread,
Mar 19, 2002, 11:27:04 AM3/19/02
to
In comp.lang.scheme Erik Naggum <er...@naggum.net> wrote:
> * Thomas Bushnell, BSG
> | If one uses tagged pointers, then its easy to implement fixnums as
> | ASCII characters efficiently.
>
> Huh? No sense this makes.
>
> | But suppose one wants to have the character datatype be 32-bit Unicode
> | characters? Or worse yet, 35-bit Unicode characters?
>
> Unicode is a 31-bit character set. The base multilingual plane is 16
> bits wide, and then there are the possibility of 20 bits encoded in two
> 16-bit values with values from 0 to 1023, effectively (+ (expt 2 20) (-
> (expt 2 16) 1024 1024)) => 1112064 possible codes in this coding scheme,
> but one does not have to understand the lo- and hi-word codes that make
> up the 20-bit character space. In effect, you need 16 bits. Therefore,
> you could represent characters with the following bit pattern, with b for
> bits and c for code. Fonts are a mistake, so is removed.
>
> 000000ccccccccccccccccccccc00110

I don't think this is true any more as of unicode 3.1 afaik, 16 bits is
no longer enough.

[snip - this doesn't sound like scheme]

Ben Goetter

unread,
Mar 19, 2002, 11:46:41 AM3/19/02
to
Quoth Pierpaolo BERNARDI:

> "Thomas Bushnell, BSG" <tb+u...@becket.net> ha scritto
> > But suppose one wants to have the character datatype be 32-bit Unicode
> > characters? Or worse yet, 35-bit Unicode characters?
>
> 21 bits are enough for Unicode.

And ISO 10646, per working group resolution.

http://std.dkuug.dk/JTC1/SC2/WG2/docs/n2175.htm
http://std.dkuug.dk/JTC1/SC2/WG2/docs/n2225.doc

Thomas Bushnell, BSG

unread,
Mar 19, 2002, 5:33:34 PM3/19/02
to
"Pierpaolo BERNARDI" <pierpaolo...@hotmail.com> writes:

> 21 bits are enough for Unicode.

Um, Unicode version 3.1.1 has the following as the largest character:

E007F;CANCEL TAG;Cf;0;BN;;;;;N;;;;;

Now the Unicode space isn't sparse, but I don't think compressing the
space is the most efficient strategy.

Erik Naggum

unread,
Mar 19, 2002, 6:18:22 PM3/19/02
to
* Sander Vesik <san...@haldjas.folklore.ee>

| I don't think this is true any more as of unicode 3.1 afaik, 16 bits is
| no longer enough.

Please pay attention and actually make an effort to read what you respond
to, will you? You should also be able to count the number of c bits and
arrive at a number greater than 16 if you do no get lost on the way.

Sheesh, some people.

Erik Naggum

unread,
Mar 19, 2002, 6:22:39 PM3/19/02
to
* Sander Vesik <san...@haldjas.folklore.ee>

| They use either UTF8 or UTF16 - you cannot rely on whetvere size
| you pick to be suitably long forever, unicode is sort of inherently
| variable-length (characters even have too possible representations
| in many cases, &auml; and similar 8-)

Variable-length characters? What the hell are you talking about? UTF-8
is a variable-length _encoding_ of characters that most certainly are
intended to require a fixed number of bits. That is, unless you think
the digit 3 take up only 6 bits while the letter A takes up 7 bits and
the symbol Ä… takes up 8. Then you have variable-length characters. Few
people consider this a meaningful way of talking about variable length.

| Implement them as variable-length strings using say UTF-8. Also, saying
| that most characters will not be wide may well be a wrong assumptin 8-)

Real programming languages work with real character objects, not just
UTF-8-encoded strings in memory.

Acquire clue, _then_ post, OK?

Tim Moore

unread,
Mar 19, 2002, 6:32:19 PM3/19/02
to
On 19 Mar 2002 14:33:34 -0800, Thomas Bushnell, BSG <tb+u...@becket.net>
wrote:

Um, what's your point? E007f fits in 20 bits. If you're thinking
that's all that's needed, there are private use areas (E000..F8FF,
F0000..FFFFD, and 100000..10FFFD) that need to be encoded too. So 21
bits looks right.

Tim

Thomas Bushnell, BSG

unread,
Mar 19, 2002, 6:46:51 PM3/19/02
to
tmo...@sea-tmoore-l.dotcast.com (Tim Moore) writes:

> Um, what's your point? E007f fits in 20 bits. If you're thinking
> that's all that's needed, there are private use areas (E000..F8FF,
> F0000..FFFFD, and 100000..10FFFD) that need to be encoded too. So 21
> bits looks right.

Oh what an embarrassing brain fart, yes that's quite right. I don't
know what I was counting, but my head was clearly on backwards.

Ray Dillinger

unread,
Mar 20, 2002, 5:29:16 PM3/20/02
to

I'd have a fixed-width internal representation -- probably 32 bits
although that's overkilling it by about a byte and a half, probably
identical to some mapping of the unicode character set -- and then
use i\o functions that were character-set aware and could translate
to and from various character sets and representations.

I wouldn't want to muck about internally with a format that had
characters of various different widths: too much pain to implement,
too many chances to introduce bugs, not enough space savings.
Besides, when people read whole files as strings, do you really
want to run through the whole string counting multi-byte characters
and single-byte characters to find the value of an expression like

(string-ref FOO charcount) ;; lookups in a 32 million character string!

where charcount is large? I don't. Constant width means O(1) lookup
time.

If space is limited, or if you're doing very serious performance
tuning, You might want to have two separate constant-width internal
character representations, one for short characters (ascii or 16bit)
and one for long (full unicode). But if so, you're going to have to
take it into account the extra space that will be used by the
additional executable code in your character and string comparisons
and manipulation functions, and deal with the increased complexity
there. That would introduce some mild insanity and chances for a few
bugs, but imo it's not as bad as variable-width characters.

What is sane, however, depends deeply on what environment you expect
to be in. You have to ask yourself whether the scheme you're writing
will be used with data in multiple character sets.

For example, will users want to read strings in ebcdic and write
them in unicode? How about the multiple incompatible versions of
ebcdic? Do you have to support them, or can we let them die now?
Will your implementation want to read and produce both UTF-8 and
UTF-16 output? Will you have to handle miscellaneous ISO character
sets that have different characters mapped to the same character
codes above 127? Or obsolete ascii where the character code we
use as backslash used to mean 1/8? How about five-bit Baudot
coding? :-)

Get character i/o functions that do translation, and then the
lookups and references and compares and everything just work for
free with simple code, and all you have to do to support a new
character set is to provide a new mapping that the i/o functions
can use.

Ray Dillinger

unread,
Mar 20, 2002, 6:11:19 PM3/20/02
to

Erik Naggum wrote:
>
> Variable-length characters? What the hell are you talking about? UTF-8

<deletia>


> Acquire clue, _then_ post, OK?

A reminder: Erik Naggum is one of the best things about
comp.lang.scheme, mainly because he doesn't generally post here.
He is frequently right, but that doesn't mean it's worth listening
to him. Newsgroups trimmed accordingly...

Bear

Ben Goetter

unread,
Mar 20, 2002, 6:34:35 PM3/20/02
to
Quoth Ray Dillinger:
> He is frequently right

At least on his home turf. Seems a bit confused, here.

> Newsgroups trimmed accordingly...

Mr Killfile has been and remains our good friend.

Andy Heninger

unread,
Mar 21, 2002, 1:53:06 AM3/21/02
to
"Ray Dillinger" <be...@sonic.net> wrote

> Get character i/o functions that do translation, and then the
> lookups and references and compares and everything just work for
> free with simple code, and all you have to do to support a new
> character set is to provide a new mapping that the i/o functions
> can use.

If you want to provide full up international support, the code for string
manipulatioin becomes anything but simple, no matter what your string
representation. Think string compares that respect the cultural conventions
of different countries and languages (collation), for example. And if
you're thinking Unicode, this is the direction you're headed.

See IBM's open source Unicode library for a good example of what's
involved -
http://oss.software.ibm.com/icu

-- Andy Heninger
heni...@us.ibm.com

Ray Dillinger

unread,
Mar 21, 2002, 11:21:57 AM3/21/02
to
Andy Heninger wrote:
>
> "Ray Dillinger" <be...@sonic.net> wrote

>
> If you want to provide full up international support, the code for string
> manipulatioin becomes anything but simple, no matter what your string
> representation. Think string compares that respect the cultural conventions
> of different countries and languages (collation), for example. And if
> you're thinking Unicode, this is the direction you're headed.

I dunno. As implementor I want to make it *possible* to
implement all the complications. I want to take the major
barriers out of the way and deal with encodings intelligently.
I'm willing to leave presentation and non-default collation
to the authors of language packages. Let someone who knows
and cares implement that as a library; I want to provide the
foundation stones so that she can, and provide default
semantics on anonymous characters (which, to me, includes
anything outside of the latin, european, extended latin,
and math planes) that are logical, consistent, and overridable.

Should the REPL rearrange itself to go top-char-to-bottom,
right-column-to-left, with prompts appearing at the top,
if someone has named their variables and defined their
symbols with kanji characters instead of latin? It's an
interesting thought. Should program code go in boustophedron
(alternating left-to-right in rows from top down) if someone
has named stuff using heiroglyphics? Um, maybe.... But is
the scheme system really where that kind of support is
needed, or would it just confuse people? And what's the
indentation convention for boustophedron?

Maybe that last byte-and-a-half should be used for left-right
and up-down and spacing properties and the scheme system itself
ought to do all that stuff. But it's not so important I'm
going to implement it before, say, read-write invariance on
procedure objects.

Bear

Duane Rettig

unread,
Mar 21, 2002, 1:00:01 PM3/21/02
to
"Andy Heninger" <an...@jtcsv.com> writes:

> "Ray Dillinger" <be...@sonic.net> wrote
> > Get character i/o functions that do translation, and then the
> > lookups and references and compares and everything just work for
> > free with simple code, and all you have to do to support a new
> > character set is to provide a new mapping that the i/o functions
> > can use.

Even before our current verion of Allegro CL (6.1), we were
supporting external-formats to exactly that extent, and it has
been extendible (for the most part). See

http://www.franz.com/support/documentation/6.0/doc/iacl.htm#locales-1

> If you want to provide full up international support, the code for string
> manipulatioin becomes anything but simple, no matter what your string
> representation. Think string compares that respect the cultural conventions
> of different countries and languages (collation), for example. And if
> you're thinking Unicode, this is the direction you're headed.
>
> See IBM's open source Unicode library for a good example of what's
> involved -
> http://oss.software.ibm.com/icu

We incorporate a large amount of IBM's work (and other work, as well)
in our current localization support. See

http://www.franz.com/support/documentation/6.1/doc/iacl.htm#localization-1

Note that we have chosen not to support LC_CTYPE and LC_MESSAGES at this time.
Also, LC_COLLATE is not supported for 6.1, but Unicode Collation Element
Tables (UCETs) will be supported for 6.2.

--
Duane Rettig Franz Inc. http://www.franz.com/ (www)
1995 University Ave Suite 275 Berkeley, CA 94704
Phone: (510) 548-3600; FAX: (510) 548-8253 du...@Franz.COM (internet)

David Rush

unread,
Mar 21, 2002, 1:02:59 PM3/21/02
to
Ray Dillinger <be...@sonic.net> writes:
> Should program code go in boustophedron
> (alternating left-to-right in rows from top down) if someone
> has named stuff using heiroglyphics? Um, maybe.... But is
> the scheme system really where that kind of support is
> needed, or would it just confuse people? And what's the
> indentation convention for boustophedron?

*This* needs to be assigned as a homework problem:

Scheme 101 (Final project):
Implement a boustophedron pretty-printer
Extra credit: do it in a single pass over
the list structure

heh...

david rush
--
Next to the right of liberty, the right of property is the most
important individual right guaranteed by the Constitution and the one
which, united with that of personal liberty, has contributed more to
the growth of civilization than any other institution established by
the human race.
-- Popular Government (William Howard Taft)

Ray Dillinger

unread,
Mar 21, 2002, 9:59:02 PM3/21/02
to

David Rush wrote:

> *This* needs to be assigned as a homework problem:
>
> Scheme 101 (Final project):
> Implement a boustophedron pretty-printer
> Extra credit: do it in a single pass over
> the list structure
>

LAUGH!

I can picture the poor kid now.
1) read assignment.
2) look of puzzlement.
3) look up "boustophedron" in OED.
4) nervous laughter, fading into growing horror....

:-)

It would be a riot to hand out as a fake assignment if
you knew the professor was going to be late one day.
Or, if you happen to *be* the professor, you could
offer it as an extra-credit project and see if you've
got anybody gonzo enough to want to do it in the
class....

Bear

hmmm....

================================BoustoPretty.scm==========

;;;reverse the order of characters in a string
(define reverse-string (lambda (inputstring)
(list->string (reverse (string->list inputstring)))))


;;;create a symbol which prints as the reverse of the input symbol
(define reverse-symbol (lambda (inputsymbol)
(string->symbol (reverse-string (symbol->string inputsymbol)))))

....

Jeffrey M. Vinocur

unread,
Mar 21, 2002, 10:23:39 PM3/21/02
to
In article <okfvgbp...@bellsouth.net>,
David Rush <ku...@bellsouth.net> wrote:
>
> Implement a boustophedron pretty-printer

Boustrophedon, no?


--
Jeffrey M. Vinocur * jm...@cornell.edu
http://www.people.cornell.edu/pages/jmv16/

Jens Axel Søgaard

unread,
Mar 21, 2002, 11:30:39 PM3/21/02
to

"Ray Dillinger" <be...@sonic.net> skrev i en meddelelse
news:3C9A9DF6...@sonic.net...

>
>
> David Rush wrote:
>
> > *This* needs to be assigned as a homework problem:
> >
> > Scheme 101 (Final project):
> > Implement a boustophedron pretty-printer
> > Extra credit: do it in a single pass over
> > the list structure
> >
> LAUGH!
>
> I can picture the poor kid now.
> 1) read assignment.
> 2) look of puzzlement.
> 3) look up "boustophedron" in OED.
> 4) nervous laughter, fading into growing horror....

Damn. Text readers exists.

http://traevoli.com/boust/screen.php3

But - he was not able to figure out how to turn it into a text editor :-)

--
Jens Axel Søgaard

David Rush

unread,
Mar 22, 2002, 2:29:57 AM3/22/02
to
jm...@cornell.edu (Jeffrey M. Vinocur) writes:
> In article <okfvgbp...@bellsouth.net>,
> David Rush <ku...@bellsouth.net> wrote:
> >
> > Implement a boustophedron pretty-printer
>
> Boustrophedon, no?

I'd thought so, but Ray consistently spelled it with a
differently-located 'R'. Since he brought it up in the first place,
I'd assumed that he was correct. That turns out not to be the case,
Senator (STR). From Merriam-Webster's Collegiate (courtesy of
yourdictionary.com):

Main Entry: bou.stro.phe.don
Pronunciation: "bü-str&-'fE-"dän, -d[^&]n
Function: noun
Etymology: Greek boustrophEdon, adverb, literally, turning like oxen in
plowing, from bous ox, cow + strephein to turn -- more at COW
Date: 1699
: the writing of alternate lines in opposite directions (as from left
to right and from right to left)

- boustrophedon adjective or adverb
- bou.stro.phe.don.ic /-fE-'d@-nik/ adjective

david rush
--
A beer barf is usually very satisfying, whereas a Tequilla puke tends
to stress the stomach more, and seeing the worm for the second time is
never much fun.
-- Rob Crittenden (on mcom.bad-attitude)

David Rush

unread,
Mar 22, 2002, 2:40:41 AM3/22/02
to
"Jens Axel Søgaard" <use...@soegaard.net> writes:
> "Ray Dillinger" <be...@sonic.net> skrev i en meddelelse
> news:3C9A9DF6...@sonic.net...
> > David Rush wrote:
> > > *This* needs to be assigned as a homework problem:
> > >
> > > Scheme 101 (Final project):
> > > Implement a boustophedron pretty-printer
> > > Extra credit: do it in a single pass over
> > > the list structure
> >
> > I can picture the poor kid now.
> > 1) read assignment.
> > 2) look of puzzlement.
> > 3) look up "boustophedron" in OED.

And not find it. Thanks to Jeffrey Vinocur for pointing this out. The
correct spelling is "boustrophedon".

> > 4) nervous laughter, fading into growing horror....
>
> Damn. Text readers exists.
>
> http://traevoli.com/boust/screen.php3
>
> But - he was not able to figure out how to turn it into a text editor :-)

Doesn't an Emacs mode exist? (cross-posted to find out ;)

david rush
--
Java is a WORA language! (Write Once, Run Away)
-- James Vandenberg (on prog...@egroups.com)

Ray Dillinger

unread,
Mar 22, 2002, 9:45:05 AM3/22/02
to
David Rush wrote:
>
> jm...@cornell.edu (Jeffrey M. Vinocur) writes:
> > In article <okfvgbp...@bellsouth.net>,
> > David Rush <ku...@bellsouth.net> wrote:
> > >
> > > Implement a boustophedron pretty-printer
> >
> > Boustrophedon, no?
>
> I'd thought so, but Ray consistently spelled it with a
> differently-located 'R'. Since he brought it up in the first place,
> I'd assumed that he was correct. That turns out not to be the case,
> Senator (STR). From Merriam-Webster's Collegiate (courtesy of
> yourdictionary.com):
>

Yup. I was wrong. 'Boustrophedon' is one of a small number of words
which I invariably spell wrong unless I slow down and remember that
it's in that family of words and correct for it. I have no idea
why this is so, but it's been so ever since I first read the word.

Sorry everybody.

Bear

Alwyn

unread,
Mar 22, 2002, 9:56:21 AM3/22/02
to
In article <3C9B443D...@sonic.net>,
Ray Dillinger <be...@sonic.net> wrote:

It comes from Greek *bous*, 'ox' and *strophe*, 'turning'. That might
help you to remember.


Alwyn

Sander Vesik

unread,
Mar 22, 2002, 4:13:12 PM3/22/02
to
In comp.lang.scheme Erik Naggum <er...@naggum.net> wrote:
> * Sander Vesik <san...@haldjas.folklore.ee>
> | They use either UTF8 or UTF16 - you cannot rely on whetvere size
> | you pick to be suitably long forever, unicode is sort of inherently
> | variable-length (characters even have too possible representations
> | in many cases, &auml; and similar 8-)
>
> Variable-length characters? What the hell are you talking about? UTF-8
> is a variable-length _encoding_ of characters that most certainly are
> intended to require a fixed number of bits. That is, unless you think
> the digit 3 take up only 6 bits while the letter A takes up 7 bits and
> the symbol ? takes up 8. Then you have variable-length characters. Few

> people consider this a meaningful way of talking about variable length.

Wake up, smnell the coffee and learn about 'combiners'. And then *think*
just a little bit, including about thinks like collation, sort order
and similar.

>
> ///

Erik Naggum

unread,
Mar 22, 2002, 10:03:52 PM3/22/02
to
* Sander Vesik

| Wake up, smnell the coffee and learn about 'combiners'. And then *think*
| just a little bit, including about thinks like collation, sort order and
| similar.

Perhaps you are unaware of the character concept as used in Unicode? It
would seem prudent at this time for you to return to the sources and
obtain the information you lack. To wit, what you incompetently refer to
as "combiners" are actually called "combining characters". I suspect you
knew that, too, since nobody _else_ calls them "combiners". But it seems
that you are fighting for your honor, now, not technical correctness, and
I shall leave to you another pathetic attempt to feel good about yourself
when you should acknowledge inferior knowledge and learn something.

Oh, by the way, Unicode has three levels. Study Unicode, and you will
know that they mean and what they do. Hint: "variable-length character"
is an incompetent restatement. A single _glyph_ may be made up of more
than one _character_ and a given glyph may be specifed using more than
one character. If you had known Unicode at all, you would know this.

Sander Vesik

unread,
Mar 23, 2002, 1:51:39 PM3/23/02
to
In comp.lang.scheme Erik Naggum <er...@naggum.net> wrote:
> * Sander Vesik
> | Wake up, smnell the coffee and learn about 'combiners'. And then *think*
> | just a little bit, including about thinks like collation, sort order and
> | similar.
>
> Perhaps you are unaware of the character concept as used in Unicode? It
> would seem prudent at this time for you to return to the sources and
> obtain the information you lack. To wit, what you incompetently refer to
> as "combiners" are actually called "combining characters". I suspect you
> knew that, too, since nobody _else_ calls them "combiners". But it seems
> that you are fighting for your honor, now, not technical correctness, and
> I shall leave to you another pathetic attempt to feel good about yourself
> when you should acknowledge inferior knowledge and learn something.

I don't subscribe to the concept of honour. I also couldn't care less what
you think of me.

>
> Oh, by the way, Unicode has three levels. Study Unicode, and you will
> know that they mean and what they do. Hint: "variable-length character"
> is an incompetent restatement. A single _glyph_ may be made up of more
> than one _character_ and a given glyph may be specifed using more than
> one character. If you had known Unicode at all, you would know this.

It is pointless to think of glyph in any other way than characters - it should
not make any difference whetever adiaresis is represented by one code point
- the precombined one - or two. In fact, if there is a detctable difference
from anything dealing with text strings the implementation is demonstratably
broken.

Erik Naggum

unread,
Mar 23, 2002, 8:46:30 PM3/23/02
to
* Sander Vesik

| I also couldn't care less what you think of me.

You should realize that only people who care a lot, make this point.

| It is pointless to think of glyph in any other way than characters - it
| should not make any difference whetever adiaresis is represented by one
| code point - the precombined one - or two. In fact, if there is a
| detctable difference from anything dealing with text strings the
| implementation is demonstratably broken.

It took the character set community many years to figure out the crucial
conceptual and then practical difference between the "characteristic
glyph" of a character and the character itself, namly that a character
may have more than one glyph, and a glyph may represent more than one
character. If you work with characters as if they were glyphs, you
_will_ lose, and you make just the kind of arguments that were made by
people who did _not_ grasp this difference in the ISO committees back in
1992 and who directly or indirectly caused Unicode to win over the
original ISO 10646 design. Unicode has many concessions to those who
think character sets are also glyph sets, such as the presentation forms,
but that only means that there are different times you would use
different parts of the Unicode code space. Some people who try to use
Unicode completely miss this point.

It also took some _companies_ a really long time to figure the difference
between glyph sets and character sets. (E.g., Apple and Xerox, and, of
course, Microsoft has yet to reinvent the distinction badly in the name
of "innovation", so their ISO 8859-1-like joke violates important rules
for character sets.) I see that you are still in the pre-enlightenment
state of mind and have failed to grasp what Unicode does with its three
levels. I cannot help you, since you appear to stop thinking in order to
protect or defend yourself or whatever (it sure looks like som mideast
"honor" codex to me), but if you just pick up the standard and read its
excellent introductions or even Unicode: A Primer, by Tony Graham, you
will understand a lot more. It does an excellent job of explaining the
distinction between glyph and character. I think you need it much more
than trying to defend yourself by insulting me with your ignorance.

Now, if you want to use or not use combining characters, you make an
effort to convert your input to your preferred form before you start
processing. This isolates the "problem" to a well-defined interface, and
it is no longer a problem in properly designed systems. If you plan to
compare a string with combining characters with one without them, you are
already so confused that there is no point in trying to tell you how
useless this is. This means that thinking in terms of "variable-length
characters" is prima facie evidence of a serious lack of insight _and_ an
attitude problem that something somebody else has done is wrong and that
you know better than everybody else. Neither are problems with Unicode.

Thomas Bushnell, BSG

unread,
Mar 23, 2002, 11:25:49 PM3/23/02
to

So a secondary question; if one is designing a new Common Lisp or
Scheme system, and one is not encumbered by any requirements about
being consistent with existing code, existing operating systems, or
existing communications protocols and interchange formats: that is, if
one gets to design the world over again:

Should the Scheme/CL type "character" hold Unicode characters, or
Unicode glyphs? (It seems clear to me that it should hold characters,
but I might be thinking about it poorly.)

And, whichever answer, why is that the right answer?

Thomas

cr88192

unread,
Mar 23, 2002, 9:02:30 PM3/23/02
to
>
> Should the Scheme/CL type "character" hold Unicode characters, or
> Unicode glyphs? (It seems clear to me that it should hold characters,
> but I might be thinking about it poorly.)
>
> And, whichever answer, why is that the right answer?
>
one could use "the cheap man's unicode" or utf-8.
actually personally I don't care so much about unicode and have held it in
the "possibly later" respect. for now it is not terribly important as I can
just restrict myself to the lower 128 characters.
in any case it sounds simpler to implement than the "codepage" system, so I
will probably use it.

"ich bin einen Amerikaner, und ich tun nicht erweiterter Zeichen noetig"
(don't mind bad grammar, as I don't really know german...).

nevermind...

Erik Naggum

unread,
Mar 24, 2002, 1:51:53 AM3/24/02
to
* tb+u...@becket.net (Thomas Bushnell, BSG)

| Should the Scheme/CL type "character" hold Unicode characters, or
| Unicode glyphs? (It seems clear to me that it should hold characters,
| but I might be thinking about it poorly.)

There are no Unicode glyphs. This properly refers to the equivalence of
a sequence of characters starting with a base character and optinoally
followed combining characters, and "precomposed" characters. This is the
canonical-equivalence of character sequences. A processor of Unicode
text is allowed to replace any character sequence with any of its
canonically-equivalent character sequences. It is in this regard that an
application may want to request a particular composite character either
as one character or a character sequence, and may decide to examine each
coded character element individually or as an interpreted character.
These constitute three different levels of interpretation that it must be
possible to specify. Since an application is explicitly permitted to
choose any of the canonical-equivalent character sequences for a
character, the only reasonable approach is to normalize characters into a
known internal form.

There is one crucial restriction on the ability to use equivalent
character sequences. ISO 10646 defines implementation levels 1, 2 and 3
that, respectively, prohibit all combining characters, allow most
combining characters, and allow all combining characters. This is a very
important part of the whole Unicode effort, but Unicode has elected to
refer to ISO 10646 for this, instead of adopting it. From my personal
communication with high-ranking officials in the Unicode consortium, this
is a political decision, not a technical one, because it was feared that
implementors that would be happy with trivial character-to-glyph--mapping
software (such as a conflation of character and glyph concepts and fonts
that support this conflation), especially in the Latin script cultures,
would simply drop support for the more complex usage of the Latin script
and would fail to implement e.g., Greek properly. Far from being an
enabling technology, it was feared that implementing the full set of
equivalences would be omitted and thus not enable the international
support that was so sought after. ISO 10646, on the other hand, has
realized that implementors will need time to get all this right, and may
choose to defer implementation of Unicode entirely if they are not able
to do it stepwise. ISO 10646 Level 1 is intended to be workable for a
large number of uses, while Level 3 is felt not to have an advantage qua
requirement until languages that require far more than composition and
decomposition to be fully supported. I concur strongly with this.

The character-to-glyph mapping is fraught with problems. One possible
way to do this is actually to use the large private use areas to build
glyphs and then internally use only non-combining characters. The level
of dynamism in the character coding and character-to-glyph mapping here
is so much difficult to get right that the canonical-equivalent sequences
of characters (which is a fairly simple table-lookup process) pales in
comparison. That is, _if_ you allow combining characters, actually being
able to display them and reason about them (such as computing widths or
dealing with character properties of the implicit base character or
converting their case) is far more difficult than decomposing and
composing characters.

As for the scary effect of "variable length" -- if you do not like it,
canonicalize the input stream. This really is an isolatable non-problem.

Erik Naggum

unread,
Mar 24, 2002, 2:00:47 AM3/24/02
to
* Thomas Bushnell, BSG

| So a secondary question; if one is designing a new Common Lisp or Scheme
| system, and one is not encumbered by any requirements about being
| consistent with existing code, existing operating systems, or existing
| communications protocols and interchange formats: that is, if one gets to
| design the world over again:

If we could design the world over again, the _first_ ting I would want to
do is making "capital letter" a combining modifier instead of doubling
the size of the code space required to handle it. Not only would this be
such a strong signal to people not to use case-sensitive identifiers in
programming languages, we would have a far better time as programmers.
E.g., considering the enormous amount of information Braille can squeeze
into only 6 bits, with codes for many common words and codes to switch to
and from digits and to capital letters, the limitations of their code
space has effectively been very beneficial.

Thomas Bushnell, BSG

unread,
Mar 25, 2002, 1:56:42 PM3/25/02
to
Erik Naggum <er...@naggum.net> writes:

> Yeah, me too. Then I could force you to pay attention to the premises
> that start a discussion instead of completely ignoring the context.
> Please see <32259420...@naggum.net>, and pay particular attention to
> what Thomas Bushnell wrote.

So, getting back to my original question about charset implementations
in Lisp/Scheme (though actually Smalltalk or any such
dynamically-typed language will have the same questions and probably
the same kinds of solutions), I've done some more study and thinking,
so let me try again. My previous question was a tad innocent, it
appears, because I was unaware of the great changes that have taken
place in Unicode since the last time I read through it and grokked the
whole thing (which was back at version 1.2 or something).

I haven't fully internalized the terminology yet, though I'm trying.
So please bear with any minor terminological gaffes (and correct them,
too).

The GNU/Linux world is rapidly converging on using UTF-8 to hold
31-bit Unicode values. Part of the reason it does this is so that
existing byte streams of Latin-1 characters can (pretty much) be used
without modification, and it allows "soft conversion" of existing
code, which is quite easy and thus helps everybody switch.

But I'm thinking about a "design the world over again" kind of
strategy. Now Erik is certainly right that capitalization *should* be
a combining character kind of thing. So let me stipulate that I want
to take Unicode as-is; I get to design *my computer system*, subject
to the a priori constraint that Unicode has done a *lot* of work, so I
will accept slight deficiencies if they help Unicode work right on the
system. So I'll take the existing Unicode encodings, even if they
don't do capitals just like we'd want.

But I don't get to redesign existing communications protocols and
such; however, that's an externalization issue, and for internal use
on the system, such protocols don't matter. Similar comments apply
for existing filesystems formats, file conventions, and the like.

Now, I *could* just use UTF-8 internally, but that seems rather
foolish. I think it's obvious that characters should be "immediately"
represented in pointer values in the way that fixnums are.

Now the Universal Character Set is officially 31 bits, but only 16
bits are in use now, and it is expected that at most 21 bits will be
used. So that means it's pretty easy to make sure the whole space of
UCS values fits in an immediate representation. That's fine for
working with actively used data.

However, strings that are going to be kept around a long time should,
it seems to me, be stored more compactly. Essentially all strings
will be in the Basic Multilingual Plane, so they can fit in 16 bits.
That means there would be two underlying string datatypes. I don't
think this is a serious problem. Is it worth having a third (for
8-bit characters) so that Latin-1 files don't have to be inflated by a
factor of two? It seems to me that this would be important too.
Basically then we would have strings which are UCS-4, UCS-2 and
Latin-1 restricted (internally, not visibly to users).

So even if strings are "compressed" this way, they are not UTF-8.
That's Right Out. They are just direct UCS values. Procedures like
string-set! therefore might have to inflate (and thus copy) the entire
string if a value outside the range is stored. But that's ok with me;
I don't think it's a serious lose.

So is this sane?

Ok, then the second question is about combining characters. Level 1
support is really not appropriate here. It would be nice to support
Level 3. But perhaps Level 2 with Hangul Jamo characters [are those
required for Level 2?] would be good enough.

It seems to me that it's most appropriate to use Normalization Form
D. Or is that crazy? It has the advantage of holding all the Level 3
values in a consistent way. (Since precombined characters do not
exist for all possibilities, Normalization Form C results in some
characters precombined and some not, right?)

And finally, should the Lisp/Scheme "character" data type refer to a
single UCS code point, or should it refer to a base character together
with all the combining characters that are attached to it?

Thomas

Erik Naggum

unread,
Mar 25, 2002, 8:34:19 PM3/25/02
to
* Thomas Bushnell, BSG

| The GNU/Linux world is rapidly converging on using UTF-8 to hold 31-bit
| Unicode values. Part of the reason it does this is so that existing byte
| streams of Latin-1 characters can (pretty much) be used without
| modification, and it allows "soft conversion" of existing code, which is
| quite easy and thus helps everybody switch.

UTF-8 is in fact extreemly hostile to applications that would otherwise
have dealt with ISO 8859-1. The addition of a prefix byte has some very
serious implications. UTF-8 is an inefficient and stupid format that
should never have been proposed. However, it has computational elegance
in that it is a stateless encoding. I maintain that encoding is stateful
regardless of whether it is made explicit or not. I therefore strongly
suggest that serious users of Unicode employ the compression scheme that
has been described in Unicode Technical Report #6. I recommend reading
this technical report.

Incidentally, if I could design things all over again, I would most
probably have used a pure 16-bit character set from the get-go. None of
this annoying 7- or 8-bit stuff. Well, actually, I would have opted for
more than 16-bit units -- it is way too small. I think I would have
wanted the smallest storage unit of a computer to be 20 bits wide. That
would have allowed addressing of 4G of today's bytes with only 20 bits.
But I digress...

| So even if strings are "compressed" this way, they are not UTF-8. That's
| Right Out. They are just direct UCS values. Procedures like string-set!
| therefore might have to inflate (and thus copy) the entire string if a
| value outside the range is stored. But that's ok with me; I don't think
| it's a serious lose.

There is some value to the C/Unix concept of a string as a small stream.
Most parsing of strings needs to parse so from start to end, so there is
no point in optimizing them for direct access. However, a string would
then be different from a vector of characters. It would, conceptually,
be more like a list of characters, but with a more compact encoding, of
course. Emacs MULE, with all its horrible faults, has taken a stream
approach to character sequences and then added direct access into it,
which has become amazingly expensive.

I believe that trying to make "string" both a stream and a vector at the
same time is futile and only leads to very serious problems. The default
representation of a string should be stream, not a vector, and accessors
should use the stream, such as with make-string-{input,output}-stream,
with new operators like dostring, instead of trying to use the string as
a vector when it clearly is not. The character concept needs to be able
to accomodate this, too. Such pervasive changes are of course not free.

| Ok, then the second question is about combining characters. Level 1
| support is really not appropriate here. It would be nice to support
| Level 3. But perhaps Level 2 with Hangul Jamo characters [are those
| required for Level 2?] would be good enough.

Level 2 requires every other combining character except Hangul Jamo.

| It seems to me that it's most appropriate to use Normalization Form D.

I agree for the streams approach. I think it is important to make sure
that there is a single code for all character sequences in the stream
when it is converted to a vector. The private use space should be used
for these things, and a mapping to and from character sequences should be
maintained such that if a private use character is queried for its
properties, those of the character sequence would be returned.

| Or is that crazy? It has the advantage of holding all the Level 3 values
| in a consistent way. (Since precombined characters do not exist for all
| possibilities, Normalization Form C results in some characters
| precombined and some not, right?)

Correct.

| And finally, should the Lisp/Scheme "character" data type refer to a
| single UCS code point, or should it refer to a base character together
| with all the combining characters that are attached to it?

Primarily the code point, but both, effectively, by using the private use
space as outlined above.

Christopher Browne

unread,
Mar 25, 2002, 10:30:56 PM3/25/02
to
The world rejoiced as Erik Naggum <er...@naggum.net> wrote:
> * Thomas Bushnell, BSG
> | The GNU/Linux world is rapidly converging on using UTF-8 to hold 31-bit
> | Unicode values. Part of the reason it does this is so that existing byte
> | streams of Latin-1 characters can (pretty much) be used without
> | modification, and it allows "soft conversion" of existing code, which is
> | quite easy and thus helps everybody switch.
>
> UTF-8 is in fact extreemly hostile to applications that would otherwise
> have dealt with ISO 8859-1. The addition of a prefix byte has some very
> serious implications. UTF-8 is an inefficient and stupid format that
> should never have been proposed. However, it has computational elegance
> in that it is a stateless encoding. I maintain that encoding is stateful
> regardless of whether it is made explicit or not. I therefore strongly
> suggest that serious users of Unicode employ the compression scheme that
> has been described in Unicode Technical Report #6. I recommend reading
> this technical report.
>
> Incidentally, if I could design things all over again, I would most
> probably have used a pure 16-bit character set from the get-go. None of
> this annoying 7- or 8-bit stuff. Well, actually, I would have opted for
> more than 16-bit units -- it is way too small. I think I would have
> wanted the smallest storage unit of a computer to be 20 bits wide. That
> would have allowed addressing of 4G of today's bytes with only 20 bits.
> But I digress...

You should have a chat with Charles Moore, of Forth fame. He
designed, using a CAD system he wrote in Forth, called OK, a 20 bit
microprocessor that (surprise, surprise... NOT!) has an instruction
set designed specifically for Forth.

Something that is unfortunate is that the 36 bit processors basically
died off in favor of 32 bit ones. Which means we have great gobs of
algorithms that assume 32 bit word sizes, with the only leap anyone
can conceive of being to 64 bits, and meaning that if you need a tag
bit or two for this or that, 32 bit operations wind up Sucking Bad.

But I digress, too...
--
(concatenate 'string "cbbrowne" "@ntlug.org")
http://www.ntlug.org/~cbbrowne/oses.html
Rules of the Evil Overlord #230. "I will not procrastinate regarding
any ritual granting immortality." <http://www.eviloverlord.com/>

cr88192

unread,
Mar 25, 2002, 9:17:00 PM3/25/02
to
>
> Something that is unfortunate is that the 36 bit processors basically
> died off in favor of 32 bit ones. Which means we have great gobs of
> algorithms that assume 32 bit word sizes, with the only leap anyone
> can conceive of being to 64 bits, and meaning that if you need a tag
> bit or two for this or that, 32 bit operations wind up Sucking Bad.
>
hello, personally I don't really know what the big difference is...
I would have imagined that in any case a slightly larger word size would
have been useful, but it is not...
sometimes for some of my code I use 48 bit ints (when 32 bits is too small
and 64 is overkill). I would think that with 36 bits the next size up would
be 72, and 36 is not evenly divisible by 8 so you would need a different
byte size as well (ie: 9 or 12).
sorry, I don't really know of byte sizes other than 8...
am I missing something?

(little has changed in my life since before, except that I am working on an
os now... again...).

ozan s. yigit

unread,
Mar 28, 2002, 1:00:38 PM3/28/02
to
Erik Naggum:
> ... It does an excellent job of explaining the

> distinction between glyph and character. I think you need it much more
> than trying to defend yourself by insulting me with your ignorance.

imagine how much time you would have saved yourself and everyone else
had you just posted a useful part of the actual unicode standard, for
example pp. 13, "Characters, Not Glyphs" [1]

The Unicode standard draws a distinction betweeb /characters/ which
are the smallest components of written language that have semantic
value, and /glyphs/, which represent the shapes that characters can
have when they are rendered or displayed. Various relationships may
exist between character and glyph; a single glyph may correspond to
a single character, or to a number of characters, or multiple glyphs
may result from a single character.

[etc]

but it is more fun to lecture, and madly scribble on the board, isn't it? :-]

oz
---
[1] The Unicode Standard Version 3.0, Addison-Wesley, 2000.

Erik Naggum

unread,
Mar 28, 2002, 1:30:03 PM3/28/02
to
* o...@cs.yorku.ca (ozan s. yigit)

| imagine how much time you would have saved yourself and everyone else
| had you just posted a useful part of the actual unicode standard, for
| example pp. 13, "Characters, Not Glyphs" [1]

Imagine how much time people would have saved _everybody_ if they cared
to study something before they thought they had the right to produce
"opinions". "When did ignorance become a point of view?" Then imagine
how much time it would take to find out what some ignorant fuck needs to
hear in order to become unconfused. It is not my task to educate people
who voice opinions on what they do not have the intellectual honesty and
wherewithal to realize that they do not know sufficiently well. People
who cannot keep track of what they know and what they do not know, should
shut the fuck up, but they never will, precisely because they are unaware
of what they know and do not know. Wade Humeniuk gave us a good analogy
to his yoga classes and the mat-abusers. Non-thinking cretins who post
ignorant opinions to newsgroup are just the same kind of inconsiderate
bastards. But you choose to _defend_ them. What does that make you?
Those who have the intellectual honesty to separate what they know from
what they just assume, also know where they heard something and can rate
its probability and credibility. Those are worth helping, because they
are likely to learn from it. Those who are unlikely to learn from what
you tell them, are a waste of time.

| but it is more fun to lecture, and madly scribble on the board, isn't it? :-]

Your life experiences apparently differ quite significantly from mine,
but if you feel happy about exposing yourself like this, please do. More
idiotic drivel that lets the world know how you think is probably going
to be the result of your obvious desire to inflame rather than inform, so
go ahead, make a spectacle of yourself. This newsgroup is quite used to
your kind by now.

Pekka P. Pirinen

unread,
Mar 28, 2002, 2:39:40 PM3/28/02
to
tb+u...@becket.net (Thomas Bushnell, BSG) writes:
> So, getting back to my original question about charset implementations
> in Lisp/Scheme (though actually Smalltalk or any such
> [much snippage]

> So that means it's pretty easy to make sure the whole space of
> UCS values fits in an immediate representation. That's fine for
> working with actively used data.

Even for actively used data, compactness of representation pays off in
better cache efficency. In fact, particularly for actively used data
should we be mindful of this. Since you seem to be thinking of a
32-bit immediate representation, an improvement to 16-bit strings or
even 8-bit strings is nothing to be sneezed at.

> However, strings that are going to be kept around a long time should,
> it seems to me, be stored more compactly. Essentially all strings
> will be in the Basic Multilingual Plane, so they can fit in 16 bits.
> That means there would be two underlying string datatypes. I don't
> think this is a serious problem.

As an implementor, I can tell you that actually the step from one
string type to two is the hardest bit. Once you've figured out how
you want to implement that, having more is not such a big deal. From
a programmer's point of view, the efficiency gains from more string
types outweigh the costs (unless you think you could do without the
larger ones), even if you have to deal with it explicitly.

> Is it worth having a third (for 8-bit characters) so that Latin-1
> files don't have to be inflated by a factor of two? It seems to me
> that this would be important too.

Files and strings don't really have much to do with each other. Files
are an externalization issue. Of course you can store files in UCS,
and sometimes that's the right thing to do, but in the real world, you
have to deal with all kinds of encodings, so you need the machinery,
anyway, to read and write Shift-JIS, Big5, Latin-1, UTF-8, etc.

Like I said above, it _is_ important to have an 8-bit string type.
People in the West, who rarely even realize they could easily support
16-bit users, will get great benefits. And between files and
"actively used data", there are those people who want to load their
entire database in main memory and compute with that; they'll get
their size limit extended as well.

> Basically then we would have strings which are UCS-4, UCS-2 and

> Latin-1 restricted (internally, not visibly to users). [...]


> Procedures like string-set! therefore might have to inflate (and
> thus copy) the entire string if a value outside the range is stored.
> But that's ok with me; I don't think it's a serious lose.

I suppose that is a viable implementation strategy, but I don't think
it's the right option. The language should expose the range of string
data types to the programmer, and let them choose, because the range
of memory usage is just too great to sweep under the mat. Also,
having strings automatically reallocated means an extra indirection
for access which cannot always be optimized away.

I note that offering multiple string types is exactly what all the CL
implementations seem to have done. This doesn't preclude having
features that automatically select the smallest feasible type, e.g.,
for "" read syntax or a STRING-APPEND function.
--
Pekka P. Pirinen
The gap between theory and practice is bigger in practice than in theory.

Thomas Bushnell, BSG

unread,
Mar 28, 2002, 3:08:19 PM3/28/02
to
Pekka.P...@globalgraphics.com (Pekka P. Pirinen) writes:

> > Is it worth having a third (for 8-bit characters) so that Latin-1
> > files don't have to be inflated by a factor of two? It seems to me
> > that this would be important too.
>
> Files and strings don't really have much to do with each other. Files
> are an externalization issue. Of course you can store files in UCS,
> and sometimes that's the right thing to do, but in the real world, you
> have to deal with all kinds of encodings, so you need the machinery,
> anyway, to read and write Shift-JIS, Big5, Latin-1, UTF-8, etc.

In the system I'm contemplating, there are no files in the normal
sense of the term; all user data lives as strings, more or less (there
might be something more clever, but whateve). Whatever strategies are
done for strings (and similar structures) will be important for all
files.

So such data has to be efficiently stored...

> I note that offering multiple string types is exactly what all the CL
> implementations seem to have done. This doesn't preclude having
> features that automatically select the smallest feasible type, e.g.,
> for "" read syntax or a STRING-APPEND function.

But this is, it seems to me, unclean.

I think of it as being similar to the way numbers work. Yes, I can
find out whether a given number is a fixnum or a bignum, and I might
well care in some special case. But normally I just use numbers and
expect the system to automagically do the right thing.

Similarly, I want the string type to simply encode Unicode strings,
and the user should not be forced to deal with more. The user should
not need to guess at the time the string is created whether or not it
will later need to hold a bigger character code, for example.

Thomas

ozan s yigit

unread,
Mar 28, 2002, 8:04:18 PM3/28/02
to
[erik's bombastic drivel elided]

heh heh heh, nice try erik, but you are no mikhail zeleny, alas. :]

oz

Erik Naggum

unread,
Mar 28, 2002, 10:25:27 PM3/28/02
to
* ozan s yigit <o...@blue.cs.yorku.ca>

| [erik's bombastic drivel elided]
|
| heh heh heh, nice try erik, but you are no mikhail zeleny, alas. :]

Oh, great, another nutjob at large.

ozan s. yigit

unread,
Mar 29, 2002, 1:32:12 AM3/29/02
to
Erik Naggum:

> | heh heh heh, nice try erik, but you are no mikhail zeleny, alas. :]
>
> Oh, great, another nutjob at large.

read your previous post. it speaks volumes.

oz
---
dreams already are. -- mark v. shaney

Brian Spilsbury

unread,
Mar 29, 2002, 2:07:30 AM3/29/02
to
Ray Dillinger <be...@sonic.net> wrote in message news:<3C990D44...@sonic.net>...
>
> I wouldn't want to muck about internally with a format that had
> characters of various different widths: too much pain to implement,
> too many chances to introduce bugs, not enough space savings.
> Besides, when people read whole files as strings, do you really
> want to run through the whole string counting multi-byte characters
> and single-byte characters to find the value of an expression like
>
> (string-ref FOO charcount) ;; lookups in a 32 million character string!
>
> where charcount is large? I don't. Constant width means O(1) lookup
> time.

Well, there are several mitigating factors and some issues with CL
which cause difficulties here.

If you consider your string as a sequence, then you can see that the
issues with variable width encodings produce a data-type which has the
access characteristics of a list.

The arguments for and against lists apply directly to variable-width
strings.

If we look at the use of strings it falls into two fairly distinct
categories;

(a) Iteration:
Printing, writing, reading, appending, scanning, copying, etc.
(b) Random-Access:
Randomly accessing characters.

Infact almost everything we do with strings is iterative (which makes
sense when you remember why strings are called strings).

The problem is that Cl has rather poor support for iterating
sequences.

If we considered a sequence to be addressed though two spaces, one
being Index-Space, and the other Point-Space we could avoid a lot of
these issues, and make lists more efficiently usable as sequences.

(elt seq index) would access the sequence though index space (which
might involve walking down a list N steps).
(elt-p seq point) would access the sequence though a point (which
would involve no traversal).

The trick to efficiently exploiting this then would be to get a point
from an index.

(dosequence (element point sequence)
(when (char= element #\!)
(setf (elt-p sequence point) #\$)))

for a fairly lame example.

with things like (subseq sequence :start-point a :end-point b) it
starts to become more flexible.

Or the ability to say (dosequence (element point sequence :start-point
point) ...) to allow the continuation of an iteration.

I'm not suggesting that this is an ideal solution, but it should at
least point out some inadequacies in the current model.

With appropriate primitives the wide-spread use of list-like strings
should not even be considered problematic, imho.

And in answer to the example above, I don't think that anyone would
suggest forcing someone to use a variable-width string representation
at all times. If random access to a particular string is important to
you, then a vector-like string is obviously the way to go.

Regards,

Brian

Brian Spilsbury

unread,
Mar 29, 2002, 4:19:09 AM3/29/02
to
tb+u...@becket.net (Thomas Bushnell, BSG) wrote in message
> Similarly, I want the string type to simply encode Unicode strings,
> and the user should not be forced to deal with more. The user should
> not need to guess at the time the string is created whether or not it
> will later need to hold a bigger character code, for example.
>

I think you need to differentiate between mutable and immutable
strings.

A mutable string which is not explicitly restricted (such as
simple-base-string) needs to be able to hold any character, so it
needs to be conservative.

An immutable string cannot be modified, so you are free to encode it
however you like, as long as you can represent whatever you have it
in.

The remainder of the problem is the idea of strings as vectors rather
than sequences, as sequences the O(1) access is no-longer an issue
(although you'd want better iteration support than CL currently
provides).

Beyond this it should be trivial to have an immutable string type
which knows what encoding it is using, and can tell the system what
accessor to use.

As a side-note, string literals and the names of symbols are immutable
in CL.

In addition you would need an operator to encode a mutable string as
an immutable string (using a given encoding), options for immutable
construction for subseq, concatenate, string-output-stream, etc would
also be useful.

Regards,

Brian

Stephan H.M.J. Houben

unread,
Mar 30, 2002, 4:38:30 AM3/30/02
to
In article <usn6kh...@globalgraphics.com>, Pekka P. Pirinen wrote:
>> Basically then we would have strings which are UCS-4, UCS-2 and
>> Latin-1 restricted (internally, not visibly to users). [...]
>> Procedures like string-set! therefore might have to inflate (and
>> thus copy) the entire string if a value outside the range is stored.
>> But that's ok with me; I don't think it's a serious lose.
>
>I suppose that is a viable implementation strategy, but I don't think
>it's the right option. The language should expose the range of string
>data types to the programmer, and let them choose, because the range
>of memory usage is just too great to sweep under the mat. Also,
>having strings automatically reallocated means an extra indirection
>for access which cannot always be optimized away.

If you have more than one string type anyway, then you can have
both directly and indirectly represented strings. It is then
possible to arrange that any directly represented string can
be replaced with an indirectly represented string. Then,
arrange for the garbage collector to remove all indirections.

Again, this is not that more complex once you have decided to
go for multiple string types anyway. Moreover, it is
completely transparent to the programmer and it can provide
other useful features, e.g. growing of strings. Indeed, it is
even possible for the implementation to dynamically decide to
overallocate storage once a string has been grown, so that
naively building a string character-by-character will be
O(n).

all this adds implementation complexity, but it makes string handling
much easier on the programmer.

To go even further: one could provide lazy string copying with
copy-on-write, optimised string concatenation in which
substrings are shared, and since the OP wants to replace files
by strings, he could even consider to have the GC dynamically
compress and uncompress large strings.

OK, this is really overengineered, but anyway...

Greetings,

Stephan

Thomas Bushnell, BSG

unread,
Mar 31, 2002, 1:11:30 AM3/31/02
to
step...@wsan03.win.tue.nl (Stephan H.M.J. Houben) writes:

> To go even further: one could provide lazy string copying with
> copy-on-write, optimised string concatenation in which
> substrings are shared, and since the OP wants to replace files
> by strings, he could even consider to have the GC dynamically
> compress and uncompress large strings.

I don't know about compressing (though it's not a bogus idea). Doing
lazy sharing by copy-on-write is certainly a good approach for large
strings, and that will probably be a necessary feature of the system
to make various user-interface tweaks work right. Thanks for the
idea.

0 new messages