Constant strings

Leopold Toetsch

unread,

Apr 16, 2004, 6:13:03 AM4/16/04

to P6I

It's time to come again with this topic, last discussion was AFAIK in:

Newsgroups: perl.perl6.internals
Date: Thu, 22 Jan 2004 14:35:54 -0500
Subject: Re: Another GC bug
From: d...@sidhe.org (Dan Sugalski)

Attached is the slightly modified version of c2str.pl. This generates a
string "resource" file from _S("string") macros.

I've modified src/objects.c to use these macros for get_init_meth (when
CALL__BUILD is enabled):

$ perl c2str.pl src/objects.c > src/objects.str
$ make -s
$ time CALL__BUILD=1 parrot -j oo2b.pasm

real 0m3.229s # w #include "objects.str"
real 0m3.950s # w string_make

So:
1) Can we make that compile silently?
2) Do all compilers understand this struct initializer?
3) How can we best integrate such a solution into the build process (not
all files - or better only a few files will need pre-processing)?

Comments welcome,
leo

PS current __init call (oo2) is 2.7 seconds. The two additional hash
lookups are rather expensive.

objects_c.patch

c2str.pl

oo2b.pasm

Leopold Toetsch

unread,

Apr 17, 2004, 3:41:27 PM4/17/04

to perl6-i...@perl.org

Leopold Toetsch <l...@toetsch.at> wrote:

[ blabla ]

Sorry.

This scheme with constant strings in constant memory doesn't work - at
least not with ARENA_DOD_FLAGS enabled, which assumes *all* objects are
coming from arenas. These string headers live outside of any arena.

It could work w/o ARENA_DOD_FLAGS *if* these strings get additional
flags set:
- is_live ... would prohibit setting live bits
- dont_touch_or_free_header ... might be needed for destruction

But with ARENA_DOD_FLAGS enabled, I don't see much chance to get this
running. This would need collecting all constant strings in an aligned
memory segment, attach an arena header to it and set live bits in
attached dod_flags - a lot of work for a preprocessor, albeit doable
with a lot of effort.

So what about a string cache. Could work similar to the method cache
with lookup via some address bits. Should still be cheaper then
constructing all these strings ever and ever.

Brainstorming time...

leo

PS why I really like to have something like this:

$ time parrot -jG ff.pasm
010

real 0m1.728s # with _S("__get_integer")
real 0m2.148s # with const_string(...)

Jeff Clites

unread,

Apr 18, 2004, 12:15:51 AM4/18/04

to l...@toetsch.at, perl6-i...@perl.org

On Apr 17, 2004, at 12:41 PM, Leopold Toetsch wrote:

> This scheme with constant strings in constant memory doesn't work - at
> least not with ARENA_DOD_FLAGS enabled, which assumes *all* objects are
> coming from arenas. These string headers live outside of any arena.

Oh, yes--darn.

> It could work w/o ARENA_DOD_FLAGS *if* these strings get additional
> flags set:
> - is_live ... would prohibit setting live bits
> - dont_touch_or_free_header ... might be needed for destruction

I wonder if it's possible to identify these at runtime as living in the
C-constant region of memory? For instance, if we could tell their
memory address is < stack base, and use that to identify them as
constant?

> But with ARENA_DOD_FLAGS enabled, I don't see much chance to get this
> running. This would need collecting all constant strings in an aligned
> memory segment, attach an arena header to it and set live bits in
> attached dod_flags - a lot of work for a preprocessor, albeit doable
> with a lot of effort.

It should be possible to aggregate all of the constants into a single
array (one include, rather than one-per-source-file), which would let
us identify them by their memory location, as residing in this range.
That seems pretty straightforward to do. So rather than compiling to
static_string_532, instead _S("foo") would compile to
static_strings[7], or something. Then the check is just whether
(some_string >= static_strings[0] && some_string <=
static_strings[max])--if so, it was from a literal (and thus, is
constant).

> PS why I really like to have something like this:
>
> $ time parrot -jG ff.pasm
> 010
>
> real 0m1.728s # with _S("__get_integer")
> real 0m2.148s # with const_string(...)

Yes, it seems like a good idea, in general terms; quite reasonable, as
an extension of what C already does for C strings. (FWIW, ObjC has
support for literal NSStrings--I'm not sure how much of the "work" is
done at compile-time v. runtime, though the tricky part for us is
really GC, which isn't a factor for ObjC. I wonder what Java does for
string literals? Maybe something similar to the above.)

JEff

Leopold Toetsch

unread,

Apr 18, 2004, 6:13:42 AM4/18/04

to Jeff Clites, perl6-i...@perl.org

Jeff Clites <jcl...@mac.com> wrote:

> It should be possible to aggregate all of the constants into a single
> array (one include, rather than one-per-source-file), which would let
> us identify them by their memory location, as residing in this range.
> That seems pretty straightforward to do. So rather than compiling to
> static_string_532, instead _S("foo") would compile to
> static_strings[7], or something. Then the check is just whether
> (some_string >= static_strings[0] && some_string <=
> static_strings[max])--if so, it was from a literal (and thus, is
> constant).

The problem with this approach is that it's happening in the fast path
of C<pobject_lives()>. The compare of the memory addresses would be done
for *all* objects.

So I think, it's better to just allocate a string header in the constant
string header pool and treat these strings like those in the constant
table.

That still needs collecting all constant strings in one file plus the
creation of the constant string headers on interpreter startup.

> JEff

leo

Leopold Toetsch

unread,

Apr 18, 2004, 11:06:50 AM4/18/04

to perl6-i...@perl.org

Leopold Toetsch <l...@toetsch.at> wrote:

[ initial proposal ]

I've now checked in a working version.
* c2str.pl generates a .str header from a .c file
* c2str.pl --all generates $(INC)/string_private_cstring.h
* this us used in string_init() to finally generate entries
in the interpreter's constant string table

* to add new files makefiles/root.in or such has to be edited
see objects.str as a template

Using multiple files is only slightly tested though.

leo

Jeff Clites

unread,

Apr 19, 2004, 4:22:00 AM4/19/04

to l...@toetsch.at, perl6-i...@perl.org

To handle multiple files, we'll probably need to generate a .c to hold
the C strings (instead of the .h), and have an extern declaration in
the .h (since it will be included in multiple files). That's assuming
they'll all be aggregated into a single file (which makes sense).

Here is a related patch, to cause us to cache the hash values of all
strings (on demand). The important part is that the cached value is
cleared out in unmake_COW, which is called any time we might mutate the
string (and thus, invalidate the cached value). This will have the
side-effect of allowing c2str.pl to be slightly simpler, since it won't
need to pre-calculate the hash value (since const strings are the same
as any others, and their hash value will be calculated and cached if it
is ever needed).

This change speeds up the attached benchmark by a factor of 1.86 in the
optimize case (via --optimize, so -Os), or 3.73 in the unoptimized case
(on Mac OS X):

# without the patch, optimized build
% ./parrot hash-timing.pasm
679 hash entries
128 characters per test key
1000000 lookups each
rep 1: 1.093889 sec
rep 2: 1.109484 sec
rep 4: 1.095041 sec

# with the patch, optimized build
% ./parrot hash-timing.pasm
679 hash entries
128 characters per test key
1000000 lookups each
rep 1: 0.608547 sec
rep 2: 0.586352 sec
rep 4: 0.575159 sec

JEff

hash-caching.patch

hash-timing.pasm

Jarkko Hietaniemi

unread,

Apr 18, 2004, 3:17:15 AM4/18/04

to Jeff Clites, l...@toetsch.at, perl6-i...@perl.org

> C-constant region of memory? For instance, if we could tell their
> memory address is < stack base, and use that to identify them as
> constant?

I don't think there is much chance of getting anything like this working
portably.

> static_strings[7], or something. Then the check is just whether
> (some_string >= static_strings[0] && some_string <=
> static_strings[max])--if so, it was from a literal (and thus, is
> constant).

Something like this would be feasible. In fact, if we are going for
compile-time tricks, all constant strings (or their "bodies", at least)
could be concatenated into a single giant string, and then have another
constant array just having the [offset, bytes] pairs. Or, rather, the
[offset, bytes, hash] triplets.

Leopold Toetsch

unread,

Apr 19, 2004, 5:25:18 AM4/19/04

to Jeff Clites, perl6-i...@perl.org

Jeff Clites <jcl...@mac.com> wrote:

> To handle multiple files, we'll probably need to generate a .c to hold
> the C strings (instead of the .h), and have an extern declaration in
> the .h (since it will be included in multiple files). That's assuming
> they'll all be aggregated into a single file (which makes sense).

Well, there is one include file per .c (e.g. objects.str), which holds
defines for this file. The global string table is included only in
string.c to construct the strings. There is no need to have include
files with any extern declarations or includes for multiple files.

> Here is a related patch, to cause us to cache the hash values of all
> strings (on demand). The important part is that the cached value is
> cleared out in unmake_COW, which is called any time we might mutate the
> string

Yep. Should work.

> side-effect of allowing c2str.pl to be slightly simpler, since it won't
> need to pre-calculate the hash value (since const strings are the same
> as any others, and their hash value will be calculated and cached if it
> is ever needed).

We still can precalculate for these constant strings and save some extra
cylces (the precalculated value isn't used yet, but ...) And we can
precalculate hash values for the string constants in the constant table
during compilation (and write it into the PBC).

> This change speeds up the attached benchmark by a factor of 1.86 in the
> optimize case (via --optimize, so -Os), or 3.73 in the unoptimized case
> (on Mac OS X):

Wheee, that's a lot.

I'll apply it RSN - I'm currently fighting with the Makefile (classes
stuff moved to main)

> JEff

leo

Leopold Toetsch

unread,

Apr 19, 2004, 8:15:30 AM4/19/04

to Jeff Clites, perl6-i...@perl.org

Jeff Clites <jcl...@mac.com> wrote:

> Here is a related patch, to cause us to cache the hash values of all
> strings (on demand). The important part is that the cached value is
> cleared out in unmake_COW, which is called any time we might mutate the
> string (and thus, invalidate the cached value).

Thanks, applied.
leo

Jeff Clites

unread,

Apr 19, 2004, 12:19:59 PM4/19/04

to l...@toetsch.at, perl6-i...@perl.org

On Apr 19, 2004, at 2:25 AM, Leopold Toetsch wrote:

> We still can precalculate for these constant strings and save some
> extra
> cylces (the precalculated value isn't used yet, but ...) And we can
> precalculate hash values for the string constants in the constant table
> during compilation (and write it into the PBC).

So the tradeoff there is that in the case of run-from-source it may be
a slowdown (since we'll calculate hash values of some strings which may
not be ever used as hash keys or lookup keys), but we can detect how we
are running, and only calculate them if we are writing out byte code,
so that should be a win.

We'll be in trouble if we ever change the hash algorithm, if we're
freezing the result into PBC. So we should probably have something in
the signature, so that if we detect that the algorithm may have
changed, we ignore what's in the PBC, and fall back to cache-on-demand.

JEff

Jeff Clites

unread,

Apr 20, 2004, 12:56:11 PM4/20/04

to l...@toetsch.at, perl6-i...@perl.org

On Apr 19, 2004, at 2:25 AM, Leopold Toetsch wrote:

> Jeff Clites <jcl...@mac.com> wrote:
>
>> This change speeds up the attached benchmark by a factor of 1.86 in
>> the
>> optimize case (via --optimize, so -Os), or 3.73 in the unoptimized
>> case
>> (on Mac OS X):
>
> Wheee, that's a lot.

Here's another tiny patch, to let us fast-fail string_equal if we have
cached hashval's which don't match. It will only make a difference in
some cases (strings of equal length which only differ near the end, and
which have cached hashval's), but in those cases the speedup can be a
factor of 1.8 (optimized build, benchmark attached).

JEff

Index: src/string.c
===================================================================
RCS file: /cvs/public/parrot/src/string.c,v
retrieving revision 1.195
diff -u -b -r1.195 string.c
--- src/string.c 19 Apr 2004 12:15:15 -0000 1.195
+++ src/string.c 20 Apr 2004 16:40:41 -0000
@@ -1758,6 +1758,11 @@
else if (!s1->strlen && !s2->strlen) {
return 0;
}
+ else if ((s1->hashval != s2->hashval)
+ && (s1->hashval != 0) && (s2->hashval != 0))
+ {
+ return 1;
+ }

# if ! DISABLE_GC_DEBUG
/* It's easy to forget that string comparison can trigger GC */

string-equal-timing-hashed.pasm

Leopold Toetsch

unread,

Apr 20, 2004, 2:22:28 PM4/20/04

to Jeff Clites, perl6-i...@perl.org

Jeff Clites <jcl...@mac.com> wrote:

> Here's another tiny patch, to let us fast-fail string_equal if we have
> cached hashval's which don't match.

What about a hash value collision?

leo

Leopold Toetsch

unread,

Apr 21, 2004, 7:05:33 AM4/21/04

to Jeff Clites, perl6-i...@perl.org

Jeff Clites <jcl...@mac.com> wrote:

> Here's another tiny patch, to let us fast-fail string_equal if we have
> cached hashval's which don't match.

Tested and applied now. I've also adoped JIT/i386 to use string_equal
for C<eq> and C<ne> string ops. This speeds up these ops considerably
*and* in the case what your test is showing, numbers indicate:

No stored hashval, 1000000 lookups each
same: 0.018617 sec
equal: 0.487958 sec
not equal: 0.517992 sec
With stored hashval, 1000000 lookups each
same: 0.019674 sec
equal: 0.487740 sec
not equal: 0.038465 sec

... a factor ~14 performance increase for the "not equal" case.

> JEff

Thanks,
leo

Jeff Clites

unread,

Apr 21, 2004, 12:22:18 PM4/21/04

to l...@toetsch.at, perl6-i...@perl.org

Ah, great! (And the "not equal" case is the only one which should be
showing a speed up--the "same" and "equal" cases are expected to be
unaffected.)

JEff

Dan Sugalski

unread,

Apr 21, 2004, 1:20:50 PM4/21/04

to Jeff Clites, l...@toetsch.at, perl6-i...@perl.org

Just to make sure... we're making sure the strings are always
properly decomposed before comparing, right?
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Leopold Toetsch

unread,

Apr 21, 2004, 1:14:54 PM4/21/04

to Jeff Clites, perl6-i...@perl.org

Jeff Clites <jcl...@mac.com> wrote:

> On Apr 21, 2004, at 4:05 AM, Leopold Toetsch wrote:

>> ... a factor ~14 performance increase for the "not equal" case.

> Ah, great!

With an optimized compile (of string.c only) the speed up decreases to
only a factor of 12 :)

> (And the "not equal" case is the only one which should be
> showing a speed up--the "same" and "equal" cases are expected to be
> unaffected.)

The "equal" case was missing one thing: if both strings are COWed copies,
the compare can be avoided too - it's equally fast then, as "not equal".
That's already in CVS.

These changes in your code show the case:

concat S0, "a"
#concat S1, "a" # <<<<<<<<
assign S1, S0 # <<<<<<<<

> JEff

leo

Jeff Clites

unread,

Apr 21, 2004, 2:13:24 PM4/21/04

to l...@toetsch.at, perl6-i...@perl.org

On Apr 21, 2004, at 10:14 AM, Leopold Toetsch wrote:

> The "equal" case was missing one thing: if both strings are COWed
> copies,
> the compare can be avoided too - it's equally fast then, as "not
> equal".

That makes sense, as long as we never optimize substring via a COW copy
with a different strlen. (That is, one could optimize the
initial-substring case that way, though I don't think we do that
currently.)

JEff

Dan Sugalski

unread,

Apr 21, 2004, 2:25:15 PM4/21/04

to Jeff Clites, perl6-i...@perl.org, l...@toetsch.at

At 11:17 AM -0700 4/21/04, Jeff Clites wrote:

>On Apr 21, 2004, at 10:20 AM, Dan Sugalski wrote:
>
>>At 9:22 AM -0700 4/21/04, Jeff Clites wrote:
>>>On Apr 21, 2004, at 4:05 AM, Leopold Toetsch wrote:
>>>
>>>>... a factor ~14 performance increase for the "not equal" case.
>>>
>>>Ah, great! (And the "not equal" case is the
>>>only one which should be showing a speed
>>>up--the "same" and "equal" cases are expected
>>>to be unaffected.)
>>
>>Just to make sure... we're making sure the
>>strings are always properly decomposed before
>>comparing, right?
>

>Nope, this is a literal "equal"
>comparison--you'd build a normalized compare on
>top of this. (There's 2 reasons for that: (1)
>You definitely need a non-normalized comparison
>available, because often that's what you want,
>and (2) For normalized comparison, you need to
>pick which style of normalization you
>want--there are at least 4 choices, each of
>which makes sense in different situations.)

We need to address that, then. If we're doing
unicode, we damn well need to do it right--å is
å, regardless of whether it's composed or
decomposed.

If people want low-level binary comparisons (and
generally we *shouldn't* for most things) then
they'll need to force the string to binary.

Jarkko Hietaniemi

unread,

Apr 21, 2004, 4:04:51 PM4/21/04

to perl6-i...@perl.org, Dan Sugalski, Jeff Clites, perl6-i...@perl.org, l...@toetsch.at

>
> We need to address that, then. If we're doing
> unicode, we damn well need to do it right--å is
> å, regardless of whether it's composed or
> decomposed.

Agreed -- on some level. But If we want to implement Larry's
:u0 (bytes) and :u1 (code points) levels we need to have also
the "more raw" comparisons available, somehow. (I do not remember
whether Larry specified would :u2 do by default some of the Unicode
normalizations, thus doing (de)compositions.)

> If people want low-level binary comparisons (and
> generally we *shouldn't* for most things) then
> they'll need to force the string to binary.

And I'm not certain whether "forcing to binary" is the right
visual image or approach here. Maybe we need some sort of
"pragma" support so that we can tweak the ":u level"? The
default level could well be :u2, the highest we can do without
picking some "language" rules.

Dan Sugalski

unread,

Apr 21, 2004, 5:14:14 PM4/21/04

to Jarkko Hietaniemi, perl6-i...@perl.org, Jeff Clites, perl6-i...@perl.org, l...@toetsch.at

At 11:04 PM +0300 4/21/04, Jarkko Hietaniemi wrote:
> >
>> We need to address that, then. If we're doing
>> unicode, we damn well need to do it right--å is
>> å, regardless of whether it's composed or
>> decomposed.
>
>Agreed -- on some level. But If we want to implement Larry's
>:u0 (bytes) and :u1 (code points) levels we need to have also
>the "more raw" comparisons available, somehow. (I do not remember
>whether Larry specified would :u2 do by default some of the Unicode
>normalizations, thus doing (de)compositions.)

We'll work that out when the perl 6 compiler gets
to that point. For Parrot, my preference (unless
ICU makes it infeasable, which I doubt) is to
keep everything decomposed. I hear rumor that
way's preferred... :)

> > If people want low-level binary comparisons (and
>> generally we *shouldn't* for most things) then
>> they'll need to force the string to binary.
>
>And I'm not certain whether "forcing to binary" is the right
>visual image or approach here. Maybe we need some sort of
>"pragma" support so that we can tweak the ":u level"? The
>default level could well be :u2, the highest we can do without
>picking some "language" rules.

I've got a Cunning Plan, oddly enough, though the
margins of this e-mail are too small to contain
it. As soon as I get it finished I'm going to
pass it onto the list and to a few non-list folks
who I know are deep into this stuff (Autrijus and
Dan Kogai, if I can get in touch. I *really* wish
I had someone who did mainly Korean text
processing handy...) and we'll see where we go
from there. I have no doubt it'll be... fun.
Yeah, that's the word, fun!

Leopold Toetsch

unread,

Apr 21, 2004, 6:22:18 PM4/21/04

to Dan Sugalski, perl6-i...@perl.org

Dan Sugalski <d...@sidhe.org> wrote:

> Just to make sure... we're making sure the strings are always
> properly decomposed before comparing, right?

Not in the absence of any rules how to decompose or better when ;) We are
currently still at Larry's level 0 or 1. Hash values and compare
operations are stable though, for and up to Unicode codepoints.

leo

Leopold Toetsch

unread,

Apr 21, 2004, 5:03:02 PM4/21/04

to Jeff Clites, perl6-i...@perl.org

Jeff Clites <jcl...@mac.com> wrote:
> On Apr 21, 2004, at 10:14 AM, Leopold Toetsch wrote:

>> The "equal" case was missing one thing: if both strings are COWed
>> copies,
>> the compare can be avoided too - it's equally fast then, as "not
>> equal".

> That makes sense, as long as we never optimize substring via a COW copy
> with a different strlen.

The strlen's are compared earlier and are already equal, when the
compare with C<strstart> is done.

> JEff

leo

Dan Sugalski

unread,

Apr 21, 2004, 6:45:10 PM4/21/04

to l...@toetsch.at, perl6-i...@perl.org

Well, then, I'll make A Big Decision:

All strings, in the absence of explicit overriding of behavior, shall
be treated as if they were in Canonical Form. If this is not the
case, the strings will be canonicalized first. If a character set has
both composed and decomposed versions of some characters, the
decomposed version is our canonical form. This includes all hash
keys, which means method names, global, and lexical variables are all
treated as if their names were stored in decomposed form if there are
decomposable characters in the names.

I can think of a language or two where this might be considered
sub-optimal, so I'm willing to work this out, though I'm not sure I
want to mix composed and decomposed characters depending on which
ones they are.

Dan Sugalski

unread,

Apr 21, 2004, 6:54:17 PM4/21/04

to kj, perl6-i...@perl.org

At 4:31 PM -0600 4/21/04, kj wrote:

>On 21 Apr 2004, at 15:14, Dan Sugalski wrote:
>
>>I've got a Cunning Plan, oddly enough, though the margins of this
>>e-mail are too small to contain it. As soon as I get it finished
>>I'm going to pass it onto the list and to a few non-list folks who
>>I know are deep into this stuff (Autrijus and Dan Kogai, if I can
>>get in touch. I *really* wish I had someone who did mainly Korean
>>text processing handy...) and we'll see where we go from there. I
>>have no doubt it'll be... fun. Yeah, that's the word, fun!
>

> I asked a friend of mine about lending a hand with Korean, but he
>will be too busy for the next little while. He did say that he will
>pass on the info to another person who might be able to lend a hand,
>though. Hopefully you'll have your wish soon, Dan.

Woohoo! Cool, and thanks very much.

If he's up for it, could you ask him a question? Namely "Treating all
text as Unicode--good idea or bad idea?" If the answer's going to be
a lot of work you can skip it, that's OK.

Larry Wall

unread,

Apr 21, 2004, 1:59:56 PM4/21/04

to perl6-i...@perl.org

On Wed, Apr 21, 2004 at 01:20:50PM -0400, Dan Sugalski wrote:
: Just to make sure... we're making sure the strings are always

: properly decomposed before comparing, right?

And likewise before hashing.

Larry

Jeff Clites

unread,

Apr 22, 2004, 5:48:03 AM4/22/04

to Dan Sugalski, perl6-i...@perl.org, l...@toetsch.at

On Apr 21, 2004, at 7:33 PM, Dan Sugalski wrote:

> At 11:17 AM -0700 4/21/04, Jeff Clites wrote:
>> On Apr 21, 2004, at 10:20 AM, Dan Sugalski wrote:
>>

>>> Just to make sure... we're making sure the strings are always
>>> properly decomposed before comparing, right?
>>

>> Nope, this is a literal "equal" comparison--you'd build a normalized
>> compare on top of this.
>

> I think this got caught on the list queue for a bit, and it's already
> been addressed, but just to be clear, Parrot's keeping decomposable
> characters decomposed, and generally normalizing, or at least
> pretending it's normalizing if it doesn't actually do so, when working
> with strings.

Yes, in order to define notions of normalized equivalence, you need a
notion of strict equality on which to base them. string_equal() is the
latter; the former are yet-to-be-coded.

JEff

Kj

unread,

Apr 21, 2004, 6:31:34 PM4/21/04

to Dan Sugalski, perl6-i...@perl.org

On 21 Apr 2004, at 15:14, Dan Sugalski wrote:

> I've got a Cunning Plan, oddly enough, though the margins of this
> e-mail are too small to contain it. As soon as I get it finished I'm
> going to pass it onto the list and to a few non-list folks who I know
> are deep into this stuff (Autrijus and Dan Kogai, if I can get in
> touch. I *really* wish I had someone who did mainly Korean text
> processing handy...) and we'll see where we go from there. I have no
> doubt it'll be... fun. Yeah, that's the word, fun!

I asked a friend of mine about lending a hand with Korean, but he

will be too busy for the next little while. He did say that he will
pass on the info to another person who might be able to lend a hand,
though. Hopefully you'll have your wish soon, Dan.

Cheers,

~kj

Kj

unread,

Apr 21, 2004, 7:52:56 PM4/21/04

to Dan Sugalski, perl6-i...@perl.org

On 21 Apr 2004, at 16:54, Dan Sugalski wrote:

> Woohoo! Cool, and thanks very much.

No problem. I can't find someone to come on-board yet, but I did get
an answer to your question.

> If he's up for it, could you ask him a question? Namely "Treating all
> text as Unicode--good idea or bad idea?" If the answer's going to be a
> lot of work you can skip it, that's OK.

The answer is fairly straight-forward, fortunately.

Talking to Burnhard and perky on HanIRC, I was able to get the
following information:

- there are (of course) some character sets that don't work well with
Unicode -- for example, Big5HKSCS doesn't encode in UCS2 (though I
didn't find out why)

- that being said, the consensus was that internal storage as Unicode
is a good idea for modern programming languages and APIs.

- Tcl/Tk's method of per-FH filters for EUC, johab, etc. seems to be
useful and well-received.

So in essence, what I got from the conversation was that internal
storage as Unicode is a good thing (and indeed, expected), so long as a
method for conversion on input/output is provided.

Sorry if that doesn't answer all the nuances of the question, but
that's the best I can do for now.

Cheers,

~kj

Jeff Clites

unread,

Apr 22, 2004, 11:47:40 AM4/22/04

to kj, Dan Sugalski, perl6-i...@perl.org

On Apr 21, 2004, at 4:52 PM, kj wrote:

> - there are (of course) some character sets that don't work well with
> Unicode -- for example, Big5HKSCS doesn't encode in UCS2 (though I
> didn't find out why)

UCS-2 is limited--it can only address the BMP (that is, only 2^16
characters). It has been superseded by the UTF-* encodings. (UTF-16 can
be thought of as UCS-2 plus surrogate pairs.)

It's my understanding that all of the characters HKSCS-2001 are
available in Unicode 4.0 (with 35 rarely-used character being mapped
into the private use area).

Necessarily, Unicode lags behind revisions to national standards--it
takes time to incorporate the changes--but, so does everything else
(inclusion of new characters into fonts, etc.).

JEff

Jeff Clites

unread,

Apr 20, 2004, 3:07:01 PM4/20/04

to l...@toetsch.at, perl6-i...@perl.org

If the hash values are equal, it proceeds on to do the full comparison.
It's only in the case where the hash values are inequal (and neither is
zero) that it can know the strings are inequal, and stop without doing
the full comparison.

(string_equal returns 1 to indicate !=, which matches string_compare
but is the opposite of what you'd expect.)

JEff

Dan Sugalski

unread,

Apr 22, 2004, 3:36:00 PM4/22/04

to Jeff Clites, perl6-i...@perl.org, l...@toetsch.at

I think, honestly, that for strings in a character set with multiple
variants that are declared equal, I want them strict equality to be
based on the canonical form of the strings rather than the binary
form. For Unicode the standard defines them as identical, and if we
have a mix of "really identical" and "logically identical" tests
we're going to get a lot of subtle and damned annoying bugs seeping
in.

Constant strings - again

Leopold Toetsch

Leopold Toetsch

Jeff Clites

Leopold Toetsch

Leopold Toetsch

Jeff Clites

Jarkko Hietaniemi

Leopold Toetsch

Leopold Toetsch

Jeff Clites

Jeff Clites

Leopold Toetsch

Leopold Toetsch

Jeff Clites

Dan Sugalski

Leopold Toetsch

Jeff Clites

Dan Sugalski

Jarkko Hietaniemi

Dan Sugalski

Leopold Toetsch

Leopold Toetsch

Dan Sugalski

Dan Sugalski

Larry Wall

Jeff Clites

Kj

Kj

Jeff Clites

Jeff Clites

Dan Sugalski