Unicode step by step

Leopold Toetsch

unread,

Apr 10, 2004, 9:13:57 AM4/10/04

to P6I

1) Allow usage of installed libicu. Needs some config support to locate
the necessary includes. Libraries should be found automatically.

2) String PBC layout. The internal string type has changed. This
currently breaks native_pbc tests (that have strings) as well as some
"parrot xx.pbc" tests related to strings. The layout seems to depend
somehow on the supported Unicode levels (or not). So before fixing the
PBC issues, I'd just have a statememt: parrot_string_t looks such and
such or of course as is now.

There is of course still the question: Should we really have ICU in the
tree. This needs tracking updates and patching (again) to make it build
and so on.

Thanks,
leo

Jeff Clites

unread,

Apr 10, 2004, 4:12:54 PM4/10/04

to Leopold Toetsch, P6I

On Apr 10, 2004, at 6:13 AM, Leopold Toetsch wrote:

> 2) String PBC layout. The internal string type has changed. This
> currently breaks native_pbc tests (that have strings) as well as some
> "parrot xx.pbc" tests related to strings.

These are working for me (which tests are failing for you?)--I did
update the PF_* API to match the changes to string internals. Of
course, since the internals changed the pbc layout changed also, so the
native_pbc test files need to be regenerated on the various
platforms--but the ppc one I submitted (see other post, or original
patch submission) should work. But if that one fails for you, it's
probably b/c of byte order, and I need to look and find where we do the
endianness correction for integers in pbc files, and hook in to do
something similar for certain string cases. If someone can send me a
number_X.pbc file generated on an i386 platform, that will help me
test.

But, it's correct that there's no backward-compatibility code in place,
to allow reading old pbc files. Do we want to have that sort of thing
at this stage? (Certainly, I'd think that after 1.0 we'd want backward
compatibility with any format changes, but do we need it at this
stage?)

But let me know which "parrot xx.pbc" tests are failing for you.

> The layout seems to depend somehow on the supported Unicode levels (or
> not). So before fixing the PBC issues, I'd just have a statememt:
> parrot_string_t looks such and such or of course as is now.

Could you rephrase? I'm not understanding what you are saying.

The only real change in the pbc format (if I'm recalling
correctly--I'll have to go back and look) are that rather than
serializing the encoding/chartype/language triple, we are writing out
the s->representation (still followed by s->bufused and then the
contents of the buffer). The only other wrinkle is that for cases where
s->representation is 2 or 4, we need to endianness correct when we use
the bytecode.

This is probably a separate discussion, but we _could_ decide instead
to represent strings in pbc files always in UTF-8. Advantage: Simpler,
no endianness correction needed, probably durable to further changes in
string internals, could isolate s->representation awareness to string.c
and string_primitives.c. Disadvantages: De-serializing a string from a
pbc file will always involve a copy, and could result in larger files
in some cases. I could argue it either way--one's cleaner, the other is
probably faster.

> There is of course still the question: Should we really have ICU in
> the tree. This needs tracking updates and patching (again) to make it
> build and so on.

One consideration is that I may need to patch ICU a few places--there's
at least one API which they only expose in C++, so I need to wrap it in
C and it's cleaner to do that as a patch to ICU rather than having C++
code in the core of parrot. Other than that, I think it boils down to
convenience, and (possibly) consistency in being able to say that
parrot version foo corresponds to ICU version bar (but maybe we don't
need to be able to say that).

JEff

Jeff Clites

unread,

Apr 10, 2004, 11:55:10 PM4/10/04

to P6I List

On Apr 10, 2004, at 1:12 PM, Jeff Clites wrote:

> On Apr 10, 2004, at 6:13 AM, Leopold Toetsch wrote:
>
>> 2) String PBC layout. The internal string type has changed. This
>> currently breaks native_pbc tests (that have strings) as well as some
>> "parrot xx.pbc" tests related to strings.
>
> These are working for me (which tests are failing for you?)--I did
> update the PF_* API to match the changes to string internals. Of
> course, since the internals changed the pbc layout changed also, so
> the native_pbc test files need to be regenerated on the various
> platforms--but the ppc one I submitted (see other post, or original
> patch submission) should work. But if that one fails for you, it's
> probably b/c of byte order, and I need to look and find where we do
> the endianness correction for integers in pbc files, and hook in to do
> something similar for certain string cases.

Here's a patch to src/pf_items.c, and a ppc t/native_pbc/number_3.pbc.
If it's working correctly, the attached strings-and-byte-order.* should
both do the same thing--output the Angstrom symbol. If it's wrong, then
the pbc version should output junk on a little-endian system.

(If your terminal emulator isn't prepared to handle UTF-8, then pipe
the output through 'less', and you should see something like
"<E2><84><AB>".)

(PS--I had to give the pbc file a fake extension, to keep the
develooper mail server from rejecting it.)

JEff

pf_items_c.patch

number_3.pbc

strings-and-byte-order.pasm

strings-and-byte-order.pbc.file

Leopold Toetsch

unread,

Apr 11, 2004, 5:20:58 AM4/11/04

to Jeff Clites, perl6-i...@perl.org

Jeff Clites <jcl...@mac.com> wrote:

> Here's a patch to src/pf_items.c, and a ppc t/native_pbc/number_3.pbc.

Works.

> If it's working correctly, the attached strings-and-byte-order.* should
> both do the same thing--output the Angstrom symbol. If it's wrong, then
> the pbc version should output junk on a little-endian system.

> (If your terminal emulator isn't prepared to handle UTF-8, then pipe
> the output through 'less', and you should see something like
> "<E2><84><AB>".)

$ parrot string_1.pbc
Å

$ parrot string_1.pbc | od -tx1
0000000 e2 84 ab 0a
0000004

> JEff

Thanks - I'll apply it RSN.

leo

Leopold Toetsch

unread,

Apr 11, 2004, 5:04:14 AM4/11/04

to Jeff Clites, perl6-i...@perl.org

Jeff Clites <jcl...@mac.com> wrote:
> On Apr 10, 2004, at 6:13 AM, Leopold Toetsch wrote:

>> 2) String PBC layout. The internal string type has changed. This
>> currently breaks native_pbc tests (that have strings) as well as some
>> "parrot xx.pbc" tests related to strings.

> These are working for me (which tests are failing for you?)--

$ make testr
Failed Test Stat Wstat Total Fail Failed List of Failed
-------------------------------------------------------------------------------
t/pmc/perlstring.t 3 768 33 3 9.09% 1-3
53 subtests skipped.
Failed 1/89 test scripts, 98.88% okay. 3/1432 subtests failed, 99.79% okay.

I didn't look further yet.

> ... Of

> course, since the internals changed the pbc layout changed also, so the
> native_pbc test files need to be regenerated on the various
> platforms

No problem.

> But, it's correct that there's no backward-compatibility code in place,
> to allow reading old pbc files. Do we want to have that sort of thing
> at this stage?

No, not needed.

>> The layout seems to depend somehow on the supported Unicode levels (or
>> not). So before fixing the PBC issues, I'd just have a statememt:
>> parrot_string_t looks such and such or of course as is now.

> Could you rephrase? I'm not understanding what you are saying.

Well, the question is: Is s->representation enough to describe our
strings?

> ... The only other wrinkle is that for cases where

> s->representation is 2 or 4, we need to endianness correct when we use
> the bytecode.

Yep.

> This is probably a separate discussion, but we _could_ decide instead
> to represent strings in pbc files always in UTF-8. Advantage: Simpler,
> no endianness correction needed, probably durable to further changes in
> string internals, could isolate s->representation awareness to string.c
> and string_primitives.c. Disadvantages: De-serializing a string from a
> pbc file will always involve a copy, and could result in larger files
> in some cases. I could argue it either way--one's cleaner, the other is
> probably faster.

Strings from PBC constants can't be used directly anyway. We munmap() or
free() the image after loading, so string constants are always copied. I
think using UTF-8 would be best.

>> There is of course still the question: Should we really have ICU in
>> the tree. This needs tracking updates and patching (again) to make it
>> build and so on.

> One consideration is that I may need to patch ICU a few places--there's
> at least one API which they only expose in C++, so I need to wrap it in
> C and it's cleaner to do that as a patch to ICU rather than having C++
> code in the core of parrot.

Can we get the ICU maintainers to integrated that interface?

> JEff

leo

Marcus Thiesen

unread,

Apr 10, 2004, 10:45:26 AM4/10/04

to perl6-i...@perl.org

On Saturday 10 April 2004 15:13, Leopold Toetsch wrote:
> There is of course still the question: Should we really have ICU in the
> tree. This needs tracking updates and patching (again) to make it build
> and so on.

In the sake of platform independence I'd say to keep it there. It's far easier
if you have only the usual build dependencies and the one special thing
inside the tree to quick test on different platforms.
What I want to say that you'll find a sane build environment and a Perl on
most of the machines, but even I don't have ICU installed.
BTW, it doesn't compile on any platform at the moment, after a realclean on
the first "make" it complains about
../data/locales/ja.txt:15: parse error. Stopped parsing with
U_INVALID_FORMAT_ERROR
couldn't parse the file ja.txt. Error:U_INVALID_FORMAT_ERROR
make[1]: *** [../data/out/build/icudt26l_ja.res] Error 3

If you do a make at this point again, it skips these steps and tries to link
parrot, failing on many undefined symbols, I believe from the non existent
ICU.

> Thanks,
> leo

Have fun,
Marcus

--
:: Marcus Thiesen :: www.thiesen.org :: ICQ#108989768 :: 0x754675F2 ::

I can resist anything but temptation
Oscar Wilde

Jeff Clites

unread,

Apr 13, 2004, 6:52:13 AM4/13/04

to mar...@thiesenweb.de, perl6-i...@perl.org

> BTW, it doesn't compile on any platform at the moment, after a
> realclean on
> the first "make" it complains about
> ../data/locales/ja.txt:15: parse error. Stopped parsing with
> U_INVALID_FORMAT_ERROR
> couldn't parse the file ja.txt. Error:U_INVALID_FORMAT_ERROR
> make[1]: *** [../data/out/build/icudt26l_ja.res] Error 3

Try a "make realclean" first--Dan checked in a fix for this, and it
seems to require this to force everything to start fresh.

> If you do a make at this point again, it skips these steps and tries
> to link
> parrot, failing on many undefined symbols, I believe from the non
> existent
> ICU.

At this point I'd expect it to link, but maybe not run well--that
failure comes when packaging up the data files, and at that point the
the libraries themselves should already be built and in the right
place. But you are detecting some "loose" behavior in the Makefile,
which was done in part so that ICU wouldn't rebuild unless you "make
clean".

JEff

Luka Frelih

unread,

Apr 13, 2004, 7:28:06 AM4/13/04

to perl6-i...@perl.org

just a confirmation...
my i386 debian linux gives the same error repeatedly after make
realclean,
if i make again, it compiles a broken parrot which fails (too) many
tests...

also it seems (to me) that parrot's configured choice of compiler,
linker, ... is not used in building icu?

does icu have some non-ubiquitous dependencies?

LF

Marcus Thiesen

unread,

Apr 13, 2004, 10:25:30 AM4/13/04

to perl6-i...@perl.org

On Tuesday 13 April 2004 13:28, luka frelih wrote:
> just a confirmation...
> my i386 debian linux gives the same error repeatedly after make
> realclean,
> if i make again, it compiles a broken parrot which fails (too) many
> tests...
>
> also it seems (to me) that parrot's configured choice of compiler,
> linker, ... is not used in building icu?
>
> does icu have some non-ubiquitous dependencies?

As I said yesterday, it worked on a machine of mine which I hadn't touched for
quite some while. On my notebook, where I do daily builds, I ran in the same
problem, even after having made a realclean.
So I did a "make clean" in the icu subdir directly, deleted all files which
are listed in .cvsignore and ran the "realclean configure build test" all
over and now it works. Seems as if something doesn't get cleaned up in icu
with a parrot realclean.

Have fun,
Marcus

--
:: Marcus Thiesen :: www.thiesen.org :: ICQ#108989768 :: 0x754675F2 ::

Do something every day that you don't want to do; this is the golden rule for
acquiring the habit of doing your duty without pain
Mark Twain

Leopold Toetsch

unread,

Apr 13, 2004, 12:22:05 PM4/13/04

to mar...@thiesenweb.de, perl6-i...@perl.org

Marcus Thiesen wrote:
>. Seems as if something doesn't get cleaned up in icu
> with a parrot realclean.

Yep. I've removed cleaning icu from clean/realclean[1].

$ make help | grep clean
...
icu.clean: ...

And there is always "make cvsclean".

> Have fun,
> Marcus

leo

[1] If anyone puts that in again he might also send a lot faster PC to
me (and possibly other developers ;)

Dan Sugalski

unread,

Apr 13, 2004, 1:01:41 PM4/13/04

to Leopold Toetsch, mar...@thiesenweb.de, perl6-i...@perl.org

At 6:22 PM +0200 4/13/04, Leopold Toetsch wrote:
>Marcus Thiesen wrote:
>>. Seems as if something doesn't get cleaned up in icu with a parrot
>>realclean.
>
>Yep. I've removed cleaning icu from clean/realclean[1].

I think we need to put that back for a bit, but with this:

>[1] If anyone puts that in again he might also send a lot faster PC
>to me (and possibly other developers ;)

We're also likely going to be well-off if we get configure to detect
a system ICU install and use that instead. It shouldn't be that
tough, but I've not had a chance to poke around in the icu part of
our config system to find out what we need to do.
--
Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk

Leopold Toetsch

unread,

Apr 13, 2004, 5:48:48 PM4/13/04

to Dan Sugalski, perl6-i...@perl.org

Dan Sugalski <d...@sidhe.org> wrote:
> At 6:22 PM +0200 4/13/04, Leopold Toetsch wrote:
>>Marcus Thiesen wrote:
>>>. Seems as if something doesn't get cleaned up in icu with a parrot
>>>realclean.
>>
>>Yep. I've removed cleaning icu from clean/realclean[1].

> I think we need to put that back for a bit,

I did list two alternatives. The "normal" way of changes doesn't include
changes to ICU source (and honestly shouldn't). Currently building is
still a bit in flux, which does mandate a "make icu.clean".

And there is of course already a new ICU version on *their* website, but
we still try to get/keep 2.6 running.

I'm still not sure that this lib should be part of *our* tree ...

> ... but with this:

>>[1] If anyone puts that in again he might also send a lot faster PC
>>to me (and possibly other developers ;)

> We're also likely going to be well-off if we get configure to detect
> a system ICU install and use that instead.

There are severals issues: First one is MANIFEST and CVS and patches.
Config steps should be simple. But - of course - I'd appreciate this
alternative as already layed out.

leo