With r8787, the following tcl code:
puts \u666
causes a segfault in the substr opcode (from tcl's lib/tclconst.pir),
and forces a few tcl-unicode escape tests into TODOs.
A short PIR test that is equivalent:
.sub main @MAIN
$S0 = "\\u666"
$I0 = 0x666
$S1 = chr $I0 # works, but substr doesn't like this string.
substr $S0, 0, 5, $S1
.end
Running the PIR through gdb, I get the stack trace below.
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00a98fff
0xffff8cf4 in ___memcpy () at /System/Library/Frameworks/
System.framework/PrivateHeaders/ppc/cpu_capabilities.h:189
189 /System/Library/Frameworks/System.framework/PrivateHeaders/
ppc/cpu_capabilities.h: No such file or directory.
in /System/Library/Frameworks/System.framework/
PrivateHeaders/ppc/cpu_capabilities.h
(gdb) bt
#0 0xffff8cf4 in ___memcpy () at /System/Library/Frameworks/
System.framework/PrivateHeaders/ppc/cpu_capabilities.h:189
#1 0x0002d04c in string_replace (interpreter=0xd00180, src=0xe5b1c0,
offset=0, length=5, rep=0xe5a630, d=0x0) at src/string.c:1238
#2 0x00093a88 in Parrot_substr_s_ic_ic_s (cur_opcode=0xd125fc,
interpreter=0xd00180) at ops/string.ops:245
#3 0x00205c5c in runops_slow_core (interpreter=0xd00180,
pc=0xd125fc) at src/runops_cores.c:153
#4 0x0004c6f0 in runops_int (interpreter=0xd00180, offset=0) at src/
interpreter.c:750
#5 0x00044d70 in runops (interpreter=0xd00180, offs=0) at src/
inter_run.c:81
#6 0x0001680c in Parrot_runcode (interpreter=0xd00180, argc=1,
argv=0xbffff9d0) at src/embed.c:831
#7 0x000165bc in Parrot_runcode (interpreter=0xd00180, argc=1,
argv=0xbffff9d0) at src/embed.c:765
#8 0x000043c8 in main (argc=1, argv=0xbffff9d0) at imcc/main.c:637
>
> causes a segfault in the substr opcode (from tcl's lib/tclconst.pir),
> and forces a few tcl-unicode escape tests into TODOs.
>
> A short PIR test that is equivalent:
>
> .sub main @MAIN
> $S0 = "\\u666"
> $I0 = 0x666
> $S1 = chr $I0 # works, but substr doesn't like this string.
> substr $S0, 0, 5, $S1
> .end
> #1 0x0002d04c in string_replace (interpreter=0xd00180, src=0xe5b1c0,
> offset=0, length=5, rep=0xe5a630, d=0x0) at src/string.c:1238
string_replace has still the old code relying on fixed-width encodings
with 1, 2, or 4 bytes per char, which is of course not true for utf8.
This needs fixing.
Takers wanted,
leo
> With r8787, the following tcl code:
>
> puts \u666
>
> causes a segfault in the substr opcode (from tcl's lib/tclconst.pir),
> and forces a few tcl-unicode escape tests into TODOs.
Duh, because it's *evil*.
:-)
Josh
> causes a segfault in the substr opcode (from tcl's lib/tclconst.pir),
> and forces a few tcl-unicode escape tests into TODOs.
>
> A short PIR test that is equivalent:
>
> .sub main @MAIN
> $S0 = "\\u666"
> $I0 = 0x666
> $S1 = chr $I0 # works, but substr doesn't like this string.
> substr $S0, 0, 5, $S1
> .end
Fixed - r8805
Thanks for testing and providing the test,
leo
I thought that one thing Jarkko learned from perl 5's Unicode model was that
the amount of code and pain to support a variable length encoding was
greater than the space saving that that encoding gives.
In turn Dan had decided that Parrot should internally unpack to some form
of fixed width encoding. So all Unicode would be stored internally in the
shortest of ISO-8859-1, UCS-16 and UCS-32 that encompassed all the code
points used.
1: My memory may be wrong on this
2: It may not have been explicit
3: I may have missed an explicit change
But having dealt with the fun of variable length encodings, my gut feeling
is with Jarkko, that it's probably better to stay fixed width internally.
Nicholas Clark
> I thought that one thing Jarkko learned from perl 5's Unicode model was that
> the amount of code and pain to support a variable length encoding was
> greater than the space saving that that encoding gives.
>
> In turn Dan had decided that Parrot should internally unpack to some form
> of fixed width encoding. So all Unicode would be stored internally in the
> shortest of ISO-8859-1, UCS-16 and UCS-32 that encompassed all the code
> points used.
Yes, with the enhancenment (also proposed by Dan) that a conversion to
fixed width encoding is done lazily i.e. on demand. The substr would be
typically such a place to change encoding to fixed.
> But having dealt with the fun of variable length encodings, my gut feeling
> is with Jarkko, that it's probably better to stay fixed width internally.
My gut feeling is just the same.
> Nicholas Clark
leo
Aha. That's the subtly that I missed from all this. The form of the "fix"
> >But having dealt with the fun of variable length encodings, my gut feeling
> >is with Jarkko, that it's probably better to stay fixed width internally.
>
> My gut feeling is just the same.
Thanks for the clarification.
Nicholas Clark